* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
@ 2006-05-02 20:18 ` hunt at redhat dot com
2006-05-02 20:25 ` hunt at redhat dot com
` (12 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: hunt at redhat dot com @ 2006-05-02 20:18 UTC (permalink / raw)
To: systemtap
------- Additional Comments From hunt at redhat dot com 2006-05-02 20:18 -------
FC6 kernels have the new kprobes page fault handling and some other changes
including some "boost" code.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
2006-05-02 20:18 ` [Bug kprobes/2637] " hunt at redhat dot com
@ 2006-05-02 20:25 ` hunt at redhat dot com
2006-05-03 13:05 ` prasanna at in dot ibm dot com
` (11 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: hunt at redhat dot com @ 2006-05-02 20:25 UTC (permalink / raw)
To: systemtap
------- Additional Comments From hunt at redhat dot com 2006-05-02 20:25 -------
Another data point:
Eliminating the userspace copy results in no skipped probes. So the attempt to
do the userspace copy causes the skipped probe count to be incremented, which
seems wrong to me.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
2006-05-02 20:18 ` [Bug kprobes/2637] " hunt at redhat dot com
2006-05-02 20:25 ` hunt at redhat dot com
@ 2006-05-03 13:05 ` prasanna at in dot ibm dot com
2006-05-03 13:08 ` fche at redhat dot com
` (10 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: prasanna at in dot ibm dot com @ 2006-05-03 13:05 UTC (permalink / raw)
To: systemtap
------- Additional Comments From prasanna at in dot ibm dot com 2006-05-03 13:05 -------
How about incrementing the nmissed count only if fixup_exception()
fails ? As shown in the patch below.
Thanks
Prasanna
Kprobes increments the nmissed count, even if user-specified pre/post
handler page faults while access data from user address space gets
fixed up. This patch removes the incrementing of nmissed count if
fixup_exception() succeeds on page faults and increments the nmissed
count only if fixup_exception() fails.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
arch/i386/kernel/kprobes.c | 14 +++++++-------
1 files changed, 7 insertions(+), 7 deletions(-)
diff -puN
arch/i386/kernel/kprobes.c~kprobes-increment-nmissed-on-failed-pagefault-handling
arch/i386/kernel/kprobes.c
---
linux-2.6.17-rc3-mm1/arch/i386/kernel/kprobes.c~kprobes-increment-nmissed-on-failed-pagefault-handling
2006-05-03 18:20:59.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/kprobes.c 2006-05-03
18:23:39.000000000 +0530
@@ -605,13 +605,6 @@ static int __kprobes kprobe_fault_handle
case KPROBE_HIT_ACTIVE:
case KPROBE_HIT_SSDONE:
/*
- * We increment the nmissed count for accounting,
- * we can also use npre/npostfault count for accouting
- * these specific fault cases.
- */
- kprobes_inc_nmissed_count(cur);
-
- /*
* We come here because instructions in the pre/post
* handler caused the page_fault, this could happen
* if handler tries to access user space by
@@ -629,6 +622,13 @@ static int __kprobes kprobe_fault_handle
return 1;
/*
+ * We increment the nmissed count for accounting,
+ * we can also use npre/npostfault count for accouting
+ * these specific fault cases.
+ */
+ kprobes_inc_nmissed_count(cur);
+
+ /*
* fixup_exception() could not handle it,
* Let do_page_fault() fix it.
*/
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (2 preceding siblings ...)
2006-05-03 13:05 ` prasanna at in dot ibm dot com
@ 2006-05-03 13:08 ` fche at redhat dot com
2006-05-04 4:22 ` hunt at redhat dot com
` (9 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: fche at redhat dot com @ 2006-05-03 13:08 UTC (permalink / raw)
To: systemtap
------- Additional Comments From fche at redhat dot com 2006-05-03 13:08 -------
While this should make the "skipped" count go away, those accesses would still
fault. The next step is to figure out why.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (3 preceding siblings ...)
2006-05-03 13:08 ` fche at redhat dot com
@ 2006-05-04 4:22 ` hunt at redhat dot com
2006-05-05 6:56 ` prasanna at in dot ibm dot com
` (8 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: hunt at redhat dot com @ 2006-05-04 4:22 UTC (permalink / raw)
To: systemtap
------- Additional Comments From hunt at redhat dot com 2006-05-04 04:22 -------
I verified Prasanna's patch fixed the skipped count.
I did another test to confirm the bad userspace copies were caused by kprobes.
First I build the stock 2.6.16.11 kernel. It does not have any problems. Then I
copied over the new kprobes code from the FC6 kernels and rebuilt 2.6.16.11.
This time it did have the userspace copy failures in the same 3 places in my
test case.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (4 preceding siblings ...)
2006-05-04 4:22 ` hunt at redhat dot com
@ 2006-05-05 6:56 ` prasanna at in dot ibm dot com
2006-05-05 13:49 ` fche at redhat dot com
` (7 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: prasanna at in dot ibm dot com @ 2006-05-05 6:56 UTC (permalink / raw)
To: systemtap
------- Additional Comments From prasanna at in dot ibm dot com 2006-05-05 06:56 -------
The example given by Martin, accesses the data in user address space. Most
likely the data accessed from user address space may not be present in memory.
It is known that such accesses causes page_faults, where in the
fixup_exception() may not succeed most of the times, hence nmissed count goes up
(even with the current above patch).
Should we remove incrementing the nmissed count even if fixup_exception() fails?
Most of the system calls access user address space to copy the agruments and
they may experience the similar page_faults as above.
Is it acceptable, if the probe handler fails most of the time, while accessing
arguments from user address space. OR do we need to find some other way to
get the agruments of the syscalls?
Any thoughts?
Thanks
Prasanna
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (5 preceding siblings ...)
2006-05-05 6:56 ` prasanna at in dot ibm dot com
@ 2006-05-05 13:49 ` fche at redhat dot com
2006-05-05 16:45 ` Martin Hunt
2006-05-05 16:42 ` hunt at redhat dot com
` (6 subsequent siblings)
13 siblings, 1 reply; 17+ messages in thread
From: fche at redhat dot com @ 2006-05-05 13:49 UTC (permalink / raw)
To: systemtap
------- Additional Comments From fche at redhat dot com 2006-05-05 13:49 -------
(In reply to comment #6)
> The example given by Martin, accesses the data in user address space. Most
> likely the data accessed from user address space may not be present in memory.
Quite possible - the faulting strings appear to come from the ps/sh executables'
data pages. Maybe, under new kernels, those are faulted in later than before.
This would be difficult to work around. Some possibilities:
- Teach the translator to look for user_string($targetvar) constructs. Find a
way of "prefetching" the requested area of user memory, before the probe handler
gets under way.
- Defer the user_string() calls in tapset script code to the .return handler, or
some other point *after* the kernel's syscall routine ran (and presumably
faulted in all needed pages).
- Reconsider the atomic probe handler model.
>[...] Should we remove incrementing the nmissed count even if
fixup_exception() fails?
"nmissed" is a misnomer in this case, since the fault occurred once the probe
handler was already running, so may have done work. But for systemtap purposes,
"missed" ideally means "never started; no side-effects occurred". Maybe we can
choose to interpret kprobes.nmissed changing as an error, and only
kretprobes.nmissed as a genuine "missed" event.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Bug kprobes/2637] skipped probes in FC6
2006-05-05 13:49 ` fche at redhat dot com
@ 2006-05-05 16:45 ` Martin Hunt
0 siblings, 0 replies; 17+ messages in thread
From: Martin Hunt @ 2006-05-05 16:45 UTC (permalink / raw)
To: sourceware-bugzilla; +Cc: systemtap
On Fri, 2006-05-05 at 13:49 +0000, fche at redhat dot com wrote:
> Quite possible - the faulting strings appear to come from the ps/sh executables'
> data pages. Maybe, under new kernels, those are faulted in later than before.
See my comment #5. The kernel is the same. Only a fairly small number of
kprobes changes were different between the working and non-working
versions of 2.6.16.11.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (6 preceding siblings ...)
2006-05-05 13:49 ` fche at redhat dot com
@ 2006-05-05 16:42 ` hunt at redhat dot com
2006-05-05 16:45 ` hunt at redhat dot com
` (5 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: hunt at redhat dot com @ 2006-05-05 16:42 UTC (permalink / raw)
To: systemtap
------- Additional Comments From hunt at redhat dot com 2006-05-05 16:42 -------
Subject: Re: skipped probes in FC6
On Fri, 2006-05-05 at 06:56 +0000, prasanna at in dot ibm dot com wrote:
> ------- Additional Comments From prasanna at in dot ibm dot com 2006-05-05 06:56 -------
> The example given by Martin, accesses the data in user address space. Most
> likely the data accessed from user address space may not be present in memory.
True. But such situations have in the past been extremely rare. What is
it about the new kprobes that makes it now so common and predictable? We
need to understand this better.
> It is known that such accesses causes page_faults, where in the
> fixup_exception() may not succeed most of the times, hence nmissed count goes up
> (even with the current above patch).
Could you explain what would cause such an event? It does not happen in
my experience and I don't see how it could. If there is an entry in the
fixup table, which there always should be if you use a proper userspace
copy function, fixup exception will work.
> Is it acceptable, if the probe handler fails most of the time, while accessing
> arguments from user address space. OR do we need to find some other way to
> get the agruments of the syscalls
The current situation fails too often and is completely unacceptable. I
cannot even run my simple demo programs.
If we absolutely cannot fix the problem, then we must as a minimum
change user_string() to just return "" or "<unknown>" and not generate
errors or warnings when copies fail.
Martin
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (7 preceding siblings ...)
2006-05-05 16:42 ` hunt at redhat dot com
@ 2006-05-05 16:45 ` hunt at redhat dot com
2006-05-05 21:33 ` anil dot s dot keshavamurthy at intel dot com
` (4 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: hunt at redhat dot com @ 2006-05-05 16:45 UTC (permalink / raw)
To: systemtap
------- Additional Comments From hunt at redhat dot com 2006-05-05 16:45 -------
Subject: Re: skipped probes in FC6
On Fri, 2006-05-05 at 13:49 +0000, fche at redhat dot com wrote:
> Quite possible - the faulting strings appear to come from the ps/sh executables'
> data pages. Maybe, under new kernels, those are faulted in later than before.
See my comment #5. The kernel is the same. Only a fairly small number of
kprobes changes were different between the working and non-working
versions of 2.6.16.11.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (8 preceding siblings ...)
2006-05-05 16:45 ` hunt at redhat dot com
@ 2006-05-05 21:33 ` anil dot s dot keshavamurthy at intel dot com
2006-05-11 16:30 ` Martin Hunt
2006-05-11 16:30 ` hunt at redhat dot com
` (3 subsequent siblings)
13 siblings, 1 reply; 17+ messages in thread
From: anil dot s dot keshavamurthy at intel dot com @ 2006-05-05 21:33 UTC (permalink / raw)
To: systemtap
------- Additional Comments From anil dot s dot keshavamurthy at intel dot com 2006-05-05 21:33 -------
> True. But such situations have in the past been extremely rare. What is
> it about the new kprobes that makes it now so common and predictable? We
> need to understand this better.
In the past, we did not had the fixup_exception() call from
kprobes_exception_handler() code path and due to which any page_fault
generated by accessing user_data from pre/post handler used to
fall back to do_page_fault() function and hence you were
not seeing failure that often.
Again when we fall back to do_page_fault() function
we will technically be in preempt_disable() state which
is also wrong and potentially we could see a hang the system
on UP system.
Hence the right thing to do here is to catch that page_fault exceptions and
try fixing up the exceptions without falling back on do_page_fault() code path.
With this simple fix, you would now experience any copy_from_user() calls from
pre/post hanlder to fail if that page is not in memory.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Bug kprobes/2637] skipped probes in FC6
2006-05-05 21:33 ` anil dot s dot keshavamurthy at intel dot com
@ 2006-05-11 16:30 ` Martin Hunt
0 siblings, 0 replies; 17+ messages in thread
From: Martin Hunt @ 2006-05-11 16:30 UTC (permalink / raw)
To: sourceware-bugzilla; +Cc: systemtap
On Fri, 2006-05-05 at 21:33 +0000, anil dot s dot keshavamurth
> Hence the right thing to do here is to catch that page_fault exceptions and
> try fixing up the exceptions without falling back on do_page_fault() code path.
> With this simple fix, you would now experience any copy_from_user() calls from
> pre/post hanlder to fail if that page is not in memory.
I agree. I have been looking into this and could use some help.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (9 preceding siblings ...)
2006-05-05 21:33 ` anil dot s dot keshavamurthy at intel dot com
@ 2006-05-11 16:30 ` hunt at redhat dot com
2006-05-11 17:01 ` anil dot s dot keshavamurthy at intel dot com
` (2 subsequent siblings)
13 siblings, 0 replies; 17+ messages in thread
From: hunt at redhat dot com @ 2006-05-11 16:30 UTC (permalink / raw)
To: systemtap
------- Additional Comments From hunt at redhat dot com 2006-05-11 16:30 -------
Subject: Re: skipped probes in FC6
On Fri, 2006-05-05 at 21:33 +0000, anil dot s dot keshavamurth
> Hence the right thing to do here is to catch that page_fault exceptions and
> try fixing up the exceptions without falling back on do_page_fault() code path.
> With this simple fix, you would now experience any copy_from_user() calls from
> pre/post hanlder to fail if that page is not in memory.
I agree. I have been looking into this and could use some help.
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (10 preceding siblings ...)
2006-05-11 16:30 ` hunt at redhat dot com
@ 2006-05-11 17:01 ` anil dot s dot keshavamurthy at intel dot com
2006-05-15 6:52 ` prasanna at in dot ibm dot com
2006-05-31 21:58 ` [Bug runtime/2637] " jkenisto at us dot ibm dot com
13 siblings, 0 replies; 17+ messages in thread
From: anil dot s dot keshavamurthy at intel dot com @ 2006-05-11 17:01 UTC (permalink / raw)
To: systemtap
------- Additional Comments From anil dot s dot keshavamurthy at intel dot com 2006-05-11 17:01 -------
(In reply to comment #11)
> Subject: Re: skipped probes in FC6
> On Fri, 2006-05-05 at 21:33 +0000, anil dot s dot keshavamurth
> > Hence the right thing to do here is to catch that page_fault exceptions and
> > try fixing up the exceptions without falling back on do_page_fault() code
path.
> > With this simple fix, you would now experience any copy_from_user() calls
from
> > pre/post hanlder to fail if that page is not in memory.
> I agree. I have been looking into this and could use some help.
Martin, What exact help are you looking here? Are you worried failing of
copy_from_users() if the user pages are not in memory?
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug kprobes/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (11 preceding siblings ...)
2006-05-11 17:01 ` anil dot s dot keshavamurthy at intel dot com
@ 2006-05-15 6:52 ` prasanna at in dot ibm dot com
2006-05-31 21:58 ` [Bug runtime/2637] " jkenisto at us dot ibm dot com
13 siblings, 0 replies; 17+ messages in thread
From: prasanna at in dot ibm dot com @ 2006-05-15 6:52 UTC (permalink / raw)
To: systemtap
------- Additional Comments From prasanna at in dot ibm dot com 2006-05-15 06:52 -------
Created an attachment (id=1024)
--> (http://sourceware.org/bugzilla/attachment.cgi?id=1024&action=view)
removes the nmissed count usage if handler's access to user address space fails
Martin,
This patch removes the usage of nmissed count if probe handler's access to user
address space fails. Also if accounting of such user address space is required,
we can add another counter later.
Thanks
Prasanna
--
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Bug runtime/2637] skipped probes in FC6
2006-05-02 20:13 [Bug kprobes/2637] New: skipped probes in FC6 hunt at redhat dot com
` (12 preceding siblings ...)
2006-05-15 6:52 ` prasanna at in dot ibm dot com
@ 2006-05-31 21:58 ` jkenisto at us dot ibm dot com
13 siblings, 0 replies; 17+ messages in thread
From: jkenisto at us dot ibm dot com @ 2006-05-31 21:58 UTC (permalink / raw)
To: systemtap
------- Additional Comments From jkenisto at us dot ibm dot com 2006-05-31 21:58 -------
I believe that we have concluded that this is NOT a kprobes bug, and in fact
reflects a fix in kprobes's fault handling. See Anil's analysis in Comment #10.
Kprobes handlers aren't supposed to sleep. When called from a handler, a
function that accesses a non-resident user page must fail. That's what's
happening now. What was happening before was that handler slept (BAD, BAD)
while the non-resident page was brought it. So the memory access succeeded, but
at the risk of hanging the system.
The real bug, as I see it, is that the script now terminates when the memory
access fails.
I'm not sure whether this is a runtime bug or a translator bug. I'm reassigning
this as a runtime bug because most of the proposed solutions involve at least
some modification of the runtime. See
http://sources.redhat.com/ml/systemtap/2006-q2/msg00493.html
--
What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|systemtap at sources dot |hunt at redhat dot com
|redhat dot com |
Status|NEW |ASSIGNED
Component|kprobes |runtime
http://sourceware.org/bugzilla/show_bug.cgi?id=2637
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
^ permalink raw reply [flat|nested] 17+ messages in thread