public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* notify_page_fault() problem
@ 2007-04-30 20:19 Quentin Barnes
  2007-04-30 20:25 ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Quentin Barnes @ 2007-04-30 20:19 UTC (permalink / raw)
  To: systemtap

In testing the ARM Kprobes code, I hit a bug where the kernel would
occasionally lock up.  I tracked this down to the page fault handling
code.

Once I had a grasp of how the code is working, I'm wondering why other
Kprobes implementations aren't hitting the same bug.

On i386, I see they have a fix or workaround for this issue, but in
skimming the code base I don't see something like it on any other
Kprobes supporting platform.

The manifestation of the problem is that the kernel can get stuck
in an infinite loop with repeating page faults until it stack
overruns.  I understand how and why it is doing it, but exactly what
is considered the best fix is unclear to me since I'm not a Linux
kernel expert.

The start of the problem is that the data space for Kprobes modules
comes out of the module_alloc'd memory region.  Like vmalloc'd
memory this memory can cause an MMU fault, which is as expected.
Like vmalloc'd memory, the kernel fills in the hardware PTEs for its
MMU for module_alloc'd memory on an on-demand basis.

The problem is that the call to notify_page_fault() is right up
front in most architecture's data fault handling code.  The function
notify_page_fault() and its registered chain routines can be invoked
_before_ the kernel's had a chance to fill in its mapping.  See the
problem yet?  The MMU generates a fault, the arch-specific fault
handler is called, it immediately calls notify_page_fault() which
calls kprobe_exceptions_notify() which calls kprobe_fault_handler()
which calls get_kprobe().  The function get_kprobe() dereferences
a kprobe data structure which the structure isn't always mapped in
hardware yet, so it faults.  Now we re-start the entire call
sequence again, and then again, and again, and again until the
system falls over.

Now on i386's do_page_fault(), it avoids the above infinite
recursion by checking to see if the fault happened in kernel space.
If it did, it makes sure it was the right kind of fault, and, if
it is, it calls vmalloc_fault() which will attempt to map the missing
page translation with vmalloc_sync_one().  If it is successful,
do_page_fault() returns without letting notify_page_fault() get
called.

This support code on i386 (vmalloc_fault() and vmalloc_sync_one())
is in its architecture-specific layer.  If this is considered
the best fix for this issue (instead of moving the call, more
below), rather than everyone go off and implement it themselves
in their own layer, should something like it be implemented in
architecture-independent code?


One thing I don't understand is why notify_page_fault() is called
so early in everyone's page fault handling code.  Shouldn't it be
called just before the kernel's about to give up, instead of right
up front?

Page fault handling code is a serious critical path in any kernel.
As it is by putting the notify_page_fault() call so early, just
having the check there, whether or not any hooks are registered,
has to have a detrimental impact on overall system performance.  If
it is treated as an option of last resort like other 'die' chain
services, this wouldn't have been a problem in the first place.  Why
did someone find it important for the check not to go last?


For now to workaround this bug for the ARM Kprobes port, I just
removed the call to notify_page_fault() for kernel translation
faults.  I wanted to hear some comments on this problem before
working on a fix.

Quentin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: notify_page_fault() problem
  2007-04-30 20:19 notify_page_fault() problem Quentin Barnes
@ 2007-04-30 20:25 ` Andi Kleen
  2007-04-30 21:15   ` Quentin Barnes
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2007-04-30 20:25 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: systemtap

Quentin Barnes <qbarnes@urbana.css.mot.com> writes:

> Now on i386's do_page_fault(), it avoids the above infinite
> recursion by checking to see if the fault happened in kernel space.

Actually the real avoidance is by calling vmalloc_sync_all() when
the notifier is registered. Probably you need to implement an equivalent
for ARM

> One thing I don't understand is why notify_page_fault() is called
> so early in everyone's page fault handling code. 

So that users like kprobes don't deadlock on a fault inside
a region that takes the mm_sem (or some other lock taken by pf)

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: notify_page_fault() problem
  2007-04-30 20:25 ` Andi Kleen
@ 2007-04-30 21:15   ` Quentin Barnes
  2007-05-01  2:57     ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Quentin Barnes @ 2007-04-30 21:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: systemtap

On Mon, Apr 30, 2007 at 11:23:22PM +0200, Andi Kleen wrote:
>Quentin Barnes <qbarnes@urbana.css.mot.com> writes:
>
>> Now on i386's do_page_fault(), it avoids the above infinite
>> recursion by checking to see if the fault happened in kernel space.
>
>Actually the real avoidance is by calling vmalloc_sync_all() when
>the notifier is registered. Probably you need to implement an equivalent
>for ARM

Ah, okay.  I see a call to vmalloc_sync_all() in
register_page_fault_notifier() which would also resolve this
problem and it was put in around 2.6.18.

However, vmalloc_sync_all() is i386 and x86_64 specific as well
as their change to register_page_fault_notifier().  I don't see
other platform doing anything else doing anything special in their
register_page_fault_notifier().  I have trouble believing that x86
and ARM are unique somehow with needing to address this problem.
Why doesn't anyone else hit this?  Is it a lurking problem or are
there other fixes in other forms out there?

>> One thing I don't understand is why notify_page_fault() is called
>> so early in everyone's page fault handling code. 
>
>So that users like kprobes don't deadlock on a fault inside
>a region that takes the mm_sem (or some other lock taken by pf)

I think I follow, but I'm not positive.

I would expect by the time the page fault control flow got through
all the expected cases and checked them all that those obtained
locks would be surrendered.

I guess part of the answer has to do with what people's expectations
are for intercepting faults with their kprobes fault handler though.

>-Andi

Quentin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: notify_page_fault() problem
  2007-04-30 21:15   ` Quentin Barnes
@ 2007-05-01  2:57     ` Andi Kleen
  2007-05-01  3:24       ` Quentin Barnes
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2007-05-01  2:57 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Andi Kleen, systemtap

> However, vmalloc_sync_all() is i386 and x86_64 specific as well
> as their change to register_page_fault_notifier().  I don't see
> other platform doing anything else doing anything special in their
> register_page_fault_notifier().  

They probably just haven't tested this particular case yet.
x86 also did it originally to handle NMI notifiers, which is a x86 special
(nested pagefault in NMI can lead to stack corruption because
NMIs are only blocked until the next IRET)

> I have trouble believing that x86
> and ARM are unique somehow with needing to address this problem.
> Why doesn't anyone else hit this?  Is it a lurking problem or are
> there other fixes in other forms out there?

The standard kprobes notifier is not modular so it won't hit this.
 
> I guess part of the answer has to do with what people's expectations
> are for intercepting faults with their kprobes fault handler though.

Yes, some have pretty broad exceptions.  It might be possible
to move it to a kernel address only path, but then some debuggers
seem to want to debug user mode too.

But you're right there has been grumbling about the overhead
of the notifier call in the hot path.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: notify_page_fault() problem
  2007-05-01  2:57     ` Andi Kleen
@ 2007-05-01  3:24       ` Quentin Barnes
  2007-05-01 15:40         ` Frank Ch. Eigler
  0 siblings, 1 reply; 8+ messages in thread
From: Quentin Barnes @ 2007-05-01  3:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: systemtap

On Tue, May 01, 2007 at 04:57:00AM +0200, Andi Kleen wrote:
>> However, vmalloc_sync_all() is i386 and x86_64 specific as well
>> as their change to register_page_fault_notifier().  I don't see
>> other platform doing anything else doing anything special in their
>> register_page_fault_notifier().  
>
>They probably just haven't tested this particular case yet.

I got it to happen most often when running the syscall.exp test.
But it was still very intermittent though.  I'm guessing it
has to do with what else got placed on the same page with the
kprobe/kretprobe data structure (so it would occasionally get
coincidentally loaded and work) and if the system is running
preempt-enabled and how busy it is to have another page fault occur
before the kprobes data structure could get its translation fault to
happen.  If the system is quiescent, this bug's not going to show up
either.  I'm currently running my lowly 64MB ARM board with network
boot _and_ swap drives so a lot system pounding is going on most all
the time.

>x86 also did it originally to handle NMI notifiers, which is a
>x86 special (nested pagefault in NMI can lead to stack corruption
>because NMIs are only blocked until the next IRET)

Ah, ok.

>> I have trouble believing that x86
>> and ARM are unique somehow with needing to address this problem.
>> Why doesn't anyone else hit this?  Is it a lurking problem or are
>> there other fixes in other forms out there?
>
>The standard kprobes notifier is not modular so it won't hit this.

Are you saying the code in arch/*/kernel/kprobes.c and kernel/kprobes.c
is not marked as a modular so it won't hit this problem?

That doesn't matter.  Just having the kprobes and kretprobes data
structures being in module memory is all that matters.  What
happens is when get_kprobe() and aggr_pre_handler() walk the the
kprobe_table[] list and they stumble across a kprobe or kretprobe
data structure referenced in the table that's not mapped in hardware
yet.  That's what's generating the recursive faults I was seeing.

>> I guess part of the answer has to do with what people's expectations
>> are for intercepting faults with their kprobes fault handler though.
>
>Yes, some have pretty broad exceptions.  It might be possible
>to move it to a kernel address only path, but then some debuggers
>seem to want to debug user mode too.

It would be nice to move those debugger hooks out of there and have
the debuggers use kprobes so their needs don't negatively impact the
system as a whole even when they're not in use.

>But you're right there has been grumbling about the overhead
>of the notifier call in the hot path.

I wasn't aware of any grumbling.  I'm not on the main kernel mailing
list.  I was just disappointed though to see that the fault handlers
are being notified for every single fault in the system, user or
kernel space.  While tracking down this bug, my debug logs were huge
just from having a user land app fault in some shared libraries.
Just a few instructions in a system's fault handler path can have
noticable performance repercussions.

>-Andi

Quentin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: notify_page_fault() problem
  2007-05-01  3:24       ` Quentin Barnes
@ 2007-05-01 15:40         ` Frank Ch. Eigler
  2007-05-01 15:43           ` Quentin Barnes
  2007-05-02 10:33           ` Andi Kleen
  0 siblings, 2 replies; 8+ messages in thread
From: Frank Ch. Eigler @ 2007-05-01 15:40 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Andi Kleen, systemtap

Quentin Barnes <qbarnes@urbana.css.mot.com> writes:

> [...]  That doesn't matter.  Just having the kprobes and kretprobes
> data structures being in module memory is all that matters.  [...]

I recall thinking of extending the systemtap startup code sequence to
force mapping of the entire module data/text memory, hoping to prevent
vmalloc faults during execution.  Does this sound like a plausible
hack around such problems?

- FChE

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: notify_page_fault() problem
  2007-05-01 15:40         ` Frank Ch. Eigler
@ 2007-05-01 15:43           ` Quentin Barnes
  2007-05-02 10:33           ` Andi Kleen
  1 sibling, 0 replies; 8+ messages in thread
From: Quentin Barnes @ 2007-05-01 15:43 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Andi Kleen, systemtap

On Tue, May 01, 2007 at 11:40:29AM -0400, Frank Ch. Eigler wrote:
>Quentin Barnes <qbarnes@urbana.css.mot.com> writes:
>
>> [...]  That doesn't matter.  Just having the kprobes and kretprobes
>> data structures being in module memory is all that matters.  [...]
>
>I recall thinking of extending the systemtap startup code sequence to
>force mapping of the entire module data/text memory, hoping to prevent
>vmalloc faults during execution.  Does this sound like a plausible
>hack around such problems?

That would be a workaround for the bug for Systemtap modules, but it's
a Kprobe implementation problem, not a Systemtap problem.  I'd rather
not hide the problem, but to see it get addressed directly like it has
on x86_64 and i386.

>- FChE

Quentin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: notify_page_fault() problem
  2007-05-01 15:40         ` Frank Ch. Eigler
  2007-05-01 15:43           ` Quentin Barnes
@ 2007-05-02 10:33           ` Andi Kleen
  1 sibling, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2007-05-02 10:33 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Quentin Barnes, Andi Kleen, systemtap

On Tue, May 01, 2007 at 11:40:29AM -0400, Frank Ch. Eigler wrote:
> Quentin Barnes <qbarnes@urbana.css.mot.com> writes:
> 
> > [...]  That doesn't matter.  Just having the kprobes and kretprobes
> > data structures being in module memory is all that matters.  [...]
> 
> I recall thinking of extending the systemtap startup code sequence to
> force mapping of the entire module data/text memory, hoping to prevent
> vmalloc faults during execution.  Does this sound like a plausible
> hack around such problems?

You can't do that from a module on x86. That is why vmalloc_sync_all() exist.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-05-02 10:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-30 20:19 notify_page_fault() problem Quentin Barnes
2007-04-30 20:25 ` Andi Kleen
2007-04-30 21:15   ` Quentin Barnes
2007-05-01  2:57     ` Andi Kleen
2007-05-01  3:24       ` Quentin Barnes
2007-05-01 15:40         ` Frank Ch. Eigler
2007-05-01 15:43           ` Quentin Barnes
2007-05-02 10:33           ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).