public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* task_finder2 help
@ 2011-08-23 16:24 David Smith
  2011-08-23 18:31 ` Josh Stone
  0 siblings, 1 reply; 3+ messages in thread
From: David Smith @ 2011-08-23 16:24 UTC (permalink / raw)
  To: systemtap

I've been working on replacing the task_finder's use of utrace.  You can
see this work on the dsmith/task_finder2 branch.  How does this work?
I'm using 4 tracepoints and one ftrace function hook to monitor task
fork/exit/exec and syscall entry & syscall exit.

On F15 (kernel 2.6.40-4.fc15.x86_64.debug) what's checked in works
great, all utrace tests pass.  Note that the code isn't finished, there
is a known memory leak for example, but it seems to work.

However, on rawhide (3.1.0-0.rc1.git6.1.fc17.x86_64), the kernel
complains mightily about sleeping functions called from invalid contexts.

There are a couple of classes of these bugs (after working through some
semi-minor issues):

- Use of get_task_mm()/mmput().  There are several places in
runtime/task_finder2.c where we call get_task_mm() to get a task's mm,
we do something with it (like get the task's pathname), then call
mmput() to release the mm.  (A couple of places we just check to see if
the task has a mm, in that case calling get_task_mm()/mmput() is
probably overkill.)

Looking at the source to get_task_mm() (in kernel/fork.c), you'll see
that it locks the task, increments an atomic variable, then unlocks the
task.  The problem is in mmput().  mmput() calls might_sleep()
unconditionally.  Why?  Because when the atomic reaches 0, all the mm's
data is freed.

So, how to work around this?  For now I've been just locking the task,
hoping that nothing will change the mm.  If the task is a single
threaded task, that assumption might be reasonable.  The task is stopped
anyway at the various points I probe (fork/exec/exit/syscalls).
However, if the task is part of a multi-threaded program, that
assumption probably isn't true.  Yes, this task is stopped, but another
running task in the same group could possibly change the mm.

Anyone got any ideas here?

- Memory allocation.  All memory allocation in the task_finder is done
using GFP_KERNEL (with perhaps some additional other flags).  Most of
the allocation is done to store PATH_MAX buffers, to be used when we're
looking at the pathname of a task.  Even as far back as the 2.6.9
kernel, GFP_KERNEL has been defined as (__GFP_WAIT | __GFP_IO |
__GFP_FS).  If __GFP_WAIT is set in the flags passed to kmalloc(),
kmalloc() calls might_sleep().

So, the obvious solution is to change the flags we pass from GFP_KERNEL
to (GFP_KERNEL & ~__GFP_WAIT).  That seems to avoid the BUG, but I'm
unsure of the ramifications (other than the obvious one that more memory
allocations could fail).  It could be that we should have been turning
off __GFP_WAIT even before now.

Does anyone know of pitfalls of turning off __GFP_WAIT?

Thanks for the help.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: task_finder2 help
  2011-08-23 16:24 task_finder2 help David Smith
@ 2011-08-23 18:31 ` Josh Stone
  2011-08-23 19:19   ` David Smith
  0 siblings, 1 reply; 3+ messages in thread
From: Josh Stone @ 2011-08-23 18:31 UTC (permalink / raw)
  To: David Smith; +Cc: systemtap

On 08/23/2011 09:23 AM, David Smith wrote:
> I've been working on replacing the task_finder's use of utrace.  You can
> see this work on the dsmith/task_finder2 branch.  How does this work?
> I'm using 4 tracepoints and one ftrace function hook to monitor task
> fork/exit/exec and syscall entry & syscall exit.
> 
> On F15 (kernel 2.6.40-4.fc15.x86_64.debug) what's checked in works
> great, all utrace tests pass.  Note that the code isn't finished, there
> is a known memory leak for example, but it seems to work.
> 
> However, on rawhide (3.1.0-0.rc1.git6.1.fc17.x86_64), the kernel
> complains mightily about sleeping functions called from invalid contexts.

This is probably just because rawhide has more CONFIG_DEBUG flags on.

> There are a couple of classes of these bugs (after working through some
> semi-minor issues):
> 
> - Use of get_task_mm()/mmput().  There are several places in
> runtime/task_finder2.c where we call get_task_mm() to get a task's mm,
> we do something with it (like get the task's pathname), then call
> mmput() to release the mm.  (A couple of places we just check to see if
> the task has a mm, in that case calling get_task_mm()/mmput() is
> probably overkill.)
> 
> Looking at the source to get_task_mm() (in kernel/fork.c), you'll see
> that it locks the task, increments an atomic variable, then unlocks the
> task.  The problem is in mmput().  mmput() calls might_sleep()
> unconditionally.  Why?  Because when the atomic reaches 0, all the mm's
> data is freed.
> 
> So, how to work around this?  For now I've been just locking the task,
> hoping that nothing will change the mm.  If the task is a single
> threaded task, that assumption might be reasonable.  The task is stopped
> anyway at the various points I probe (fork/exec/exit/syscalls).
> However, if the task is part of a multi-threaded program, that
> assumption probably isn't true.  Yes, this task is stopped, but another
> running task in the same group could possibly change the mm.

The get/put are only protecting from having the mm released.  So long as
you're always doing this within the context of a task that owns this mm,
then it seems safe to say that mm_users will always be >0 -- and so the
get/put are really redundant.  You can see many examples in mmap.c  that
don't bother with get/put on current->mm.

I don't think any of this prevents a *change* in the mm though.  If
that's a concern, I believe you need to down_read mmap_sem.

> - Memory allocation.  All memory allocation in the task_finder is done
> using GFP_KERNEL (with perhaps some additional other flags).  Most of
> the allocation is done to store PATH_MAX buffers, to be used when we're
> looking at the pathname of a task.  Even as far back as the 2.6.9
> kernel, GFP_KERNEL has been defined as (__GFP_WAIT | __GFP_IO |
> __GFP_FS).  If __GFP_WAIT is set in the flags passed to kmalloc(),
> kmalloc() calls might_sleep().
> 
> So, the obvious solution is to change the flags we pass from GFP_KERNEL
> to (GFP_KERNEL & ~__GFP_WAIT).  That seems to avoid the BUG, but I'm
> unsure of the ramifications (other than the obvious one that more memory
> allocations could fail).  It could be that we should have been turning
> off __GFP_WAIT even before now.
> 
> Does anyone know of pitfalls of turning off __GFP_WAIT?

The obvious pitfall is that the allocation is more likely to fail, so
you have to be ready to deal with that.

I'm not sure what the full ramifications are for keeping __GFP_IO and
__GFP_IO, but I think GFP_ATOMIC is for exactly these unsleepable
allocations.  Or possibly GFP_NOWAIT to avoid the emergency pool.

If possible, the best choice is to allocate from a safer context, but I
assume you've already ruled that out...

Josh

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: task_finder2 help
  2011-08-23 18:31 ` Josh Stone
@ 2011-08-23 19:19   ` David Smith
  0 siblings, 0 replies; 3+ messages in thread
From: David Smith @ 2011-08-23 19:19 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

On 08/23/2011 01:31 PM, Josh Stone wrote:
> On 08/23/2011 09:23 AM, David Smith wrote:
>> However, on rawhide (3.1.0-0.rc1.git6.1.fc17.x86_64), the kernel
>> complains mightily about sleeping functions called from invalid contexts.
> 
> This is probably just because rawhide has more CONFIG_DEBUG flags on.

True.

>> There are a couple of classes of these bugs (after working through some
>> semi-minor issues):
>>
>> - Use of get_task_mm()/mmput().

...

> The get/put are only protecting from having the mm released.  So long as
> you're always doing this within the context of a task that owns this mm,
> then it seems safe to say that mm_users will always be >0 -- and so the
> get/put are really redundant.  You can see many examples in mmap.c  that
> don't bother with get/put on current->mm.
> 
> I don't think any of this prevents a *change* in the mm though.  If
> that's a concern, I believe you need to down_read mmap_sem.

When actually looking at the mm data, I do down_read mmap_sem, so I
should be OK there.

I think you are right that I'm doing this from with the context of the
task that owns the mm.  The fork/clone tracepoint happens in the context
of the parent, not the child, so I'll need to look at that code and make
sure I'm only looking at the parent's mm.

>> - Memory allocation.

...

>> Does anyone know of pitfalls of turning off __GFP_WAIT?

> I'm not sure what the full ramifications are for keeping __GFP_IO and
> __GFP_IO, but I think GFP_ATOMIC is for exactly these unsleepable
> allocations.  Or possibly GFP_NOWAIT to avoid the emergency pool.

GFP_NOWAIT sounds interesting.  Note that unless I add new functions in
runtime/alloc.c, we'd be changing all systemtap kmalloc's to use the new
flags, so I don't want to do this without proper thought.

> If possible, the best choice is to allocate from a safer context, but I
> assume you've already ruled that out...

There are 2 classes of allocations in the task_finder, both short term:
(1) pathnames, used when matching "interesting" task pathnames; and (2)
cached vma information.  The new tracepoint-based utrace code makes much
longer-term allocations, for engines and main structures whose lifetime
is that of the task itself (assuming the systemtap script doesn't exit
first).

If you've got ideas on how to allocate any of the above from a safer
context, I'm all ears.


Thanks for the advice.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-08-23 19:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-23 16:24 task_finder2 help David Smith
2011-08-23 18:31 ` Josh Stone
2011-08-23 19:19   ` David Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).