public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* tracking memory map changes
@ 2008-05-30 20:50 David Smith
  2008-05-30 22:35 ` Jim Keniston
  2008-06-02  1:59 ` Frank Ch. Eigler
  0 siblings, 2 replies; 5+ messages in thread
From: David Smith @ 2008-05-30 20:50 UTC (permalink / raw)
  To: Systemtap List

One of the requirements of doing user-space probing is being able to
follow user-space program memory map changes.  I've been poking around
and I thought I'd share what I found and the possibilities of where to
go from here.

I did all of the following on f8 and looking at 2.6.24.4 kernel source.
 I took a peek at the source for 2.6.26, but didn't see any drastic
differences in this area.

Let me start by describing what is composed by a memory map.  A thread's
memory map is described in the kernel by a struct mm_struct.  Each
individual piece is described by a struct vm_area_struct.  Both
structures' definitions can be found in include/linux/mm_types.h.

The easiest way to look at memory map is by looking at /proc/PID/maps.
On an x86 f8 system, I ran /bin/cat, and here's what you would see in
/proc/PID/maps for /bin/cat (note that I added the column headers):

vm_start-vm_end  flags vm_pgoff MJ:MN inode      path
----------------- ---- -------- ----- -------    -----------------
00110000-00111000 r-xp 00110000 00:00 0          [vdso]
00655000-00670000 r-xp 00000000 fd:00 3080583    /lib/ld-2.7.so
00670000-00671000 r-xp 0001a000 fd:00 3080583    /lib/ld-2.7.so
00671000-00672000 rwxp 0001b000 fd:00 3080583    /lib/ld-2.7.so
00674000-007c7000 r-xp 00000000 fd:00 3083040    /lib/libc-2.7.so
007c7000-007c9000 r-xp 00153000 fd:00 3083040    /lib/libc-2.7.so
007c9000-007ca000 rwxp 00155000 fd:00 3083040    /lib/libc-2.7.so
007ca000-007cd000 rwxp 007ca000 00:00 0
08048000-0804d000 r-xp 00000000 fd:00 2621473    /bin/cat
0804d000-0804e000 rw-p 00004000 fd:00 2621473    /bin/cat
08e7e000-08e9f000 rw-p 08e7e000 00:00 0
b7dbe000-b7fbe000 r--p 00000000 fd:00 2526405
/usr/lib/locale/locale-archive
b7fbe000-b7fc0000 rw-p b7fbe000 00:00 0
bfa14000-bfa29000 rw-p bffea000 00:00 0          [stack]

The vm_start, vm_end, flags, and vm_pgoff columns are fields straight
out of vm_area_struct.  The 'MJ:MN' header denotes the MAJOR and MINOR
numbers of the inode's device.  Both the inode and device come from
looking at vma->vm_file.  vm_pgoff is the offset within the associated
vm_file where this vm_area_struct starts.

At first I was confused by multiple vm_area_structs for /lib/ld-2.7.so,
/lib/libc-2.7.so and /bin/cat, until I realized they were for the .text,
.data, and .bss sections of those files.  According to the 'size'
command, /bin/cat has no .bss section, so it only has 2 vm_area_structs.

Note that there are no explicit flags set on a vm_area_struct for the
differences between sections - in other words, there is nothing that
definitively says that this particular vm_area_struct maps a .text
section vs. a .data section vs. a .bss section.  A guess could be made
that the first (vm_pgoff == 0), read-only, executable, vm_area_struct
associated with a particular file is probably the .text section.  .bss
sections are the only writable sections out of the three.

Frank, here are some initial questions.

Q1: What information will the runtime need from each vm_area_struct?
I'd guess the path, vm_start, and vm_end at a minimum.

Q2: Will the runtime want to know only about new text sections being
added or all sections?

Q3: Will the runtime want to know about any of the vm_area_structs not
associated with a file?

When /bin/cat, gets exec'ed, the /lib/ld-2.7.so and /bin/cat files are
already mapped in.  As /bin/cat runs, it loads in /lib/libc-2.7.so.
This means that we've got 2 related problems: enumerating the sections
when first attaching to a thread (either by being exec'ed or by
attaching to an existing thread) then tracking memory map changes as
they occur (as in loading /lib/libc-2.7.so or by a thread calling dlopen()).

Enumerating the existing vm_area_structs seems easy enough.  Tracking
new vm_area_structs as they get added is harder.  Finding the right
point and the right method is the problem.

The sys_mmap2() system call is a wrapper around
mm/mmap.c:do_mmap_pgoff(). do_mmap_pgoff() does lots of error checking,
then calls mm/mmap.c:mmap_region() to actually add a new vm_area_struct.
 Toward the end of mmap_region(), vm_stat_account() is called (if
CONFIG_PROC_FS is on).

So, where/how to track memory map changes?  Here are a few ideas:

1) Set a kretprobe on sys_mmap2()/do_mmap_pgoff()/mmap_region().  One
problem here is that kretprobes are limited in quantity.

2) Set a kprobe on vm_stat_account().  This would require that the
kernel was configured with CONFIG_PROC_FS and that vm_stat_account() is
getting called in the correct place in all the kernels we're interested in.

3) Turn on utrace syscall return tracing for that thread and wait for
mmap calls to return.  This is probably the easiest route, but it forces
every syscall for that thread to go through the slow path.  A big
advantage here is that an all utrace solution wouldn't require any
debugging info for the kernel to be present on the system.

Does any have any better ideas or preferences here?

In all of the above methods the code won't know what was added, just
that a new vm_area_struct might exist, so I'll have to figure out a way
to track changes.

Finally, the shortest path to something somewhat useful would be to
first work on providing notification of existing vm_area_structs.  This
might help move user-space tracing along while I work on the harder
problem of tracking memory map changes.  Providing notification of
existing vm_area_structs might allow attaching to an existing thread
(which already has all its shared libraries loaded and doesn't call
dlopen()) and being able to figure out the right address to probe.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: tracking memory map changes
  2008-05-30 20:50 tracking memory map changes David Smith
@ 2008-05-30 22:35 ` Jim Keniston
  2008-05-30 22:41   ` David Smith
  2008-06-02  1:59 ` Frank Ch. Eigler
  1 sibling, 1 reply; 5+ messages in thread
From: Jim Keniston @ 2008-05-30 22:35 UTC (permalink / raw)
  To: David Smith; +Cc: Systemtap List

On Fri, 2008-05-30 at 11:42 -0500, David Smith wrote:
> One of the requirements of doing user-space probing is being able to
> follow user-space program memory map changes.  I've been poking around
> and I thought I'd share what I found and the possibilities of where to
> go from here.
> 
> I did all of the following on f8 and looking at 2.6.24.4 kernel source.
>  I took a peek at the source for 2.6.26, but didn't see any drastic
> differences in this area.
> 
> Let me start by describing what is composed by a memory map.

...

Yes, your description is consistent with my understanding.

> Frank, here are some initial questions.
> 
> Q1: What information will the runtime need from each vm_area_struct?
> I'd guess the path, vm_start, and vm_end at a minimum.
> 
> Q2: Will the runtime want to know only about new text sections being
> added or all sections?

It'd also be good to watch for vm areas going away.  I think Anil
Keshavamurthy mentioned that this can happen for things like Java.
Uprobes can theoretically get into trouble if you register a uprobe,
then the underlying vma gets unmapped, then a new vma gets mapped in the
same address range.  We should catch the munmamp/mremap and unregister
any associated uprobes.
  
> 
> Q3: Will the runtime want to know about any of the vm_area_structs not
> associated with a file?
> 
> When /bin/cat, gets exec'ed, the /lib/ld-2.7.so and /bin/cat files are
> already mapped in.  As /bin/cat runs, it loads in /lib/libc-2.7.so.
> This means that we've got 2 related problems: enumerating the sections
> when first attaching to a thread (either by being exec'ed or by
> attaching to an existing thread) then tracking memory map changes as
> they occur (as in loading /lib/libc-2.7.so or by a thread calling dlopen()).
> 
> Enumerating the existing vm_area_structs seems easy enough.  Tracking
> new vm_area_structs as they get added is harder.  Finding the right
> point and the right method is the problem.
> 
> The sys_mmap2() system call is a wrapper around
> mm/mmap.c:do_mmap_pgoff(). do_mmap_pgoff() does lots of error checking,
> then calls mm/mmap.c:mmap_region() to actually add a new vm_area_struct.
>  Toward the end of mmap_region(), vm_stat_account() is called (if
> CONFIG_PROC_FS is on).
> 
> So, where/how to track memory map changes?  Here are a few ideas:
> 
> 1) Set a kretprobe on sys_mmap2()/do_mmap_pgoff()/mmap_region().  One
> problem here is that kretprobes are limited in quantity.

Kretprobe_instances are preallocated per-kretprobe, so they're the
limited quantity.  But quantity is a concern only if there are multiple
instances of that particular probed function running concurrently.  We
can play with maxactive to come up with a "safe" number... then triple
that (or make it a -D option ;-)).

A problem with a k[ret]probe is that the handler can't block, so that
limits what we can do.
 
> 
> 2) Set a kprobe on vm_stat_account().  This would require that the
> kernel was configured with CONFIG_PROC_FS and that vm_stat_account() is
> getting called in the correct place in all the kernels we're interested in.
> 
> 3) Turn on utrace syscall return tracing for that thread and wait for
> mmap calls to return.  This is probably the easiest route, but it forces
> every syscall for that thread to go through the slow path.  A big
> advantage here is that an all utrace solution wouldn't require any
> debugging info for the kernel to be present on the system.

Yes.

> 
> Does any have any better ideas or preferences here?

"mmap change event" is already on Roland's utrace TODO list.  That'd be
my preference.  It's be lightweight, and you could block if necessary to
do your magic when notified.  You obviously know the terrain well, so
maybe you could funnel that work into the utrace enhancement.

> 
> In all of the above methods the code won't know what was added, just
> that a new vm_area_struct might exist, so I'll have to figure out a way
> to track changes.
> 
> Finally, the shortest path to something somewhat useful would be to
> first work on providing notification of existing vm_area_structs.  This
> might help move user-space tracing along while I work on the harder
> problem of tracking memory map changes.  Providing notification of
> existing vm_area_structs might allow attaching to an existing thread
> (which already has all its shared libraries loaded and doesn't call
> dlopen()) and being able to figure out the right address to probe.
> 

You don't explicitly say so, but I infer from this that you're making
progress on probing shlibs.  That'll be huge.

Jim

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: tracking memory map changes
  2008-05-30 22:35 ` Jim Keniston
@ 2008-05-30 22:41   ` David Smith
  2008-06-02  8:07     ` Frank Ch. Eigler
  0 siblings, 1 reply; 5+ messages in thread
From: David Smith @ 2008-05-30 22:41 UTC (permalink / raw)
  To: Jim Keniston; +Cc: Systemtap List

Jim Keniston wrote:
> On Fri, 2008-05-30 at 11:42 -0500, David Smith wrote:
>> One of the requirements of doing user-space probing is being able to
>> follow user-space program memory map changes.  I've been poking around
>> and I thought I'd share what I found and the possibilities of where to
>> go from here.
>>
>> I did all of the following on f8 and looking at 2.6.24.4 kernel source.
>>  I took a peek at the source for 2.6.26, but didn't see any drastic
>> differences in this area.
>>
>> Let me start by describing what is composed by a memory map.
> 
> ...
> 
> Yes, your description is consistent with my understanding.
> 
>> Frank, here are some initial questions.
>>
>> Q1: What information will the runtime need from each vm_area_struct?
>> I'd guess the path, vm_start, and vm_end at a minimum.
>>
>> Q2: Will the runtime want to know only about new text sections being
>> added or all sections?
> 
> It'd also be good to watch for vm areas going away.  I think Anil
> Keshavamurthy mentioned that this can happen for things like Java.
> Uprobes can theoretically get into trouble if you register a uprobe,
> then the underlying vma gets unmapped, then a new vma gets mapped in the
> same address range.  We should catch the munmamp/mremap and unregister
> any associated uprobes.

If CONFIG_PROFILING is on, sys_munmap() has a hook built into it that we
might be able to use.  There is a routine called profile_munmap() that
ends up using the blocking_notifier_chain kernel facility.  Of course
we'd have to see if we can do this without bothering profiling.  (I'm
not sure this will work at all, but it warrants further investigating.)

If that won't work, we'll have to fall back to something similar to what
we decide to do for mmap tracing.

>> So, where/how to track memory map changes?  Here are a few ideas:
>>
>> 1) Set a kretprobe on sys_mmap2()/do_mmap_pgoff()/mmap_region().  One
>> problem here is that kretprobes are limited in quantity.
> 
> Kretprobe_instances are preallocated per-kretprobe, so they're the
> limited quantity.  But quantity is a concern only if there are multiple
> instances of that particular probed function running concurrently.  We
> can play with maxactive to come up with a "safe" number... then triple
> that (or make it a -D option ;-)).
> 
> A problem with a k[ret]probe is that the handler can't block, so that
> limits what we can do.

Ah, OK.

>> 2) Set a kprobe on vm_stat_account().  This would require that the
>> kernel was configured with CONFIG_PROC_FS and that vm_stat_account() is
>> getting called in the correct place in all the kernels we're interested in.
>>
>> 3) Turn on utrace syscall return tracing for that thread and wait for
>> mmap calls to return.  This is probably the easiest route, but it forces
>> every syscall for that thread to go through the slow path.  A big
>> advantage here is that an all utrace solution wouldn't require any
>> debugging info for the kernel to be present on the system.
> 
> Yes.
> 
>> Does any have any better ideas or preferences here?
> 
> "mmap change event" is already on Roland's utrace TODO list.  That'd be
> my preference.  It's be lightweight, and you could block if necessary to
> do your magic when notified.  You obviously know the terrain well, so
> maybe you could funnel that work into the utrace enhancement.

You are right, I've forgotten this one.  This would certainly be better
than the other methods listed.  The biggest problem with this one is
that it would require kernel changes (which would probably mean we
couldn't use it with older kernels).

>> In all of the above methods the code won't know what was added, just
>> that a new vm_area_struct might exist, so I'll have to figure out a way
>> to track changes.
>>
>> Finally, the shortest path to something somewhat useful would be to
>> first work on providing notification of existing vm_area_structs.  This
>> might help move user-space tracing along while I work on the harder
>> problem of tracking memory map changes.  Providing notification of
>> existing vm_area_structs might allow attaching to an existing thread
>> (which already has all its shared libraries loaded and doesn't call
>> dlopen()) and being able to figure out the right address to probe.
>>
> 
> You don't explicitly say so, but I infer from this that you're making
> progress on probing shlibs.  That'll be huge.

That's the goal.  I'm just trying to figure out how to get there.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: tracking memory map changes
  2008-05-30 20:50 tracking memory map changes David Smith
  2008-05-30 22:35 ` Jim Keniston
@ 2008-06-02  1:59 ` Frank Ch. Eigler
  1 sibling, 0 replies; 5+ messages in thread
From: Frank Ch. Eigler @ 2008-06-02  1:59 UTC (permalink / raw)
  To: David Smith; +Cc: Systemtap List

David Smith <dsmith@redhat.com> writes:

> [...]
> Let me start by describing what is composed by a memory map.  

Thanks for getting started with this.

> vm_start-vm_end  flags vm_pgoff MJ:MN inode      path
> ----------------- ---- -------- ----- -------    -----------------
> 00110000-00111000 r-xp 00110000 00:00 0          [vdso]
> [...]
> 00674000-007c7000 r-xp 00000000 fd:00 3083040    /lib/libc-2.7.so
> 007c7000-007c9000 r-xp 00153000 fd:00 3083040    /lib/libc-2.7.so
> 007c9000-007ca000 rwxp 00155000 fd:00 3083040    /lib/libc-2.7.so
> [...]
> 08048000-0804d000 r-xp 00000000 fd:00 2621473    /bin/cat
> 0804d000-0804e000 rw-p 00004000 fd:00 2621473    /bin/cat
> [...]
> At first I was confused by multiple vm_area_structs for /lib/ld-2.7.so,
> /lib/libc-2.7.so and /bin/cat, until I realized they were for the .text,
> .data, and .bss sections of those files.  [...]

Not necessarily: there can exist rw- mappings of the same areas that
are later (or even concurrently) mapped r-x.  BSS regions in
particular aren't even really mapped in from a given binary because
they're not present in there in the first place.

  eu-readelf -S FILE: Type == NOBITS

> Note that there are no explicit flags set on a vm_area_struct for the
> differences between sections - in other words, there is nothing that
> definitively says that this particular vm_area_struct maps a .text
> section vs. a .data section vs. a .bss section.  [...]

That's OK - the kernel doesn't care.  The vm_pgoff value tells us
which page of the underlying ELF file is being mapped.  The translator
will need to pass enough data to the runtime to figure out that, e.g.,
page 0x153000 of libc-2.7.so refers to its text segment.

  eu-readelf -l FILE: Offset

> Frank, here are some initial questions.
>
> Q1: What information will the runtime need from each vm_area_struct?
> I'd guess the path, vm_start, and vm_end at a minimum.

And vm_pgoff.

> Q2: Will the runtime want to know only about new text sections being
> added or all sections?

For now, the text stuff (which, for your purposes, may be those pages
that are mapped in with "x" (execute) privileges).  Before long though
I'd like to give the runtime a map of the programs' *data* also, so
that data pointers can be mapped to data symbols.  That would perhaps
allow us to profile "frequently accessed variables".

> Q3: Will the runtime want to know about any of the vm_area_structs not
> associated with a file?

For now, probably not.  This should not be a hard decision though.

> When /bin/cat, gets exec'ed, the /lib/ld-2.7.so and /bin/cat files are
> already mapped in. [...]

Yup, and we definitely want to know about those.

> So, where/how to track memory map changes?  Here are a few ideas:
> [...]  3) Turn on utrace syscall return tracing for that thread and
> wait for mmap calls to return.  This is probably the easiest route,
> but it forces every syscall for that thread to go through the slow
> path.  [...]

Let's do this for now.

> In all of the above methods the code won't know what was added, just
> that a new vm_area_struct might exist, so I'll have to figure out a
> way to track changes. [...]

Right.


- FChE

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: tracking memory map changes
  2008-05-30 22:41   ` David Smith
@ 2008-06-02  8:07     ` Frank Ch. Eigler
  0 siblings, 0 replies; 5+ messages in thread
From: Frank Ch. Eigler @ 2008-06-02  8:07 UTC (permalink / raw)
  To: David Smith; +Cc: Jim Keniston, Systemtap List

David Smith <dsmith@redhat.com> writes:

> [...]
>> It'd also be good to watch for vm areas going away.  [...]

Absolutely.  We must do that.

> If CONFIG_PROFILING is on, sys_munmap() has a hook built into it
> that we might be able to use.  [...]  If that won't work, we'll have
> to fall back to something similar to what we decide to do for mmap
> tracing.

Considering that we'll already intercept mmap, we might as well
intercept munmap for the same processes using the same mechanism.

- FChE

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-05-30 22:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-30 20:50 tracking memory map changes David Smith
2008-05-30 22:35 ` Jim Keniston
2008-05-30 22:41   ` David Smith
2008-06-02  8:07     ` Frank Ch. Eigler
2008-06-02  1:59 ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).