public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* Looking for recommendation for using SystemTap
@ 2006-09-29 13:02 Tony Reix
  2006-09-29 17:00 ` Frank Ch. Eigler
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Tony Reix @ 2006-09-29 13:02 UTC (permalink / raw)
  To: systemtap

Hi,

I'm having several Oopss while running tests of an application which
has:
 - one patch applied to the kernel
 - one kernel module
The analysis of the Oopss clearly show that "someone" writes strings
(like "ata" or "ejbo") randomly in memory and destroys links in
structures, like vmlilst used by get_vmalloc_info in fs/proc/mmu.c or
ulp->proc_list used by loop_undo in ipc/sems.c .
Maybe my code is the culprit, or not.

Do you think SystemTap can help me finding the culprit ?
If yes, do you have recommendations and proposals about how to use
SystemTap for that goal ?
Can you point me to documentations providing the basic for using
SystemTap in real ?

Thanks,

Tony

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Looking for recommendation for using SystemTap
  2006-09-29 13:02 Looking for recommendation for using SystemTap Tony Reix
@ 2006-09-29 17:00 ` Frank Ch. Eigler
  2006-10-02 13:42   ` Tony Reix
  2006-09-29 17:23 ` Vara Prasad
  2006-09-29 20:36 ` David Wilder
  2 siblings, 1 reply; 5+ messages in thread
From: Frank Ch. Eigler @ 2006-09-29 17:00 UTC (permalink / raw)
  To: Tony Reix; +Cc: systemtap

Tony Reix <tony.reix@bull.net> writes:

> [...]
> The analysis of the Oopss clearly show that "someone" writes strings
> (like "ata" or "ejbo") randomly in memory and destroys links in
> structures, like vmlilst used by get_vmalloc_info in fs/proc/mmu.c or
> ulp->proc_list used by loop_undo in ipc/sems.c .
> [...]
> Do you think SystemTap can help me finding the culprit ?
> [...]

Perhaps.  Does the memory corruption occur in predictable places?
Imagine a probe that runs periodically (via a frequently triggered
timer, or a breakpoint at a code point under suspicion).  That probe
could look through selected places that are corrupted, and check for
something suspicious.  For example:

  #! stap -g
  probe kernel.function("after_your_function") { if (checkstuff ()) log ("bug") }
  function checkstuff () /* .... */

What checkstuff() does depends on how a program may be able to assess
corruption.  If it's ascii scripts showing up within known regions of
valid memory, something like this naive search could do it.  (Such a
function could be encapsulated into the systemtap tapset library).

  function checkstuff () %{
    char *begin = 0xdeadbeef;
    char *end = 0xdeadf00d;
    int found = 0;
    char *p;
    for (p = begin; p+3 < end; p++)
      if (p[0] == 'a' && p[1] == 't' && p[2] == 'a') found=1;
    THIS->__retval = found;
  %}

Later, we will have hardware-assisted watchpoint probes that hit when
a designated area of memory is read and/or written.  That could narrow
the culprits down even further.  This might look something lke:

  probe kernel.watch.from(0xdeadbeef).to(0xdeadfood).string("ata")
    { log ("bug") }


Anyway, this all depends on being able to characterize the corruption
well enough that a routine could be written to safely check for it.
If you don't have even that much information, very drastic measures
may be necessary (such as running the kernel under a simulator or
debugger).


- FChE

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Looking for recommendation for using SystemTap
  2006-09-29 13:02 Looking for recommendation for using SystemTap Tony Reix
  2006-09-29 17:00 ` Frank Ch. Eigler
@ 2006-09-29 17:23 ` Vara Prasad
  2006-09-29 20:36 ` David Wilder
  2 siblings, 0 replies; 5+ messages in thread
From: Vara Prasad @ 2006-09-29 17:23 UTC (permalink / raw)
  To: Tony Reix; +Cc: systemtap

Tony Reix wrote:

>Hi,
>
>I'm having several Oopss while running tests of an application which
>has:
> - one patch applied to the kernel
> - one kernel module
>The analysis of the Oopss clearly show that "someone" writes strings
>(like "ata" or "ejbo") randomly in memory and destroys links in
>structures, like vmlilst used by get_vmalloc_info in fs/proc/mmu.c or
>ulp->proc_list used by loop_undo in ipc/sems.c .
>Maybe my code is the culprit, or not.
>
>Do you think SystemTap can help me finding the culprit ?
>  
>
Yes, it can help you narrow down the areas to look for.

>If yes, do you have recommendations and proposals about how to use
>SystemTap for that goal ?
>  
>
A general proposal below.

>Can you point me to documentations providing the basic for using
>SystemTap in real ?
>  
>
Folks have used SystemTap to solve real life problems so it is ready for 
use. The main documentation you need to look at is the tutorial on the 
web and man pages that come with the install. If you don't find that 
sufficient to get going feel free to send a note in the mailing list and 
we will help you ASAP.

Coming to your problem, looks like you are observing a memory 
corruption. We don't have the watch point feature implemented yet where 
you could say write the contents of a data structure when ever someone 
modifies contents located at an address.

With the current features we have my suggestion would be the following steps
1) Identify the common code paths you take when you run your work load
2) Install few probes in that code path
3) The probe handler should printout the contents of the pointers you 
suspect are getting corrupted. If you have a global that is getting 
corrupted it is even easier.

The goal of the above exercise is to first narrow down the places where 
the corruption is happening.
If this doesn't work you could write probe that fires periodically after 
every few milli seconds and dumps the contents of suspected data 
structures. Here is the skeleton for timer probe

probe timer.ms(5000) {
        /* code to print your data structures */
}


Hope this helps.

>Thanks,
>
>Tony
>
>  
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Looking for recommendation for using SystemTap
  2006-09-29 13:02 Looking for recommendation for using SystemTap Tony Reix
  2006-09-29 17:00 ` Frank Ch. Eigler
  2006-09-29 17:23 ` Vara Prasad
@ 2006-09-29 20:36 ` David Wilder
  2 siblings, 0 replies; 5+ messages in thread
From: David Wilder @ 2006-09-29 20:36 UTC (permalink / raw)
  To: Tony Reix; +Cc: systemtap

Tony Reix wrote:

>Hi,
>
>I'm having several Oopss while running tests of an application which
>has:
> - one patch applied to the kernel
> - one kernel module
>The analysis of the Oopss clearly show that "someone" writes strings
>(like "ata" or "ejbo") randomly in memory and destroys links in
>structures, like vmlilst used by get_vmalloc_info in fs/proc/mmu.c or
>ulp->proc_list used by loop_undo in ipc/sems.c .
>Maybe my code is the culprit, or not.
>
>Do you think SystemTap can help me finding the culprit ?
>If yes, do you have recommendations and proposals about how to use
>SystemTap for that goal ?
>Can you point me to documentations providing the basic for using
>SystemTap in real ?
>
>Thanks,
>
>Tony
>
>  
>
A combination of systemtap and kdump may help.
My idea is to write a tap script that detects the corruption by 
searching for the corruption pattern.  If you know the section of memory 
or the list that is often corrupted you could limit the search area to 
speed up the probe.  Then place the probe (maybe a return probe would 
work better) at suspected areas of the code.  It may take several trys.  
The idea is to get closer to the point where the corruption occurred.  
It is possible to do probe all kernel functions using a *.  But that may 
be overkill.  Your tap script could call panic() to trigger a crash dump 
when the corruption is detected.  Searching the active stacks in the 
dump may get you closer to the bad code.  You can also look for patterns 
between several dumps, like a particular task or subsystem always on 
CPU.  Using that info move your probe point around.

I hope this helps.

-- 
David Wilder
IBM Linux Technology Center
Beaverton, Oregon, USA 
dwilder@us.ibm.com
(503)578-3789

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Looking for recommendation for using SystemTap
  2006-09-29 17:00 ` Frank Ch. Eigler
@ 2006-10-02 13:42   ` Tony Reix
  0 siblings, 0 replies; 5+ messages in thread
From: Tony Reix @ 2006-10-02 13:42 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

Le vendredi 29 septembre 2006 à 13:00 -0400, Frank Ch. Eigler a écrit :
> Tony Reix <tony.reix@bull.net> writes:
> 
> > [...]
> > The analysis of the Oopss clearly show that "someone" writes strings
> > (like "ata" or "ejbo") randomly in memory and destroys links in
> > structures, like vmlilst used by get_vmalloc_info in fs/proc/mmu.c or
> > ulp->proc_list used by loop_undo in ipc/sems.c .
> > [...]
> > Do you think SystemTap can help me finding the culprit ?
> > [...]
> 
> Perhaps.  Does the memory corruption occur in predictable places?
> Imagine a probe that runs periodically (via a frequently triggered
> timer, or a breakpoint at a code point under suspicion).  That probe
> could look through selected places that are corrupted, and check for
> something suspicious.

Up to know, each run (3 of them) has produced a Oops in a different
place (in a different linked list).

Using more options in .config now leads to a crash at the moment the
memory is corrupted. Seems the code I'm trying to test is the culprit !
A suggestion: add a basic SystemTap code to the kernel when these
options are used (memory leak debug, compile kernel with frame ...,
write protect kernel read-only data ...), so that it helps understanding
which code is writing in the wrong places.


> For example:
> 
>   #! stap -g
>   probe kernel.function("after_your_function") { if (checkstuff ()) log ("bug") }
>   function checkstuff () /* .... */
> 
> What checkstuff() does depends on how a program may be able to assess
> corruption.  If it's ascii scripts showing up within known regions of
> valid memory, something like this naive search could do it.  (Such a
> function could be encapsulated into the systemtap tapset library).
> 
>   function checkstuff () %{
>     char *begin = 0xdeadbeef;
>     char *end = 0xdeadf00d;
>     int found = 0;
>     char *p;
>     for (p = begin; p+3 < end; p++)
>       if (p[0] == 'a' && p[1] == 't' && p[2] == 'a') found=1;
>     THIS->__retval = found;
>   %}
> 
> Later, we will have hardware-assisted watchpoint probes that hit when
> a designated area of memory is read and/or written.  That could narrow
> the culprits down even further.  This might look something lke:
> 
>   probe kernel.watch.from(0xdeadbeef).to(0xdeadfood).string("ata")
>     { log ("bug") }
> 
> 
> Anyway, this all depends on being able to characterize the corruption
> well enough that a routine could be written to safely check for it.

I think I've got this very important information by recompiling the
kernel with the options I talked here before (Kernel Hacking).

> If you don't have even that much information, very drastic measures
> may be necessary (such as running the kernel under a simulator or
> debugger).

Yes. We've talked about that with colleagues ... UML, ...
They had to fix bugs in the tools before being able to find their
problem ...

Regards,

Tony

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-10-02 13:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-29 13:02 Looking for recommendation for using SystemTap Tony Reix
2006-09-29 17:00 ` Frank Ch. Eigler
2006-10-02 13:42   ` Tony Reix
2006-09-29 17:23 ` Vara Prasad
2006-09-29 20:36 ` David Wilder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).