whitelist for safe-mode probes (or just a better blacklist?)

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* whitelist for safe-mode probes (or just a better blacklist?)
@ 2006-09-19 16:29 Martin Hunt
  2006-09-20 15:14 ` Frank Ch. Eigler
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Hunt @ 2006-09-19 16:29 UTC (permalink / raw)
  To: systemtap

There are always going to be small pieces of the kernel where it will be
unsafe to insert a probe. We implemented a blacklist where we can list
bad functions to probe, however, it is not well maintained due to
several reasons. One of them is lack of testing. Another is that
sometimes a problem probing a function was due to something in systemtap
that we could fix by removing an unnecessary system dependency. We were
reluctant to add functions to the blacklist until we understood why they
failed. So the current blacklist is not complete and as kernels change,
the list will have to change with it.  To guarantee a probe will not
crash the kernel it is going to be necessary to generate a whitelist of
probe points.

While this may seem like it would reduce systemtap's usefulness,
remember that we are targeting two very different users. System admins
won't care that they cannot probe the internals of the spinlock code
(for example). They want to know they can do simple things like probe
kernel.function("*") and it won't crash.  Kernel developers will just
use "guru mode" and probe anywhere they want.

An alternative, would be to just create a better blacklist and use
thorough testing to guarantee that all other functions in the kernel
work with probes. This seems more difficult to maintain and will add a
new step to releasing each kernel. I think having a whitelist of safe
functions for all 2.6 kernels would require less work and be more safe.

How would this all work? The whitelist and blacklist would be files
distributed with Systemtap. They would be updated automatically with a
test script. I think we would not need version checking. One list for
2.6 would probably be OK because functions will be added and deleted
from kernel subversions but they probably won't change from safe to
unsafe.  But if they do, they would need to get removed from the
whitelist for all kernels.

SAFE-MODE - Each kernel function probed must be in the whitelist (or
must be a static kernel marker. If/when those are widely implemented we
will be able to do away with the whitelist.)

GURU-MODE - whitelist is ignored. Each kernel function must not be in
the blacklist.  (There should also be an option to ignore the blacklist
for testing.)

Thoughts?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: whitelist for safe-mode probes (or just a better blacklist?)
  2006-09-19 16:29 whitelist for safe-mode probes (or just a better blacklist?) Martin Hunt
@ 2006-09-20 15:14 ` Frank Ch. Eigler
  2006-09-20 15:42   ` Martin Hunt
  0 siblings, 1 reply; 9+ messages in thread
From: Frank Ch. Eigler @ 2006-09-20 15:14 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap

Martin Hunt <hunt@redhat.com> writes:

> [...]  To guarantee a probe will not crash the kernel it is going to
> be necessary to generate a whitelist of probe points.

Sure, except that this guarantee is only as good as the method used to
generate the whitelist.

> [...]  How would this all work? The whitelist and blacklist would be
> files distributed with Systemtap.  They would be updated
> automatically with a test script. [...]

How do you imagine this test script working?  Could it generate a list
roughly matching the "in-our-experience-so-far-safe" set in a
reasonable timeframe?  (It would not be very helpful if it took months
to run, or resulted in a small list.)

- FChE

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: whitelist for safe-mode probes (or just a better blacklist?)
  2006-09-20 15:14 ` Frank Ch. Eigler
@ 2006-09-20 15:42   ` Martin Hunt
  2006-09-20 16:23     ` Vara Prasad
  2006-09-20 18:02     ` David Smith
  0 siblings, 2 replies; 9+ messages in thread
From: Martin Hunt @ 2006-09-20 15:42 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

On Wed, 2006-09-20 at 11:14 -0400, Frank Ch. Eigler wrote:
> Martin Hunt <hunt@redhat.com> writes:
> 
> > [...]  To guarantee a probe will not crash the kernel it is going to
> > be necessary to generate a whitelist of probe points.
> 
> Sure, except that this guarantee is only as good as the method used to
> generate the whitelist.

Of course.

> > [...]  How would this all work? The whitelist and blacklist would be
> > files distributed with Systemtap.  They would be updated
> > automatically with a test script. [...]
> 
> How do you imagine this test script working?  Could it generate a list
> roughly matching the "in-our-experience-so-far-safe" set in a
> reasonable timeframe?  (It would not be very helpful if it took months
> to run, or resulted in a small list.)

I imagine this would be a list that would be checked into CVS of
functions that have been tested and never caused problems.  The only
reason to use a whitelist instead of a blacklist is because we should be
paranoid and not assume as new functions get added to the kernel, they
are safely probeable, as we do now.

Writing a script to do this testing is not difficult, except for the
problems with lockups which require a way to remotely reboot a system.
This requires we assume the existence of special hardware or that the
test system is running on a specific virtualization system.  This needs
done regardless of what we decide about the need for a whitelist.  I
hoped to provoke some discussion about this.  We've talked about it, but
has anyone actually written any test scripts to test all the kernel
functions this way?

Martin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: whitelist for safe-mode probes (or just a better blacklist?)
  2006-09-20 15:42   ` Martin Hunt
@ 2006-09-20 16:23     ` Vara Prasad
  2006-09-22  9:43       ` Li Guanglei
  2006-09-20 18:02     ` David Smith
  1 sibling, 1 reply; 9+ messages in thread
From: Vara Prasad @ 2006-09-20 16:23 UTC (permalink / raw)
  To: Martin Hunt; +Cc: Frank Ch. Eigler, systemtap

Martin Hunt wrote:

>On Wed, 2006-09-20 at 11:14 -0400, Frank Ch. Eigler wrote:
>  
>
>>Martin Hunt <hunt@redhat.com> writes:
>>
>>    
>>
>>>[...]  To guarantee a probe will not crash the kernel it is going to
>>>be necessary to generate a whitelist of probe points.
>>>      
>>>
>>Sure, except that this guarantee is only as good as the method used to
>>generate the whitelist.
>>    
>>
>
>Of course.
>
>  
>
>>>[...]  How would this all work? The whitelist and blacklist would be
>>>files distributed with Systemtap.  They would be updated
>>>automatically with a test script. [...]
>>>      
>>>
>>How do you imagine this test script working?  Could it generate a list
>>roughly matching the "in-our-experience-so-far-safe" set in a
>>reasonable timeframe?  (It would not be very helpful if it took months
>>to run, or resulted in a small list.)
>>    
>>
>
>I imagine this would be a list that would be checked into CVS of
>functions that have been tested and never caused problems.  The only
>reason to use a whitelist instead of a blacklist is because we should be
>paranoid and not assume as new functions get added to the kernel, they
>are safely probeable, as we do now.
>
>Writing a script to do this testing is not difficult, except for the
>problems with lockups which require a way to remotely reboot a system.
>This requires we assume the existence of special hardware or that the
>test system is running on a specific virtualization system.  This needs
>done regardless of what we decide about the need for a whitelist.  I
>hoped to provoke some discussion about this.  We've talked about it, but
>has anyone actually written any test scripts to test all the kernel
>functions this way?
>
>Martin
>
>
>  
>
If i understand Martin's goal here is to come up with a list of 
functions that we know doesn't break for a given distribution/kernel. 
This list doesn't mean the functions outside the list or not safe, we 
just don't know and we don't want assume they are safe to probe. We can 
start with a simple approach where we only focus this white list  for 
few distro releases and the major mainline release like 2.6.17, 18, 19 
etc. of  Linus tree, no -mm or any other git trees nor any rc candidates.

It shouldn't be that difficult to use DWARF library to generate all 
exported functions in the kernel. I am only focusing on exported 
functions first as their interfaces are more stable then some internal 
functions but this method can work on any function. If there happens to 
be a function if one of our tapsets is probing that is not in the above 
list we should add those functions as well. Once we have the function 
names, generate a script that puts probes in some percentage of the 
probes let us say 10% at each time in a sliding window. Loads the 
generated module and runs a standard test like ltp for 10 mins. The 
content of the probe handler should be to print the name of the 
function, increment a counter and also print some golbal variables like 
PID, GID etc. After being done with the whole list of the functions we 
should then generate a script that puts the probe in all the functions 
in the white list and runs few standard tests like ltp, fstest etc for 
30 min to make sure probing all of the functions doesn't cause any 
instability problems.

Once we agree upon a format we can run these tests as part of the weekly 
test we are doing so we can catch problems early.  Over a period of few 
weeks we can come up with a decent list that we feel comfortable. Once 
we have a big enough of safe list translator by default for wild card 
expansion consult this black list and white list and expand only to the 
function names from this list. We should also provide a way for us to 
indicate the translator i am testing i don't want you to restrict to 
only white list so do the real expansion of wildcards.

A side effect of this work could be after few weeks of results we can 
identify safe to probe routines we could probably even go a head and put 
some gcc magic macros in the kernel code itself that gives us info in 
the ELF section to say what functions are deemed safe to put probes. 
That way over a period of time we may not have to ship separate white 
list, but that is for future (now i am day dreaming :-) ).

Anyone got tomatoes?

bye,
Vara Prasad

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: whitelist for safe-mode probes (or just a better blacklist?)
  2006-09-20 15:42   ` Martin Hunt
  2006-09-20 16:23     ` Vara Prasad
@ 2006-09-20 18:02     ` David Smith
  2006-09-21 22:13       ` David Wilder
  1 sibling, 1 reply; 9+ messages in thread
From: David Smith @ 2006-09-20 18:02 UTC (permalink / raw)
  To: Martin Hunt; +Cc: Frank Ch. Eigler, systemtap

Martin Hunt wrote:
> On Wed, 2006-09-20 at 11:14 -0400, Frank Ch. Eigler wrote:
>> Martin Hunt <hunt@redhat.com> writes:
>>
>>> [...]  To guarantee a probe will not crash the kernel it is going to
>>> be necessary to generate a whitelist of probe points.
>> Sure, except that this guarantee is only as good as the method used to
>> generate the whitelist.
> 
> Of course.
> 
>>> [...]  How would this all work? The whitelist and blacklist would be
>>> files distributed with Systemtap.  They would be updated
>>> automatically with a test script. [...]
>> How do you imagine this test script working?  Could it generate a list
>> roughly matching the "in-our-experience-so-far-safe" set in a
>> reasonable timeframe?  (It would not be very helpful if it took months
>> to run, or resulted in a small list.)
> 
> I imagine this would be a list that would be checked into CVS of
> functions that have been tested and never caused problems.  The only
> reason to use a whitelist instead of a blacklist is because we should be
> paranoid and not assume as new functions get added to the kernel, they
> are safely probeable, as we do now.
> 
> Writing a script to do this testing is not difficult, except for the
> problems with lockups which require a way to remotely reboot a system.
> This requires we assume the existence of special hardware or that the
> test system is running on a specific virtualization system.  This needs
> done regardless of what we decide about the need for a whitelist.  I
> hoped to provoke some discussion about this.  We've talked about it, but
> has anyone actually written any test scripts to test all the kernel
> functions this way?

I can tell you that looking into the problems probing 
'kernel.function("*")' on x86 over the last couple of days I've rebooted 
my test system (what seems like) countless times.  I certainly agree 
with you that we'll need special hardware (perhaps x10 could be a simple 
start) or virtualization  to get this going using a script.  I do think 
that this testing would be extremely useful, even without a whitelist 
feature.

I wonder if we really might need various levels of "whitelists" to 
satisfy customer concerns.  Something like anyone in group A can only 
probe syscalls, users in group B can probe syscalls + exported kernel 
functions, etc.

-- 
David Smith
dsmith@redhat.com
Red Hat, Inc.
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: whitelist for safe-mode probes (or just a better blacklist?)
  2006-09-20 18:02     ` David Smith
@ 2006-09-21 22:13       ` David Wilder
  0 siblings, 0 replies; 9+ messages in thread
From: David Wilder @ 2006-09-21 22:13 UTC (permalink / raw)
  To: David Smith; +Cc: Martin Hunt, Frank Ch. Eigler, systemtap

David Smith wrote:

> Martin Hunt wrote:
>
>> On Wed, 2006-09-20 at 11:14 -0400, Frank Ch. Eigler wrote:
>>
>>> Martin Hunt <hunt@redhat.com> writes:
>>>
>>>> [...]  To guarantee a probe will not crash the kernel it is going to
>>>> be necessary to generate a whitelist of probe points.
>>>
>>> Sure, except that this guarantee is only as good as the method used to
>>> generate the whitelist.
>>
>>
>> Of course.
>>
>>>> [...]  How would this all work? The whitelist and blacklist would be
>>>> files distributed with Systemtap.  They would be updated
>>>> automatically with a test script. [...]
>>>
>>> How do you imagine this test script working?  Could it generate a list
>>> roughly matching the "in-our-experience-so-far-safe" set in a
>>> reasonable timeframe?  (It would not be very helpful if it took months
>>> to run, or resulted in a small list.)
>>
>>
>> I imagine this would be a list that would be checked into CVS of
>> functions that have been tested and never caused problems.  The only
>> reason to use a whitelist instead of a blacklist is because we should be
>> paranoid and not assume as new functions get added to the kernel, they
>> are safely probeable, as we do now.
>>
>> Writing a script to do this testing is not difficult, except for the
>> problems with lockups which require a way to remotely reboot a system.
>> This requires we assume the existence of special hardware or that the
>> test system is running on a specific virtualization system.  This needs
>> done regardless of what we decide about the need for a whitelist.  I
>> hoped to provoke some discussion about this.  We've talked about it, but
>> has anyone actually written any test scripts to test all the kernel
>> functions this way?
>
>
> I can tell you that looking into the problems probing 
> 'kernel.function("*")' on x86 over the last couple of days I've 
> rebooted my test system (what seems like) countless times.  I 
> certainly agree with you that we'll need special hardware (perhaps x10 
> could be a simple start) or virtualization  to get this going using a 
> script.  I do think that this testing would be extremely useful, even 
> without a whitelist feature.
>
> I wonder if we really might need various levels of "whitelists" to 
> satisfy customer concerns.  Something like anyone in group A can only 
> probe syscalls, users in group B can probe syscalls + exported kernel 
> functions, etc.
>
I would like to chime in..

Let us think of a white list not as a tool to increase systemtap 
stability but as a tool to decrease tap script debug time.

If I were a system manager in an environment where my next house payment 
depended on system-up time,  I would never  run any tap script that I 
had not fully tested, or was supplied by my ldp.  Therefor the white 
list only helps me in a test environment by speeding up the testing of 
scripts to be use later in production.  In other words the white list 
helps me from falling in pitfalls by using untested tap points.  But it 
wont eliminate finding new pitfalls during my testing.

But thinking about it now,  that is the same thing the black list is 
doing....

Testing is a good thing, but we should match the effort with the correct 
paradigm and work on maintaining just the black list.

-- 
David Wilder
IBM Linux Technology Center
Beaverton, Oregon, USA 
dwilder@us.ibm.com
(503)578-3789

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: whitelist for safe-mode probes (or just a better blacklist?)
  2006-09-20 16:23     ` Vara Prasad
@ 2006-09-22  9:43       ` Li Guanglei
  2006-09-22 15:53         ` Li Guanglei
  0 siblings, 1 reply; 9+ messages in thread
From: Li Guanglei @ 2006-09-22  9:43 UTC (permalink / raw)
  To: Vara Prasad; +Cc: Martin Hunt, Frank Ch. Eigler, systemtap

Vara Prasad wrote:
> Martin Hunt wrote:
> 
> 
> It shouldn't be that difficult to use DWARF library to generate all 
> exported functions in the kernel. I am only focusing on exported 
> functions first as their interfaces are more stable then some internal 
> functions but this method can work on any function. If there happens to 
> be a function if one of our tapsets is probing that is not in the above 
> list we should add those functions as well. Once we have the function 
> names, generate a script that puts probes in some percentage of the 
> probes let us say 10% at each time in a sliding window. Loads the 
> generated module and runs a standard test like ltp for 10 mins. The 
> content of the probe handler should be to print the name of the 
> function, increment a counter and also print some golbal variables like 
> PID, GID etc. After being done with the whole list of the functions we 
> should then generate a script that puts the probe in all the functions 
> in the white list and runs few standard tests like ltp, fstest etc for 
> 30 min to make sure probing all of the functions doesn't cause any 
> instability problems.
> 
> Once we agree upon a format we can run these tests as part of the weekly 
> test we are doing so we can catch problems early.  Over a period of few 
> weeks we can come up with a decent list that we feel comfortable. Once 
> we have a big enough of safe list translator by default for wild card 
> expansion consult this black list and white list and expand only to the 
> function names from this list. We should also provide a way for us to 
> indicate the translator i am testing i don't want you to restrict to 
> only white list so do the real expansion of wildcards.
> 
> A side effect of this work could be after few weeks of results we can 
> identify safe to probe routines we could probably even go a head and put 
> some gcc magic macros in the kernel code itself that gives us info in 
> the ELF section to say what functions are deemed safe to put probes. 
> That way over a period of time we may not have to ship separate white 
> list, but that is for future (now i am day dreaming :-) ).
> 
> Anyone got tomatoes?
> 
> bye,
> Vara Prasad
> 

Hi,
   I used: stap -e 'probe kernel.function("*") {}' -p2 -v  | grep 
"kernel.function" | wc -l, and it shows me 10827 functions will be probed.

   As suggested, we divide all the functions into groups. The number 
of group can't be too big since we must the run the test enough long 
for each group. So there will be quite some functions(~1000 maybe) in 
each group. How about if one of the groups crashes the kernel? In most 
cases we can't know which functions cause the problem so we have to 
shrink the scope by and by to put the functions inside this group 
gradually into the whitelist, but this will cause a lot of work. A bad 
situation is that all the groups will crash the kernel.

   Apparently those groups that pass the tests can't declare all 
functions contains inside them are safe. Maybe some functions were 
never triggered during the tests or only were triggered a few times 
and didn't came across the dead condition. If one day we find probing 
the whole whitelist crashes the Kernel, we have to take pains to find 
out which one in the whitelist has the problem. And found a suitable 
testcase that will trigger all the probes is a hard task.

   So after thinking about this topic, the whole work may not be an 
easy task. Maybe finally we find we spent too much time to get the 
whitelist.

   Just my random thoughts.

- Guanglei

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: whitelist for safe-mode probes (or just a better blacklist?)
  2006-09-22  9:43       ` Li Guanglei
@ 2006-09-22 15:53         ` Li Guanglei
  0 siblings, 0 replies; 9+ messages in thread
From: Li Guanglei @ 2006-09-22 15:53 UTC (permalink / raw)
  To: Li Guanglei; +Cc: Vara Prasad, Martin Hunt, Frank Ch. Eigler, systemtap

Li Guanglei wrote:
> 
> Hi,
>   I used: stap -e 'probe kernel.function("*") {}' -p2 -v  | grep 
> "kernel.function" | wc -l, and it shows me 10827 functions will be probed.
> 
>   As suggested, we divide all the functions into groups. The number of 
> group can't be too big since we must the run the test enough long for 
> each group. So there will be quite some functions(~1000 maybe) in each 
> group. How about if one of the groups crashes the kernel? In most cases 
> we can't know which functions cause the problem so we have to shrink the 
> scope by and by to put the functions inside this group gradually into 
> the whitelist, but this will cause a lot of work. A bad situation is 
> that all the groups will crash the kernel.
> 
>   Apparently those groups that pass the tests can't declare all 
> functions contains inside them are safe. Maybe some functions were never 
> triggered during the tests or only were triggered a few times and didn't 
> came across the dead condition. If one day we find probing the whole 
> whitelist crashes the Kernel, we have to take pains to find out which 
> one in the whitelist has the problem. And found a suitable testcase that 
> will trigger all the probes is a hard task.
> 
>   So after thinking about this topic, the whole work may not be an easy 
> task. Maybe finally we find we spent too much time to get the whitelist.
> 
>   Just my random thoughts.
> 
> - Guanglei
> 

We could slightly modify "all_kernel_functions.exp" to make it print 
the statistics of probe being triggered periodically into a local file:

set systemtap_script {
     global stat
     probe %s {
         stat[probefunc()] <<< 1
     }
     probe begin {
         log("systemtap starting probe")
     }
     probe timer.ms(10000), end {
         log("systemtap ending probe")
         foreach (func in stat)
             printf("%%d  %%s\n", @count(stat[func]), func)
     }

}

We could also record which groups has passed test, and which group is 
being tested. We use an init script to run the testcase right after 
system is booted up. So each time system booted into the testing, we 
can resume the tests. And even for those group failed the test, we can 
refer to the statistics information and consider those events being 
triggered >> one times could be moved into safe list.

Another machine could just ping the testing machine and if no 
response, it could just send a command to reboot the testing machine, 
so that we can run such testing while we are sleeping. :)

- Guanglei

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: whitelist for safe-mode probes (or just a better blacklist?)
@ 2006-09-20 17:20 Stone, Joshua I
  0 siblings, 0 replies; 9+ messages in thread
From: Stone, Joshua I @ 2006-09-20 17:20 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap, Frank Ch. Eigler

On Wednesday, September 20, 2006 8:43 AM, Martin Hunt wrote:
> Writing a script to do this testing is not difficult, except for the
> problems with lockups which require a way to remotely reboot a system.
> This requires we assume the existence of special hardware or that the
> test system is running on a specific virtualization system.  This
> needs done regardless of what we decide about the need for a
> whitelist.  I hoped to provoke some discussion about this.  We've
> talked about it, but has anyone actually written any test scripts to
> test all the kernel functions this way?

See 'src/testsuite/systemtap.stress/all_kernel_functions.exp'.

This test is not enabled in the normal test-runs, because of the
likelyhood of inducing crashes.  There's an 'if 0' near the bottom that
gates the test, so just flip that to try it out.

The 'genload' function in the test could probably be improved to get a
more representative kernel test...

Josh

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-09-22 15:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-09-19 16:29 whitelist for safe-mode probes (or just a better blacklist?) Martin Hunt
2006-09-20 15:14 ` Frank Ch. Eigler
2006-09-20 15:42   ` Martin Hunt
2006-09-20 16:23     ` Vara Prasad
2006-09-22  9:43       ` Li Guanglei
2006-09-22 15:53         ` Li Guanglei
2006-09-20 18:02     ` David Smith
2006-09-21 22:13       ` David Wilder
2006-09-20 17:20 Stone, Joshua I

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).