public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* automated way to find functions that we might want to blacklist
@ 2011-12-22 23:33 Timo Juhani Lindfors
  2011-12-24 14:27 ` Sami Liedes
  0 siblings, 1 reply; 8+ messages in thread
From: Timo Juhani Lindfors @ 2011-12-22 23:33 UTC (permalink / raw)
  To: systemtap

Hi,

http://lindi.iki.fi/lindi/systemtap/torture/systemtap-torture.py

is a quick'n'dirty tool that I wrote to figure out why

stap -e 'probe kernel.function("*") {}'

crashes the system. The tool starts from the complete set of functions
and divides this to smaller and smaller partitions based on whether that
set crashes the system or not.

I ran the script with no arguments on an amd64 xen domU running debian
wheezy.  The produced logfile

http://lindi.iki.fi/lindi/systemtap/torture/linux-image-3.0.0-1-amd64_3.0.0-3/torture.log

shows which functions we should consider for the blacklist:

$ grep "([1234] funs) CRASHED" torture.log 
Thu Dec 22 11:48:24 2011 HYPERVISOR_physdev_op (39) .. HYPERVISOR_set_debugreg (41) (3 funs) CRASHED
Thu Dec 22 11:51:09 2011 HYPERVISOR_sched_op (40) .. HYPERVISOR_set_debugreg (41) (2 funs) CRASHED
Thu Dec 22 12:08:41 2011 hash_64 (10907) .. hash_futex (10909) (3 funs) CRASHED
Thu Dec 22 12:09:49 2011 hash_64 (10907) .. hash_64 (10907) (1 funs) CRASHED
Thu Dec 22 12:12:18 2011 hash_ptr (10910) .. hash_walk_next (10912) (3 funs) CRASHED
Thu Dec 22 12:13:28 2011 hash_ptr (10910) .. hash_ptr (10910) (1 funs) CRASHED
Thu Dec 22 13:14:18 2011 native_set_pmd_at (15204) .. native_setup_msi_irqs (15207) (4 funs) CRASHED
Thu Dec 22 13:38:13 2011 native_set_pmd_at (15204) .. native_set_pte (15205) (2 funs) CRASHED
Thu Dec 22 13:40:22 2011 native_set_pte (15205) .. native_set_pte (15205) (1 funs) CRASHED

Machine-readable trace that shows full function names (and not just
"..") is also available:

http://lindi.iki.fi/lindi/systemtap/torture/linux-image-3.0.0-1-amd64_3.0.0-3/state.json.bz2

This version does not cope with non-determinism very well. If a set of
function probes crashes the system only sometimes you may need to run
the torture script multiple times to catch it. If you want to try the
script yourself here's how I ran it:

1) Install watchdog package to domU
2) Use

on_crash = 'restart'

in xen domain configuration

3) Run

while true; do
    if ! hping3 --numeric --count 35 --icmp lindi2.lan; then
    xm destroy lindi2
    sleep 3
    xm create /local/xen/lindi2/config
    sleep 120
    fi
done

on dom0 as root.

4) Add

@reboot sleep 55 && /home/lindi/proj/systemtap-torture/systemtap-torture.py

to crontab of lindi so that the test continues after each crash.

5) Start "socat UDP-RECV:2346 -" on a computer where you want to send
the logs (specified using --report-host).

6) Run

/home/lindi/proj/systemtap-torture/systemtap-torture.py

manually and wait for the system to enter an intense stress test :-)

-Timo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: automated way to find functions that we might want to blacklist
  2011-12-22 23:33 automated way to find functions that we might want to blacklist Timo Juhani Lindfors
@ 2011-12-24 14:27 ` Sami Liedes
  2012-01-15 21:46   ` Sami Liedes
  0 siblings, 1 reply; 8+ messages in thread
From: Sami Liedes @ 2011-12-24 14:27 UTC (permalink / raw)
  To: systemtap

[-- Attachment #1: Type: text/plain, Size: 571 bytes --]

Here's a list of functions that cause Debian's 3.1.0 kernel to crash
under qemu/kvm for me, found using the script and a fair amount of
manual testing (because some of them only crash rarely). This list is
probably not exhaustive; Timo's tool could use some tweaking to enable
the stress test to be run for longer periods of time and to zero on
all the functions that only crash sometimes.

* arch_local_irq_enable
* arch_local_irq_restore
* clts
* hash_64
* hash_ptr
* inat_get_opcode_attribute
* native_safe_halt
* native_set_debugreg
* native_set_fixmap
* outw

	Sami

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: automated way to find functions that we might want to blacklist
  2011-12-24 14:27 ` Sami Liedes
@ 2012-01-15 21:46   ` Sami Liedes
  2012-01-15 22:13     ` Sami Liedes
  2012-01-17 20:18     ` David Smith
  0 siblings, 2 replies; 8+ messages in thread
From: Sami Liedes @ 2012-01-15 21:46 UTC (permalink / raw)
  To: systemtap

[-- Attachment #1: Type: text/plain, Size: 1474 bytes --]

On Fri, Dec 23, 2011 at 11:10:31PM +0200, Sami Liedes wrote:
> Here's a list of functions that cause Debian's 3.1.0 kernel to crash
> under qemu/kvm for me, found using the script and a fair amount of
> manual testing (because some of them only crash rarely). This list is
> probably not exhaustive; Timo's tool could use some tweaking to enable
> the stress test to be run for longer periods of time and to zero on
> all the functions that only crash sometimes.

After more exhaustive testing and modifying the script to handle
occasional crashes much better, here's an updated list. Should I just
send a patch adding these to the blacklist against the git HEAD of
systemtap?

Note that I've done the tests using systemtap 1.6. Hope that's OK.

Probing any of these functions (eventually) crashes Debian testing's
kernel 3.1.x:

* __find_general_cachep
* arch_local_irq_enable
* arch_local_irq_restore
* clts
* cpu_relax
* hash_64
* hash_ptr
* inat_get_opcode_attribute
* kmem_find_general_cachep
* native_safe_halt
* native_set_debugreg
* native_set_fixmap
* native_write_cr0
* outw
* readl
* rep_nop

I also tested on (mainline) kernel 3.2.0, with the above-mentioned
functions already blacklisted. Placing a probe simultaneously on *all
three* of these functions causes a crash; probing any two but not the
third doesn't seem to crash. Should we just blacklist them all to be
sure?

* test_tsk_thread_flag   AND
* test_tsk_need_resched  AND
* test_ti_thread_flag

	Sami

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: automated way to find functions that we might want to blacklist
  2012-01-15 21:46   ` Sami Liedes
@ 2012-01-15 22:13     ` Sami Liedes
  2012-01-17 20:18     ` David Smith
  1 sibling, 0 replies; 8+ messages in thread
From: Sami Liedes @ 2012-01-15 22:13 UTC (permalink / raw)
  To: systemtap

[-- Attachment #1: Type: text/plain, Size: 2858 bytes --]

On Sun, Jan 15, 2012 at 11:44:02PM +0200, Sami Liedes wrote:
> After more exhaustive testing and modifying the script to handle
> occasional crashes much better, here's an updated list. Should I just
> send a patch adding these to the blacklist against the git HEAD of
> systemtap?

Here's a tentative patch against git HEAD. I also took the liberty to
move a couple of entries to make the previously almost-sorted list
sorted again.

	Sami

------------------------------------------------------------
commit dc5e6ca380d76de702113caa9cc8dbbb1fc5e785
Author: Sami Liedes <sami.liedes@iki.fi>
Date:   Mon Jan 16 00:11:00 2012 +0200

    Blacklist some functions that crash 3.1.x or 3.2.0.

diff --git a/dwflpp.cxx b/dwflpp.cxx
index 4d78187..fe89cf7 100644
--- a/dwflpp.cxx
+++ b/dwflpp.cxx
@@ -3159,7 +3159,11 @@ dwflpp::build_blacklist()
   // also allows detection of problems at translate- rather than
   // run-time.
 
-  blfn += "atomic_notifier_call_chain"; // first blfn; no "|"
+  blfn += "arch_local_irq_enable"; // first blfn; no "|"
+  blfn += "|arch_local_irq_restore";
+  blfn += "|atomic_notifier_call_chain";
+  blfn += "|clts";
+  blfn += "|cpu_relax";
   blfn += "|default_do_nmi";
   blfn += "|__die";
   blfn += "|die_nmi";
@@ -3171,19 +3175,35 @@ dwflpp::build_blacklist()
   blfn += "|do_sparc64_fault";
   blfn += "|do_trap";
   blfn += "|dummy_nmi_callback";
+  blfn += "|__find_general_cachep";
   blfn += "|flush_icache_range";
+  blfn += "|hash_64";
+  blfn += "|hash_ptr";
   blfn += "|ia64_bad_break";
   blfn += "|ia64_do_page_fault";
   blfn += "|ia64_fault";
+  blfn += "|inat_get_opcode_attribute";
   blfn += "|io_check_error";
+  blfn += "|kmem_find_general_cachep";
   blfn += "|mem_parity_error";
+  blfn += "|native_safe_halt";
+  blfn += "|native_set_debugreg";
+  blfn += "|native_set_fixmap";
+  blfn += "|native_write_cr0";
   blfn += "|nmi_watchdog_tick";
   blfn += "|notifier_call_chain";
   blfn += "|oops_begin";
   blfn += "|oops_end";
+  blfn += "|outw";
   blfn += "|program_check_exception";
+  blfn += "|readl";
+  blfn += "|rep_nop";
   blfn += "|single_step_exception";
   blfn += "|sync_regs";
+  blfn += "|system_call_after_swapgs";
+  blfn += "|test_tsk_thread_flag";
+  blfn += "|test_tsk_need_resched";
+  blfn += "|test_ti_thread_flag.*";
   blfn += "|unhandled_fault";
   blfn += "|unknown_nmi_error";
   blfn += "|xen_[gs]et_debugreg";
@@ -3193,9 +3213,6 @@ dwflpp::build_blacklist()
   blfn += "|xen_adjust_exception_frame";
   blfn += "|xen_iret.*";
   blfn += "|xen_sysret64.*";
-  blfn += "|test_ti_thread_flag.*";
-  blfn += "|inat_get_opcode_attribute";
-  blfn += "|system_call_after_swapgs";
 
   // Lots of locks
   blfn += "|.*raw_.*_lock.*";
------------------------------------------------------------

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: automated way to find functions that we might want to blacklist
  2012-01-15 21:46   ` Sami Liedes
  2012-01-15 22:13     ` Sami Liedes
@ 2012-01-17 20:18     ` David Smith
  2012-01-17 21:55       ` Sami Liedes
  2012-01-19 13:10       ` Sami Liedes
  1 sibling, 2 replies; 8+ messages in thread
From: David Smith @ 2012-01-17 20:18 UTC (permalink / raw)
  To: systemtap

On 01/15/2012 03:44 PM, Sami Liedes wrote:

> Probing any of these functions (eventually) crashes Debian testing's
> kernel 3.1.x:


A couple of these don't make any sense to me:

> * hash_64
> * hash_ptr


There is nothing that those 2 functions do that can crash the kernel.
Those functions (should be) always inlined.  I'd guess the problem isn't
those two functions, but the function that is calling them.

To test for true function calls, you *shouldn't* do:

# stap -e 'probe kernel.function("*") {}'

The above probes inlined functions and real function calls.  Instead you
should do the equivalent of:

# stap -e 'probe kernel.function("*").call {}'

Another problem I see with your testing methodology is that you are
using xen.  I don't think we've used xen in a while, but the xen kernel
always gave us different results than a regular kernel.  I'd test on
bare metal or in a kvm instance.

I'll make one final comment here.  In my mind the blacklist is a
semi-temporary thing (although we don't typically remove functions from
it).  The real fix here is to get the crashing functions marked with
'__kprobes' in the upstream kernel.  This fixes the problem for all
kprobes users, not just systemtap.

-- 

David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: automated way to find functions that we might want to blacklist
  2012-01-17 20:18     ` David Smith
@ 2012-01-17 21:55       ` Sami Liedes
  2012-01-19 13:10       ` Sami Liedes
  1 sibling, 0 replies; 8+ messages in thread
From: Sami Liedes @ 2012-01-17 21:55 UTC (permalink / raw)
  To: systemtap

[-- Attachment #1: Type: text/plain, Size: 1057 bytes --]

On Tue, Jan 17, 2012 at 02:18:26PM -0600, David Smith wrote:
> There is nothing that those 2 functions do that can crash the kernel.
> Those functions (should be) always inlined.  I'd guess the problem isn't
> those two functions, but the function that is calling them.

Yes, I've figured as much that the crashes come from places where
these are inlined. But I had't thought about using .call; will do
that.

> Another problem I see with your testing methodology is that you are
> using xen.  I don't think we've used xen in a while, but the xen kernel
> always gave us different results than a regular kernel.  I'd test on
> bare metal or in a kvm instance.

Timo uses Xen, but I'm running the tests in KVM.

> I'll make one final comment here.  In my mind the blacklist is a
> semi-temporary thing (although we don't typically remove functions from
> it).  The real fix here is to get the crashing functions marked with
> '__kprobes' in the upstream kernel.  This fixes the problem for all
> kprobes users, not just systemtap.

Makes sense to me.

	Sami

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: automated way to find functions that we might want to blacklist
  2012-01-17 20:18     ` David Smith
  2012-01-17 21:55       ` Sami Liedes
@ 2012-01-19 13:10       ` Sami Liedes
  2012-01-19 16:42         ` Josh Stone
  1 sibling, 1 reply; 8+ messages in thread
From: Sami Liedes @ 2012-01-19 13:10 UTC (permalink / raw)
  To: systemtap

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On Tue, Jan 17, 2012 at 02:18:26PM -0600, David Smith wrote:
> A couple of these don't make any sense to me:
> 
> > * hash_64
> > * hash_ptr
> 
> There is nothing that those 2 functions do that can crash the kernel.
> Those functions (should be) always inlined.  I'd guess the problem isn't
> those two functions, but the function that is calling them.

Is there a way to blacklist a single place where one of these is
inlined? I don't think blacklisting the offending function where these
are inlined would prevent the crash with 'probe
kernel.function("hash_64") {}'. Or would it?

After some testing with 3.2.0 and .call, only these two functions have
caused crashes so far (in KVM):

* inat_get_opcode_attribute
* native_set_debugreg

The first of these is already included in the blacklist of systemtap
git HEAD, but the second one is new.

	Sami

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: automated way to find functions that we might want to blacklist
  2012-01-19 13:10       ` Sami Liedes
@ 2012-01-19 16:42         ` Josh Stone
  0 siblings, 0 replies; 8+ messages in thread
From: Josh Stone @ 2012-01-19 16:42 UTC (permalink / raw)
  To: systemtap

On 01/19/2012 05:10 AM, Sami Liedes wrote:
> Is there a way to blacklist a single place where one of these is
> inlined? I don't think blacklisting the offending function where these
> are inlined would prevent the crash with 'probe
> kernel.function("hash_64") {}'. Or would it?

Currently, we don't.  We do check whether the inlined address is
anywhere within a __kprobes-marked function, but the name comparison is
only done on the inline function itself.  We could enhance this though,
because that nesting hierarchy is present in DWARF.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-01-19 16:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-22 23:33 automated way to find functions that we might want to blacklist Timo Juhani Lindfors
2011-12-24 14:27 ` Sami Liedes
2012-01-15 21:46   ` Sami Liedes
2012-01-15 22:13     ` Sami Liedes
2012-01-17 20:18     ` David Smith
2012-01-17 21:55       ` Sami Liedes
2012-01-19 13:10       ` Sami Liedes
2012-01-19 16:42         ` Josh Stone

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).