[Bug runtime/16806] New: kernel crash during repeated module insertion

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* [Bug runtime/16806] New: kernel crash during repeated module insertion
@ 2014-04-03 19:22 jlebon at redhat dot com
  2014-04-03 19:22 ` [Bug runtime/16806] " jlebon at redhat dot com
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: jlebon at redhat dot com @ 2014-04-03 19:22 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

            Bug ID: 16806
           Summary: kernel crash during repeated module insertion
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: jlebon at redhat dot com

This crash sompetimes occurs during the testsuite run of
unprivileged_myproc.exp under f19. I've been able to reproduce it more directly
as follow (files are based on loop.c and libloop.c from
testsuite/systemtap.unprivileged):

$ cat loop2.c
#include <pthread.h>
#include <unistd.h>
#include "sys/sdt.h"

extern int libloopfunc (void);

/* Thread entry point */
void *bar (void *b) {
  int i;
  int *j = (int *)b;
  for (i = 0; i < 10; ++i)
    *j += i;
 a:
  return b;
}

/* We need an inline function. */
inline int ibar (void) {
  return libloopfunc ();
}

/* We need a threaded app. */
inline int tbar (void) {
  void *x;
  int j = 0;
  STAP_PROBE(_test_, main_enter);
  pthread_t thread;
  pthread_attr_t attr;
  pthread_attr_init(&attr);
  pthread_create (& thread, & attr, bar, (void*)& j);
  pthread_join (thread, & x);
  return j;
}

main (int argc, char *argv[]) {
  int j = 0;
  for (;;) {
    j += ibar ();
    j += tbar ();
    /* Don't loop if an argument was passed */
    if (argc > 1)
      return 0;
    //usleep (250000); /* 1/4 second pause.  */
  }
  return j;
}
$
$ cat libloop.c
#include "sys/sdt.h"
#include <stdlib.h>

extern void *result;
void *result;

inline int ilibloopfunc (void) {
  int i, j = 0;
  STAP_PROBE(_test_, ilibloopfunc_enter);
  for (i = 0; i < 10; ++i)
    j += i;
  result = malloc(100); /* trigger some plt activity */
  return j;
}

int libloopfunc (void) {
  int i, j = 0;
  if (0) goto a;
 a:
  STAP_PROBE(_test_, libloopfunc_enter);
  for (i = 0; i < 10; ++i)
    j += ilibloopfunc ();
  return j;
}
$
$ gcc libloop.c -g -O -shared -fPIC -o libloop.so
$ gcc loop2.c -g -O -lpthread -lloop -L. -o loop2
$ stap --poison-cache -p4 -m loop2_mod -e 'probe
process("./loop2").function("ibar") { println("hit"); exit() }'
loop2_mod.ko
$ LD_LIBRARY_PATH=$PWD ./loop2 &
[1] 4164
$ while [ true ]; do staprun loop2_mod.ko; done
ERROR: utrace_set_events returned error -114 on pid 10935
hit
hit
ERROR: utrace_set_events returned error -114 on pid 21427
hit
hit
ERROR: utrace_set_events returned error -114 on pid 31276
hit
<kernel crash>

(The 'ERROR: utrace_set_events' messages don't always appear).

This probably has something to do with threads. Even though the function probed
is ibar(), the crash does not occur if we remove bar() & tbar().

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
@ 2014-04-03 19:22 ` jlebon at redhat dot com
  2014-04-03 19:25 ` jlebon at redhat dot com
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jlebon at redhat dot com @ 2014-04-03 19:22 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

--- Comment #1 from Jonathan Lebon <jlebon at redhat dot com> ---
Created attachment 7526
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7526&action=edit
dmesg

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
  2014-04-03 19:22 ` [Bug runtime/16806] " jlebon at redhat dot com
@ 2014-04-03 19:25 ` jlebon at redhat dot com
  2014-04-03 20:38 ` jlebon at redhat dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jlebon at redhat dot com @ 2014-04-03 19:25 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

--- Comment #2 from Jonathan Lebon <jlebon at redhat dot com> ---
Created attachment 7527
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7527&action=edit
another dmesg

The backtrace is not always the same, but it seems like the BUG is always
unable to handle kernel paging request.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
  2014-04-03 19:22 ` [Bug runtime/16806] " jlebon at redhat dot com
  2014-04-03 19:25 ` jlebon at redhat dot com
@ 2014-04-03 20:38 ` jlebon at redhat dot com
  2014-04-03 20:38 ` jlebon at redhat dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jlebon at redhat dot com @ 2014-04-03 20:38 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

--- Comment #4 from Jonathan Lebon <jlebon at redhat dot com> ---
Created attachment 7530
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7530&action=edit
stdout with debug flags

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
                   ` (2 preceding siblings ...)
  2014-04-03 20:38 ` jlebon at redhat dot com
@ 2014-04-03 20:38 ` jlebon at redhat dot com
  2014-04-03 20:55 ` jlebon at redhat dot com
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jlebon at redhat dot com @ 2014-04-03 20:38 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

--- Comment #3 from Jonathan Lebon <jlebon at redhat dot com> ---
Created attachment 7529
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7529&action=edit
dmesg with debug flags

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
                   ` (3 preceding siblings ...)
  2014-04-03 20:38 ` jlebon at redhat dot com
@ 2014-04-03 20:55 ` jlebon at redhat dot com
  2014-04-10 16:31 ` dsmith at redhat dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jlebon at redhat dot com @ 2014-04-03 20:55 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

Jonathan Lebon <jlebon at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #7529|0                           |1
        is obsolete|                            |

--- Comment #5 from Jonathan Lebon <jlebon at redhat dot com> ---
Created attachment 7531
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7531&action=edit
dmesg with debug flags

Note that there are multiple runs in there. The run that caused the bug starts
at 936.869106.

To reproduce:

$ stap -DDEBUG_SYMBOLS=9 -DDEBUG_TASK_FINDER=9 -DSTP_TF_DEBUG=9 --poison-cache
-p4 -m loop2_mod_verbose -e 'probe process("./loop2").function("ibar") {
println("hit"); exit() }'
loop2_mod_verbose.ko
$ 
$ staprun loop2_mod_verbose.ko -c 'timeout 0.1 ./loop2'

On average, requires about 7-8 runs before it crashes (seems like it's more
when debug flags are on).

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
                   ` (4 preceding siblings ...)
  2014-04-03 20:55 ` jlebon at redhat dot com
@ 2014-04-10 16:31 ` dsmith at redhat dot com
  2014-04-10 18:14 ` dsmith at redhat dot com
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: dsmith at redhat dot com @ 2014-04-10 16:31 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

David Smith <dsmith at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dsmith at redhat dot com

--- Comment #6 from David Smith <dsmith at redhat dot com> ---
A correction first. The test program compilation should include the rpath flag,
so that the executable can find its library.

# gcc libloop.c -g -O -shared -fPIC -o libloop.so                         
# gcc loop2.c -g -O -lpthread -lloop -Wl,-rpath,`pwd` -L. -o loop2        

Here's a portion of the kernel log of a crash that I've seen:

=====
[  813.192482] utrace_shutdown:373 entry
[  813.196964] utrace_shutdown:392 - freeing task-specific
[  813.200937] utrace_cleanup:324 - removing engine
[  813.204778] utrace_cleanup:358 exit
[  813.208640] utrace_cleanup:324 - removing engine
[  813.212415] utrace_cleanup:324 - removing engine
[  813.216090] utrace_cleanup:358 exit
[  813.471621] BUG: unable to handle kernel paging request at 00007f09303bd000
[  813.472040] IP: [<ffffffff81176072>] anon_vma_chain_link+0x12/0x40
[  813.472040] PGD 38dd3067 PUD 38e42067 PMD 5ad2067 PTE 800000006443b065
[  813.472040] Oops: 0003 [#1] SMP 
[  813.472040] Modules linked in: loop2_mod_verbose(OF) fuse btrfs zlib_deflate
raid6_pq xor vfat msdos fat ext4 mbcache jbd2 binfmt_misc nfsv3 rpcsec_gss_krb5
nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT ipt_REJECT
xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
iptable_security iptable_raw iptable_filter ip_tables snd_hda_codec_generic
ppdev snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm
kvm_amd kvm snd_page_alloc snd_timer snd serio_raw pcspkr i2c_piix4
virtio_console virtio_balloon soundcore parport_pc parport mperf
[  813.472040]  nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod
cdrom ata_generic pata_acpi virtio_blk virtio_net cirrus syscopyarea
sysfillrect sysimgblt drm_kms_helper ttm ata_piix drm virtio_pci virtio_ring
virtio i2c_core libata floppy dm_mirror dm_region_hash dm_log dm_mod
[  813.472040] CPU: 0 PID: 6159 Comm: stapio Tainted: GF         
O--------------   3.10.0-121.el7.x86_64 #1
=====

From that output, here's what I think is happening. The stap module cleans up
and unloads (notice 'loop2_mod_verbose' isn't in the list of linked in
modules). However, something didn't get cleaned up properly and when the kernel
accesses it later, the kernel crashes.

Now I've just got to figure out what got left behind.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
                   ` (5 preceding siblings ...)
  2014-04-10 16:31 ` dsmith at redhat dot com
@ 2014-04-10 18:14 ` dsmith at redhat dot com
  2014-04-14 17:01 ` dsmith at redhat dot com
  2014-04-17 20:13 ` dsmith at redhat dot com
  8 siblings, 0 replies; 10+ messages in thread
From: dsmith at redhat dot com @ 2014-04-10 18:14 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

--- Comment #7 from David Smith <dsmith at redhat dot com> ---
(In reply to David Smith from comment #6)
> A correction first. The test program compilation should include the rpath
> flag, so that the executable can find its library.
> 
> # gcc libloop.c -g -O -shared -fPIC -o libloop.so                         
> # gcc loop2.c -g -O -lpthread -lloop -Wl,-rpath,`pwd` -L. -o loop2        
> 
> Here's a portion of the kernel log of a crash that I've seen:
> 
> =====
> [  813.192482] utrace_shutdown:373 entry
> [  813.196964] utrace_shutdown:392 - freeing task-specific
> [  813.200937] utrace_cleanup:324 - removing engine
> [  813.204778] utrace_cleanup:358 exit
> [  813.208640] utrace_cleanup:324 - removing engine
> [  813.212415] utrace_cleanup:324 - removing engine
> [  813.216090] utrace_cleanup:358 exit
> [  813.471621] BUG: unable to handle kernel paging request at
> 00007f09303bd000
> [  813.472040] IP: [<ffffffff81176072>] anon_vma_chain_link+0x12/0x40
> [  813.472040] PGD 38dd3067 PUD 38e42067 PMD 5ad2067 PTE 800000006443b065
> [  813.472040] Oops: 0003 [#1] SMP 
> [  813.472040] Modules linked in: loop2_mod_verbose(OF) fuse btrfs
> zlib_deflate raid6_pq xor vfat msdos fat ext4 mbcache jbd2 binfmt_misc nfsv3
> rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT
> ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc
> ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
> nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter
> ip_tables snd_hda_codec_generic ppdev snd_hda_intel snd_hda_codec snd_hwdep
> snd_seq snd_seq_device snd_pcm kvm_amd kvm snd_page_alloc snd_timer snd
> serio_raw pcspkr i2c_piix4 virtio_console virtio_balloon soundcore
> parport_pc parport mperf
> [  813.472040]  nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod
> cdrom ata_generic pata_acpi virtio_blk virtio_net cirrus syscopyarea
> sysfillrect sysimgblt drm_kms_helper ttm ata_piix drm virtio_pci virtio_ring
> virtio i2c_core libata floppy dm_mirror dm_region_hash dm_log dm_mod
> [  813.472040] CPU: 0 PID: 6159 Comm: stapio Tainted: GF         
> O--------------   3.10.0-121.el7.x86_64 #1
> =====
> 
> From that output, here's what I think is happening. The stap module cleans
> up and unloads (notice 'loop2_mod_verbose' isn't in the list of linked in
> modules). However, something didn't get cleaned up properly and when the
> kernel accesses it later, the kernel crashes.
> 
> Now I've just got to figure out what got left behind.

Sigh. Evidently I'm blind, because the 1st module in the list is
'loop2_mod_verbose'. The 'O' flag means it is an out-of-tree module, and the
'F' flag means it was force loaded (i.e. not via dependency).

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
                   ` (6 preceding siblings ...)
  2014-04-10 18:14 ` dsmith at redhat dot com
@ 2014-04-14 17:01 ` dsmith at redhat dot com
  2014-04-17 20:13 ` dsmith at redhat dot com
  8 siblings, 0 replies; 10+ messages in thread
From: dsmith at redhat dot com @ 2014-04-14 17:01 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

--- Comment #8 from David Smith <dsmith at redhat dot com> ---
Created attachment 7553
  --> https://sourceware.org/bugzilla/attachment.cgi?id=7553&action=edit
potential patch

Here's a patch that fixed the problem for me. Here's my theory about what is
going on. When shutting down, the generated C code calls
stap_stop_task_finder(), which looks like the following (a bit simplified):

====
stap_stop_task_finder(void)                                                     
{                                                                               
        if (atomic_read(&__stp_task_finder_state) == __STP_TF_UNITIALIZED)      
                return;                                                         
        atomic_set(&__stp_task_finder_state, __STP_TF_STOPPING);                

        // The utrace_shutdown() function detaches and cleans up                
        // everything for us - we don't have to go through each                 
        // engine. This also means that the attach_count could end up           
        // > 0 (since we don't got through each engine individually).           
        utrace_shutdown();                                                      

        atomic_set(&__stp_task_finder_state, __STP_TF_STOPPED);                 

        /* Now that all the engines are detached, make sure                     
         * all the callbacks are finished.  If they aren't, we'll               
         * crash the kernel when the module is removed. */                      
        while (atomic_read(&__stp_inuse_count) != 0) {                          
                schedule();                                                     
        }                                                                       

        /* Make sure all outstanding task work requests are canceled. */        
        __stp_tf_cancel_task_work(); 

        utrace_exit();                                                          
}                                                                               
====

The utrace_shutdown() function unregisters all the tracepoint probes, but can
leave some running. The '_stp_inuse_count' loop above waits until they are
finished. When utrace_exit() gets called, it calls stp_task_work_exit(), to
cancel any outstanding task_work items that the utrace probe handlers left
behind.

The patch does a couple of things. First, it adds a call to
stp_task_work_exit() in utrace_shutdown, so it gets called sooner - before the
'_stp_inuse_count' loop. Second, it fixed a possible bug in the way
utrace_shutdown() knows it has already been called.

These changes fix the problem for me.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug runtime/16806] kernel crash during repeated module insertion
  2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
                   ` (7 preceding siblings ...)
  2014-04-14 17:01 ` dsmith at redhat dot com
@ 2014-04-17 20:13 ` dsmith at redhat dot com
  8 siblings, 0 replies; 10+ messages in thread
From: dsmith at redhat dot com @ 2014-04-17 20:13 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=16806

David Smith <dsmith at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #9 from David Smith <dsmith at redhat dot com> ---
Fixed in commit d9edc5f (which is different than the attached patch). With this
patch, a module without debug prints works fine.

With debug prints (-DDEBUG_TASK_FINDER=9 -DSTP_TF_DEBUG=9), we still get an
occasional delay when exiting with this patch. An occasional delay when using
debug prints is certainly better than a crash. The delay doesn't appear to be
related to task_finder/utrace shutdown, for some reason there is a delay in
calling stap_stop_task_finder(). We may have to look into this further in the
future.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-04-17 20:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-03 19:22 [Bug runtime/16806] New: kernel crash during repeated module insertion jlebon at redhat dot com
2014-04-03 19:22 ` [Bug runtime/16806] " jlebon at redhat dot com
2014-04-03 19:25 ` jlebon at redhat dot com
2014-04-03 20:38 ` jlebon at redhat dot com
2014-04-03 20:38 ` jlebon at redhat dot com
2014-04-03 20:55 ` jlebon at redhat dot com
2014-04-10 16:31 ` dsmith at redhat dot com
2014-04-10 18:14 ` dsmith at redhat dot com
2014-04-14 17:01 ` dsmith at redhat dot com
2014-04-17 20:13 ` dsmith at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).