public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system.
@ 2007-01-24  9:50 srinivasa at in dot ibm dot com
  2007-01-24 15:13 ` [Bug runtime/3911] " mmlnx at us dot ibm dot com
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: srinivasa at in dot ibm dot com @ 2007-01-24  9:50 UTC (permalink / raw)
  To: systemtap

My environment is systemtap(systemtap-20070120.tar.bz2),
elfutils(elfutils-0.124.tar.gz), kernel(2.6.18-1.2961.el5), p570 lpared system.

I was compiling latest systemtap source on  rhel5(2.6.18-1.2961.el5) kernel and
system dropped to xmon.

Screen looks loke this 
==================================
FAIL: probefunc:kernel.statement(0xc00000000005be34) compilation
FAIL: probefunc:kernel.function("scheduler_tick") compilation
FAIL: probefunc:kernel.inline("context_switch") compilation
Running
/home/systemtap/tmp/stap_testing_200701240903/src/testsuite/systemtap.base/simple.exp
...
Running
/home/systemtap/tmp/stap_testing_200701240903/src/testsuite/systemtap.base/timeofday.exp
...
FAIL: timeofday test compilation
Running
/home/systemtap/tmp/stap_testing_200701240903/src/testsuite/systemtap.base/timers.exp
...
Running
/home/systemtap/tmp/stap_testing_200701240903/src/testsuite/systemtap.base/tri.exp
...
Running
/home/systemtap/tmp/stap_testing_200701240903/src/testsuite/systemtap.maps/absentstats.exp
...


/////System crashed/////
=========================================

What xmon shows 
=============================
Unable to handle kernel paging request for data at address 0x420000000007f


5:mon> e
cpu 0x5: Vector: 300 (Data Access) at [c000000028a6b310]
    pc: c000000000349f40: ._spin_lock+0x20/0x88
    lr: c0000000000d6640: .__cache_alloc_node+0x4c/0x174
    sp: c000000028a6b590
   msr: 8000000000001032
   dar: 420000000007f
 dsisr: 40000000
  current = 0xc00000002ad74430
  paca    = 0xc000000000464d00
    pid   = 15804, comm = staprun

5:mon> t
[c000000028a6b610] c0000000000d6640 .__cache_alloc_node+0x4c/0x174
[c000000028a6b6b0] c0000000000d6cc8 .kmem_cache_alloc_node+0x104/0x12c
[c000000028a6b750] d000000000aa6f18 ._stp_map_init+0xa0/0x150
[stap_a0def33066a20a930648d1bcbf25f718_896]
[c000000028a6b800] d000000000aa70d4 ._stp_pmap_new+0x10c/0x1f0
[stap_a0def33066a20a930648d1bcbf25f718_896]
[c000000028a6b8c0] d000000000aa7328 ._stp_pmap_new_ix+0x170/0x28c
[stap_a0def33066a20a930648d1bcbf25f718_896]
[c000000028a6b970] d000000000aa757c .systemtap_module_init+0x138/0x254
[stap_a0def33066a20a930648d1bcbf25f718_896]
[c000000028a6ba20] d000000000aa76a8 .probe_start+0x10/0x2c
[stap_a0def33066a20a930648d1bcbf25f718_896]
[c000000028a6baa0] d000000000aa7734 ._stp_handle_start+0x70/0x10c
[stap_a0def33066a20a930648d1bcbf25f718_896]
[c000000028a6bbb0] d000000000aa79b8 ._stp_proc_write_cmd+0x1e8/0x9b0
[stap_a0def33066a20a930648d1bcbf25f718_896]
[c000000028a6bcf0] c0000000000e026c .vfs_write+0x118/0x200
[c000000028a6bd90] c0000000000e09dc .sys_write+0x4c/0x8c
[c000000028a6be30] c00000000000869c syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 0000008072e9461c
SP (ffff86eee90) is in userspace


5:mon> di c000000000349f40   (PC value)
c000000000349f40  7d20f828      lwarx   r9,r0,r31
c000000000349f44  2c090000      cmpwi   r9,0
c000000000349f48  40820010      bne     c000000000349f58        #
._spin_lock+0x38/0x88
c000000000349f4c  7c00f92d      stwcx.  r0,r0,r31
c000000000349f50  40a2fff0      bne     c000000000349f40        #
._spin_lock+0x20/0x88



5:mon> r
R00 = 0000000080000005   R16 = 0000000010005b18
R01 = c000000028a6b590   R17 = 0000000010005b10
R02 = c000000000578f80   R18 = 00000ffff86ef390
R03 = 000420000000007f   R19 = 0000000010005b38
R04 = 00000000000012d0   R20 = 0000008072d50698
R05 = 0000000000000010   R21 = 0000000010005c28
R06 = 0000000000000030   R22 = 00000000100170c8
R07 = 0000000000000220   R23 = 0000000000000002
R08 = 0000000000000010   R24 = 0000000000000010
R09 = c00000002dfe2980   R25 = 0000000000000800
R10 = c000000000605448   R26 = 0000000000000010
R11 = 0000000000000000   R27 = c00000002dfe2900
R12 = d000000000aaa3f0   R28 = 8000000000009032
R13 = c000000000464d00   R29 = 00000000000012d0
R14 = 00000000100170c0   R30 = c0000000004a59b8
R15 = 0000000010005638   R31 = 000420000000007f
pc  = c000000000349f40 ._spin_lock+0x20/0x88
lr  = c0000000000d6640 .__cache_alloc_node+0x4c/0x174
msr = 8000000000001032   cr  = 24000444
ctr = c0000000000d6cf0   xer = 0000000020000000   trap =  300
dar = 000420000000007f   dsisr = 40000000



======================================

Thanks 
 Srinivasa Ds

-- 
           Summary: Compilation of systemtap causes the system to crash on
                    p570 system.
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: runtime
        AssignedTo: systemtap at sources dot redhat dot com
        ReportedBy: srinivasa at in dot ibm dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=3911

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug runtime/3911] Compilation of systemtap causes the system to crash on p570 system.
  2007-01-24  9:50 [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system srinivasa at in dot ibm dot com
@ 2007-01-24 15:13 ` mmlnx at us dot ibm dot com
  2007-01-25 20:27 ` fche at redhat dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: mmlnx at us dot ibm dot com @ 2007-01-24 15:13 UTC (permalink / raw)
  To: systemtap


------- Additional Comments From mmlnx at us dot ibm dot com  2007-01-24 15:13 -------
I doubt this was caused by the compilation.  The tests run in parallel so it's
difficult to associate what you see on the screen with the test that caused the
problem.  Is this crash repeatable?  

Also, you're using an old elfutils.  Try using elfutils-0.125 and see if you can
repeat the crash.

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=3911

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug runtime/3911] Compilation of systemtap causes the system to crash on p570 system.
  2007-01-24  9:50 [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system srinivasa at in dot ibm dot com
  2007-01-24 15:13 ` [Bug runtime/3911] " mmlnx at us dot ibm dot com
@ 2007-01-25 20:27 ` fche at redhat dot com
  2007-01-29  8:31 ` srinivasa at in dot ibm dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: fche at redhat dot com @ 2007-01-25 20:27 UTC (permalink / raw)
  To: systemtap


------- Additional Comments From fche at redhat dot com  2007-01-25 20:27 -------
(In reply to comment #0)
> I was compiling latest systemtap source on  rhel5(2.6.18-1.2961.el5) kernel and
> system dropped to xmon.

Not really just compiling: you were running test cases.

> What xmon shows 
> =============================
> Unable to handle kernel paging request for data at address 0x420000000007f
> 5:mon> e
> cpu 0x5: Vector: 300 (Data Access) at [c000000028a6b310]
>     pc: c000000000349f40: ._spin_lock+0x20/0x88

This resembles a memory corruption that may even precede this systemtap module.
It is certainly *before* any probes are even registered, let alone run.  The
runtime is the only part more or less running by this time.


-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=3911

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug runtime/3911] Compilation of systemtap causes the system to crash on p570 system.
  2007-01-24  9:50 [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system srinivasa at in dot ibm dot com
  2007-01-24 15:13 ` [Bug runtime/3911] " mmlnx at us dot ibm dot com
  2007-01-25 20:27 ` fche at redhat dot com
@ 2007-01-29  8:31 ` srinivasa at in dot ibm dot com
  2007-01-30  7:17 ` srinivasa at in dot ibm dot com
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: srinivasa at in dot ibm dot com @ 2007-01-29  8:31 UTC (permalink / raw)
  To: systemtap


------- Additional Comments From srinivasa at in dot ibm dot com  2007-01-29 08:30 -------
Iam able to reproduce this bug on latest upstream kernel 2.6.20-rc6 kernel also.



-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=3911

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug runtime/3911] Compilation of systemtap causes the system to crash on p570 system.
  2007-01-24  9:50 [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system srinivasa at in dot ibm dot com
                   ` (2 preceding siblings ...)
  2007-01-29  8:31 ` srinivasa at in dot ibm dot com
@ 2007-01-30  7:17 ` srinivasa at in dot ibm dot com
  2007-01-30 12:16 ` fche at redhat dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: srinivasa at in dot ibm dot com @ 2007-01-30  7:17 UTC (permalink / raw)
  To: systemtap


------- Additional Comments From srinivasa at in dot ibm dot com  2007-01-30 07:17 -------
Here is my analysis of this bug
======================================
 1) Looking at the backtrace, _stp_map_init() calls kmalloc_node() with cpu as
argument. kmalloc_node() is same as kmalloc if NUMA is not configured and
kmalloc_node() calls kmem_cache_alloc_node() if NUMA is configured.

_stp_map_init is called by _stp_pmap_new() within for_each_cpu() brace.

 
=============================================
static int _stp_map_init(MAP m, unsigned max_entries, int type, int key_size,
int data_si
ze, int cpu)
{
        int size;

.....................................
.....................................
for (i = 0; i < max_entries; i++) {
                        if (cpu < 0)
                                tmp = kmalloc(size, STP_ALLOC_FLAGS);
                        else
                                tmp = kmalloc_node(size, STP_ALLOC_FLAGS, cpu);

                        if (!tmp)
                                return -1;;

                        dbug ("allocated %lx\n", (long)tmp);
=========================================================================
static PMAP _stp_pmap_new(unsigned max_entries, int type, int key_size, int
data_size)
{
        int i;
        MAP map, m;

        PMAP pmap = (PMAP) kmalloc(sizeof(struct pmap), STP_ALLOC_FLAGS);
        if (pmap == NULL)
                return NULL;
.........................
.........................
 for_each_cpu(i) {
                m = per_cpu_ptr (map, i);
                if (_stp_map_init(m, max_entries, type, key_size, data_size, i)) {
                        goto err1;
                }
        }
============================================

2) Since in my system, NUMA is configured kmalloc_node() calls
kmem_cache_alloc_node with "cpu" as nodeid
====================================================
#ifdef CONFIG_NUMA
extern void *__kmalloc_node(size_t size, gfp_t flags, int node);

static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
{
        if (__builtin_constant_p(size)) {
                int i = 0;
#define CACHE(x) \
.......................
..............................
return kmem_cache_alloc_node((flags & GFP_DMA) ?
                        malloc_sizes[i].cs_dmacachep :
                        malloc_sizes[i].cs_cachep, flags, node);
        }
========================================================
3) This means systemtap code expects the number of nodes in numa should be same
as number of cpu's. 
  kmem_cache_alloc_node() inturn calls ___cache_alloc_node where
cachep->nodelists[nodeid] gives wrong address because in my system number of
nodes are less than number of cpus.
================================
Mount-cache hash table entries: 4096
Processor 1 found.
Processor 2 found.
Processor 3 found.
Processor 4 found.
Processor 5 found.
Processor 6 found.
Processor 7 found.
Brought up 8 CPUs
Node 0 CPUs: 0-7
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
sizeof(vma)=176 bytes
sizeof(page)=56 bytes
sizeof(inode)=560 bytes
=================================

for example: I have 8 cpu's in my system and  4 numa nodes. nodelists[8] gives
wrong address and that is causing the oops.
============================================================
static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
                                int nodeid)
{
        struct list_head *entry;
        struct slab *slabp;
        struct kmem_list3 *l3;
        void *obj;
        int x;

        l3 = cachep->nodelists[nodeid];  <<<========PC is here
        BUG_ON(!l3);
============================================

Hence assuming, number of nodes equal to number of cpus in systemtap modules is
causing this bug.


Martin 
 Any ideas??

Thanks
 Srinivasa DS

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hunt at redhat dot com


http://sourceware.org/bugzilla/show_bug.cgi?id=3911

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug runtime/3911] Compilation of systemtap causes the system to crash on p570 system.
  2007-01-24  9:50 [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system srinivasa at in dot ibm dot com
                   ` (3 preceding siblings ...)
  2007-01-30  7:17 ` srinivasa at in dot ibm dot com
@ 2007-01-30 12:16 ` fche at redhat dot com
  2007-01-30 14:37 ` hunt at redhat dot com
  2007-01-30 14:38 ` hunt at redhat dot com
  6 siblings, 0 replies; 8+ messages in thread
From: fche at redhat dot com @ 2007-01-30 12:16 UTC (permalink / raw)
  To: systemtap


------- Additional Comments From fche at redhat dot com  2007-01-30 12:15 -------
Very nice analysis.

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=3911

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug runtime/3911] Compilation of systemtap causes the system to crash on p570 system.
  2007-01-24  9:50 [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system srinivasa at in dot ibm dot com
                   ` (4 preceding siblings ...)
  2007-01-30 12:16 ` fche at redhat dot com
@ 2007-01-30 14:37 ` hunt at redhat dot com
  2007-01-30 14:38 ` hunt at redhat dot com
  6 siblings, 0 replies; 8+ messages in thread
From: hunt at redhat dot com @ 2007-01-30 14:37 UTC (permalink / raw)
  To: systemtap


------- Additional Comments From hunt at redhat dot com  2007-01-30 14:36 -------
Looks like there is a missing cpu to node mapping there.  It's probably time to
just give up on NUMA for older kernels and just use alloc_percpu or percpu_alloc
as it is in the kernel. 

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED


http://sourceware.org/bugzilla/show_bug.cgi?id=3911

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug runtime/3911] Compilation of systemtap causes the system to crash on p570 system.
  2007-01-24  9:50 [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system srinivasa at in dot ibm dot com
                   ` (5 preceding siblings ...)
  2007-01-30 14:37 ` hunt at redhat dot com
@ 2007-01-30 14:38 ` hunt at redhat dot com
  6 siblings, 0 replies; 8+ messages in thread
From: hunt at redhat dot com @ 2007-01-30 14:38 UTC (permalink / raw)
  To: systemtap



-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|systemtap at sources dot    |hunt at redhat dot com
                   |redhat dot com              |


http://sourceware.org/bugzilla/show_bug.cgi?id=3911

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-01-30 14:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-24  9:50 [Bug runtime/3911] New: Compilation of systemtap causes the system to crash on p570 system srinivasa at in dot ibm dot com
2007-01-24 15:13 ` [Bug runtime/3911] " mmlnx at us dot ibm dot com
2007-01-25 20:27 ` fche at redhat dot com
2007-01-29  8:31 ` srinivasa at in dot ibm dot com
2007-01-30  7:17 ` srinivasa at in dot ibm dot com
2007-01-30 12:16 ` fche at redhat dot com
2007-01-30 14:37 ` hunt at redhat dot com
2007-01-30 14:38 ` hunt at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).