[Bug runtime/29577] New: NULL pointer dereference in kernel tracepoint int3 execution

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* [Bug runtime/29577] New: NULL pointer dereference in kernel tracepoint int3 execution
@ 2022-09-16 19:52 agentzh at gmail dot com
  2022-09-16 20:49 ` [Bug runtime/29577] " agentzh at gmail dot com
  2022-09-18  1:35 ` agentzh at gmail dot com
  0 siblings, 2 replies; 3+ messages in thread
From: agentzh at gmail dot com @ 2022-09-16 19:52 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=29577

            Bug ID: 29577
           Summary: NULL pointer dereference in kernel tracepoint int3
                    execution
           Product: systemtap
           Version: unspecified
            Status: UNCONFIRMED
          Severity: critical
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: agentzh at gmail dot com
  Target Milestone: ---

Hi folks!

We observed several NULL pointer dereference errors in CentOS 8 kernel
`4.18.0-193.6.3.el8_2.x86_64`. They all look like this:

```
  11948 [39391079.662703] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
  11949 [39391094.375018] NMI watchdog: Watchdog detected hard LOCKUP on cpu
2Modules linked in: orxray_lj_c_off_cpu_2789022(OE) joydev binfmt_misc tcp_diag
inet_diag mpt3sas raid_class scsi_transport_sas 8021q garp mrp stp llc bonding
dell_rbu intel_rapl_msr intel_rapl_c  11949 ommon skx_edac nfit libnvdimm
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate
dell_smbios iTCO_wdt iTCO_vendor_support dell_wmi_descriptor wmi_bmof dcdbas
intel_uncore   11949 pcspkr intel_rapl_perf mei_me ipmi_si mei ipmi_devintf
i2c_i801 lpc_ich wmi ipmi_msghandler acpi_power_meter sch_fq tcp_bbr dummy
ip_tables xfs libcrc32c sd_mod sg mgag200 drm_vram_helper ttm drm_kms_helper
syscopyarea sysfillrect sysimgblt ahci fb_sys_fops libahc  11949 i drm
crc32c_intel libata megaraid_sas igb dca i2c_algo_bit i40e [last unloaded:
orxray_count_off_cpu_frames_2788869]
  11950 [39391094.375028] CPU: 2 PID: 2789064 Comm: ps Kdump: loaded Tainted: G
          OE    --------- -  - 4.18.0-193.6.3.el8_2.x86_64 #1
  11951 [39391094.375029] Hardware name: Dell Inc. PowerEdge R640/06DKY5, BIOS
2.6.4 04/09/2020
  11952 [39391094.375029] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
  11953 [39391094.375030] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89
c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90
<8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
  11954 [39391094.375030] RSP: 0018:ffffb2fdc90f7c68 EFLAGS: 00000002
  11955 [39391094.375031] RAX: 0000000000000101 RBX: ffff88c60fce9540 RCX:
0000000000000003
  11956 [39391094.375031] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff88c60fce9540
  11957 [39391094.375031] RBP: 000000000000000b R08: 0000000000000008 R09:
0000000000333a33
  11958 [39391094.375032] R10: 8080808080808080 R11: 0000000000000010 R12:
ffffb2fdc90f7ca8
  11959 [39391094.375032] R13: 0000000000000040 R14: ffff88c3506a0180 R15:
0000000000000000
  11960 [39391094.375032] FS:  00007f2f11c9d940(0000) GS:ffff88c60fc80000(0000)
knlGS:0000000000000000
  11961 [39391094.375033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  11962 [39391094.375033] CR2: 000055fd3a4ce038 CR3: 0000000d7d370002 CR4:
00000000007606e0
  11963 [39391094.375033] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
  11964 [39391094.375034] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
  11965 [39391094.375034] PKRU: 55555554
  11966 [39391094.375034] Call Trace:
  11967 [39391094.375034]  _raw_spin_lock_irq+0x23/0x26
  11968 [39391094.375034]  wq_worker_comm+0x77/0xd0
  11969 [39391094.375034]  proc_task_name+0x44/0x120
  11970 [39391094.375035]  ? num_to_str+0xac/0xc0
  11971 [39391094.375035]  do_task_stat+0x434/0xde0
  11972 [39391094.375035]  proc_single_show+0x4e/0x90
  11973 [39391094.375035]  seq_read+0x153/0x410
  11974 [39391094.375035]  vfs_read+0x91/0x140
  11975 [39391094.375035]  ksys_read+0x4f/0xb0
  11976 [39391094.375035]  do_syscall_64+0x5b/0x1a0
  11977 [39391094.375036]  entry_SYSCALL_64_after_hwframe+0x65/0xca
  11978 [39391094.375036] RIP: 0033:0x7f2f11049a85
  11979 [39391094.375036] Code: fe ff ff 50 48 8d 3d 32 d8 09 00 e8 b5 0e 02 00
0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 d5 4b 2d 00 8b 00 85 c0 75 0f 31 c0 0f 05
<48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89
  11980 [39391094.375036] RSP: 002b:00007ffff7ee8b78 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
  11981 [39391094.375037] RAX: ffffffffffffffda RBX: 00007f2f11a7a678 RCX:
00007f2f11049a85
  11982 [39391094.375037] RDX: 0000000000000800 RSI: 000055fd3a4cb5e0 RDI:
0000000000000006
  11983 [39391094.375037] RBP: 00007f2f11a7a670 R08: 00007f2f11875ea5 R09:
00007f2f110dac80
  11984 [39391094.375037] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000006
  11985 [39391094.375038] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
  11986 [39391094.375038] Kernel panic - not syncing: Hard LOCKUP
```

BTW, the orxray_XXX modules are stap ko modules.

Basically it is always a BUG for null pointer dereference followed by another
BUG for soft-lockup detected by the NMI watchdog.

Inspecting the kernel backtraces on all the CPUs in the kdump file reveals the
following full kernel backtraces for these two:

```
PID: 29       TASK: ffff88af005c0000  CPU: 3    COMMAND: "ksoftirqd/3"
 #0 [fffffe0000088e50] crash_nmi_callback at ffffffff9984d033
 #1 [fffffe0000088e58] nmi_handle at ffffffff99822893
 #2 [fffffe0000088eb0] default_do_nmi at ffffffff99822d4e
 #3 [fffffe0000088ed0] do_nmi at ffffffff99822f28
 #4 [fffffe0000088ef0] end_repeat_nmi at ffffffff9a2016b4
    [exception RIP: native_queued_spin_lock_slowpath+91]
    RIP: ffffffff9990a92b  RSP: ffffb2fdc683b910  RFLAGS: 00000002
    RAX: 00000000005c0101  RBX: ffff88af00161ec0  RCX: 0000000000000003
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff88c60fce9c40
    RBP: ffff88c60fce9c40   R8: ffff88c60fce9560   R9: ffff88af07803230
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88af00162a7c
    R13: 0000000000000000  R14: 0000000000000046  R15: 0000000000000003
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ffffb2fdc683b910] native_queued_spin_lock_slowpath at ffffffff9990a92b
 #6 [ffffb2fdc683b910] _raw_spin_lock at ffffffff9a094f4c
 #7 [ffffb2fdc683b918] try_to_wake_up at ffffffff998e1463
 #8 [ffffb2fdc683b978] __queue_work at ffffffff998cdcb3
 #9 [ffffb2fdc683b9c0] queue_work_on at ffffffff998cdfd4
#10 [ffffb2fdc683b9d0] mga_imageblit at ffffffffc0422666 [mgag200]
#11 [ffffb2fdc683b9e8] soft_cursor at ffffffff99c9cfa4
#12 [ffffb2fdc683ba40] bit_cursor at ffffffff99c9cbd2
#13 [ffffb2fdc683bb08] hide_cursor at ffffffff99d2d78a
#14 [ffffb2fdc683bb18] vt_console_print at ffffffff99d2fb8d
#15 [ffffb2fdc683bb80] console_unlock at ffffffff999171d9
#16 [ffffb2fdc683bbc0] vprintk_emit at ffffffff99918e24
#17 [ffffb2fdc683bc18] printk at ffffffff999193a6
#18 [ffffb2fdc683bc78] no_context at ffffffff99868880
#19 [ffffb2fdc683bcd0] do_page_fault at ffffffff99869262
#20 [ffffb2fdc683bd00] page_fault at ffffffff9a20120e
    [exception RIP: poke_int3_handler+56]
    RIP: ffffffff998266e8  RSP: ffffb2fdc683bdb0  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffffb2fdc683bdd8  RCX: ffffffff9a200b87
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: ffffffff9a08f7f6
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#21 [ffffb2fdc683bdb8] do_int3 at ffffffff9981ed2e
#22 [ffffb2fdc683bdd0] int3 at ffffffff9a201124
#23 [ffffb2fdc683be58] __schedule at ffffffff9a08f7f7
#24 [ffffb2fdc683bed8] schedule at ffffffff9a08fd2f
#25 [ffffb2fdc683bee8] smpboot_thread_fn at ffffffff998d88cb
#26 [ffffb2fdc683bf10] kthread at ffffffff998d4812
#27 [ffffb2fdc683bf50] ret_from_fork at ffffffff9a20023f
```

This one corresponds to the dmesg error

```
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
```

By analyzing the PC value in the kernel backtrace, we found that it is crashed
in the poke_int3_handler() function in file arch/x86/kernel/alternative.c, on
line 832:

```
832         if (tp->addr != ip)
```

The offset of the `addr` in tp is indeed 8.

Also, we use gdb to find out `__schedule at ffffffff9a08f7f7` is for
kernel/sched/core.c:3431:

```
3431         trace_sched_switch(preempt, prev, next);
```

Our stap script indeed defines a `probe kernel.trace("sched_switch")`. So it
makes sense.

And there's another one for another CPU related to stap:

```
PID: 2789022  TASK: ffff88b461eadc40  CPU: 19   COMMAND: "stapio"
 #0 [fffffe0000338e50] crash_nmi_callback at ffffffff9984d033
 #1 [fffffe0000338e58] nmi_handle at ffffffff99822893
 #2 [fffffe0000338eb0] default_do_nmi at ffffffff99822d4e
 #3 [fffffe0000338ed0] do_nmi at ffffffff99822f28
 #4 [fffffe0000338ef0] end_repeat_nmi at ffffffff9a2016b4
    [exception RIP: smp_call_function_many+492]
    RIP: ffffffff9994f2ac  RSP: ffffb2fdc9b63d18  RFLAGS: 00000202
    RAX: 0000000000000003  RBX: ffff88c6100eacc0  RCX: ffff88c60fcf04c0
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: ffff88af0059f1b0
    RBP: ffffffff99826720   R8: 000000000002e040   R9: ffffffff9984ddda
    R10: ffffe28c203a4600  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000001  R14: 0000000000000020  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ffffb2fdc9b63d18] smp_call_function_many at ffffffff9994f2ac
 #6 [ffffb2fdc9b63d58] on_each_cpu at ffffffff9994f368
 #7 [ffffb2fdc9b63d78] text_poke_bp_batch at ffffffff998270b9
 #8 [ffffb2fdc9b63db8] arch_jump_label_transform_apply at ffffffff9982433e
 #9 [ffffb2fdc9b63dc0] __static_key_slow_dec_cpuslocked at ffffffff99a1464f
#10 [ffffb2fdc9b63de0] static_key_slow_dec at ffffffff99a146d3
#11 [ffffb2fdc9b63df0] tracepoint_probe_unregister at ffffffff99994452
#12 [ffffb2fdc9b63e20] do_tracepoint_probe_unregister at ffffffffc0c905b3
[orxray_lj_c_off_cpu_2789022]
#13 [ffffb2fdc9b63e28] stp_tracepoint_probe_unregister at ffffffffc0ca8147
[orxray_lj_c_off_cpu_2789022]
#14 [ffffb2fdc9b63e50] utrace_shutdown at ffffffffc0ca8440
[orxray_lj_c_off_cpu_2789022]
#15 [ffffb2fdc9b63e58] stap_stop_task_finder at ffffffffc0ca85ad
[orxray_lj_c_off_cpu_2789022]
#16 [ffffb2fdc9b63e60] systemtap_module_exit at ffffffffc0ca932d
[orxray_lj_c_off_cpu_2789022]
#17 [ffffb2fdc9b63e80] _stp_cleanup_and_exit at ffffffffc0ca956e
[orxray_lj_c_off_cpu_2789022]
#18 [ffffb2fdc9b63e90] _stp_ctl_write_cmd at ffffffffc0ca96d1
[orxray_lj_c_off_cpu_2789022]
#19 [ffffb2fdc9b63eb8] proc_reg_write at ffffffff99b4241c
#20 [ffffb2fdc9b63ed0] vfs_write at ffffffff99ac7835
#21 [ffffb2fdc9b63f00] ksys_write at ffffffff99ac7aaf
#22 [ffffb2fdc9b63f38] do_syscall_64 at ffffffff9980419b
#23 [ffffb2fdc9b63f50] entry_SYSCALL_64_after_hwframe at ffffffff9a2000ad
    RIP: 00007f51f720faf7  RSP: 00007ffce4839fa0  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 0000000000000005  RCX: 00007f51f720faf7
    RDX: 0000000000000004  RSI: 00007ffce4839fe8  RDI: 0000000000000005
    RBP: 00007ffce4839fe8   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000293  R12: 0000000000000004
    R13: 000000000000000b  R14: 00007f51f78356c0  R15: 00007ffce483a094
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
```

This is in the context of stapio while unregistering the module and the
tracepoint probes. But note that it is in utrace_shutdown(), which does not
involve the `sched_switch` tracepoint. So I'm not sure how these two contexts
are related to each other.

We tried hard to reproduce this on the same kernel in a controlled environment
but with no luck, even when the tracepoints are triggered 2M times per second
across 24 CPU cores, and the staprun command is repeated every second.

So far I only found the following upstream bug fix which may be related or may
not:
https://lore.kernel.org/all/20210805132717.23813-1-mathieu.desnoyers@efficios.com/

I have another theory is that maybe a previous run of a stap ko failed to
unregister its sched_switch tracepoint. The same tool was running periodically
in the same box after all.

Any help to debug this further will be highly recommended!

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug runtime/29577] NULL pointer dereference in kernel tracepoint int3 execution
  2022-09-16 19:52 [Bug runtime/29577] New: NULL pointer dereference in kernel tracepoint int3 execution agentzh at gmail dot com
@ 2022-09-16 20:49 ` agentzh at gmail dot com
  2022-09-18  1:35 ` agentzh at gmail dot com
  1 sibling, 0 replies; 3+ messages in thread
From: agentzh at gmail dot com @ 2022-09-16 20:49 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=29577

--- Comment #1 from agentzh <agentzh at gmail dot com> ---
I just proposed a patch to fix it but haven't verified it since it's extremely
hard to reproduce:

https://gist.github.com/agentzh/905021679609619ec1b62011870cd926

It looks like related.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug runtime/29577] NULL pointer dereference in kernel tracepoint int3 execution
  2022-09-16 19:52 [Bug runtime/29577] New: NULL pointer dereference in kernel tracepoint int3 execution agentzh at gmail dot com
  2022-09-16 20:49 ` [Bug runtime/29577] " agentzh at gmail dot com
@ 2022-09-18  1:35 ` agentzh at gmail dot com
  1 sibling, 0 replies; 3+ messages in thread
From: agentzh at gmail dot com @ 2022-09-18  1:35 UTC (permalink / raw)
  To: systemtap

https://sourceware.org/bugzilla/show_bug.cgi?id=29577

agentzh <agentzh at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #2 from agentzh <agentzh at gmail dot com> ---
OK, now after analyzing the kdump file in more detail, I believe it is this
kernel bug fixed upstream almost 3 years ago:

https://lkml.iu.edu/hypermail/linux/kernel/1912.0/00303.html

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-09-18  1:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16 19:52 [Bug runtime/29577] New: NULL pointer dereference in kernel tracepoint int3 execution agentzh at gmail dot com
2022-09-16 20:49 ` [Bug runtime/29577] " agentzh at gmail dot com
2022-09-18  1:35 ` agentzh at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).