From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 957D5385E037; Fri, 16 Sep 2022 19:52:41 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 957D5385E037
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1663357961;
	bh=zuambYetZVl7u0rfjbjw2ErKgP13jKR8Yet7RFlAeJk=;
	h=From:To:Subject:Date:From;
	b=T4mXXkCHWrW2WImk5X3bqGFUXX8ybP+QpnNU7CJaPM29RAW0MZhAknCk+5Zvk6Npb
	 JrtgrJiTZr4wtAYE7nIVqNIjuWaM/WzX4URS81FTbN5hjao2hHOZ7kajdZqmFE0T+x
	 rv0bwNpYN/4yudRncX6Gkmo+ry4a6ww5WegMvIDA=
From: "agentzh at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: systemtap@sourceware.org
Subject: [Bug runtime/29577] New: NULL pointer dereference in kernel
 tracepoint int3 execution
Date: Fri, 16 Sep 2022 19:52:41 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: systemtap
X-Bugzilla-Component: runtime
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: critical
X-Bugzilla-Who: agentzh at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: systemtap at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-29577-6586@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <systemtap.sourceware.org>

https://sourceware.org/bugzilla/show_bug.cgi?id=3D29577

            Bug ID: 29577
           Summary: NULL pointer dereference in kernel tracepoint int3
                    execution
           Product: systemtap
           Version: unspecified
            Status: UNCONFIRMED
          Severity: critical
          Priority: P2
         Component: runtime
          Assignee: systemtap at sourceware dot org
          Reporter: agentzh at gmail dot com
  Target Milestone: ---

Hi folks!

We observed several NULL pointer dereference errors in CentOS 8 kernel
`4.18.0-193.6.3.el8_2.x86_64`. They all look like this:

```
  11948 [39391079.662703] BUG: unable to handle kernel NULL pointer derefer=
ence
at 0000000000000008
  11949 [39391094.375018] NMI watchdog: Watchdog detected hard LOCKUP on cpu
2Modules linked in: orxray_lj_c_off_cpu_2789022(OE) joydev binfmt_misc tcp_=
diag
inet_diag mpt3sas raid_class scsi_transport_sas 8021q garp mrp stp llc bond=
ing
dell_rbu intel_rapl_msr intel_rapl_c  11949 ommon skx_edac nfit libnvdimm
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate
dell_smbios iTCO_wdt iTCO_vendor_support dell_wmi_descriptor wmi_bmof dcdbas
intel_uncore   11949 pcspkr intel_rapl_perf mei_me ipmi_si mei ipmi_devintf
i2c_i801 lpc_ich wmi ipmi_msghandler acpi_power_meter sch_fq tcp_bbr dummy
ip_tables xfs libcrc32c sd_mod sg mgag200 drm_vram_helper ttm drm_kms_helper
syscopyarea sysfillrect sysimgblt ahci fb_sys_fops libahc  11949 i drm
crc32c_intel libata megaraid_sas igb dca i2c_algo_bit i40e [last unloaded:
orxray_count_off_cpu_frames_2788869]
  11950 [39391094.375028] CPU: 2 PID: 2789064 Comm: ps Kdump: loaded Tainte=
d: G
          OE    --------- -  - 4.18.0-193.6.3.el8_2.x86_64 #1
  11951 [39391094.375029] Hardware name: Dell Inc. PowerEdge R640/06DKY5, B=
IOS
2.6.4 04/09/2020
  11952 [39391094.375029] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0=
x1d0
  11953 [39391094.375030] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 0=
8 89
c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 =
90
<8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
  11954 [39391094.375030] RSP: 0018:ffffb2fdc90f7c68 EFLAGS: 00000002
  11955 [39391094.375031] RAX: 0000000000000101 RBX: ffff88c60fce9540 RCX:
0000000000000003
  11956 [39391094.375031] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff88c60fce9540
  11957 [39391094.375031] RBP: 000000000000000b R08: 0000000000000008 R09:
0000000000333a33
  11958 [39391094.375032] R10: 8080808080808080 R11: 0000000000000010 R12:
ffffb2fdc90f7ca8
  11959 [39391094.375032] R13: 0000000000000040 R14: ffff88c3506a0180 R15:
0000000000000000
  11960 [39391094.375032] FS:  00007f2f11c9d940(0000) GS:ffff88c60fc80000(0=
000)
knlGS:0000000000000000
  11961 [39391094.375033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  11962 [39391094.375033] CR2: 000055fd3a4ce038 CR3: 0000000d7d370002 CR4:
00000000007606e0
  11963 [39391094.375033] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
  11964 [39391094.375034] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
  11965 [39391094.375034] PKRU: 55555554
  11966 [39391094.375034] Call Trace:
  11967 [39391094.375034]  _raw_spin_lock_irq+0x23/0x26
  11968 [39391094.375034]  wq_worker_comm+0x77/0xd0
  11969 [39391094.375034]  proc_task_name+0x44/0x120
  11970 [39391094.375035]  ? num_to_str+0xac/0xc0
  11971 [39391094.375035]  do_task_stat+0x434/0xde0
  11972 [39391094.375035]  proc_single_show+0x4e/0x90
  11973 [39391094.375035]  seq_read+0x153/0x410
  11974 [39391094.375035]  vfs_read+0x91/0x140
  11975 [39391094.375035]  ksys_read+0x4f/0xb0
  11976 [39391094.375035]  do_syscall_64+0x5b/0x1a0
  11977 [39391094.375036]  entry_SYSCALL_64_after_hwframe+0x65/0xca
  11978 [39391094.375036] RIP: 0033:0x7f2f11049a85
  11979 [39391094.375036] Code: fe ff ff 50 48 8d 3d 32 d8 09 00 e8 b5 0e 0=
2 00
0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 d5 4b 2d 00 8b 00 85 c0 75 0f 31 c0 0f =
05
<48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89
  11980 [39391094.375036] RSP: 002b:00007ffff7ee8b78 EFLAGS: 00000246 ORIG_=
RAX:
0000000000000000
  11981 [39391094.375037] RAX: ffffffffffffffda RBX: 00007f2f11a7a678 RCX:
00007f2f11049a85
  11982 [39391094.375037] RDX: 0000000000000800 RSI: 000055fd3a4cb5e0 RDI:
0000000000000006
  11983 [39391094.375037] RBP: 00007f2f11a7a670 R08: 00007f2f11875ea5 R09:
00007f2f110dac80
  11984 [39391094.375037] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000006
  11985 [39391094.375038] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
  11986 [39391094.375038] Kernel panic - not syncing: Hard LOCKUP
```

BTW, the orxray_XXX modules are stap ko modules.

Basically it is always a BUG for null pointer dereference followed by anoth=
er
BUG for soft-lockup detected by the NMI watchdog.

Inspecting the kernel backtraces on all the CPUs in the kdump file reveals =
the
following full kernel backtraces for these two:

```
PID: 29       TASK: ffff88af005c0000  CPU: 3    COMMAND: "ksoftirqd/3"
 #0 [fffffe0000088e50] crash_nmi_callback at ffffffff9984d033
 #1 [fffffe0000088e58] nmi_handle at ffffffff99822893
 #2 [fffffe0000088eb0] default_do_nmi at ffffffff99822d4e
 #3 [fffffe0000088ed0] do_nmi at ffffffff99822f28
 #4 [fffffe0000088ef0] end_repeat_nmi at ffffffff9a2016b4
    [exception RIP: native_queued_spin_lock_slowpath+91]
    RIP: ffffffff9990a92b  RSP: ffffb2fdc683b910  RFLAGS: 00000002
    RAX: 00000000005c0101  RBX: ffff88af00161ec0  RCX: 0000000000000003
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff88c60fce9c40
    RBP: ffff88c60fce9c40   R8: ffff88c60fce9560   R9: ffff88af07803230
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88af00162a7c
    R13: 0000000000000000  R14: 0000000000000046  R15: 0000000000000003
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ffffb2fdc683b910] native_queued_spin_lock_slowpath at ffffffff9990a92b
 #6 [ffffb2fdc683b910] _raw_spin_lock at ffffffff9a094f4c
 #7 [ffffb2fdc683b918] try_to_wake_up at ffffffff998e1463
 #8 [ffffb2fdc683b978] __queue_work at ffffffff998cdcb3
 #9 [ffffb2fdc683b9c0] queue_work_on at ffffffff998cdfd4
#10 [ffffb2fdc683b9d0] mga_imageblit at ffffffffc0422666 [mgag200]
#11 [ffffb2fdc683b9e8] soft_cursor at ffffffff99c9cfa4
#12 [ffffb2fdc683ba40] bit_cursor at ffffffff99c9cbd2
#13 [ffffb2fdc683bb08] hide_cursor at ffffffff99d2d78a
#14 [ffffb2fdc683bb18] vt_console_print at ffffffff99d2fb8d
#15 [ffffb2fdc683bb80] console_unlock at ffffffff999171d9
#16 [ffffb2fdc683bbc0] vprintk_emit at ffffffff99918e24
#17 [ffffb2fdc683bc18] printk at ffffffff999193a6
#18 [ffffb2fdc683bc78] no_context at ffffffff99868880
#19 [ffffb2fdc683bcd0] do_page_fault at ffffffff99869262
#20 [ffffb2fdc683bd00] page_fault at ffffffff9a20120e
    [exception RIP: poke_int3_handler+56]
    RIP: ffffffff998266e8  RSP: ffffb2fdc683bdb0  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffffb2fdc683bdd8  RCX: ffffffff9a200b87
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: ffffffff9a08f7f6
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#21 [ffffb2fdc683bdb8] do_int3 at ffffffff9981ed2e
#22 [ffffb2fdc683bdd0] int3 at ffffffff9a201124
#23 [ffffb2fdc683be58] __schedule at ffffffff9a08f7f7
#24 [ffffb2fdc683bed8] schedule at ffffffff9a08fd2f
#25 [ffffb2fdc683bee8] smpboot_thread_fn at ffffffff998d88cb
#26 [ffffb2fdc683bf10] kthread at ffffffff998d4812
#27 [ffffb2fdc683bf50] ret_from_fork at ffffffff9a20023f
```

This one corresponds to the dmesg error

```
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
```

By analyzing the PC value in the kernel backtrace, we found that it is cras=
hed
in the poke_int3_handler() function in file arch/x86/kernel/alternative.c, =
on
line 832:

```
832         if (tp->addr !=3D ip)
```

The offset of the `addr` in tp is indeed 8.

Also, we use gdb to find out `__schedule at ffffffff9a08f7f7` is for
kernel/sched/core.c:3431:

```
3431         trace_sched_switch(preempt, prev, next);
```

Our stap script indeed defines a `probe kernel.trace("sched_switch")`. So it
makes sense.

And there's another one for another CPU related to stap:

```
PID: 2789022  TASK: ffff88b461eadc40  CPU: 19   COMMAND: "stapio"
 #0 [fffffe0000338e50] crash_nmi_callback at ffffffff9984d033
 #1 [fffffe0000338e58] nmi_handle at ffffffff99822893
 #2 [fffffe0000338eb0] default_do_nmi at ffffffff99822d4e
 #3 [fffffe0000338ed0] do_nmi at ffffffff99822f28
 #4 [fffffe0000338ef0] end_repeat_nmi at ffffffff9a2016b4
    [exception RIP: smp_call_function_many+492]
    RIP: ffffffff9994f2ac  RSP: ffffb2fdc9b63d18  RFLAGS: 00000202
    RAX: 0000000000000003  RBX: ffff88c6100eacc0  RCX: ffff88c60fcf04c0
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: ffff88af0059f1b0
    RBP: ffffffff99826720   R8: 000000000002e040   R9: ffffffff9984ddda
    R10: ffffe28c203a4600  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000001  R14: 0000000000000020  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ffffb2fdc9b63d18] smp_call_function_many at ffffffff9994f2ac
 #6 [ffffb2fdc9b63d58] on_each_cpu at ffffffff9994f368
 #7 [ffffb2fdc9b63d78] text_poke_bp_batch at ffffffff998270b9
 #8 [ffffb2fdc9b63db8] arch_jump_label_transform_apply at ffffffff9982433e
 #9 [ffffb2fdc9b63dc0] __static_key_slow_dec_cpuslocked at ffffffff99a1464f
#10 [ffffb2fdc9b63de0] static_key_slow_dec at ffffffff99a146d3
#11 [ffffb2fdc9b63df0] tracepoint_probe_unregister at ffffffff99994452
#12 [ffffb2fdc9b63e20] do_tracepoint_probe_unregister at ffffffffc0c905b3
[orxray_lj_c_off_cpu_2789022]
#13 [ffffb2fdc9b63e28] stp_tracepoint_probe_unregister at ffffffffc0ca8147
[orxray_lj_c_off_cpu_2789022]
#14 [ffffb2fdc9b63e50] utrace_shutdown at ffffffffc0ca8440
[orxray_lj_c_off_cpu_2789022]
#15 [ffffb2fdc9b63e58] stap_stop_task_finder at ffffffffc0ca85ad
[orxray_lj_c_off_cpu_2789022]
#16 [ffffb2fdc9b63e60] systemtap_module_exit at ffffffffc0ca932d
[orxray_lj_c_off_cpu_2789022]
#17 [ffffb2fdc9b63e80] _stp_cleanup_and_exit at ffffffffc0ca956e
[orxray_lj_c_off_cpu_2789022]
#18 [ffffb2fdc9b63e90] _stp_ctl_write_cmd at ffffffffc0ca96d1
[orxray_lj_c_off_cpu_2789022]
#19 [ffffb2fdc9b63eb8] proc_reg_write at ffffffff99b4241c
#20 [ffffb2fdc9b63ed0] vfs_write at ffffffff99ac7835
#21 [ffffb2fdc9b63f00] ksys_write at ffffffff99ac7aaf
#22 [ffffb2fdc9b63f38] do_syscall_64 at ffffffff9980419b
#23 [ffffb2fdc9b63f50] entry_SYSCALL_64_after_hwframe at ffffffff9a2000ad
    RIP: 00007f51f720faf7  RSP: 00007ffce4839fa0  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 0000000000000005  RCX: 00007f51f720faf7
    RDX: 0000000000000004  RSI: 00007ffce4839fe8  RDI: 0000000000000005
    RBP: 00007ffce4839fe8   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000293  R12: 0000000000000004
    R13: 000000000000000b  R14: 00007f51f78356c0  R15: 00007ffce483a094
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
```

This is in the context of stapio while unregistering the module and the
tracepoint probes. But note that it is in utrace_shutdown(), which does not
involve the `sched_switch` tracepoint. So I'm not sure how these two contex=
ts
are related to each other.

We tried hard to reproduce this on the same kernel in a controlled environm=
ent
but with no luck, even when the tracepoints are triggered 2M times per seco=
nd
across 24 CPU cores, and the staprun command is repeated every second.

So far I only found the following upstream bug fix which may be related or =
may
not:
https://lore.kernel.org/all/20210805132717.23813-1-mathieu.desnoyers@effici=
os.com/

I have another theory is that maybe a previous run of a stap ko failed to
unregister its sched_switch tracepoint. The same tool was running periodica=
lly
in the same box after all.

Any help to debug this further will be highly recommended!

--=20
You are receiving this mail because:
You are the assignee for the bug.=