From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 957D5385E037; Fri, 16 Sep 2022 19:52:41 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 957D5385E037 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1663357961; bh=zuambYetZVl7u0rfjbjw2ErKgP13jKR8Yet7RFlAeJk=; h=From:To:Subject:Date:From; b=T4mXXkCHWrW2WImk5X3bqGFUXX8ybP+QpnNU7CJaPM29RAW0MZhAknCk+5Zvk6Npb JrtgrJiTZr4wtAYE7nIVqNIjuWaM/WzX4URS81FTbN5hjao2hHOZ7kajdZqmFE0T+x rv0bwNpYN/4yudRncX6Gkmo+ry4a6ww5WegMvIDA= From: "agentzh at gmail dot com" To: systemtap@sourceware.org Subject: [Bug runtime/29577] New: NULL pointer dereference in kernel tracepoint int3 execution Date: Fri, 16 Sep 2022 19:52:41 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: systemtap X-Bugzilla-Component: runtime X-Bugzilla-Version: unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: critical X-Bugzilla-Who: agentzh at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: systemtap at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D29577 Bug ID: 29577 Summary: NULL pointer dereference in kernel tracepoint int3 execution Product: systemtap Version: unspecified Status: UNCONFIRMED Severity: critical Priority: P2 Component: runtime Assignee: systemtap at sourceware dot org Reporter: agentzh at gmail dot com Target Milestone: --- Hi folks! We observed several NULL pointer dereference errors in CentOS 8 kernel `4.18.0-193.6.3.el8_2.x86_64`. They all look like this: ``` 11948 [39391079.662703] BUG: unable to handle kernel NULL pointer derefer= ence at 0000000000000008 11949 [39391094.375018] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2Modules linked in: orxray_lj_c_off_cpu_2789022(OE) joydev binfmt_misc tcp_= diag inet_diag mpt3sas raid_class scsi_transport_sas 8021q garp mrp stp llc bond= ing dell_rbu intel_rapl_msr intel_rapl_c 11949 ommon skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate dell_smbios iTCO_wdt iTCO_vendor_support dell_wmi_descriptor wmi_bmof dcdbas intel_uncore 11949 pcspkr intel_rapl_perf mei_me ipmi_si mei ipmi_devintf i2c_i801 lpc_ich wmi ipmi_msghandler acpi_power_meter sch_fq tcp_bbr dummy ip_tables xfs libcrc32c sd_mod sg mgag200 drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt ahci fb_sys_fops libahc 11949 i drm crc32c_intel libata megaraid_sas igb dca i2c_algo_bit i40e [last unloaded: orxray_count_off_cpu_frames_2788869] 11950 [39391094.375028] CPU: 2 PID: 2789064 Comm: ps Kdump: loaded Tainte= d: G OE --------- - - 4.18.0-193.6.3.el8_2.x86_64 #1 11951 [39391094.375029] Hardware name: Dell Inc. PowerEdge R640/06DKY5, B= IOS 2.6.4 04/09/2020 11952 [39391094.375029] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0= x1d0 11953 [39391094.375030] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 0= 8 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 = 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 11954 [39391094.375030] RSP: 0018:ffffb2fdc90f7c68 EFLAGS: 00000002 11955 [39391094.375031] RAX: 0000000000000101 RBX: ffff88c60fce9540 RCX: 0000000000000003 11956 [39391094.375031] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88c60fce9540 11957 [39391094.375031] RBP: 000000000000000b R08: 0000000000000008 R09: 0000000000333a33 11958 [39391094.375032] R10: 8080808080808080 R11: 0000000000000010 R12: ffffb2fdc90f7ca8 11959 [39391094.375032] R13: 0000000000000040 R14: ffff88c3506a0180 R15: 0000000000000000 11960 [39391094.375032] FS: 00007f2f11c9d940(0000) GS:ffff88c60fc80000(0= 000) knlGS:0000000000000000 11961 [39391094.375033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 11962 [39391094.375033] CR2: 000055fd3a4ce038 CR3: 0000000d7d370002 CR4: 00000000007606e0 11963 [39391094.375033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 11964 [39391094.375034] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 11965 [39391094.375034] PKRU: 55555554 11966 [39391094.375034] Call Trace: 11967 [39391094.375034] _raw_spin_lock_irq+0x23/0x26 11968 [39391094.375034] wq_worker_comm+0x77/0xd0 11969 [39391094.375034] proc_task_name+0x44/0x120 11970 [39391094.375035] ? num_to_str+0xac/0xc0 11971 [39391094.375035] do_task_stat+0x434/0xde0 11972 [39391094.375035] proc_single_show+0x4e/0x90 11973 [39391094.375035] seq_read+0x153/0x410 11974 [39391094.375035] vfs_read+0x91/0x140 11975 [39391094.375035] ksys_read+0x4f/0xb0 11976 [39391094.375035] do_syscall_64+0x5b/0x1a0 11977 [39391094.375036] entry_SYSCALL_64_after_hwframe+0x65/0xca 11978 [39391094.375036] RIP: 0033:0x7f2f11049a85 11979 [39391094.375036] Code: fe ff ff 50 48 8d 3d 32 d8 09 00 e8 b5 0e 0= 2 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 d5 4b 2d 00 8b 00 85 c0 75 0f 31 c0 0f = 05 <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 11980 [39391094.375036] RSP: 002b:00007ffff7ee8b78 EFLAGS: 00000246 ORIG_= RAX: 0000000000000000 11981 [39391094.375037] RAX: ffffffffffffffda RBX: 00007f2f11a7a678 RCX: 00007f2f11049a85 11982 [39391094.375037] RDX: 0000000000000800 RSI: 000055fd3a4cb5e0 RDI: 0000000000000006 11983 [39391094.375037] RBP: 00007f2f11a7a670 R08: 00007f2f11875ea5 R09: 00007f2f110dac80 11984 [39391094.375037] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006 11985 [39391094.375038] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 11986 [39391094.375038] Kernel panic - not syncing: Hard LOCKUP ``` BTW, the orxray_XXX modules are stap ko modules. Basically it is always a BUG for null pointer dereference followed by anoth= er BUG for soft-lockup detected by the NMI watchdog. Inspecting the kernel backtraces on all the CPUs in the kdump file reveals = the following full kernel backtraces for these two: ``` PID: 29 TASK: ffff88af005c0000 CPU: 3 COMMAND: "ksoftirqd/3" #0 [fffffe0000088e50] crash_nmi_callback at ffffffff9984d033 #1 [fffffe0000088e58] nmi_handle at ffffffff99822893 #2 [fffffe0000088eb0] default_do_nmi at ffffffff99822d4e #3 [fffffe0000088ed0] do_nmi at ffffffff99822f28 #4 [fffffe0000088ef0] end_repeat_nmi at ffffffff9a2016b4 [exception RIP: native_queued_spin_lock_slowpath+91] RIP: ffffffff9990a92b RSP: ffffb2fdc683b910 RFLAGS: 00000002 RAX: 00000000005c0101 RBX: ffff88af00161ec0 RCX: 0000000000000003 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88c60fce9c40 RBP: ffff88c60fce9c40 R8: ffff88c60fce9560 R9: ffff88af07803230 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88af00162a7c R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000003 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- --- #5 [ffffb2fdc683b910] native_queued_spin_lock_slowpath at ffffffff9990a92b #6 [ffffb2fdc683b910] _raw_spin_lock at ffffffff9a094f4c #7 [ffffb2fdc683b918] try_to_wake_up at ffffffff998e1463 #8 [ffffb2fdc683b978] __queue_work at ffffffff998cdcb3 #9 [ffffb2fdc683b9c0] queue_work_on at ffffffff998cdfd4 #10 [ffffb2fdc683b9d0] mga_imageblit at ffffffffc0422666 [mgag200] #11 [ffffb2fdc683b9e8] soft_cursor at ffffffff99c9cfa4 #12 [ffffb2fdc683ba40] bit_cursor at ffffffff99c9cbd2 #13 [ffffb2fdc683bb08] hide_cursor at ffffffff99d2d78a #14 [ffffb2fdc683bb18] vt_console_print at ffffffff99d2fb8d #15 [ffffb2fdc683bb80] console_unlock at ffffffff999171d9 #16 [ffffb2fdc683bbc0] vprintk_emit at ffffffff99918e24 #17 [ffffb2fdc683bc18] printk at ffffffff999193a6 #18 [ffffb2fdc683bc78] no_context at ffffffff99868880 #19 [ffffb2fdc683bcd0] do_page_fault at ffffffff99869262 #20 [ffffb2fdc683bd00] page_fault at ffffffff9a20120e [exception RIP: poke_int3_handler+56] RIP: ffffffff998266e8 RSP: ffffb2fdc683bdb0 RFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffb2fdc683bdd8 RCX: ffffffff9a200b87 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9a08f7f6 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #21 [ffffb2fdc683bdb8] do_int3 at ffffffff9981ed2e #22 [ffffb2fdc683bdd0] int3 at ffffffff9a201124 #23 [ffffb2fdc683be58] __schedule at ffffffff9a08f7f7 #24 [ffffb2fdc683bed8] schedule at ffffffff9a08fd2f #25 [ffffb2fdc683bee8] smpboot_thread_fn at ffffffff998d88cb #26 [ffffb2fdc683bf10] kthread at ffffffff998d4812 #27 [ffffb2fdc683bf50] ret_from_fork at ffffffff9a20023f ``` This one corresponds to the dmesg error ``` BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 ``` By analyzing the PC value in the kernel backtrace, we found that it is cras= hed in the poke_int3_handler() function in file arch/x86/kernel/alternative.c, = on line 832: ``` 832 if (tp->addr !=3D ip) ``` The offset of the `addr` in tp is indeed 8. Also, we use gdb to find out `__schedule at ffffffff9a08f7f7` is for kernel/sched/core.c:3431: ``` 3431 trace_sched_switch(preempt, prev, next); ``` Our stap script indeed defines a `probe kernel.trace("sched_switch")`. So it makes sense. And there's another one for another CPU related to stap: ``` PID: 2789022 TASK: ffff88b461eadc40 CPU: 19 COMMAND: "stapio" #0 [fffffe0000338e50] crash_nmi_callback at ffffffff9984d033 #1 [fffffe0000338e58] nmi_handle at ffffffff99822893 #2 [fffffe0000338eb0] default_do_nmi at ffffffff99822d4e #3 [fffffe0000338ed0] do_nmi at ffffffff99822f28 #4 [fffffe0000338ef0] end_repeat_nmi at ffffffff9a2016b4 [exception RIP: smp_call_function_many+492] RIP: ffffffff9994f2ac RSP: ffffb2fdc9b63d18 RFLAGS: 00000202 RAX: 0000000000000003 RBX: ffff88c6100eacc0 RCX: ffff88c60fcf04c0 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88af0059f1b0 RBP: ffffffff99826720 R8: 000000000002e040 R9: ffffffff9984ddda R10: ffffe28c203a4600 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000020 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- --- #5 [ffffb2fdc9b63d18] smp_call_function_many at ffffffff9994f2ac #6 [ffffb2fdc9b63d58] on_each_cpu at ffffffff9994f368 #7 [ffffb2fdc9b63d78] text_poke_bp_batch at ffffffff998270b9 #8 [ffffb2fdc9b63db8] arch_jump_label_transform_apply at ffffffff9982433e #9 [ffffb2fdc9b63dc0] __static_key_slow_dec_cpuslocked at ffffffff99a1464f #10 [ffffb2fdc9b63de0] static_key_slow_dec at ffffffff99a146d3 #11 [ffffb2fdc9b63df0] tracepoint_probe_unregister at ffffffff99994452 #12 [ffffb2fdc9b63e20] do_tracepoint_probe_unregister at ffffffffc0c905b3 [orxray_lj_c_off_cpu_2789022] #13 [ffffb2fdc9b63e28] stp_tracepoint_probe_unregister at ffffffffc0ca8147 [orxray_lj_c_off_cpu_2789022] #14 [ffffb2fdc9b63e50] utrace_shutdown at ffffffffc0ca8440 [orxray_lj_c_off_cpu_2789022] #15 [ffffb2fdc9b63e58] stap_stop_task_finder at ffffffffc0ca85ad [orxray_lj_c_off_cpu_2789022] #16 [ffffb2fdc9b63e60] systemtap_module_exit at ffffffffc0ca932d [orxray_lj_c_off_cpu_2789022] #17 [ffffb2fdc9b63e80] _stp_cleanup_and_exit at ffffffffc0ca956e [orxray_lj_c_off_cpu_2789022] #18 [ffffb2fdc9b63e90] _stp_ctl_write_cmd at ffffffffc0ca96d1 [orxray_lj_c_off_cpu_2789022] #19 [ffffb2fdc9b63eb8] proc_reg_write at ffffffff99b4241c #20 [ffffb2fdc9b63ed0] vfs_write at ffffffff99ac7835 #21 [ffffb2fdc9b63f00] ksys_write at ffffffff99ac7aaf #22 [ffffb2fdc9b63f38] do_syscall_64 at ffffffff9980419b #23 [ffffb2fdc9b63f50] entry_SYSCALL_64_after_hwframe at ffffffff9a2000ad RIP: 00007f51f720faf7 RSP: 00007ffce4839fa0 RFLAGS: 00000293 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f51f720faf7 RDX: 0000000000000004 RSI: 00007ffce4839fe8 RDI: 0000000000000005 RBP: 00007ffce4839fe8 R8: 0000000000000000 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000004 R13: 000000000000000b R14: 00007f51f78356c0 R15: 00007ffce483a094 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b ``` This is in the context of stapio while unregistering the module and the tracepoint probes. But note that it is in utrace_shutdown(), which does not involve the `sched_switch` tracepoint. So I'm not sure how these two contex= ts are related to each other. We tried hard to reproduce this on the same kernel in a controlled environm= ent but with no luck, even when the tracepoints are triggered 2M times per seco= nd across 24 CPU cores, and the staprun command is repeated every second. So far I only found the following upstream bug fix which may be related or = may not: https://lore.kernel.org/all/20210805132717.23813-1-mathieu.desnoyers@effici= os.com/ I have another theory is that maybe a previous run of a stap ko failed to unregister its sched_switch tracepoint. The same tool was running periodica= lly in the same box after all. Any help to debug this further will be highly recommended! --=20 You are receiving this mail because: You are the assignee for the bug.=