[Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* [Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86_64 kernels
@ 2012-05-14 15:41 mjw at redhat dot com
  2012-05-14 15:48 ` [Bug runtime/14107] " mjw at redhat dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: mjw at redhat dot com @ 2012-05-14 15:41 UTC (permalink / raw)
  To: systemtap

http://sourceware.org/bugzilla/show_bug.cgi?id=14107

             Bug #: 14107
           Summary: Bad user unwinding from kernel fatal signal handler
                    for some x86_64 kernels
           Product: systemtap
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: runtime
        AssignedTo: systemtap@sourceware.org
        ReportedBy: mjw@redhat.com
                CC: atomlin@redhat.com, bmr@redhat.com
    Classification: Unclassified


The following program:

int
func (void)
{
        int *foo = (void *) 0x1234;
        *foo = 0x12345;
        return 0;
}

int
main (void)
{
  return func ();
}

compiled with gcc -o bad_code bad_code.c and the following stap script:

probe kernel.function("show_signal_msg") {
        /*(PF_USER | PR_WRITE) */
        if (execname() == "bad_code") {
                if ($error_code & 0x6) {
                        printf ("\nUser mode process %s [pid: %d] received a
SIGSEGV - error_code: 0x%x\n", execname(), pid(), $error_code)
                        print_ubacktrace()
                }
        }
}


ran with: stap -d ./bad_code --ldd show_signal_msg.stp -c ./bad_code

produces the following (correct) user backtrace on 3.3.5-2.fc16.x86_64:

User mode process bad_code [pid: 18431] received a SIGSEGV - error_code: 0x6
 0x400484 : func+0x10/0x1d [/usr/local/build/systemtap-obj/bad_code]
 0x40049a : main+0x9/0xf [/usr/local/build/systemtap-obj/bad_code]
 0x7fd419d1069d : __libc_start_main+0xed/0x1c0 [/lib64/libc-2.14.90.so]
 0x4003b9 : _start+0x29/0x2c [/usr/local/build/systemtap-obj/bad_code]

But on some other x86_64 kernels it produces:

WARNING: _stp_read_address failed to access memory location

User mode process bad_code [pid: 12152] received a SIGSEGV - error_code: 0x6
 0x400484 : func+0x10/0x1d [/home/mark/build/systemtap-obj/bad_code]
Warning: child process exited with signal 11 (Segmentation fault)
WARNING: Number of errors: 0, skipped probes: 1
WARNING: /usr/local/install/systemtap/bin/staprun exited with status: 1

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug runtime/14107] Bad user unwinding from kernel fatal signal handler for some x86_64 kernels
  2012-05-14 15:41 [Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86_64 kernels mjw at redhat dot com
@ 2012-05-14 15:48 ` mjw at redhat dot com
  2012-05-14 22:17 ` mjw at redhat dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: mjw at redhat dot com @ 2012-05-14 15:48 UTC (permalink / raw)
  To: systemtap

http://sourceware.org/bugzilla/show_bug.cgi?id=14107

--- Comment #1 from Mark Wielaard <mjw at redhat dot com> 2012-05-14 15:47:52 UTC ---
The issue is that on x86_64 (it doesn't happen on i686) stap tries to recover
the user space registers by unwinding the kernel stack. This succeeds on the
f16 kernel and then the unwinder takes those recovered registers to do the user
space unwind. But it fails on the rhel6 kernel. See -DDEBUG_UNWIND=99 output:

_stp_get_uregs:194: unwind levels: 15, ret: -5, pc=0xffffffff814ef8f5
_stp_get_uregs:209: failed to recover user reg state

And the for user space the unwinder has to do with partial register values and
fails...

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug runtime/14107] Bad user unwinding from kernel fatal signal handler for some x86_64 kernels
  2012-05-14 15:41 [Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86_64 kernels mjw at redhat dot com
  2012-05-14 15:48 ` [Bug runtime/14107] " mjw at redhat dot com
@ 2012-05-14 22:17 ` mjw at redhat dot com
  2012-05-14 22:23 ` mjw at redhat dot com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: mjw at redhat dot com @ 2012-05-14 22:17 UTC (permalink / raw)
  To: systemtap

http://sourceware.org/bugzilla/show_bug.cgi?id=14107

--- Comment #2 from Mark Wielaard <mjw at redhat dot com> 2012-05-14 22:16:13 UTC ---
(In reply to comment #1)
> See -DDEBUG_UNWIND=99 output:
> 
> _stp_get_uregs:194: unwind levels: 15, ret: -5, pc=0xffffffff814ef8f5
> _stp_get_uregs:209: failed to recover user reg state
> 
> And the for user space the unwinder has to do with partial register values and
> fails...

According to /proc/kallsyms:

ffffffff814ef8d0 T page_fault
ffffffff814ef900 T machine_check

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug runtime/14107] Bad user unwinding from kernel fatal signal handler for some x86_64 kernels
  2012-05-14 15:41 [Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86_64 kernels mjw at redhat dot com
  2012-05-14 15:48 ` [Bug runtime/14107] " mjw at redhat dot com
  2012-05-14 22:17 ` mjw at redhat dot com
@ 2012-05-14 22:23 ` mjw at redhat dot com
  2012-05-15 14:09 ` mjw at redhat dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: mjw at redhat dot com @ 2012-05-14 22:23 UTC (permalink / raw)
  To: systemtap

http://sourceware.org/bugzilla/show_bug.cgi?id=14107

--- Comment #3 from Mark Wielaard <mjw at redhat dot com> 2012-05-14 22:22:51 UTC ---
And we do actually go trough do_page_fault just before this frame:

_stp_get_uregs:194: unwind levels: 17, ret: 0, pc=0xffffffff814f253e
unwind:1452: pc=ffffffff814f253d, ffffffff814f253e
unwind:1492: trying debug_frame
set_no_state_rule:375: reg=10, where=1
_stp_search_unwind_hdr:777: binary search for ffffffff814f253d
_stp_search_unwind_hdr:839: fde off=26520
_stp_search_unwind_hdr:849: returning fde=ffffffffa14be360
startLoc=ffffffff814f
2500
unwind_frame:1184: kernel: fde=ffffffffa14be360
unwind_frame:1189: kernel: cie=ffffffffa14bde28
parse_fde_cie:282: map retAddrReg value 16 to reg_info idx 16
unwind_frame:1203: startLoc: ffffffff814f2500, endLoc: ffffffff814f2597
unwind_frame:1251: cie=ffffffffa14bde28 fde=ffffffffa14be360
startLoc=ffffffff81
4f2500 endLoc=ffffffff814f2597, pc=ffffffff814f253d
unwind_frame:1271: processCFI for CIE
[...]
unwind_frame:1426: returning 0 (ffffffff814ef8f5)
_stp_get_uregs:194: unwind levels: 16, ret: 0, pc=0xffffffff814ef8f5
unwind:1452: pc=ffffffff814ef8f4, ffffffff814ef8f5
unwind:1492: trying debug_frame
set_no_state_rule:375: reg=10, where=1
_stp_search_unwind_hdr:777: binary search for ffffffff814ef8f4
_stp_search_unwind_hdr:839: fde off=113238
_stp_search_unwind_hdr:849: returning fde=ffffffffa15ab078
startLoc=ffffffff814ef680
unwind_frame:1184: kernel: fde=ffffffffa15ab078
unwind_frame:1189: kernel: cie=ffffffffa15aafb0
parse_fde_cie:282: map retAddrReg value 16 to reg_info idx 16
unwind_frame:1203: startLoc: ffffffff814ef680, endLoc: ffffffff814ef707
unwind_frame:1205: pc (ffffffff814ef8f4) > endLoc(ffffffff814ef707)
unwind:1496: debug_frame failed: 1, trying eh_frame
unwind_frame:1168: Module kernel: no unwind frame data
_stp_get_uregs:194: unwind levels: 15, ret: -5, pc=0xffffffff814ef8f5
_stp_get_uregs:209: failed to recover user reg state

Since do_page_fault is the actual errorentry for page_fault it looks like the
CFI for do_page_fault is wrong, or we don't process is correctly.

The CFI for do_page_fault looks as follows for 2.6.32-220.7.1.el6.x86_64:

 [ 25fe8] CIE length=20
   CIE_id:                   18446744073709551615
   version:                  3
   augmentation:             ""
   code_alignment_factor:    1
   data_alignment_factor:    -8
   return_address_register:  16

   Program:
     def_cfa r7 (rsp) at offset 8
     offset_extended_sf r16 (rip) at cfa-8
     nop
     nop
     nop
     nop
     nop

 [ 26520] FDE length=76 cie=[ 25fe8]
   CIE_pointer:              155624
   initial_location:         0xffffffff814f2500 <do_page_fault>
   address_range:            0x97

   Program:
     advance_loc4 1 to 0x1
     def_cfa_offset 16
     offset_extended_sf r6 (rbp) at cfa-16
     advance_loc4 3 to 0x4
     def_cfa_register r6 (rbp)
     advance_loc4 23 to 0x1b
     offset_extended_sf r14 (r14) at cfa-24
     offset_extended_sf r13 (r13) at cfa-32
     offset_extended_sf r12 (r12) at cfa-40
     offset_extended_sf r3 (rbx) at cfa-48
     advance_loc4 83 to 0x6e
     remember_state
     restore r6 (rbp)
     def_cfa r7 (rsp) at offset 8
     restore r14 (r14)
     restore r13 (r13)
     restore r12 (r12)
     restore r3 (rbx)
     advance_loc4 1 to 0x6f
     restore_state
     nop
     nop

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug runtime/14107] Bad user unwinding from kernel fatal signal handler for some x86_64 kernels
  2012-05-14 15:41 [Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86_64 kernels mjw at redhat dot com
                   ` (2 preceding siblings ...)
  2012-05-14 22:23 ` mjw at redhat dot com
@ 2012-05-15 14:09 ` mjw at redhat dot com
  2012-05-15 14:16 ` mjw at redhat dot com
  2012-05-21 11:02 ` mjw at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: mjw at redhat dot com @ 2012-05-15 14:09 UTC (permalink / raw)
  To: systemtap

http://sourceware.org/bugzilla/show_bug.cgi?id=14107

--- Comment #4 from Mark Wielaard <mjw at redhat dot com> 2012-05-15 14:07:27 UTC ---
The problem isn't the CFI for do_page_fault, but that there is no CFI for
page_fault. Nor does there seem to be any CFI for any assembly symbol defined
in entry_64.S. Which explains why unwinding to the kernel/user space barrier
just fails.

No idea yet, why the CFI isn't included in /usr/lib/debug/lib/modules/*/vmlinux
for the RHEL6 kernel, it certainly is there in entry_64.S source code. And it
also is in the fedora version
$ eu-readelf --debug-dump=frames
/usr/lib/debug/lib/modules/3.3.5-2.fc16.x86_64/vmlinux | grep -B2 -A1
page_fault
 [  7ae0] FDE length=68 cie=[  6da8]
   CIE_pointer:              28072
   initial_location:         0xffffffff815f4850 <page_fault>
   address_range:            0x2a

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug runtime/14107] Bad user unwinding from kernel fatal signal handler for some x86_64 kernels
  2012-05-14 15:41 [Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86_64 kernels mjw at redhat dot com
                   ` (3 preceding siblings ...)
  2012-05-15 14:09 ` mjw at redhat dot com
@ 2012-05-15 14:16 ` mjw at redhat dot com
  2012-05-21 11:02 ` mjw at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: mjw at redhat dot com @ 2012-05-15 14:16 UTC (permalink / raw)
  To: systemtap

http://sourceware.org/bugzilla/show_bug.cgi?id=14107

--- Comment #5 from Mark Wielaard <mjw at redhat dot com> 2012-05-15 14:14:41 UTC ---
Looks like the RHEL6 kernel is missing this:

commit 9e565292270a2d55524be38835104c564ac8f795
Author: Roland McGrath <roland@redhat.com>
Date:   Thu May 13 21:43:03 2010 -0700

    x86: Use .cfi_sections for assembly code

    The newer assemblers support the .cfi_sections directive so we can put
    the CFI from .S files into the .debug_frame section that is preserved
    in unstripped vmlinux and in separate debuginfo, rather than the
    .eh_frame section that is now discarded by vmlinux.lds.S.

    Signed-off-by: Roland McGrath <roland@redhat.com>
    LKML-Reference: <20100514044303.A6FE7400BE@magilla.sf.frob.com>
    Signed-off-by: H. Peter Anvin <hpa@zytor.com>

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug runtime/14107] Bad user unwinding from kernel fatal signal handler for some x86_64 kernels
  2012-05-14 15:41 [Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86_64 kernels mjw at redhat dot com
                   ` (4 preceding siblings ...)
  2012-05-15 14:16 ` mjw at redhat dot com
@ 2012-05-21 11:02 ` mjw at redhat dot com
  5 siblings, 0 replies; 7+ messages in thread
From: mjw at redhat dot com @ 2012-05-21 11:02 UTC (permalink / raw)
  To: systemtap

http://sourceware.org/bugzilla/show_bug.cgi?id=14107

Mark Wielaard <mjw at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #6 from Mark Wielaard <mjw at redhat dot com> 2012-05-21 11:01:19 UTC ---
Not really "fixed", but with the proper kernel patch, see comment #5, this
should just work. Added a testcase to check the system is behaving properly.

commit 07c9d78ebb28b888f01aed9c206e724f0e72db25
Author: Mark Wielaard <mjw@redhat.com>
Date:   Mon May 21 12:57:41 2012 +0200

    Add testcase for PR14107 Bad user unwinding from kernel fatal signal
handler

    This is really a kernel bug, see bug report, when the CFI for the assembly
    code is missing we cannot properly recover the register state for the user
    process and might give a bad/missing user backtrace.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-05-21 11:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-14 15:41 [Bug runtime/14107] New: Bad user unwinding from kernel fatal signal handler for some x86_64 kernels mjw at redhat dot com
2012-05-14 15:48 ` [Bug runtime/14107] " mjw at redhat dot com
2012-05-14 22:17 ` mjw at redhat dot com
2012-05-14 22:23 ` mjw at redhat dot com
2012-05-15 14:09 ` mjw at redhat dot com
2012-05-15 14:16 ` mjw at redhat dot com
2012-05-21 11:02 ` mjw at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).