public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* RE: double fault
@ 2005-11-24  2:37 Stone, Joshua I
  0 siblings, 0 replies; 10+ messages in thread
From: Stone, Joshua I @ 2005-11-24  2:37 UTC (permalink / raw)
  To: Roland McGrath, Martin Hunt; +Cc: systemtap

Roland McGrath wrote:
> The second crash had an esp of 0xf5bd4f98.  If that's a proper stack
> pointer, it's only 104 bytes from the beginning of the stack. 
> Considering that the trap frame itself is 60 bytes, that's fairly
> small for a realistic stack.  It might well be that in fact it's an
> overflowed stack that grew down from below 0xf5bd6000 and overflowed
> by getting below 0xf5bd5034 (which is the end of the struct
> thread_info at the base of the stack). 

I added a check to monitor the stack on the probe entrance, like this:

	unsigned left = (unsigned)CONTEXT->regs & 0xfff;
	printk("stap_debug: %d bytes on the stack");

Once I added that, I started getting only a single output and then a
crash every time.  The value reported is consistantly 3976 bytes - only
120 bytes from the top.  And the eip is now consistantly at that stack
read within do_page_fault as well.

>> Is there a way I can get the double-fault to print a full oops, with
>> a stack trace?
> 
> No, it's a special trap handler that uses its own stack and just has
> the simple printks you've seen.  You'd have to do something like put
> a probe on the line in doublefault_fn where it printk's the esp et
> al, and have that call show_trace on t->esp or something.

A probe here doesn't work.  I tried it, and the system hung up
completely (a triple-fault?).  I think things must be hosed up pretty
bad by the time it gets to doublefault_fn.

And thanks to the infinite wisdom of Linus, it's a pain to get a
debugger in there.  I tried kdb first, but kdb doesn't automatically
catch double-faults.  I put a breakpoint on doublefault_fn, and it
triggered, but kdb just panicked about invalid memory references as it
was trying to take over.  Again, to me this seems to indicate trouble
with the stack.  I couldn't get kgdb to work at all on the RHEL4 kernel
- likely patching issues.


Martin Hunt wrote:
> But I'm not sure its worth pursuing further because it appears to not
> happen in the newer version of kprobes.

Perhaps, or perhaps there's still a landmine in there that is just
better obscured in the newer kprobes.  I would feel much better if there
was a known fix that occurred, instead of the problem magically
disappearing.  I don't think I will spend much more time on this though,
at least until someone runs into the same issue on the new kprobes.

At the very least, judging by the side conversations, we now appear to
have quite a few people looking closely at the fault handling code...

Thanks,

Josh

^ permalink raw reply	[flat|nested] 10+ messages in thread
* RE: double fault
@ 2005-11-22  3:46 Stone, Joshua I
  2005-11-22 11:00 ` Roland McGrath
  0 siblings, 1 reply; 10+ messages in thread
From: Stone, Joshua I @ 2005-11-22  3:46 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap

>From: Roland McGrath [mailto:roland@redhat.com] 
>
>The stack overflow notion sounds plausible.  To investigate 
>that angle, one
>thing to try comes to mind off hand.  In each probe that might 
>be hitting,
>stick some %{ ... %} code to do a "stack getting small" check. 
> It can do
>something like:
>
>	unsigned left = (unsigned)regs & 0xfff;
>	if (left < 256) panic("stack getting close");
>
>That might manage to print out a full oops with backtrace 
>details that show
>the cascade of page fault frames or whatever the situation actually is.
>
>
>Thanks,
>Roland
>

I tried the code you gave (using CONTEXT->regs), but I don't understand
how that computes how much stack space is left.  Shouldn't it be
CONTEXT->regs->esp?  And even then, you can see the two esp's from the
register dumps I gave - the first would have triggered your panic, and
the second wouldn't.  Am I missing something?

Anyway, I tried it both ways.  It immediately panics, but there's no
oops info.  It just says "Kernel panic - not syncing".  I added a
dump_stack call, but that all looks innocent.

Is there a way I can get the double-fault to print a full oops, with a
stack trace?

I'm pretty new to kernel-debugging, so sorry if I'm asking simple
questions...

Thanks,

Josh

^ permalink raw reply	[flat|nested] 10+ messages in thread
* double fault
@ 2005-11-22  1:12 Stone, Joshua I
  2005-11-22  1:25 ` Roland McGrath
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Stone, Joshua I @ 2005-11-22  1:12 UTC (permalink / raw)
  To: systemtap

I am seeing sporadic double-faults when running tests on systemtap.  I
am trying to run systemtap.base/lt.exp, though others fail as well.  It
doesn't always fail, but if I run it four or five times in succession
that's usually enough to trigger the fault.  Below are manual copies of
a couple of the faults dumped to the console:

double fault, gdt at c0358000 [255 bytes]
double fault, tss at c03dc000
eip = ffffffff, esp = f4b6500c
eax = ffffffff, ebx = ffffffff, ecx = 0000007b, edx = f4b65018
esi = ffffffff, edi = ffffffff, ebp = 00000000
 
double fault, gdt at c0358000 [255 bytes]
double fault, tss at c03dc000
eip = c011a799, esp = f5bd4f98
eax = f959a380, ebx = f5bd5170, ecx = 0000007b, edx = f4bd505c
esi = 00000000, edi = c011a785, ebp = 00000000

The first dump doesn't tell much, but the edi and eip values in the
second dump are interesting.  'c011a785' is the beginning of
do_page_fault, and the instruction at 'c011a799' is a read from the
stack.  Methinks the stack runneth over?

This is on RHEL4 U2, i686, kernel 2.6.9-22.EL.  I verified this crash on
two different machines with this kernel: an IBM T42 laptop (1.7GHz
Pentium M, 1GB RAM), and a desktop (3.6GHz Pentium 4 HT/EM64T, 2GB RAM).
I couldn't reproduce the problem with the 2.6.9-22.ELsmp kernel.  I also
tried the desktop in x86_64 mode, and could not reproduce the problem
with the UP kernel nor the SMP kernel.

Please let me know if there's any other information I can provide to
help track this down...

Thanks,

Josh Stone

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-11-24  2:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-24  2:37 double fault Stone, Joshua I
  -- strict thread matches above, loose matches on Subject: below --
2005-11-22  3:46 Stone, Joshua I
2005-11-22 11:00 ` Roland McGrath
2005-11-22  1:12 Stone, Joshua I
2005-11-22  1:25 ` Roland McGrath
2005-11-22  9:29 ` Richard J Moore
2005-11-23  8:34 ` Martin Hunt
2005-11-23 17:21   ` Mathieu Desnoyers
2005-11-23 17:54     ` Martin Hunt
2005-11-23 18:09       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).