public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* double fault
@ 2005-11-22  1:12 Stone, Joshua I
  2005-11-22  1:25 ` Roland McGrath
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Stone, Joshua I @ 2005-11-22  1:12 UTC (permalink / raw)
  To: systemtap

I am seeing sporadic double-faults when running tests on systemtap.  I
am trying to run systemtap.base/lt.exp, though others fail as well.  It
doesn't always fail, but if I run it four or five times in succession
that's usually enough to trigger the fault.  Below are manual copies of
a couple of the faults dumped to the console:

double fault, gdt at c0358000 [255 bytes]
double fault, tss at c03dc000
eip = ffffffff, esp = f4b6500c
eax = ffffffff, ebx = ffffffff, ecx = 0000007b, edx = f4b65018
esi = ffffffff, edi = ffffffff, ebp = 00000000
 
double fault, gdt at c0358000 [255 bytes]
double fault, tss at c03dc000
eip = c011a799, esp = f5bd4f98
eax = f959a380, ebx = f5bd5170, ecx = 0000007b, edx = f4bd505c
esi = 00000000, edi = c011a785, ebp = 00000000

The first dump doesn't tell much, but the edi and eip values in the
second dump are interesting.  'c011a785' is the beginning of
do_page_fault, and the instruction at 'c011a799' is a read from the
stack.  Methinks the stack runneth over?

This is on RHEL4 U2, i686, kernel 2.6.9-22.EL.  I verified this crash on
two different machines with this kernel: an IBM T42 laptop (1.7GHz
Pentium M, 1GB RAM), and a desktop (3.6GHz Pentium 4 HT/EM64T, 2GB RAM).
I couldn't reproduce the problem with the 2.6.9-22.ELsmp kernel.  I also
tried the desktop in x86_64 mode, and could not reproduce the problem
with the UP kernel nor the SMP kernel.

Please let me know if there's any other information I can provide to
help track this down...

Thanks,

Josh Stone

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault
  2005-11-22  1:12 double fault Stone, Joshua I
@ 2005-11-22  1:25 ` Roland McGrath
  2005-11-22  9:29 ` Richard J Moore
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Roland McGrath @ 2005-11-22  1:25 UTC (permalink / raw)
  To: Stone, Joshua I; +Cc: systemtap

The stack overflow notion sounds plausible.  To investigate that angle, one
thing to try comes to mind off hand.  In each probe that might be hitting,
stick some %{ ... %} code to do a "stack getting small" check.  It can do
something like:

	unsigned left = (unsigned)regs & 0xfff;
	if (left < 256) panic("stack getting close");

That might manage to print out a full oops with backtrace details that show
the cascade of page fault frames or whatever the situation actually is.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault
  2005-11-22  1:12 double fault Stone, Joshua I
  2005-11-22  1:25 ` Roland McGrath
@ 2005-11-22  9:29 ` Richard J Moore
  2005-11-22 14:00 ` double fault -> PAGE_KERNEL flagged memory Mathieu Desnoyers
  2005-11-23  8:34 ` double fault Martin Hunt
  3 siblings, 0 replies; 32+ messages in thread
From: Richard J Moore @ 2005-11-22  9:29 UTC (permalink / raw)
  To: Stone, Joshua I; +Cc: systemtap





We need to distinguish between recursive behaviour that's cause stack
depletion and insufficient stack space. If you brows the stack do you see:

1) a great chunk of unused space, or
2) a regular pattern of return addresses

If you follow the stack frames are there any huge jumps - indicating
excessive amounts of local data allocation?



- -
Richard J Moore
IBM Advanced Linux Response Team - Linux Technology Centre
MOBEX: 264807; Mobile (+44) (0)7739-875237
Office: (+44) (0)1962-817072


                                                                           
             "Stone, Joshua                                                
             I"                                                            
             <joshua.i.stone                                            To 
             @intel.com>              <systemtap@sources.redhat.com>       
             Sent by:                                                   cc 
             systemtap-owner                                               
             @sourceware.org                                           bcc 
                                                                           
                                                                   Subject 
             22/11/2005               double fault                         
             01:12                                                         
                                                                           




I am seeing sporadic double-faults when running tests on systemtap.  I
am trying to run systemtap.base/lt.exp, though others fail as well.  It
doesn't always fail, but if I run it four or five times in succession
that's usually enough to trigger the fault.  Below are manual copies of
a couple of the faults dumped to the console:

double fault, gdt at c0358000 [255 bytes]
double fault, tss at c03dc000
eip = ffffffff, esp = f4b6500c
eax = ffffffff, ebx = ffffffff, ecx = 0000007b, edx = f4b65018
esi = ffffffff, edi = ffffffff, ebp = 00000000

double fault, gdt at c0358000 [255 bytes]
double fault, tss at c03dc000
eip = c011a799, esp = f5bd4f98
eax = f959a380, ebx = f5bd5170, ecx = 0000007b, edx = f4bd505c
esi = 00000000, edi = c011a785, ebp = 00000000

The first dump doesn't tell much, but the edi and eip values in the
second dump are interesting.  'c011a785' is the beginning of
do_page_fault, and the instruction at 'c011a799' is a read from the
stack.  Methinks the stack runneth over?

This is on RHEL4 U2, i686, kernel 2.6.9-22.EL.  I verified this crash on
two different machines with this kernel: an IBM T42 laptop (1.7GHz
Pentium M, 1GB RAM), and a desktop (3.6GHz Pentium 4 HT/EM64T, 2GB RAM).
I couldn't reproduce the problem with the 2.6.9-22.ELsmp kernel.  I also
tried the desktop in x86_64 mode, and could not reproduce the problem
with the UP kernel nor the SMP kernel.

Please let me know if there's any other information I can provide to
help track this down...

Thanks,

Josh Stone


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22  1:12 double fault Stone, Joshua I
  2005-11-22  1:25 ` Roland McGrath
  2005-11-22  9:29 ` Richard J Moore
@ 2005-11-22 14:00 ` Mathieu Desnoyers
  2005-11-22 15:12   ` Tom Zanussi
                     ` (2 more replies)
  2005-11-23  8:34 ` double fault Martin Hunt
  3 siblings, 3 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-22 14:00 UTC (permalink / raw)
  To: Stone, Joshua I, Tom Zanussi, michel.dagenais; +Cc: systemtap

I suspect that your double fault may come from the systemTAP logging code. Do
you have an instrumentation point in any fault handler ?

For Tom : can you flag the RelayFS buffer memory PAGE_KERNEL instead of
GFP_KERNEL ? Otherwise, it leads to page faults when accessing those pages when
accessed for the first time (seen with LTTng).

For instance, if you log an event for the page fault handler, and this logging
code does generate a page fault itself, then you get a double fault.

The same could apply to unaligned memory access.

Make sure that the SystemTAP code is _always_ in contiguous memory non
swappable to disk :

The Linux kernel module loading does make sure that all module code is memory
locked (see module.c) by first loading the whole module in a vmap area (which
is swappable) and then copying the code in a region of memory flagged
PAGE_KERNEL_EXEC (see vmalloc.c:vmalloc_exec()).

Furthermore, make sure that each data memory regions are also non swappable.
That means the RelayFS buffer too.

So :

- memory in which the SystemTAP code is loaded should be allocated with
  vmalloc_exec() (or with the GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC
  flags).
- SystemTAP global data structures should be in memory protected from swap out,
  with a flag like PAGE_KERNEL.
- RelayFS buffers should be PAGE_KERNEL too (not GFP_KERNEL).


Mathieu


* Stone, Joshua I (joshua.i.stone@intel.com) wrote:
> I am seeing sporadic double-faults when running tests on systemtap.  I
> am trying to run systemtap.base/lt.exp, though others fail as well.  It
> doesn't always fail, but if I run it four or five times in succession
> that's usually enough to trigger the fault.  Below are manual copies of
> a couple of the faults dumped to the console:
> 
> double fault, gdt at c0358000 [255 bytes]
> double fault, tss at c03dc000
> eip = ffffffff, esp = f4b6500c
> eax = ffffffff, ebx = ffffffff, ecx = 0000007b, edx = f4b65018
> esi = ffffffff, edi = ffffffff, ebp = 00000000
>  
> double fault, gdt at c0358000 [255 bytes]
> double fault, tss at c03dc000
> eip = c011a799, esp = f5bd4f98
> eax = f959a380, ebx = f5bd5170, ecx = 0000007b, edx = f4bd505c
> esi = 00000000, edi = c011a785, ebp = 00000000
> 
> The first dump doesn't tell much, but the edi and eip values in the
> second dump are interesting.  'c011a785' is the beginning of
> do_page_fault, and the instruction at 'c011a799' is a read from the
> stack.  Methinks the stack runneth over?
> 
> This is on RHEL4 U2, i686, kernel 2.6.9-22.EL.  I verified this crash on
> two different machines with this kernel: an IBM T42 laptop (1.7GHz
> Pentium M, 1GB RAM), and a desktop (3.6GHz Pentium 4 HT/EM64T, 2GB RAM).
> I couldn't reproduce the problem with the 2.6.9-22.ELsmp kernel.  I also
> tried the desktop in x86_64 mode, and could not reproduce the problem
> with the UP kernel nor the SMP kernel.
> 
> Please let me know if there's any other information I can provide to
> help track this down...
> 
> Thanks,
> 
> Josh Stone
> 
OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22 14:00 ` double fault -> PAGE_KERNEL flagged memory Mathieu Desnoyers
@ 2005-11-22 15:12   ` Tom Zanussi
  2005-11-22 15:24     ` Mathieu Desnoyers
  2005-11-22 15:17   ` Mathieu Desnoyers
  2005-11-22 15:42   ` Frank Ch. Eigler
  2 siblings, 1 reply; 32+ messages in thread
From: Tom Zanussi @ 2005-11-22 15:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Stone, Joshua I, Tom Zanussi, michel.dagenais, systemtap

Mathieu Desnoyers writes:
 > I suspect that your double fault may come from the systemTAP logging code. Do
 > you have an instrumentation point in any fault handler ?
 > 
 > For Tom : can you flag the RelayFS buffer memory PAGE_KERNEL instead of
 > GFP_KERNEL ? Otherwise, it leads to page faults when accessing those pages when
 > accessed for the first time (seen with LTTng).

It already is PAGE_KERNEL.  The page faults you see with relayfs are
vmalloc page faults i.e. minor faults that just update the kernel part
of the current process's page table with the buffer pages.

Anyway, systemtap doesn't use relayfs unless -b (bulk) is specified on
the command line, so unless that's the case, it can't be relayfs
causing the problem.

tom


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22 14:00 ` double fault -> PAGE_KERNEL flagged memory Mathieu Desnoyers
  2005-11-22 15:12   ` Tom Zanussi
@ 2005-11-22 15:17   ` Mathieu Desnoyers
  2005-11-22 15:42   ` Frank Ch. Eigler
  2 siblings, 0 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-22 15:17 UTC (permalink / raw)
  To: Stone, Joshua I, Tom Zanussi, michel.dagenais; +Cc: systemtap

* Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:

(talking about causes of a double fault in a Linux Kernel)

[...]
> 
> The same could apply to unaligned memory access.
> 
[...]

Sorry to reply to myself, but this isn't true in the Linux Kernel. The CPU
EFLAG Alignment Check is never activated in the Linux Kernel code. Furthermore,
alignment check exceptions never generate a double fault, as they are "benign
exceptions".

It still apply to page faults :

(from IA32 Intel Architecture Software Developer's Manual p. 6-14 vol.1)

#DF Double Fault

Source :

Any instruction that can generate an exception, an NMI or an INTR.

Well, not "any". For details, see table at vol. 3, p 5-38, same reference. In
fact, NMI and INTR seems to never generate a double fault.

Causes of a double fault :

Divide error, invalid TSS, segment not present, stack fault, general protection
nested on a page fault.

Page fault nested on another page fault.

Divide error, invalid TSS, segment not present, stack fault, general protection
nested on
divide error, invalid TSS, segment not present, stack fault, general protection.

In the other cases, the exceptions are handled serially.


Mathieu


OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22 15:12   ` Tom Zanussi
@ 2005-11-22 15:24     ` Mathieu Desnoyers
  2005-11-22 16:52       ` Tom Zanussi
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-22 15:24 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: Stone, Joshua I, michel.dagenais, systemtap

* Tom Zanussi (zanussi@us.ibm.com) wrote:
> It already is PAGE_KERNEL. 

You are right.

> The page faults you see with relayfs are
> vmalloc page faults i.e. minor faults that just update the kernel part
> of the current process's page table with the buffer pages.
> 

Generally speaking, what would happen if :

- A page fault handler has a trace point that logs an event in the RelayFS
  buffer.
- This exact write causes a (minor) page fault.

It may not happen often, but I think it would lead to a double fault. Is this
case possible ?


Mathieu


OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22 14:00 ` double fault -> PAGE_KERNEL flagged memory Mathieu Desnoyers
  2005-11-22 15:12   ` Tom Zanussi
  2005-11-22 15:17   ` Mathieu Desnoyers
@ 2005-11-22 15:42   ` Frank Ch. Eigler
  2005-11-22 16:01     ` Mathieu Desnoyers
  2 siblings, 1 reply; 32+ messages in thread
From: Frank Ch. Eigler @ 2005-11-22 15:42 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: michel.dagenais, systemtap


Mathieu Desnoyers wrote:

> [...]  For Tom : can you flag the RelayFS buffer memory PAGE_KERNEL
> instead of GFP_KERNEL ? Otherwise, it leads to page faults when
> accessing those pages when accessed for the first time (seen with
> LTTng).

FWIW, relayfs is not used by default.

> [...]  Make sure that the SystemTAP code is _always_ in contiguous
> memory non swappable to disk :

Is my impression correct that no ordinary kernel-allocated memory such
as .ko .text/.data is swappable?  Remember, this is what systemtap
creates, plus a few batches of dynamically allocated (k/vmalloc)
memory during module initialization.

- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22 15:42   ` Frank Ch. Eigler
@ 2005-11-22 16:01     ` Mathieu Desnoyers
  0 siblings, 0 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-22 16:01 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: michel.dagenais, systemtap

* Frank Ch. Eigler (fche@redhat.com) wrote:
> 
> Mathieu Desnoyers wrote:
> > [...]  Make sure that the SystemTAP code is _always_ in contiguous
> > memory non swappable to disk :
> 
> Is my impression correct that no ordinary kernel-allocated memory such
> as .ko .text/.data is swappable?

Yes, this is right (module.c:load_module puts all memory allocatable sections
into PAGE_KERNEL_EXEC memory, which is non swappable.


> Remember, this is what systemtap
> creates, plus a few batches of dynamically allocated (k/vmalloc)
> memory during module initialization.
> 

The vmalloc'd areas will cause a minor page fault when they are not present in
the page table of the current process. If you do not instrument the page fault
handler, it shouldn't be a problem, but if you do, then you can get a double
fault.

kmalloc'd memory does not have this problem, but does not have the advantages of
virtual memory (especially interesting when allocating big chunks of contiguous
memory in a fragmented pool).

Mathieu


OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22 15:24     ` Mathieu Desnoyers
@ 2005-11-22 16:52       ` Tom Zanussi
  2005-11-22 18:59         ` Mathieu Desnoyers
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Zanussi @ 2005-11-22 16:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tom Zanussi, Stone, Joshua I, michel.dagenais, systemtap

Mathieu Desnoyers writes:
 > * Tom Zanussi (zanussi@us.ibm.com) wrote:
 > > It already is PAGE_KERNEL. 
 > 
 > You are right.
 > 
 > > The page faults you see with relayfs are
 > > vmalloc page faults i.e. minor faults that just update the kernel part
 > > of the current process's page table with the buffer pages.
 > > 
 > 
 > Generally speaking, what would happen if :
 > 
 > - A page fault handler has a trace point that logs an event in the RelayFS
 >   buffer.
 > - This exact write causes a (minor) page fault.
 > 
 > It may not happen often, but I think it would lead to a double fault. Is this
 > case possible ?
 > 

Yes, I'm sure it is possible, in fact I'm sure it happens all the time
in LTT and it's never been a problem.  Have you verified that it is a
problem (e.g. tracing only page faults and seeing a double fault).

Tom



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22 16:52       ` Tom Zanussi
@ 2005-11-22 18:59         ` Mathieu Desnoyers
  2005-11-23 15:13           ` Tom Zanussi
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-22 18:59 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: Stone, Joshua I, michel.dagenais, systemtap, ltt-dev

* Tom Zanussi (zanussi@us.ibm.com) wrote:
> Mathieu Desnoyers writes:
>  > Generally speaking, what would happen if :
>  > 
>  > - A page fault handler has a trace point that logs an event in the RelayFS
>  >   buffer.
>  > - This exact write causes a (minor) page fault.
>  > 
>  > It may not happen often, but I think it would lead to a double fault. Is this
>  > case possible ?
>  > 
> 
> Yes, I'm sure it is possible, in fact I'm sure it happens all the time
> in LTT and it's never been a problem.  Have you verified that it is a
> problem (e.g. tracing only page faults and seeing a double fault).
> 
> 

I have just done this, and nothing bad happened and I discovered why :

The LTT instrumentation of do_page_fault comes right after the
"We fault-in kernel-space virtual memory on-demand." section. So, minor page
faults are not instrumented.

Maybe this isn't so interesting to instrument it, but if we would like to, I
suggest to add an optionnal support for kmalloc'd memory areas in RelayFS so
the client can ask for a channel which consists of a small buffer that is
reentrant for the memory subsystem. This should be documented with a big
warning about the small size the client should ask for.


Mathieu

OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault
  2005-11-22  1:12 double fault Stone, Joshua I
                   ` (2 preceding siblings ...)
  2005-11-22 14:00 ` double fault -> PAGE_KERNEL flagged memory Mathieu Desnoyers
@ 2005-11-23  8:34 ` Martin Hunt
  2005-11-23 17:21   ` Mathieu Desnoyers
  3 siblings, 1 reply; 32+ messages in thread
From: Martin Hunt @ 2005-11-23  8:34 UTC (permalink / raw)
  To: Stone, Joshua I; +Cc: systemtap

[-- Attachment #1: Type: text/plain, Size: 1748 bytes --]

On Mon, 2005-11-21 at 17:12 -0800, Stone, Joshua I wrote: 
> I am seeing sporadic double-faults when running tests on systemtap.  I
> am trying to run systemtap.base/lt.exp, though others fail as well.  It
> doesn't always fail, but if I run it four or five times in succession
> that's usually enough to trigger the fault.  Below are manual copies of
> a couple of the faults dumped to the console:

Sorry I didn't respond sooner. I've been a bit slow the last couple days
due to the flu.

This looks like the same double-fault I've been seeing sporadically on
my laptop running RHEL4 (and nowhere else).  I tried a couple of ways to
track it down but it isn't easy.  I never did get my laptop working with
netdump either.

It appeared to me that the faults were originating in kprobes. In fact
the same OS on the same hardware with the scalability patches does not
have this problem.

I stripped down the generated C file to something very small that still
demonstrated the problem. Basically it has the giant context array and a
sets a single kprobe on sys_open that simply returns.

Changing the kprobe to other functions does not always trigger the bug.

The problem also has something to do with the size of the context array.
Changing NR_CPUS to 128 (which makes the array really huge) was enough
to cause the double fault to happen on all my RHEL machines (including
x86_64) except for ones running under vmware. I changed the code to use
vmalloc (we really want vmalloc_node() but RHEL4 doesn't have it) and
all the crashes stopped on every machine.

Confused yet? I've attached my simple C file that triggers the bug. But
I'm not sure its worth pursuing further because it appears to not happen
in the newer version of kprobes.

Martin



[-- Attachment #2: stap_crash.c --]
[-- Type: text/x-csrc, Size: 2859 bytes --]

#define MAXNESTING 30
#define MAXSTRINGLEN 128
#define STP_STRING_SIZE MAXSTRINGLEN
#include "runtime.h"
#include <linux/string.h>
#include <linux/timer.h>
#include "loc2c-runtime.h" 
typedef char string_t[MAXSTRINGLEN];

struct context {
  atomic_t busy;
  const char *probe_point;
  unsigned actioncount;
  unsigned nesting;
  const char *last_error;
  const char *last_stmt;
  struct pt_regs *regs;
  union {
    struct probe_0_locals {
    } probe_0;
    struct function_my_sys_open_mode_str_locals {
      string_t bs;
      int64_t f;
      string_t __tmp0;
      string_t __tmp1;
      string_t __tmp2;
      string_t __tmp3;
      string_t __tmp4;
      string_t __tmp5;
      string_t __tmp6;
      string_t __tmp7;
      string_t __tmp8;
      string_t __tmp9;
      string_t __tmp10;
      string_t __tmp11;
      string_t __tmp12;
      string_t __tmp13;
      string_t __tmp14;
      string_t __tmp15;
      string_t __tmp16;
      string_t __tmp17;
      string_t __tmp18;
      string_t __tmp19;
      string_t __tmp20;
      string_t __tmp21;
      string_t __tmp22;
      string_t __tmp23;
      string_t __tmp24;
      string_t __tmp25;
      string_t __tmp26;
      string_t __tmp27;
      string_t __tmp28;
      string_t __tmp29;
      string_t __tmp30;
      string_t __tmp31;
      string_t __tmp32;
      string_t __tmp33;
      string_t __tmp34;
      string_t __tmp35;
      string_t __retvalue;
    } function_my_sys_open_mode_str;
  } locals [MAXNESTING];
} contexts [128];


static struct kprobe dwarf_kprobe_0[1]= {
  {.addr= (void *) 0xc016765e}
};

char const * dwarf_kprobe_0_location_names[1] = {
  "kernel.function(\"sys_open@fs/open.c:947\")"
};

static int 
dwarf_kprobe_0_enter (struct kprobe *probe_instance, struct pt_regs *regs) {
  return 0;
}

static int systemtap_module_init (void);
int systemtap_module_init () {
  int rc = 0;
  const char *probe_point = "";
  /* register probe #0, 1 location(s) */
  probe_point = "kernel.function(\"sys_open@fs/open.c:947\")";
  {
    int i;
    printk("in module_init() contexts = %d\n", sizeof(contexts));
    for (i = 0; i < 1; i++) {
    ssleep(5);    
      dwarf_kprobe_0[i].pre_handler = &dwarf_kprobe_0_enter;
      rc = rc || register_kprobe (&(dwarf_kprobe_0[i]));
      if (unlikely (rc)) {
        probe_point = dwarf_kprobe_0_location_names[i];
        break;
      }
    printk("probe registered\n");
    }
    if (unlikely (rc)) while (--i >= 0)
      unregister_kprobe (&(dwarf_kprobe_0[i]));
  }
  
  printk("DONE rc=%d\n", rc);
  ssleep(5);
  return rc;
}

void systemtap_module_exit (void) {
  int i;
  for (i = 0; i < 1; i++)
    unregister_kprobe (&(dwarf_kprobe_0[i]));
}

int probe_start () {
  return systemtap_module_init () ? -1 : 0;
}

void probe_exit () {
  systemtap_module_exit ();
}

MODULE_DESCRIPTION("systemtap probe");
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-22 18:59         ` Mathieu Desnoyers
@ 2005-11-23 15:13           ` Tom Zanussi
  2005-11-23 17:53             ` Mathieu Desnoyers
  0 siblings, 1 reply; 32+ messages in thread
From: Tom Zanussi @ 2005-11-23 15:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tom Zanussi, Stone, Joshua I, michel.dagenais, systemtap, ltt-dev

Mathieu Desnoyers writes:
 > * Tom Zanussi (zanussi@us.ibm.com) wrote:
 > > Mathieu Desnoyers writes:
 > >  > Generally speaking, what would happen if :
 > >  > 
 > >  > - A page fault handler has a trace point that logs an event in the RelayFS
 > >  >   buffer.
 > >  > - This exact write causes a (minor) page fault.
 > >  > 
 > >  > It may not happen often, but I think it would lead to a double fault. Is this
 > >  > case possible ?
 > >  > 
 > > 
 > > Yes, I'm sure it is possible, in fact I'm sure it happens all the time
 > > in LTT and it's never been a problem.  Have you verified that it is a
 > > problem (e.g. tracing only page faults and seeing a double fault).
 > > 
 > > 
 > 
 > I have just done this, and nothing bad happened and I discovered why :
 > 
 > The LTT instrumentation of do_page_fault comes right after the
 > "We fault-in kernel-space virtual memory on-demand." section. So, minor page
 > faults are not instrumented.
 > 
 > Maybe this isn't so interesting to instrument it, but if we would like to, I
 > suggest to add an optionnal support for kmalloc'd memory areas in RelayFS so
 > the client can ask for a channel which consists of a small buffer that is
 > reentrant for the memory subsystem. This should be documented with a big
 > warning about the small size the client should ask for.
 > 

That would be one option, but how about instrumenting the vmalloc
fault case to check whether the faulting address lies inside the
buffer and if so, don't log an event in that case.  You wouldn't see
those vmalloc faults touching the relayfs buffers, but I'm not sure
that's interesting anyway...

Tom


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault
  2005-11-23  8:34 ` double fault Martin Hunt
@ 2005-11-23 17:21   ` Mathieu Desnoyers
  2005-11-23 17:54     ` Martin Hunt
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-23 17:21 UTC (permalink / raw)
  To: Martin Hunt; +Cc: Stone, Joshua I, systemtap

* Martin Hunt (hunt@redhat.com) wrote:
> Changing the kprobe to other functions does not always trigger the bug.
> 
> The problem also has something to do with the size of the context array.
> Changing NR_CPUS to 128 (which makes the array really huge) was enough
> to cause the double fault to happen on all my RHEL machines (including
> x86_64) except for ones running under vmware. I changed the code to use
> vmalloc (we really want vmalloc_node() but RHEL4 doesn't have it) and
> all the crashes stopped on every machine.
> 

What are the flags used for the memory allocated by vmalloc ?

Did you try : 

- allocating the memory with kmalloc instead of vmalloc ?
- to see if there is a code path that goes from do_page_fault to sys_open ? I
  would be surprised about it, but we never know...

Mathieu

OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 15:13           ` Tom Zanussi
@ 2005-11-23 17:53             ` Mathieu Desnoyers
  2005-11-23 20:16               ` Tom Zanussi
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-23 17:53 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: Stone, Joshua I, michel.dagenais, systemtap, ltt-dev

* Tom Zanussi (zanussi@us.ibm.com) wrote:
> Mathieu Desnoyers writes:
>  > * Tom Zanussi (zanussi@us.ibm.com) wrote:
>  > > Mathieu Desnoyers writes:
>  > >  > Generally speaking, what would happen if :
>  > >  > 
>  > >  > - A page fault handler has a trace point that logs an event in the RelayFS
>  > >  >   buffer.
>  > >  > - This exact write causes a (minor) page fault.
>  > >  > 
>  > >  > It may not happen often, but I think it would lead to a double fault. Is this
>  > >  > case possible ?
>  > >  > 
>  > > 
>  > > Yes, I'm sure it is possible, in fact I'm sure it happens all the time
>  > > in LTT and it's never been a problem.  Have you verified that it is a
>  > > problem (e.g. tracing only page faults and seeing a double fault).
>  > > 
>  > > 
>  > 
>  > I have just done this, and nothing bad happened and I discovered why :
>  > 
>  > The LTT instrumentation of do_page_fault comes right after the
>  > "We fault-in kernel-space virtual memory on-demand." section. So, minor page
>  > faults are not instrumented.
>  > 
>  > Maybe this isn't so interesting to instrument it, but if we would like to, I
>  > suggest to add an optionnal support for kmalloc'd memory areas in RelayFS so
>  > the client can ask for a channel which consists of a small buffer that is
>  > reentrant for the memory subsystem. This should be documented with a big
>  > warning about the small size the client should ask for.
>  > 
> 
> That would be one option, but how about instrumenting the vmalloc
> fault case to check whether the faulting address lies inside the
> buffer and if so, don't log an event in that case.  You wouldn't see
> those vmalloc faults touching the relayfs buffers, but I'm not sure
> that's interesting anyway...
> 

What about :
- On any minor page fault
  - logging a vmalloc fault (from somewhere else in the kernel)
    - test if faulted address lies within the relayfs buffer : no
    - generate a vmalloc fault because of buffer access
      -> double fault.

I'm afraid it won't fix the problem.

Mathieu

OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault
  2005-11-23 17:21   ` Mathieu Desnoyers
@ 2005-11-23 17:54     ` Martin Hunt
  2005-11-23 18:09       ` Mathieu Desnoyers
  0 siblings, 1 reply; 32+ messages in thread
From: Martin Hunt @ 2005-11-23 17:54 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Stone, Joshua I, systemtap

On Wed, 2005-11-23 at 12:21 -0500, Mathieu Desnoyers wrote:
> * Martin Hunt (hunt@redhat.com) wrote:
> > I changed the code to use
> > vmalloc (we really want vmalloc_node() but RHEL4 doesn't have it) and
> > all the crashes stopped on every machine.
> > 
> 
> What are the flags used for the memory allocated by vmalloc ?

From mm/vmalloc.c:
void *vmalloc(unsigned long size)
{
       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
}


> Did you try : 
> 
> - allocating the memory with kmalloc instead of vmalloc ?

Why would I do that?  What would I look for?  vmalloc already works.

> - to see if there is a code path that goes from do_page_fault to sys_open ? I
>   would be surprised about it, but we never know...
I think we can assume that files aren't being opened in do_page_fault.
But I checked anyway and I don't see anything like that.

Martin


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault
  2005-11-23 17:54     ` Martin Hunt
@ 2005-11-23 18:09       ` Mathieu Desnoyers
  0 siblings, 0 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-23 18:09 UTC (permalink / raw)
  To: Martin Hunt; +Cc: Stone, Joshua I, systemtap

* Martin Hunt (hunt@redhat.com) wrote:
> > Did you try : 
> > 
> > - allocating the memory with kmalloc instead of vmalloc ?
> 
> Why would I do that?  What would I look for?  vmalloc already works.
> 

Memory allocated by vmalloc will generate a minor page fault when accessed from
kernel space on behalf of a different process. The page fault handler will
simply update the process'page table in that case. If the page fault handler is
instrumented for a minor fault and the logging code generates a page fault, it
clearly causes a double fault.

But as you only instrument sys_open, this case does not apply.

Mathieu

OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 17:53             ` Mathieu Desnoyers
@ 2005-11-23 20:16               ` Tom Zanussi
  2005-11-23 21:00                 ` Frank Ch. Eigler
  2005-11-23 23:21                 ` Mathieu Desnoyers
  0 siblings, 2 replies; 32+ messages in thread
From: Tom Zanussi @ 2005-11-23 20:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tom Zanussi, Stone, Joshua I, michel.dagenais, systemtap, ltt-dev

Mathieu Desnoyers writes:
 > * Tom Zanussi (zanussi@us.ibm.com) wrote:
 > > Mathieu Desnoyers writes:
 > >  > * Tom Zanussi (zanussi@us.ibm.com) wrote:
 > >  > > Mathieu Desnoyers writes:
 > >  > >  > Generally speaking, what would happen if :
 > >  > >  > 
 > >  > >  > - A page fault handler has a trace point that logs an event in the RelayFS
 > >  > >  >   buffer.
 > >  > >  > - This exact write causes a (minor) page fault.
 > >  > >  > 
 > >  > >  > It may not happen often, but I think it would lead to a double fault. Is this
 > >  > >  > case possible ?
 > >  > >  > 
 > >  > > 
 > >  > > Yes, I'm sure it is possible, in fact I'm sure it happens all the time
 > >  > > in LTT and it's never been a problem.  Have you verified that it is a
 > >  > > problem (e.g. tracing only page faults and seeing a double fault).
 > >  > > 
 > >  > > 
 > >  > 
 > >  > I have just done this, and nothing bad happened and I discovered why :
 > >  > 
 > >  > The LTT instrumentation of do_page_fault comes right after the
 > >  > "We fault-in kernel-space virtual memory on-demand." section. So, minor page
 > >  > faults are not instrumented.
 > >  > 
 > >  > Maybe this isn't so interesting to instrument it, but if we would like to, I
 > >  > suggest to add an optionnal support for kmalloc'd memory areas in RelayFS so
 > >  > the client can ask for a channel which consists of a small buffer that is
 > >  > reentrant for the memory subsystem. This should be documented with a big
 > >  > warning about the small size the client should ask for.
 > >  > 
 > > 
 > > That would be one option, but how about instrumenting the vmalloc
 > > fault case to check whether the faulting address lies inside the
 > > buffer and if so, don't log an event in that case.  You wouldn't see
 > > those vmalloc faults touching the relayfs buffers, but I'm not sure
 > > that's interesting anyway...
 > > 
 > 
 > What about :
 > - On any minor page fault
 >   - logging a vmalloc fault (from somewhere else in the kernel)
 >     - test if faulted address lies within the relayfs buffer : no
 >     - generate a vmalloc fault because of buffer access
 >       -> double fault.
 > 
 > I'm afraid it won't fix the problem.
 > 

I just tried a simple experiment instrumenting do_page_fault() to log
in the same spot LTT does, and also at the very end of of the
vmalloc_fault case, and see the vmalloc faults (to the relayfs buffer,
caused by a write from another page fault, since that's all that's
instrumented) being logged without double faults.  So you don't even
have to test whether the fault is in the relayfs buffer, just wait
until the page table has been updated.  What would cause a double
fault would be if the vmalloc_fault tried logging before the page
table was updated, which would cause the same vmalloc fault.

Tom



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 20:16               ` Tom Zanussi
@ 2005-11-23 21:00                 ` Frank Ch. Eigler
  2005-11-23 21:51                   ` Roland McGrath
  2005-11-23 22:18                   ` Tom Zanussi
  2005-11-23 23:21                 ` Mathieu Desnoyers
  1 sibling, 2 replies; 32+ messages in thread
From: Frank Ch. Eigler @ 2005-11-23 21:00 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: ltt-dev, systemtap


zanussi wrote:

> [...]  What would cause a double fault would be if the vmalloc_fault
> tried logging before the page table was updated, which would cause
> the same vmalloc fault.

Then this is analogous to the problem of calling printk from within an
inconveniently placed kprobe.  What can we do to eliminate this
vulnerability?  Can we somehow arrange to "fault in" all probe-related
kernel-space vmalloc areas into new process' address spaces, so we don't
encounter this unintentional and undesirable reentrancy?

- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 21:00                 ` Frank Ch. Eigler
@ 2005-11-23 21:51                   ` Roland McGrath
  2005-11-23 21:56                     ` Mathieu Desnoyers
  2005-11-23 22:18                   ` Tom Zanussi
  1 sibling, 1 reply; 32+ messages in thread
From: Roland McGrath @ 2005-11-23 21:51 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

> Then this is analogous to the problem of calling printk from within an
> inconveniently placed kprobe.  What can we do to eliminate this
> vulnerability?  Can we somehow arrange to "fault in" all probe-related
> kernel-space vmalloc areas into new process' address spaces, so we don't
> encounter this unintentional and undesirable reentrancy?

What's the reason for using vmalloc then?  Why not use kmalloc that doesn't
need page table changes?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 21:51                   ` Roland McGrath
@ 2005-11-23 21:56                     ` Mathieu Desnoyers
  2005-11-23 22:34                       ` Roland McGrath
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-23 21:56 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Frank Ch. Eigler, systemtap

* Roland McGrath (roland@redhat.com) wrote:
> > Then this is analogous to the problem of calling printk from within an
> > inconveniently placed kprobe.  What can we do to eliminate this
> > vulnerability?  Can we somehow arrange to "fault in" all probe-related
> > kernel-space vmalloc areas into new process' address spaces, so we don't
> > encounter this unintentional and undesirable reentrancy?
> 
> What's the reason for using vmalloc then?  Why not use kmalloc that doesn't
> need page table changes?
> 

kmalloc needs contiguous pages of memory. It can be problematic in a system
where the memory is fragmented and the requested allocation size is big.

Mathieu

OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 21:00                 ` Frank Ch. Eigler
  2005-11-23 21:51                   ` Roland McGrath
@ 2005-11-23 22:18                   ` Tom Zanussi
  2005-11-23 22:25                     ` Frank Ch. Eigler
  1 sibling, 1 reply; 32+ messages in thread
From: Tom Zanussi @ 2005-11-23 22:18 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Tom Zanussi, ltt-dev, systemtap

Frank Ch. Eigler writes:
 > 
 > zanussi wrote:
 > 
 > > [...]  What would cause a double fault would be if the vmalloc_fault
 > > tried logging before the page table was updated, which would cause
 > > the same vmalloc fault.
 > 
 > Then this is analogous to the problem of calling printk from within an
 > inconveniently placed kprobe.  What can we do to eliminate this
 > vulnerability?  Can we somehow arrange to "fault in" all probe-related
 > kernel-space vmalloc areas into new process' address spaces, so we don't
 > encounter this unintentional and undesirable reentrancy?
 > 

I'll think about it, but it doesn't sound like fun.  It sounds like it
might be one of those cases where you only allow a tapset to
instrument a certain area, in this case a page fault tapset to
instrument the page fault path.  I can't remember, how is the
possibility of a printk() in a problematic function currently handled
in systemtap?

Tom


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 22:18                   ` Tom Zanussi
@ 2005-11-23 22:25                     ` Frank Ch. Eigler
  0 siblings, 0 replies; 32+ messages in thread
From: Frank Ch. Eigler @ 2005-11-23 22:25 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: ltt-dev, systemtap

Hi -

On Wed, Nov 23, 2005 at 04:17:42PM -0600, Tom Zanussi wrote:

> [...] It sounds like it might be one of those cases where you only
> allow a tapset to instrument a certain area, in this case a page
> fault tapset to instrument the page fault path.

Enforcing this would be tricky as long as dwarf probes are generally
accessible.  Maybe we should more seriously look at kmalloc.  At least
memory allocation is an initialization-time-only step that can fail
cleanly.

> I can't remember, how is the possibility of a printk() in a
> problematic function currently handled in systemtap?

By eliminating printk() calls from the runtime/tapsets etc. :-)

- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 21:56                     ` Mathieu Desnoyers
@ 2005-11-23 22:34                       ` Roland McGrath
  2005-11-23 22:44                         ` Mathieu Desnoyers
                                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Roland McGrath @ 2005-11-23 22:34 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: systemtap

> kmalloc needs contiguous pages of memory. It can be problematic in a system
> where the memory is fragmented and the requested allocation size is big.

Ok.  vmalloc makes sense for big buffers.  It's not all that clear to me
that we can prefault the pages in the right address space ahead of time,
unless we iterate over every mm.  That is, we have to know in what context
we'll be running probe code that needs to use a given buffer, but we can't
necessarily know that before the probe actually hits.  At that point, it
may be unsafe to prefault the vmalloc'd buffer area because the probe point
is in the middle of page table futzing or whatnot.  

In the current Linus kernel, the i386 do_page_fault routine is marked with
__kprobes.  As I understand it, this puts it in a special section and no
kprobe can be inserted inside the function.  If that's the case, I'm a bit
confused as to how this is coming up.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 22:34                       ` Roland McGrath
@ 2005-11-23 22:44                         ` Mathieu Desnoyers
  2005-11-23 23:20                         ` Tom Zanussi
  2005-11-24 18:10                         ` Frank Ch. Eigler
  2 siblings, 0 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-23 22:44 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap

* Roland McGrath (roland@redhat.com) wrote:
> In the current Linus kernel, the i386 do_page_fault routine is marked with
> __kprobes.  As I understand it, this puts it in a special section and no
> kprobe can be inserted inside the function.  If that's the case, I'm a bit
> confused as to how this is coming up.
> 

Well, it's getting clear now that only the vmalloc_fault path of this function
is problematic. It's ok as long as this path does not call any external
function.

Mathieu

OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 22:34                       ` Roland McGrath
  2005-11-23 22:44                         ` Mathieu Desnoyers
@ 2005-11-23 23:20                         ` Tom Zanussi
  2005-11-24 18:10                         ` Frank Ch. Eigler
  2 siblings, 0 replies; 32+ messages in thread
From: Tom Zanussi @ 2005-11-23 23:20 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Mathieu Desnoyers, systemtap

Roland McGrath writes:
 > 
 > In the current Linus kernel, the i386 do_page_fault routine is marked with
 > __kprobes.  As I understand it, this puts it in a special section and no
 > kprobe can be inserted inside the function.  If that's the case, I'm a bit
 > confused as to how this is coming up.

This part of the thread has been a sidebar, AFAICS - the original
poster didn't mention anything about instrumenting the page fault
path.

Tom


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 20:16               ` Tom Zanussi
  2005-11-23 21:00                 ` Frank Ch. Eigler
@ 2005-11-23 23:21                 ` Mathieu Desnoyers
  1 sibling, 0 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2005-11-23 23:21 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: Stone, Joshua I, michel.dagenais, systemtap, ltt-dev

* Tom Zanussi (zanussi@us.ibm.com) wrote:
 
> I just tried a simple experiment instrumenting do_page_fault() to log
> in the same spot LTT does, and also at the very end of of the
> vmalloc_fault case, and see the vmalloc faults (to the relayfs buffer,
> caused by a write from another page fault, since that's all that's
> instrumented) being logged without double faults.  So you don't even
> have to test whether the fault is in the relayfs buffer, just wait
> until the page table has been updated.  What would cause a double
> fault would be if the vmalloc_fault tried logging before the page
> table was updated, which would cause the same vmalloc fault.
> 


Yes, it fits with p. 5-52, vol. 3 from IA32 Intel Software Developer Manual : it
explains that a page fault handler can generate another page fault in its code.
It must save the content of the CR2 register (address of the fault) before a
nested PF can occur.

After having read the "Interrupt 8 - Double fault exception", vol3. p. 5-37 five
times, I get the difference between :

"while calling an exception handler for a prior exception"
and
"while the processor is executing an NMI interrupt handler"

The page fault case causing a double fault is only when the call to the PF
handler will itself generate a page fault. The code of the PF handler seems to
be allowed to generate a PF without any double fault occuring.

But having the PF handler causing a PF on the same address as the preceding one
before the page table has been updated could clearly end up in a recursive page
fault. I just tried it, I don't think this ends up in a double fault
(nothing shown on my debug console) and I get a system freeze (I use spinlocks
to protect my buffer in this test). It looks like a deadlock.

It could be interesting, if we want to instrument this function before the page
table is updated (i.e. have each function entry including this one, getting the
real time spent in the do_page_fault function), to have a small low traffic
memory subsystem reentrant kmalloc'd relayfs channel to log this particular
path.

Mathieu


OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-23 22:34                       ` Roland McGrath
  2005-11-23 22:44                         ` Mathieu Desnoyers
  2005-11-23 23:20                         ` Tom Zanussi
@ 2005-11-24 18:10                         ` Frank Ch. Eigler
  2005-11-24 22:09                           ` Roland McGrath
  2 siblings, 1 reply; 32+ messages in thread
From: Frank Ch. Eigler @ 2005-11-24 18:10 UTC (permalink / raw)
  To: systemtap


roland wrote:

> Ok.  vmalloc makes sense for big buffers.  [...]

But less so for other stuff.  I'll get the translator to switch to
kmalloc as much as possible.

> In the current Linus kernel, the i386 do_page_fault routine is
> marked with __kprobes.  [...]

Maybe that decoration was missing from some transitively called function?

- FChE

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: double fault -> PAGE_KERNEL flagged memory
  2005-11-24 18:10                         ` Frank Ch. Eigler
@ 2005-11-24 22:09                           ` Roland McGrath
  0 siblings, 0 replies; 32+ messages in thread
From: Roland McGrath @ 2005-11-24 22:09 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

> But less so for other stuff.  I'll get the translator to switch to
> kmalloc as much as possible.

More or less only for things larger than a page, I think.

> > In the current Linus kernel, the i386 do_page_fault routine is
> > marked with __kprobes.  [...]
> 
> Maybe that decoration was missing from some transitively called function?

Could be, though there aren't any actual calls in the vmalloc_fault path, I
think.  The annotations (and the protections) don't exist at all in RHEL4
kernels, looking at it now.  (I was looking at the current upstream kernel
earlier.)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: double fault
@ 2005-11-24  2:37 Stone, Joshua I
  0 siblings, 0 replies; 32+ messages in thread
From: Stone, Joshua I @ 2005-11-24  2:37 UTC (permalink / raw)
  To: Roland McGrath, Martin Hunt; +Cc: systemtap

Roland McGrath wrote:
> The second crash had an esp of 0xf5bd4f98.  If that's a proper stack
> pointer, it's only 104 bytes from the beginning of the stack. 
> Considering that the trap frame itself is 60 bytes, that's fairly
> small for a realistic stack.  It might well be that in fact it's an
> overflowed stack that grew down from below 0xf5bd6000 and overflowed
> by getting below 0xf5bd5034 (which is the end of the struct
> thread_info at the base of the stack). 

I added a check to monitor the stack on the probe entrance, like this:

	unsigned left = (unsigned)CONTEXT->regs & 0xfff;
	printk("stap_debug: %d bytes on the stack");

Once I added that, I started getting only a single output and then a
crash every time.  The value reported is consistantly 3976 bytes - only
120 bytes from the top.  And the eip is now consistantly at that stack
read within do_page_fault as well.

>> Is there a way I can get the double-fault to print a full oops, with
>> a stack trace?
> 
> No, it's a special trap handler that uses its own stack and just has
> the simple printks you've seen.  You'd have to do something like put
> a probe on the line in doublefault_fn where it printk's the esp et
> al, and have that call show_trace on t->esp or something.

A probe here doesn't work.  I tried it, and the system hung up
completely (a triple-fault?).  I think things must be hosed up pretty
bad by the time it gets to doublefault_fn.

And thanks to the infinite wisdom of Linus, it's a pain to get a
debugger in there.  I tried kdb first, but kdb doesn't automatically
catch double-faults.  I put a breakpoint on doublefault_fn, and it
triggered, but kdb just panicked about invalid memory references as it
was trying to take over.  Again, to me this seems to indicate trouble
with the stack.  I couldn't get kgdb to work at all on the RHEL4 kernel
- likely patching issues.


Martin Hunt wrote:
> But I'm not sure its worth pursuing further because it appears to not
> happen in the newer version of kprobes.

Perhaps, or perhaps there's still a landmine in there that is just
better obscured in the newer kprobes.  I would feel much better if there
was a known fix that occurred, instead of the problem magically
disappearing.  I don't think I will spend much more time on this though,
at least until someone runs into the same issue on the new kprobes.

At the very least, judging by the side conversations, we now appear to
have quite a few people looking closely at the fault handling code...

Thanks,

Josh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: double fault
  2005-11-22  3:46 Stone, Joshua I
@ 2005-11-22 11:00 ` Roland McGrath
  0 siblings, 0 replies; 32+ messages in thread
From: Roland McGrath @ 2005-11-22 11:00 UTC (permalink / raw)
  To: Stone, Joshua I; +Cc: systemtap

> Shouldn't it be > CONTEXT->regs->esp?

Nope.  For kernel traps the sp and ss is not saved by the i386 hardware, so
that part of the struct pt_regs is not actually there.  However, that
struct itself is the trap frame of the registers that are pushed on the
stack and so it is a stack address near the sp at the time of the fault.

> I tried the code you gave (using CONTEXT->regs), but I don't understand
> how that computes how much stack space is left.  

The stacks are 4k and aligned, so & 0xfff is that sp relative to the base
of the stack.  If sp & 0xfff is very tiny, then the stack is about to
overflow.

> And even then, you can see the two esp's from the register dumps I gave -
> the first would have triggered your panic, and the second wouldn't.  

The second crash had an esp of 0xf5bd4f98.  If that's a proper stack
pointer, it's only 104 bytes from the beginning of the stack.  Considering
that the trap frame itself is 60 bytes, that's fairly small for a realistic
stack.  It might well be that in fact it's an overflowed stack that grew
down from below 0xf5bd6000 and overflowed by getting below 0xf5bd5034
(which is the end of the struct thread_info at the base of the stack).

Of course, it's all just speculation that stack overflow is the issue.

> Is there a way I can get the double-fault to print a full oops, with a
> stack trace?

No, it's a special trap handler that uses its own stack and just has the
simple printks you've seen.  You'd have to do something like put a probe on
the line in doublefault_fn where it printk's the esp et al, and have that
call show_trace on t->esp or something.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: double fault
@ 2005-11-22  3:46 Stone, Joshua I
  2005-11-22 11:00 ` Roland McGrath
  0 siblings, 1 reply; 32+ messages in thread
From: Stone, Joshua I @ 2005-11-22  3:46 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap

>From: Roland McGrath [mailto:roland@redhat.com] 
>
>The stack overflow notion sounds plausible.  To investigate 
>that angle, one
>thing to try comes to mind off hand.  In each probe that might 
>be hitting,
>stick some %{ ... %} code to do a "stack getting small" check. 
> It can do
>something like:
>
>	unsigned left = (unsigned)regs & 0xfff;
>	if (left < 256) panic("stack getting close");
>
>That might manage to print out a full oops with backtrace 
>details that show
>the cascade of page fault frames or whatever the situation actually is.
>
>
>Thanks,
>Roland
>

I tried the code you gave (using CONTEXT->regs), but I don't understand
how that computes how much stack space is left.  Shouldn't it be
CONTEXT->regs->esp?  And even then, you can see the two esp's from the
register dumps I gave - the first would have triggered your panic, and
the second wouldn't.  Am I missing something?

Anyway, I tried it both ways.  It immediately panics, but there's no
oops info.  It just says "Kernel panic - not syncing".  I added a
dump_stack call, but that all looks innocent.

Is there a way I can get the double-fault to print a full oops, with a
stack trace?

I'm pretty new to kernel-debugging, so sorry if I'm asking simple
questions...

Thanks,

Josh

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-11-24 22:09 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-11-22  1:12 double fault Stone, Joshua I
2005-11-22  1:25 ` Roland McGrath
2005-11-22  9:29 ` Richard J Moore
2005-11-22 14:00 ` double fault -> PAGE_KERNEL flagged memory Mathieu Desnoyers
2005-11-22 15:12   ` Tom Zanussi
2005-11-22 15:24     ` Mathieu Desnoyers
2005-11-22 16:52       ` Tom Zanussi
2005-11-22 18:59         ` Mathieu Desnoyers
2005-11-23 15:13           ` Tom Zanussi
2005-11-23 17:53             ` Mathieu Desnoyers
2005-11-23 20:16               ` Tom Zanussi
2005-11-23 21:00                 ` Frank Ch. Eigler
2005-11-23 21:51                   ` Roland McGrath
2005-11-23 21:56                     ` Mathieu Desnoyers
2005-11-23 22:34                       ` Roland McGrath
2005-11-23 22:44                         ` Mathieu Desnoyers
2005-11-23 23:20                         ` Tom Zanussi
2005-11-24 18:10                         ` Frank Ch. Eigler
2005-11-24 22:09                           ` Roland McGrath
2005-11-23 22:18                   ` Tom Zanussi
2005-11-23 22:25                     ` Frank Ch. Eigler
2005-11-23 23:21                 ` Mathieu Desnoyers
2005-11-22 15:17   ` Mathieu Desnoyers
2005-11-22 15:42   ` Frank Ch. Eigler
2005-11-22 16:01     ` Mathieu Desnoyers
2005-11-23  8:34 ` double fault Martin Hunt
2005-11-23 17:21   ` Mathieu Desnoyers
2005-11-23 17:54     ` Martin Hunt
2005-11-23 18:09       ` Mathieu Desnoyers
2005-11-22  3:46 Stone, Joshua I
2005-11-22 11:00 ` Roland McGrath
2005-11-24  2:37 Stone, Joshua I

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).