[RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
@ 2008-07-15 18:33 James Bottomley
  2008-07-16 22:42 ` Masami Hiramatsu
  2008-07-18  9:11 ` [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) Andi Kleen
  0 siblings, 2 replies; 49+ messages in thread
From: James Bottomley @ 2008-07-15 18:33 UTC (permalink / raw)
  To: linux-kernel, systemtap

One of the big nasties of systemtap is the way it tries to embed
virtually the entirety of the kernel symbol table in the probe modules
it constructs.  This is highly undesirable because it represents a
subversion of the kernel API to gain access to unexported symbols.  At
least for kprobes, the correct way to do this is to specify the probe
point by symbol and offset.

This patch converts systemtap to use the correct kprobe
symbol_name/offset pair to identify the probe location.

This only represents a baby step:  after this is done, there are at
least three other consumers of the systemtap module relocation
machinery:

     1. unwind information.  I think the consumers of this can be
        converted to use the arch specific unwinders that already exist
        within the kernel
     2. systemtap specific functions that use kernel internals.  This
        was things like get_cycles() but I think they all now use a
        sanctioned API ... need to check
     3. Access to unexported global variables used by the probes.  This
        one is a bit tricky; the dwarf gives a probe the ability to
        access any variable available from the probed stack frame,
        including all globals.  We could just make the globals off
        limits, but that weakens the value of the debugger.
        Alternatively, we could expand the kprobe API to allow probes
        access to named global variables (tricky to get right without
        effectively giving general symbol access).  Thoughts?

If you're going to try this out, you currently need to specify --kelf on
the command line to tell systemtap to use the kernel elf to derive
symbol names and offsets (it will just segfault without this ATM).

James

---

diff --git a/tapsets.cxx b/tapsets.cxx
index 9037e15..a6a6dd3 100644
--- a/tapsets.cxx
+++ b/tapsets.cxx
@@ -2306,13 +2306,15 @@ struct dwarf_derived_probe: public derived_probe
                        const string& module,
                        const string& section,
 		       Dwarf_Addr dwfl_addr,
-		       Dwarf_Addr addr,
+		       string symbol,
+		       unsigned int offset,
 		       dwarf_query & q,
                        Dwarf_Die* scope_die);
 
   string module;
   string section;
-  Dwarf_Addr addr;
+  string kprobe_symbol;
+  unsigned int kprobe_offset;
   bool has_return;
   bool has_maxactive;
   long maxactive_val;
@@ -3260,9 +3262,18 @@ dwarf_query::add_probe_point(const string& funcname,
 
   if (! bad)
     {
+      struct module_info *mi = dw.mod_info;
+      if (!mi->sym_table)
+	mi->get_symtab(this);
+      struct symbol_table *sym_tab = mi->sym_table;
+      func_info *symbol = sym_tab->get_func_containing_address(addr);
+
       sess.unwindsym_modules.insert (module);
       probe = new dwarf_derived_probe(funcname, filename, line,
-                                      module, reloc_section, addr, reloc_addr, *this, scope_die);
+                                      module, reloc_section, reloc_addr,
+				      symbol->name,
+				      (unsigned int)(addr - symbol->addr),
+				      *this, scope_die);
       results.push_back(probe);
     }
 }
@@ -4380,7 +4391,8 @@ dwarf_derived_probe::printsig (ostream& o) const
   // function instances.  This is distinct from the verbose/clog
   // output, since this part goes into the cache hash calculations.
   sole_location()->print (o);
-  o << " /* pc=0x" << hex << addr << dec << " */";
+  o << " /* pc=<" << kprobe_symbol << "+0x" << hex 
+    << kprobe_offset << dec << "> */";
   printsig_nested (o);
 }
 
@@ -4406,22 +4418,26 @@ dwarf_derived_probe::dwarf_derived_probe(const string& funcname,
                                          // NB: dwfl_addr is the virtualized
                                          // address for this symbol.
                                          Dwarf_Addr dwfl_addr,
-                                         // addr is the section-offset for
-                                         // actual relocation.
-                                         Dwarf_Addr addr,
+                                         // symbol is the closest known symbol
+                                         // and offset is the offset from the symbol
+					 string symbol,
+					 unsigned int offset,
                                          dwarf_query& q,
                                          Dwarf_Die* scope_die /* may be null */)
   : derived_probe (q.base_probe, new probe_point(*q.base_loc) /* .components soon rewritten */ ),
-    module (module), section (section), addr (addr),
+    module (module), section (section), kprobe_symbol(symbol),
+    kprobe_offset(offset),
     has_return (q.has_return),
     has_maxactive (q.has_maxactive),
     maxactive_val (q.maxactive_val)
 {
   // Assert relocation invariants
+#if 0
   if (section == "" && dwfl_addr != addr) // addr should be absolute
     throw semantic_error ("missing relocation base against", q.base_loc->tok);
   if (section != "" && dwfl_addr == addr) // addr should be an offset
     throw semantic_error ("inconsistent relocation address", q.base_loc->tok);
+#endif
 
   this->tok = q.base_probe->tok;
 
@@ -4620,8 +4636,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
   // Let's find some stats for the three embedded strings.  Maybe they
   // are small and uniform enough to justify putting char[MAX]'s into
   // the array instead of relocated char*'s.
-  size_t module_name_max = 0, section_name_max = 0, pp_name_max = 0;
-  size_t module_name_tot = 0, section_name_tot = 0, pp_name_tot = 0;
+  size_t pp_name_max = 0, pp_name_tot = 0;
+  size_t symbol_name_name_max  = 0, symbol_name_name_tot = 0;
   size_t all_name_cnt = probes_by_module.size(); // for average
   for (p_b_m_iterator it = probes_by_module.begin(); it != probes_by_module.end(); it++)
     {
@@ -4630,9 +4646,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
         size_t var##_size = (expr) + 1;                 \
         var##_max = max (var##_max, var##_size);        \
         var##_tot += var##_size; } while (0)
-      DOIT(module_name, p->module.size());
-      DOIT(section_name, p->section.size());
       DOIT(pp_name, lex_cast_qstring(*p->sole_location()).size());
+      DOIT(symbol_name_name, p->kprobe_symbol.size());
 #undef DOIT
     }
 
@@ -4652,11 +4667,10 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
       if (s.verbose > 2) clog << "stap_dwarf_probe *" << #var << endl;  \
     }
 
-  CALCIT(module);
-  CALCIT(section);
   CALCIT(pp);
+  CALCIT(symbol_name);
 
-  s.op->newline() << "const unsigned long address;";
+  s.op->newline() << "unsigned int offset;";
   s.op->newline() << "void (* const ph) (struct context*);";
   s.op->newline(-1) << "} stap_dwarf_probes[] = {";
   s.op->indent(1);
@@ -4673,9 +4687,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
           assert (p->maxactive_val >= 0 && p->maxactive_val <= USHRT_MAX);
           s.op->line() << " .maxactive_val=" << p->maxactive_val << ",";
         }
-      s.op->line() << " .address=0x" << hex << p->addr << dec << "UL,";
-      s.op->line() << " .module=\"" << p->module << "\",";
-      s.op->line() << " .section=\"" << p->section << "\",";
+      s.op->line() << " .symbol_name=\"" << p->kprobe_symbol << "\",";
+      s.op->line() << " .offset=0x" << hex << p->kprobe_offset << dec << ",";
       s.op->line() << " .pp=" << lex_cast_qstring (*p->sole_location()) << ",";
       s.op->line() << " .ph=&" << p->name;
       s.op->line() << " },";
@@ -4735,11 +4748,10 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
   s.op->newline() << "for (i=0; i<" << probes_by_module.size() << "; i++) {";
   s.op->newline(1) << "struct stap_dwarf_probe *sdp = & stap_dwarf_probes[i];";
   s.op->newline() << "struct stap_dwarf_kprobe *kp = & stap_dwarf_kprobes[i];";
-  s.op->newline() << "unsigned long relocated_addr = _stp_module_relocate (sdp->module, sdp->section, sdp->address);";
-  s.op->newline() << "if (relocated_addr == 0) continue;"; // quietly; assume module is absent
   s.op->newline() << "probe_point = sdp->pp;";
   s.op->newline() << "if (sdp->return_p) {";
-  s.op->newline(1) << "kp->u.krp.kp.addr = (void *) relocated_addr;";
+  s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
+  s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
   s.op->newline() << "if (sdp->maxactive_p) {";
   s.op->newline(1) << "kp->u.krp.maxactive = sdp->maxactive_val;";
   s.op->newline(-1) << "} else {";
@@ -4748,7 +4760,8 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
   s.op->newline() << "kp->u.krp.handler = &enter_kretprobe_probe;";
   s.op->newline() << "rc = register_kretprobe (& kp->u.krp);";
   s.op->newline(-1) << "} else {";
-  s.op->newline(1) << "kp->u.kp.addr = (void *) relocated_addr;";
+  s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
+  s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
   s.op->newline() << "kp->u.kp.pre_handler = &enter_kprobe_probe;";
   s.op->newline() << "rc = register_kprobe (& kp->u.kp);";
   s.op->newline(-1) << "}";
@@ -4885,12 +4898,20 @@ dwarf_builder::build(systemtap_session & sess,
           throw semantic_error ("absolute statement probe in unprivileged script", q.base_probe->tok);
         }
 
+      struct module_info *mi = dw->mod_info;
+      if (!mi->sym_table)
+	mi->get_symtab(&q);
+      struct symbol_table *sym_tab = mi->sym_table;
+      func_info *symbol = sym_tab->get_func_containing_address(q.statement_num_val);
+
       // For kernel.statement(NUM).absolute probe points, we bypass
       // all the debuginfo stuff: We just wire up a
       // dwarf_derived_probe right here and now.
       dwarf_derived_probe* p =
         new dwarf_derived_probe ("", "", 0, "kernel", "",
-                                 q.statement_num_val, q.statement_num_val,
+                                 q.statement_num_val,
+				 symbol->name,
+				 (unsigned int)(q.statement_num_val - symbol->addr),
                                  q, 0);
       finished_results.push_back (p);
       sess.unwindsym_modules.insert ("kernel");


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-15 18:33 [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) James Bottomley
@ 2008-07-16 22:42 ` Masami Hiramatsu
  2008-07-16 23:03   ` James Bottomley
  2008-07-18  9:11 ` [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) Andi Kleen
  1 sibling, 1 reply; 49+ messages in thread
From: Masami Hiramatsu @ 2008-07-16 22:42 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, systemtap

James Bottomley wrote:
> One of the big nasties of systemtap is the way it tries to embed
> virtually the entirety of the kernel symbol table in the probe modules
> it constructs.  This is highly undesirable because it represents a
> subversion of the kernel API to gain access to unexported symbols.  At
> least for kprobes, the correct way to do this is to specify the probe
> point by symbol and offset.
> 
> This patch converts systemtap to use the correct kprobe
> symbol_name/offset pair to identify the probe location.

Hi James,

I think your suggestion is a good step. Of course, it might
have to solve some issues.

Unfortunately, current kprobe's symbol_name interface is not
so clever. For example, if you specify a static function
which is defined at several places in the kernel(ex. do_open),
it always pick up the first one in kallsyms, even if systemtap
can find all of those functions.
(you can find many duplicated symbols in /proc/kallsyms)

So, we might better improve kallsyms to treat this case
and find what is a better way to specify symbols and addresses.

> 
> This only represents a baby step:  after this is done, there are at
> least three other consumers of the systemtap module relocation
> machinery:
> 
>      1. unwind information.  I think the consumers of this can be
>         converted to use the arch specific unwinders that already exist
>         within the kernel
>      2. systemtap specific functions that use kernel internals.  This
>         was things like get_cycles() but I think they all now use a
>         sanctioned API ... need to check

Sure, those functions must be well isolated from other parts of kernel.
unfortunately, relayfs is not enough isolated. see below;
http://sources.redhat.com/bugzilla/show_bug.cgi?id=6487

>      3. Access to unexported global variables used by the probes.  This
>         one is a bit tricky; the dwarf gives a probe the ability to
>         access any variable available from the probed stack frame,
>         including all globals.  We could just make the globals off
>         limits, but that weakens the value of the debugger.
>         Alternatively, we could expand the kprobe API to allow probes
>         access to named global variables (tricky to get right without
>         effectively giving general symbol access).  Thoughts?

Could we provide a separated GPL'd interface to access named global
symbols which is based on kallsyms?

Thank you,

> If you're going to try this out, you currently need to specify --kelf on
> the command line to tell systemtap to use the kernel elf to derive
> symbol names and offsets (it will just segfault without this ATM).
> 
> James

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-16 22:42 ` Masami Hiramatsu
@ 2008-07-16 23:03   ` James Bottomley
  2008-07-17  0:07     ` Masami Hiramatsu
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-16 23:03 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, systemtap

On Wed, 2008-07-16 at 18:40 -0400, Masami Hiramatsu wrote:
> James Bottomley wrote:
> > One of the big nasties of systemtap is the way it tries to embed
> > virtually the entirety of the kernel symbol table in the probe modules
> > it constructs.  This is highly undesirable because it represents a
> > subversion of the kernel API to gain access to unexported symbols.  At
> > least for kprobes, the correct way to do this is to specify the probe
> > point by symbol and offset.
> > 
> > This patch converts systemtap to use the correct kprobe
> > symbol_name/offset pair to identify the probe location.
> 
> Hi James,
> 
> I think your suggestion is a good step. Of course, it might
> have to solve some issues.
> 
> Unfortunately, current kprobe's symbol_name interface is not
> so clever. For example, if you specify a static function
> which is defined at several places in the kernel(ex. do_open),
> it always pick up the first one in kallsyms, even if systemtap
> can find all of those functions.
> (you can find many duplicated symbols in /proc/kallsyms)

Right, but realistically only functions which have a strict existence
(i.e. those for whom an address could be taken) can be used; functions
which are fully inlined (as in have no separate existence) can't.
That's why the patch finds the closest function with an address to match
on.

> So, we might better improve kallsyms to treat this case
> and find what is a better way to specify symbols and addresses.

Well, both the dwarf and the kallsyms know which are the functions that
have a real existence, so the tool can work it out.  It has a real
meaning too because the chosen symbol must be the parent routine of all
the nested inlines.

> > This only represents a baby step:  after this is done, there are at
> > least three other consumers of the systemtap module relocation
> > machinery:
> > 
> >      1. unwind information.  I think the consumers of this can be
> >         converted to use the arch specific unwinders that already exist
> >         within the kernel
> >      2. systemtap specific functions that use kernel internals.  This
> >         was things like get_cycles() but I think they all now use a
> >         sanctioned API ... need to check
> 
> Sure, those functions must be well isolated from other parts of kernel.
> unfortunately, relayfs is not enough isolated. see below;
> http://sources.redhat.com/bugzilla/show_bug.cgi?id=6487

This is just "who guards the guards" or in this case, you can't probe
pieces of the kernel that the probe internals use.  However, as long as
the separation is tight this shouldn't be too much of a problem.

> >      3. Access to unexported global variables used by the probes.  This
> >         one is a bit tricky; the dwarf gives a probe the ability to
> >         access any variable available from the probed stack frame,
> >         including all globals.  We could just make the globals off
> >         limits, but that weakens the value of the debugger.
> >         Alternatively, we could expand the kprobe API to allow probes
> >         access to named global variables (tricky to get right without
> >         effectively giving general symbol access).  Thoughts?
> 
> Could we provide a separated GPL'd interface to access named global
> symbols which is based on kallsyms?

Yes, I think so ... it's just a case of working out what and how; but to
do that we need a consumer of the interface.

James


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-16 23:03   ` James Bottomley
@ 2008-07-17  0:07     ` Masami Hiramatsu
  2008-07-17  1:50       ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Masami Hiramatsu @ 2008-07-17  0:07 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, systemtap

James Bottomley wrote:
> On Wed, 2008-07-16 at 18:40 -0400, Masami Hiramatsu wrote:
>> James Bottomley wrote:
>>> One of the big nasties of systemtap is the way it tries to embed
>>> virtually the entirety of the kernel symbol table in the probe modules
>>> it constructs.  This is highly undesirable because it represents a
>>> subversion of the kernel API to gain access to unexported symbols.  At
>>> least for kprobes, the correct way to do this is to specify the probe
>>> point by symbol and offset.
>>>
>>> This patch converts systemtap to use the correct kprobe
>>> symbol_name/offset pair to identify the probe location.
>> Hi James,
>>
>> I think your suggestion is a good step. Of course, it might
>> have to solve some issues.
>>
>> Unfortunately, current kprobe's symbol_name interface is not
>> so clever. For example, if you specify a static function
>> which is defined at several places in the kernel(ex. do_open),
>> it always pick up the first one in kallsyms, even if systemtap
>> can find all of those functions.
>> (you can find many duplicated symbols in /proc/kallsyms)
> 
> Right, but realistically only functions which have a strict existence
> (i.e. those for whom an address could be taken) can be used; functions
> which are fully inlined (as in have no separate existence) can't.
> That's why the patch finds the closest function with an address to match
> on.

Sure, inlined functions are embedded in a caller function, so the
closest function is the correct owner.

However, I meant local-scope functions can have same name if
they are defined in different scope. And even though, both of
them are shown in kallsyms. This mean, you can see the functions
which have real different existence but have same symbol.

Would you mean systemtap should not probe those name-conflicted
functions?

>> So, we might better improve kallsyms to treat this case
>> and find what is a better way to specify symbols and addresses.
> 
> Well, both the dwarf and the kallsyms know which are the functions that
> have a real existence, so the tool can work it out.  It has a real
> meaning too because the chosen symbol must be the parent routine of all
> the nested inlines.

Hmm, here is what I got with your patch;
$ stap --kelf -e 'probe kernel.function("do_open"){}' -p2
# probes
kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */
kernel.function("do_open@fs/block_dev.c:928") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */
kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x55> */ /* <- kernel.function("do_open") */
kernel.function("do_open@ipc/mqueue.c:630") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */

Without your patch;
$ stap -e 'probe kernel.function("do_open"){}' -p2
# probes
kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=0x10382 */ /* <- kernel.function("do_open") */
kernel.function("do_open@fs/block_dev.c:928") /* pc=0xa0750 */ /* <- kernel.function("do_open") */
kernel.function("do_open@fs/nfsctl.c:24") /* pc=0xa6411 */ /* <- kernel.function("do_open") */
kernel.function("do_open@ipc/mqueue.c:630") /* pc=0xc55a6 */ /* <- kernel.function("do_open") */

Obviously, the 3rd "do_open" is fully inlined, so it can be
correctly handled by kprobes, because it has different
symbol(sys_nfsservctl). However, other "do_open" have
same symbol(do_open). these will be put on same
address (at the first symbol on kallsyms list).

So, we need a bridge for the gap of function addresses
between kallsyms and dwarf.

[...]
>> Could we provide a separated GPL'd interface to access named global
>> symbols which is based on kallsyms?
> 
> Yes, I think so ... it's just a case of working out what and how; but to
> do that we need a consumer of the interface.

I agree with you, we need to change both of systemtap and kernel.

Thank you,

> 
> James
> 
> 

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17  0:07     ` Masami Hiramatsu
@ 2008-07-17  1:50       ` James Bottomley
  2008-07-17 14:18         ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-17  1:50 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, systemtap

On Wed, 2008-07-16 at 20:05 -0400, Masami Hiramatsu wrote:
> James Bottomley wrote:
> > On Wed, 2008-07-16 at 18:40 -0400, Masami Hiramatsu wrote:
> >> James Bottomley wrote:
> >>> One of the big nasties of systemtap is the way it tries to embed
> >>> virtually the entirety of the kernel symbol table in the probe modules
> >>> it constructs.  This is highly undesirable because it represents a
> >>> subversion of the kernel API to gain access to unexported symbols.  At
> >>> least for kprobes, the correct way to do this is to specify the probe
> >>> point by symbol and offset.
> >>>
> >>> This patch converts systemtap to use the correct kprobe
> >>> symbol_name/offset pair to identify the probe location.
> >> Hi James,
> >>
> >> I think your suggestion is a good step. Of course, it might
> >> have to solve some issues.
> >>
> >> Unfortunately, current kprobe's symbol_name interface is not
> >> so clever. For example, if you specify a static function
> >> which is defined at several places in the kernel(ex. do_open),
> >> it always pick up the first one in kallsyms, even if systemtap
> >> can find all of those functions.
> >> (you can find many duplicated symbols in /proc/kallsyms)
> > 
> > Right, but realistically only functions which have a strict existence
> > (i.e. those for whom an address could be taken) can be used; functions
> > which are fully inlined (as in have no separate existence) can't.
> > That's why the patch finds the closest function with an address to match
> > on.
> 
> Sure, inlined functions are embedded in a caller function, so the
> closest function is the correct owner.
> 
> However, I meant local-scope functions can have same name if
> they are defined in different scope. And even though, both of
> them are shown in kallsyms. This mean, you can see the functions
> which have real different existence but have same symbol.
> 
> Would you mean systemtap should not probe those name-conflicted
> functions?

Actually, I wasn't aware we had any.

> >> So, we might better improve kallsyms to treat this case
> >> and find what is a better way to specify symbols and addresses.
> > 
> > Well, both the dwarf and the kallsyms know which are the functions that
> > have a real existence, so the tool can work it out.  It has a real
> > meaning too because the chosen symbol must be the parent routine of all
> > the nested inlines.
> 
> Hmm, here is what I got with your patch;
> $ stap --kelf -e 'probe kernel.function("do_open"){}' -p2
> # probes
> kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@fs/block_dev.c:928") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x55> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@ipc/mqueue.c:630") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */
> 
> Without your patch;
> $ stap -e 'probe kernel.function("do_open"){}' -p2
> # probes
> kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=0x10382 */ /* <- kernel.function("do_open") */
> kernel.function("do_open@fs/block_dev.c:928") /* pc=0xa0750 */ /* <- kernel.function("do_open") */
> kernel.function("do_open@fs/nfsctl.c:24") /* pc=0xa6411 */ /* <- kernel.function("do_open") */
> kernel.function("do_open@ipc/mqueue.c:630") /* pc=0xc55a6 */ /* <- kernel.function("do_open") */
> 
> Obviously, the 3rd "do_open" is fully inlined, so it can be
> correctly handled by kprobes, because it has different
> symbol(sys_nfsservctl). However, other "do_open" have
> same symbol(do_open). these will be put on same
> address (at the first symbol on kallsyms list).
> 
> So, we need a bridge for the gap of function addresses
> between kallsyms and dwarf.

You mean this particular problem:

hobholes:/home/jejb/git/BUILD-2.6# grep do_open /proc/kallsyms 
c01af160 t do_open
c01d5d40 t do_open

It's certainly a material defect in the current API.  I'll think about
it and see if I can come up with a solution.

> [...]
> >> Could we provide a separated GPL'd interface to access named global
> >> symbols which is based on kallsyms?
> > 
> > Yes, I think so ... it's just a case of working out what and how; but to
> > do that we need a consumer of the interface.
> 
> I agree with you, we need to change both of systemtap and kernel.
> 
> Thank you,

You're welcome.

James


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17  1:50       ` James Bottomley
@ 2008-07-17 14:18         ` James Bottomley
  2008-07-17 16:59           ` James Bottomley
       [not found]           ` <1216313914.5515.25.camel__21144.9282979176$1216314027$gmane$org@localhost.localdomain>
  0 siblings, 2 replies; 49+ messages in thread
From: James Bottomley @ 2008-07-17 14:18 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, systemtap

On Wed, 2008-07-16 at 20:49 -0500, James Bottomley wrote:
> On Wed, 2008-07-16 at 20:05 -0400, Masami Hiramatsu wrote:
> > Hmm, here is what I got with your patch;
> > $ stap --kelf -e 'probe kernel.function("do_open"){}' -p2
> > # probes
> > kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@fs/block_dev.c:928") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x55> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@ipc/mqueue.c:630") /* pc=<do_open+0x0> */ /* <- kernel.function("do_open") */
> > 
> > Without your patch;
> > $ stap -e 'probe kernel.function("do_open"){}' -p2
> > # probes
> > kernel.function("do_open@arch/x86/kernel/apm_32.c:1557") /* pc=0x10382 */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@fs/block_dev.c:928") /* pc=0xa0750 */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@fs/nfsctl.c:24") /* pc=0xa6411 */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@ipc/mqueue.c:630") /* pc=0xc55a6 */ /* <- kernel.function("do_open") */
> > 
> > Obviously, the 3rd "do_open" is fully inlined, so it can be
> > correctly handled by kprobes, because it has different
> > symbol(sys_nfsservctl). However, other "do_open" have
> > same symbol(do_open). these will be put on same
> > address (at the first symbol on kallsyms list).
> > 
> > So, we need a bridge for the gap of function addresses
> > between kallsyms and dwarf.
> 
> You mean this particular problem:
> 
> hobholes:/home/jejb/git/BUILD-2.6# grep do_open /proc/kallsyms 
> c01af160 t do_open
> c01d5d40 t do_open
> 
> It's certainly a material defect in the current API.  I'll think about
> it and see if I can come up with a solution.

OK, thought about it.  There seem to be two possible solutions

     1. Get systemtap always to offset from non-static functions.  This
        will use the standard linker to ensure uniqueness (a module
        qualifier will still need to be added to the struct kprobe for
        lookup, since modules can duplicate unexported kernel symbols).
     2. Add the filename as a discriminator for duplicate symbols in the
        kallsyms program (would still need module qualifier).  This is
        appealing because the path name would be printed in the kernel
        trace to help with oops tracking

This is where negotiations come in.  To me 2. looks to be better because
it will help us with oops tracking.  On the other hand, it's usually
pretty obvious from the stack trace context which files the duplicate
symbols are actually in; what do other people think?

James


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 14:18         ` James Bottomley
@ 2008-07-17 16:59           ` James Bottomley
  2008-07-17 21:38             ` Masami Hiramatsu
       [not found]           ` <1216313914.5515.25.camel__21144.9282979176$1216314027$gmane$org@localhost.localdomain>
  1 sibling, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-17 16:59 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, systemtap

On Thu, 2008-07-17 at 09:18 -0500, James Bottomley wrote:
> OK, thought about it.  There seem to be two possible solutions
> 
>      1. Get systemtap always to offset from non-static functions.  This
>         will use the standard linker to ensure uniqueness (a module
>         qualifier will still need to be added to the struct kprobe for
>         lookup, since modules can duplicate unexported kernel symbols).
>      2. Add the filename as a discriminator for duplicate symbols in the
>         kallsyms program (would still need module qualifier).  This is
>         appealing because the path name would be printed in the kernel
>         trace to help with oops tracking
> 
> This is where negotiations come in.  To me 2. looks to be better because
> it will help us with oops tracking.  On the other hand, it's usually
> pretty obvious from the stack trace context which files the duplicate
> symbols are actually in; what do other people think?

Just by way of illustration, this is systemtap fixed up according to
suggestion number 1.  You can see now using your test case that we get:

# probes
kernel.function("do_open@fs/block_dev.c:929") /* pc=<lookup_bdev+0x90> */ /* <- kernel.function("do_open") */
kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x6a> */ /* <- kernel.function("do_open") */
kernel.function("do_open@ipc/mqueue.c:642") /* pc=<sys_mq_unlink+0x130> */ /* <- kernel.function("do_open") */

James

---

From 0203b75a9ca97fc7463e6372baee897d1029b799 Mon Sep 17 00:00:00 2001
From: James Bottomley <James.Bottomley@HansenPartnership.com>
Date: Tue, 15 Jul 2008 13:25:00 -0500
Subject: Part1 use symbol_name/offset to locate dwarf probes

---
 tapsets.cxx |  119 +++++++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 83 insertions(+), 36 deletions(-)

diff --git a/tapsets.cxx b/tapsets.cxx
index 4d1e469..da198c9 100644
--- a/tapsets.cxx
+++ b/tapsets.cxx
@@ -491,6 +491,7 @@ func_info
   Dwarf_Addr entrypc;
   Dwarf_Addr prologue_end;
   bool weak;
+  bool global;
   // Comparison functor for list of functions sorted by address. The
   // two versions that take a Dwarf_Addr let us use the STL algorithms
   // upper_bound, equal_range et al., but we don't know whether the
@@ -591,6 +592,7 @@ symbol_table
   module_info *mod_info;	// associated module
   map<string, func_info*> map_by_name;
   vector<func_info*> list_by_addr;
+  vector<func_info*> global_list_by_addr;
   typedef vector<func_info*>::iterator iterator_t;
   typedef pair<iterator_t, iterator_t> range_t;
 #ifdef __powerpc__
@@ -598,7 +600,7 @@ symbol_table
 #endif
   // add_symbol doesn't leave symbol table in order; call
   // symbol_table::sort() when done adding symbols.
-  void add_symbol(const char *name, bool weak, Dwarf_Addr addr,
+  void add_symbol(const char *name, bool weak, bool global, Dwarf_Addr addr,
                                                Dwarf_Addr *high_addr);
   void sort();
   enum info_status read_symbols(FILE *f, const string& path);
@@ -611,7 +613,7 @@ symbol_table
   void purge_syscall_stubs();
   func_info *lookup_symbol(const string& name);
   Dwarf_Addr lookup_symbol_address(const string& name);
-  func_info *get_func_containing_address(Dwarf_Addr addr);
+  func_info *get_func_containing_address(Dwarf_Addr addr, bool global);
 
   symbol_table(module_info *mi) : mod_info(mi) {}
   ~symbol_table();
@@ -2306,13 +2308,15 @@ struct dwarf_derived_probe: public derived_probe
                        const string& module,
                        const string& section,
 		       Dwarf_Addr dwfl_addr,
-		       Dwarf_Addr addr,
+		       string symbol,
+		       unsigned int offset,
 		       dwarf_query & q,
                        Dwarf_Die* scope_die);
 
   string module;
   string section;
-  Dwarf_Addr addr;
+  string kprobe_symbol;
+  unsigned int kprobe_offset;
   bool has_return;
   bool has_maxactive;
   long maxactive_val;
@@ -2835,7 +2839,7 @@ dwarf_query::query_module_symtab()
       // Find the "function" in which the indicated address resides.
       Dwarf_Addr addr =
       		(has_function_num ? function_num_val : statement_num_val);
-      fi = sym_table->get_func_containing_address(addr);
+      fi = sym_table->get_func_containing_address(addr, false);
       if (!fi)
         {
           cerr << "Warning: address "
@@ -3260,9 +3264,18 @@ dwarf_query::add_probe_point(const string& funcname,
 
   if (! bad)
     {
+      struct module_info *mi = dw.mod_info;
+      if (!mi->sym_table)
+	mi->get_symtab(this);
+      struct symbol_table *sym_tab = mi->sym_table;
+      func_info *symbol = sym_tab->get_func_containing_address(addr, true);
+
       sess.unwindsym_modules.insert (module);
       probe = new dwarf_derived_probe(funcname, filename, line,
-                                      module, reloc_section, addr, reloc_addr, *this, scope_die);
+                                      module, reloc_section, reloc_addr,
+				      symbol->name,
+				      (unsigned int)(addr - symbol->addr),
+				      *this, scope_die);
       results.push_back(probe);
     }
 }
@@ -4380,7 +4393,8 @@ dwarf_derived_probe::printsig (ostream& o) const
   // function instances.  This is distinct from the verbose/clog
   // output, since this part goes into the cache hash calculations.
   sole_location()->print (o);
-  o << " /* pc=0x" << hex << addr << dec << " */";
+  o << " /* pc=<" << kprobe_symbol << "+0x" << hex
+    << kprobe_offset << dec << "> */";
   printsig_nested (o);
 }
 
@@ -4406,22 +4420,26 @@ dwarf_derived_probe::dwarf_derived_probe(const string& funcname,
                                          // NB: dwfl_addr is the virtualized
                                          // address for this symbol.
                                          Dwarf_Addr dwfl_addr,
-                                         // addr is the section-offset for
-                                         // actual relocation.
-                                         Dwarf_Addr addr,
+                                         // symbol is the closest known symbol
+                                         // and offset is the offset from the symbol
+					 string symbol,
+					 unsigned int offset,
                                          dwarf_query& q,
                                          Dwarf_Die* scope_die /* may be null */)
   : derived_probe (q.base_probe, new probe_point(*q.base_loc) /* .components soon rewritten */ ),
-    module (module), section (section), addr (addr),
+    module (module), section (section), kprobe_symbol(symbol),
+    kprobe_offset(offset),
     has_return (q.has_return),
     has_maxactive (q.has_maxactive),
     maxactive_val (q.maxactive_val)
 {
   // Assert relocation invariants
+#if 0
   if (section == "" && dwfl_addr != addr) // addr should be absolute
     throw semantic_error ("missing relocation base against", q.base_loc->tok);
   if (section != "" && dwfl_addr == addr) // addr should be an offset
     throw semantic_error ("inconsistent relocation address", q.base_loc->tok);
+#endif
 
   this->tok = q.base_probe->tok;
 
@@ -4620,8 +4638,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
   // Let's find some stats for the three embedded strings.  Maybe they
   // are small and uniform enough to justify putting char[MAX]'s into
   // the array instead of relocated char*'s.
-  size_t module_name_max = 0, section_name_max = 0, pp_name_max = 0;
-  size_t module_name_tot = 0, section_name_tot = 0, pp_name_tot = 0;
+  size_t pp_name_max = 0, pp_name_tot = 0;
+  size_t symbol_name_name_max  = 0, symbol_name_name_tot = 0;
   size_t all_name_cnt = probes_by_module.size(); // for average
   for (p_b_m_iterator it = probes_by_module.begin(); it != probes_by_module.end(); it++)
     {
@@ -4630,9 +4648,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
         size_t var##_size = (expr) + 1;                 \
         var##_max = max (var##_max, var##_size);        \
         var##_tot += var##_size; } while (0)
-      DOIT(module_name, p->module.size());
-      DOIT(section_name, p->section.size());
       DOIT(pp_name, lex_cast_qstring(*p->sole_location()).size());
+      DOIT(symbol_name_name, p->kprobe_symbol.size());
 #undef DOIT
     }
 
@@ -4652,11 +4669,10 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
       if (s.verbose > 2) clog << "stap_dwarf_probe *" << #var << endl;  \
     }
 
-  CALCIT(module);
-  CALCIT(section);
   CALCIT(pp);
+  CALCIT(symbol_name);
 
-  s.op->newline() << "const unsigned long address;";
+  s.op->newline() << "unsigned int offset;";
   s.op->newline() << "void (* const ph) (struct context*);";
   s.op->newline(-1) << "} stap_dwarf_probes[] = {";
   s.op->indent(1);
@@ -4673,9 +4689,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
           assert (p->maxactive_val >= 0 && p->maxactive_val <= USHRT_MAX);
           s.op->line() << " .maxactive_val=" << p->maxactive_val << ",";
         }
-      s.op->line() << " .address=0x" << hex << p->addr << dec << "UL,";
-      s.op->line() << " .module=\"" << p->module << "\",";
-      s.op->line() << " .section=\"" << p->section << "\",";
+      s.op->line() << " .symbol_name=\"" << p->kprobe_symbol << "\",";
+      s.op->line() << " .offset=0x" << hex << p->kprobe_offset << dec << ",";
       s.op->line() << " .pp=" << lex_cast_qstring (*p->sole_location()) << ",";
       s.op->line() << " .ph=&" << p->name;
       s.op->line() << " },";
@@ -4735,11 +4750,10 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
   s.op->newline() << "for (i=0; i<" << probes_by_module.size() << "; i++) {";
   s.op->newline(1) << "struct stap_dwarf_probe *sdp = & stap_dwarf_probes[i];";
   s.op->newline() << "struct stap_dwarf_kprobe *kp = & stap_dwarf_kprobes[i];";
-  s.op->newline() << "unsigned long relocated_addr = _stp_module_relocate (sdp->module, sdp->section, sdp->address);";
-  s.op->newline() << "if (relocated_addr == 0) continue;"; // quietly; assume module is absent
   s.op->newline() << "probe_point = sdp->pp;";
   s.op->newline() << "if (sdp->return_p) {";
-  s.op->newline(1) << "kp->u.krp.kp.addr = (void *) relocated_addr;";
+  s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
+  s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
   s.op->newline() << "if (sdp->maxactive_p) {";
   s.op->newline(1) << "kp->u.krp.maxactive = sdp->maxactive_val;";
   s.op->newline(-1) << "} else {";
@@ -4748,7 +4762,8 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
   s.op->newline() << "kp->u.krp.handler = &enter_kretprobe_probe;";
   s.op->newline() << "rc = register_kretprobe (& kp->u.krp);";
   s.op->newline(-1) << "} else {";
-  s.op->newline(1) << "kp->u.kp.addr = (void *) relocated_addr;";
+  s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
+  s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
   s.op->newline() << "kp->u.kp.pre_handler = &enter_kprobe_probe;";
   s.op->newline() << "rc = register_kprobe (& kp->u.kp);";
   s.op->newline(-1) << "}";
@@ -4885,12 +4900,20 @@ dwarf_builder::build(systemtap_session & sess,
           throw semantic_error ("absolute statement probe in unprivileged script", q.base_probe->tok);
         }
 
+      struct module_info *mi = dw->mod_info;
+      if (!mi->sym_table)
+	mi->get_symtab(&q);
+      struct symbol_table *sym_tab = mi->sym_table;
+      func_info *symbol = sym_tab->get_func_containing_address(q.statement_num_val, true);
+
       // For kernel.statement(NUM).absolute probe points, we bypass
       // all the debuginfo stuff: We just wire up a
       // dwarf_derived_probe right here and now.
       dwarf_derived_probe* p =
         new dwarf_derived_probe ("", "", 0, "kernel", "",
-                                 q.statement_num_val, q.statement_num_val,
+                                 q.statement_num_val,
+				 symbol->name,
+				 (unsigned int)(q.statement_num_val - symbol->addr),
                                  q, 0);
       finished_results.push_back (p);
       sess.unwindsym_modules.insert ("kernel");
@@ -4908,8 +4931,8 @@ symbol_table::~symbol_table()
 }
 
 void
-symbol_table::add_symbol(const char *name, bool weak, Dwarf_Addr addr,
-                                                      Dwarf_Addr *high_addr)
+symbol_table::add_symbol(const char *name, bool weak, bool global, 
+			 Dwarf_Addr addr, Dwarf_Addr *high_addr)
 {
 #ifdef __powerpc__
   // Map ".sys_foo" to "sys_foo".
@@ -4920,10 +4943,13 @@ symbol_table::add_symbol(const char *name, bool weak, Dwarf_Addr addr,
   fi->addr = addr;
   fi->name = name;
   fi->weak = weak;
+  fi->global = global;
   map_by_name[fi->name] = fi;
   // TODO: Use a multimap in case there are multiple static
   // functions with the same name?
   list_by_addr.push_back(fi);
+  if (global)
+    global_list_by_addr.push_back(fi);
 }
 
 enum info_status
@@ -4961,7 +4987,8 @@ symbol_table::read_symbols(FILE *f, const string& path)
           break;
         }
       if (type == 'T' || type == 't' || type == 'W')
-        add_symbol(name, (type == 'W'), (Dwarf_Addr) addr, &high_addr);
+        add_symbol(name, (type == 'W'), (type == 'T'),
+		   (Dwarf_Addr) addr, &high_addr);
     }
 
   if (list_by_addr.size() < 1)
@@ -5080,7 +5107,8 @@ symbol_table::get_from_elf()
       if (name && GELF_ST_TYPE(sym.st_info) == STT_FUNC &&
                                                    !reject_section(section))
         add_symbol(name, (GELF_ST_BIND(sym.st_info) == STB_WEAK),
-                                              sym.st_value, &high_addr);
+		   GELF_ST_BIND(sym.st_info) == STB_GLOBAL,
+		   sym.st_value, &high_addr);
     }
   sort();
   return info_present;
@@ -5121,14 +5149,29 @@ symbol_table::mark_dwarf_redundancies(dwflpp *dw)
 }
 
 func_info *
-symbol_table::get_func_containing_address(Dwarf_Addr addr)
+symbol_table::get_func_containing_address(Dwarf_Addr addr, bool global)
 {
-  iterator_t iter = upper_bound(list_by_addr.begin(), list_by_addr.end(), addr,
-                              func_info::Compare());
-  if (iter == list_by_addr.begin())
-    return NULL;
+  iterator_t iter;
+
+  if (global)
+    {
+      iter = upper_bound(global_list_by_addr.begin(),
+			 global_list_by_addr.end(), addr,
+			 func_info::Compare());
+      if (iter == global_list_by_addr.begin())
+	return NULL;
+      else
+	return *(iter - 1);
+    }
   else
-    return *(iter - 1);
+    {
+      iter = upper_bound(list_by_addr.begin(), list_by_addr.end(), addr,
+			 func_info::Compare());
+      if (iter == list_by_addr.begin())
+	return NULL;
+      else
+	return *(iter - 1);
+    }
 }
 
 func_info *
@@ -5181,12 +5224,16 @@ symbol_table::purge_syscall_stubs()
   list_by_addr.erase(remove(purge_range.first, purge_range.second,
                             (func_info*)0),
                      purge_range.second);
+  // NOTE: At the moment global_list_by_addr has no weak symbols
+  // so nothing needs to be removed from it.
 }
 
 void
 symbol_table::sort()
 {
   stable_sort(list_by_addr.begin(), list_by_addr.end(), func_info::Compare());
+  stable_sort(global_list_by_addr.begin(), global_list_by_addr.end(),
+	      func_info::Compare());
 }
 
 void
-- 
1.5.6



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
       [not found]           ` <1216313914.5515.25.camel__21144.9282979176$1216314027$gmane$org@localhost.localdomain>
@ 2008-07-17 18:32             ` Frank Ch. Eigler
  2008-07-17 20:13               ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-17 18:32 UTC (permalink / raw)
  To: James Bottomley; +Cc: Masami Hiramatsu, linux-kernel, systemtap

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> [...]
> Just by way of illustration, this is systemtap fixed up according to
> suggestion number 1.  You can see now using your test case that we get:
>
> # probes
> kernel.function("do_open@fs/block_dev.c:929") /* pc=<lookup_bdev+0x90> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x6a> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@ipc/mqueue.c:642") /* pc=<sys_mq_unlink+0x130> */ /* <- kernel.function("do_open") */
> [...]

Can you explain in detail how you believe this is materially
different from offsetting from _stext?

- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 18:32             ` Frank Ch. Eigler
@ 2008-07-17 20:13               ` James Bottomley
  2008-07-17 20:28                 ` Frank Ch. Eigler
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-17 20:13 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Masami Hiramatsu, linux-kernel, systemtap

On Thu, 2008-07-17 at 14:30 -0400, Frank Ch. Eigler wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > [...]
> > Just by way of illustration, this is systemtap fixed up according to
> > suggestion number 1.  You can see now using your test case that we get:
> >
> > # probes
> > kernel.function("do_open@fs/block_dev.c:929") /* pc=<lookup_bdev+0x90> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x6a> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@ipc/mqueue.c:642") /* pc=<sys_mq_unlink+0x130> */ /* <- kernel.function("do_open") */
> > [...]
> 
> Can you explain in detail how you believe this is materially
> different from offsetting from _stext?

Basically because _stext is an incredibly dangerous symbol; being linker
generated it doesn't actually get put in the right place if you look:

jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
ffffffff80209000 T _stext
ffffffff80200000 A _text

Since we can't do negative offsets, you've lost access to the symbols in
the sections that start before _stext.  Assuming you meant _text (which
is dangerous because it's a define in the kernel linker script and could
change).  Then you can't offset into other sections, like init sections
or modules.

James

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 20:13               ` James Bottomley
@ 2008-07-17 20:28                 ` Frank Ch. Eigler
  2008-07-17 21:06                   ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-17 20:28 UTC (permalink / raw)
  To: James Bottomley; +Cc: Masami Hiramatsu, linux-kernel, systemtap

Hi -

On Thu, Jul 17, 2008 at 03:12:26PM -0500, James Bottomley wrote:
> [...]
> > Can you explain in detail how you believe this is materially
> > different from offsetting from _stext?
> 
> Basically because _stext is an incredibly dangerous symbol; being linker
> generated it doesn't actually get put in the right place if you look:

Thank you for your response.

> jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
> ffffffff80209000 T _stext
> ffffffff80200000 A _text
> 
> Since we can't do negative offsets

Actually, "we" as in systemtap could do it just fine if that were
desired.  And really _stext is therefore an arbitrary choice - it
could be any other reference.

My point is that the proposed effort to identify a nearby function
symbol to use as a base for each probe's symbol+offset calculation is
wasted.

> you've lost access to the symbols in the sections that start before _stext.  

What's between _text and _stext appears to consist of kernel boot-time
functions that are unmapped the time anything like systemtap could
run.

> Assuming you meant _text (which is dangerous because it's a define
> in the kernel linker script and could change).

By "dangerous" do you only mean that it may require a one-liner
catch-up patch in systemtap if the kernel linker scripts change?

> Then you can't offset into other sections, like init sections or
> modules.

Kernel init sections are unprobeable by definition, so that doesn't
matter.  Modules are also irrelevant, since their addresses are
relative to their relocation bases / sections, not to a kernel
(vmlinux) symbol.

- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 20:28                 ` Frank Ch. Eigler
@ 2008-07-17 21:06                   ` James Bottomley
  2008-07-17 21:35                     ` Frank Ch. Eigler
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-17 21:06 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Masami Hiramatsu, linux-kernel, systemtap

On Thu, 2008-07-17 at 16:26 -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> On Thu, Jul 17, 2008 at 03:12:26PM -0500, James Bottomley wrote:
> > [...]
> > > Can you explain in detail how you believe this is materially
> > > different from offsetting from _stext?
> > 
> > Basically because _stext is an incredibly dangerous symbol; being linker
> > generated it doesn't actually get put in the right place if you look:
> 
> Thank you for your response.
> 
> > jejb@sparkweed> nm vmlinux |egrep -w '_stext|_text'
> > ffffffff80209000 T _stext
> > ffffffff80200000 A _text
> > 
> > Since we can't do negative offsets
> 
> Actually, "we" as in systemtap could do it just fine if that were
> desired.  And really _stext is therefore an arbitrary choice - it
> could be any other reference.
> 
> My point is that the proposed effort to identify a nearby function
> symbol to use as a base for each probe's symbol+offset calculation is
> wasted.

It's not exactly wasted ... the calculations have to be done anyway for
modules.

> > you've lost access to the symbols in the sections that start before _stext.  
> 
> What's between _text and _stext appears to consist of kernel boot-time
> functions that are unmapped the time anything like systemtap could
> run.

Well, no, they're the head code.  It's actually used in CPU boot and
tear down, one of the things it's useful to probe, I think.

> > Assuming you meant _text (which is dangerous because it's a define
> > in the kernel linker script and could change).
> 
> By "dangerous" do you only mean that it may require a one-liner
> catch-up patch in systemtap if the kernel linker scripts change?

Dangerous as in it's not necessarily part of the kernel linker scripts.
Some arches have it defined as a symbol, some have it as a linker script
definition ... that's why it's location is strange.

The point, really, is to remove some of the fragile dependencies between
systemtap and the kernel.

> > Then you can't offset into other sections, like init sections or
> > modules.
> 
> Kernel init sections are unprobeable by definition, so that doesn't
> matter.  Modules are also irrelevant, since their addresses are
> relative to their relocation bases / sections, not to a kernel
> (vmlinux) symbol.

Then the definition needs altering.  I can see that the industrial
customers aren't interested but kernel developers are ... a lot of
problems occur in the init sections.

I think you'll find that systemtap will run quite happily from a shell
in an initramfs before the init sections are discarded.  Plus there's
always module init sections which can appear at any time.

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 21:06                   ` James Bottomley
@ 2008-07-17 21:35                     ` Frank Ch. Eigler
  2008-07-17 22:05                       ` Masami Hiramatsu
  2008-07-22 18:00                       ` Rik van Riel
  0 siblings, 2 replies; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-17 21:35 UTC (permalink / raw)
  To: James Bottomley; +Cc: Masami Hiramatsu, linux-kernel, systemtap

Hi -


On Thu, Jul 17, 2008 at 04:06:09PM -0500, James Bottomley wrote:

> [...]
> > My point is that the proposed effort to identify a nearby function
> > symbol to use as a base for each probe's symbol+offset calculation is
> > wasted.
> 
> It's not exactly wasted ... the calculations have to be done anyway for
> modules.

Not really - we just anchor off a different (per-module) reference
symbol or address.  At the moment, we use the .text* section bases.


> > > you've lost access to the symbols in the sections that start before _stext.  
> > 
> > What's between _text and _stext appears to consist of kernel boot-time
> > functions that are unmapped the time anything like systemtap could
> > run.
> 
> Well, no, they're the head code.  It's actually used in CPU boot and
> tear down, one of the things it's useful to probe, I think.

Fair enough - conceivably probing that stuff is useful, as is module
initialization.  We don't try to do it yet (and indeed kprobes blocks
it all).

In any case, the method of probe address calculation doesn't affect
that issue.  We can calculate .init* addresses relative to any
convenient reference in exactly the same way as non-.init addresses.


> > > Assuming you meant _text (which is dangerous because it's a define
> > > in the kernel linker script and could change).
> > 
> > By "dangerous" do you only mean that it may require a one-liner
> > catch-up patch in systemtap if the kernel linker scripts change?
> 
> Dangerous as in it's not necessarily part of the kernel linker scripts.
> [...]
> The point, really, is to remove some of the fragile dependencies between
> systemtap and the kernel.

Yes, that is generally desirable - each case is usually a question of
cost/benefit.  One significant requirement for us is to keep working
with older kernels.


- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 16:59           ` James Bottomley
@ 2008-07-17 21:38             ` Masami Hiramatsu
  2008-07-17 22:03               ` James Bottomley
  2008-07-21 14:21               ` James Bottomley
  0 siblings, 2 replies; 49+ messages in thread
From: Masami Hiramatsu @ 2008-07-17 21:38 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, systemtap

James Bottomley wrote:
> On Thu, 2008-07-17 at 09:18 -0500, James Bottomley wrote:
>> OK, thought about it.  There seem to be two possible solutions
>>
>>      1. Get systemtap always to offset from non-static functions.  This
>>         will use the standard linker to ensure uniqueness (a module
>>         qualifier will still need to be added to the struct kprobe for
>>         lookup, since modules can duplicate unexported kernel symbols).
>>      2. Add the filename as a discriminator for duplicate symbols in the
>>         kallsyms program (would still need module qualifier).  This is
>>         appealing because the path name would be printed in the kernel
>>         trace to help with oops tracking
>>
>> This is where negotiations come in.  To me 2. looks to be better because
>> it will help us with oops tracking.  On the other hand, it's usually
>> pretty obvious from the stack trace context which files the duplicate
>> symbols are actually in; what do other people think?
> 
> Just by way of illustration, this is systemtap fixed up according to
> suggestion number 1.  You can see now using your test case that we get:
> 
> # probes
> kernel.function("do_open@fs/block_dev.c:929") /* pc=<lookup_bdev+0x90> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x6a> */ /* <- kernel.function("do_open") */
> kernel.function("do_open@ipc/mqueue.c:642") /* pc=<sys_mq_unlink+0x130> */ /* <- kernel.function("do_open") */

Hi James,

Thank you for updating the patch.
Unfortunately, I found another scenario; if someone make a module which
has EXPORT_SYMBOL(do_open), it's a non-static function. but there are
other static version do_open in kallsyms.
Here, I tested it and got below;

$ stap --kelf -e 'probe module("test").function("do_open"){}' -p2
# probes
module("test").function("do_open@?") /* pc=<do_open+0x0> */ /* <- module("test").function("do_open") */

And I think similar issue will occur even if it is embedded in vmlinux.

By the way, can this patch solve the issue of -ffunction-sections?

Anyway, I think we still need solution no.2.


Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 21:38             ` Masami Hiramatsu
@ 2008-07-17 22:03               ` James Bottomley
  2008-07-21 14:21               ` James Bottomley
  1 sibling, 0 replies; 49+ messages in thread
From: James Bottomley @ 2008-07-17 22:03 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, systemtap

On Thu, 2008-07-17 at 17:36 -0400, Masami Hiramatsu wrote:
> James Bottomley wrote:
> > On Thu, 2008-07-17 at 09:18 -0500, James Bottomley wrote:
> >> OK, thought about it.  There seem to be two possible solutions
> >>
> >>      1. Get systemtap always to offset from non-static functions.  This
> >>         will use the standard linker to ensure uniqueness (a module
> >>         qualifier will still need to be added to the struct kprobe for
> >>         lookup, since modules can duplicate unexported kernel symbols).
> >>      2. Add the filename as a discriminator for duplicate symbols in the
> >>         kallsyms program (would still need module qualifier).  This is
> >>         appealing because the path name would be printed in the kernel
> >>         trace to help with oops tracking
> >>
> >> This is where negotiations come in.  To me 2. looks to be better because
> >> it will help us with oops tracking.  On the other hand, it's usually
> >> pretty obvious from the stack trace context which files the duplicate
> >> symbols are actually in; what do other people think?
> > 
> > Just by way of illustration, this is systemtap fixed up according to
> > suggestion number 1.  You can see now using your test case that we get:
> > 
> > # probes
> > kernel.function("do_open@fs/block_dev.c:929") /* pc=<lookup_bdev+0x90> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x6a> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@ipc/mqueue.c:642") /* pc=<sys_mq_unlink+0x130> */ /* <- kernel.function("do_open") */
> 
> Hi James,
> 
> Thank you for updating the patch.
> Unfortunately, I found another scenario; if someone make a module which
> has EXPORT_SYMBOL(do_open), it's a non-static function. but there are
> other static version do_open in kallsyms.
> Here, I tested it and got below;
> 
> $ stap --kelf -e 'probe module("test").function("do_open"){}' -p2
> # probes
> module("test").function("do_open@?") /* pc=<do_open+0x0> */ /* <- module("test").function("do_open") */
> 
> And I think similar issue will occur even if it is embedded in vmlinux.

Actually, no.  This is only a module problem ... it's triggered by the
fact that the module namespace is different from the kernel's global
namespace.  To get around this, I think the actual module (or null for
kernel) has to become an extra parameter to struct kprobe.

> By the way, can this patch solve the issue of -ffunction-sections?

Actually not entirely, no, if we go for only global symbols.  The
compiler is entitled to spit out a section even for a static function as
long as it has a real body.  If the module loader insterts stubs then
even an offset from a nearby function could end up being wrong

> Anyway, I think we still need solution no.2.

I'll cook up a patch and run it by lkml to try to gauge the reaction.

James


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 21:35                     ` Frank Ch. Eigler
@ 2008-07-17 22:05                       ` Masami Hiramatsu
  2008-07-22 18:00                       ` Rik van Riel
  1 sibling, 0 replies; 49+ messages in thread
From: Masami Hiramatsu @ 2008-07-17 22:05 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: James Bottomley, linux-kernel, systemtap

Frank Ch. Eigler wrote:
> Hi -
> 
> 
> On Thu, Jul 17, 2008 at 04:06:09PM -0500, James Bottomley wrote:
> 
>> [...]
>>> My point is that the proposed effort to identify a nearby function
>>> symbol to use as a base for each probe's symbol+offset calculation is
>>> wasted.
>> It's not exactly wasted ... the calculations have to be done anyway for
>> modules.
> 
> Not really - we just anchor off a different (per-module) reference
> symbol or address.  At the moment, we use the .text* section bases.
> 
> 
>>>> you've lost access to the symbols in the sections that start before _stext.  
>>> What's between _text and _stext appears to consist of kernel boot-time
>>> functions that are unmapped the time anything like systemtap could
>>> run.
>> Well, no, they're the head code.  It's actually used in CPU boot and
>> tear down, one of the things it's useful to probe, I think.
> 
> Fair enough - conceivably probing that stuff is useful, as is module
> initialization.  We don't try to do it yet (and indeed kprobes blocks
> it all).
> 
> In any case, the method of probe address calculation doesn't affect
> that issue.  We can calculate .init* addresses relative to any
> convenient reference in exactly the same way as non-.init addresses.
> 
> 
>>>> Assuming you meant _text (which is dangerous because it's a define
>>>> in the kernel linker script and could change).
>>> By "dangerous" do you only mean that it may require a one-liner
>>> catch-up patch in systemtap if the kernel linker scripts change?
>> Dangerous as in it's not necessarily part of the kernel linker scripts.
>> [...]
>> The point, really, is to remove some of the fragile dependencies between
>> systemtap and the kernel.
> 
> Yes, that is generally desirable - each case is usually a question of
> cost/benefit.  One significant requirement for us is to keep working
> with older kernels.

Hi Frank,

I know we'd better archive that requirement. However, if we lose support
from developers because we are too much focusing on that, we'll also
lose the future of systemtap itself. We have to see the cost/benefit
from the long-term of view.

Could we separate systemtap parser/elaborator and code generator
to support both of old and new kernels?

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-15 18:33 [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) James Bottomley
  2008-07-16 22:42 ` Masami Hiramatsu
@ 2008-07-18  9:11 ` Andi Kleen
  2008-07-18  9:23   ` Peter Zijlstra
  1 sibling, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2008-07-18  9:11 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, systemtap, jbeulich

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> One of the big nasties of systemtap is the way it tries to embed
> virtually the entirety of the kernel symbol table in the probe modules
> it constructs.  This is highly undesirable because it represents a
> subversion of the kernel API to gain access to unexported symbols.  At
> least for kprobes, the correct way to do this is to specify the probe
> point by symbol and offset.
>
> This patch converts systemtap to use the correct kprobe
> symbol_name/offset pair to identify the probe location.
>
> This only represents a baby step:  after this is done, there are at
> least three other consumers of the systemtap module relocation
> machinery:
>
>      1. unwind information.  I think the consumers of this can be
>         converted to use the arch specific unwinders that already exist
>         within the kernel


Right now x86 doesn't really have a good reliable unwinder that
works without frame pointer. I think systemtap
recently switched to Jan Beulich's dwarf2 unwinder. Before
switching to the in kernel unwinder that one would need to be 
re-merged again.

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18  9:11 ` [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) Andi Kleen
@ 2008-07-18  9:23   ` Peter Zijlstra
  2008-07-18 10:31     ` Andi Kleen
  2008-07-18 13:04     ` Frank Ch. Eigler
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2008-07-18  9:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: James Bottomley, linux-kernel, systemtap, jbeulich

On Fri, 2008-07-18 at 11:11 +0200, Andi Kleen wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > One of the big nasties of systemtap is the way it tries to embed
> > virtually the entirety of the kernel symbol table in the probe modules
> > it constructs.  This is highly undesirable because it represents a
> > subversion of the kernel API to gain access to unexported symbols.  At
> > least for kprobes, the correct way to do this is to specify the probe
> > point by symbol and offset.
> >
> > This patch converts systemtap to use the correct kprobe
> > symbol_name/offset pair to identify the probe location.
> >
> > This only represents a baby step:  after this is done, there are at
> > least three other consumers of the systemtap module relocation
> > machinery:
> >
> >      1. unwind information.  I think the consumers of this can be
> >         converted to use the arch specific unwinders that already exist
> >         within the kernel
> 
> 
> Right now x86 doesn't really have a good reliable unwinder that
> works without frame pointer. I think systemtap
> recently switched to Jan Beulich's dwarf2 unwinder. Before
> switching to the in kernel unwinder that one would need to be 
> re-merged again.

Those are two separate issues.

 1) stap ought to use the kernel's infrastructure and not re-implement
its own.

 2) if the kernel's infrastructure doesn't meet requirements, improve
it.

But while the x86 might not be perfect, its fairly ok these days. Its
not the utter piece of shite x86_64 had for a long time - today's traces
mostly make sense.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18  9:23   ` Peter Zijlstra
@ 2008-07-18 10:31     ` Andi Kleen
  2008-07-18 10:44       ` Peter Zijlstra
  2008-07-18 13:04     ` Frank Ch. Eigler
  1 sibling, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2008-07-18 10:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: James Bottomley, linux-kernel, systemtap, jbeulich

>  1) stap ought to use the kernel's infrastructure and not re-implement
> its own.
> 
>  2) if the kernel's infrastructure doesn't meet requirements, improve
> it.

No argument on either of those. Right now the kernel infrastructure
is only comparable to what systemtap overs at very high overhead
costs (see below)

> But while the x86 might not be perfect, its fairly ok these days. Its
> not the utter piece of shite x86_64 had for a long time 

Not sure what you're referring to with this. AFAIK the x86-64 unwinder
for a normal frame pointer less kernel was not any worse (or better)
than a i386 kernel without frame pointers.

- today's traces
> mostly make sense.

If you enable frame pointers? Making your complete kernel slower?
Generating much worse code on i386 by wasting >20% of its available
registers?  Getting pipeline stalls on each function call/exit on many CPUs?

Right now unfortunately there are a few rogue CONFIGs who select that
so more and more kernels have, but I found that always distateful because
enabling frame pointers has such a large impact on all kernel code, especially
on the register starved i386.

I still think the right solution eventually is to have a dwarf2 unwinder
by default for i386/x86-64 and get rid of all these nasty "select
FRAME_POINTER"s which have unfortunately sneaked in.

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18 10:31     ` Andi Kleen
@ 2008-07-18 10:44       ` Peter Zijlstra
  2008-07-18 10:52         ` Andi Kleen
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2008-07-18 10:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: James Bottomley, linux-kernel, systemtap, jbeulich, arjan

On Fri, 2008-07-18 at 12:31 +0200, Andi Kleen wrote:

> > But while the x86 might not be perfect, its fairly ok these days. Its
> > not the utter piece of shite x86_64 had for a long time 
> 
> Not sure what you're referring to with this. AFAIK the x86-64 unwinder
> for a normal frame pointer less kernel was not any worse (or better)
> than a i386 kernel without frame pointers.

I hardly ever compile a kernel without frame pointers, as debugability
is top priority for me, so I'm afraid I'm not sure how bad it gets
without.

But it used to be that x86_64 was crap even with frame pointers, and
without it it was just random gibberish - Arjan fixed that up somewhere
around .25 iirc.

> - today's traces
> > mostly make sense.
> 
> If you enable frame pointers? Making your complete kernel slower?
> Generating much worse code on i386 by wasting >20% of its available
> registers?  Getting pipeline stalls on each function call/exit on many CPUs?
> 
> Right now unfortunately there are a few rogue CONFIGs who select that
> so more and more kernels have, but I found that always distateful because
> enabling frame pointers has such a large impact on all kernel code, especially
> on the register starved i386.
> 
> I still think the right solution eventually is to have a dwarf2 unwinder
> by default for i386/x86-64 and get rid of all these nasty "select
> FRAME_POINTER"s which have unfortunately sneaked in.

No argument on i386 vs frame pointers, fortunately I hardly ever build a
32bit kernel these days, 32bit hardware is disappearing fast here :-) 

As to merging the dwarf unwinder, I have no particular objection to
getting that merged as I'm rather ignorant on these matters - but from
reading emails around the last merge attempt, Linus had some strong
opinions on the matter.

Have those been resolved since?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18 10:44       ` Peter Zijlstra
@ 2008-07-18 10:52         ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2008-07-18 10:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: James Bottomley, linux-kernel, systemtap, jbeulich, arjan

Peter Zijlstra wrote:
> On Fri, 2008-07-18 at 12:31 +0200, Andi Kleen wrote:
> 
>>> But while the x86 might not be perfect, its fairly ok these days. Its
>>> not the utter piece of shite x86_64 had for a long time 
>> Not sure what you're referring to with this. AFAIK the x86-64 unwinder
>> for a normal frame pointer less kernel was not any worse (or better)
>> than a i386 kernel without frame pointers.
> 
> I hardly ever compile a kernel without frame pointers, as debugability
> is top priority for me, so I'm afraid I'm not sure how bad it gets
> without.
> 
> But it used to be that x86_64 was crap even with frame pointers, and
> without it it was just random gibberish - Arjan fixed that up somewhere
> around .25 iirc.

Well yes it was designed for dwarf2 unwinding and Linus' revert of that
wrecked it. Also even the frame pointer walker was in the dwarf2
code, so with that removed it fell back to a somewhat dumb unwinder.

AFAIK most of the assembler code still doesn't set up frame pointers,
so you'll likely still run into problems.

> 
>> - today's traces
>>> mostly make sense.
>> If you enable frame pointers? Making your complete kernel slower?
>> Generating much worse code on i386 by wasting >20% of its available
>> registers?  Getting pipeline stalls on each function call/exit on many CPUs?
>>
>> Right now unfortunately there are a few rogue CONFIGs who select that
>> so more and more kernels have, but I found that always distateful because
>> enabling frame pointers has such a large impact on all kernel code, especially
>> on the register starved i386.
>>
>> I still think the right solution eventually is to have a dwarf2 unwinder
>> by default for i386/x86-64 and get rid of all these nasty "select
>> FRAME_POINTER"s which have unfortunately sneaked in.
> 
> No argument on i386 vs frame pointers, fortunately I hardly ever build a
> 32bit kernel these days, 32bit hardware is disappearing fast here :-) 

It hurts even on 64bit, e.g. the pipe line stall issue is there which
can be a significant part of a function call cost of a small function
of which the kernel has plenty (that was the major reason why the x86-64
ABI discouraged frame pointers BTW and encouraged unwinding). Given several
modern cores have special hardware to avoid that, but there's still a lot of the
older cores around and even on the modern cores it's not free.

Also it's ~10% of the registers there. And it has some other impact.

> As to merging the dwarf unwinder, I have no particular objection to
> getting that merged as I'm rather ignorant on these matters - but from
> reading emails around the last merge attempt, Linus had some strong
> opinions on the matter.
> 
> Have those been resolved since?

I added some checks back then to satisfy Linus. Also the dwarf2 unwinder
is shipping successfully for some time both in systemtap and in opensuse 11.0,
[although that one unfortunately fell victim to one of the rogue select
FRAME_POINTERs, sigh!] so there shouldn't be much teething problems left.

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18  9:23   ` Peter Zijlstra
  2008-07-18 10:31     ` Andi Kleen
@ 2008-07-18 13:04     ` Frank Ch. Eigler
  2008-07-18 13:09       ` Andi Kleen
                         ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-18 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, James Bottomley, linux-kernel, systemtap, jbeulich

Peter Zijlstra <peterz@infradead.org> writes:

> [...]
>> Right now x86 doesn't really have a good reliable unwinder that
>> works without frame pointer. I think systemtap
>> recently switched to Jan Beulich's dwarf2 unwinder. Before
>> switching to the in kernel unwinder that one would need to be 
>> re-merged again.
>
> Those are two separate issues.
>
> 1) stap ought to use the kernel's infrastructure and not re-implement
> its own.
> 2) if the kernel's infrastructure doesn't meet requirements, improve
> it.

They are related to the extent that readers may not realize some
implications of systemtap being/becoming a *kernel-resident* but not
*kernel-focused* tool.

For example, we're about to do unwinding/stack-traces of userspace
programs.  To what extent do you think the kernel unwinder (should one
reappear in git) would welcome patches that provide zero benefit to
the kernel, but only enable a peculiar (nonintrusive) sort of
unwinding we would need for complex userspace stacks?

- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18 13:04     ` Frank Ch. Eigler
@ 2008-07-18 13:09       ` Andi Kleen
  2008-07-18 13:10       ` Peter Zijlstra
  2008-07-18 13:21       ` James Bottomley
  2 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2008-07-18 13:09 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Peter Zijlstra, Andi Kleen, James Bottomley, linux-kernel,
	systemtap, jbeulich

> For example, we're about to do unwinding/stack-traces of userspace
> programs.  To what extent do you think the kernel unwinder (should one
> reappear in git) would welcome patches that provide zero benefit to
> the kernel, but only enable a peculiar (nonintrusive) sort of
> unwinding we would need for complex userspace stacks?

In theory if the tables are right you don't need any special
for the kernel dwarf2 unwinder.  If the tables are not right
fix them.

I believe oprofile did this already in some cases (generating
user back traces from the kernel) 

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18 13:04     ` Frank Ch. Eigler
  2008-07-18 13:09       ` Andi Kleen
@ 2008-07-18 13:10       ` Peter Zijlstra
  2008-07-18 13:31         ` Frank Ch. Eigler
  2008-07-18 13:36         ` Andi Kleen
  2008-07-18 13:21       ` James Bottomley
  2 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2008-07-18 13:10 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Andi Kleen, James Bottomley, linux-kernel, systemtap, jbeulich,
	arjan, sandmann, Ingo Molnar

On Fri, 2008-07-18 at 09:02 -0400, Frank Ch. Eigler wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > [...]
> >> Right now x86 doesn't really have a good reliable unwinder that
> >> works without frame pointer. I think systemtap
> >> recently switched to Jan Beulich's dwarf2 unwinder. Before
> >> switching to the in kernel unwinder that one would need to be 
> >> re-merged again.
> >
> > Those are two separate issues.
> >
> > 1) stap ought to use the kernel's infrastructure and not re-implement
> > its own.
> > 2) if the kernel's infrastructure doesn't meet requirements, improve
> > it.
> 
> They are related to the extent that readers may not realize some
> implications of systemtap being/becoming a *kernel-resident* but not
> *kernel-focused* tool.
> 
> For example, we're about to do unwinding/stack-traces of userspace
> programs.  To what extent do you think the kernel unwinder (should one
> reappear in git) would welcome patches that provide zero benefit to
> the kernel, but only enable a peculiar (nonintrusive) sort of
> unwinding we would need for complex userspace stacks?

I think sysprof (kernel/trace/trace_sysprof.c) already does user-space
stack unwinding. So pushing that capability further up the chain when a
second user (stap) comes along makes perfect sense.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18 13:04     ` Frank Ch. Eigler
  2008-07-18 13:09       ` Andi Kleen
  2008-07-18 13:10       ` Peter Zijlstra
@ 2008-07-18 13:21       ` James Bottomley
  2008-07-18 13:40         ` Frank Ch. Eigler
  2 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-18 13:21 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Peter Zijlstra, Andi Kleen, linux-kernel, systemtap, jbeulich

On Fri, 2008-07-18 at 09:02 -0400, Frank Ch. Eigler wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > [...]
> >> Right now x86 doesn't really have a good reliable unwinder that
> >> works without frame pointer. I think systemtap
> >> recently switched to Jan Beulich's dwarf2 unwinder. Before
> >> switching to the in kernel unwinder that one would need to be 
> >> re-merged again.
> >
> > Those are two separate issues.
> >
> > 1) stap ought to use the kernel's infrastructure and not re-implement
> > its own.
> > 2) if the kernel's infrastructure doesn't meet requirements, improve
> > it.
> 
> They are related to the extent that readers may not realize some
> implications of systemtap being/becoming a *kernel-resident* but not
> *kernel-focused* tool.
> 
> For example, we're about to do unwinding/stack-traces of userspace
> programs.  To what extent do you think the kernel unwinder (should one
> reappear in git) would welcome patches that provide zero benefit to
> the kernel, but only enable a peculiar (nonintrusive) sort of
> unwinding we would need for complex userspace stacks?

I'm not entirely convinced systemtap wants full stack unwinding in the
kernel.  What the kernel wants is the call trace, which it can do with
kallsyms.  However, systemtap in userspace sees all the relevant dwarf
information as well ... it could do a much better job of unwinding: give
file and line and arguments for function calls, for instance.  All it
really needs is to have the relevant pieces of the stack relayed back.

James


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18 13:10       ` Peter Zijlstra
@ 2008-07-18 13:31         ` Frank Ch. Eigler
  2008-07-18 13:36         ` Andi Kleen
  1 sibling, 0 replies; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-18 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, James Bottomley, linux-kernel, systemtap, jbeulich,
	arjan, sandmann, Ingo Molnar

Hi -

On Fri, Jul 18, 2008 at 03:10:27PM +0200, Peter Zijlstra wrote:
> [...]
> > For example, we're about to do unwinding/stack-traces of userspace
> > programs.  To what extent do you think the kernel unwinder (should one
> > reappear in git) would welcome patches that provide zero benefit to
> > the kernel, but only enable a peculiar (nonintrusive) sort of
> > unwinding we would need for complex userspace stacks?
> 
> I think sysprof (kernel/trace/trace_sysprof.c) already does user-space
> stack unwinding. So pushing that capability further up the chain when a
> second user (stap) comes along makes perfect sense.

trace_sysprof relies on dump_stack, which is x86-only and does not do
elf/dwarf unwinding proper.  (Likewise oprofile, etc.)  They can't
even start, because they don't have unwind data available - something
we plan to make available to the systemtap runtime.


- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18 13:10       ` Peter Zijlstra
  2008-07-18 13:31         ` Frank Ch. Eigler
@ 2008-07-18 13:36         ` Andi Kleen
  1 sibling, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2008-07-18 13:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frank Ch. Eigler, James Bottomley, linux-kernel, systemtap,
	jbeulich, arjan, sandmann, Ingo Molnar

Peter Zijlstra <peterz@infradead.org> writes:
>
> I think sysprof (kernel/trace/trace_sysprof.c) already does user-space
> stack unwinding. So pushing that capability further up the chain when a
> second user (stap) comes along makes perfect sense.

oprofile call trace profiling also does it.

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-18 13:21       ` James Bottomley
@ 2008-07-18 13:40         ` Frank Ch. Eigler
  0 siblings, 0 replies; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-18 13:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: Peter Zijlstra, Andi Kleen, linux-kernel, systemtap, jbeulich

Hi -

On Fri, Jul 18, 2008 at 08:21:24AM -0500, James Bottomley wrote:

> [...]  I'm not entirely convinced systemtap wants full stack
> unwinding in the kernel.

Sure we "want" it if we can get it.  It enables richer data gathering.
It lets scripts act on the contents of the call stack ("is this probe
being run due to a callback from this or that shared library?").

> [...] However, systemtap in userspace sees all the relevant dwarf
> information as well ... it could do a much better job of unwinding:
> give file and line and arguments for function calls, for instance.
> All it really needs is to have the relevant pieces of the stack
> relayed back.

The relevant bits of stack for a userspace program could include
several megabytes per thread; without unwind info we can't be sure
which parts are which.  (A full gdb symbolic backtrace takes seconds
to compute - something we can't possibly afford in situ.)

- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 21:38             ` Masami Hiramatsu
  2008-07-17 22:03               ` James Bottomley
@ 2008-07-21 14:21               ` James Bottomley
  1 sibling, 0 replies; 49+ messages in thread
From: James Bottomley @ 2008-07-21 14:21 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: linux-kernel, systemtap

On Thu, 2008-07-17 at 17:36 -0400, Masami Hiramatsu wrote:
> James Bottomley wrote:
> > On Thu, 2008-07-17 at 09:18 -0500, James Bottomley wrote:
> >> OK, thought about it.  There seem to be two possible solutions
> >>
> >>      1. Get systemtap always to offset from non-static functions.  This
> >>         will use the standard linker to ensure uniqueness (a module
> >>         qualifier will still need to be added to the struct kprobe for
> >>         lookup, since modules can duplicate unexported kernel symbols).
> >>      2. Add the filename as a discriminator for duplicate symbols in the
> >>         kallsyms program (would still need module qualifier).  This is
> >>         appealing because the path name would be printed in the kernel
> >>         trace to help with oops tracking
> >>
> >> This is where negotiations come in.  To me 2. looks to be better because
> >> it will help us with oops tracking.  On the other hand, it's usually
> >> pretty obvious from the stack trace context which files the duplicate
> >> symbols are actually in; what do other people think?
> > 
> > Just by way of illustration, this is systemtap fixed up according to
> > suggestion number 1.  You can see now using your test case that we get:
> > 
> > # probes
> > kernel.function("do_open@fs/block_dev.c:929") /* pc=<lookup_bdev+0x90> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@fs/nfsctl.c:24") /* pc=<sys_nfsservctl+0x6a> */ /* <- kernel.function("do_open") */
> > kernel.function("do_open@ipc/mqueue.c:642") /* pc=<sys_mq_unlink+0x130> */ /* <- kernel.function("do_open") */
> 
> Hi James,
> 
> Thank you for updating the patch.
> Unfortunately, I found another scenario; if someone make a module which
> has EXPORT_SYMBOL(do_open), it's a non-static function. but there are
> other static version do_open in kallsyms.
> Here, I tested it and got below;
> 
> $ stap --kelf -e 'probe module("test").function("do_open"){}' -p2
> # probes
> module("test").function("do_open@?") /* pc=<do_open+0x0> */ /* <- module("test").function("do_open") */

OK, I fixed this.  It turns out that kprobes already had a syntax for
this problem; it's <module>:<function> (in fact it won't work without
this).  I updated the code (attached below) to work correctly.  I'm
still looking at the kallsyms code for a uniqueness solution.

James

---

From: James Bottomley <James.Bottomley@HansenPartnership.com>
Subject: Part1 use symbol_name/offset to locate dwarf probes

---
 tapsets.cxx |  120 ++++++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 83 insertions(+), 37 deletions(-)

diff --git a/tapsets.cxx b/tapsets.cxx
index a08a8ab..f24dc10 100644
--- a/tapsets.cxx
+++ b/tapsets.cxx
@@ -491,6 +491,7 @@ func_info
   Dwarf_Addr entrypc;
   Dwarf_Addr prologue_end;
   bool weak;
+  bool global;
   // Comparison functor for list of functions sorted by address. The
   // two versions that take a Dwarf_Addr let us use the STL algorithms
   // upper_bound, equal_range et al., but we don't know whether the
@@ -591,6 +592,7 @@ symbol_table
   module_info *mod_info;	// associated module
   map<string, func_info*> map_by_name;
   vector<func_info*> list_by_addr;
+  vector<func_info*> global_list_by_addr;
   typedef vector<func_info*>::iterator iterator_t;
   typedef pair<iterator_t, iterator_t> range_t;
 #ifdef __powerpc__
@@ -598,7 +600,7 @@ symbol_table
 #endif
   // add_symbol doesn't leave symbol table in order; call
   // symbol_table::sort() when done adding symbols.
-  void add_symbol(const char *name, bool weak, Dwarf_Addr addr,
+  void add_symbol(const char *name, bool weak, bool global, Dwarf_Addr addr,
                                                Dwarf_Addr *high_addr);
   void sort();
   enum info_status read_symbols(FILE *f, const string& path);
@@ -611,7 +613,7 @@ symbol_table
   void purge_syscall_stubs();
   func_info *lookup_symbol(const string& name);
   Dwarf_Addr lookup_symbol_address(const string& name);
-  func_info *get_func_containing_address(Dwarf_Addr addr);
+  func_info *get_func_containing_address(Dwarf_Addr addr, bool global);
 
   symbol_table(module_info *mi) : mod_info(mi) {}
   ~symbol_table();
@@ -2309,13 +2311,15 @@ struct dwarf_derived_probe: public derived_probe
                        const string& module,
                        const string& section,
 		       Dwarf_Addr dwfl_addr,
-		       Dwarf_Addr addr,
+		       string symbol,
+		       unsigned int offset,
 		       dwarf_query & q,
                        Dwarf_Die* scope_die);
 
   string module;
   string section;
-  Dwarf_Addr addr;
+  string kprobe_symbol;
+  unsigned int kprobe_offset;
   bool has_return;
   bool has_maxactive;
   long maxactive_val;
@@ -2840,7 +2844,7 @@ dwarf_query::query_module_symtab()
       // Find the "function" in which the indicated address resides.
       Dwarf_Addr addr =
       		(has_function_num ? function_num_val : statement_num_val);
-      fi = sym_table->get_func_containing_address(addr);
+      fi = sym_table->get_func_containing_address(addr, false);
       if (!fi)
         {
           cerr << "Warning: address "
@@ -3265,9 +3269,18 @@ dwarf_query::add_probe_point(const string& funcname,
 
   if (! bad)
     {
+      struct module_info *mi = dw.mod_info;
+      if (!mi->sym_table)
+	mi->get_symtab(this);
+      struct symbol_table *sym_tab = mi->sym_table;
+      func_info *symbol = sym_tab->get_func_containing_address(addr, true);
+
       sess.unwindsym_modules.insert (module);
       probe = new dwarf_derived_probe(funcname, filename, line,
-                                      module, reloc_section, addr, reloc_addr, *this, scope_die);
+                                      module, reloc_section, reloc_addr,
+				      symbol->name,
+				      (unsigned int)(addr - symbol->addr),
+				      *this, scope_die);
       results.push_back(probe);
     }
 }
@@ -4385,7 +4398,8 @@ dwarf_derived_probe::printsig (ostream& o) const
   // function instances.  This is distinct from the verbose/clog
   // output, since this part goes into the cache hash calculations.
   sole_location()->print (o);
-  o << " /* pc=0x" << hex << addr << dec << " */";
+  o << " /* pc=<" << kprobe_symbol << "+0x" << hex
+    << kprobe_offset << dec << "> */";
   printsig_nested (o);
 }
 
@@ -4411,22 +4425,26 @@ dwarf_derived_probe::dwarf_derived_probe(const string& funcname,
                                          // NB: dwfl_addr is the virtualized
                                          // address for this symbol.
                                          Dwarf_Addr dwfl_addr,
-                                         // addr is the section-offset for
-                                         // actual relocation.
-                                         Dwarf_Addr addr,
+                                         // symbol is the closest known symbol
+                                         // and offset is the offset from the symbol
+					 string symbol,
+					 unsigned int offset,
                                          dwarf_query& q,
                                          Dwarf_Die* scope_die /* may be null */)
   : derived_probe (q.base_probe, new probe_point(*q.base_loc) /* .components soon rewritten */ ),
-    module (module), section (section), addr (addr),
+    module (module), section (section), kprobe_symbol(symbol),
+    kprobe_offset(offset),
     has_return (q.has_return),
     has_maxactive (q.has_maxactive),
     maxactive_val (q.maxactive_val)
 {
   // Assert relocation invariants
+#if 0
   if (section == "" && dwfl_addr != addr) // addr should be absolute
     throw semantic_error ("missing relocation base against", q.base_loc->tok);
   if (section != "" && dwfl_addr == addr) // addr should be an offset
     throw semantic_error ("inconsistent relocation address", q.base_loc->tok);
+#endif
 
   this->tok = q.base_probe->tok;
 
@@ -4634,8 +4652,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
   // Let's find some stats for the three embedded strings.  Maybe they
   // are small and uniform enough to justify putting char[MAX]'s into
   // the array instead of relocated char*'s.
-  size_t module_name_max = 0, section_name_max = 0, pp_name_max = 0;
-  size_t module_name_tot = 0, section_name_tot = 0, pp_name_tot = 0;
+  size_t pp_name_max = 0, pp_name_tot = 0;
+  size_t symbol_name_name_max  = 0, symbol_name_name_tot = 0;
   size_t all_name_cnt = probes_by_module.size(); // for average
   for (p_b_m_iterator it = probes_by_module.begin(); it != probes_by_module.end(); it++)
     {
@@ -4644,9 +4662,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
         size_t var##_size = (expr) + 1;                 \
         var##_max = max (var##_max, var##_size);        \
         var##_tot += var##_size; } while (0)
-      DOIT(module_name, p->module.size());
-      DOIT(section_name, p->section.size());
       DOIT(pp_name, lex_cast_qstring(*p->sole_location()).size());
+      DOIT(symbol_name_name, p->kprobe_symbol.size());
 #undef DOIT
     }
 
@@ -4666,11 +4683,10 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
       if (s.verbose > 2) clog << "stap_dwarf_probe *" << #var << endl;  \
     }
 
-  CALCIT(module);
-  CALCIT(section);
   CALCIT(pp);
+  CALCIT(symbol_name);
 
-  s.op->newline() << "const unsigned long address;";
+  s.op->newline() << "unsigned int offset;";
   s.op->newline() << "void (* const ph) (struct context*);";
   s.op->newline(-1) << "} stap_dwarf_probes[] = {";
   s.op->indent(1);
@@ -4687,9 +4703,8 @@ dwarf_derived_probe_group::emit_module_decls (systemtap_session& s)
           assert (p->maxactive_val >= 0 && p->maxactive_val <= USHRT_MAX);
           s.op->line() << " .maxactive_val=" << p->maxactive_val << ",";
         }
-      s.op->line() << " .address=0x" << hex << p->addr << dec << "UL,";
-      s.op->line() << " .module=\"" << p->module << "\",";
-      s.op->line() << " .section=\"" << p->section << "\",";
+      s.op->line() << " .symbol_name=\"" << p->kprobe_symbol << "\",";
+      s.op->line() << " .offset=0x" << hex << p->kprobe_offset << dec << ",";
       s.op->line() << " .pp=" << lex_cast_qstring (*p->sole_location()) << ",";
       s.op->line() << " .ph=&" << p->name;
       s.op->line() << " },";
@@ -4749,11 +4764,10 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
   s.op->newline() << "for (i=0; i<" << probes_by_module.size() << "; i++) {";
   s.op->newline(1) << "struct stap_dwarf_probe *sdp = & stap_dwarf_probes[i];";
   s.op->newline() << "struct stap_dwarf_kprobe *kp = & stap_dwarf_kprobes[i];";
-  s.op->newline() << "unsigned long relocated_addr = _stp_module_relocate (sdp->module, sdp->section, sdp->address);";
-  s.op->newline() << "if (relocated_addr == 0) continue;"; // quietly; assume module is absent
   s.op->newline() << "probe_point = sdp->pp;";
   s.op->newline() << "if (sdp->return_p) {";
-  s.op->newline(1) << "kp->u.krp.kp.addr = (void *) relocated_addr;";
+  s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
+  s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
   s.op->newline() << "if (sdp->maxactive_p) {";
   s.op->newline(1) << "kp->u.krp.maxactive = sdp->maxactive_val;";
   s.op->newline(-1) << "} else {";
@@ -4774,8 +4788,8 @@ dwarf_derived_probe_group::emit_module_init (systemtap_session& s)
   s.op->newline() << "rc = register_kretprobe (& kp->u.krp);";
   s.op->newline() << "#endif";
   s.op->newline(-1) << "} else {";
-  // to ensure safeness of bspcache, always use aggr_kprobe on ia64
-  s.op->newline(1) << "kp->u.kp.addr = (void *) relocated_addr;";
+  s.op->newline(1) << "kp->u.krp.kp.symbol_name = sdp->symbol_name;";
+  s.op->newline(1) << "kp->u.krp.kp.offset = sdp->offset;";
   s.op->newline() << "kp->u.kp.pre_handler = &enter_kprobe_probe;";
   s.op->newline() << "#ifdef __ia64__";
   s.op->newline() << "kp->dummy.addr = kp->u.kp.addr;";
@@ -4939,12 +4953,20 @@ dwarf_builder::build(systemtap_session & sess,
           throw semantic_error ("absolute statement probe in unprivileged script", q.base_probe->tok);
         }
 
+      struct module_info *mi = dw->mod_info;
+      if (!mi->sym_table)
+	mi->get_symtab(&q);
+      struct symbol_table *sym_tab = mi->sym_table;
+      func_info *symbol = sym_tab->get_func_containing_address(q.statement_num_val, true);
+
       // For kernel.statement(NUM).absolute probe points, we bypass
       // all the debuginfo stuff: We just wire up a
       // dwarf_derived_probe right here and now.
       dwarf_derived_probe* p =
         new dwarf_derived_probe ("", "", 0, "kernel", "",
-                                 q.statement_num_val, q.statement_num_val,
+                                 q.statement_num_val,
+				 symbol->name,
+				 (unsigned int)(q.statement_num_val - symbol->addr),
                                  q, 0);
       finished_results.push_back (p);
       sess.unwindsym_modules.insert ("kernel");
@@ -4962,8 +4984,8 @@ symbol_table::~symbol_table()
 }
 
 void
-symbol_table::add_symbol(const char *name, bool weak, Dwarf_Addr addr,
-                                                      Dwarf_Addr *high_addr)
+symbol_table::add_symbol(const char *name, bool weak, bool global, 
+			 Dwarf_Addr addr, Dwarf_Addr *high_addr)
 {
 #ifdef __powerpc__
   // Map ".sys_foo" to "sys_foo".
@@ -4974,10 +4996,13 @@ symbol_table::add_symbol(const char *name, bool weak, Dwarf_Addr addr,
   fi->addr = addr;
   fi->name = name;
   fi->weak = weak;
+  fi->global = global;
   map_by_name[fi->name] = fi;
   // TODO: Use a multimap in case there are multiple static
   // functions with the same name?
   list_by_addr.push_back(fi);
+  if (global)
+    global_list_by_addr.push_back(fi);
 }
 
 enum info_status
@@ -5015,7 +5040,8 @@ symbol_table::read_symbols(FILE *f, const string& path)
           break;
         }
       if (type == 'T' || type == 't' || type == 'W')
-        add_symbol(name, (type == 'W'), (Dwarf_Addr) addr, &high_addr);
+        add_symbol(name, (type == 'W'), (type == 'T'),
+		   (Dwarf_Addr) addr, &high_addr);
     }
 
   if (list_by_addr.size() < 1)
@@ -5134,7 +5160,8 @@ symbol_table::get_from_elf()
       if (name && GELF_ST_TYPE(sym.st_info) == STT_FUNC &&
                                                    !reject_section(section))
         add_symbol(name, (GELF_ST_BIND(sym.st_info) == STB_WEAK),
-                                              sym.st_value, &high_addr);
+		   GELF_ST_BIND(sym.st_info) == STB_GLOBAL,
+		   sym.st_value, &high_addr);
     }
   sort();
   return info_present;
@@ -5175,14 +5202,29 @@ symbol_table::mark_dwarf_redundancies(dwflpp *dw)
 }
 
 func_info *
-symbol_table::get_func_containing_address(Dwarf_Addr addr)
+symbol_table::get_func_containing_address(Dwarf_Addr addr, bool global)
 {
-  iterator_t iter = upper_bound(list_by_addr.begin(), list_by_addr.end(), addr,
-                              func_info::Compare());
-  if (iter == list_by_addr.begin())
-    return NULL;
+  iterator_t iter;
+
+  if (global)
+    {
+      iter = upper_bound(global_list_by_addr.begin(),
+			 global_list_by_addr.end(), addr,
+			 func_info::Compare());
+      if (iter == global_list_by_addr.begin())
+	return NULL;
+      else
+	return *(iter - 1);
+    }
   else
-    return *(iter - 1);
+    {
+      iter = upper_bound(list_by_addr.begin(), list_by_addr.end(), addr,
+			 func_info::Compare());
+      if (iter == list_by_addr.begin())
+	return NULL;
+      else
+	return *(iter - 1);
+    }
 }
 
 func_info *
@@ -5235,12 +5277,16 @@ symbol_table::purge_syscall_stubs()
   list_by_addr.erase(remove(purge_range.first, purge_range.second,
                             (func_info*)0),
                      purge_range.second);
+  // NOTE: At the moment global_list_by_addr has no weak symbols
+  // so nothing needs to be removed from it.
 }
 
 void
 symbol_table::sort()
 {
   stable_sort(list_by_addr.begin(), list_by_addr.end(), func_info::Compare());
+  stable_sort(global_list_by_addr.begin(), global_list_by_addr.end(),
+	      func_info::Compare());
 }
 
 void
-- 
1.5.6.2



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-17 21:35                     ` Frank Ch. Eigler
  2008-07-17 22:05                       ` Masami Hiramatsu
@ 2008-07-22 18:00                       ` Rik van Riel
  2008-07-22 18:12                         ` Frank Ch. Eigler
  2008-07-23 15:06                         ` systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs Frank Ch. Eigler
  1 sibling, 2 replies; 49+ messages in thread
From: Rik van Riel @ 2008-07-22 18:00 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: James Bottomley, Masami Hiramatsu, linux-kernel, systemtap

On Thu, 17 Jul 2008 17:33:55 -0400
"Frank Ch. Eigler" <fche@redhat.com> wrote:

> Yes, that is generally desirable - each case is usually a question of
> cost/benefit.  One significant requirement for us is to keep working
> with older kernels.

You will have to weigh that against the benefits of making
systemtap generally useful for kernel developers, which
would result in a more active systemtap community and,
eventually, more available scripts and easier end user
functionality.

If a project is not aimed squarely at the developers who 
could give the project critical mass, it is essentially 
doomed.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-22 18:00                       ` Rik van Riel
@ 2008-07-22 18:12                         ` Frank Ch. Eigler
  2008-07-22 18:31                           ` Peter Zijlstra
       [not found]                           ` <1216751477.7257.115.camel__19834.5970632092$1216751567$gmane$org@twins>
  2008-07-23 15:06                         ` systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs Frank Ch. Eigler
  1 sibling, 2 replies; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-22 18:12 UTC (permalink / raw)
  To: Rik van Riel; +Cc: James Bottomley, Masami Hiramatsu, linux-kernel, systemtap

Hi -

On Tue, Jul 22, 2008 at 02:00:15PM -0400, Rik van Riel wrote:
> [...]
> > Yes, that is generally desirable - each case is usually a question of
> > cost/benefit.  One significant requirement for us is to keep working
> > with older kernels.
> 
> You will have to weigh that against the benefits of making
> systemtap generally useful for kernel developers [...]

Understood & agreed, Rik.  If an issue arises where there is genuine
conflict between kernel-developer-usability and something else, we'll
try to solve it favouring the former if at all possible.

(The kprobes addressing argument cannot reasonably be placed into this
category.)


- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-22 18:12                         ` Frank Ch. Eigler
@ 2008-07-22 18:31                           ` Peter Zijlstra
       [not found]                           ` <1216751477.7257.115.camel__19834.5970632092$1216751567$gmane$org@twins>
  1 sibling, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2008-07-22 18:31 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Rik van Riel, James Bottomley, Masami Hiramatsu, linux-kernel, systemtap

On Tue, 2008-07-22 at 14:11 -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> On Tue, Jul 22, 2008 at 02:00:15PM -0400, Rik van Riel wrote:
> > [...]
> > > Yes, that is generally desirable - each case is usually a question of
> > > cost/benefit.  One significant requirement for us is to keep working
> > > with older kernels.
> > 
> > You will have to weigh that against the benefits of making
> > systemtap generally useful for kernel developers [...]
> 
> Understood & agreed, Rik.  If an issue arises where there is genuine
> conflict between kernel-developer-usability and something else, we'll
> try to solve it favouring the former if at all possible.
> 
> (The kprobes addressing argument cannot reasonably be placed into this
> category.)

You have your viewpoint inverted, if the kernel developers think you
have a problem, and you fail to address it, they will walk away.

If you want the kernel people to endorse your project, you'll have to
please them. Its that simple. If that means having to radically
re-structure your design, and/or break backwards compatibility then so
be it. Such are the costs for not collaborating from the start.

If you stubbornly refuse to co-operate you'll either break the project
or invite a fork/rewrite by someone else if the idea is deemed
worthwhile enough.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
       [not found]                           ` <1216751477.7257.115.camel__19834.5970632092$1216751567$gmane$org@twins>
@ 2008-07-22 18:50                             ` Frank Ch. Eigler
  0 siblings, 0 replies; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-22 18:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, James Bottomley, Masami Hiramatsu, linux-kernel, systemtap

Peter Zijlstra <peterz@infradead.org> writes:

> [...]
>> > You will have to weigh that against the benefits of making
>> > systemtap generally useful for kernel developers [...]
>> 
>> Understood & agreed, Rik.  If an issue arises where there is genuine
>> conflict between kernel-developer-usability and something else, we'll
>> try to solve it favouring the former if at all possible.
>> 
>> (The kprobes addressing argument cannot reasonably be placed into this
>> category.)

> You have your viewpoint inverted, if the kernel developers think you
> have a problem, and you fail to address it, they will walk away.
>
> If you want the kernel people to endorse your project, you'll have
> to please them. It's that simple. [...]

We have and will try to accomodate anything reasonable.  I trust no
one is suggesting that every systemtap-related suggestion from lkml is
to be treated as if infallible, and that we can continue to debate the
wisdom of each idea on its merits.

- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs
  2008-07-22 18:00                       ` Rik van Riel
  2008-07-22 18:12                         ` Frank Ch. Eigler
@ 2008-07-23 15:06                         ` Frank Ch. Eigler
  2008-07-23 15:29                           ` Arjan van de Ven
                                             ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-23 15:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: James Bottomley, Masami Hiramatsu, linux-kernel, systemtap

Hi -

I wrote:

> > [...] One significant requirement for us is to keep working with
> > older kernels.  [...]

Maybe it's worth elaborating on why the need for backward
compatibility is different for systemtap than for typical kernel-side
code.

The bulk of systemtap is a user-space program, and it does very
user-spacey things like parsing dwarf and invoking compilers, running
network servers.  Soon it will include user-space libraries.  It is so
different from the stuff normally found there that no one has AFAIK
seriously proposed that the entire software be made part of the kernel
git tree.  So it is an ordinary separate user-space package, built by
users and distributors.

It does happen to *generate* kernel modules.  The way that such a
module must interface with any particular kernel is naturally subject
to the whims & desires of the kernel du jour.  This is why we have a
mass of mechanism to try to automatically speak to each kernel version
as appropriate.

It is desirable to minimize this mass for obvious reasons.  When a new
upstream kernel comes out with a tasty new feature -- or a less tasty
API rewrite -- we need to extend systemtap to support that too.  We
cannot easily take old support away, because then the same user-space
code base would no longer run against actually installed kernels.

To draw an analogy, systemtap is somewhat like low-level userspace
code like glibc or syslogd or udevd.  I hope no one would seriously
propose casually committing code to those packages that would make
them unusable on prior kernel versions.  Accepting such a patch would
require their maintainers to fork outright every time a kernel change
occurs.

Things are good however if the low-level userspace changes are
backward-compatible, so that the new kernel facility is used when
present, but the software does not regress if it is not.  I believe
this is what we need to aim for, even though it puts the bulk of the
burden on systemtap (or glibc, or ...).

I hope this fills in some of the gaps in the background.

- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: systemtap & backward compatibility, was Re: [RFC] systemtap:  begin the process of using proper kernel APIs
  2008-07-23 15:06                         ` systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs Frank Ch. Eigler
@ 2008-07-23 15:29                           ` Arjan van de Ven
  2008-07-23 15:33                           ` Peter Zijlstra
       [not found]                           ` <20080723082856.334f9c17__2909.60763018138$1216827051$gmane$org@infradead.org>
  2 siblings, 0 replies; 49+ messages in thread
From: Arjan van de Ven @ 2008-07-23 15:29 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Rik van Riel, James Bottomley, Masami Hiramatsu, linux-kernel, systemtap

On Wed, 23 Jul 2008 11:04:34 -0400
"Frank Ch. Eigler" <fche@redhat.com> wrote:

> Hi -
> 
> I wrote:
> 
> > > [...] One significant requirement for us is to keep working with
> > > older kernels.  [...]
> 
> Maybe it's worth elaborating on why the need for backward
> compatibility is different for systemtap than for typical kernel-side
> code.
> 
> The bulk of systemtap is a user-space program, and it does very
> user-spacey things like parsing dwarf and invoking compilers, running
> network servers.  Soon it will include user-space libraries.  It is so
> different from the stuff normally found there that no one has AFAIK
> seriously proposed that the entire software be made part of the kernel
> git tree.  So it is an ordinary separate user-space package, built by
> users and distributors.

so far so good...

> 
> It does happen to *generate* kernel modules.  The way that such a
> module must interface with any particular kernel is naturally subject
> to the whims & desires of the kernel du jour.  This is why we have a
> mass of mechanism to try to automatically speak to each kernel version
> as appropriate.

and this is where I strongly disagree.
THIS part *has* to be in the kernel source, so that we can change it
WITH the kernel as we change it. If this means that there's some
userland .so code in the kernel source, so be it. If it means we provide
some template files that your userland fills in the blanks for, even
better. (paint-by-number kernel modules!)

But to have any chance at all of systemtap being sustainable, this part
of the stack has to be together with where the changes happen.

> 
> It is desirable to minimize this mass for obvious reasons.  When a new
> upstream kernel comes out with a tasty new feature -- or a less tasty
> API rewrite -- we need to extend systemtap to support that too.

At that point you are already 3 months too late for me, and probably
for most of my fellow kernel hackers. THIS is exactly what makes
systemtap not usable for kernel hackers, and this is exactly why you
see very little contributions from kernel hackers.
(and when it's seen it gets a rather luke warm reception, but that's a
different story).

It also means that unless I want to package and build systemtap myself,
I have to wait for my OS vendor to think about moving to the kernel I'm
on before I can use systemtap. For me as kernel developer.. that's the
second show stopper already.


> To draw an analogy, systemtap is somewhat like low-level userspace
> code like glibc or syslogd or udevd.  I hope no one would seriously
> propose casually committing code to those packages that would make
> them unusable on prior kernel versions.  Accepting such a patch would
> require their maintainers to fork outright every time a kernel change
> occurs.

we've discussed pulling udev into the kernel source several times, and
the jury is still out on it.
But systemtap is NOT like udev or glibc or ..
it's a kernel component.. at least the part that generates the kernel
code is. it needs to breathe and move together with the kernel.
> 
> I hope this fills in some of the gaps in the background.

it explains where you're coming from, which is good. However I for one
really disagree with the assumption, and i just tried to point out that
the consequences of this are rather dreadful.

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: systemtap & backward compatibility, was Re: [RFC] systemtap:  begin the process of using proper kernel APIs
  2008-07-23 15:06                         ` systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs Frank Ch. Eigler
  2008-07-23 15:29                           ` Arjan van de Ven
@ 2008-07-23 15:33                           ` Peter Zijlstra
  2008-07-23 20:26                             ` Masami Hiramatsu
       [not found]                           ` <20080723082856.334f9c17__2909.60763018138$1216827051$gmane$org@infradead.org>
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2008-07-23 15:33 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Rik van Riel, James Bottomley, Masami Hiramatsu, linux-kernel, systemtap

On Wed, 2008-07-23 at 11:04 -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> I wrote:
> 
> > > [...] One significant requirement for us is to keep working with
> > > older kernels.  [...]
> 
> Maybe it's worth elaborating on why the need for backward
> compatibility is different for systemtap than for typical kernel-side
> code.
> 
> The bulk of systemtap is a user-space program, and it does very
> user-spacey things like parsing dwarf and invoking compilers, running
> network servers.  Soon it will include user-space libraries.  It is so
> different from the stuff normally found there that no one has AFAIK
> seriously proposed that the entire software be made part of the kernel
> git tree.  So it is an ordinary separate user-space package, built by
> users and distributors.
> 
> It does happen to *generate* kernel modules.  The way that such a
> module must interface with any particular kernel is naturally subject
> to the whims & desires of the kernel du jour.  This is why we have a
> mass of mechanism to try to automatically speak to each kernel version
> as appropriate.
> 
> It is desirable to minimize this mass for obvious reasons.  When a new
> upstream kernel comes out with a tasty new feature -- or a less tasty
> API rewrite -- we need to extend systemtap to support that too.  We
> cannot easily take old support away, because then the same user-space
> code base would no longer run against actually installed kernels.
> 
> To draw an analogy, systemtap is somewhat like low-level userspace
> code like glibc or syslogd or udevd.  I hope no one would seriously
> propose casually committing code to those packages that would make
> them unusable on prior kernel versions.  Accepting such a patch would
> require their maintainers to fork outright every time a kernel change
> occurs.
> 
> Things are good however if the low-level userspace changes are
> backward-compatible, so that the new kernel facility is used when
> present, but the software does not regress if it is not.  I believe
> this is what we need to aim for, even though it puts the bulk of the
> burden on systemtap (or glibc, or ...).
> 
> I hope this fills in some of the gaps in the background.

Why does a new version of stap have to work on ancient kernels?

A new gnome version requires a new gtk version, a new kde version
requires a new qt etc.. so why does a new stap not require a new kernel?

Why isn't only supporting the last few kernels, say for example as far
back as there are -stable series at the moment of release, good enough?

People who insist on running stale kernels are usually the same people
who run stale userspace - we call those enterprise people - so why can't
they run matching stale version of the kernel and stap?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: systemtap & backward compatibility, was Re: [RFC] systemtap:  begin the process of using proper kernel APIs
       [not found]                           ` <20080723082856.334f9c17__2909.60763018138$1216827051$gmane$org@infradead.org>
@ 2008-07-23 16:43                             ` Frank Ch. Eigler
  2008-07-23 16:54                               ` Adrian Bunk
  2008-07-23 22:14                               ` Masami Hiramatsu
  0 siblings, 2 replies; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-23 16:43 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Rik van Riel, James Bottomley, Masami Hiramatsu, linux-kernel, systemtap

Arjan van de Ven <arjan@infradead.org> writes:

> [...]
>> It does happen to *generate* kernel modules.  The way that such a
>> module must interface with any particular kernel is naturally subject
>> to the whims & desires of the kernel du jour.  This is why we have a
>> mass of mechanism to try to automatically speak to each kernel version
>> as appropriate.

> and this is where I strongly disagree.  THIS part *has* to be in the
> kernel source, so that we can change it WITH the kernel as we change
> it. [...]  But to have any chance at all of systemtap being
> sustainable, this part of the stack has to be together with where
> the changes happen.

OK.  It will take us some time to figure out to what extent this would
be feasible.  Maybe a topic for Portland.

>> It is desirable to minimize this mass for obvious reasons.  When a new
>> upstream kernel comes out with a tasty new feature -- or a less tasty
>> API rewrite -- we need to extend systemtap to support that too.
>
> At that point you are already 3 months too late for me, and probably
> for most of my fellow kernel hackers. 

(Really?  Have we ever been 3 months behind supporting the git kernel?)

> (and when it's seen it gets a rather luke warm reception, but that's
> a different story).

I hope the backward compatibility issue, as it stands today, helps
explain the reasons for the current deal with kprobes.

In the interim (before we come up with a way of moving more
kernel-coupled systemtap code into kernel.org/git), would y'all
consider an arrangement?  Those of you who care about systemtap, and
are intending to make an incompatible kernel/module interface change,
please run the systemtap testsuite before & after.  If it regresses,
send us a note or a patch.  If practical, we'll integrate it (and add
any backward-compatibility hacks if needed) into systemtap.

- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: systemtap & backward compatibility, was Re: [RFC] systemtap:  begin the process of using proper kernel APIs
  2008-07-23 16:43                             ` Frank Ch. Eigler
@ 2008-07-23 16:54                               ` Adrian Bunk
  2008-07-23 17:43                                 ` Frank Ch. Eigler
  2008-07-23 22:14                               ` Masami Hiramatsu
  1 sibling, 1 reply; 49+ messages in thread
From: Adrian Bunk @ 2008-07-23 16:54 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Arjan van de Ven, Rik van Riel, James Bottomley,
	Masami Hiramatsu, linux-kernel, systemtap

On Wed, Jul 23, 2008 at 12:41:37PM -0400, Frank Ch. Eigler wrote:
>...
> In the interim (before we come up with a way of moving more
> kernel-coupled systemtap code into kernel.org/git), would y'all
> consider an arrangement?  Those of you who care about systemtap, and
> are intending to make an incompatible kernel/module interface change,
> please run the systemtap testsuite before & after.  If it regresses,
> send us a note or a patch.  If practical, we'll integrate it (and add
> any backward-compatibility hacks if needed) into systemtap.

The number of incompatible kernel/module interface changes per kernel is 
most likely some medium three digit number.

What you have in mind is therefore not really feasible.

But we have several levels of kernels that incorporate future stuff, 
from -rc over -next to -mm, and you could implement automated runs 
of the systemtap testsuite against the latest ones.

> - FChE

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs
  2008-07-23 16:54                               ` Adrian Bunk
@ 2008-07-23 17:43                                 ` Frank Ch. Eigler
  2008-07-23 18:41                                   ` Adrian Bunk
  0 siblings, 1 reply; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-23 17:43 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Arjan van de Ven, Rik van Riel, James Bottomley,
	Masami Hiramatsu, linux-kernel, systemtap

Hi -

On Wed, Jul 23, 2008 at 07:54:05PM +0300, Adrian Bunk wrote:
> On Wed, Jul 23, 2008 at 12:41:37PM -0400, Frank Ch. Eigler wrote:
> >...
> > In the interim (before we come up with a way of moving more
> > kernel-coupled systemtap code into kernel.org/git), would y'all
> > consider an arrangement?  Those of you who care about systemtap, and
> > are intending to make an incompatible kernel/module interface change,
> > please run the systemtap testsuite before & after.  [...]

> The number of incompatible kernel/module interface changes per
> kernel is most likely some medium three digit number.  What you have
> in mind is therefore not really feasible.

Do you think so, even if the historical pattern continues that only a
small fraction of those changes actually impact systemtap?


> But we have several levels of kernels that incorporate future stuff,
> from -rc over -next to -mm, and you could implement automated runs
> of the systemtap testsuite against the latest ones.

We do some of that already and should do more.


- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: systemtap & backward compatibility, was Re: [RFC] systemtap:  begin the process of using proper kernel APIs
  2008-07-23 17:43                                 ` Frank Ch. Eigler
@ 2008-07-23 18:41                                   ` Adrian Bunk
  0 siblings, 0 replies; 49+ messages in thread
From: Adrian Bunk @ 2008-07-23 18:41 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Arjan van de Ven, Rik van Riel, James Bottomley,
	Masami Hiramatsu, linux-kernel, systemtap

On Wed, Jul 23, 2008 at 01:34:38PM -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> On Wed, Jul 23, 2008 at 07:54:05PM +0300, Adrian Bunk wrote:
> > On Wed, Jul 23, 2008 at 12:41:37PM -0400, Frank Ch. Eigler wrote:
> > >...
> > > In the interim (before we come up with a way of moving more
> > > kernel-coupled systemtap code into kernel.org/git), would y'all
> > > consider an arrangement?  Those of you who care about systemtap, and
> > > are intending to make an incompatible kernel/module interface change,
> > > please run the systemtap testsuite before & after.  [...]
> 
> > The number of incompatible kernel/module interface changes per
> > kernel is most likely some medium three digit number.  What you have
> > in mind is therefore not really feasible.
> 
> Do you think so, even if the historical pattern continues that only a
> small fraction of those changes actually impact systemtap?

But which are these?

All commits that either remove EXPORT_SYMBOL's or touch include/ are 
candidates would have to be checked, and that's a four digit number
of commits per kernel release.

What is not feasible is to check each of thesse commits.

> > But we have several levels of kernels that incorporate future stuff,
> > from -rc over -next to -mm, and you could implement automated runs
> > of the systemtap testsuite against the latest ones.
> 
> We do some of that already and should do more.
> 
> - FChE

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: systemtap & backward compatibility, was Re: [RFC] systemtap:  begin the process of using proper kernel APIs
  2008-07-23 15:33                           ` Peter Zijlstra
@ 2008-07-23 20:26                             ` Masami Hiramatsu
  0 siblings, 0 replies; 49+ messages in thread
From: Masami Hiramatsu @ 2008-07-23 20:26 UTC (permalink / raw)
  To: Peter Zijlstra, Frank Ch. Eigler
  Cc: Rik van Riel, James Bottomley, linux-kernel, systemtap

Hi,

Peter Zijlstra wrote:
> On Wed, 2008-07-23 at 11:04 -0400, Frank Ch. Eigler wrote:
>> Hi -
>>
>> I wrote:
>>
>>>> [...] One significant requirement for us is to keep working with
>>>> older kernels.  [...]
>> Maybe it's worth elaborating on why the need for backward
>> compatibility is different for systemtap than for typical kernel-side
>> code.
>>
>> The bulk of systemtap is a user-space program, and it does very
>> user-spacey things like parsing dwarf and invoking compilers, running
>> network servers.  Soon it will include user-space libraries.  It is so
>> different from the stuff normally found there that no one has AFAIK
>> seriously proposed that the entire software be made part of the kernel
>> git tree.  So it is an ordinary separate user-space package, built by
>> users and distributors.
>>
>> It does happen to *generate* kernel modules.  The way that such a
>> module must interface with any particular kernel is naturally subject
>> to the whims & desires of the kernel du jour.  This is why we have a
>> mass of mechanism to try to automatically speak to each kernel version
>> as appropriate.
>>
>> It is desirable to minimize this mass for obvious reasons.  When a new
>> upstream kernel comes out with a tasty new feature -- or a less tasty
>> API rewrite -- we need to extend systemtap to support that too.  We
>> cannot easily take old support away, because then the same user-space
>> code base would no longer run against actually installed kernels.
>>
>> To draw an analogy, systemtap is somewhat like low-level userspace
>> code like glibc or syslogd or udevd.  I hope no one would seriously
>> propose casually committing code to those packages that would make
>> them unusable on prior kernel versions.  Accepting such a patch would
>> require their maintainers to fork outright every time a kernel change
>> occurs.
>>
>> Things are good however if the low-level userspace changes are
>> backward-compatible, so that the new kernel facility is used when
>> present, but the software does not regress if it is not.  I believe
>> this is what we need to aim for, even though it puts the bulk of the
>> burden on systemtap (or glibc, or ...).
>>
>> I hope this fills in some of the gaps in the background.
> 
> Why does a new version of stap have to work on ancient kernels?
>
> A new gnome version requires a new gtk version, a new kde version
> requires a new qt etc.. so why does a new stap not require a new kernel?
> 
> Why isn't only supporting the last few kernels, say for example as far
> back as there are -stable series at the moment of release, good enough?
> 
> People who insist on running stale kernels are usually the same people
> who run stale userspace - we call those enterprise people - so why can't
> they run matching stale version of the kernel and stap?

I agree with you. currently, systemtap is increasingly evolving
on single source tree. But it is obvious that this developing style
can't catch up the upstream development.
I'd like to suggest that we might better branch the tree --
one is stable tree for old kernel, another aims to merge into upstream.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: systemtap & backward compatibility, was Re: [RFC] systemtap:   begin the process of using proper kernel APIs
  2008-07-23 16:43                             ` Frank Ch. Eigler
  2008-07-23 16:54                               ` Adrian Bunk
@ 2008-07-23 22:14                               ` Masami Hiramatsu
  1 sibling, 0 replies; 49+ messages in thread
From: Masami Hiramatsu @ 2008-07-23 22:14 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Arjan van de Ven, Rik van Riel, James Bottomley, linux-kernel, systemtap

Hi Frank,

Frank Ch. Eigler wrote:
> Arjan van de Ven <arjan@infradead.org> writes:
[...]
>> (and when it's seen it gets a rather luke warm reception, but that's
>> a different story).
> 
> I hope the backward compatibility issue, as it stands today, helps
> explain the reasons for the current deal with kprobes.

I understood that the current deal with kprobes is also for integrating
user probe logic and kernel probe logic.
Obviously, it is hard uprobe to provide same symbol_name interface,
because it requires to access(and analyze) userspace symbol
information from kernel.

> In the interim (before we come up with a way of moving more
> kernel-coupled systemtap code into kernel.org/git), would y'all
> consider an arrangement?  Those of you who care about systemtap, and
> are intending to make an incompatible kernel/module interface change,
> please run the systemtap testsuite before & after.  If it regresses,
> send us a note or a patch.  If practical, we'll integrate it (and add
> any backward-compatibility hacks if needed) into systemtap.

Hmm, I think it's very costly way for both of kernel developers and
systemtap developers.
From the long term of viewpoint, I think it's better (less costly)
to merge systemtap runtime/tapset into upstream kernel and maintain
it. Then, we can stabilize its API by ourselves on upstream.
Since it reduces the catchup/maintenance cost and it enables users
to use stap on upstream kernel, I think it is benefit for both.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-16 10:58             ` Frank Ch. Eigler
@ 2008-07-16 14:56               ` James Bottomley
  0 siblings, 0 replies; 49+ messages in thread
From: James Bottomley @ 2008-07-16 14:56 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: linux-kernel, systemtap

On Wed, 2008-07-16 at 06:56 -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> On Tue, Jul 15, 2008 at 09:06:23PM -0500, James Bottomley wrote:
> > [...]
> > > Please choose your words more carefully.  We don't "subvert" anything,
> > > where one would mean sneaking around some sort of protection.
> > 
> > Actually, I did and you do.  One of the OED's definition of subvert is
> > "to undermine or overturn a condition or order of things, a principle or
> > a law etc."  In this particular case, this:
> > 
> > commit 3a872d89baae821a0f6e2c1055d4b47650661137
> > Author: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> > Date:   Mon Oct 2 02:17:30 2006 -0700
> >     [PATCH] Kprobes: Make kprobe modules more portable
> >
> > Which provided a portable input to kprobes (the symbol_name/offset one)
> > and revoked the global accessibility of the kallsyms_lookup_name().
> 
> That patch served two purposes: a helpful utility for other kprobes
> users, and it enabling what LKML deemed more important - unexporting
> kallsyms*.
> 
> 
> > It's actually worse than this, though.  The kernel API isn't fixed in
> > stone, it evolves usually by trying to make problematic use cases
> > better.  By refusing to consider using the replacement API [...]
> 
> Your lecture is based upon a misundertanding ...
> 
> 
> > [...]
> > It emits a single probe and produces this in the module build:
> > -rw-r--r-- 1 root root  17996 2008-07-15 20:45 stap_2154.c
> > About 600 lines.
> > However, it also needs this for the symbol table:
> > -rw-r--r-- 1 root root 446137 2008-07-15 20:45 stap-symbols.h
> 
> ... that this is somehow connected to the kprobe api issue.
> 
> IT IS NOT.
> 
> We do not use those symbol tables for kprobe placement purposes.
> (This part is partially a prototype for user-space parts, and the
> sizes will not stay large.)

There's no misunderstanding that you're currently embedding the kernel
symbol table and have expanded the sizes of the modules enormously to do
so.  The real truth is that coming up with convoluted ways to recompute
some information the kernel already knows is fragile and shouldn't be
done.  What should be done is to engage in discussion about how to get
that information out of the kernel ... preferably with use cases and
patches.

> The way we set up kprobes now could be trivially converted to
> "_stext"+offset.  Would that alone allay your concerns?

No ... because as I already said, that's broken for -ffunction-sections.
It needs to be the symbol of the actual function the probe is in and the
offset into that function.

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-16  2:06           ` James Bottomley
@ 2008-07-16 10:58             ` Frank Ch. Eigler
  2008-07-16 14:56               ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-16 10:58 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, systemtap

Hi -

On Tue, Jul 15, 2008 at 09:06:23PM -0500, James Bottomley wrote:
> [...]
> > Please choose your words more carefully.  We don't "subvert" anything,
> > where one would mean sneaking around some sort of protection.
> 
> Actually, I did and you do.  One of the OED's definition of subvert is
> "to undermine or overturn a condition or order of things, a principle or
> a law etc."  In this particular case, this:
> 
> commit 3a872d89baae821a0f6e2c1055d4b47650661137
> Author: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
> Date:   Mon Oct 2 02:17:30 2006 -0700
>     [PATCH] Kprobes: Make kprobe modules more portable
>
> Which provided a portable input to kprobes (the symbol_name/offset one)
> and revoked the global accessibility of the kallsyms_lookup_name().

That patch served two purposes: a helpful utility for other kprobes
users, and it enabling what LKML deemed more important - unexporting
kallsyms*.


> It's actually worse than this, though.  The kernel API isn't fixed in
> stone, it evolves usually by trying to make problematic use cases
> better.  By refusing to consider using the replacement API [...]

Your lecture is based upon a misundertanding ...


> [...]
> It emits a single probe and produces this in the module build:
> -rw-r--r-- 1 root root  17996 2008-07-15 20:45 stap_2154.c
> About 600 lines.
> However, it also needs this for the symbol table:
> -rw-r--r-- 1 root root 446137 2008-07-15 20:45 stap-symbols.h

... that this is somehow connected to the kprobe api issue.

IT IS NOT.

We do not use those symbol tables for kprobe placement purposes.
(This part is partially a prototype for user-space parts, and the
sizes will not stay large.)

The way we set up kprobes now could be trivially converted to
"_stext"+offset.  Would that alone allay your concerns?


- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-15 22:19         ` Frank Ch. Eigler
@ 2008-07-16  2:06           ` James Bottomley
  2008-07-16 10:58             ` Frank Ch. Eigler
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-16  2:06 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: linux-kernel, systemtap

On Tue, 2008-07-15 at 18:18 -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> On Tue, Jul 15, 2008 at 03:24:22PM -0500, James Bottomley wrote:
> > [...]
> > > > > > This is highly undesirable because it represents a subversion of the
> > > > > > kernel API to gain access to unexported symbols.
> > > [...]
> > > Maybe, but what "subversion" are you talking about?
> > 
> > using a hand crafted relocation function to gain access to kernel
> > symbols instead of the provided API.
> 
> Please choose your words more carefully.  We don't "subvert" anything,
> where one would mean sneaking around some sort of protection.

Actually, I did and you do.  One of the OED's definition of subvert is
"to undermine or overturn a condition or order of things, a principle or
a law etc."  In this particular case, this:

commit 3a872d89baae821a0f6e2c1055d4b47650661137
Author: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Date:   Mon Oct 2 02:17:30 2006 -0700

    [PATCH] Kprobes: Make kprobe modules more portable

Which provided a portable input to kprobes (the symbol_name/offset one)
and revoked the global accessibility of the kallsyms_lookup_name().

The design was for kprobes users to stop using kallsyms_lookup_name()
and to use the symbol_name/offset instead.  What systemtap did is code
its own _stp_module_lookup() as a fairly direct replacement for
kallsyms_lookup_name().  That's deliberately overturning the condition
or order of things, because you deliberately ignored the specific
replacement API in rolling your own, hence subversion.

It's actually worse than this, though.  The kernel API isn't fixed in
stone, it evolves usually by trying to make problematic use cases
better.  By refusing to consider using the replacement API, you lost the
opportunity to point out the shortcomings and negotiate for a better
one, so it's languished for two years with no real testing or update.
Worse still, you cut yourself off from the development flow of the
kernel and effectively forked a private API for you own use.  Now,
because of this, most kernel developers will be far less inclined to
listen to your input because you've chosen not to listen to theirs.  The
give and take of open source development that produces the virtuous
circle of innovation is broken.  To redress this, you have to use the
correct API and begin engaging in the dialogue which stalled two years
ago.

But let's examine the consequences objectively.  I have a simple single
probe file:

probe kernel.statement("*@block/bsg.c:144") {
        print ("here\n");
}

It emits a single probe and produces this in the module build:

-rw-r--r-- 1 root root  17996 2008-07-15 20:45 stap_2154.c

About 600 lines.

However, it also needs this for the symbol table:

-rw-r--r-- 1 root root 446137 2008-07-15 20:45 stap-symbols.h

About 12,500 lines just for the symbols.

Together these produce a module

-rw-r--r-- 1 root root 652509 2008-07-15 20:46 stap_2154.ko

That's well over half a megabyte largely because of the symbols.

By now the embedded guys are already having WTF attacks about your
module wanting a significant portion of their available ram ... and so
on ... Are you seriously arguing that a good 0.6MB of bloat just because
you refuse to use a provided API is a good thing?

I'm afraid this is your classic fish or cut bait issue:  You can choose
either to engage in dialogue with the kernel community, try to use the
provided API and improve it based on demonstrated use cases (and if you
choose to do this, I can help you with it, since interaction will need
to go both ways) and thus benefit from the open source innovation
stream, or you can keep within your own community, eschewing the broader
kernel development community, ignoring their feedback and spending all
your effort constructing work arounds for what you consider to be kernel
problems leading the systemtap users to be unhappy and Sun Marketing
pulverising us over our lack of useful tracing tools.

Which is it to be?

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-15 20:24       ` James Bottomley
@ 2008-07-15 22:19         ` Frank Ch. Eigler
  2008-07-16  2:06           ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-15 22:19 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, systemtap

Hi -

On Tue, Jul 15, 2008 at 03:24:22PM -0500, James Bottomley wrote:
> [...]
> > > > > This is highly undesirable because it represents a subversion of the
> > > > > kernel API to gain access to unexported symbols.
> > [...]
> > Maybe, but what "subversion" are you talking about?
> 
> using a hand crafted relocation function to gain access to kernel
> symbols instead of the provided API.

Please choose your words more carefully.  We don't "subvert" anything,
where one would mean sneaking around some sort of protection.  We use
an established, existing facility (placing kprobes by address).  We
compute addresses correctly (an error would be obvious), and is
conceptually not so different from an address being passed in at run
time from /proc/kallsyms.


> > Of course, but for our purposes, the kernel will be just one amongst
> > many probing targets.  We will be tracking multiple symbol tables and
> > unwind data for user-space.
> 
> You're going to hand roll your own symbol resolution for user space too?
> Isn't it pretty easy simply to get ld.so to do that for you?

I don't see how.  We can't call into ld.so from kernel space.  One may
need to probe ld.so itself.


> [...] I made sure that pure addressed based kprobes still work even
> when they have to use the symbol_name/offset resolution method
> ... the new code just works out the closest symbol and applies an
> offset.

(Not important, but did you consider the possibility that this chosen
reference symbol may, for whatever reason, not be listed in the
kernel's kprobe-assisting symbol table?)


> > Again, please consider user-space.  The runtime will need similar
> > symbol resolution code *for user space* anyway.  Keeping it in there
> > for the kernel is no incremental complexity - if anything, the
> > opposite.
>
> I really think there might have to be separate runtime pieces for
> user space and for the kernel.  Trying to build a single scheme that
> works in both places looks cumbersome.  In the separate case, the
> kernel piece, which is potentially movable inside the kernel,
> becomes a lot simpler. [...]

It may just be in the eye of the beholder.  To me, a single scheme
that supports all the various address spaces (and kernel versions and
configurations!) we deal with is just as appealing to me as increasing
kernel specialization is to you.


> OK, with -ffunction-sections you can't offset from _stext which
> seems to be what _stp_module_relocate() uses.  [...]  That means
> that each function address can potentially move depending on the
> number of relocation stubs embedded between it and the next
> function.

I may be missing something, but doesn't all that happen during
linking?  We process linked executables, not object files subject to
further run-time relaxation/shrinkage.


- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-15 20:08     ` Frank Ch. Eigler
@ 2008-07-15 20:24       ` James Bottomley
  2008-07-15 22:19         ` Frank Ch. Eigler
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-15 20:24 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: linux-kernel, systemtap

On Tue, 2008-07-15 at 16:07 -0400, Frank Ch. Eigler wrote:
> On Tue, Jul 15, 2008 at 02:52:06PM -0500, James Bottomley wrote:
> 
> > [...]
> > > > One of the big nasties of systemtap is the way it tries to embed
> > > > virtually the entirety of the kernel symbol table in the probe
> > > > modules it constructs.
> > > 
> > > It is a compromise of conflicting requirements. 
> > 
> > Well ... in order to make forward progress, since the systemtap people
> > expressed a desire to be better integrated with the kernel, the first
> > order of business is to use the correct APIs [...]
> 
> Let's concentrate then on those areas where this is more clear-cut.

This is the most clear cut of the areas.  Until the systemtap modules
get moved to the proper APIs, it's very difficult to tell what else
needs to be cleaned up or changed inside the kernel to support them.

> > > > This is highly undesirable because it represents a subversion of the
> > > > kernel API to gain access to unexported symbols.
> > > 
> > > Please elaborate.  Does the translator or its runtime use unexported
> > > symbols?  (That would arouse the question about why.)
> > > 
> > > Or are you talking about being able to *probe* unexported functions or
> > > access unexported data?  That would be a deliberate feature.
> > 
> > No ... I'm talking about _stp_module_relocate() at this point.  It's an
> > unnecessary function
> 
> Maybe, but what "subversion" are you talking about?

using a hand crafted relocation function to gain access to kernel
symbols instead of the provided API.  Even if it's not used as a
template for every producer of binary modules, it's just incredibly
fragile.

> > since the kprobes API provides a way to attach to a symbol and an
> > offset.  The API allows access to unexported functions.
> 
> ... but not to e.g. data, which also uses this common mechanism.

That was number three in the original list of problems on my email.  I
think it's solvable reasonably easily (as I outlined).

> > > > At least for kprobes, the correct way to do this is to specify the
> > > > probe point by symbol and offset.
> > > 
> > > But there won't be just kprobes.  Much of this code was built with
> > > anticipation of user-space probing, and there the kernel won't have a
> > > similar mechanism.  Similarly, the code is written to work with old
> > > kernels too - ones that predate the symbol+offset kprobe API.
> > 
> > OK ... you've got me there ... why would user space probing necessitate
> > resolution of kernel space symbols?  Surely you plan to use an exported
> > module API of utrace or whatever the agreed framework is?
> 
> Of course, but for our purposes, the kernel will be just one amongst
> many probing targets.  We will be tracking multiple symbol tables and
> unwind data for user-space.

You're going to hand roll your own symbol resolution for user space too?
Isn't it pretty easy simply to get ld.so to do that for you?

> > > Unless someone is about to rip out pure address-based kprobes, I see
> > > no reason to complicate the code.
> > 
> > If you actually look, you'll see that pure addressed based kprobes still
> > work.
> 
> No need for the snark.  I know they work; we've been using them for
> years.  I am simply happy to stay with them.

I meant if you read the patch I posted, I made sure that pure addressed
based kprobes still work even when they have to use the
symbol_name/offset resolution method ... the new code just works out the
closest symbol and applies an offset.

> > Also, I think you'll find it simplifies the code, since tons of the
> > runtime junk that duplicate the in-kernel symbol resolution can be
> > thrown out, plus the corresponding pieces of systemtap that have to
> > worry about this.
> 
> Again, please consider user-space.  The runtime will need similar
> symbol resolution code *for user space* anyway.  Keeping it in there
> for the kernel is no incremental complexity - if anything, the
> opposite.

I really think there might have to be separate runtime pieces for user
space and for the kernel.  Trying to build a single scheme that works in
both places looks cumbersome.  In the separate case, the kernel piece,
which is potentially movable inside the kernel, becomes a lot simpler.

> > There's also the architectural worry: this scheme you currently use
> > is very fragile.  For instance, I don't see it surviving a move to
> > -ffunction-sections (which patches are already going over
> > linux-arch).
> 
> Let's try it.  Whatever actual problems that throws up, we'd also
> encounter with userspace.

OK, with -ffunction-sections you can't offset from _stext which seems to
be what _stp_module_relocate() uses.  The reason this gets used on
something like parisc is so that we can place jump stubs in between the
function sections if necessary to widen the PCREL17 relocations.  That
means that each function address can potentially move depending on the
number of relocation stubs embedded between it and the next function.
The exact same problem occurs between DSOs in user space.  Your problem
is that you don't know the separations until someone attempts linking
(and even then, there's nothing except common sense that requires the
linker to use the minimum number of stubs).

Now, you could still offset from the start address of the section, but
that's simply the address of the function, so resolution by
symbol_name/offset is the effective solution.

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address)
  2008-07-15 19:52   ` James Bottomley
@ 2008-07-15 20:08     ` Frank Ch. Eigler
  2008-07-15 20:24       ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-15 20:08 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, systemtap

Hi -

On Tue, Jul 15, 2008 at 02:52:06PM -0500, James Bottomley wrote:

> [...]
> > > One of the big nasties of systemtap is the way it tries to embed
> > > virtually the entirety of the kernel symbol table in the probe
> > > modules it constructs.
> > 
> > It is a compromise of conflicting requirements. 
> 
> Well ... in order to make forward progress, since the systemtap people
> expressed a desire to be better integrated with the kernel, the first
> order of business is to use the correct APIs [...]

Let's concentrate then on those areas where this is more clear-cut.


> > > This is highly undesirable because it represents a subversion of the
> > > kernel API to gain access to unexported symbols.
> > 
> > Please elaborate.  Does the translator or its runtime use unexported
> > symbols?  (That would arouse the question about why.)
> > 
> > Or are you talking about being able to *probe* unexported functions or
> > access unexported data?  That would be a deliberate feature.
> 
> No ... I'm talking about _stp_module_relocate() at this point.  It's an
> unnecessary function

Maybe, but what "subversion" are you talking about?

> since the kprobes API provides a way to attach to a symbol and an
> offset.  The API allows access to unexported functions.

... but not to e.g. data, which also uses this common mechanism.


> > > At least for kprobes, the correct way to do this is to specify the
> > > probe point by symbol and offset.
> > 
> > But there won't be just kprobes.  Much of this code was built with
> > anticipation of user-space probing, and there the kernel won't have a
> > similar mechanism.  Similarly, the code is written to work with old
> > kernels too - ones that predate the symbol+offset kprobe API.
> 
> OK ... you've got me there ... why would user space probing necessitate
> resolution of kernel space symbols?  Surely you plan to use an exported
> module API of utrace or whatever the agreed framework is?

Of course, but for our purposes, the kernel will be just one amongst
many probing targets.  We will be tracking multiple symbol tables and
unwind data for user-space.


> > Unless someone is about to rip out pure address-based kprobes, I see
> > no reason to complicate the code.
> 
> If you actually look, you'll see that pure addressed based kprobes still
> work.

No need for the snark.  I know they work; we've been using them for
years.  I am simply happy to stay with them.


> Also, I think you'll find it simplifies the code, since tons of the
> runtime junk that duplicate the in-kernel symbol resolution can be
> thrown out, plus the corresponding pieces of systemtap that have to
> worry about this.

Again, please consider user-space.  The runtime will need similar
symbol resolution code *for user space* anyway.  Keeping it in there
for the kernel is no incremental complexity - if anything, the
opposite.


> There's also the architectural worry: this scheme you currently use
> is very fragile.  For instance, I don't see it surviving a move to
> -ffunction-sections (which patches are already going over
> linux-arch).

Let's try it.  Whatever actual problems that throws up, we'd also
encounter with userspace.


- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
  2008-07-15 19:42 ` Frank Ch. Eigler
@ 2008-07-15 19:52   ` James Bottomley
  2008-07-15 20:08     ` Frank Ch. Eigler
  0 siblings, 1 reply; 49+ messages in thread
From: James Bottomley @ 2008-07-15 19:52 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: linux-kernel, systemtap

On Tue, 2008-07-15 at 15:41 -0400, Frank Ch. Eigler wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> 
> > One of the big nasties of systemtap is the way it tries to embed
> > virtually the entirety of the kernel symbol table in the probe
> > modules it constructs.
> 
> It is a compromise of conflicting requirements. 

Well ... in order to make forward progress, since the systemtap people
expressed a desire to be better integrated with the kernel, the first
order of business is to use the correct APIs ... that has to happen even
before an evaluation can be made of which pieces of the systemtap
runtime should move into the kernel.

> > This is highly undesirable because it represents a subversion of the
> > kernel API to gain access to unexported symbols.
> 
> Please elaborate.  Does the translator or its runtime use unexported
> symbols?  (That would arouse the question about why.)
> 
> Or are you talking about being able to *probe* unexported functions or
> access unexported data?  That would be a deliberate feature.

No ... I'm talking about _stp_module_relocate() at this point.  It's an
unnecessary function, since the kprobes API provides a way to attach to
a symbol and an offset.  The API allows access to unexported functions.

> > At least for kprobes, the correct way to do this is to specify the
> > probe point by symbol and offset.
> 
> But there won't be just kprobes.  Much of this code was built with
> anticipation of user-space probing, and there the kernel won't have a
> similar mechanism.  Similarly, the code is written to work with old
> kernels too - ones that predate the symbol+offset kprobe API.

OK ... you've got me there ... why would user space probing necessitate
resolution of kernel space symbols?  Surely you plan to use an exported
module API of utrace or whatever the agreed framework is?

> Unless someone is about to rip out pure address-based kprobes, I see
> no reason to complicate the code.

If you actually look, you'll see that pure addressed based kprobes still
work.

Also, I think you'll find it simplifies the code, since tons of the
runtime junk that duplicate the in-kernel symbol resolution can be
thrown out, plus the corresponding pieces of systemtap that have to
worry about this.

There's also the architectural worry:  this scheme you currently use is
very fragile.  For instance, I don't see it surviving a move to
-ffunction-sections (which patches are already going over linux-arch).

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] systemtap: begin the process of using proper kernel APIs  (part1: use kprobe symbol_name/offset instead of address)
       [not found] <1216146802.3312.95.camel__45052.4692344063$1216146917$gmane$org@localhost.localdomain>
@ 2008-07-15 19:42 ` Frank Ch. Eigler
  2008-07-15 19:52   ` James Bottomley
  0 siblings, 1 reply; 49+ messages in thread
From: Frank Ch. Eigler @ 2008-07-15 19:42 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-kernel, systemtap

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> One of the big nasties of systemtap is the way it tries to embed
> virtually the entirety of the kernel symbol table in the probe
> modules it constructs.

It is a compromise of conflicting requirements. 

> This is highly undesirable because it represents a subversion of the
> kernel API to gain access to unexported symbols.

Please elaborate.  Does the translator or its runtime use unexported
symbols?  (That would arouse the question about why.)

Or are you talking about being able to *probe* unexported functions or
access unexported data?  That would be a deliberate feature.

> At least for kprobes, the correct way to do this is to specify the
> probe point by symbol and offset.

But there won't be just kprobes.  Much of this code was built with
anticipation of user-space probing, and there the kernel won't have a
similar mechanism.  Similarly, the code is written to work with old
kernels too - ones that predate the symbol+offset kprobe API.

Unless someone is about to rip out pure address-based kprobes, I see
no reason to complicate the code.

- FChE

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2008-07-23 22:14 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-15 18:33 [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) James Bottomley
2008-07-16 22:42 ` Masami Hiramatsu
2008-07-16 23:03   ` James Bottomley
2008-07-17  0:07     ` Masami Hiramatsu
2008-07-17  1:50       ` James Bottomley
2008-07-17 14:18         ` James Bottomley
2008-07-17 16:59           ` James Bottomley
2008-07-17 21:38             ` Masami Hiramatsu
2008-07-17 22:03               ` James Bottomley
2008-07-21 14:21               ` James Bottomley
     [not found]           ` <1216313914.5515.25.camel__21144.9282979176$1216314027$gmane$org@localhost.localdomain>
2008-07-17 18:32             ` Frank Ch. Eigler
2008-07-17 20:13               ` James Bottomley
2008-07-17 20:28                 ` Frank Ch. Eigler
2008-07-17 21:06                   ` James Bottomley
2008-07-17 21:35                     ` Frank Ch. Eigler
2008-07-17 22:05                       ` Masami Hiramatsu
2008-07-22 18:00                       ` Rik van Riel
2008-07-22 18:12                         ` Frank Ch. Eigler
2008-07-22 18:31                           ` Peter Zijlstra
     [not found]                           ` <1216751477.7257.115.camel__19834.5970632092$1216751567$gmane$org@twins>
2008-07-22 18:50                             ` Frank Ch. Eigler
2008-07-23 15:06                         ` systemtap & backward compatibility, was Re: [RFC] systemtap: begin the process of using proper kernel APIs Frank Ch. Eigler
2008-07-23 15:29                           ` Arjan van de Ven
2008-07-23 15:33                           ` Peter Zijlstra
2008-07-23 20:26                             ` Masami Hiramatsu
     [not found]                           ` <20080723082856.334f9c17__2909.60763018138$1216827051$gmane$org@infradead.org>
2008-07-23 16:43                             ` Frank Ch. Eigler
2008-07-23 16:54                               ` Adrian Bunk
2008-07-23 17:43                                 ` Frank Ch. Eigler
2008-07-23 18:41                                   ` Adrian Bunk
2008-07-23 22:14                               ` Masami Hiramatsu
2008-07-18  9:11 ` [RFC] systemtap: begin the process of using proper kernel APIs (part1: use kprobe symbol_name/offset instead of address) Andi Kleen
2008-07-18  9:23   ` Peter Zijlstra
2008-07-18 10:31     ` Andi Kleen
2008-07-18 10:44       ` Peter Zijlstra
2008-07-18 10:52         ` Andi Kleen
2008-07-18 13:04     ` Frank Ch. Eigler
2008-07-18 13:09       ` Andi Kleen
2008-07-18 13:10       ` Peter Zijlstra
2008-07-18 13:31         ` Frank Ch. Eigler
2008-07-18 13:36         ` Andi Kleen
2008-07-18 13:21       ` James Bottomley
2008-07-18 13:40         ` Frank Ch. Eigler
     [not found] <1216146802.3312.95.camel__45052.4692344063$1216146917$gmane$org@localhost.localdomain>
2008-07-15 19:42 ` Frank Ch. Eigler
2008-07-15 19:52   ` James Bottomley
2008-07-15 20:08     ` Frank Ch. Eigler
2008-07-15 20:24       ` James Bottomley
2008-07-15 22:19         ` Frank Ch. Eigler
2008-07-16  2:06           ` James Bottomley
2008-07-16 10:58             ` Frank Ch. Eigler
2008-07-16 14:56               ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).