new static user probe types

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* new static user probe types
@ 2009-06-26 21:26 Stan Cox
  2009-07-15 16:19 ` Stan Cox
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Cox @ 2009-06-26 21:26 UTC (permalink / raw)
  To: systemtap

There are now two new static user probe types, one that uses utrace
and one that uses kprobes.  They are currently available by compiling
with -DEXPERIMENTAL_UTRACE_SDT and -DEXPERIMENTAL_KPROBE_SDT.  The
code in tapsets.cxx has been refactored by creating a probe_table and an
sdt_var_expanding_visitor.  The probe_table handles retrieving probes
from the .probes section and building an stap probe.  The
sdt_var_expanding_visitor handles expanding arguments.

A kprobe probe currently maps to the getegid syscall.  The
probe_table::convert_probe method will synthesize something like this:
The string comparison is acknowledged to be expensive and needs to be
replaced by an integer comparison.

kernel.function("*sys_getegid*")
{
// Only done for i386
regparm(0)
// Did we get here for the probe in question?
if ((user_string(ulong_arg(1))) != ("test_probe_4")) next
}

A utrace probe uses a dummy syscall value and uses the stap utrace
syscall infrastructure.  The probe_table::convert_probe method will
synthesize something like this:  

process("PATH").syscall
{
// Did we get here for the probe in question?
if ((user_string(_utrace_syscall_arg(0))) != ("test_probe_4")) next
// Is this the expected task?
if ((task_tid(task_current())) != (_utrace_syscall_arg(1))) next
// Is this the "fake" syscall from the sdt probe?
if ((_utrace_syscall_nr()) != (48813)) next
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-06-26 21:26 new static user probe types Stan Cox
@ 2009-07-15 16:19 ` Stan Cox
  2009-07-15 18:39   ` Josh Stone
  2009-07-20 18:34   ` Stan Cox
  0 siblings, 2 replies; 16+ messages in thread
From: Stan Cox @ 2009-07-15 16:19 UTC (permalink / raw)
  To: systemtap

The performance for the respective static probe types, for a test 
program that invokes a static user probe in a loop 1,000,000 times, is:
kprobe 1.61user 4.73system
utrace 1.38user 22.02system
uprobe 1.24user 11.30system

It ended up that the guard statements in the probes were too permissive; 
fixing this and doing some guard statement performance tweaks yields:
kprobe 1.65user 3.78system
utrace 1.19user 6.03system
uprobe 1.24user 11.30system

The guard statements this experiments is using look like:
kprobe:
if ((task_tid(task_current())) != (_utrace_syscall_arg(1))) next   # 
current task?
if ((pointer_arg(1)) != (_stp_probe_name_test)) {      # probe name ptr 
differs from saved?
if ((user_int(pointer_arg(1))) != (1953719668)) next  # check first word 
of probe name
if ((user_string(pointer_arg(1))) != ("test")) next       # check string 
if all else fails
}
(_stp_probe_name_test) = (pointer_arg(1))               # remember saved 
probe name ptr

utrace:
if ((_utrace_syscall_nr()) != (48813)) next               # is this the 
fake sdt "syscall"
if ((pointer_arg(1)...                                               # 
as above

Running oprofile:
kprobe old 27% strlcpy 11% int3
          exp 13% __ticket_spin_lock 10% int3
utrace old  25% search_extable 18% page_fault 16% strlcpy
         exp  43% start_callback 5% system_call

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-15 16:19 ` Stan Cox
@ 2009-07-15 18:39   ` Josh Stone
  2009-07-15 20:47     ` Stan Cox
  2009-07-20 18:34   ` Stan Cox
  1 sibling, 1 reply; 16+ messages in thread
From: Josh Stone @ 2009-07-15 18:39 UTC (permalink / raw)
  To: Stan Cox; +Cc: systemtap

On 07/15/2009 09:19 AM, Stan Cox wrote:
> The guard statements this experiments is using look like:
> kprobe:
> if ((task_tid(task_current())) != (_utrace_syscall_arg(1))) next   # 
> current task?

task_tid(task_current()) == tid(), and the latter should be faster.  But
anyway, it's not clear to me what this is guarding against.

> if ((pointer_arg(1)) != (_stp_probe_name_test)) {      # probe name ptr 
> differs from saved?
> if ((user_int(pointer_arg(1))) != (1953719668)) next  # check first word 
> of probe name
> if ((user_string(pointer_arg(1))) != ("test")) next       # check string 
> if all else fails
> }
> (_stp_probe_name_test) = (pointer_arg(1))               # remember saved 
> probe name ptr

Beware that the 1953719668 constant is little-endian specific.

I'm also nervous about saving _stp_probe_name_XXX -- murphy's law may
eventually produce a mark in one app that happens to have the same
pointer address as a different mark in a different app.


Josh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-15 18:39   ` Josh Stone
@ 2009-07-15 20:47     ` Stan Cox
  2009-07-15 21:57       ` Josh Stone
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Cox @ 2009-07-15 20:47 UTC (permalink / raw)
  To: Josh Stone, systemtap

Josh Stone wrote:
> On 07/15/2009 09:19 AM, Stan Cox wrote:
>   
>
> task_tid(task_current()) == tid(), and the latter should be faster.  But
> anyway, it's not clear to me what this is guarding against.
>
>   
An application sets up a kprobe probe by hijacking a syscall, currently 
SYS_getegid, and passing the tid.  On the stap side the .mark probe gets 
converted to kernel.function("sys_getegid") so the guard wants to make 
certain we didn't get here via another task doing a SYS_getegid syscall.
> Beware that the 1953719668 constant is little-endian specific.
>
> I'm also nervous about saving _stp_probe_name_XXX -- murphy's law may
> eventually produce a mark in one app that happens to have the same
> pointer address as a different mark in a different app.
>   
There will be a series of probes for the hijacked syscall, one per 
marker type.  So if this is postgres there would be one for 
transaction__start, lwlock__acquire, etc.  So it is necessary to check 
that the probe for transaction__start actually arrived here from a 
transaction__start.  The probe name is passed in so the easy way to do 
this is to compare the passed in probe name to the .mark("marker_name") 
and check that they match.  However it would be nice to "optimize" this 
so the string comparison is avoided if there is a lower cost 
alternative.  That is what my experiment was trying to achieve. Any 
clever suggestions on how to do that?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-15 20:47     ` Stan Cox
@ 2009-07-15 21:57       ` Josh Stone
  2009-07-16 13:44         ` Stan Cox
  0 siblings, 1 reply; 16+ messages in thread
From: Josh Stone @ 2009-07-15 21:57 UTC (permalink / raw)
  To: Stan Cox; +Cc: systemtap

On 07/15/2009 01:47 PM, Stan Cox wrote:
> Josh Stone wrote:
>> On 07/15/2009 09:19 AM, Stan Cox wrote:
>>   
>>
>> task_tid(task_current()) == tid(), and the latter should be faster.  But
>> anyway, it's not clear to me what this is guarding against.
>>
>>   
> An application sets up a kprobe probe by hijacking a syscall, currently 
> SYS_getegid, and passing the tid.  On the stap side the .mark probe gets 
> converted to kernel.function("sys_getegid") so the guard wants to make 
> certain we didn't get here via another task doing a SYS_getegid syscall.

Ok, so really any magic value would work, right?  Why not use the
STAP_GUARD constant and avoid the extra gettid syscalls in the app?
Then the guard doesn't need to be part of the other arguments either.

Also, I see this in sdt.h:
> # if defined EXPERIMENTAL_KPROBE_SDT
> # define STAP_SYSCALL __NR_getegid
> # define STAP_GUARD 0x32425250
> # define GETTID 0
> # elif defined EXPERIMENTAL_UTRACE_SDT
> # define STAP_SYSCALL 0xbead
> # define STAP_GUARD 0x33425250
> # define GETTID syscall(SYS_gettid)
> # endif

It would appear that you're always passing 0 in place of a tid in the
kprobe version.  Wouldn't that mean the kprobe version is always
skipping the probe body?

And then the guard code you gave for utrace is not using the tid check,
even though it is generating a real GETTID.  It seems that the gettid()
penalties are reversed here.

But again, perhaps we don't need the tid at all...

>> Beware that the 1953719668 constant is little-endian specific.
>>
>> I'm also nervous about saving _stp_probe_name_XXX -- murphy's law may
>> eventually produce a mark in one app that happens to have the same
>> pointer address as a different mark in a different app.
>>   
> There will be a series of probes for the hijacked syscall, one per 
> marker type.  So if this is postgres there would be one for 
> transaction__start, lwlock__acquire, etc.  So it is necessary to check 
> that the probe for transaction__start actually arrived here from a 
> transaction__start.  The probe name is passed in so the easy way to do 
> this is to compare the passed in probe name to the .mark("marker_name") 
> and check that they match.  However it would be nice to "optimize" this 
> so the string comparison is avoided if there is a lower cost 
> alternative.  That is what my experiment was trying to achieve. Any 
> clever suggestions on how to do that?

Correctness always trumps optimization (unfortunately).

It looks like a large part of the oprofile overhead is in strlcpy, which
is probably script code preparing to do the strcmp.  You might be able
to reduce the overhead by doing the checks in native C rather than
synthesizing script code.  You could write C in emit_probe_local_init
instead, and then you'd even be able to skip probes before locks are
attempted.

Using a hash or UUID instead of strings might also help, but I don't
know how to generate that in macros...

Josh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-15 21:57       ` Josh Stone
@ 2009-07-16 13:44         ` Stan Cox
  0 siblings, 0 replies; 16+ messages in thread
From: Stan Cox @ 2009-07-16 13:44 UTC (permalink / raw)
  To: Josh Stone; +Cc: systemtap

Josh Stone wrote:
> On 07/15/2009 01:47 PM, Stan Cox wrote:
>   
> You could write C in emit_probe_local_init
> instead, and then you'd even be able to skip probes before locks are
> attempted.
>   
Ah that's a good hint; I didn't know about emit_probe_local_init.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-15 16:19 ` Stan Cox
  2009-07-15 18:39   ` Josh Stone
@ 2009-07-20 18:34   ` Stan Cox
  2009-07-22 10:42     ` Mark Wielaard
  1 sibling, 1 reply; 16+ messages in thread
From: Stan Cox @ 2009-07-20 18:34 UTC (permalink / raw)
  To: systemtap

I did some tweaking and a bit more testing.  Running a program which 
invokes a probe in a loop.  Invoked via stap running a script with only 
a begin in it.
utrace 0.71user 0.34system 0:01.10elapsed # probe is a syscall, no arg setup
kprobe 0.72user 0.33system 0:01.09elapsed # probe is a syscall, no arg setup
uprobe 0.57user 0.09system 0:00.68elapsed # probe is a nop, possibly 
also arg setup

Invoked via stap running a script which actually has handlers for the probes
utrace 0.82user 4.28system 0:05.17elapsed
kprobe 1.59user 5.18system 0:08.50elapsed
uprobe 1.17user 11.12system 0:12.35elapsed

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-20 18:34   ` Stan Cox
@ 2009-07-22 10:42     ` Mark Wielaard
  2009-07-22 14:39       ` Frank Ch. Eigler
  2009-07-23  3:07       ` Roland McGrath
  0 siblings, 2 replies; 16+ messages in thread
From: Mark Wielaard @ 2009-07-22 10:42 UTC (permalink / raw)
  To: Stan Cox; +Cc: systemtap

On Mon, 2009-07-20 at 14:34 -0400, Stan Cox wrote:
> I did some tweaking and a bit more testing.  Running a program which 
> invokes a probe in a loop.  Invoked via stap running a script with only 
> a begin in it.
> utrace 0.71user 0.34system 0:01.10elapsed # probe is a syscall, no arg setup
> kprobe 0.72user 0.33system 0:01.09elapsed # probe is a syscall, no arg setup
> uprobe 0.57user 0.09system 0:00.68elapsed # probe is a nop, possibly 
> also arg setup

So uprobes is the clear winner for almost zero-overhead when disabled.

> Invoked via stap running a script which actually has handlers for the probes
> utrace 0.82user 4.28system 0:05.17elapsed
> kprobe 1.59user 5.18system 0:08.50elapsed
> uprobe 1.17user 11.12system 0:12.35elapsed

But it has the largest overhead when enabled. Clearly we want the uprobe
mechanism when probes are disabled, but the utrace mechanism when probes
are enabled!

Would it be possible to change the uprobes mechanism for inserting trap
instructions to insert any instructions (sequence)?

Then we could have the best of both worlds, or could even decide at
runtime what to insert to when the probe gets enabled.

You would make sure that there are enough nops in the place of the probe
point for the instruction sequence you want to replace and then the
uprobes insert instruction mechanism would (after checking it had enough
nop space) insert the instruction sequence (preferable the one used by
the utrace mechanism).

It would also help with implementing the idea for the ENABLED mechanism
idea suggested in
http://sourceware.org/bugzilla/show_bug.cgi?id=10013#c2
but there you would replace a specific instruction with another (the
inserting mechanism should check of course).

So, it might be a bit like what Srikar posted to utrace-devel:
https://www.redhat.com/archives/utrace-devel/2009-June/msg00018.html
But with the Ubp (ubp_bkpt insn) replaced with inserting of an
"arbitrary" instruction instead of a trap (and possibly a check that the
instruction being replaced is the intended one). And without the need
for Ssol in that case.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-22 10:42     ` Mark Wielaard
@ 2009-07-22 14:39       ` Frank Ch. Eigler
  2009-07-22 17:10         ` Mark Wielaard
  2009-07-23  3:07       ` Roland McGrath
  1 sibling, 1 reply; 16+ messages in thread
From: Frank Ch. Eigler @ 2009-07-22 14:39 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Stan Cox, systemtap

Mark Wielaard <mjw@redhat.com> writes:

> So uprobes is the clear winner for almost zero-overhead when
> disabled.  [...]  But it has the largest overhead when
> enabled. Clearly we want the uprobe mechanism when probes are
> disabled, but the utrace mechanism when probes are enabled! [...]

Yup.

> [...] It would also help with implementing the idea for the ENABLED
> mechanism idea suggested in [PR10013].

Or how about this.  We could expand STAP_PROBE(...) to

   { extern char stap_probe_NNNN_enabled_p;
     if (unlikely(stap_probe_NNN_enabled_p)) {
        /* current inline-asm stuff, but adding
           &enabled_p to the descriptor struct. */
     }
   }

Since this FOO_enabled_p variable would be initialized to 0, this
would allow bypass of all the instrumentation inline-asm in the
instrumentation-off case.  To turn on the instrumentation, the
systemtap runtime would poke at the target process's memory to
(atomically) increment that flag byte, and vice versa.  (Increment
instead of set, in case multiple systemtap sessions are targeting the
same process.)

The inline-asm inside could be the fastest enabled variant, probably
the kprobe-based one.  (This would make user-space sdt.h usable
without utrace & uprobes.)

The explicit ENABLED() test from PR10013 could use the same flag
variable, and thus piggyback on the same run-time mechanism.  (Its
declaration would have to be put into some dtrace-script-generated
header file, I guess.)

- FChE

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-22 14:39       ` Frank Ch. Eigler
@ 2009-07-22 17:10         ` Mark Wielaard
  2009-07-29 15:44           ` Stan Cox
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Wielaard @ 2009-07-22 17:10 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Stan Cox, systemtap

Hi Frank,

On Wed, 2009-07-22 at 10:39 -0400, Frank Ch. Eigler wrote:
> Or how about this.  We could expand STAP_PROBE(...) to
> 
>    { extern char stap_probe_NNNN_enabled_p;
>      if (unlikely(stap_probe_NNN_enabled_p)) {
>         /* current inline-asm stuff, but adding
>            &enabled_p to the descriptor struct. */
>      }
>    }
> 
> Since this FOO_enabled_p variable would be initialized to 0, this
> would allow bypass of all the instrumentation inline-asm in the
> instrumentation-off case.  To turn on the instrumentation, the
> systemtap runtime would poke at the target process's memory to
> (atomically) increment that flag byte, and vice versa.  (Increment
> instead of set, in case multiple systemtap sessions are targeting the
> same process.)
> 
> The inline-asm inside could be the fastest enabled variant, probably
> the kprobe-based one.  (This would make user-space sdt.h usable
> without utrace & uprobes.)

O, I like it. It is probably not as zero-overhead as the "pure nop"
approach, but it would be easy (well, relatively, the sdt macros
assembly stuff is slightly hard, at least for me) to implement (easier
than extending the ubp stuff at least). Certainly warrants a try and
benchmark. Making this only need kprobes is also appealing (even though
I really wish to see utrace and uprobes move forward and into the
mainline kernel).

BTW. For storing changeable variables the .probes section should become
alloc, rw now always (it currently is only for relocatable objects). How
to keep the section/parsing compatible with older versions already
compiled into programs is also left as an exercise to the reader...

Cheers,

Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-22 17:10         ` Mark Wielaard
@ 2009-07-29 15:44           ` Stan Cox
  2009-07-29 15:51             ` Stan Cox
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Cox @ 2009-07-29 15:44 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Frank Ch. Eigler, systemtap

So here is what I have for SW10013:

1. Given a 'dtrace' input file:
provider tstsyscall
{
 probe test(short arg1, int arg2, int arg3, int arg4, struct astruct arg5)
}

2. The dtrace script produces a header file containing:
#define STAP_HAS_SEMAPHORES 1
...
/* TSTSYSCALL_TEST ( short arg1, int arg2, int arg3, int arg4, struct astruct arg5) */
#define TSTSYSCALL_TEST_ENABLED() test_semaphore
__extension__ extern long test_semaphore __attribute__ ((unused)) __attribute__ ((section (".probes")));
#define TSTSYSCALL_TEST(arg1,arg2,arg3,arg4,arg5) \
STAP_PROBE5(provider,test,arg1,arg2,arg3,arg4,arg5)
// these are to help kprobe and utrace access struct astruct
struct astruct {int a; int b;};
typedef struct astruct test_arg5;  test_arg5 test_arg5_v;

3. where STAP_PROBE5 is defined as:
#define STAP_PROBE5_(probe,label,parm1,parm2,parm3,parm4,parm5)
STAP_SEMAPHORE(probe)
do {
  __extension__ struct {size_t arg1 __attribute__((aligned(8))); 
	  size_t arg2 __attribute__((aligned(8))); 
	  size_t arg3 __attribute__((aligned(8))); 
	  size_t arg4 __attribute__((aligned(8))); 
	  size_t arg5 __attribute__((aligned(8)));} 
  stap_probe5_args = {(size_t)parm1, (size_t)parm2, (size_t)parm3, (size_t)parm4, 
	(size_t)parm5}; 
  STAP_PROBE_DATA(probe,STAP_GUARD,5); 
  syscall (STAP_SYSCALL, #probe, STAP_GUARD, &stap_probe5_args); 
 } while (0)

 4. and STAP_SEMAPHORE is defined as (so if the dtrace script isn't
    being used there is no test):
#ifdef STAP_HAS_SEMAPHORE 
#define STAP_SEMAPHORE(probe) 
  if ( probe ## _semaphore ) 
#else
#define STAP_SEMAPHORE(probe) ;
#endif

5. and the dtrace script also produces tstsyscall_.c.  This defines
    the test value and creates a ctor where stap can set the value.
__extension__ long test_semaphore __attribute__ ((unused)) __attribute__ ((section (".probes")));

_stap_constructor ()
{
  __asm__ volatile ("/* %0 */" :: "g"(test_semaphore));
}

6. this needs to be improved but currently if stap sees a test probe then it 
   'automatically' produces this probe which requires -g.  
process(EXECUTABLE).statement("_stap_constructor")
{ test_semaphore = 1 }

7. The execution time of a synthetic test not run under stap influence
    is more or less equivalent with the above and when run under stap:
utrace 0.76user 4.30system 0:05.22elapsed
kprobe 1.38user 5.09system 0:06.92elapsed 
uprobe 0.92user 10.99system 0:12.01elapsed 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-29 15:44           ` Stan Cox
@ 2009-07-29 15:51             ` Stan Cox
  0 siblings, 0 replies; 16+ messages in thread
From: Stan Cox @ 2009-07-29 15:51 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Frank Ch. Eigler, systemtap


>
> 5. and the dtrace script also produces tstsyscall_.c. 
>  
> _stap_constructor ()
Forgot to mention that _stap_constructor is declared as:
void _stap_constructor () __attribute__((constructor));

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-22 10:42     ` Mark Wielaard
  2009-07-22 14:39       ` Frank Ch. Eigler
@ 2009-07-23  3:07       ` Roland McGrath
  2009-07-23 10:28         ` Mark Wielaard
  1 sibling, 1 reply; 16+ messages in thread
From: Roland McGrath @ 2009-07-23  3:07 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Stan Cox, systemtap

> So uprobes is the clear winner for almost zero-overhead when disabled.

Well, sure.  It's just a nop or three before it gets a breakpoint inserted
(aside from argument packing overhead).

> But it has the largest overhead when enabled. Clearly we want the uprobe
> mechanism when probes are disabled, but the utrace mechanism when probes
> are enabled!

:-)

> Would it be possible to change the uprobes mechanism for inserting trap
> instructions to insert any instructions (sequence)?

That's what jump-optimized probes et al are about, using instruction
analysis.  But here you don't have to do the general case, just a
precisely-chosen code patch of an exactly-known nop sequence the sdt.h
macros generate.

> Then we could have the best of both worlds, or could even decide at
> runtime what to insert to when the probe gets enabled.

Indeed.

> You would make sure that there are enough nops in the place of the probe
> point for the instruction sequence you want to replace and then the
> uprobes insert instruction mechanism would (after checking it had enough
> nop space) insert the instruction sequence (preferable the one used by
> the utrace mechanism).

It can be more precisely-tailored than that, you don't need to think of it
as being a "uprobes method" at all.  It's very simple hard-wired code patching.
i.e., the macro produces one long nop and you patch that to a relative call.
You can make it a call to a stock function we provide in some .a you link
with, or to a stub generated directly in an alternate section by the macros.
(If you don't need different stubs, it could be in a linkonce section.)

> It would also help with implementing the idea for the ENABLED mechanism

That's just another variant of code-patching for the same purpose.

> So, it might be a bit like what Srikar posted to utrace-devel: [...]

By which you just mean it's another kind of code-patching.

> Or how about this.  We could expand STAP_PROBE(...) to
> 
>    { extern char stap_probe_NNNN_enabled_p;
>      if (unlikely(stap_probe_NNN_enabled_p)) {
>         /* current inline-asm stuff, but adding
>            &enabled_p to the descriptor struct. */
>      }
>    }

The point of this is to skip any argument-packing work generated by the
compiler, which would be inside the "if unlikely" block, right?

> The inline-asm inside could be the fastest enabled variant, probably
> the kprobe-based one.  (This would make user-space sdt.h usable
> without utrace & uprobes.)

Presumably the really fastest would be an ill-used syscall that has a
tracepoint in it.  I have also been thinking about vDSO ways of doing that.

> O, I like it. It is probably not as zero-overhead as the "pure nop"

Less for some and more for others, I would guess.  That is, in the base
case the test and branch not taken would be slower than the nop.  But if
just a few more instructions are required to set up the tracepoint
arguments and those are skipped in the disabled case, perhaps that tips it.

> Certainly warrants a try and benchmark.

I think this is a lot like some things Mathieu already experimented with
and measured in the kernel context.  I think he pursued a code-patching
flavor that patched an immediate operand because that was measured as
faster than having the actual extra load of a simple enabled_p variable.

> BTW. For storing changeable variables the .probes section should become
> alloc, rw now always (it currently is only for relocatable objects). 

It doesn't make sense that it should differ in relocatable objects.
I don't understand that.

As to the general question: eh, maybe.  From inside the kernel it's no
problem to poke a r/o text word just like a user-mutable data page.  Either
way the file-mapped page from the executable/DSO gets COW'd so you can
change it.  If you throw it into general data, it's more likely you will
share a page with other random data and so won't actually add a page's COW
overhead.  But the flip side is that keeping it in read-only means that a
program going haywire can't (without mprotect calls) accidentally set/clear
the flag, which is always winds up being lots of fun in the truly weirdest
debugging scenarios.

Thanks,
Roland

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-23  3:07       ` Roland McGrath
@ 2009-07-23 10:28         ` Mark Wielaard
  2009-07-23 14:40           ` Frank Ch. Eigler
  2009-07-23 19:33           ` Roland McGrath
  0 siblings, 2 replies; 16+ messages in thread
From: Mark Wielaard @ 2009-07-23 10:28 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Stan Cox, systemtap

On Wed, 2009-07-22 at 20:06 -0700, Roland McGrath wrote:
> > You would make sure that there are enough nops in the place of the probe
> > point for the instruction sequence you want to replace and then the
> > uprobes insert instruction mechanism would (after checking it had enough
> > nop space) insert the instruction sequence (preferable the one used by
> > the utrace mechanism).
> 
> It can be more precisely-tailored than that, you don't need to think of it
> as being a "uprobes method" at all.  It's very simple hard-wired code patching.
> i.e., the macro produces one long nop and you patch that to a relative call.
> You can make it a call to a stock function we provide in some .a you link
> with, or to a stub generated directly in an alternate section by the macros.
> (If you don't need different stubs, it could be in a linkonce section.)
>
> > It would also help with implementing the idea for the ENABLED mechanism
> 
> That's just another variant of code-patching for the same purpose.
> 
> > So, it might be a bit like what Srikar posted to utrace-devel: [...]
> 
> By which you just mean it's another kind of code-patching.

Yes. I am thinking of it as a "uprobes method" since that already
contains the UBP mechanism, interfaces and data structures we would need
to make this more general.

> > Or how about this.  We could expand STAP_PROBE(...) to
> > 
> >    { extern char stap_probe_NNNN_enabled_p;
> >      if (unlikely(stap_probe_NNN_enabled_p)) {
> >         /* current inline-asm stuff, but adding
> >            &enabled_p to the descriptor struct. */
> >      }
> >    }
> 
> The point of this is to skip any argument-packing work generated by the
> compiler, which would be inside the "if unlikely" block, right?

Partly, but it is sadly not guaranteed by GCC.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40207
The main speed win would be not tripping over the "probe trigger".
And the advantage to have a generalized approach to being able to check
whether a probe is enabled or not.

> > Certainly warrants a try and benchmark.
> 
> I think this is a lot like some things Mathieu already experimented with
> and measured in the kernel context.  I think he pursued a code-patching
> flavor that patched an immediate operand because that was measured as
> faster than having the actual extra load of a simple enabled_p variable.

That sounds like what I proposed in
http://sourceware.org/bugzilla/show_bug.cgi?id=10013#c2
the disadvantage is that it needs some code-patching magic. The
advantage of the above approach is that it wouldn't need anything not
already in the kernel mainline, just tricking gcc and the preprocessor
to setup things like we would want.

> > BTW. For storing changeable variables the .probes section should become
> > alloc, rw now always (it currently is only for relocatable objects). 
> 
> It doesn't make sense that it should differ in relocatable objects.
> I don't understand that.

The .probes section stores the addresses of the generated labels that
are used to find the probe addresses. This isn't a problem for an
executable that isn't relocatable. But it is for shared libraries which
has relocatable addresses (if you have selinux memory protection turned
on). In that case the section has to be writable for the linker. See
http://sourceware.org/bugzilla/show_bug.cgi?id=10381

Cheers,

Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-23 10:28         ` Mark Wielaard
@ 2009-07-23 14:40           ` Frank Ch. Eigler
  2009-07-23 19:33           ` Roland McGrath
  1 sibling, 0 replies; 16+ messages in thread
From: Frank Ch. Eigler @ 2009-07-23 14:40 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Roland McGrath, Stan Cox, systemtap


Mark Wielaard <mjw@redhat.com> writes:

> [...]
>> The point of this is to skip any argument-packing work generated by the
>> compiler, which would be inside the "if unlikely" block, right?

(Yes.)

> Partly, but it is sadly not guaranteed by GCC.
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40207

Note that this was for an implementation that relies on static inline
functions.  Pure macros are OK.

>> I think this is a lot like some things Mathieu already experimented with
>> and measured in the kernel context.  I think he pursued a code-patching
>> flavor that patched an immediate operand because that was measured as
>> faster than having the actual extra load of a simple enabled_p variable.

(Yes, and other tracing widgets from other OS's have discovered the
same thing.  Of course that costs per-arch complications instead, thus
the simpler proposal as a first step.)


- FChE

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: new static user probe types
  2009-07-23 10:28         ` Mark Wielaard
  2009-07-23 14:40           ` Frank Ch. Eigler
@ 2009-07-23 19:33           ` Roland McGrath
  1 sibling, 0 replies; 16+ messages in thread
From: Roland McGrath @ 2009-07-23 19:33 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: Stan Cox, systemtap

> The .probes section stores the addresses of the generated labels that
> are used to find the probe addresses. This isn't a problem for an
> executable that isn't relocatable. But it is for shared libraries which
> has relocatable addresses (if you have selinux memory protection turned
> on). In that case the section has to be writable for the linker. See
> http://sourceware.org/bugzilla/show_bug.cgi?id=10381

Bah!  I'm sure I already told you about doing this right!  The .probes
section should contain relative addresses fixed at compile time, just like
every other kind of static annotation (and code itself) is made PIC-friendly.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-07-29 15:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-26 21:26 new static user probe types Stan Cox
2009-07-15 16:19 ` Stan Cox
2009-07-15 18:39   ` Josh Stone
2009-07-15 20:47     ` Stan Cox
2009-07-15 21:57       ` Josh Stone
2009-07-16 13:44         ` Stan Cox
2009-07-20 18:34   ` Stan Cox
2009-07-22 10:42     ` Mark Wielaard
2009-07-22 14:39       ` Frank Ch. Eigler
2009-07-22 17:10         ` Mark Wielaard
2009-07-29 15:44           ` Stan Cox
2009-07-29 15:51             ` Stan Cox
2009-07-23  3:07       ` Roland McGrath
2009-07-23 10:28         ` Mark Wielaard
2009-07-23 14:40           ` Frank Ch. Eigler
2009-07-23 19:33           ` Roland McGrath

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).