public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* revamp sdt.h
@ 2010-08-03  3:07 Roland McGrath
  2010-08-09 15:16 ` Rayson Ho
  0 siblings, 1 reply; 15+ messages in thread
From: Roland McGrath @ 2010-08-03  3:07 UTC (permalink / raw)
  To: systemtap

[-- Attachment #1: message body text --]
[-- Type: text/plain, Size: 12955 bytes --]

I have spoken before about some of the shortcomings of the .probes
section format the sdt.h macros generate.  (I'm not really sure how
much I've written that in any postings here and how much it may have
been only in verbal grumblings in some unrecorded voice meetings.)
With Rayson's recent work, we've also noted the need to have sdt.h
macros that can work with hand-written assembly code.

So here is a first discussion draft of an entirely revamped set of
sdt.h macros and binary format they generate.  There is no conceptual
change here at all, it is just a new encoding of exactly the same
information as today's v2 sdt.h probes.  The only change to the
translator is the new binary format decoder.  Actually, that's not
true, but the changes are small and I'll explain them all in a moment.

Two files are attached at the end, which you should read or skim along
with my explanation here.  First is the core macro nest of the new
sdt.h, with some example macro uses.  The other file is a small
standalone C program based on libelf (elfutils >= 0.130) that decodes
the new binary format and prints out the probes.  That can serve as
the model for the new translator code.

The essential macros in this first draft are actually pretty complete
and usable (I didn't include all the sugar, just enough for examples).
The one thing they are not is friendly to -pedantic (unless used with
-std=c99).  They use variadic macros heavily and I'm not sure I could
have hashed this out without them and not become homicidal.  But
chances are we can rejigger the macro nest without them later if we
have to, with only slight dangers to life and limb of bystanders.
Anyway, not an issue for a discussion draft.

This version addresses these issues done poorly by the existing stuff,
some of which are purely about the macros and some of which are about
the format itself.

* can be used in assembly (.S) source files
* can be used inside inline asm statements in C source

Both of these matter for the places probes should go in libpthread functions.

* no data relocs

The old formats are non-starters for libc/libpthread, where the number
of dynamic relocs of any kind is very carefully tuned to keep the
startup cost on every program in the system as small as possible.

* minimal memory footprint of any kind

The cost is exactly one byte of rodata (rounded to alignment, so one
word at least) total in the final file, plus just the size of the nop
instruction itself times the number of probes.

These last two are achieved by putting the data into a non-allocated
ELF note.  This has some nice properties we get for free:
* no runtime cost, it's all fixed at link time and never in memory
* preserved in both stripped files and .debug files
It also has one new wrinkle we didn't have before (which is the flip
side of not having any dynamic relocs), which is that prelink won't
adjust its contents for address offsets.

One drawback of using a naked ELF note for every probe is that there
is a proportionally large per-probe overhead for the note headers.
But that just means something like another 20 bytes on top of the 16
or 24 you might have had for each probe, before counting the name
strings.  We could make a much more compact note format if we wanted
to rely on a link-time step.  But the absolute numbers involved in the
size of the notes are still pretty small (I think it's smaller per
probe than v2 .probes is, and it's just ELF file size instead of being
runtime memory footprint).  IMHO there is quite a lot to be said for
'#include <sys/sdt.h>' (and maybe later -lsdt, but now not even that)
being the sum total of extra fiddling to an existing build setup
needed to add static probes.

Ok, now it's time to look at new-sdt.h, attached below.  You can just
look at /* Example uses */ and below for the moment.  That file can be
compiled either as C or as assembly to show those examples in a binary.
(The assembly is for x86-64, though you can trivially change the operand
expressions to something that will work on another machine if you want
to see an example there.)

The scenario below is for building a DSO.  You could just as well drop
the -fPIC and -shared flags and create an executable instead.  I'm only
showing the one example because both are really just the same, and the
DSO case lets me illustrate the prelink issue.

	$ gcc -c -o s.o -xc new-sdt.h -O2  -fPIC
	$ gcc -c -o s2.o -Dfrob=diddle -Dmain=dummy -xc new-sdt.h -O2 -fPIC
	$ gcc -c -o s3.o -x assembler-with-cpp new-sdt.h -O2 -fPIC
	$ gcc -shared -o s.so s.o s2.o s3.o

Ok.  So now we compiled two objects from C sources and one from assembly
sources, and linked those together into a DSO (or executable).  The
different objects use overlapping sets of provider and probe names,
i.e. some probes have instances in two of the objects.

Now let's build the little decoder program:

	$ gcc -std=gnu99 -g sdt-extractor.c -o sdt-extractor -lelf

And now we can run it:

	$ ./sdt-extractor s.so
	0x5a0	libfoo.noargs              :
	0x5a1	libfoo.frob                -4@%edi 4@(%rsi)
	0x5a2	libfoo.diddle              8@%rsi -4@%edi
	0x5a3	libfoo.asm_noargs          
	0x5a4	libfoo.asmfrob             %edi %rax (%rsi)
	0x5af	libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
	0x5d0	libfoo.noargs              :
	0x5d1	libfoo.diddle              -4@%edi 4@(%rsi)
	0x5d2	libfoo.diddle              8@%rsi -4@%edi
	0x5d3	libfoo.asm_noargs          
	0x5d4	libfoo.asmfrob             %edi %rax (%rsi)
	0x5df	libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
	0x5e4	libfoo.noargs              :
	0x5e5	libfoo.frob                %rax, -20(%rbp)
	0x5e6	libfoo.diddle              (%rdi), %rax

As you can see, we have a probe address, a provider name, a probe name,
and an argument format string.  Don't worry yet about the argument
details.  I'll get to that after covering the prelink issue.

So, all this probe information is stored in the .note.stapsdt section,
which is not allocated data, has no relocs, and does not get touched by
prelink.  So the probe addresses stored in there at link time stay as
they started.  But, prelink might adjust the actual text addresses:

	$ prelink -r 0x1000000 s.so
	$ ./sdt-extractor s.so
	0x10005a0	libfoo.noargs              :
	0x10005a1	libfoo.frob                -4@%edi 4@(%rsi)
	0x10005a2	libfoo.diddle              8@%rsi -4@%edi
	0x10005a3	libfoo.asm_noargs          
	0x10005a4	libfoo.asmfrob             %edi %rax (%rsi)
	0x10005af	libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
	0x10005d0	libfoo.noargs              :
	0x10005d1	libfoo.diddle              -4@%edi 4@(%rsi)
	0x10005d2	libfoo.diddle              8@%rsi -4@%edi
	0x10005d3	libfoo.asm_noargs          
	0x10005d4	libfoo.asmfrob             %edi %rax (%rsi)
	0x10005df	libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
	0x10005e4	libfoo.noargs              :
	0x10005e5	libfoo.frob                %rax, -20(%rbp)
	0x10005e6	libfoo.diddle              (%rdi), %rax
	$

As you can see, everything is still correct: the probe addresses got the
prelink offset applied, and nothing else changed.  So how does this work?

It uses the .stapsdt.base section.  This is a special section we add to
the text.  All the .ifndef and comdat magic in the macro for this is
just there so that we only ever have one of these sections in a final
link and it's only ever one byte long.  Really it could be 0 bytes long,
but the linker swallows the section if we make it empty, so we pad it
with a byte (and alignment padding will usually mean that it consumes at
least one word in the binary's text segment).  Nothing about this
section itself matters, we just use it as a marker to detect prelink
address adjustments.

Each probe note records the link-time address of the .stapsdt.base
section alongside the probe PC address.  The decoder compares the base
address stored in the note with the .stapsdt.base section's sh_addr.
Initially these are the same, but the section header will be adjusted by
prelink.  So the decoder applies the difference to the probe PC address
to get the correct prelinked PC address.

I've put this magic into the macro and note format unconditionally, but
none of that is necessary for executables.  We could make it conditional
on #ifdef __PIC__.  But the cost (a word per note, plus the 1-byte
section of runtime rodata) seems small enough that it's nicer not to
bother with two variants of the format.

A library or application built using a custom linker script could
possibly remove, rename, or hide the .stapsdt.base section.  But that is
a rare thing to do (and even with some custom linker scripts, it may
well come through fine).  We do rely on the decoder in the translator
being able to find that section by name, but that is certainly no more
than the old .probes schemes relied on.


Now, some notes about the note format.

Note that the name of the notes section is not normative, and in a final
executable/DSO you might actually be looking at intermixed notes of
other kinds (follow the sdt-extractor.c example to consider all
appropriate sections and check all notes in them via gelf_getnote).

The ELF note format is variable-sized and includes a "vendor string" and
a type code.  Both the header and the "payload" after that are aligned
to 4 bytes within the section.

We're using the string "stapsdt" and that give us complete control of
the meaning of that (32-bit) type code (GElf_Nhdr.n_type).  So if we
want to have different flavors of probes, or different encoding formats,
now or in the future, we can encode all such selections in that type
code.  For this discussion draft, I'm using just one flavor (intended
for uprobes probes, i.e. a nop) and n_type=3 (for "sdt v3").

After the note header, the n_descsz bytes are:

	probe PC address (4 or 8 bytes)
	link-time sh_addr of .stapsdt.base section (4 or 8 bytes)
	provider name (null-terminated string)
	probe name (null-terminated string)
	argument format (null-terminated string)

Finally, I've made some changes to the v2 argument format string,
some trivial and one substantive.

* For no arguments, the string can be either "" or ":".
* Arguments can be separated by commas, whitespace, or both.

These differences are just for the convenience of writing the macro nest.

* Sized arguments.

In looking at the proposed libpthread probes using v2 sdt.h, I noticed
that adding the probes introduced not only the nop instructions
themselves, but some extra code before them to sign-extend or
zero-extend int arguments (on the hot path, even adding register
pressure!).  We really don't want that perturbation of the code
generation just for the common situation of having int-typed probe
parameters on a 64-bit machine.

So, these macros do not cast a probe argument to size_t as the existing
macros do.  Instead, they just make it an rvalue of int or wider (by
doing a plain + 0).  That coerces short (and bitfields and whatever) to
int, and coerces array references to pointers.  So arguments will still
wind up integers, and be either 32 or 64 bits (on a 32-bit machine,
there could be a 64-bit probe argument, which would be forced into a
memory operand).

This is encoded in the argument format string.  Each argument might
still be a plain assembly operand (from hand-written assembly), in which
case you should assume it's meant to be natural word size, or perhaps
the word size indicated by the register syntax (e.g. %eax or %r11d on
x86-64 mean the low 32 bits only).  But normally each argument will look
like "N@OP" where OP is the actual assembly operand, and N is one of:

	4	32 bits unsigned
	-4	32 bits signed
	8	64 bits unsigned
	-8	64 bits signed

The signedness doesn't really matter for 64 bits, though you could
potentially still use it to choose %d vs %u formatting for $parms$
and that sort of thing.  The -4@ notation tells you that you need to
extract it as 32 bits (low 32 of a register, or only address 4 bytes
if a memory access) and sign-extend it to 64 bits for a stap long.

This shifts the work of sign extension (when you want it) to the
translator/generated probe runtime code, rather than putting it into
the probed hot path code to be run even when no probes are in use.
With this, we can choose probe points and arguments carefully for
libpthread/libc and reasonably expect not to perturb the generated
code at all beyond the actual nop insertions.


I think I've explained everything.  
The discussion draft for sdt.h glosses over some trivial nits,
but I think I was pretty thorough about all the important nits.

The one thing I didn't mention is the semaphore option.  There isn't
one.  I can't tell what the story is with the semaphore these days, but
it looks like we're not really doing that any more.  If we want it in,
or even optionally in at compile time, then it is easy enough to add it
to these macros, and use new n_type values to indicate with vs without
variants of the note format.


Thanks,
Roland



[-- Attachment #2: sdt.h macro fragments with example probe uses --]
[-- Type: text/plain, Size: 5333 bytes --]

#ifdef __ASSEMBLER__
# define _SDT_PROBE(provider, name, arg_format, ...) \
  _SDT_ASM_BODY(provider, name, arg_format, __VA_ARGS__)
# define _SDT_ASM_1(...)		__VA_ARGS__;
# define _SDT_ASM_STRING_1(...)		.asciz #__VA_ARGS__;
# define _SDT_ASM_ARGS(format, ...)	_SDT_ASM_STRING_1(__VA_ARGS__)
# define _SDT_ARG(n, x)			x
#else
# define _SDT_PROBE(provider, name, arg_format, ...) __asm__ __volatile__ \
  (_SDT_ASM_BODY(provider, name, arg_format, :) :: __VA_ARGS__)
# define _SDT_ASM_1(...)		#__VA_ARGS__ "\n"
# define _SDT_ASM_STRING_1(...)		_SDT_ASM_1(.asciz #__VA_ARGS__)
# define _SDT_ASM_ARGS(format, ...)	_SDT_ASM_STRING_1(format)
# define _SDT_ARGFMT(n)			%c[_SDT_S##n]@_SDT_ARGTMPL(_SDT_A##n)
# define _SDT_ARG(n, x)			\
  [_SDT_S##n] "n" ((__builtin_constant_p ((x) + 0 < 0) ? 1 : -1) \
		   * (int) sizeof ((x) + 0)),		 \
  [_SDT_A##n] "nor" ((x) + 0)
#endif
#define _SDT_ASM(...)			_SDT_ASM_1(__VA_ARGS__)
#define _SDT_ASM_STRING(...)		_SDT_ASM_STRING_1(__VA_ARGS__)

#if defined __powerpc__ || defined __powerpc64__
# define _SDT_ARGTMPL(id)	%I[id]%[id]
#else
# define _SDT_ARGTMPL(id)	%[id]
#endif

#include <bits/wordsize.h>
#if __WORDSIZE == 64
# define _SDT_ASM_ADDR	.quad
#else
# define _SDT_ASM_ADDR	.long
#endif

#define _SDT_NOP	nop

#define _SDT_NOTE_NAME	"stapsdt"
#define _SDT_NOTE_TYPE	3

#define _SDT_ASM_BODY(provider, name, arg_format, ...)			      \
  _SDT_ASM(990:	_SDT_NOP)						      \
  _SDT_ASM(	.section .note.stapsdt,"","note")			      \
  _SDT_ASM(	.balign 4)						      \
  _SDT_ASM(	.int 992f-991f, 994f-993f, _SDT_NOTE_TYPE)		      \
  _SDT_ASM(991:	.asciz _SDT_NOTE_NAME)					      \
  _SDT_ASM(992:	.balign 4)						      \
  _SDT_ASM(993:	_SDT_ASM_ADDR 990b)					      \
  _SDT_ASM(	_SDT_ASM_ADDR _.stapsdt.base)				      \
  _SDT_ASM_STRING(provider)						      \
  _SDT_ASM_STRING(name)							      \
  _SDT_ASM_ARGS(arg_format, __VA_ARGS__)				      \
  _SDT_ASM(994:	.balign 4)						      \
  _SDT_ASM(	.previous)						      \
  _SDT_ASM(.ifndef _.stapsdt.base)					      \
  _SDT_ASM(	.section .stapsdt.base,"aG","progbits",.stapsdt.base,comdat)  \
  _SDT_ASM(	.weak _.stapsdt.base)					      \
  _SDT_ASM(	.hidden _.stapsdt.base)					      \
  _SDT_ASM(_.stapsdt.base: .space 1)					      \
  _SDT_ASM(	.size _.stapsdt.base, 1)				      \
  _SDT_ASM(	.previous)						      \
  _SDT_ASM(.endif)

#define PROBE0(provider, name) \
  _SDT_PROBE(provider, name, :, :)
#define PROBE1(provider, name, arg1) \
  _SDT_PROBE(provider, name, _SDT_ARGFMT(1), _SDT_ARG(1, arg1))
#define PROBE2(provider, name, arg1, arg2) \
  _SDT_PROBE(provider, name, _SDT_ARGFMT(1) _SDT_ARGFMT(2), \
	     _SDT_ARG(1, arg1), _SDT_ARG(2, arg2))

#define PROBE_ASM(provider, name, ...)		\
  _SDT_ASM_BODY(provider, name, __VA_ARGS__, :)
#define PROBE_ASM_TEMPLATE(n)		_SDT_ASM_TEMPLATE_##n
#define PROBE_ASM_OPERANDS(n, ...)	_SDT_ASM_OPERANDS_##n(__VA_ARGS__)
#define _SDT_ASM_TEMPLATE_0		:
#define _SDT_ASM_TEMPLATE_1		_SDT_ARGFMT(1)
#define _SDT_ASM_TEMPLATE_2		_SDT_ASM_TEMPLATE_1 _SDT_ARGFMT(2)
#define _SDT_ASM_TEMPLATE_3		_SDT_ASM_TEMPLATE_2 _SDT_ARGFMT(3)
#define _SDT_ASM_OPERANDS_0()		/* no operands */
#define _SDT_ASM_OPERANDS_1(arg1)	_SDT_ARG(1, arg1)
#define _SDT_ASM_OPERANDS_2(arg1, arg2)	_SDT_ARG(1, arg1), _SDT_ARG(2, arg2)
#define _SDT_ASM_OPERANDS_3(arg1, arg2, arg3)	\
  _SDT_ARG(1, arg1), _SDT_ARG(2, arg2), _SDT_ARG(3, arg3)


/* Example uses */

#define LIB libfoo    /* Probe do macros support indirecting the names.  */

#ifdef __ASSEMBLER__

#define ARG1 %rax
#define ARG2 -20(%rbp)

/* Here in an assembly source file, probes look just like in C source.
   The arguments are assembly operands that the sdt decoder can grok;
   e.g. constants might need to be marked, etc.  */
PROBE0(LIB, noargs)
PROBE2(LIB, frob, ARG1, ARG2)
PROBE2(LIB, diddle, (%rdi), %rax)

#else

struct bar { unsigned int baz; short int spaz; };

void frob (int foo, struct bar *bar)
{
  /* Plain C use is as before.  */
  PROBE0(LIB, noargs);
  PROBE2(LIB, frob, foo, bar->baz);
  PROBE2(LIB, diddle, bar, bar->spaz);

  /* Here's a use inside traditional inline asm.
     Note that GCC does not do %format handling in this case.  */
  __asm (PROBE_ASM(LIB, asm_noargs)
	 "# standalone asm: %0 et al not translated, no %% needed");

  /* Here's a use inside a fancy GCC asm using operands from C.
     Here the asm writer is choosing which assembly operands to
     tell sdt, just like writing a probe in an assembly source file.
     Note spaces with no commas between the operands.
     Those might or might not be substituted GCC %format thingies.  */
  __asm volatile ("# do something with %0\n"
		  PROBE_ASM(LIB, asmfrob, %0 %%rax %1)
		  "# do something with %1"
		  : : "r" (foo), "m" (bar->baz));

  /* Here's an asm use where the probe arguments are specified separately
     in C, so they behave just like a plain C probe would.  The
     PROBE_ASM_TEMPLATE(n) macro says we have n arguments from C.
     Then PROBE_ASM_OPERANDS(n, ...) can appear anywhere in the
     asm's list of input operands.  */
  const int fold[3] = { 1, 2, 3 };
  static int ugh[3] = { 1, 2, 3 }; /* array as arg demonstrates why + 0 */
  __asm volatile (PROBE_ASM(LIB, asmfrobarg, PROBE_ASM_TEMPLATE(3))
		  "# magic insn uses no operands"
		  : : PROBE_ASM_OPERANDS(3, bar[foo].baz, ugh, fold[1]));
}

int main () {}

#endif

[-- Attachment #3: reference C program using libelf to extract sdt.h probes from a linked binary --]
[-- Type: text/plain, Size: 3902 bytes --]

#define _SDT_NOTE_NAME	"stapsdt"
#define _SDT_NOTE_TYPE	3

#define _GNU_SOURCE
#include <gelf.h>
#include <fcntl.h>
#include <unistd.h>
#include <error.h>
#include <errno.h>
#include <string.h>
#include <inttypes.h>
#include <assert.h>
#include <stdio.h>

static void
handle_probe (Elf *elf, GElf_Addr base, int type, const char *data, size_t len)
{
  if (type != _SDT_NOTE_TYPE)
    {
      error (0, 0, "unknown %s n_type %u", _SDT_NOTE_NAME, type);
      return;
    }

  union
  {
    Elf64_Addr a64[2];
    Elf32_Addr a32[2];
  } buf;
  Elf_Data dst =
    {
      .d_type = ELF_T_ADDR, .d_version = EV_CURRENT,
      .d_buf = &buf, .d_size = gelf_fsize (elf, ELF_T_ADDR, 2, EV_CURRENT)
    };
  assert (dst.d_size <= sizeof buf);

  if (len < dst.d_size + 3)
    {
      error (0, 0, "short note");
      return;
    }

  Elf_Data src =
    {
      .d_type = ELF_T_ADDR, .d_version = EV_CURRENT,
      .d_buf = (void *) data, .d_size = dst.d_size
    };

  if (gelf_xlatetom (elf, &dst, &src,
		     elf_getident (elf, NULL)[EI_DATA]) == NULL)
    error (0, 0, "gelf_xlatetom: %s", elf_errmsg (-1));

  const char *provider = data + dst.d_size;
  const char *name = memchr (provider, '\0', data + len - provider);
  if (name == NULL)
    {
      error (0, 0, "corrupt probe");
      return;
    }

  ++name;
  const char *args = memchr (name, '\0', data + len - name);
  if (args++ == NULL ||
      memchr (args, '\0', data + len - name) != data + len - 1)
  if (name == NULL)
    {
      error (0, 0, "corrupt probe");
      return;
    }

  GElf_Addr pc;
  GElf_Addr base_ref;
  if (gelf_getclass (elf) == ELFCLASS32)
    {
      pc = buf.a32[0];
      base_ref = buf.a32[1];
    }
  else
    {
      pc = buf.a64[0];
      base_ref = buf.a64[1];
    }

  pc += base - base_ref;

  printf ("%#" PRIx64 "\t%s.%-20s%s\n", pc, provider, name, args);
}

static void
handle_notes (Elf *elf, Elf_Scn *scn, GElf_Addr base)
{
  if (base == (GElf_Addr) -1)
    {
      error (0, 0, "notes before base section");
      base = 0;
    }

  Elf_Data *data = elf_getdata (scn, NULL);
  size_t next;
  GElf_Nhdr nhdr;
  size_t name_off;
  size_t desc_off;
  for (size_t offset = 0;
       (next = gelf_getnote (data, offset, &nhdr, &name_off, &desc_off)) > 0;
       offset = next)
    if (nhdr.n_namesz == sizeof _SDT_NOTE_NAME
	&& !memcmp (data->d_buf + name_off,
		    _SDT_NOTE_NAME, sizeof _SDT_NOTE_NAME))
      handle_probe (elf, base,
		    nhdr.n_type, data->d_buf + desc_off, nhdr.n_descsz);
}

static void
handle_elf (Elf *elf)
{
  size_t shstrndx;
  if (elf_getshdrstrndx (elf, &shstrndx))
    {
      error (0, 0, "elf_getshdrstrndx: %s", elf_errmsg (-1));
      return;
    }

  GElf_Addr base = -1;

  Elf_Scn *scn = NULL;
  while ((scn = elf_nextscn (elf, scn)) != NULL)
    {
      GElf_Shdr shdr;
      if (gelf_getshdr (scn, &shdr) == NULL)
	{
	  error (0, 0, "elf_getshdr: %s", elf_errmsg (-1));
	  continue;
	}
      switch (shdr.sh_type)
	{
	case SHT_NOTE:
	  if (!(shdr.sh_flags & SHF_ALLOC))
	    handle_notes (elf, scn, base);
	  break;

	case SHT_PROGBITS:
	  if (base == (GElf_Addr) -1
	      && (shdr.sh_flags & SHF_ALLOC) && shdr.sh_name != 0)
	    {
	      const char *scn_name = elf_strptr (elf, shstrndx, shdr.sh_name);
	      if (scn_name != NULL && !strcmp (scn_name, ".stapsdt.base"))
		base = shdr.sh_addr;
	    }
	  break;
	}
    }
}

static void
handle_file (const char *file)
{
  int fd = open64 (file, O_RDONLY);
  if (fd < 0)
    error (0, errno, "%s", file);
  else
    {
      Elf *elf = elf_begin (fd, ELF_C_READ_MMAP_PRIVATE, NULL);
      if (elf == NULL)
	error (0, 0, "elf_begin: %s: %s", elf_errmsg (-1));
      else
	{
	  handle_elf (elf);
	  elf_end (elf);
	}
      close (fd);
    }
}

int
main (int argc, char **argv)
{
  elf_version (EV_CURRENT);

  for (int argi = 1; argi < argc; ++argi)
    handle_file (argv[argi]);

  return error_message_count > 0;
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-08-03  3:07 revamp sdt.h Roland McGrath
@ 2010-08-09 15:16 ` Rayson Ho
  2010-08-09 16:38   ` Roland McGrath
  0 siblings, 1 reply; 15+ messages in thread
From: Rayson Ho @ 2010-08-09 15:16 UTC (permalink / raw)
  To: systemtap

Thanks a lot Roland,

I have modified your sdt.h slightly and used it for the pthread
probes. Is there a special branch of systemtap or utrace that I need
to use in order to test/benchmark the overhead of the existance of the
new sdt probes in libpthread?

Rayson



On Mon, Aug 2, 2010 at 11:06 PM, Roland McGrath <roland@redhat.com> wrote:
> I have spoken before about some of the shortcomings of the .probes
> section format the sdt.h macros generate.  (I'm not really sure how
> much I've written that in any postings here and how much it may have
> been only in verbal grumblings in some unrecorded voice meetings.)
> With Rayson's recent work, we've also noted the need to have sdt.h
> macros that can work with hand-written assembly code.
>
> So here is a first discussion draft of an entirely revamped set of
> sdt.h macros and binary format they generate.  There is no conceptual
> change here at all, it is just a new encoding of exactly the same
> information as today's v2 sdt.h probes.  The only change to the
> translator is the new binary format decoder.  Actually, that's not
> true, but the changes are small and I'll explain them all in a moment.
>
> Two files are attached at the end, which you should read or skim along
> with my explanation here.  First is the core macro nest of the new
> sdt.h, with some example macro uses.  The other file is a small
> standalone C program based on libelf (elfutils >= 0.130) that decodes
> the new binary format and prints out the probes.  That can serve as
> the model for the new translator code.
>
> The essential macros in this first draft are actually pretty complete
> and usable (I didn't include all the sugar, just enough for examples).
> The one thing they are not is friendly to -pedantic (unless used with
> -std=c99).  They use variadic macros heavily and I'm not sure I could
> have hashed this out without them and not become homicidal.  But
> chances are we can rejigger the macro nest without them later if we
> have to, with only slight dangers to life and limb of bystanders.
> Anyway, not an issue for a discussion draft.
>
> This version addresses these issues done poorly by the existing stuff,
> some of which are purely about the macros and some of which are about
> the format itself.
>
> * can be used in assembly (.S) source files
> * can be used inside inline asm statements in C source
>
> Both of these matter for the places probes should go in libpthread functions.
>
> * no data relocs
>
> The old formats are non-starters for libc/libpthread, where the number
> of dynamic relocs of any kind is very carefully tuned to keep the
> startup cost on every program in the system as small as possible.
>
> * minimal memory footprint of any kind
>
> The cost is exactly one byte of rodata (rounded to alignment, so one
> word at least) total in the final file, plus just the size of the nop
> instruction itself times the number of probes.
>
> These last two are achieved by putting the data into a non-allocated
> ELF note.  This has some nice properties we get for free:
> * no runtime cost, it's all fixed at link time and never in memory
> * preserved in both stripped files and .debug files
> It also has one new wrinkle we didn't have before (which is the flip
> side of not having any dynamic relocs), which is that prelink won't
> adjust its contents for address offsets.
>
> One drawback of using a naked ELF note for every probe is that there
> is a proportionally large per-probe overhead for the note headers.
> But that just means something like another 20 bytes on top of the 16
> or 24 you might have had for each probe, before counting the name
> strings.  We could make a much more compact note format if we wanted
> to rely on a link-time step.  But the absolute numbers involved in the
> size of the notes are still pretty small (I think it's smaller per
> probe than v2 .probes is, and it's just ELF file size instead of being
> runtime memory footprint).  IMHO there is quite a lot to be said for
> '#include <sys/sdt.h>' (and maybe later -lsdt, but now not even that)
> being the sum total of extra fiddling to an existing build setup
> needed to add static probes.
>
> Ok, now it's time to look at new-sdt.h, attached below.  You can just
> look at /* Example uses */ and below for the moment.  That file can be
> compiled either as C or as assembly to show those examples in a binary.
> (The assembly is for x86-64, though you can trivially change the operand
> expressions to something that will work on another machine if you want
> to see an example there.)
>
> The scenario below is for building a DSO.  You could just as well drop
> the -fPIC and -shared flags and create an executable instead.  I'm only
> showing the one example because both are really just the same, and the
> DSO case lets me illustrate the prelink issue.
>
>        $ gcc -c -o s.o -xc new-sdt.h -O2  -fPIC
>        $ gcc -c -o s2.o -Dfrob=diddle -Dmain=dummy -xc new-sdt.h -O2 -fPIC
>        $ gcc -c -o s3.o -x assembler-with-cpp new-sdt.h -O2 -fPIC
>        $ gcc -shared -o s.so s.o s2.o s3.o
>
> Ok.  So now we compiled two objects from C sources and one from assembly
> sources, and linked those together into a DSO (or executable).  The
> different objects use overlapping sets of provider and probe names,
> i.e. some probes have instances in two of the objects.
>
> Now let's build the little decoder program:
>
>        $ gcc -std=gnu99 -g sdt-extractor.c -o sdt-extractor -lelf
>
> And now we can run it:
>
>        $ ./sdt-extractor s.so
>        0x5a0   libfoo.noargs              :
>        0x5a1   libfoo.frob                -4@%edi 4@(%rsi)
>        0x5a2   libfoo.diddle              8@%rsi -4@%edi
>        0x5a3   libfoo.asm_noargs
>        0x5a4   libfoo.asmfrob             %edi %rax (%rsi)
>        0x5af   libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
>        0x5d0   libfoo.noargs              :
>        0x5d1   libfoo.diddle              -4@%edi 4@(%rsi)
>        0x5d2   libfoo.diddle              8@%rsi -4@%edi
>        0x5d3   libfoo.asm_noargs
>        0x5d4   libfoo.asmfrob             %edi %rax (%rsi)
>        0x5df   libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
>        0x5e4   libfoo.noargs              :
>        0x5e5   libfoo.frob                %rax, -20(%rbp)
>        0x5e6   libfoo.diddle              (%rdi), %rax
>
> As you can see, we have a probe address, a provider name, a probe name,
> and an argument format string.  Don't worry yet about the argument
> details.  I'll get to that after covering the prelink issue.
>
> So, all this probe information is stored in the .note.stapsdt section,
> which is not allocated data, has no relocs, and does not get touched by
> prelink.  So the probe addresses stored in there at link time stay as
> they started.  But, prelink might adjust the actual text addresses:
>
>        $ prelink -r 0x1000000 s.so
>        $ ./sdt-extractor s.so
>        0x10005a0       libfoo.noargs              :
>        0x10005a1       libfoo.frob                -4@%edi 4@(%rsi)
>        0x10005a2       libfoo.diddle              8@%rsi -4@%edi
>        0x10005a3       libfoo.asm_noargs
>        0x10005a4       libfoo.asmfrob             %edi %rax (%rsi)
>        0x10005af       libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
>        0x10005d0       libfoo.noargs              :
>        0x10005d1       libfoo.diddle              -4@%edi 4@(%rsi)
>        0x10005d2       libfoo.diddle              8@%rsi -4@%edi
>        0x10005d3       libfoo.asm_noargs
>        0x10005d4       libfoo.asmfrob             %edi %rax (%rsi)
>        0x10005df       libfoo.asmfrobarg          4@(%rsi,%rdi,4) 8@%rax 4@$2
>        0x10005e4       libfoo.noargs              :
>        0x10005e5       libfoo.frob                %rax, -20(%rbp)
>        0x10005e6       libfoo.diddle              (%rdi), %rax
>        $
>
> As you can see, everything is still correct: the probe addresses got the
> prelink offset applied, and nothing else changed.  So how does this work?
>
> It uses the .stapsdt.base section.  This is a special section we add to
> the text.  All the .ifndef and comdat magic in the macro for this is
> just there so that we only ever have one of these sections in a final
> link and it's only ever one byte long.  Really it could be 0 bytes long,
> but the linker swallows the section if we make it empty, so we pad it
> with a byte (and alignment padding will usually mean that it consumes at
> least one word in the binary's text segment).  Nothing about this
> section itself matters, we just use it as a marker to detect prelink
> address adjustments.
>
> Each probe note records the link-time address of the .stapsdt.base
> section alongside the probe PC address.  The decoder compares the base
> address stored in the note with the .stapsdt.base section's sh_addr.
> Initially these are the same, but the section header will be adjusted by
> prelink.  So the decoder applies the difference to the probe PC address
> to get the correct prelinked PC address.
>
> I've put this magic into the macro and note format unconditionally, but
> none of that is necessary for executables.  We could make it conditional
> on #ifdef __PIC__.  But the cost (a word per note, plus the 1-byte
> section of runtime rodata) seems small enough that it's nicer not to
> bother with two variants of the format.
>
> A library or application built using a custom linker script could
> possibly remove, rename, or hide the .stapsdt.base section.  But that is
> a rare thing to do (and even with some custom linker scripts, it may
> well come through fine).  We do rely on the decoder in the translator
> being able to find that section by name, but that is certainly no more
> than the old .probes schemes relied on.
>
>
> Now, some notes about the note format.
>
> Note that the name of the notes section is not normative, and in a final
> executable/DSO you might actually be looking at intermixed notes of
> other kinds (follow the sdt-extractor.c example to consider all
> appropriate sections and check all notes in them via gelf_getnote).
>
> The ELF note format is variable-sized and includes a "vendor string" and
> a type code.  Both the header and the "payload" after that are aligned
> to 4 bytes within the section.
>
> We're using the string "stapsdt" and that give us complete control of
> the meaning of that (32-bit) type code (GElf_Nhdr.n_type).  So if we
> want to have different flavors of probes, or different encoding formats,
> now or in the future, we can encode all such selections in that type
> code.  For this discussion draft, I'm using just one flavor (intended
> for uprobes probes, i.e. a nop) and n_type=3 (for "sdt v3").
>
> After the note header, the n_descsz bytes are:
>
>        probe PC address (4 or 8 bytes)
>        link-time sh_addr of .stapsdt.base section (4 or 8 bytes)
>        provider name (null-terminated string)
>        probe name (null-terminated string)
>        argument format (null-terminated string)
>
> Finally, I've made some changes to the v2 argument format string,
> some trivial and one substantive.
>
> * For no arguments, the string can be either "" or ":".
> * Arguments can be separated by commas, whitespace, or both.
>
> These differences are just for the convenience of writing the macro nest.
>
> * Sized arguments.
>
> In looking at the proposed libpthread probes using v2 sdt.h, I noticed
> that adding the probes introduced not only the nop instructions
> themselves, but some extra code before them to sign-extend or
> zero-extend int arguments (on the hot path, even adding register
> pressure!).  We really don't want that perturbation of the code
> generation just for the common situation of having int-typed probe
> parameters on a 64-bit machine.
>
> So, these macros do not cast a probe argument to size_t as the existing
> macros do.  Instead, they just make it an rvalue of int or wider (by
> doing a plain + 0).  That coerces short (and bitfields and whatever) to
> int, and coerces array references to pointers.  So arguments will still
> wind up integers, and be either 32 or 64 bits (on a 32-bit machine,
> there could be a 64-bit probe argument, which would be forced into a
> memory operand).
>
> This is encoded in the argument format string.  Each argument might
> still be a plain assembly operand (from hand-written assembly), in which
> case you should assume it's meant to be natural word size, or perhaps
> the word size indicated by the register syntax (e.g. %eax or %r11d on
> x86-64 mean the low 32 bits only).  But normally each argument will look
> like "N@OP" where OP is the actual assembly operand, and N is one of:
>
>        4       32 bits unsigned
>        -4      32 bits signed
>        8       64 bits unsigned
>        -8      64 bits signed
>
> The signedness doesn't really matter for 64 bits, though you could
> potentially still use it to choose %d vs %u formatting for $parms$
> and that sort of thing.  The -4@ notation tells you that you need to
> extract it as 32 bits (low 32 of a register, or only address 4 bytes
> if a memory access) and sign-extend it to 64 bits for a stap long.
>
> This shifts the work of sign extension (when you want it) to the
> translator/generated probe runtime code, rather than putting it into
> the probed hot path code to be run even when no probes are in use.
> With this, we can choose probe points and arguments carefully for
> libpthread/libc and reasonably expect not to perturb the generated
> code at all beyond the actual nop insertions.
>
>
> I think I've explained everything.
> The discussion draft for sdt.h glosses over some trivial nits,
> but I think I was pretty thorough about all the important nits.
>
> The one thing I didn't mention is the semaphore option.  There isn't
> one.  I can't tell what the story is with the semaphore these days, but
> it looks like we're not really doing that any more.  If we want it in,
> or even optionally in at compile time, then it is easy enough to add it
> to these macros, and use new n_type values to indicate with vs without
> variants of the note format.
>
>
> Thanks,
> Roland
>
>
>
> #ifdef __ASSEMBLER__
> # define _SDT_PROBE(provider, name, arg_format, ...) \
>  _SDT_ASM_BODY(provider, name, arg_format, __VA_ARGS__)
> # define _SDT_ASM_1(...)                __VA_ARGS__;
> # define _SDT_ASM_STRING_1(...)         .asciz #__VA_ARGS__;
> # define _SDT_ASM_ARGS(format, ...)     _SDT_ASM_STRING_1(__VA_ARGS__)
> # define _SDT_ARG(n, x)                 x
> #else
> # define _SDT_PROBE(provider, name, arg_format, ...) __asm__ __volatile__ \
>  (_SDT_ASM_BODY(provider, name, arg_format, :) :: __VA_ARGS__)
> # define _SDT_ASM_1(...)                #__VA_ARGS__ "\n"
> # define _SDT_ASM_STRING_1(...)         _SDT_ASM_1(.asciz #__VA_ARGS__)
> # define _SDT_ASM_ARGS(format, ...)     _SDT_ASM_STRING_1(format)
> # define _SDT_ARGFMT(n)                 %c[_SDT_S##n]@_SDT_ARGTMPL(_SDT_A##n)
> # define _SDT_ARG(n, x)                 \
>  [_SDT_S##n] "n" ((__builtin_constant_p ((x) + 0 < 0) ? 1 : -1) \
>                   * (int) sizeof ((x) + 0)),            \
>  [_SDT_A##n] "nor" ((x) + 0)
> #endif
> #define _SDT_ASM(...)                   _SDT_ASM_1(__VA_ARGS__)
> #define _SDT_ASM_STRING(...)            _SDT_ASM_STRING_1(__VA_ARGS__)
>
> #if defined __powerpc__ || defined __powerpc64__
> # define _SDT_ARGTMPL(id)       %I[id]%[id]
> #else
> # define _SDT_ARGTMPL(id)       %[id]
> #endif
>
> #include <bits/wordsize.h>
> #if __WORDSIZE == 64
> # define _SDT_ASM_ADDR  .quad
> #else
> # define _SDT_ASM_ADDR  .long
> #endif
>
> #define _SDT_NOP        nop
>
> #define _SDT_NOTE_NAME  "stapsdt"
> #define _SDT_NOTE_TYPE  3
>
> #define _SDT_ASM_BODY(provider, name, arg_format, ...)                        \
>  _SDT_ASM(990: _SDT_NOP)                                                     \
>  _SDT_ASM(     .section .note.stapsdt,"","note")                             \
>  _SDT_ASM(     .balign 4)                                                    \
>  _SDT_ASM(     .int 992f-991f, 994f-993f, _SDT_NOTE_TYPE)                    \
>  _SDT_ASM(991: .asciz _SDT_NOTE_NAME)                                        \
>  _SDT_ASM(992: .balign 4)                                                    \
>  _SDT_ASM(993: _SDT_ASM_ADDR 990b)                                           \
>  _SDT_ASM(     _SDT_ASM_ADDR _.stapsdt.base)                                 \
>  _SDT_ASM_STRING(provider)                                                   \
>  _SDT_ASM_STRING(name)                                                       \
>  _SDT_ASM_ARGS(arg_format, __VA_ARGS__)                                      \
>  _SDT_ASM(994: .balign 4)                                                    \
>  _SDT_ASM(     .previous)                                                    \
>  _SDT_ASM(.ifndef _.stapsdt.base)                                            \
>  _SDT_ASM(     .section .stapsdt.base,"aG","progbits",.stapsdt.base,comdat)  \
>  _SDT_ASM(     .weak _.stapsdt.base)                                         \
>  _SDT_ASM(     .hidden _.stapsdt.base)                                       \
>  _SDT_ASM(_.stapsdt.base: .space 1)                                          \
>  _SDT_ASM(     .size _.stapsdt.base, 1)                                      \
>  _SDT_ASM(     .previous)                                                    \
>  _SDT_ASM(.endif)
>
> #define PROBE0(provider, name) \
>  _SDT_PROBE(provider, name, :, :)
> #define PROBE1(provider, name, arg1) \
>  _SDT_PROBE(provider, name, _SDT_ARGFMT(1), _SDT_ARG(1, arg1))
> #define PROBE2(provider, name, arg1, arg2) \
>  _SDT_PROBE(provider, name, _SDT_ARGFMT(1) _SDT_ARGFMT(2), \
>             _SDT_ARG(1, arg1), _SDT_ARG(2, arg2))
>
> #define PROBE_ASM(provider, name, ...)          \
>  _SDT_ASM_BODY(provider, name, __VA_ARGS__, :)
> #define PROBE_ASM_TEMPLATE(n)           _SDT_ASM_TEMPLATE_##n
> #define PROBE_ASM_OPERANDS(n, ...)      _SDT_ASM_OPERANDS_##n(__VA_ARGS__)
> #define _SDT_ASM_TEMPLATE_0             :
> #define _SDT_ASM_TEMPLATE_1             _SDT_ARGFMT(1)
> #define _SDT_ASM_TEMPLATE_2             _SDT_ASM_TEMPLATE_1 _SDT_ARGFMT(2)
> #define _SDT_ASM_TEMPLATE_3             _SDT_ASM_TEMPLATE_2 _SDT_ARGFMT(3)
> #define _SDT_ASM_OPERANDS_0()           /* no operands */
> #define _SDT_ASM_OPERANDS_1(arg1)       _SDT_ARG(1, arg1)
> #define _SDT_ASM_OPERANDS_2(arg1, arg2) _SDT_ARG(1, arg1), _SDT_ARG(2, arg2)
> #define _SDT_ASM_OPERANDS_3(arg1, arg2, arg3)   \
>  _SDT_ARG(1, arg1), _SDT_ARG(2, arg2), _SDT_ARG(3, arg3)
>
>
> /* Example uses */
>
> #define LIB libfoo    /* Probe do macros support indirecting the names.  */
>
> #ifdef __ASSEMBLER__
>
> #define ARG1 %rax
> #define ARG2 -20(%rbp)
>
> /* Here in an assembly source file, probes look just like in C source.
>   The arguments are assembly operands that the sdt decoder can grok;
>   e.g. constants might need to be marked, etc.  */
> PROBE0(LIB, noargs)
> PROBE2(LIB, frob, ARG1, ARG2)
> PROBE2(LIB, diddle, (%rdi), %rax)
>
> #else
>
> struct bar { unsigned int baz; short int spaz; };
>
> void frob (int foo, struct bar *bar)
> {
>  /* Plain C use is as before.  */
>  PROBE0(LIB, noargs);
>  PROBE2(LIB, frob, foo, bar->baz);
>  PROBE2(LIB, diddle, bar, bar->spaz);
>
>  /* Here's a use inside traditional inline asm.
>     Note that GCC does not do %format handling in this case.  */
>  __asm (PROBE_ASM(LIB, asm_noargs)
>         "# standalone asm: %0 et al not translated, no %% needed");
>
>  /* Here's a use inside a fancy GCC asm using operands from C.
>     Here the asm writer is choosing which assembly operands to
>     tell sdt, just like writing a probe in an assembly source file.
>     Note spaces with no commas between the operands.
>     Those might or might not be substituted GCC %format thingies.  */
>  __asm volatile ("# do something with %0\n"
>                  PROBE_ASM(LIB, asmfrob, %0 %%rax %1)
>                  "# do something with %1"
>                  : : "r" (foo), "m" (bar->baz));
>
>  /* Here's an asm use where the probe arguments are specified separately
>     in C, so they behave just like a plain C probe would.  The
>     PROBE_ASM_TEMPLATE(n) macro says we have n arguments from C.
>     Then PROBE_ASM_OPERANDS(n, ...) can appear anywhere in the
>     asm's list of input operands.  */
>  const int fold[3] = { 1, 2, 3 };
>  static int ugh[3] = { 1, 2, 3 }; /* array as arg demonstrates why + 0 */
>  __asm volatile (PROBE_ASM(LIB, asmfrobarg, PROBE_ASM_TEMPLATE(3))
>                  "# magic insn uses no operands"
>                  : : PROBE_ASM_OPERANDS(3, bar[foo].baz, ugh, fold[1]));
> }
>
> int main () {}
>
> #endif
>
> #define _SDT_NOTE_NAME  "stapsdt"
> #define _SDT_NOTE_TYPE  3
>
> #define _GNU_SOURCE
> #include <gelf.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <error.h>
> #include <errno.h>
> #include <string.h>
> #include <inttypes.h>
> #include <assert.h>
> #include <stdio.h>
>
> static void
> handle_probe (Elf *elf, GElf_Addr base, int type, const char *data, size_t len)
> {
>  if (type != _SDT_NOTE_TYPE)
>    {
>      error (0, 0, "unknown %s n_type %u", _SDT_NOTE_NAME, type);
>      return;
>    }
>
>  union
>  {
>    Elf64_Addr a64[2];
>    Elf32_Addr a32[2];
>  } buf;
>  Elf_Data dst =
>    {
>      .d_type = ELF_T_ADDR, .d_version = EV_CURRENT,
>      .d_buf = &buf, .d_size = gelf_fsize (elf, ELF_T_ADDR, 2, EV_CURRENT)
>    };
>  assert (dst.d_size <= sizeof buf);
>
>  if (len < dst.d_size + 3)
>    {
>      error (0, 0, "short note");
>      return;
>    }
>
>  Elf_Data src =
>    {
>      .d_type = ELF_T_ADDR, .d_version = EV_CURRENT,
>      .d_buf = (void *) data, .d_size = dst.d_size
>    };
>
>  if (gelf_xlatetom (elf, &dst, &src,
>                     elf_getident (elf, NULL)[EI_DATA]) == NULL)
>    error (0, 0, "gelf_xlatetom: %s", elf_errmsg (-1));
>
>  const char *provider = data + dst.d_size;
>  const char *name = memchr (provider, '\0', data + len - provider);
>  if (name == NULL)
>    {
>      error (0, 0, "corrupt probe");
>      return;
>    }
>
>  ++name;
>  const char *args = memchr (name, '\0', data + len - name);
>  if (args++ == NULL ||
>      memchr (args, '\0', data + len - name) != data + len - 1)
>  if (name == NULL)
>    {
>      error (0, 0, "corrupt probe");
>      return;
>    }
>
>  GElf_Addr pc;
>  GElf_Addr base_ref;
>  if (gelf_getclass (elf) == ELFCLASS32)
>    {
>      pc = buf.a32[0];
>      base_ref = buf.a32[1];
>    }
>  else
>    {
>      pc = buf.a64[0];
>      base_ref = buf.a64[1];
>    }
>
>  pc += base - base_ref;
>
>  printf ("%#" PRIx64 "\t%s.%-20s%s\n", pc, provider, name, args);
> }
>
> static void
> handle_notes (Elf *elf, Elf_Scn *scn, GElf_Addr base)
> {
>  if (base == (GElf_Addr) -1)
>    {
>      error (0, 0, "notes before base section");
>      base = 0;
>    }
>
>  Elf_Data *data = elf_getdata (scn, NULL);
>  size_t next;
>  GElf_Nhdr nhdr;
>  size_t name_off;
>  size_t desc_off;
>  for (size_t offset = 0;
>       (next = gelf_getnote (data, offset, &nhdr, &name_off, &desc_off)) > 0;
>       offset = next)
>    if (nhdr.n_namesz == sizeof _SDT_NOTE_NAME
>        && !memcmp (data->d_buf + name_off,
>                    _SDT_NOTE_NAME, sizeof _SDT_NOTE_NAME))
>      handle_probe (elf, base,
>                    nhdr.n_type, data->d_buf + desc_off, nhdr.n_descsz);
> }
>
> static void
> handle_elf (Elf *elf)
> {
>  size_t shstrndx;
>  if (elf_getshdrstrndx (elf, &shstrndx))
>    {
>      error (0, 0, "elf_getshdrstrndx: %s", elf_errmsg (-1));
>      return;
>    }
>
>  GElf_Addr base = -1;
>
>  Elf_Scn *scn = NULL;
>  while ((scn = elf_nextscn (elf, scn)) != NULL)
>    {
>      GElf_Shdr shdr;
>      if (gelf_getshdr (scn, &shdr) == NULL)
>        {
>          error (0, 0, "elf_getshdr: %s", elf_errmsg (-1));
>          continue;
>        }
>      switch (shdr.sh_type)
>        {
>        case SHT_NOTE:
>          if (!(shdr.sh_flags & SHF_ALLOC))
>            handle_notes (elf, scn, base);
>          break;
>
>        case SHT_PROGBITS:
>          if (base == (GElf_Addr) -1
>              && (shdr.sh_flags & SHF_ALLOC) && shdr.sh_name != 0)
>            {
>              const char *scn_name = elf_strptr (elf, shstrndx, shdr.sh_name);
>              if (scn_name != NULL && !strcmp (scn_name, ".stapsdt.base"))
>                base = shdr.sh_addr;
>            }
>          break;
>        }
>    }
> }
>
> static void
> handle_file (const char *file)
> {
>  int fd = open64 (file, O_RDONLY);
>  if (fd < 0)
>    error (0, errno, "%s", file);
>  else
>    {
>      Elf *elf = elf_begin (fd, ELF_C_READ_MMAP_PRIVATE, NULL);
>      if (elf == NULL)
>        error (0, 0, "elf_begin: %s: %s", elf_errmsg (-1));
>      else
>        {
>          handle_elf (elf);
>          elf_end (elf);
>        }
>      close (fd);
>    }
> }
>
> int
> main (int argc, char **argv)
> {
>  elf_version (EV_CURRENT);
>
>  for (int argi = 1; argi < argc; ++argi)
>    handle_file (argv[argi]);
>
>  return error_message_count > 0;
> }
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-08-09 15:16 ` Rayson Ho
@ 2010-08-09 16:38   ` Roland McGrath
  2010-08-11 13:11     ` Rayson Ho
  2010-09-03 20:36     ` revamp sdt.h Stan Cox
  0 siblings, 2 replies; 15+ messages in thread
From: Roland McGrath @ 2010-08-09 16:38 UTC (permalink / raw)
  To: Rayson Ho; +Cc: systemtap

> I have modified your sdt.h slightly and used it for the pthread
> probes. Is there a special branch of systemtap or utrace that I need
> to use in order to test/benchmark the overhead of the existance of the
> new sdt probes in libpthread?

The translator work has yet to be done.  So at the moment all you could
test is having the probes in there statically (i.e. addition of nop
instructions) and not using them.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-08-09 16:38   ` Roland McGrath
@ 2010-08-11 13:11     ` Rayson Ho
  2010-09-08  0:44       ` Roland McGrath
  2010-09-29 17:18       ` Adding systemtap probe points in pthread library (was: Re: revamp sdt.h) Rayson Ho
  2010-09-03 20:36     ` revamp sdt.h Stan Cox
  1 sibling, 2 replies; 15+ messages in thread
From: Rayson Ho @ 2010-08-11 13:11 UTC (permalink / raw)
  To: systemtap

Thanks Roland,

I've ported my pthread probe changes against your systemtap glibc tree.
I think the interface of the new sdt.h and that of the LIBC_PROBE()
marco is almost the identical, but I am not sure if we are going to have
macros like PROBE2() in the new-sdt -- not an issue at all as it is just
a handy macro.

I've inline-attached the diff, and the biggest change is that
"pthread_probe.h" used with the old sdt.h is now gone.

I have not attached the diffs for the probes in the inline assembly code
yet, I will do that soon.

Rayson



diff --git a/nptl/pthread_join.c b/nptl/pthread_join.c
index 6a87a8b..9f67a58 100644
--- a/nptl/pthread_join.c
+++ b/nptl/pthread_join.c
@@ -23,6 +23,8 @@
 #include <atomic.h>
 #include "pthreadP.h"
 
+#include <stap-probe.h>
+
 
 static void
 cleanup (void *arg)
@@ -55,6 +57,8 @@ pthread_join (threadid, thread_return)
   struct pthread *self = THREAD_SELF;
   int result = 0;
 
+  LIBC_PROBE(pthread_join, 1, threadid);
+
   /* During the wait we change to asynchronous cancellation.  If we
      are canceled the thread we are waiting for must be marked as
      un-wait-ed for again.  */
@@ -110,5 +114,7 @@ pthread_join (threadid, thread_return)
       __free_tcb (pd);
     }
 
+  LIBC_PROBE(pthread_join_ret, 2, threadid, result);  
+
   return result;
 }
diff --git a/nptl/pthread_mutex_destroy.c b/nptl/pthread_mutex_destroy.c
index e2c9f8a..dd690cd 100644
--- a/nptl/pthread_mutex_destroy.c
+++ b/nptl/pthread_mutex_destroy.c
@@ -20,6 +20,7 @@
 #include <errno.h>
 #include "pthreadP.h"
 
+#include <stap-probe.h>
 
 int
 __pthread_mutex_destroy (mutex)
@@ -32,6 +33,8 @@ __pthread_mutex_destroy (mutex)
   /* Set to an invalid value.  */
   mutex->__data.__kind = -1;
 
+  LIBC_PROBE(pthread_mutex_destroy, 1, mutex);
+
   return 0;
 }
 strong_alias (__pthread_mutex_destroy, pthread_mutex_destroy)
diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
index d9b1ef0..e4fcae7 100644
--- a/nptl/pthread_mutex_init.c
+++ b/nptl/pthread_mutex_init.c
@@ -24,6 +24,8 @@
 #include <kernel-features.h>
 #include "pthreadP.h"
 
+#include <stap-probe.h>
+
 static const struct pthread_mutexattr default_attr =
   {
     /* Default is a normal mutex, not shared between processes.  */
@@ -47,6 +49,8 @@ __pthread_mutex_init (mutex, mutexattr)
 
   imutexattr = (const struct pthread_mutexattr *) mutexattr ?:
&default_attr;
 
+  LIBC_PROBE(pthread_mutex_init, 2, mutex, mutexattr);
+
   /* Sanity checks.  */
   switch (__builtin_expect (imutexattr->mutexkind
 			    & PTHREAD_MUTEXATTR_PROTOCOL_MASK,
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 50dc188..b754372 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -25,6 +25,7 @@
 #include "pthreadP.h"
 #include <lowlevellock.h>
 
+#include <stap-probe.h>
 
 #ifndef LLL_MUTEX_LOCK
 # define LLL_MUTEX_LOCK(mutex) \
@@ -48,6 +49,10 @@ __pthread_mutex_lock (mutex)
   assert (sizeof (mutex->__size) >= sizeof (mutex->__data));
 
   unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
+
+  /* systemtap marker */
+  LIBC_PROBE(pthread_mutex_lock, 1, mutex);
+
   if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
     return __pthread_mutex_lock_full (mutex);
 
@@ -60,6 +65,8 @@ __pthread_mutex_lock (mutex)
       /* Normal mutex.  */
       LLL_MUTEX_LOCK (mutex);
       assert (mutex->__data.__owner == 0);
+
+      LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
     {
@@ -75,6 +82,11 @@ __pthread_mutex_lock (mutex)
 
 	  ++mutex->__data.__count;
 
+          /* currently, the systemtap pthread probe does not have a */
+          /* probe point here because the thread already owns this */
+          /* recursive lock before the call to this function. */
+          /* this might change in the future */
+
 	  return 0;
 	}
 
@@ -83,6 +95,8 @@ __pthread_mutex_lock (mutex)
 
       assert (mutex->__data.__owner == 0);
       mutex->__data.__count = 1;
+
+      LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
     {
@@ -94,6 +108,7 @@ __pthread_mutex_lock (mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
 			     mutex->__data.__spins * 2 + 10);
+
 	  do
 	    {
 	      if (cnt++ >= max_cnt)
@@ -108,6 +123,8 @@ __pthread_mutex_lock (mutex)
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
+          LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
+
 	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
 	}
       assert (mutex->__data.__owner == 0);
@@ -127,6 +144,8 @@ __pthread_mutex_lock (mutex)
   ++mutex->__data.__nusers;
 #endif
 
+  LIBC_PROBE(pthread_mutex_lock_acquire, 1, mutex);
+
   return 0;
 }
 
@@ -277,6 +296,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 
 		++mutex->__data.__count;
 
+                /* currently, the systemtap pthread probe does not have
a */
+                /* probe point here because the thread already owns
this */
+                /* recursive lock before the call to this function. */
+                /* this might change in the future */
 		return 0;
 	      }
 	  }
@@ -393,6 +416,11 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 		  /* Overflow of the counter.  */
 		  return EAGAIN;
 
+                 /* currently, the systemtap pthread probe does not
have a */
+                 /* probe point here because the thread already owns
this */
+                 /* recursive lock before the call to this function. */
+                 /* this might change in the future */
+
 		++mutex->__data.__count;
 
 		return 0;
@@ -451,6 +479,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	  }
 	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
 
+        LIBC_PROBE(pthread_mutex_lock_full_block, 1, mutex);
+
 	assert (mutex->__data.__owner == 0);
 	mutex->__data.__count = 1;
       }
@@ -467,6 +497,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
   ++mutex->__data.__nusers;
 #endif
 
+  LIBC_PROBE(pthread_mutex_lock_full_acquire, 1, mutex);
+
   return 0;
 }
 #ifndef __pthread_mutex_lock
diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
index f9fe10b..e0f305e 100644
--- a/nptl/pthread_mutex_unlock.c
+++ b/nptl/pthread_mutex_unlock.c
@@ -23,6 +23,8 @@
 #include "pthreadP.h"
 #include <lowlevellock.h>
 
+#include <stap-probe.h>
+
 static int
 internal_function
 __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
@@ -50,6 +52,7 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
 
       /* Unlock.  */
       lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));
+      LIBC_PROBE(pthread_mutex_release, 1, mutex);
       return 0;
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
@@ -60,6 +63,10 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+        
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto normal;
     }
@@ -104,6 +111,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
 
       goto robust;
@@ -149,6 +160,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto continue_pi_non_robust;
 
@@ -171,6 +186,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
 
       goto continue_pi_robust;
@@ -237,6 +256,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto pp;
 
@@ -272,6 +295,9 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
int decr)
 			PTHREAD_MUTEX_PSHARED (mutex));
 
       int oldprio = newval >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
+
+      LIBC_PROBE(pthread_mutex_release, 1, mutex);
+
       return __pthread_tpp_change_priority (oldprio, -1);
 
     default:
@@ -279,6 +305,8 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
int decr)
       return EINVAL;
     }
 
+  LIBC_PROBE(pthread_mutex_release, 1, mutex);
+
   return 0;
 }
 
diff --git a/nptl/pthread_rwlock_destroy.c
b/nptl/pthread_rwlock_destroy.c
index 28fd24b..b14de8f 100644
--- a/nptl/pthread_rwlock_destroy.c
+++ b/nptl/pthread_rwlock_destroy.c
@@ -19,12 +19,14 @@
 
 #include "pthreadP.h"
 
+#include <stap-probe.h>
 
 int
 __pthread_rwlock_destroy (rwlock)
      pthread_rwlock_t *rwlock;
 {
   /* Nothing to be done.  For now.  */
+  LIBC_PROBE(pthread_rwlock_destroy, 1, rwlock);
   return 0;
 }
 strong_alias (__pthread_rwlock_destroy, pthread_rwlock_destroy)
diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c
index 31eb508..7b4d8f0 100644
--- a/nptl/pthread_rwlock_rdlock.c
+++ b/nptl/pthread_rwlock_rdlock.c
@@ -23,6 +23,8 @@
 #include <pthread.h>
 #include <pthreadP.h>
 
+#include <stap-probe.h>
+
 
 /* Acquire read lock for RWLOCK.  */
 int
@@ -31,6 +33,8 @@ __pthread_rwlock_rdlock (rwlock)
 {
   int result = 0;
 
+  LIBC(pthread_rwlock_rdlock, 1, rwlock);
+
   /* Make sure we are along.  */
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
@@ -49,6 +53,13 @@ __pthread_rwlock_rdlock (rwlock)
 	      --rwlock->__data.__nr_readers;
 	      result = EAGAIN;
 	    }
+          else
+            {
+              /* systemtap pthread probe - this is the only place where
*/
+              /* we get this read-write lock */
+              LIBC_PROBE(pthread_rwlock_rdlock, 1, rwlock);
+            }
+
 
 	  break;
 	}
diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c
index a7ef71a..ba9620b 100644
--- a/nptl/pthread_rwlock_unlock.c
+++ b/nptl/pthread_rwlock_unlock.c
@@ -23,10 +23,14 @@
 #include <pthread.h>
 #include <pthreadP.h>
 
+#include <stap-probe.h>
+
 /* Unlock RWLOCK.  */
 int
 __pthread_rwlock_unlock (pthread_rwlock_t *rwlock)
 {
+  LIBC_PROBE(pthread_rwlock_unlock, 1, rwlock);
+
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
   if (rwlock->__data.__writer)
     rwlock->__data.__writer = 0;
diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c
index 64fe970..09b9454 100644
--- a/nptl/pthread_rwlock_wrlock.c
+++ b/nptl/pthread_rwlock_wrlock.c
@@ -23,6 +23,7 @@
 #include <pthread.h>
 #include <pthreadP.h>
 
+#include <stap-probe.h>
 
 /* Acquire write lock for RWLOCK.  */
 int
@@ -31,6 +32,8 @@ __pthread_rwlock_wrlock (rwlock)
 {
   int result = 0;
 
+  LIBC_PROBE(pthread_rwlock_wrlock, 1, rwlock);
+
   /* Make sure we are along.  */
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
@@ -41,6 +44,11 @@ __pthread_rwlock_wrlock (rwlock)
 	{
 	  /* Mark self as writer.  */
 	  rwlock->__data.__writer = THREAD_GETMEM (THREAD_SELF, tid);
+
+          /* systemtap pthread probe - this is the only place where we
can */
+          /* get this read-write lock. */
+          LIBC_PROBE(pthread_rwlock_wrlock_acquire, 1, rwlock);
+
 	  break;
 	}
 


On Mon, 2010-08-09 at 09:37 -0700, Roland McGrath wrote:
> > I have modified your sdt.h slightly and used it for the pthread
> > probes. Is there a special branch of systemtap or utrace that I need
> > to use in order to test/benchmark the overhead of the existance of the
> > new sdt probes in libpthread?
> 
> The translator work has yet to be done.  So at the moment all you could
> test is having the probes in there statically (i.e. addition of nop
> instructions) and not using them.
> 
> 
> Thanks,
> Roland


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-08-09 16:38   ` Roland McGrath
  2010-08-11 13:11     ` Rayson Ho
@ 2010-09-03 20:36     ` Stan Cox
  2010-09-03 20:38       ` Roland McGrath
  1 sibling, 1 reply; 15+ messages in thread
From: Stan Cox @ 2010-09-03 20:36 UTC (permalink / raw)
  To: systemtap

On 08/09/2010 12:37 PM, Roland McGrath wrote:
> The translator work has yet to be done.

The roland/new-sdt branch now has a version of the translator that supports 
version 3 sdt.h.  The only regressions are a couple sdt.exp tests that are 
failing with -pedantic.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-09-03 20:36     ` revamp sdt.h Stan Cox
@ 2010-09-03 20:38       ` Roland McGrath
  2010-09-03 20:42         ` Stan Cox
  0 siblings, 1 reply; 15+ messages in thread
From: Roland McGrath @ 2010-09-03 20:38 UTC (permalink / raw)
  To: Stan Cox; +Cc: systemtap

> On 08/09/2010 12:37 PM, Roland McGrath wrote:
> > The translator work has yet to be done.
> 
> The roland/new-sdt branch now has a version of the translator that supports 
> version 3 sdt.h.  The only regressions are a couple sdt.exp tests that are 
> failing with -pedantic.

I'll look into those.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-09-03 20:38       ` Roland McGrath
@ 2010-09-03 20:42         ` Stan Cox
  2010-09-03 21:02           ` Roland McGrath
  0 siblings, 1 reply; 15+ messages in thread
From: Stan Cox @ 2010-09-03 20:42 UTC (permalink / raw)
  To: Roland McGrath, systemtap

On 09/03/2010 04:38 PM, Roland McGrath wrote:
>> On 08/09/2010 12:37 PM, Roland McGrath wrote:
>>> The translator work has yet to be done.
>>
>> The roland/new-sdt branch now has a version of the translator that supports
>> version 3 sdt.h.  The only regressions are a couple sdt.exp tests that are
>> failing with -pedantic.
>
> I'll look into those.

Thanks Roland.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-09-03 20:42         ` Stan Cox
@ 2010-09-03 21:02           ` Roland McGrath
  2010-09-07 15:08             ` Stan Cox
  0 siblings, 1 reply; 15+ messages in thread
From: Roland McGrath @ 2010-09-03 21:02 UTC (permalink / raw)
  To: Stan Cox; +Cc: systemtap

> On 09/03/2010 04:38 PM, Roland McGrath wrote:
> >> On 08/09/2010 12:37 PM, Roland McGrath wrote:
> >>> The translator work has yet to be done.
> >>
> >> The roland/new-sdt branch now has a version of the translator that supports
> >> version 3 sdt.h.  The only regressions are a couple sdt.exp tests that are
> >> failing with -pedantic.
> >
> > I'll look into those.
> 
> Thanks Roland.

Your systemtap.exp changes look all wrong to me.  
Please explain their rationale.

Your log entry fails to use correct file names for subdirectory files.

Your addition of _SDT_SEMAPHORE was cosmetically sloppy,
and also uses a symbol name that is not name space clean.
I thought we had abandoned the semaphore thing, or that 
someone would have answered me about that and asked me to
encode it right in the macros.


diff --git a/includes/sys/sdt.h b/includes/sys/sdt.h
index d9dc7cc..0000000 100644  
--- a/includes/sys/sdt.h
+++ b/includes/sys/sdt.h
@@ -77,6 +77,12 @@
 # define _SDT_ASM_AUTOGROUP ""
 #endif
 
+#ifdef _SDT_HAS_SEMAPHORES
+# define _SDT_SEMAPHORE(provider, name)	_sdt_semaphore_##provider##name
+#else
+# define _SDT_SEMAPHORE(provider, name)	0
+#endif
+
 #define _SDT_ASM_BODY(provider, name, pack_args, args)			      \
   _SDT_ASM_1(990:	_SDT_NOP)					      \
   _SDT_ASM_3(		.pushsection .note.stapsdt,_SDT_ASM_AUTOGROUP,"note") \
@@ -86,7 +92,7 @@
   _SDT_ASM_1(992:	.balign 4)					      \
   _SDT_ASM_1(993:	_SDT_ASM_ADDR 990b)				      \
   _SDT_ASM_1(		_SDT_ASM_ADDR _.stapsdt.base)			      \
-  _SDT_SEMAPHORE(provider,name)						      \
+  _SDT_ASM_1(		_SDT_ASM_ADDR _SDT_SEMAPHORE(provider, name))	      \
   _SDT_ASM_STRING(provider)						      \
   _SDT_ASM_STRING(name)							      \
   pack_args args							      \
@@ -102,12 +108,6 @@
   _SDT_ASM_1(		.popsection)					      \
   _SDT_ASM_1(.endif)
 
-#if defined _SDT_HAS_SEMAPHORES
-#define _SDT_SEMAPHORE(p,n) _SDT_ASM_1(		_SDT_ASM_ADDR p##_##n##_semaphore)
-#else
-#define _SDT_SEMAPHORE(p,n) _SDT_ASM_1(		_SDT_ASM_ADDR 0)
-#endif
-
 #define _SDT_ASM_TEMPLATE_0		/* no arguments */
 #define _SDT_ASM_TEMPLATE_1		_SDT_ARGFMT(1)
 #define _SDT_ASM_TEMPLATE_2		_SDT_ASM_TEMPLATE_1 _SDT_ARGFMT(2)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-09-03 21:02           ` Roland McGrath
@ 2010-09-07 15:08             ` Stan Cox
  2010-09-08  0:27               ` Roland McGrath
  0 siblings, 1 reply; 15+ messages in thread
From: Stan Cox @ 2010-09-07 15:08 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap

On 09/03/2010 05:02 PM, Roland McGrath wrote:


> Your systemtap.exp changes look all wrong to me.
> Please explain their rationale.

Yes, I had not noticed that there is now a src/testsuite/sys/sdt.h.  I'll 
change systemtap.exp back.


> Your addition of _SDT_SEMAPHORE... I thought we had abandoned the semaphore thing,

The implementation is unchanged;  I changed the name to match the pattern in 
sdt.h.  Nothing sacred about the name or implementation;  we can certainly 
change it.  The semaphore mechanism is used to implement the *_ENABLED 
mechanism, the usefulness of which is described by Mark here:
  http://sourceware.org/ml/systemtap/2009-q1/msg00959.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-09-07 15:08             ` Stan Cox
@ 2010-09-08  0:27               ` Roland McGrath
  2010-09-08 15:44                 ` Stan Cox
  0 siblings, 1 reply; 15+ messages in thread
From: Roland McGrath @ 2010-09-08  0:27 UTC (permalink / raw)
  To: Stan Cox; +Cc: systemtap

Hi Stan,

I was in a lousy mood on Friday, so sorry for the tone of my message.

My review of your changes had objections to some parts thoroughly isolated
to the tests, or the headers, and no objections to the translator changes.
For a variety of reasons like that, I find it far preferable always to do
multiple small commits rather than individual large ones.  Of course, it
always helps collaboration to push your branch frequently while you work,
and doing smaller-grained commits makes that easier.

Whenever you would use two separate paragraphs in traditional ChangeLog
format (i.e. a blank line between two * lines), then those should be two
separate commits.  Those should be separate whenever one change is
separable from the other (and sometimes even when they aren't really).

For example, I would have done one commit of just the systemtap.exp change,
one of just the sdt.h change, and another with the translator changes.

This makes it easier to revert one part cleanly (or just avoid merging it).
When reviewing changes, it makes it easier to approve/reject one part, etc.

Doing reversions by hand almost always results in unintended changes, and
one really never knows until much later whether they were in fact as
inconsequential as they seemed.

> > Your addition of _SDT_SEMAPHORE... I thought we had abandoned the semaphore thing,
> 
> The implementation is unchanged;  I changed the name to match the pattern in 
> sdt.h.  Nothing sacred about the name or implementation;  we can certainly 
> change it.  The semaphore mechanism is used to implement the *_ENABLED 
> mechanism, the usefulness of which is described by Mark here:
>   http://sourceware.org/ml/systemtap/2009-q1/msg00959.html

I don't really have an opinion about the semaphore feature.  It's just that
in my long posting proposing what we're now calling "sdt v3", I said I was
leaving out the semaphore because it wasn't being used, and nobody followed
up to say otherwise.  I had expected others to participate with me in the
discussion of my code before anyone just started reworking it.  I spent a
long time making the header file as pretty as it can be, so I'm more
sensitive about willy-nilly changes than is entirely reasonable.


The includes magic changes I made for the tests were done quite carefully,
and I did test them (using runcheck but not installcheck), so I'm surprised
you had issues I didn't see.  It's all rather subtle, so I think it's wise
to discuss any such changes rather than just sweep them in.  (I suppose I
should have done so.)  And, frankly, for anything that even might either be
subtle or ever be important, commits with a log explanation of "Tweak it"
are just inadequate.

Those paths need to be exactly just right to have installcheck actually
test the installed headers so we can know they got installed properly,
which is its purpose.  For the runcheck case, they need to be differently
exactly just right to have it find the right headers in the build and
source directories and no others.  For both cases, it's important that they
be found in an -isystem path so that -pedantic doesn't emit meaningless
stupid warnings that there is no other way to suppress (until GCC is fixed).


Thanks,
Roland

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-08-11 13:11     ` Rayson Ho
@ 2010-09-08  0:44       ` Roland McGrath
  2010-09-29 17:18       ` Adding systemtap probe points in pthread library (was: Re: revamp sdt.h) Rayson Ho
  1 sibling, 0 replies; 15+ messages in thread
From: Roland McGrath @ 2010-09-08  0:44 UTC (permalink / raw)
  To: Rayson Ho; +Cc: systemtap

> I've ported my pthread probe changes against your systemtap glibc tree.
> I think the interface of the new sdt.h and that of the LIBC_PROBE()
> marco is almost the identical, but I am not sure if we are going to have
> macros like PROBE2() in the new-sdt -- not an issue at all as it is just
> a handy macro.

Aside from asm support, the sdt.h macro interface will not change.

> I've inline-attached the diff, and the biggest change is that
> "pthread_probe.h" used with the old sdt.h is now gone.

Good.  You are almost there on cosmetic issues for glibc submissions.
Put a space before a paren, as you see in the rest of the code.
Use full sentences with proper punctuation in comments, and follow
the comment indentation style you see in the rest of the code.

> I have not attached the diffs for the probes in the inline assembly code
> yet, I will do that soon.

Ok.  I have only skimmed these patches as to the semantics of the probes.
So I only have a few comments, but the choices of other probe locations and
parameters still needs a more thorough review.

> @@ -55,6 +57,8 @@ pthread_join (threadid, thread_return)
>    struct pthread *self = THREAD_SELF;
>    int result = 0;
>  
> +  LIBC_PROBE(pthread_join, 1, threadid);

A vanilla probe for non-bogus entry to pthread_join is probably fine.  But
perhaps it's more useful to have (either only, or a separate additional)
probe when pthread_join actually blocks (or close to, i.e. lll_wait_tid
call).

> +  LIBC_PROBE(pthread_join_ret, 2, threadid, result);  

The error code is useful to give here, sure.  But I think pd->result
is the most obviously interesting and important value, so don't miss that.
That might mean different probe names for success/error cases, or perhaps
just a dummy pd->result value of NULL for the error case.

> @@ -32,6 +33,8 @@ __pthread_mutex_destroy (mutex)
>    /* Set to an invalid value.  */
>    mutex->__data.__kind = -1;
>  
> +  LIBC_PROBE(pthread_mutex_destroy, 1, mutex);

For any probe that is named simply with the function name, it should be
before it does its work, not after.  Here, the probe fires after the mutex
has been destroyed, so it's no longer possible to inspect its (former) state.

> @@ -47,6 +49,8 @@ __pthread_mutex_init (mutex, mutexattr)
>  
>    imutexattr = (const struct pthread_mutexattr *) mutexattr ?:
> &default_attr;
>  
> +  LIBC_PROBE(pthread_mutex_init, 2, mutex, mutexattr);

Here might be exception where the opposite makes more sense.  For a *_init
function, the mutex is known to be garbage beforehand, so there is nothing
there to inspect.  OTOH, if tracing a bug wherein it gets reinitialized
wrongly, perhaps you want to see it first.

> +  /* systemtap marker */
> +  LIBC_PROBE(pthread_mutex_lock, 1, mutex);

Omit useless comments.

> @@ -94,6 +108,7 @@ __pthread_mutex_lock (mutex)
>  	  int cnt = 0;
>  	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>  			     mutex->__data.__spins * 2 + 10);
> +
>  	  do
>  	    {
>  	      if (cnt++ >= max_cnt)

Do not make any changes even to whitespace outside your additions.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-09-08  0:27               ` Roland McGrath
@ 2010-09-08 15:44                 ` Stan Cox
  2010-09-08 17:37                   ` Mark Wielaard
  0 siblings, 1 reply; 15+ messages in thread
From: Stan Cox @ 2010-09-08 15:44 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap


> semaphore feature.

Ah sorry about that; didn't realize you were seeking that kind of feedback. 
Interpreters, java in particular, tcl and perhaps python to a lesser degree 
use #ifdef IS_JAVA_PROBEA_ENABLED(), as I understand it, to avoid runtime 
probe argument setup overhead, so the semaphore feature was added to implement 
*ENABLED.

> The includes magic changes I made for the tests were done quite carefully,
> and I did test them (using runcheck but not installcheck), so I'm surprised
> you had issues I didn't see.  It's all rather subtle, so I think it's wise
> to discuss any such changes rather than just sweep them in.  (I suppose I
> should have done so.)  And, frankly, for anything that even might either be
> subtle or ever be important, commits with a log explanation of "Tweak it"
> are just inadequate.


> Those paths need to be exactly just right to have installcheck actually
> test the installed headers so we can know they got installed properly,
> which is its purpose.  For the runcheck case, they need to be differently
> exactly just right to have it find the right headers in the build and
> source directories and no others.  For both cases, it's important that they
> be found in an -isystem path so that -pedantic doesn't emit meaningless
> stupid warnings that there is no other way to suppress (until GCC is fixed).

What I missed initially when I ran with installcheck, which I caught later 
when hand compiling with gcc -v, was the systemtap.exp use of -isystem=${dir}. 
  For gcc 4.4.4 20100630 on FC13,  gcc tries to use an include directory 
called =/path/to/include/dir (with the '=') and then ignores it since it 
doesn't exist.  Offhand I would think that if --sysroot were not specified 
then gcc might just ignore '=', but it doesn't.  So systemtap.exp is now 
exactly what you had plus 1) add env to global 2) remove '=' from -isystem 
setup.  Hmm, good point about -pedantic.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: revamp sdt.h
  2010-09-08 15:44                 ` Stan Cox
@ 2010-09-08 17:37                   ` Mark Wielaard
  0 siblings, 0 replies; 15+ messages in thread
From: Mark Wielaard @ 2010-09-08 17:37 UTC (permalink / raw)
  To: Stan Cox; +Cc: Roland McGrath, systemtap

On Wed, 2010-09-08 at 11:42 -0400, Stan Cox wrote:
> > semaphore feature.
> 
> Ah sorry about that; didn't realize you were seeking that kind of feedback. 
> Interpreters, java in particular, tcl and perhaps python to a lesser degree 
> use #ifdef IS_JAVA_PROBEA_ENABLED(), as I understand it, to avoid runtime 
> probe argument setup overhead, so the semaphore feature was added to implement 
> *ENABLED.

Apologies here too. I had missed the request for feedback on the
semaphores. Here are a couple of examples of code using the ENABLED
probes. In python the main PyEval_EvalFrameEx() function has the
following near the start:

    if (PYTHON_FUNCTION_ENTRY_ENABLED())
        dtrace_entry(f);

and then near the end:

    if (PYTHON_FUNCTION_RETURN_ENABLED())
        dtrace_return(f);

This is done because the entry and exit probes provide arguments which
might take some time to setup. e.g for entry:

static void
dtrace_entry(PyFrameObject *f)
{
    const char *filename;
    const char *fname;
    int lineno;

    filename = PyString_AsString(f->f_code->co_filename);
    fname = PyString_AsString(f->f_code->co_name);
    lineno = PyCode_Addr2Line(f->f_code, f->f_lasti);

    PYTHON_FUNCTION_ENTRY((char *)filename, (char *)fname, lineno);

Something similar can be found in tcl and firefox.

The java hotspot support doesn't use this feature (yet). There we fetch
all arguments through user*() stap functions in the accompanied tapset.
But even though the stap language is pretty rich and flexible, this is a
lot of magic (re)coding of what the (c++) hotspot code already does.

If we like to be completely source compatible with dtrace markers
already in existing programs/libraries supporting the ENABLED macros
would be really nice/essential.

Cheers,

Mark 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Adding systemtap probe points in pthread library (was: Re: revamp sdt.h)
  2010-08-11 13:11     ` Rayson Ho
  2010-09-08  0:44       ` Roland McGrath
@ 2010-09-29 17:18       ` Rayson Ho
  2010-10-06 14:27         ` RFC: " Rayson Ho
  1 sibling, 1 reply; 15+ messages in thread
From: Rayson Ho @ 2010-09-29 17:18 UTC (permalink / raw)
  To: libc-alpha, systemtap

With the new sdt.h and translator in the systemtap git tree, I added the
probes in assembly code in lowlevellock.S to only trace mutex lock calls
that are contented (ie. those that end up calling futex(2)).

Additions & modifications in this revision:
1) nptl/DESIGN-systemtap-probes.txt - brief docs about the probes
2) nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S - added 2 probes
in the assembly routines right before they perform the SYS_futex
syscall.

(Please see the patch at the end of this message)

The micro-benchmark results are much better, with a simple program that
does nothing much but pthread_mutex_lock() & pthread_mutex_unlock():


#include <stdio.h>
#include <pthread.h>

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

thread()
{
 int i;

 for (i=0;i<100000000;i++)
 {
   pthread_mutex_lock(&lock);
   pthread_mutex_unlock(&lock);
 }
}

main()
{
 pthread_t tid, tid2, tid3, tid4;

 pthread_create(&tid,  NULL, thread, NULL);
 pthread_create(&tid2, NULL, thread, NULL);
 pthread_create(&tid3, NULL, thread, NULL);
 pthread_create(&tid4, NULL, thread, NULL);


 thread();

 pthread_join(tid,  NULL);
 pthread_join(tid2, NULL);
 pthread_join(tid3, NULL);
 pthread_join(tid4, NULL);

}

With 5 threads (the version above), there were only 408613 calls to
futex(2), which reduced the firing of mutex_acquire() by 1223.6 times! I
repeated the same test with smaller number (2-4) of threads, and in all
cases, the number of times futex(2) is entered to get the lock is low --
which is similar to the behavior of well-written threaded code (using
DTrace on OpenSolaris, MySQL was not extremely contented.)

So instead of a slow-down of several times when I benchmarked this
micro-benchmark with the older probes, we are only several % slower than
the same code without being instrumented by systemtap. (I benchmarked
the code on my laptop, and I will repeat the benchmark again on a much
quieter machine, esp. with X running and release the final results.)

Rayson




diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
index 43ca44c..2f1a4c1 100644
--- a/nptl/pthreadP.h
+++ b/nptl/pthreadP.h
@@ -33,6 +33,7 @@
 #include <atomic.h>
 #include <kernel-features.h>
 
+#include "pthread_probe.h"
 
 /* Atomic operations on TLS memory.  */
 #ifndef THREAD_ATOMIC_CMPXCHG_VAL
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index 34d83f9..66f44cb 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -296,6 +296,11 @@ start_thread (void *arg)
 	  CANCEL_RESET (oldtype);
 	}
 
+      /* SystemTap probe
+         All of the normal thread creation work would
+         be done after this point */
+      PTHREAD_PROBE_START(pd->arg);
+
       /* Run the code the user provided.  */
 #ifdef CALL_THREAD_FCT
       THREAD_SETMEM (pd, result, CALL_THREAD_FCT (pd));
@@ -552,6 +557,9 @@ __pthread_create_2_1 (newthread, attr,
start_routine, arg)
   /* Pass the descriptor to the caller.  */
   *newthread = (pthread_t) pd;
 
+  /* Systemtap probe */
+  PTHREAD_PROBE_CREATE(newthread, start_routine, arg);
+
   /* Start the thread.  */
   return create_thread (pd, iattr, STACK_VARIABLES_ARGS);
 }
diff --git a/nptl/pthread_join.c b/nptl/pthread_join.c
index 6a87a8b..58171a3 100644
--- a/nptl/pthread_join.c
+++ b/nptl/pthread_join.c
@@ -55,6 +55,8 @@ pthread_join (threadid, thread_return)
   struct pthread *self = THREAD_SELF;
   int result = 0;
 
+  PTHREAD_PROBE_JOIN(threadid);
+
   /* During the wait we change to asynchronous cancellation.  If we
      are canceled the thread we are waiting for must be marked as
      un-wait-ed for again.  */
@@ -110,5 +112,7 @@ pthread_join (threadid, thread_return)
       __free_tcb (pd);
     }
 
+  PTHREAD_PROBE_JOIN_RET(threadid, result);
+
   return result;
 }
diff --git a/nptl/pthread_mutex_destroy.c b/nptl/pthread_mutex_destroy.c
index e2c9f8a..2217f58 100644
--- a/nptl/pthread_mutex_destroy.c
+++ b/nptl/pthread_mutex_destroy.c
@@ -29,6 +29,8 @@ __pthread_mutex_destroy (mutex)
       && mutex->__data.__nusers != 0)
     return EBUSY;
 
+  PTHREAD_PROBE_MUTEX_DESTROY(mutex);
+
   /* Set to an invalid value.  */
   mutex->__data.__kind = -1;
 
diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
index d9b1ef0..bf395dd 100644
--- a/nptl/pthread_mutex_init.c
+++ b/nptl/pthread_mutex_init.c
@@ -45,6 +45,8 @@ __pthread_mutex_init (mutex, mutexattr)
 
   assert (sizeof (pthread_mutex_t) <= __SIZEOF_PTHREAD_MUTEX_T);
 
+  PTHREAD_PROBE_MUTEX_INIT(mutex);
+
   imutexattr = (const struct pthread_mutexattr *) mutexattr ?:
&default_attr;
 
   /* Sanity checks.  */
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 50dc188..a4ccefe 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -48,6 +48,10 @@ __pthread_mutex_lock (mutex)
   assert (sizeof (mutex->__size) >= sizeof (mutex->__data));
 
   unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
+
+  /* systemtap marker */
+  PTHREAD_PROBE_MUTEX_ENTRY(mutex);
+
   if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
     return __pthread_mutex_lock_full (mutex);
 
@@ -60,6 +64,8 @@ __pthread_mutex_lock (mutex)
       /* Normal mutex.  */
       LLL_MUTEX_LOCK (mutex);
       assert (mutex->__data.__owner == 0);
+
+      PTHREAD_PROBE_MUTEX_BLOCK(mutex);
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
     {
@@ -75,6 +81,11 @@ __pthread_mutex_lock (mutex)
 
 	  ++mutex->__data.__count;
 
+          /* currently, the systemtap pthread probe does not have a */
+          /* probe point here because the thread already owns this */
+          /* recursive lock before the call to this function. */
+          /* this might change in the future */
+
 	  return 0;
 	}
 
@@ -83,6 +94,8 @@ __pthread_mutex_lock (mutex)
 
       assert (mutex->__data.__owner == 0);
       mutex->__data.__count = 1;
+
+      PTHREAD_PROBE_MUTEX_BLOCK(mutex);
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
     {
@@ -94,6 +107,7 @@ __pthread_mutex_lock (mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
 			     mutex->__data.__spins * 2 + 10);
+
 	  do
 	    {
 	      if (cnt++ >= max_cnt)
@@ -108,6 +122,8 @@ __pthread_mutex_lock (mutex)
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
+          PTHREAD_PROBE_MUTEX_BLOCK(mutex);
+
 	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
 	}
       assert (mutex->__data.__owner == 0);
@@ -127,6 +143,8 @@ __pthread_mutex_lock (mutex)
   ++mutex->__data.__nusers;
 #endif
 
+  PTHREAD_PROBE_MUTEX_ACQUIRE(mutex);
+
   return 0;
 }
 
@@ -277,6 +295,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 
 		++mutex->__data.__count;
 
+                /* currently, the systemtap pthread probe does not have
a */
+                /* probe point here because the thread already owns
this */
+                /* recursive lock before the call to this function. */
+                /* this might change in the future */
 		return 0;
 	      }
 	  }
@@ -393,6 +415,11 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 		  /* Overflow of the counter.  */
 		  return EAGAIN;
 
+                 /* currently, the systemtap pthread probe does not
have a */
+                 /* probe point here because the thread already owns
this */
+                 /* recursive lock before the call to this function. */
+                 /* this might change in the future */
+
 		++mutex->__data.__count;
 
 		return 0;
@@ -442,8 +469,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 		  break;
 
 		if (oldval != ceilval)
+                {
 		  lll_futex_wait (&mutex->__data.__lock, ceilval | 2,
 				  PTHREAD_MUTEX_PSHARED (mutex));
+                }
 	      }
 	    while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
 							ceilval | 2, ceilval)
@@ -451,6 +480,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	  }
 	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
 
+        PTHREAD_PROBE_MUTEX_BLOCK(mutex);
+
 	assert (mutex->__data.__owner == 0);
 	mutex->__data.__count = 1;
       }
@@ -467,6 +498,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
   ++mutex->__data.__nusers;
 #endif
 
+  PTHREAD_PROBE_MUTEX_ACQUIRE(mutex);
+
   return 0;
 }
 #ifndef __pthread_mutex_lock
diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
index f9fe10b..50cbc5c 100644
--- a/nptl/pthread_mutex_unlock.c
+++ b/nptl/pthread_mutex_unlock.c
@@ -50,6 +50,7 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
 
       /* Unlock.  */
       lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));
+      PTHREAD_PROBE_MUTEX_RELEASE(mutex);
       return 0;
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
@@ -60,6 +61,10 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+        
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto normal;
     }
@@ -104,6 +109,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
 
       goto robust;
@@ -149,6 +158,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto continue_pi_non_robust;
 
@@ -171,6 +184,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
 
       goto continue_pi_robust;
@@ -237,6 +254,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto pp;
 
@@ -272,6 +293,9 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
int decr)
 			PTHREAD_MUTEX_PSHARED (mutex));
 
       int oldprio = newval >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
+
+      PTHREAD_PROBE_MUTEX_RELEASE(mutex);
+
       return __pthread_tpp_change_priority (oldprio, -1);
 
     default:
@@ -279,6 +303,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
int decr)
       return EINVAL;
     }
 
+  PTHREAD_PROBE_MUTEX_RELEASE(mutex);
   return 0;
 }
 
diff --git a/nptl/pthread_rwlock_destroy.c
b/nptl/pthread_rwlock_destroy.c
index 28fd24b..b4cd7ab 100644
--- a/nptl/pthread_rwlock_destroy.c
+++ b/nptl/pthread_rwlock_destroy.c
@@ -24,6 +24,8 @@ int
 __pthread_rwlock_destroy (rwlock)
      pthread_rwlock_t *rwlock;
 {
+  PTHREAD_PROBE_RWLOCK_DESTROY(rwlock);
+
   /* Nothing to be done.  For now.  */
   return 0;
 }
diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c
index 31eb508..954b414 100644
--- a/nptl/pthread_rwlock_rdlock.c
+++ b/nptl/pthread_rwlock_rdlock.c
@@ -31,6 +31,8 @@ __pthread_rwlock_rdlock (rwlock)
 {
   int result = 0;
 
+  PTHREAD_PROBE_RLOCK_ENTRY(rwlock);
+
   /* Make sure we are along.  */
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
@@ -49,6 +51,12 @@ __pthread_rwlock_rdlock (rwlock)
 	      --rwlock->__data.__nr_readers;
 	      result = EAGAIN;
 	    }
+          else
+            {
+              /* systemtap pthread probe - this is the only place where
*/
+              /* we get this read-write lock */
+              PTHREAD_PROBE_RWLOCK_ACQUIRE_READ(rwlock);
+            }
 
 	  break;
 	}
diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c
index a7ef71a..e7d1568 100644
--- a/nptl/pthread_rwlock_unlock.c
+++ b/nptl/pthread_rwlock_unlock.c
@@ -27,6 +27,8 @@
 int
 __pthread_rwlock_unlock (pthread_rwlock_t *rwlock)
 {
+  PTHREAD_PROBE_RWLOCK_UNLOCK(rwlock);
+
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
   if (rwlock->__data.__writer)
     rwlock->__data.__writer = 0;
diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c
index 64fe970..abf3083 100644
--- a/nptl/pthread_rwlock_wrlock.c
+++ b/nptl/pthread_rwlock_wrlock.c
@@ -31,6 +31,8 @@ __pthread_rwlock_wrlock (rwlock)
 {
   int result = 0;
 
+  PTHREAD_PROBE_WLOCK_ENTRY(rwlock);
+
   /* Make sure we are along.  */
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
@@ -41,6 +43,11 @@ __pthread_rwlock_wrlock (rwlock)
 	{
 	  /* Mark self as writer.  */
 	  rwlock->__data.__writer = THREAD_GETMEM (THREAD_SELF, tid);
+
+          /* systemtap pthread probe - this is the only place where we
can */
+          /* get this read-write lock. */
+          PTHREAD_PROBE_RWLOCK_ACQUIRE_WRITE(rwlock);
+
 	  break;
 	}
diff --git a/nptl/pthread_probe.h b/nptl/pthread_probe.h
index e69de29..8178ac9 100644
--- a/nptl/pthread_probe.h
+++ b/nptl/pthread_probe.h
@@ -0,0 +1,33 @@
+#include <sys/sdt.h>
+
+/* #include "new-sdt3.h" */
+
+#define PTHREAD_PROBE_CREATE(arg1,arg2,arg3)
STAP_PROBE3(provider,create,arg1,arg2,arg3)
+#define PTHREAD_PROBE_JOIN(arg1) STAP_PROBE1(provider,join,arg1)
+#define PTHREAD_PROBE_JOIN_RET(arg1,arg2)
STAP_PROBE2(provider,join_ret,arg1,arg2)
+#define PTHREAD_PROBE_START(arg1) STAP_PROBE1(provider,start,arg1)
+#define PTHREAD_PROBE_END(arg1) STAP_PROBE1(provider,end,arg1)
+#define PTHREAD_PROBE_MUTEX_INIT(arg1)
STAP_PROBE1(provider,mutex_init,arg1)
+#define PTHREAD_PROBE_MUTEX_DESTROY(arg1)
STAP_PROBE1(provider,mutex_destroy,arg1)
+#define PTHREAD_PROBE_MUTEX_ACQUIRE(arg1)
STAP_PROBE1(provider,mutex_acquire,arg1)
+#define PTHREAD_PROBE_MUTEX_RELEASE(arg1)
STAP_PROBE1(provider,mutex_release,arg1)
+#define PTHREAD_PROBE_MUTEX_BLOCK(arg1)
STAP_PROBE1(provider,mutex_block,arg1)
+#define PTHREAD_PROBE_COND_INIT(arg1)
STAP_PROBE1(provider,cond_init,arg1)
+#define PTHREAD_PROBE_COND_DESTROY(arg1)
STAP_PROBE1(provider,cond_destroy,arg1)
+#define PTHREAD_PROBE_COND_WAIT(arg1,arg2)
STAP_PROBE2(provider,cond_wait,arg1,arg2)
+#define PTHREAD_PROBE_COND_WAKE(arg1,arg2)
STAP_PROBE2(provider,cond_wake,arg1,arg2)
+#define PTHREAD_PROBE_COND_SIGNAL(arg1)
STAP_PROBE1(provider,cond_signal,arg1)
+#define PTHREAD_PROBE_COND_BROADCAST(arg1)
STAP_PROBE1(provider,cond_broadcast,arg1)
+#define PTHREAD_PROBE_RWLOCK_ACQUIRE_WRITE(arg1)
STAP_PROBE1(provider,rwlock_acquire_write,arg1)
+#define PTHREAD_PROBE_RWLOCK_ACQUIRE_READ(arg1)
STAP_PROBE1(provider,rwlock_acquire_read,arg1)
+#define PTHREAD_PROBE_RWLOCK_DESTROY(arg1)
STAP_PROBE1(provider,rwlock_destroy,arg1)
+#define PTHREAD_PROBE_RWLOCK_UNLOCK(arg1)
STAP_PROBE1(provider,rwlock_unlock,arg1)
+#define PTHREAD_PROBE_MUTEX_ENTRY(arg1)
STAP_PROBE1(provider,mutex_entry,arg1)
+#define PTHREAD_PROBE_RLOCK_ENTRY(arg1)
STAP_PROBE1(provider,rlock_entry,arg1)
+#define PTHREAD_PROBE_RLOCK_BLOCK(arg1)
STAP_PROBE1(provider,rlock_block,arg1)
+#define PTHREAD_PROBE_WLOCK_ENTRY(arg1)
STAP_PROBE1(provider,wlock_entry,arg1)
+#define PTHREAD_PROBE_WLOCK_BLOCK(arg1)
STAP_PROBE1(provider,wlock_block,arg1)
+
+/* the following probe points are in low-level assembly/inline assembly
code */
+#define PTHREAD_PROBE_LL_LOCKWAIT_PRIVATE(arg1)
STAP_PROBE1(provider,lock_wait_private,arg1)
+#define PTHREAD_PROBE_LL_LOCKWAIT(arg1)
STAP_PROBE1(provider,lock_wait,arg1)

diff --git a/nptl/DESIGN-systemtap-probes.txt
b/nptl/DESIGN-systemtap-probes.txt
index e69de29..d8bbbd7 100644
--- a/nptl/DESIGN-systemtap-probes.txt
+++ b/nptl/DESIGN-systemtap-probes.txt
@@ -0,0 +1,34 @@
+Systemtap is a dynamic tracingi/instrumenting tool available on Linux.
Probes that are not fired at run time have extremely close to zero
overhead.
+
+The following probes are available for NPTL:
+
+Thread creation & Join Probes
+=============================
+create   - probe for pthread_create(3) - arg1 = thread ID, arg2 =
start_routine, arg3 = arguments
+start    - probe for actual thread creation, arg1 = struct pthread
(members include thread ID, process ID)
+join     - probe for pthread_join(3)   - arg1 = thread ID
+join_ret - probe for pthread_join(3) return - arg1 = thread ID, arg2 =
return value
+
+Lock-related Probes
+===================
+mutex_init    - probe for pthread_mutex_init(3) - arg1 = address of
mutex lock
+mutex_acquired- probe for pthread_mutex_lock(3) - arg1 = address of
mutex lock
+mutex_block   - probe for resume from _possible_ mutex block event -
arg1 = address of mutex lock
+mutex_entry   - probe for entry to the pthread_mutex_lock(3) function,
- arg1 = address of mutex lock
+mutex_release - probe for pthread_mutex_unlock(3) after the successful
release of a mutex lock - arg1 = address of mutex lock
+mutex_destroy - probe for pthread_mutex_destroy(3) - arg1 = address of
mutex lock
+rwlock_destroy- probe for pthread_rwlock_destroy(3) - arg1 = address of
rw lock
+rwlock_acquire_write-probe for pthread_rwlock_wrlock(3) - arg1 =
address of rw lock
+rwlock_unlock - probe for pthread_rwlock_unlock(3) - arg1 = address of
rw lock
+
+lock_wait         - probe in low-level lock code, only fired when futex
is called (i.e. when trying to acquire a contented lock)
+lock_wait_private - probe in low-level lock code, only fired when futex
is called (i.e. when trying to acquire a contented lock)
+
+Condition variable Probes
+=========================
+cond_init - probe for pthread_cond_init(3) - arg1 = condition, arg2 =
attr
+cond_destroy- probe for pthread_condattr_destroy(3) - arg1 = attr
+cond_wait - probe for pthread_cond_wait(3) - arg1 = condition, arg2 =
mutex lock
+cond_signal - probe for pthread_cond_signal(3) - arg1 = condition
+cond_broadcast - probe for pthread_cond_broadcast(3) - arg1 = condition
+
diff --git a/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
b/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
index 8de9cf4..b6d9847 100644
--- a/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
+++ b/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
@@ -21,6 +21,7 @@
 #include <pthread-errnos.h>
 #include <kernel-features.h>
 #include <lowlevellock.h>
+#include <pthread_probe.h>
 
        .text
 
@@ -130,7 +131,8 @@ __lll_lock_wait:
        cmpl    %edx, %eax      /* NB:   %edx == 2 */
        jne     2f
 
-1:     movl    $SYS_futex, %eax
+1:     PTHREAD_PROBE_LL_LOCKWAIT(%rdi)
+        movl   $SYS_futex, %eax
        syscall
 
 2:     movl    %edx, %eax
@@ -180,7 +182,8 @@ __lll_timedlock_wait:
        cmpl    %edx, %eax
        jne     2f
 
-1:     movl    $SYS_futex, %eax
+1:     PTHREAD_PROBE_LL_LOCKWAIT_PRIVATE(%rdi)
+        movl   $SYS_futex, %eax
        movl    $2, %edx
        syscall
 

On Wed, 2010-08-11 at 09:10 -0400, Rayson Ho wrote:
> Thanks Roland,
> 
> I've ported my pthread probe changes against your systemtap glibc tree.
> I think the interface of the new sdt.h and that of the LIBC_PROBE()
> marco is almost the identical, but I am not sure if we are going to have
> macros like PROBE2() in the new-sdt -- not an issue at all as it is just
> a handy macro.
> 
> I've inline-attached the diff, and the biggest change is that
> "pthread_probe.h" used with the old sdt.h is now gone.
> 
> I have not attached the diffs for the probes in the inline assembly code
> yet, I will do that soon.
> 
> Rayson
> 
> 
> 
> diff --git a/nptl/pthread_join.c b/nptl/pthread_join.c
> index 6a87a8b..9f67a58 100644
> --- a/nptl/pthread_join.c
> +++ b/nptl/pthread_join.c
> @@ -23,6 +23,8 @@
>  #include <atomic.h>
>  #include "pthreadP.h"
>  
> +#include <stap-probe.h>
> +
>  
>  static void
>  cleanup (void *arg)
> @@ -55,6 +57,8 @@ pthread_join (threadid, thread_return)
>    struct pthread *self = THREAD_SELF;
>    int result = 0;
>  
> +  LIBC_PROBE(pthread_join, 1, threadid);
> +
>    /* During the wait we change to asynchronous cancellation.  If we
>       are canceled the thread we are waiting for must be marked as
>       un-wait-ed for again.  */
> @@ -110,5 +114,7 @@ pthread_join (threadid, thread_return)
>        __free_tcb (pd);
>      }
>  
> +  LIBC_PROBE(pthread_join_ret, 2, threadid, result);  
> +
>    return result;
>  }
> diff --git a/nptl/pthread_mutex_destroy.c b/nptl/pthread_mutex_destroy.c
> index e2c9f8a..dd690cd 100644
> --- a/nptl/pthread_mutex_destroy.c
> +++ b/nptl/pthread_mutex_destroy.c
> @@ -20,6 +20,7 @@
>  #include <errno.h>
>  #include "pthreadP.h"
>  
> +#include <stap-probe.h>
>  
>  int
>  __pthread_mutex_destroy (mutex)
> @@ -32,6 +33,8 @@ __pthread_mutex_destroy (mutex)
>    /* Set to an invalid value.  */
>    mutex->__data.__kind = -1;
>  
> +  LIBC_PROBE(pthread_mutex_destroy, 1, mutex);
> +
>    return 0;
>  }
>  strong_alias (__pthread_mutex_destroy, pthread_mutex_destroy)
> diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
> index d9b1ef0..e4fcae7 100644
> --- a/nptl/pthread_mutex_init.c
> +++ b/nptl/pthread_mutex_init.c
> @@ -24,6 +24,8 @@
>  #include <kernel-features.h>
>  #include "pthreadP.h"
>  
> +#include <stap-probe.h>
> +
>  static const struct pthread_mutexattr default_attr =
>    {
>      /* Default is a normal mutex, not shared between processes.  */
> @@ -47,6 +49,8 @@ __pthread_mutex_init (mutex, mutexattr)
>  
>    imutexattr = (const struct pthread_mutexattr *) mutexattr ?:
> &default_attr;
>  
> +  LIBC_PROBE(pthread_mutex_init, 2, mutex, mutexattr);
> +
>    /* Sanity checks.  */
>    switch (__builtin_expect (imutexattr->mutexkind
>  			    & PTHREAD_MUTEXATTR_PROTOCOL_MASK,
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 50dc188..b754372 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -25,6 +25,7 @@
>  #include "pthreadP.h"
>  #include <lowlevellock.h>
>  
> +#include <stap-probe.h>
>  
>  #ifndef LLL_MUTEX_LOCK
>  # define LLL_MUTEX_LOCK(mutex) \
> @@ -48,6 +49,10 @@ __pthread_mutex_lock (mutex)
>    assert (sizeof (mutex->__size) >= sizeof (mutex->__data));
>  
>    unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
> +
> +  /* systemtap marker */
> +  LIBC_PROBE(pthread_mutex_lock, 1, mutex);
> +
>    if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
>      return __pthread_mutex_lock_full (mutex);
>  
> @@ -60,6 +65,8 @@ __pthread_mutex_lock (mutex)
>        /* Normal mutex.  */
>        LLL_MUTEX_LOCK (mutex);
>        assert (mutex->__data.__owner == 0);
> +
> +      LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
>      }
>    else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
>      {
> @@ -75,6 +82,11 @@ __pthread_mutex_lock (mutex)
>  
>  	  ++mutex->__data.__count;
>  
> +          /* currently, the systemtap pthread probe does not have a */
> +          /* probe point here because the thread already owns this */
> +          /* recursive lock before the call to this function. */
> +          /* this might change in the future */
> +
>  	  return 0;
>  	}
>  
> @@ -83,6 +95,8 @@ __pthread_mutex_lock (mutex)
>  
>        assert (mutex->__data.__owner == 0);
>        mutex->__data.__count = 1;
> +
> +      LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
>      }
>    else if (__builtin_expect (type == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
>      {
> @@ -94,6 +108,7 @@ __pthread_mutex_lock (mutex)
>  	  int cnt = 0;
>  	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>  			     mutex->__data.__spins * 2 + 10);
> +
>  	  do
>  	    {
>  	      if (cnt++ >= max_cnt)
> @@ -108,6 +123,8 @@ __pthread_mutex_lock (mutex)
>  	    }
>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
> +          LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
> +
>  	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>  	}
>        assert (mutex->__data.__owner == 0);
> @@ -127,6 +144,8 @@ __pthread_mutex_lock (mutex)
>    ++mutex->__data.__nusers;
>  #endif
>  
> +  LIBC_PROBE(pthread_mutex_lock_acquire, 1, mutex);
> +
>    return 0;
>  }
>  
> @@ -277,6 +296,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  
>  		++mutex->__data.__count;
>  
> +                /* currently, the systemtap pthread probe does not have
> a */
> +                /* probe point here because the thread already owns
> this */
> +                /* recursive lock before the call to this function. */
> +                /* this might change in the future */
>  		return 0;
>  	      }
>  	  }
> @@ -393,6 +416,11 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  		  /* Overflow of the counter.  */
>  		  return EAGAIN;
>  
> +                 /* currently, the systemtap pthread probe does not
> have a */
> +                 /* probe point here because the thread already owns
> this */
> +                 /* recursive lock before the call to this function. */
> +                 /* this might change in the future */
> +
>  		++mutex->__data.__count;
>  
>  		return 0;
> @@ -451,6 +479,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  	  }
>  	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
>  
> +        LIBC_PROBE(pthread_mutex_lock_full_block, 1, mutex);
> +
>  	assert (mutex->__data.__owner == 0);
>  	mutex->__data.__count = 1;
>        }
> @@ -467,6 +497,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>    ++mutex->__data.__nusers;
>  #endif
>  
> +  LIBC_PROBE(pthread_mutex_lock_full_acquire, 1, mutex);
> +
>    return 0;
>  }
>  #ifndef __pthread_mutex_lock
> diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
> index f9fe10b..e0f305e 100644
> --- a/nptl/pthread_mutex_unlock.c
> +++ b/nptl/pthread_mutex_unlock.c
> @@ -23,6 +23,8 @@
>  #include "pthreadP.h"
>  #include <lowlevellock.h>
>  
> +#include <stap-probe.h>
> +
>  static int
>  internal_function
>  __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
> @@ -50,6 +52,7 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
>  
>        /* Unlock.  */
>        lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));
> +      LIBC_PROBE(pthread_mutex_release, 1, mutex);
>        return 0;
>      }
>    else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
> @@ -60,6 +63,10 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +        
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>        goto normal;
>      }
> @@ -104,6 +111,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> *mutex, int decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>  
>        goto robust;
> @@ -149,6 +160,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> *mutex, int decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>        goto continue_pi_non_robust;
>  
> @@ -171,6 +186,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> *mutex, int decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>  
>        goto continue_pi_robust;
> @@ -237,6 +256,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> *mutex, int decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>        goto pp;
>  
> @@ -272,6 +295,9 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
> int decr)
>  			PTHREAD_MUTEX_PSHARED (mutex));
>  
>        int oldprio = newval >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
> +
> +      LIBC_PROBE(pthread_mutex_release, 1, mutex);
> +
>        return __pthread_tpp_change_priority (oldprio, -1);
>  
>      default:
> @@ -279,6 +305,8 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
> int decr)
>        return EINVAL;
>      }
>  
> +  LIBC_PROBE(pthread_mutex_release, 1, mutex);
> +
>    return 0;
>  }
>  
> diff --git a/nptl/pthread_rwlock_destroy.c
> b/nptl/pthread_rwlock_destroy.c
> index 28fd24b..b14de8f 100644
> --- a/nptl/pthread_rwlock_destroy.c
> +++ b/nptl/pthread_rwlock_destroy.c
> @@ -19,12 +19,14 @@
>  
>  #include "pthreadP.h"
>  
> +#include <stap-probe.h>
>  
>  int
>  __pthread_rwlock_destroy (rwlock)
>       pthread_rwlock_t *rwlock;
>  {
>    /* Nothing to be done.  For now.  */
> +  LIBC_PROBE(pthread_rwlock_destroy, 1, rwlock);
>    return 0;
>  }
>  strong_alias (__pthread_rwlock_destroy, pthread_rwlock_destroy)
> diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c
> index 31eb508..7b4d8f0 100644
> --- a/nptl/pthread_rwlock_rdlock.c
> +++ b/nptl/pthread_rwlock_rdlock.c
> @@ -23,6 +23,8 @@
>  #include <pthread.h>
>  #include <pthreadP.h>
>  
> +#include <stap-probe.h>
> +
>  
>  /* Acquire read lock for RWLOCK.  */
>  int
> @@ -31,6 +33,8 @@ __pthread_rwlock_rdlock (rwlock)
>  {
>    int result = 0;
>  
> +  LIBC(pthread_rwlock_rdlock, 1, rwlock);
> +
>    /* Make sure we are along.  */
>    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
>  
> @@ -49,6 +53,13 @@ __pthread_rwlock_rdlock (rwlock)
>  	      --rwlock->__data.__nr_readers;
>  	      result = EAGAIN;
>  	    }
> +          else
> +            {
> +              /* systemtap pthread probe - this is the only place where
> */
> +              /* we get this read-write lock */
> +              LIBC_PROBE(pthread_rwlock_rdlock, 1, rwlock);
> +            }
> +
>  
>  	  break;
>  	}
> diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c
> index a7ef71a..ba9620b 100644
> --- a/nptl/pthread_rwlock_unlock.c
> +++ b/nptl/pthread_rwlock_unlock.c
> @@ -23,10 +23,14 @@
>  #include <pthread.h>
>  #include <pthreadP.h>
>  
> +#include <stap-probe.h>
> +
>  /* Unlock RWLOCK.  */
>  int
>  __pthread_rwlock_unlock (pthread_rwlock_t *rwlock)
>  {
> +  LIBC_PROBE(pthread_rwlock_unlock, 1, rwlock);
> +
>    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
>    if (rwlock->__data.__writer)
>      rwlock->__data.__writer = 0;
> diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c
> index 64fe970..09b9454 100644
> --- a/nptl/pthread_rwlock_wrlock.c
> +++ b/nptl/pthread_rwlock_wrlock.c
> @@ -23,6 +23,7 @@
>  #include <pthread.h>
>  #include <pthreadP.h>
>  
> +#include <stap-probe.h>
>  
>  /* Acquire write lock for RWLOCK.  */
>  int
> @@ -31,6 +32,8 @@ __pthread_rwlock_wrlock (rwlock)
>  {
>    int result = 0;
>  
> +  LIBC_PROBE(pthread_rwlock_wrlock, 1, rwlock);
> +
>    /* Make sure we are along.  */
>    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
>  
> @@ -41,6 +44,11 @@ __pthread_rwlock_wrlock (rwlock)
>  	{
>  	  /* Mark self as writer.  */
>  	  rwlock->__data.__writer = THREAD_GETMEM (THREAD_SELF, tid);
> +
> +          /* systemtap pthread probe - this is the only place where we
> can */
> +          /* get this read-write lock. */
> +          LIBC_PROBE(pthread_rwlock_wrlock_acquire, 1, rwlock);
> +
>  	  break;
>  	}
>  
> 
> 
> On Mon, 2010-08-09 at 09:37 -0700, Roland McGrath wrote:
> > > I have modified your sdt.h slightly and used it for the pthread
> > > probes. Is there a special branch of systemtap or utrace that I need
> > > to use in order to test/benchmark the overhead of the existance of the
> > > new sdt probes in libpthread?
> > 
> > The translator work has yet to be done.  So at the moment all you could
> > test is having the probes in there statically (i.e. addition of nop
> > instructions) and not using them.
> > 
> > 
> > Thanks,
> > Roland
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RFC: Adding systemtap probe points in pthread library (was: Re: revamp sdt.h)
  2010-09-29 17:18       ` Adding systemtap probe points in pthread library (was: Re: revamp sdt.h) Rayson Ho
@ 2010-10-06 14:27         ` Rayson Ho
  0 siblings, 0 replies; 15+ messages in thread
From: Rayson Ho @ 2010-10-06 14:27 UTC (permalink / raw)
  To: libc-alpha; +Cc: systemtap

I missed the review comment from Roland (Sorry... it got mixed with
other sdt.h discussions). I've addressed most of his comments, and also
addressed the glibc list maintainer's comment.

Changes relative to the last email:

- removed nptl/pthread_probe.h (originally designed to use the "ENABLED"
macro, and as the sdt.h PROBES are virtually free in terms of
performance, so there is not a lot of need for nptl/pthread_probe.h).
And because of the way the header files are included, the "IN_LIB"
macros are not always defined. Thus I removed the whole thing and gone
with inline probes.
- added ChangeLog
- now uses LIBC_PROBE
- cleaned up the code to follow the coding standard (and added space
before the opening paren.

Note that I've only tested on x86 & x64, and the code compiles fine on
PPC (but not tested on it).

There are more probes for glibc (malloc & free, a few more pthread
probes), but I believe I should just send the idential patch but with
the problems addressed rather than a new patch every time or else it
would take a long time.

Thanks,
Rayson





diff --git a/ChangeLog b/ChangeLog
index b300207..94951c0 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,19 @@
+2010-10-06  Rayson Ho <rho@redhat.com>
+
+	* nptl/DESIGN-systemtap-probes.txt: New file.
+	* nptl/pthread_create.c: SystemTap probes.
+        * nptl/pthread_join.c: Likewise.
+        * nptl/pthread_mutex_destroy.c: Likewise.
+        * nptl/pthread_mutex_init.c: Likewise.
+        * nptl/pthread_mutex_lock.c: Likewise.
+        * nptl/pthread_mutex_unlock.c: Likewise.
+        * nptl/pthread_rwlock_destroy.c: Likewise.
+        * nptl/pthread_rwlock_rdlock.c: Likewise.
+        * nptl/pthread_rwlock_unlock.c: Likewise.
+        * nptl/pthread_rwlock_wrlock.c: Likewise.
+        * nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: Likewise.
+
+
 2010-08-25  Roland McGrath  <roland@redhat.com>
 
 	* include/stap-probe.h: New file.

diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index 4075dd9..5fdf2c9 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -556,6 +556,8 @@ __pthread_create_2_1 (newthread, attr,
start_routine, arg)
   /* Pass the descriptor to the caller.  */
   *newthread = (pthread_t) pd;
 
+  LIBC_PROBE (newthread, 2, start_routine, arg);
+
   /* Start the thread.  */
   return create_thread (pd, iattr, STACK_VARIABLES_ARGS);
 }
diff --git a/nptl/pthread_join.c b/nptl/pthread_join.c
index 6a87a8b..447ee31 100644
--- a/nptl/pthread_join.c
+++ b/nptl/pthread_join.c
@@ -23,6 +23,8 @@
 #include <atomic.h>
 #include "pthreadP.h"
 
+#include <stap-probe.h>
+
 
 static void
 cleanup (void *arg)
@@ -55,6 +57,8 @@ pthread_join (threadid, thread_return)
   struct pthread *self = THREAD_SELF;
   int result = 0;
 
+  LIBC_PROBE (join, 1, threadid);
+
   /* During the wait we change to asynchronous cancellation.  If we
      are canceled the thread we are waiting for must be marked as
      un-wait-ed for again.  */
@@ -110,5 +114,7 @@ pthread_join (threadid, thread_return)
       __free_tcb (pd);
     }
 
+  LIBC_PROBE (join_ret, 3, threadid, result, pd->result);
+
   return result;
 }
diff --git a/nptl/pthread_mutex_destroy.c b/nptl/pthread_mutex_destroy.c
index e2c9f8a..181c4aa 100644
--- a/nptl/pthread_mutex_destroy.c
+++ b/nptl/pthread_mutex_destroy.c
@@ -20,11 +20,15 @@
 #include <errno.h>
 #include "pthreadP.h"
 
+#include <stap-probe.h>
+
 
 int
 __pthread_mutex_destroy (mutex)
      pthread_mutex_t *mutex;
 {
+  LIBC_PROBE (cond_destroy, 1, mutex);
+
   if ((mutex->__data.__kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP) == 0
       && mutex->__data.__nusers != 0)
     return EBUSY;
diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
index d9b1ef0..d22abea 100644
--- a/nptl/pthread_mutex_init.c
+++ b/nptl/pthread_mutex_init.c
@@ -24,6 +24,8 @@
 #include <kernel-features.h>
 #include "pthreadP.h"
 
+#include <stap-probe.h>
+
 static const struct pthread_mutexattr default_attr =
   {
     /* Default is a normal mutex, not shared between processes.  */
@@ -135,6 +137,8 @@ __pthread_mutex_init (mutex, mutexattr)
   // mutex->__spins = 0;	already done by memset
   // mutex->__next = NULL;	already done by memset
 
+  LIBC_PROBE (mutex_init, 1, mutex);
+
   return 0;
 }
 strong_alias (__pthread_mutex_init, pthread_mutex_init)
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 50dc188..a2d9188 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -24,6 +24,7 @@
 #include <not-cancel.h>
 #include "pthreadP.h"
 #include <lowlevellock.h>
+#include <stap-probe.h>
 
 
 #ifndef LLL_MUTEX_LOCK
@@ -48,6 +49,9 @@ __pthread_mutex_lock (mutex)
   assert (sizeof (mutex->__size) >= sizeof (mutex->__data));
 
   unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
+
+  LIBC_PROBE (mutex_entry, 1, mutex);
+
   if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
     return __pthread_mutex_lock_full (mutex);
 
@@ -60,6 +64,8 @@ __pthread_mutex_lock (mutex)
       /* Normal mutex.  */
       LLL_MUTEX_LOCK (mutex);
       assert (mutex->__data.__owner == 0);
+
+      LIBC_PROBE (mutex_block, 1, mutex);
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
     {
@@ -75,6 +81,11 @@ __pthread_mutex_lock (mutex)
 
 	  ++mutex->__data.__count;
 
+          /* currently, the systemtap pthread probe does not have a */
+          /* probe point here because the thread already owns this */
+          /* recursive lock before the call to this function. */
+          /* this might change in the future */
+
 	  return 0;
 	}
 
@@ -83,6 +94,8 @@ __pthread_mutex_lock (mutex)
 
       assert (mutex->__data.__owner == 0);
       mutex->__data.__count = 1;
+
+      LIBC_PROBE (mutex_block, 1, mutex);
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
     {
@@ -94,6 +107,7 @@ __pthread_mutex_lock (mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
 			     mutex->__data.__spins * 2 + 10);
+
 	  do
 	    {
 	      if (cnt++ >= max_cnt)
@@ -108,6 +122,8 @@ __pthread_mutex_lock (mutex)
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
+          LIBC_PROBE (mutex_block, 1, mutex);
+
 	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
 	}
       assert (mutex->__data.__owner == 0);
@@ -127,6 +143,8 @@ __pthread_mutex_lock (mutex)
   ++mutex->__data.__nusers;
 #endif
 
+  LIBC_PROBE (mutex_acquire, 1, mutex);
+
   return 0;
 }
 
@@ -277,6 +295,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 
 		++mutex->__data.__count;
 
+                /* currently, the systemtap pthread probe does not have
a */
+                /* probe point here because the thread already owns
this */
+                /* recursive lock before the call to this function. */
+                /* this might change in the future */
 		return 0;
 	      }
 	  }
@@ -393,6 +415,11 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 		  /* Overflow of the counter.  */
 		  return EAGAIN;
 
+                 /* currently, the systemtap pthread probe does not
have a */
+                 /* probe point here because the thread already owns
this */
+                 /* recursive lock before the call to this function. */
+                 /* this might change in the future */
+
 		++mutex->__data.__count;
 
 		return 0;
@@ -451,6 +478,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	  }
 	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
 
+        LIBC_PROBE (mutex_block, 1, mutex);
+
 	assert (mutex->__data.__owner == 0);
 	mutex->__data.__count = 1;
       }
@@ -467,6 +496,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
   ++mutex->__data.__nusers;
 #endif
 
+  LIBC_PROBE (mutex_acquire, 1, mutex);
+
   return 0;
 }
 #ifndef __pthread_mutex_lock
diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
index f9fe10b..b6c358e 100644
--- a/nptl/pthread_mutex_unlock.c
+++ b/nptl/pthread_mutex_unlock.c
@@ -22,6 +22,7 @@
 #include <stdlib.h>
 #include "pthreadP.h"
 #include <lowlevellock.h>
+#include <stap-probe.h>
 
 static int
 internal_function
@@ -50,6 +51,9 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
 
       /* Unlock.  */
       lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));
+
+      LIBC_PROBE (mutex_release, 1, mutex);
+
       return 0;
     }
   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
@@ -60,6 +64,10 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+        
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto normal;
     }
@@ -104,6 +112,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
 
       goto robust;
@@ -149,6 +161,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto continue_pi_non_robust;
 
@@ -171,6 +187,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
 
       goto continue_pi_robust;
@@ -237,6 +257,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
*mutex, int decr)
 
       if (--mutex->__data.__count != 0)
 	/* We still hold the mutex.  */
+
+        /* currently, the systemtap pthread probe does not have */
+        /* probe point here because the thread still owns the lock */
+        /* this might change in the future */
 	return 0;
       goto pp;
 
@@ -272,6 +296,9 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
int decr)
 			PTHREAD_MUTEX_PSHARED (mutex));
 
       int oldprio = newval >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
+
+      LIBC_PROBE (mutex_release, 1, mutex);
+
       return __pthread_tpp_change_priority (oldprio, -1);
 
     default:
@@ -279,6 +306,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
int decr)
       return EINVAL;
     }
 
+  LIBC_PROBE (mutex_release, 1, mutex);
   return 0;
 }
 
diff --git a/nptl/pthread_rwlock_destroy.c
b/nptl/pthread_rwlock_destroy.c
index 28fd24b..84aa693 100644
--- a/nptl/pthread_rwlock_destroy.c
+++ b/nptl/pthread_rwlock_destroy.c
@@ -18,12 +18,15 @@
    02111-1307 USA.  */
 
 #include "pthreadP.h"
+#include <stap-probe.h>
 
 
 int
 __pthread_rwlock_destroy (rwlock)
      pthread_rwlock_t *rwlock;
 {
+  LIBC_PROBE (rwlock_destroy, 1, rwlock);
+
   /* Nothing to be done.  For now.  */
   return 0;
 }
diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c
index 31eb508..26b42b0 100644
--- a/nptl/pthread_rwlock_rdlock.c
+++ b/nptl/pthread_rwlock_rdlock.c
@@ -22,6 +22,7 @@
 #include <lowlevellock.h>
 #include <pthread.h>
 #include <pthreadP.h>
+#include <stap-probe.h>
 
 
 /* Acquire read lock for RWLOCK.  */
@@ -31,6 +32,8 @@ __pthread_rwlock_rdlock (rwlock)
 {
   int result = 0;
 
+  LIBC_PROBE (rlock_entry, 1, rwlock);
+
   /* Make sure we are along.  */
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
@@ -49,6 +52,12 @@ __pthread_rwlock_rdlock (rwlock)
 	      --rwlock->__data.__nr_readers;
 	      result = EAGAIN;
 	    }
+          else
+            {
+              /* systemtap pthread probe - this is the only place where
*/
+              /* we get this read-write lock */
+              LIBC_PROBE (rwlock_acquire_read, 1, rwlock);
+            }
 
 	  break;
 	}
diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c
index a7ef71a..a6e8d87 100644
--- a/nptl/pthread_rwlock_unlock.c
+++ b/nptl/pthread_rwlock_unlock.c
@@ -22,11 +22,14 @@
 #include <lowlevellock.h>
 #include <pthread.h>
 #include <pthreadP.h>
+#include <stap-probe.h>
 
 /* Unlock RWLOCK.  */
 int
 __pthread_rwlock_unlock (pthread_rwlock_t *rwlock)
 {
+  LIBC_PROBE (rwlock_unlock, 1, rwlock);
+
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
   if (rwlock->__data.__writer)
     rwlock->__data.__writer = 0;
diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c
index 64fe970..0c3d471 100644
--- a/nptl/pthread_rwlock_wrlock.c
+++ b/nptl/pthread_rwlock_wrlock.c
@@ -22,6 +22,7 @@
 #include <lowlevellock.h>
 #include <pthread.h>
 #include <pthreadP.h>
+#include <stap-probe.h>
 
 
 /* Acquire write lock for RWLOCK.  */
@@ -31,6 +32,8 @@ __pthread_rwlock_wrlock (rwlock)
 {
   int result = 0;
 
+  LIBC_PROBE (wlock_entry, 1, rwlock);
+
   /* Make sure we are along.  */
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
 
@@ -41,6 +44,11 @@ __pthread_rwlock_wrlock (rwlock)
 	{
 	  /* Mark self as writer.  */
 	  rwlock->__data.__writer = THREAD_GETMEM (THREAD_SELF, tid);
+
+          /* systemtap pthread probe - this is the only place where we
can */
+          /* get this read-write lock. */
+          LIBC_PROBE (rwlock_acquire_write, 1, rwlock);
+
 	  break;
 	}
 
diff --git a/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
b/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
index 3195db2..3457b18 100644
--- a/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
+++ b/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
@@ -22,6 +22,15 @@
 #include <kernel-features.h>
 #include <lowlevellock.h>
 
+#if defined IN_LIB || !defined NOT_IN_libc
+ #include <stap-probe.h>
+ #define PTHREAD_PROBE_LL_LOCKWAIT_PRIVATE(arg1)
LIBC_PROBE_ASM(lll_lock_wait_private, arg1)
+ #define PTHREAD_PROBE_LL_LOCKWAIT(arg1)
LIBC_PROBE_ASM(lll_lock_wait, arg1)
+#else
+ #define PTHREAD_PROBE_LL_LOCKWAIT_PRIVATE(arg1)
+ #define PTHREAD_PROBE_LL_LOCKWAIT(arg1)
+#endif
+
 	.text
 
 #ifdef __ASSUME_PRIVATE_FUTEX
@@ -91,7 +100,8 @@ __lll_lock_wait_private:
 	cmpl	%edx, %eax	/* NB:	 %edx == 2 */
 	jne	2f
 
-1:	movl	$SYS_futex, %eax
+1:	PTHREAD_PROBE_LL_LOCKWAIT_PRIVATE(%rdi)
+	movl	$SYS_futex, %eax
 	syscall
 
 2:	movl	%edx, %eax
@@ -130,7 +140,8 @@ __lll_lock_wait:
 	cmpl	%edx, %eax	/* NB:	 %edx == 2 */
 	jne	2f
 
-1:	movl	$SYS_futex, %eax
+1:	PTHREAD_PROBE_LL_LOCKWAIT(%rdi)
+	movl	$SYS_futex, %eax
 	syscall
 
 2:	movl	%edx, %eax




On Wed, 2010-09-29 at 13:17 -0400, Rayson Ho wrote:
> With the new sdt.h and translator in the systemtap git tree, I added the
> probes in assembly code in lowlevellock.S to only trace mutex lock calls
> that are contented (ie. those that end up calling futex(2)).
> 
> Additions & modifications in this revision:
> 1) nptl/DESIGN-systemtap-probes.txt - brief docs about the probes
> 2) nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S - added 2 probes
> in the assembly routines right before they perform the SYS_futex
> syscall.
> 
> (Please see the patch at the end of this message)
> 
> The micro-benchmark results are much better, with a simple program that
> does nothing much but pthread_mutex_lock() & pthread_mutex_unlock():
> 
> 
> #include <stdio.h>
> #include <pthread.h>
> 
> pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
> 
> thread()
> {
>  int i;
> 
>  for (i=0;i<100000000;i++)
>  {
>    pthread_mutex_lock(&lock);
>    pthread_mutex_unlock(&lock);
>  }
> }
> 
> main()
> {
>  pthread_t tid, tid2, tid3, tid4;
> 
>  pthread_create(&tid,  NULL, thread, NULL);
>  pthread_create(&tid2, NULL, thread, NULL);
>  pthread_create(&tid3, NULL, thread, NULL);
>  pthread_create(&tid4, NULL, thread, NULL);
> 
> 
>  thread();
> 
>  pthread_join(tid,  NULL);
>  pthread_join(tid2, NULL);
>  pthread_join(tid3, NULL);
>  pthread_join(tid4, NULL);
> 
> }
> 
> With 5 threads (the version above), there were only 408613 calls to
> futex(2), which reduced the firing of mutex_acquire() by 1223.6 times! I
> repeated the same test with smaller number (2-4) of threads, and in all
> cases, the number of times futex(2) is entered to get the lock is low --
> which is similar to the behavior of well-written threaded code (using
> DTrace on OpenSolaris, MySQL was not extremely contented.)
> 
> So instead of a slow-down of several times when I benchmarked this
> micro-benchmark with the older probes, we are only several % slower than
> the same code without being instrumented by systemtap. (I benchmarked
> the code on my laptop, and I will repeat the benchmark again on a much
> quieter machine, esp. with X running and release the final results.)
> 
> Rayson
> 
> 
> 
> 
> diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
> index 43ca44c..2f1a4c1 100644
> --- a/nptl/pthreadP.h
> +++ b/nptl/pthreadP.h
> @@ -33,6 +33,7 @@
>  #include <atomic.h>
>  #include <kernel-features.h>
>  
> +#include "pthread_probe.h"
>  
>  /* Atomic operations on TLS memory.  */
>  #ifndef THREAD_ATOMIC_CMPXCHG_VAL
> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
> index 34d83f9..66f44cb 100644
> --- a/nptl/pthread_create.c
> +++ b/nptl/pthread_create.c
> @@ -296,6 +296,11 @@ start_thread (void *arg)
>  	  CANCEL_RESET (oldtype);
>  	}
>  
> +      /* SystemTap probe
> +         All of the normal thread creation work would
> +         be done after this point */
> +      PTHREAD_PROBE_START(pd->arg);
> +
>        /* Run the code the user provided.  */
>  #ifdef CALL_THREAD_FCT
>        THREAD_SETMEM (pd, result, CALL_THREAD_FCT (pd));
> @@ -552,6 +557,9 @@ __pthread_create_2_1 (newthread, attr,
> start_routine, arg)
>    /* Pass the descriptor to the caller.  */
>    *newthread = (pthread_t) pd;
>  
> +  /* Systemtap probe */
> +  PTHREAD_PROBE_CREATE(newthread, start_routine, arg);
> +
>    /* Start the thread.  */
>    return create_thread (pd, iattr, STACK_VARIABLES_ARGS);
>  }
> diff --git a/nptl/pthread_join.c b/nptl/pthread_join.c
> index 6a87a8b..58171a3 100644
> --- a/nptl/pthread_join.c
> +++ b/nptl/pthread_join.c
> @@ -55,6 +55,8 @@ pthread_join (threadid, thread_return)
>    struct pthread *self = THREAD_SELF;
>    int result = 0;
>  
> +  PTHREAD_PROBE_JOIN(threadid);
> +
>    /* During the wait we change to asynchronous cancellation.  If we
>       are canceled the thread we are waiting for must be marked as
>       un-wait-ed for again.  */
> @@ -110,5 +112,7 @@ pthread_join (threadid, thread_return)
>        __free_tcb (pd);
>      }
>  
> +  PTHREAD_PROBE_JOIN_RET(threadid, result);
> +
>    return result;
>  }
> diff --git a/nptl/pthread_mutex_destroy.c b/nptl/pthread_mutex_destroy.c
> index e2c9f8a..2217f58 100644
> --- a/nptl/pthread_mutex_destroy.c
> +++ b/nptl/pthread_mutex_destroy.c
> @@ -29,6 +29,8 @@ __pthread_mutex_destroy (mutex)
>        && mutex->__data.__nusers != 0)
>      return EBUSY;
>  
> +  PTHREAD_PROBE_MUTEX_DESTROY(mutex);
> +
>    /* Set to an invalid value.  */
>    mutex->__data.__kind = -1;
>  
> diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
> index d9b1ef0..bf395dd 100644
> --- a/nptl/pthread_mutex_init.c
> +++ b/nptl/pthread_mutex_init.c
> @@ -45,6 +45,8 @@ __pthread_mutex_init (mutex, mutexattr)
>  
>    assert (sizeof (pthread_mutex_t) <= __SIZEOF_PTHREAD_MUTEX_T);
>  
> +  PTHREAD_PROBE_MUTEX_INIT(mutex);
> +
>    imutexattr = (const struct pthread_mutexattr *) mutexattr ?:
> &default_attr;
>  
>    /* Sanity checks.  */
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 50dc188..a4ccefe 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -48,6 +48,10 @@ __pthread_mutex_lock (mutex)
>    assert (sizeof (mutex->__size) >= sizeof (mutex->__data));
>  
>    unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
> +
> +  /* systemtap marker */
> +  PTHREAD_PROBE_MUTEX_ENTRY(mutex);
> +
>    if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
>      return __pthread_mutex_lock_full (mutex);
>  
> @@ -60,6 +64,8 @@ __pthread_mutex_lock (mutex)
>        /* Normal mutex.  */
>        LLL_MUTEX_LOCK (mutex);
>        assert (mutex->__data.__owner == 0);
> +
> +      PTHREAD_PROBE_MUTEX_BLOCK(mutex);
>      }
>    else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
>      {
> @@ -75,6 +81,11 @@ __pthread_mutex_lock (mutex)
>  
>  	  ++mutex->__data.__count;
>  
> +          /* currently, the systemtap pthread probe does not have a */
> +          /* probe point here because the thread already owns this */
> +          /* recursive lock before the call to this function. */
> +          /* this might change in the future */
> +
>  	  return 0;
>  	}
>  
> @@ -83,6 +94,8 @@ __pthread_mutex_lock (mutex)
>  
>        assert (mutex->__data.__owner == 0);
>        mutex->__data.__count = 1;
> +
> +      PTHREAD_PROBE_MUTEX_BLOCK(mutex);
>      }
>    else if (__builtin_expect (type == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
>      {
> @@ -94,6 +107,7 @@ __pthread_mutex_lock (mutex)
>  	  int cnt = 0;
>  	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
>  			     mutex->__data.__spins * 2 + 10);
> +
>  	  do
>  	    {
>  	      if (cnt++ >= max_cnt)
> @@ -108,6 +122,8 @@ __pthread_mutex_lock (mutex)
>  	    }
>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
> +          PTHREAD_PROBE_MUTEX_BLOCK(mutex);
> +
>  	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>  	}
>        assert (mutex->__data.__owner == 0);
> @@ -127,6 +143,8 @@ __pthread_mutex_lock (mutex)
>    ++mutex->__data.__nusers;
>  #endif
>  
> +  PTHREAD_PROBE_MUTEX_ACQUIRE(mutex);
> +
>    return 0;
>  }
>  
> @@ -277,6 +295,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  
>  		++mutex->__data.__count;
>  
> +                /* currently, the systemtap pthread probe does not have
> a */
> +                /* probe point here because the thread already owns
> this */
> +                /* recursive lock before the call to this function. */
> +                /* this might change in the future */
>  		return 0;
>  	      }
>  	  }
> @@ -393,6 +415,11 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  		  /* Overflow of the counter.  */
>  		  return EAGAIN;
>  
> +                 /* currently, the systemtap pthread probe does not
> have a */
> +                 /* probe point here because the thread already owns
> this */
> +                 /* recursive lock before the call to this function. */
> +                 /* this might change in the future */
> +
>  		++mutex->__data.__count;
>  
>  		return 0;
> @@ -442,8 +469,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  		  break;
>  
>  		if (oldval != ceilval)
> +                {
>  		  lll_futex_wait (&mutex->__data.__lock, ceilval | 2,
>  				  PTHREAD_MUTEX_PSHARED (mutex));
> +                }
>  	      }
>  	    while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
>  							ceilval | 2, ceilval)
> @@ -451,6 +480,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  	  }
>  	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
>  
> +        PTHREAD_PROBE_MUTEX_BLOCK(mutex);
> +
>  	assert (mutex->__data.__owner == 0);
>  	mutex->__data.__count = 1;
>        }
> @@ -467,6 +498,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>    ++mutex->__data.__nusers;
>  #endif
>  
> +  PTHREAD_PROBE_MUTEX_ACQUIRE(mutex);
> +
>    return 0;
>  }
>  #ifndef __pthread_mutex_lock
> diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
> index f9fe10b..50cbc5c 100644
> --- a/nptl/pthread_mutex_unlock.c
> +++ b/nptl/pthread_mutex_unlock.c
> @@ -50,6 +50,7 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
>  
>        /* Unlock.  */
>        lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));
> +      PTHREAD_PROBE_MUTEX_RELEASE(mutex);
>        return 0;
>      }
>    else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
> @@ -60,6 +61,10 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +        
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>        goto normal;
>      }
> @@ -104,6 +109,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> *mutex, int decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>  
>        goto robust;
> @@ -149,6 +158,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> *mutex, int decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>        goto continue_pi_non_robust;
>  
> @@ -171,6 +184,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> *mutex, int decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>  
>        goto continue_pi_robust;
> @@ -237,6 +254,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> *mutex, int decr)
>  
>        if (--mutex->__data.__count != 0)
>  	/* We still hold the mutex.  */
> +
> +        /* currently, the systemtap pthread probe does not have */
> +        /* probe point here because the thread still owns the lock */
> +        /* this might change in the future */
>  	return 0;
>        goto pp;
>  
> @@ -272,6 +293,9 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
> int decr)
>  			PTHREAD_MUTEX_PSHARED (mutex));
>  
>        int oldprio = newval >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
> +
> +      PTHREAD_PROBE_MUTEX_RELEASE(mutex);
> +
>        return __pthread_tpp_change_priority (oldprio, -1);
>  
>      default:
> @@ -279,6 +303,7 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
> int decr)
>        return EINVAL;
>      }
>  
> +  PTHREAD_PROBE_MUTEX_RELEASE(mutex);
>    return 0;
>  }
>  
> diff --git a/nptl/pthread_rwlock_destroy.c
> b/nptl/pthread_rwlock_destroy.c
> index 28fd24b..b4cd7ab 100644
> --- a/nptl/pthread_rwlock_destroy.c
> +++ b/nptl/pthread_rwlock_destroy.c
> @@ -24,6 +24,8 @@ int
>  __pthread_rwlock_destroy (rwlock)
>       pthread_rwlock_t *rwlock;
>  {
> +  PTHREAD_PROBE_RWLOCK_DESTROY(rwlock);
> +
>    /* Nothing to be done.  For now.  */
>    return 0;
>  }
> diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c
> index 31eb508..954b414 100644
> --- a/nptl/pthread_rwlock_rdlock.c
> +++ b/nptl/pthread_rwlock_rdlock.c
> @@ -31,6 +31,8 @@ __pthread_rwlock_rdlock (rwlock)
>  {
>    int result = 0;
>  
> +  PTHREAD_PROBE_RLOCK_ENTRY(rwlock);
> +
>    /* Make sure we are along.  */
>    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
>  
> @@ -49,6 +51,12 @@ __pthread_rwlock_rdlock (rwlock)
>  	      --rwlock->__data.__nr_readers;
>  	      result = EAGAIN;
>  	    }
> +          else
> +            {
> +              /* systemtap pthread probe - this is the only place where
> */
> +              /* we get this read-write lock */
> +              PTHREAD_PROBE_RWLOCK_ACQUIRE_READ(rwlock);
> +            }
>  
>  	  break;
>  	}
> diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c
> index a7ef71a..e7d1568 100644
> --- a/nptl/pthread_rwlock_unlock.c
> +++ b/nptl/pthread_rwlock_unlock.c
> @@ -27,6 +27,8 @@
>  int
>  __pthread_rwlock_unlock (pthread_rwlock_t *rwlock)
>  {
> +  PTHREAD_PROBE_RWLOCK_UNLOCK(rwlock);
> +
>    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
>    if (rwlock->__data.__writer)
>      rwlock->__data.__writer = 0;
> diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c
> index 64fe970..abf3083 100644
> --- a/nptl/pthread_rwlock_wrlock.c
> +++ b/nptl/pthread_rwlock_wrlock.c
> @@ -31,6 +31,8 @@ __pthread_rwlock_wrlock (rwlock)
>  {
>    int result = 0;
>  
> +  PTHREAD_PROBE_WLOCK_ENTRY(rwlock);
> +
>    /* Make sure we are along.  */
>    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
>  
> @@ -41,6 +43,11 @@ __pthread_rwlock_wrlock (rwlock)
>  	{
>  	  /* Mark self as writer.  */
>  	  rwlock->__data.__writer = THREAD_GETMEM (THREAD_SELF, tid);
> +
> +          /* systemtap pthread probe - this is the only place where we
> can */
> +          /* get this read-write lock. */
> +          PTHREAD_PROBE_RWLOCK_ACQUIRE_WRITE(rwlock);
> +
>  	  break;
>  	}
> diff --git a/nptl/pthread_probe.h b/nptl/pthread_probe.h
> index e69de29..8178ac9 100644
> --- a/nptl/pthread_probe.h
> +++ b/nptl/pthread_probe.h
> @@ -0,0 +1,33 @@
> +#include <sys/sdt.h>
> +
> +/* #include "new-sdt3.h" */
> +
> +#define PTHREAD_PROBE_CREATE(arg1,arg2,arg3)
> STAP_PROBE3(provider,create,arg1,arg2,arg3)
> +#define PTHREAD_PROBE_JOIN(arg1) STAP_PROBE1(provider,join,arg1)
> +#define PTHREAD_PROBE_JOIN_RET(arg1,arg2)
> STAP_PROBE2(provider,join_ret,arg1,arg2)
> +#define PTHREAD_PROBE_START(arg1) STAP_PROBE1(provider,start,arg1)
> +#define PTHREAD_PROBE_END(arg1) STAP_PROBE1(provider,end,arg1)
> +#define PTHREAD_PROBE_MUTEX_INIT(arg1)
> STAP_PROBE1(provider,mutex_init,arg1)
> +#define PTHREAD_PROBE_MUTEX_DESTROY(arg1)
> STAP_PROBE1(provider,mutex_destroy,arg1)
> +#define PTHREAD_PROBE_MUTEX_ACQUIRE(arg1)
> STAP_PROBE1(provider,mutex_acquire,arg1)
> +#define PTHREAD_PROBE_MUTEX_RELEASE(arg1)
> STAP_PROBE1(provider,mutex_release,arg1)
> +#define PTHREAD_PROBE_MUTEX_BLOCK(arg1)
> STAP_PROBE1(provider,mutex_block,arg1)
> +#define PTHREAD_PROBE_COND_INIT(arg1)
> STAP_PROBE1(provider,cond_init,arg1)
> +#define PTHREAD_PROBE_COND_DESTROY(arg1)
> STAP_PROBE1(provider,cond_destroy,arg1)
> +#define PTHREAD_PROBE_COND_WAIT(arg1,arg2)
> STAP_PROBE2(provider,cond_wait,arg1,arg2)
> +#define PTHREAD_PROBE_COND_WAKE(arg1,arg2)
> STAP_PROBE2(provider,cond_wake,arg1,arg2)
> +#define PTHREAD_PROBE_COND_SIGNAL(arg1)
> STAP_PROBE1(provider,cond_signal,arg1)
> +#define PTHREAD_PROBE_COND_BROADCAST(arg1)
> STAP_PROBE1(provider,cond_broadcast,arg1)
> +#define PTHREAD_PROBE_RWLOCK_ACQUIRE_WRITE(arg1)
> STAP_PROBE1(provider,rwlock_acquire_write,arg1)
> +#define PTHREAD_PROBE_RWLOCK_ACQUIRE_READ(arg1)
> STAP_PROBE1(provider,rwlock_acquire_read,arg1)
> +#define PTHREAD_PROBE_RWLOCK_DESTROY(arg1)
> STAP_PROBE1(provider,rwlock_destroy,arg1)
> +#define PTHREAD_PROBE_RWLOCK_UNLOCK(arg1)
> STAP_PROBE1(provider,rwlock_unlock,arg1)
> +#define PTHREAD_PROBE_MUTEX_ENTRY(arg1)
> STAP_PROBE1(provider,mutex_entry,arg1)
> +#define PTHREAD_PROBE_RLOCK_ENTRY(arg1)
> STAP_PROBE1(provider,rlock_entry,arg1)
> +#define PTHREAD_PROBE_RLOCK_BLOCK(arg1)
> STAP_PROBE1(provider,rlock_block,arg1)
> +#define PTHREAD_PROBE_WLOCK_ENTRY(arg1)
> STAP_PROBE1(provider,wlock_entry,arg1)
> +#define PTHREAD_PROBE_WLOCK_BLOCK(arg1)
> STAP_PROBE1(provider,wlock_block,arg1)
> +
> +/* the following probe points are in low-level assembly/inline assembly
> code */
> +#define PTHREAD_PROBE_LL_LOCKWAIT_PRIVATE(arg1)
> STAP_PROBE1(provider,lock_wait_private,arg1)
> +#define PTHREAD_PROBE_LL_LOCKWAIT(arg1)
> STAP_PROBE1(provider,lock_wait,arg1)
> 
> diff --git a/nptl/DESIGN-systemtap-probes.txt
> b/nptl/DESIGN-systemtap-probes.txt
> index e69de29..d8bbbd7 100644
> --- a/nptl/DESIGN-systemtap-probes.txt
> +++ b/nptl/DESIGN-systemtap-probes.txt
> @@ -0,0 +1,34 @@
> +Systemtap is a dynamic tracingi/instrumenting tool available on Linux.
> Probes that are not fired at run time have extremely close to zero
> overhead.
> +
> +The following probes are available for NPTL:
> +
> +Thread creation & Join Probes
> +=============================
> +create   - probe for pthread_create(3) - arg1 = thread ID, arg2 =
> start_routine, arg3 = arguments
> +start    - probe for actual thread creation, arg1 = struct pthread
> (members include thread ID, process ID)
> +join     - probe for pthread_join(3)   - arg1 = thread ID
> +join_ret - probe for pthread_join(3) return - arg1 = thread ID, arg2 =
> return value
> +
> +Lock-related Probes
> +===================
> +mutex_init    - probe for pthread_mutex_init(3) - arg1 = address of
> mutex lock
> +mutex_acquired- probe for pthread_mutex_lock(3) - arg1 = address of
> mutex lock
> +mutex_block   - probe for resume from _possible_ mutex block event -
> arg1 = address of mutex lock
> +mutex_entry   - probe for entry to the pthread_mutex_lock(3) function,
> - arg1 = address of mutex lock
> +mutex_release - probe for pthread_mutex_unlock(3) after the successful
> release of a mutex lock - arg1 = address of mutex lock
> +mutex_destroy - probe for pthread_mutex_destroy(3) - arg1 = address of
> mutex lock
> +rwlock_destroy- probe for pthread_rwlock_destroy(3) - arg1 = address of
> rw lock
> +rwlock_acquire_write-probe for pthread_rwlock_wrlock(3) - arg1 =
> address of rw lock
> +rwlock_unlock - probe for pthread_rwlock_unlock(3) - arg1 = address of
> rw lock
> +
> +lock_wait         - probe in low-level lock code, only fired when futex
> is called (i.e. when trying to acquire a contented lock)
> +lock_wait_private - probe in low-level lock code, only fired when futex
> is called (i.e. when trying to acquire a contented lock)
> +
> +Condition variable Probes
> +=========================
> +cond_init - probe for pthread_cond_init(3) - arg1 = condition, arg2 =
> attr
> +cond_destroy- probe for pthread_condattr_destroy(3) - arg1 = attr
> +cond_wait - probe for pthread_cond_wait(3) - arg1 = condition, arg2 =
> mutex lock
> +cond_signal - probe for pthread_cond_signal(3) - arg1 = condition
> +cond_broadcast - probe for pthread_cond_broadcast(3) - arg1 = condition
> +
> diff --git a/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
> b/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
> index 8de9cf4..b6d9847 100644
> --- a/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
> +++ b/nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
> @@ -21,6 +21,7 @@
>  #include <pthread-errnos.h>
>  #include <kernel-features.h>
>  #include <lowlevellock.h>
> +#include <pthread_probe.h>
>  
>         .text
>  
> @@ -130,7 +131,8 @@ __lll_lock_wait:
>         cmpl    %edx, %eax      /* NB:   %edx == 2 */
>         jne     2f
>  
> -1:     movl    $SYS_futex, %eax
> +1:     PTHREAD_PROBE_LL_LOCKWAIT(%rdi)
> +        movl   $SYS_futex, %eax
>         syscall
>  
>  2:     movl    %edx, %eax
> @@ -180,7 +182,8 @@ __lll_timedlock_wait:
>         cmpl    %edx, %eax
>         jne     2f
>  
> -1:     movl    $SYS_futex, %eax
> +1:     PTHREAD_PROBE_LL_LOCKWAIT_PRIVATE(%rdi)
> +        movl   $SYS_futex, %eax
>         movl    $2, %edx
>         syscall
>  
> 
> On Wed, 2010-08-11 at 09:10 -0400, Rayson Ho wrote:
> > Thanks Roland,
> > 
> > I've ported my pthread probe changes against your systemtap glibc tree.
> > I think the interface of the new sdt.h and that of the LIBC_PROBE()
> > marco is almost the identical, but I am not sure if we are going to have
> > macros like PROBE2() in the new-sdt -- not an issue at all as it is just
> > a handy macro.
> > 
> > I've inline-attached the diff, and the biggest change is that
> > "pthread_probe.h" used with the old sdt.h is now gone.
> > 
> > I have not attached the diffs for the probes in the inline assembly code
> > yet, I will do that soon.
> > 
> > Rayson
> > 
> > 
> > 
> > diff --git a/nptl/pthread_join.c b/nptl/pthread_join.c
> > index 6a87a8b..9f67a58 100644
> > --- a/nptl/pthread_join.c
> > +++ b/nptl/pthread_join.c
> > @@ -23,6 +23,8 @@
> >  #include <atomic.h>
> >  #include "pthreadP.h"
> >  
> > +#include <stap-probe.h>
> > +
> >  
> >  static void
> >  cleanup (void *arg)
> > @@ -55,6 +57,8 @@ pthread_join (threadid, thread_return)
> >    struct pthread *self = THREAD_SELF;
> >    int result = 0;
> >  
> > +  LIBC_PROBE(pthread_join, 1, threadid);
> > +
> >    /* During the wait we change to asynchronous cancellation.  If we
> >       are canceled the thread we are waiting for must be marked as
> >       un-wait-ed for again.  */
> > @@ -110,5 +114,7 @@ pthread_join (threadid, thread_return)
> >        __free_tcb (pd);
> >      }
> >  
> > +  LIBC_PROBE(pthread_join_ret, 2, threadid, result);  
> > +
> >    return result;
> >  }
> > diff --git a/nptl/pthread_mutex_destroy.c b/nptl/pthread_mutex_destroy.c
> > index e2c9f8a..dd690cd 100644
> > --- a/nptl/pthread_mutex_destroy.c
> > +++ b/nptl/pthread_mutex_destroy.c
> > @@ -20,6 +20,7 @@
> >  #include <errno.h>
> >  #include "pthreadP.h"
> >  
> > +#include <stap-probe.h>
> >  
> >  int
> >  __pthread_mutex_destroy (mutex)
> > @@ -32,6 +33,8 @@ __pthread_mutex_destroy (mutex)
> >    /* Set to an invalid value.  */
> >    mutex->__data.__kind = -1;
> >  
> > +  LIBC_PROBE(pthread_mutex_destroy, 1, mutex);
> > +
> >    return 0;
> >  }
> >  strong_alias (__pthread_mutex_destroy, pthread_mutex_destroy)
> > diff --git a/nptl/pthread_mutex_init.c b/nptl/pthread_mutex_init.c
> > index d9b1ef0..e4fcae7 100644
> > --- a/nptl/pthread_mutex_init.c
> > +++ b/nptl/pthread_mutex_init.c
> > @@ -24,6 +24,8 @@
> >  #include <kernel-features.h>
> >  #include "pthreadP.h"
> >  
> > +#include <stap-probe.h>
> > +
> >  static const struct pthread_mutexattr default_attr =
> >    {
> >      /* Default is a normal mutex, not shared between processes.  */
> > @@ -47,6 +49,8 @@ __pthread_mutex_init (mutex, mutexattr)
> >  
> >    imutexattr = (const struct pthread_mutexattr *) mutexattr ?:
> > &default_attr;
> >  
> > +  LIBC_PROBE(pthread_mutex_init, 2, mutex, mutexattr);
> > +
> >    /* Sanity checks.  */
> >    switch (__builtin_expect (imutexattr->mutexkind
> >  			    & PTHREAD_MUTEXATTR_PROTOCOL_MASK,
> > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > index 50dc188..b754372 100644
> > --- a/nptl/pthread_mutex_lock.c
> > +++ b/nptl/pthread_mutex_lock.c
> > @@ -25,6 +25,7 @@
> >  #include "pthreadP.h"
> >  #include <lowlevellock.h>
> >  
> > +#include <stap-probe.h>
> >  
> >  #ifndef LLL_MUTEX_LOCK
> >  # define LLL_MUTEX_LOCK(mutex) \
> > @@ -48,6 +49,10 @@ __pthread_mutex_lock (mutex)
> >    assert (sizeof (mutex->__size) >= sizeof (mutex->__data));
> >  
> >    unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
> > +
> > +  /* systemtap marker */
> > +  LIBC_PROBE(pthread_mutex_lock, 1, mutex);
> > +
> >    if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
> >      return __pthread_mutex_lock_full (mutex);
> >  
> > @@ -60,6 +65,8 @@ __pthread_mutex_lock (mutex)
> >        /* Normal mutex.  */
> >        LLL_MUTEX_LOCK (mutex);
> >        assert (mutex->__data.__owner == 0);
> > +
> > +      LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
> >      }
> >    else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
> >      {
> > @@ -75,6 +82,11 @@ __pthread_mutex_lock (mutex)
> >  
> >  	  ++mutex->__data.__count;
> >  
> > +          /* currently, the systemtap pthread probe does not have a */
> > +          /* probe point here because the thread already owns this */
> > +          /* recursive lock before the call to this function. */
> > +          /* this might change in the future */
> > +
> >  	  return 0;
> >  	}
> >  
> > @@ -83,6 +95,8 @@ __pthread_mutex_lock (mutex)
> >  
> >        assert (mutex->__data.__owner == 0);
> >        mutex->__data.__count = 1;
> > +
> > +      LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
> >      }
> >    else if (__builtin_expect (type == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
> >      {
> > @@ -94,6 +108,7 @@ __pthread_mutex_lock (mutex)
> >  	  int cnt = 0;
> >  	  int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
> >  			     mutex->__data.__spins * 2 + 10);
> > +
> >  	  do
> >  	    {
> >  	      if (cnt++ >= max_cnt)
> > @@ -108,6 +123,8 @@ __pthread_mutex_lock (mutex)
> >  	    }
> >  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> >  
> > +          LIBC_PROBE(pthread_mutex_lock_block, 1, mutex);
> > +
> >  	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
> >  	}
> >        assert (mutex->__data.__owner == 0);
> > @@ -127,6 +144,8 @@ __pthread_mutex_lock (mutex)
> >    ++mutex->__data.__nusers;
> >  #endif
> >  
> > +  LIBC_PROBE(pthread_mutex_lock_acquire, 1, mutex);
> > +
> >    return 0;
> >  }
> >  
> > @@ -277,6 +296,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >  
> >  		++mutex->__data.__count;
> >  
> > +                /* currently, the systemtap pthread probe does not have
> > a */
> > +                /* probe point here because the thread already owns
> > this */
> > +                /* recursive lock before the call to this function. */
> > +                /* this might change in the future */
> >  		return 0;
> >  	      }
> >  	  }
> > @@ -393,6 +416,11 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >  		  /* Overflow of the counter.  */
> >  		  return EAGAIN;
> >  
> > +                 /* currently, the systemtap pthread probe does not
> > have a */
> > +                 /* probe point here because the thread already owns
> > this */
> > +                 /* recursive lock before the call to this function. */
> > +                 /* this might change in the future */
> > +
> >  		++mutex->__data.__count;
> >  
> >  		return 0;
> > @@ -451,6 +479,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >  	  }
> >  	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
> >  
> > +        LIBC_PROBE(pthread_mutex_lock_full_block, 1, mutex);
> > +
> >  	assert (mutex->__data.__owner == 0);
> >  	mutex->__data.__count = 1;
> >        }
> > @@ -467,6 +497,8 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >    ++mutex->__data.__nusers;
> >  #endif
> >  
> > +  LIBC_PROBE(pthread_mutex_lock_full_acquire, 1, mutex);
> > +
> >    return 0;
> >  }
> >  #ifndef __pthread_mutex_lock
> > diff --git a/nptl/pthread_mutex_unlock.c b/nptl/pthread_mutex_unlock.c
> > index f9fe10b..e0f305e 100644
> > --- a/nptl/pthread_mutex_unlock.c
> > +++ b/nptl/pthread_mutex_unlock.c
> > @@ -23,6 +23,8 @@
> >  #include "pthreadP.h"
> >  #include <lowlevellock.h>
> >  
> > +#include <stap-probe.h>
> > +
> >  static int
> >  internal_function
> >  __pthread_mutex_unlock_full (pthread_mutex_t *mutex, int decr)
> > @@ -50,6 +52,7 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
> >  
> >        /* Unlock.  */
> >        lll_unlock (mutex->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex));
> > +      LIBC_PROBE(pthread_mutex_release, 1, mutex);
> >        return 0;
> >      }
> >    else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
> > @@ -60,6 +63,10 @@ __pthread_mutex_unlock_usercnt (mutex, decr)
> >  
> >        if (--mutex->__data.__count != 0)
> >  	/* We still hold the mutex.  */
> > +        
> > +        /* currently, the systemtap pthread probe does not have */
> > +        /* probe point here because the thread still owns the lock */
> > +        /* this might change in the future */
> >  	return 0;
> >        goto normal;
> >      }
> > @@ -104,6 +111,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> > *mutex, int decr)
> >  
> >        if (--mutex->__data.__count != 0)
> >  	/* We still hold the mutex.  */
> > +
> > +        /* currently, the systemtap pthread probe does not have */
> > +        /* probe point here because the thread still owns the lock */
> > +        /* this might change in the future */
> >  	return 0;
> >  
> >        goto robust;
> > @@ -149,6 +160,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> > *mutex, int decr)
> >  
> >        if (--mutex->__data.__count != 0)
> >  	/* We still hold the mutex.  */
> > +
> > +        /* currently, the systemtap pthread probe does not have */
> > +        /* probe point here because the thread still owns the lock */
> > +        /* this might change in the future */
> >  	return 0;
> >        goto continue_pi_non_robust;
> >  
> > @@ -171,6 +186,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> > *mutex, int decr)
> >  
> >        if (--mutex->__data.__count != 0)
> >  	/* We still hold the mutex.  */
> > +
> > +        /* currently, the systemtap pthread probe does not have */
> > +        /* probe point here because the thread still owns the lock */
> > +        /* this might change in the future */
> >  	return 0;
> >  
> >        goto continue_pi_robust;
> > @@ -237,6 +256,10 @@ __pthread_mutex_unlock_full (pthread_mutex_t
> > *mutex, int decr)
> >  
> >        if (--mutex->__data.__count != 0)
> >  	/* We still hold the mutex.  */
> > +
> > +        /* currently, the systemtap pthread probe does not have */
> > +        /* probe point here because the thread still owns the lock */
> > +        /* this might change in the future */
> >  	return 0;
> >        goto pp;
> >  
> > @@ -272,6 +295,9 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
> > int decr)
> >  			PTHREAD_MUTEX_PSHARED (mutex));
> >  
> >        int oldprio = newval >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
> > +
> > +      LIBC_PROBE(pthread_mutex_release, 1, mutex);
> > +
> >        return __pthread_tpp_change_priority (oldprio, -1);
> >  
> >      default:
> > @@ -279,6 +305,8 @@ __pthread_mutex_unlock_full (pthread_mutex_t *mutex,
> > int decr)
> >        return EINVAL;
> >      }
> >  
> > +  LIBC_PROBE(pthread_mutex_release, 1, mutex);
> > +
> >    return 0;
> >  }
> >  
> > diff --git a/nptl/pthread_rwlock_destroy.c
> > b/nptl/pthread_rwlock_destroy.c
> > index 28fd24b..b14de8f 100644
> > --- a/nptl/pthread_rwlock_destroy.c
> > +++ b/nptl/pthread_rwlock_destroy.c
> > @@ -19,12 +19,14 @@
> >  
> >  #include "pthreadP.h"
> >  
> > +#include <stap-probe.h>
> >  
> >  int
> >  __pthread_rwlock_destroy (rwlock)
> >       pthread_rwlock_t *rwlock;
> >  {
> >    /* Nothing to be done.  For now.  */
> > +  LIBC_PROBE(pthread_rwlock_destroy, 1, rwlock);
> >    return 0;
> >  }
> >  strong_alias (__pthread_rwlock_destroy, pthread_rwlock_destroy)
> > diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c
> > index 31eb508..7b4d8f0 100644
> > --- a/nptl/pthread_rwlock_rdlock.c
> > +++ b/nptl/pthread_rwlock_rdlock.c
> > @@ -23,6 +23,8 @@
> >  #include <pthread.h>
> >  #include <pthreadP.h>
> >  
> > +#include <stap-probe.h>
> > +
> >  
> >  /* Acquire read lock for RWLOCK.  */
> >  int
> > @@ -31,6 +33,8 @@ __pthread_rwlock_rdlock (rwlock)
> >  {
> >    int result = 0;
> >  
> > +  LIBC(pthread_rwlock_rdlock, 1, rwlock);
> > +
> >    /* Make sure we are along.  */
> >    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
> >  
> > @@ -49,6 +53,13 @@ __pthread_rwlock_rdlock (rwlock)
> >  	      --rwlock->__data.__nr_readers;
> >  	      result = EAGAIN;
> >  	    }
> > +          else
> > +            {
> > +              /* systemtap pthread probe - this is the only place where
> > */
> > +              /* we get this read-write lock */
> > +              LIBC_PROBE(pthread_rwlock_rdlock, 1, rwlock);
> > +            }
> > +
> >  
> >  	  break;
> >  	}
> > diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c
> > index a7ef71a..ba9620b 100644
> > --- a/nptl/pthread_rwlock_unlock.c
> > +++ b/nptl/pthread_rwlock_unlock.c
> > @@ -23,10 +23,14 @@
> >  #include <pthread.h>
> >  #include <pthreadP.h>
> >  
> > +#include <stap-probe.h>
> > +
> >  /* Unlock RWLOCK.  */
> >  int
> >  __pthread_rwlock_unlock (pthread_rwlock_t *rwlock)
> >  {
> > +  LIBC_PROBE(pthread_rwlock_unlock, 1, rwlock);
> > +
> >    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
> >    if (rwlock->__data.__writer)
> >      rwlock->__data.__writer = 0;
> > diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c
> > index 64fe970..09b9454 100644
> > --- a/nptl/pthread_rwlock_wrlock.c
> > +++ b/nptl/pthread_rwlock_wrlock.c
> > @@ -23,6 +23,7 @@
> >  #include <pthread.h>
> >  #include <pthreadP.h>
> >  
> > +#include <stap-probe.h>
> >  
> >  /* Acquire write lock for RWLOCK.  */
> >  int
> > @@ -31,6 +32,8 @@ __pthread_rwlock_wrlock (rwlock)
> >  {
> >    int result = 0;
> >  
> > +  LIBC_PROBE(pthread_rwlock_wrlock, 1, rwlock);
> > +
> >    /* Make sure we are along.  */
> >    lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
> >  
> > @@ -41,6 +44,11 @@ __pthread_rwlock_wrlock (rwlock)
> >  	{
> >  	  /* Mark self as writer.  */
> >  	  rwlock->__data.__writer = THREAD_GETMEM (THREAD_SELF, tid);
> > +
> > +          /* systemtap pthread probe - this is the only place where we
> > can */
> > +          /* get this read-write lock. */
> > +          LIBC_PROBE(pthread_rwlock_wrlock_acquire, 1, rwlock);
> > +
> >  	  break;
> >  	}
> >  
> > 
> > 
> > On Mon, 2010-08-09 at 09:37 -0700, Roland McGrath wrote:
> > > > I have modified your sdt.h slightly and used it for the pthread
> > > > probes. Is there a special branch of systemtap or utrace that I need
> > > > to use in order to test/benchmark the overhead of the existance of the
> > > > new sdt probes in libpthread?
> > > 
> > > The translator work has yet to be done.  So at the moment all you could
> > > test is having the probes in there statically (i.e. addition of nop
> > > instructions) and not using them.
> > > 
> > > 
> > > Thanks,
> > > Roland
> > 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-10-06 14:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-03  3:07 revamp sdt.h Roland McGrath
2010-08-09 15:16 ` Rayson Ho
2010-08-09 16:38   ` Roland McGrath
2010-08-11 13:11     ` Rayson Ho
2010-09-08  0:44       ` Roland McGrath
2010-09-29 17:18       ` Adding systemtap probe points in pthread library (was: Re: revamp sdt.h) Rayson Ho
2010-10-06 14:27         ` RFC: " Rayson Ho
2010-09-03 20:36     ` revamp sdt.h Stan Cox
2010-09-03 20:38       ` Roland McGrath
2010-09-03 20:42         ` Stan Cox
2010-09-03 21:02           ` Roland McGrath
2010-09-07 15:08             ` Stan Cox
2010-09-08  0:27               ` Roland McGrath
2010-09-08 15:44                 ` Stan Cox
2010-09-08 17:37                   ` Mark Wielaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).