public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* Implementing a generic binary trace interface.
@ 2006-01-24 15:20 Jose R. Santos
  2006-01-24 16:21 ` William Cohen
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Jose R. Santos @ 2006-01-24 15:20 UTC (permalink / raw)
  To: SystemTAP

Hi folks,

My team is currently implementing a trace tool using SystemTap that 
currently does logging by means of printf mechanism.  We want to move to 
a binary trace format but there is no such mechanism on SystemTap for 
doing this.  I've looked at what the folks at Hitachi have done with 
BTI, but this seems to force a specific trace format that is not 
suitable for what we need.  Ideally, the trace format should be left to 
the tapset using the interface not the BTI.  I propose to slightly alter 
the BTI from Hitachi to allow other trace implementations to use the 
trace format that's most convenient for the people implementing them.

To facilitate this, Tom Zanussi has been talking about implementing a 
basic form struct to the SystemTap language.  The basic idea of how the 
tapset would use the new BTI and struct data types looks like:

probe kernel.function("sys_open")
{
      trace_extra.hookid = HOOKID_OPEN;
      trace_extra.flags = $flags;
      trace_extra.mode = $mode;
      trace_extra.name = $filename;
      trace_extra.fd = $fd;

      lket_tracer(trace_extra);
}


function lket_trace(trace_extra:struct) {

          trace_format.timestamp = ...
          trace_format.cpu = ...
          trace_format.....
          .....

          trace(trace_format, trace_extra)
}


Unlike the BTI current implementation, the format of the trace hook is defined by lket_trace and not by the generic interface.  This design has the following benefits over the Hitachi interface.

1) It allows for anyone to implement their own trace hooks as they see fit. Making it a truly generic interface.

2) It does not limit the number or type of arguments that a trace hook can have.  The current implementation limits you to 16 data points of size long.


Aside from the support to add struct to the SystemTap language, the rest of the changes to the BTI should be pretty straight forward and they would not significantly impact the current work that Hitachi has done with their implementation.

Thoughts?

-JRS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 15:20 Implementing a generic binary trace interface Jose R. Santos
@ 2006-01-24 16:21 ` William Cohen
  2006-01-24 17:03   ` Jose R. Santos
  2006-01-24 16:42 ` Mathieu Desnoyers
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: William Cohen @ 2006-01-24 16:21 UTC (permalink / raw)
  To: jrs; +Cc: SystemTAP

Jose R. Santos wrote:
> Hi folks,
> 
> My team is currently implementing a trace tool using SystemTap that 
> currently does logging by means of printf mechanism.  We want to move to 
> a binary trace format but there is no such mechanism on SystemTap for 
> doing this.  I've looked at what the folks at Hitachi have done with 
> BTI, but this seems to force a specific trace format that is not 
> suitable for what we need.  Ideally, the trace format should be left to 
> the tapset using the interface not the BTI.  I propose to slightly alter 
> the BTI from Hitachi to allow other trace implementations to use the 
> trace format that's most convenient for the people implementing them.
> 
> To facilitate this, Tom Zanussi has been talking about implementing a 
> basic form struct to the SystemTap language.  The basic idea of how the 
> tapset would use the new BTI and struct data types looks like:
> 
> probe kernel.function("sys_open")
> {
>      trace_extra.hookid = HOOKID_OPEN;
>      trace_extra.flags = $flags;
>      trace_extra.mode = $mode;
>      trace_extra.name = $filename;
>      trace_extra.fd = $fd;
> 
>      lket_tracer(trace_extra);
> }
> 
> 
> function lket_trace(trace_extra:struct) {
> 
>          trace_format.timestamp = ...
>          trace_format.cpu = ...
>          trace_format.....
>          .....
> 
>          trace(trace_format, trace_extra)
> }
> 
> 
> Unlike the BTI current implementation, the format of the trace hook is 
> defined by lket_trace and not by the generic interface.  This design has 
> the following benefits over the Hitachi interface.
> 
> 1) It allows for anyone to implement their own trace hooks as they see 
> fit. Making it a truly generic interface.
> 
> 2) It does not limit the number or type of arguments that a trace hook 
> can have.  The current implementation limits you to 16 data points of 
> size long.
> 
> 
> Aside from the support to add struct to the SystemTap language, the rest 
> of the changes to the BTI should be pretty straight forward and they 
> would not significantly impact the current work that Hitachi has done 
> with their implementation.
> 
> Thoughts?
> 
> -JRS

Are you proposing something like the trace buffers in perfmon2?

http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html
http://sourceforge.net/projects/perfmon2

The perfmon2 allows multiple types of trace buffers, each has a unique 
idetifier. The perfmon2 trace buffers can be interpretted in what ever 
manner desired. OProfile has used a version of this mechanism in perfmon 
to collect data on ia64 machines. The code that writes to the buffer can 
be custom. There is a default trace buffer that writes a data to the buffer.

-Will

-Will

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 15:20 Implementing a generic binary trace interface Jose R. Santos
  2006-01-24 16:21 ` William Cohen
@ 2006-01-24 16:42 ` Mathieu Desnoyers
  2006-01-24 16:55 ` Tom Zanussi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2006-01-24 16:42 UTC (permalink / raw)
  To: Jose R. Santos; +Cc: SystemTAP

Have you looked at the LTTng/LTTV trace format ?

It supports nested structures, unions, sequences, arrays. It uses completely
portable types and optionnaly does dynamic data alignment.

It's described there :

http://ltt.polymtl.ca/svn/ltt/branches/poly/doc/developer/format.html


If what you are looking for is something "truly generic", you might be
interested.


Mathieu


* Jose R. Santos (jrs@us.ibm.com) wrote:
> Hi folks,
> 
> My team is currently implementing a trace tool using SystemTap that 
> currently does logging by means of printf mechanism.  We want to move to 
> a binary trace format but there is no such mechanism on SystemTap for 
> doing this.  I've looked at what the folks at Hitachi have done with 
> BTI, but this seems to force a specific trace format that is not 
> suitable for what we need.  Ideally, the trace format should be left to 
> the tapset using the interface not the BTI.  I propose to slightly alter 
> the BTI from Hitachi to allow other trace implementations to use the 
> trace format that's most convenient for the people implementing them.
> 
> To facilitate this, Tom Zanussi has been talking about implementing a 
> basic form struct to the SystemTap language.  The basic idea of how the 
> tapset would use the new BTI and struct data types looks like:
> 
> probe kernel.function("sys_open")
> {
>      trace_extra.hookid = HOOKID_OPEN;
>      trace_extra.flags = $flags;
>      trace_extra.mode = $mode;
>      trace_extra.name = $filename;
>      trace_extra.fd = $fd;
> 
>      lket_tracer(trace_extra);
> }
> 
> 
> function lket_trace(trace_extra:struct) {
> 
>          trace_format.timestamp = ...
>          trace_format.cpu = ...
>          trace_format.....
>          .....
> 
>          trace(trace_format, trace_extra)
> }
> 
> 
> Unlike the BTI current implementation, the format of the trace hook is 
> defined by lket_trace and not by the generic interface.  This design has 
> the following benefits over the Hitachi interface.
> 
> 1) It allows for anyone to implement their own trace hooks as they see fit. 
> Making it a truly generic interface.
> 
> 2) It does not limit the number or type of arguments that a trace hook can 
> have.  The current implementation limits you to 16 data points of size long.
> 
> 
> Aside from the support to add struct to the SystemTap language, the rest of 
> the changes to the BTI should be pretty straight forward and they would not 
> significantly impact the current work that Hitachi has done with their 
> implementation.
> 
> Thoughts?
> 
> -JRS
> 
OpenPGP public key:              http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint:     8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 15:20 Implementing a generic binary trace interface Jose R. Santos
  2006-01-24 16:21 ` William Cohen
  2006-01-24 16:42 ` Mathieu Desnoyers
@ 2006-01-24 16:55 ` Tom Zanussi
  2006-01-26  2:27   ` Frank Ch. Eigler
  2006-01-24 17:18 ` Frank Ch. Eigler
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Tom Zanussi @ 2006-01-24 16:55 UTC (permalink / raw)
  To: jrs; +Cc: SystemTAP

Jose R. Santos writes:
 > Hi folks,
 > 
 > My team is currently implementing a trace tool using SystemTap that 
 > currently does logging by means of printf mechanism.  We want to move to 
 > a binary trace format but there is no such mechanism on SystemTap for 
 > doing this.  I've looked at what the folks at Hitachi have done with 
 > BTI, but this seems to force a specific trace format that is not 
 > suitable for what we need.  Ideally, the trace format should be left to 
 > the tapset using the interface not the BTI.  I propose to slightly alter 
 > the BTI from Hitachi to allow other trace implementations to use the 
 > trace format that's most convenient for the people implementing them.
 > 
 > To facilitate this, Tom Zanussi has been talking about implementing a 
 > basic form struct to the SystemTap language.  The basic idea of how the 
 > tapset would use the new BTI and struct data types looks like:
 > 
 > probe kernel.function("sys_open")
 > {
 >       trace_extra.hookid = HOOKID_OPEN;
 >       trace_extra.flags = $flags;
 >       trace_extra.mode = $mode;
 >       trace_extra.name = $filename;
 >       trace_extra.fd = $fd;
 > 
 >       lket_tracer(trace_extra);
 > }
 > 
 > 
 > function lket_trace(trace_extra:struct) {
 > 
 >           trace_format.timestamp = ...
 >           trace_format.cpu = ...
 >           trace_format.....
 >           .....
 > 
 >           trace(trace_format, trace_extra)
 > }
 > 
 > 
 > Unlike the BTI current implementation, the format of the trace hook is defined by lket_trace and not by the generic interface.  This design has the following benefits over the Hitachi interface.
 > 
 > 1) It allows for anyone to implement their own trace hooks as they see fit. Making it a truly generic interface.
 > 
 > 2) It does not limit the number or type of arguments that a trace hook can have.  The current implementation limits you to 16 data points of size long.
 > 
 > 
 > Aside from the support to add struct to the SystemTap language, the rest of the changes to the BTI should be pretty straight forward and they would not significantly impact the current work that Hitachi has done with their implementation.
 > 
 > Thoughts?

I'm not sure what you mean with the trace_format part - it seems
confusing to me.  Maybe simplify it to this to make it more clear what
we want to be able to do:

probe kernel.function("sys_open")
{
	data.hook_id = HOOKID_OPEN

	fill_in_common(data)

	data.flags = $flags;
	data.mode = $mode;
	data.name = $filename;
	data.fd = $fd;

	trace(data);
}

function fill_in_common(data)
{
	data.timestamp = timestamp_us();
}

Basically, we want to be able to write a probe like the above, the end
result being that a block of data corresponding to the struct gets
logged by the trace() function, which is just a simple function that
logs N bytes of raw data i.e. sizeof(data).  The struct syntax is to
me just a language convenience making it easy for the translator to
generate the code needed to do this, and doesn't need to do be
anything more than that, so it seems that it should be relatively easy
to add, but I don't know the translator very well, I may be mistaken.

Here's an approximation of the code that might be generated for the
probe script (I didn't bother to look at the code being generated
currently for probes to stick in here, but this should give the
general idea).

First a struct is generated.  The field types are inferred from the
assignments.  Currently there are only 2 systemtap types that can be
used in scripts, and there's no way to specify a field's type
explicitly e.g. u16 or char.  It would be nice in order to save space
for tracing, but I'm assuming for the sake of this example the
capability is there; otherwise just substitute longs for everything
but strings.  In any case, each field can be of any type.  There's
also no arbitrary limit on the number of items.

struct sys_open_trace_struct
{
	u16 hookid;
	unsigned long long timestamp;
	int flags;
	int mode;
	char name[NAME_MAX+1];
	int fd;
};

Finally, in the probe handler that's generated, a hypothetical
function _stp_reserve() is called to reserve space in the stp buffer.
It can basically be treated as if it were a kmalloc(), though all it's
doing is reserving a spot in the tracing buffer.  The code that fills
up the struct is then generated as usual for a probe, and when the
last field is filled in, it's done - there's no trace function or
anything needed after that, because the handler is directly filling in
the trace buffer memory.  Again, I'm just making up the _stp-specific
code off the top of my head - I'll look up the actual generated code
more carefully if it would help make it more clear.

void sys_open_handler(void)
{
	struct sys_open_trace_struct *event;
	event = _stp_reserve(sizeof(struct sys_open_trace_struct));

	event->hook_id = HOOKID_OPEN;
	event->timestamp = _stp_gettimeofday_us();
	event->flags = _stp_get_target("flags");
	event->mode = _stp_get_target("mode");
	_stp_copy_user_string(event->name, _stp_get_target("filename"));
	event->fd = _stp_get_target("fd");
}


Just for comparison, the above probe is really the equivalent of this
probe that uses globals and a dedicated 6-param trace function:

global timestamp, hook_id, flags, mode, filename, fd

probe kernel.function("sys_open")
{
	data.hook_id = HOOKID_OPEN

	fill_in_common()

	flags = $flags;
	mode = $mode;
	filename = $filename;
	fd = $fd;

	trace(hook_id, timestamp, flags, mode, filename, fd);
}

function fill_in_common()
{
	timestamp = timestamp_us();
}

To do the same logging as with the struct example the trace function
here would reserve or allocate an array of 6 longs (or whatever the
trace6() function is hard-coded to handle) and a string, assign the
param values and log.  The problem with this is that there needs to be
a dedicated trace() function with 6 params and a string and if later
its decided to be more efficient and use sizes for each field
according to its real type, there's no opportunity to do that, unless
you want to enumerate every possibility of number of params and param
type.

Considering that the first probe is so similar to the second, which
can be done today, I think this should be a relatively minor addition
to the language, but I'll let the translator folks comment on that.

This scheme could also be used in the reverse direction with another
slight addition to the language for making trace events from external
tracers visible in systemtap.

global timestamp, hook_id, flags, mode, filename, fd

probe trace.ltt("open_event")
{
	flags = $ltt_event.flags;
	mode = $ltt_event.mode;
	name = $ltt_event.filename;
	fd = $ltt_event.fd;
}

Here, the ltt event fields are available using the same struct syntax,
but prepended with '$'.

You could even combine the two and do 'retracing' if you wanted to
e.g. suppose the static tracepoint didn't provide some piece of info
that you needed.  You could grab what it does provide, and since
you're still in the context of the open syscall, you should be able to
also grab at that point other values, stick it all into a new struct
and log it e.g.

probe trace.ltt("open_event")
{
	data.hook_id = HOOKID_OPEN

	fill_in_common()

	data.flags = $ltt_event.flags;
	data.filename = $ltt_event.filename;
	data.fd = $ltt_event.fd;
	data.mode = $mode;

	trace(data);
}

function fill_in_common()
{
	timestamp = timestamp_us();
}

Here, the first three fields come from the ltt event, and the mode
comes from the mode param that wasn't available in the ltt event but
should be from dwarf info.

Tom


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 16:21 ` William Cohen
@ 2006-01-24 17:03   ` Jose R. Santos
  0 siblings, 0 replies; 29+ messages in thread
From: Jose R. Santos @ 2006-01-24 17:03 UTC (permalink / raw)
  To: William Cohen; +Cc: SystemTAP

William Cohen wrote:

>Jose R. Santos wrote:
>> Hi folks,
>> 
>> My team is currently implementing a trace tool using SystemTap that 
>> currently does logging by means of printf mechanism.  We want to move to 
>> a binary trace format but there is no such mechanism on SystemTap for 
>> doing this.  I've looked at what the folks at Hitachi have done with 
>> BTI, but this seems to force a specific trace format that is not 
>> suitable for what we need.  Ideally, the trace format should be left to 
>> the tapset using the interface not the BTI.  I propose to slightly alter 
>> the BTI from Hitachi to allow other trace implementations to use the 
>> trace format that's most convenient for the people implementing them.
>> 
>> To facilitate this, Tom Zanussi has been talking about implementing a 
>> basic form struct to the SystemTap language.  The basic idea of how the 
>> tapset would use the new BTI and struct data types looks like:
>> 
>> probe kernel.function("sys_open")
>> {
>>      trace_extra.hookid = HOOKID_OPEN;
>>      trace_extra.flags = $flags;
>>      trace_extra.mode = $mode;
>>      trace_extra.name = $filename;
>>      trace_extra.fd = $fd;
>> 
>>      lket_tracer(trace_extra);
>> }
>> 
>> 
>> function lket_trace(trace_extra:struct) {
>> 
>>          trace_format.timestamp = ...
>>          trace_format.cpu = ...
>>          trace_format.....
>>          .....
>> 
>>          trace(trace_format, trace_extra)
>> }
>> 
>> 
>> Unlike the BTI current implementation, the format of the trace hook is 
>> defined by lket_trace and not by the generic interface.  This design has 
>> the following benefits over the Hitachi interface.
>> 
>> 1) It allows for anyone to implement their own trace hooks as they see 
>> fit. Making it a truly generic interface.
>> 
>> 2) It does not limit the number or type of arguments that a trace hook 
>> can have.  The current implementation limits you to 16 data points of 
>> size long.
>> 
>> 
>> Aside from the support to add struct to the SystemTap language, the rest 
>> of the changes to the BTI should be pretty straight forward and they 
>> would not significantly impact the current work that Hitachi has done 
>> with their implementation.
>> 
>> Thoughts?
>> 
>> -JRS
>
>Are you proposing something like the trace buffers in perfmon2?
>
>http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html
>http://sourceforge.net/projects/perfmon2
>
>The perfmon2 allows multiple types of trace buffers, each has a unique 
>idetifier. The perfmon2 trace buffers can be interpretted in what ever 
>manner desired. OProfile has used a version of this mechanism in perfmon 
>to collect data on ia64 machines. The code that writes to the buffer can 
>be custom. There is a default trace buffer that writes a data to the buffer.
>
>-Will
>
>-Will
>
>
>  
>
No.  The first thing Im proposing is to alter Hitachi's Binary Trace 
Interface (BTI) into something more generic that myself and other can 
use on their projects.  What BTI does is basically write data into the 
SystemTap buffers in raw binary form.  This reduces the overhead of 
using printf and it provides a smaller trace file.  Masami Hiramatsu 
submitted their BTI patches Dec/2005 which I liked, but would like to 
see some small changes in order to not tie it to a particular trace format.

The second thing I'm proposing is to add support for a struct type in 
the SystemTap language to simplify the writing on trace hooks.  This 
idea originally came from Tom Zanussi and he sold me on it.  Adding 
support for a struct type structure in the SystemTap language would be 
of great use to simplify the writing of the trace hooks that use the 
binary trace interface.  I believe Tom had already an idea of how to 
implement this on to SystemTap.

-JRS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 15:20 Implementing a generic binary trace interface Jose R. Santos
                   ` (2 preceding siblings ...)
  2006-01-24 16:55 ` Tom Zanussi
@ 2006-01-24 17:18 ` Frank Ch. Eigler
  2006-01-24 17:55   ` Jose R. Santos
  2006-01-24 18:30   ` Tom Zanussi
  2006-01-25  6:37 ` Masami Hiramatsu
  2006-01-26  9:11 ` Martin Hunt
  5 siblings, 2 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2006-01-24 17:18 UTC (permalink / raw)
  To: jrs; +Cc: SystemTAP


"Jose R. Santos" <jrs@us.ibm.com> writes:

> My team is currently implementing a trace tool using SystemTap that
> currently does logging by means of printf mechanism.  We want to
> move to a binary trace format [...]

Could you specify *why*?  This is important, because a naive "struct"
construct in systemtap (involving string copying, integer casting) is
likely to rob you of any extra performance you might imagine this
would give you.


> [...]
> probe kernel.function("sys_open")
> {
>       trace_extra.hookid = HOOKID_OPEN;
>       trace_extra.flags = $flags;
> [...]
>       lket_tracer(trace_extra);
> }
> 
> function lket_trace(trace_extra:struct) {
> [...]
>           trace_format.cpu = ...
> [...]
>           trace(trace_format, trace_extra)
> }

How do you folks imagine this working?  Treat the field names as if
they were array indexes, and trace() just iterates over the array?  In
any particular order?

Maybe what you're looking for is programming convenience, such as
varargs, or generic functions, or local arrays?


> [...]
> 2) It does not limit the number or type of arguments that a trace
> hook can have.  The current implementation limits you to 16 data
> points of size long.

You might find that those fixed typing/size constraints are exactly
what allows the Hitachi code to go fast.


- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 17:18 ` Frank Ch. Eigler
@ 2006-01-24 17:55   ` Jose R. Santos
  2006-01-24 18:30   ` Tom Zanussi
  1 sibling, 0 replies; 29+ messages in thread
From: Jose R. Santos @ 2006-01-24 17:55 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: SystemTAP

Frank Ch. Eigler wrote:

>
>> [...]
>> 2) It does not limit the number or type of arguments that a trace
>> hook can have.  The current implementation limits you to 16 data
>> points of size long.
>
>You might find that those fixed typing/size constraints are exactly
>what allows the Hitachi code to go fast.
>
>
>- FChE
>
>  
>
And I completely agree. My point is that such constraints should be part 
of tapset using the generic interface and not part of the generic 
interface itself. The other alternative is to have a different interface 
for each different tracing methodology which I think is redundant.

-JRS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 17:18 ` Frank Ch. Eigler
  2006-01-24 17:55   ` Jose R. Santos
@ 2006-01-24 18:30   ` Tom Zanussi
  1 sibling, 0 replies; 29+ messages in thread
From: Tom Zanussi @ 2006-01-24 18:30 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: jrs, SystemTAP

Frank Ch. Eigler writes:
 > 
 > "Jose R. Santos" <jrs@us.ibm.com> writes:
 > 
 > > My team is currently implementing a trace tool using SystemTap that
 > > currently does logging by means of printf mechanism.  We want to
 > > move to a binary trace format [...]
 > 
 > Could you specify *why*?  This is important, because a naive "struct"
 > construct in systemtap (involving string copying, integer casting) is
 > likely to rob you of any extra performance you might imagine this
 > would give you.
 > 

Just being allowed to avoid the kernel printf() functions should be a
big win - Jose says that's the reason the omininous 'this will slow
down your system very much' message is in the README.

Avoiding the run-time overhead is one part, but also not having to log
10 times the amount of data to get the same information would be nice.

With structs, you also know the size of the data to be logged ahead of
time, so you don't have to first sprintf it to a buffer, then copy the
number of bytes returned by sprintf() to another buffer.

 > 
 > > [...]
 > > probe kernel.function("sys_open")
 > > {
 > >       trace_extra.hookid = HOOKID_OPEN;
 > >       trace_extra.flags = $flags;
 > > [...]
 > >       lket_tracer(trace_extra);
 > > }
 > > 
 > > function lket_trace(trace_extra:struct) {
 > > [...]
 > >           trace_format.cpu = ...
 > > [...]
 > >           trace(trace_format, trace_extra)
 > > }
 > 
 > How do you folks imagine this working?  Treat the field names as if
 > they were array indexes, and trace() just iterates over the array?  In
 > any particular order?
 > 
 > Maybe what you're looking for is programming convenience, such as
 > varargs, or generic functions, or local arrays?
 > 
 > 
 > > [...]
 > > 2) It does not limit the number or type of arguments that a trace
 > > hook can have.  The current implementation limits you to 16 data
 > > points of size long.
 > 
 > You might find that those fixed typing/size constraints are exactly
 > what allows the Hitachi code to go fast.
 > 

How is

	ent->args[i] = (long)va_arg(args, int64_t);

(from the BTI code) better than

      data->argi = (long)argi;

?

Tom


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 15:20 Implementing a generic binary trace interface Jose R. Santos
                   ` (3 preceding siblings ...)
  2006-01-24 17:18 ` Frank Ch. Eigler
@ 2006-01-25  6:37 ` Masami Hiramatsu
  2006-01-25 14:45   ` Jose R. Santos
  2006-01-26  9:11 ` Martin Hunt
  5 siblings, 1 reply; 29+ messages in thread
From: Masami Hiramatsu @ 2006-01-25  6:37 UTC (permalink / raw)
  To: jrs; +Cc: SystemTAP, Yumiko Sugita, Satoshi Oshima, Hideo Aoki

[-- Attachment #1: Type: text/plain, Size: 3469 bytes --]

Hi,

Jose R. Santos wrote:
> Hi folks,
>
> My team is currently implementing a trace tool using SystemTap that
> currently does logging by means of printf mechanism.  We want to move to
> a binary trace format but there is no such mechanism on SystemTap for
> doing this.  I've looked at what the folks at Hitachi have done with
> BTI, but this seems to force a specific trace format that is not
> suitable for what we need.  Ideally, the trace format should be left to
> the tapset using the interface not the BTI.  I propose to slightly alter
> the BTI from Hitachi to allow other trace implementations to use the
> trace format that's most convenient for the people implementing them.

We had recognized the problems of previous BTI implementation. And we
have developed more generic method called gBTI (generic Binary Transport
Interface) which can be used with current ATI.
gBTI itself uses no formatted structure but an array of long (64bit integer).
It just transports those several binary data from tapset to user daemon. It
works like IP packet envelope.

To support this gBTI, I think, systemtap just introduce only three features below:
1. _stp_binary_write() function in runtime.
2. stpd enhancement to handle binary data correctly.
3. some wrapping functions for user scripts. (i.e. lket_trace(), binary_log(), etc)
So I think supporting gBTI is much easier and simpler than supporting other
method.

1. runtime function
gBTI provides a runtime function called _stp_binary_write() for tapsets.
_stp_binary_write() should be invoked with a number as the first argument and
variable argument lists. In other words, the synopsis is:

 void _stp_binary_write(int num, ...);

The "num" argument specifies the number of arguments following this argument.
The type of other variables must be int64_t.
For example, you can use any format of followings;

 _stp_binary_write(3, (int64_t)arg1, (int64_t)arg2, (int64_t)arg3);
 _stp_binary_write(2, (int64_t)current->pid, (int64_t)current);
 _stp_binary_write(5, (int64_t)lkst_header1(), (int64_t)lkst_header2(),
                  (int64_t)etype, (int64_t)arg1, (int64_t)arg2);

 And so on.

The _stp_binary_write() function writes the data as a binary packet into a
relayfs channel which is shared by current ATI. The gBTI also share the
sequential-ID with the ATI by using _stp_seq_inc(). The packet format is
described below.

gBTI packet:
[seq-id][\0][num][arg1][arg2]....[arg(num-1)]

In the other hands, ATI packet is;
[seq-id][string][\0]
(Length of string is greater than 0, this is checked by runtime functions.)

Thus, gBTI can share the channel safely, because if the packet whose first
character of the string is '\0', it is binary data packet.

2. stpd enhancement
We should enhance stpd daemon to handle both ascii packet and binary packet
correctly. But it is not so difficalt.

3. wrapping functions
I also think we can define the various interfaces for example LKST, LKET,
or more generic binary_log() interface over the gBTI runtime.

What would you think about the gBTI?

I have already a concrete implementation of gBTI(for relayfs). I developed
it as a tapset script including runtime part.
I attach two files to this mail.

- gbti.stp:      gBTI core runtime (_stp_binary_write())
- lkst_gbti.stp: LKST compatible wrapping function of gBTI

Please review it.

Best regards,

-- 
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: hiramatu@sdl.hitachi.co.jp


[-- Attachment #2: gbti.stp --]
[-- Type: text/plain, Size: 1485 bytes --]

// generic Binary Transport Interface script
// Copyright (C) 2006 Hitachi, Ltd., Systems Development Laboratory
// Written by Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
// Instead of replacing whole runtime, this file includes a C source.

%{
#ifndef _GBTI_C_ /* -*- linux-c -*- */
#define _GBTI_C_

#include <linux/config.h>
#include <linux/percpu.h>
#include "io.c"

#ifndef MAXBINARYARGS
#define MAXBINARYARGS 127
#endif

#define STP_BIN_PACKET 0

struct stp_packet_h { 
	int32_t seq;
	u_int8_t flag;
	u_int8_t num;
}  __attribute__((packed));

#ifdef STP_RELAYFS

/* need to disable irqs */
static void _stp_binary_write (int num, ...)
{
	va_list vargs;
	int i;
	void *ptr;
	unsigned length;

	if (unlikely(num == 0))
		return;

	if (unlikely(num > MAXBINARYARGS))
		num = MAXBINARYARGS;
	length = sizeof(struct stp_packet_h) + num*sizeof(int64_t);

	/*
	 * I wonder if I should disable irqs here. SystemTap probe
	 * is not reentrant. Aren't we already doing this atomically?
	 */
	{
	unsigned long flags;
	local_irq_save(flags);
	ptr = relay_reserve(_stp_chan, length);
	local_irq_restore(flags);
	}

	if (ptr != 0) {
		struct stp_packet_h *hd = ptr;
		int64_t *args = ptr + sizeof(struct stp_packet_h);
		hd->seq = _stp_seq_inc();
		hd->flag = STP_BIN_PACKET;
		hd->num = num;
		va_start(vargs, num);
		for (i = 0; i < num; i++) {
			args[i] = va_arg(vargs, int64_t);
		}
		va_end(vargs);
	}
}
#endif

#endif /* _GBTI_C_ */
%}

function gbti_init () %{
	/* do nothing */
%}

[-- Attachment #3: lkst_gbti.stp --]
[-- Type: text/plain, Size: 1341 bytes --]

// lkst generic binary transport interface script
// Copyright (C) 2006 Hitachi, Ltd., Systems Development Laboratory
// Written by Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>

%{
static void _stp_binary_write (int num, ...);
%}
/*
I'd like to provide new interface:
function binary_log:unkown (num:long, ...)
It should be a special embeded function.
*/

/* This interface is basically same as classic LKST. */
function lkst_trace_n(n:long, etype:long, arg1:long, arg2:long, arg3:long, arg4:long) %{
	int64_t tsc;
	rdtscll(tsc);
	/* The size of classic LKST's etype is 12bits.
	 On the systemtap, it is expanded to 24 bits.*/
	_stp_binary_write (THIS->n+2,
			  THIS->etype << 40 | 
			  (int64_t)smp_processor_id() << 32 | current->pid,
			  tsc, THIS->arg1, THIS->arg2, 
			  THIS->arg3, THIS->arg4);
%}

function lkst_trace(etype:long, arg1:long, arg2:long, arg3:long, arg4:long) %{
	int64_t tsc;
	rdtscll(tsc);
	_stp_binary_write (6,
			  THIS->etype << 40 | 
			  (int64_t)smp_processor_id() << 32 | current->pid,
			  tsc, THIS->etype, THIS->arg1, THIS->arg2, 
			  THIS->arg3, THIS->arg4);
%}


probe begin {
      /* initial event */
      gbti_init()
      log("start tracing")
      lkst_trace_n (4, EVENT_SYNCTIME, gettimeofday_s(), gettimeofday_us(),
      		   cpu_khz(), get_tsc());
}

probe end {
      log("end tracing")
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-25  6:37 ` Masami Hiramatsu
@ 2006-01-25 14:45   ` Jose R. Santos
  2006-01-26  8:51     ` Masami Hiramatsu
  0 siblings, 1 reply; 29+ messages in thread
From: Jose R. Santos @ 2006-01-25 14:45 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: SystemTAP, Yumiko Sugita, Satoshi Oshima, Hideo Aoki

Masami Hiramatsu wrote:

>Hi,
>
>Jose R. Santos wrote:
>> Hi folks,
>>
>> My team is currently implementing a trace tool using SystemTap that
>> currently does logging by means of printf mechanism.  We want to move to
>> a binary trace format but there is no such mechanism on SystemTap for
>> doing this.  I've looked at what the folks at Hitachi have done with
>> BTI, but this seems to force a specific trace format that is not
>> suitable for what we need.  Ideally, the trace format should be left to
>> the tapset using the interface not the BTI.  I propose to slightly alter
>> the BTI from Hitachi to allow other trace implementations to use the
>> trace format that's most convenient for the people implementing them.
>
>We had recognized the problems of previous BTI implementation. And we
>have developed more generic method called gBTI (generic Binary Transport
>Interface) which can be used with current ATI.
>gBTI itself uses no formatted structure but an array of long (64bit integer).
>It just transports those several binary data from tapset to user daemon. It
>works like IP packet envelope.
>
>To support this gBTI, I think, systemtap just introduce only three features below:
>1. _stp_binary_write() function in runtime.
>2. stpd enhancement to handle binary data correctly.
>3. some wrapping functions for user scripts. (i.e. lket_trace(), binary_log(), etc)
>So I think supporting gBTI is much easier and simpler than supporting other
>method.
>
>1. runtime function
>gBTI provides a runtime function called _stp_binary_write() for tapsets.
>_stp_binary_write() should be invoked with a number as the first argument and
>variable argument lists. In other words, the synopsis is:
>
> void _stp_binary_write(int num, ...);
>
>The "num" argument specifies the number of arguments following this argument.
>The type of other variables must be int64_t.
>For example, you can use any format of followings;
>
> _stp_binary_write(3, (int64_t)arg1, (int64_t)arg2, (int64_t)arg3);
> _stp_binary_write(2, (int64_t)current->pid, (int64_t)current);
> _stp_binary_write(5, (int64_t)lkst_header1(), (int64_t)lkst_header2(),
>                  (int64_t)etype, (int64_t)arg1, (int64_t)arg2);
>
> And so on.
>
>The _stp_binary_write() function writes the data as a binary packet into a
>relayfs channel which is shared by current ATI. The gBTI also share the
>sequential-ID with the ATI by using _stp_seq_inc(). The packet format is
>described below.
>
>gBTI packet:
>[seq-id][\0][num][arg1][arg2]....[arg(num-1)]
>
>In the other hands, ATI packet is;
>[seq-id][string][\0]
>(Length of string is greater than 0, this is checked by runtime functions.)
>
>Thus, gBTI can share the channel safely, because if the packet whose first
>character of the string is '\0', it is binary data packet.
>  
>

Nice...

This implementation is more generic and its available now.  While this 
is suitable for our current trace work, I still agree with Tom that this 
approach may not be as flexible for gathering other data types.  While 
having fixed data types make it easy to decode the binary trace, there 
may be situations were we may want to grab data that may not fit on 
either a size long arg or the 128 arg limit.  Also, while possible to 
write a function call with 128 args, this would look really ugly and 
seems error prone.  Having a basic struct data type on SystemTap would 
solve some of these issues.

>2. stpd enhancement
>We should enhance stpd daemon to handle both ascii packet and binary packet
>correctly. But it is not so difficalt.
>  
>
I claim complete ignorance on how the stpd daemon works.  Can you 
elaborate on what the problem and possible enhancements are?  Does this 
still require your prohibit merging patch?

>3. wrapping functions
>I also think we can define the various interfaces for example LKST, LKET,
>or more generic binary_log() interface over the gBTI runtime.
>  
>
This I like.  I would use this if it were available now on SystemTap.

>What would you think about the gBTI?
>  
>
Its a good solution and seems like we could port our trace to use this 
mechanism easily (which we would had to do anyway if we implemented our 
own BTI).  This could later be enhance with what Tom is proposing.  I 
would like to see this or an equivalent functionality as part of 
SystemTap though.  Since you move the functionality to a tapset, are you 
looking at maintaining this as part of LKST or do you have plans to get 
this into SystemTap itself?

>I have already a concrete implementation of gBTI(for relayfs). I developed
>it as a tapset script including runtime part.
>I attach two files to this mail.
>
>- gbti.stp:      gBTI core runtime (_stp_binary_write())
>- lkst_gbti.stp: LKST compatible wrapping function of gBTI
>
>Please review it.
>
>Best regards,
>  
>
I've done a visual inspection and don't see any mayor issues with it.  I 
will play with it a bit and see how it goes.

What is the purpose of having gbti_init()?

Thanks

-JRS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 16:55 ` Tom Zanussi
@ 2006-01-26  2:27   ` Frank Ch. Eigler
  2006-01-26  4:29     ` Jose R. Santos
  2006-01-26 14:23     ` Tom Zanussi
  0 siblings, 2 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2006-01-26  2:27 UTC (permalink / raw)
  To: Tom Zanussi; +Cc: systemtap


zanussi wrote:

> [...]

Thank you for your rather detailed proposal.


> [...]
> probe kernel.function("sys_open")
> {
> 	data.hook_id = HOOKID_OPEN
> 	fill_in_common(data)
> 	data.flags = $flags;
> [...]
> }
> 
> function fill_in_common(data)
> {
> 	data.timestamp = timestamp_us();
> }

From this discussion, it seems to me that the sole use of this sort of
struct construct is for purposes of tracing.  There are no language
elements proposed for generically extracting information such as field
names / values.  Have you thought about what happens in connection
with conditionals / loops?  How should functions be type checked when
they take structs?  And return them?  And how are these connected to
arrays, when they are also passed by reference?  Can they be global?
Why not?

This seems to me as a too special-purpose and of too limited
application for it to show up in the script language in this
misleadingly general-looking form.

Here is a possible way of accomplishing the same goal - lightweight
binary tracing.  No new misleading notation, just special functions,
like printf* today.

# probe kernel.function("sys_open")
# {
# 	trace("hook_id", HOOKID_OPEN)
# 	trace_common()
# 	trace("flags", $flags)
# 	[...]
# }
# 
# function trace_common()
# {
# 	trace("timestamp", timestamp_us())
# }

The translator could generate similar code to your sketch: a field
assignment in a synthesized buffer struct for each executed trace().
It could avoid widening $target values to systemtap types, or include
explicit encoding/casting directives.  It could infer a "flush" at the
end of a probe that calls trace().  It could invent field names if you
don't care to give one for each call site.

For that matter, this could all be hidden in printf itself, with
special formatting directives.  (Don't assume that this would
necessarily invoke the kernel vsnprintf.  The translator could be a
smarter than to call the runtime _stp_printf function.)

By the way, in either the structy or printy design, how do you expect
the actual binary struct metadata to be used at the end?  How would
the binary trace data be decoded?

- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26  2:27   ` Frank Ch. Eigler
@ 2006-01-26  4:29     ` Jose R. Santos
  2006-01-26  5:07       ` Frank Ch. Eigler
  2006-01-26 14:23     ` Tom Zanussi
  1 sibling, 1 reply; 29+ messages in thread
From: Jose R. Santos @ 2006-01-26  4:29 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Tom Zanussi, systemtap

Frank Ch. Eigler wrote:

>zanussi wrote:
>
>> [...]
>
>Thank you for your rather detailed proposal.
>
>
>> [...]
>> probe kernel.function("sys_open")
>> {
>> 	data.hook_id = HOOKID_OPEN
>> 	fill_in_common(data)
>> 	data.flags = $flags;
>> [...]
>> }
>> 
>> function fill_in_common(data)
>> {
>> 	data.timestamp = timestamp_us();
>> }
>
From this discussion, it seems to me that the sole use of this sort of
>struct construct is for purposes of tracing.  There are no language
>elements proposed for generically extracting information such as field
>names / values.  Have you thought about what happens in connection
>with conditionals / loops?  How should functions be type checked when
>they take structs?  And return them?  And how are these connected to
>arrays, when they are also passed by reference?  Can they be global?
>Why not?
>
>This seems to me as a too special-purpose and of too limited
>application for it to show up in the script language in this
>misleadingly general-looking form.
>
>Here is a possible way of accomplishing the same goal - lightweight
>binary tracing.  No new misleading notation, just special functions,
>like printf* today.
>  
>
I wonder, would local arrays be less misleading in your opinion.  If the 
array could contain multiple types and be use as both an argument and 
return value, this would also work for tracing.  The use of arrays does 
seem to better fit the SystemTap coding style.

># probe kernel.function("sys_open")
># {
># 	trace("hook_id", HOOKID_OPEN)
># 	trace_common()
># 	trace("flags", $flags)
># 	[...]
># }
># 
># function trace_common()
># {
># 	trace("timestamp", timestamp_us())
># }
>
>The translator could generate similar code to your sketch: a field
>assignment in a synthesized buffer struct for each executed trace().
>It could avoid widening $target values to systemtap types, or include
>explicit encoding/casting directives.  It could infer a "flush" at the
>end of a probe that calls trace().  It could invent field names if you
>don't care to give one for each call site.
>
>For that matter, this could all be hidden in printf itself, with
>special formatting directives.  (Don't assume that this would
>necessarily invoke the kernel vsnprintf.  The translator could be a
>smarter than to call the runtime _stp_printf function.)
>
>  
>
Printf is the current mechanism that is used in LKET.  While it works, I 
am finding the code does not look a clean as it could with a better 
interface. BTI does offer nice interface that is very easy to decode and 
easy to implement.  Then new gBTI is better, but still somewhat not as 
flexible to the any type of data that we may want to put.

>By the way, in either the structy or printy design, how do you expect
>the actual binary struct metadata to be used at the end?  How would
>the binary trace data be decoded?
>
>- FChE
>
>  
>
Thats up to the trace format to provide enough information to decode the 
trace.  All of the hooks will have some fixed size data followed by data 
that is specific to the hook.  For example:

(int)HookID|(long long)Timestamp|(int)CPU|(int)pid|(long)sizeof 
hook|.........

We can do partial decoding of all the fixed size items and as long as we 
have the size of the hook somewhere in there, we can calculate the 
offset of the next hook.  Its up to the user space tool to know what 
each trace hook represents and do decoding of the data that cant be 
interpreted by a generic decoder.


-JRS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26  4:29     ` Jose R. Santos
@ 2006-01-26  5:07       ` Frank Ch. Eigler
  0 siblings, 0 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2006-01-26  5:07 UTC (permalink / raw)
  To: Jose R. Santos; +Cc: systemtap

Hi -

jrs wrote:

> [...]  I wonder, would local arrays be less misleading in your
> opinion.  If the array could contain multiple types and be use as
> both an argument and return value, this would also work for tracing.
> The use of arrays does seem to better fit the SystemTap coding
> style.

Yes, that would be another way, but then it would be hard to provide
the same degree of efficiency (no data copying, no index lookups,
etc.)  as the other solutions.


> [...] Printf is the current mechanism that is used in LKET.  While
> it works, I am finding the code does not look a clean as it could
> with a better interface.  BTI does offer nice interface that is very
> easy to decode and easy to implement.  Then new gBTI is better, but
> still somewhat not as flexible to the any type of data that we may
> want to put.

We can keep separate the low-level C API that is ultimately used to
send trace records forth, and the script-level constructs that will
invoke them.  The translator should be able to map between reasonable
pairs.  I hope to nudge the script side toward harmony with the rest
of the language.

- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-25 14:45   ` Jose R. Santos
@ 2006-01-26  8:51     ` Masami Hiramatsu
  2006-01-26 14:41       ` Frank Ch. Eigler
  0 siblings, 1 reply; 29+ messages in thread
From: Masami Hiramatsu @ 2006-01-26  8:51 UTC (permalink / raw)
  To: jrs
  Cc: SystemTAP, Yumiko Sugita, Satoshi Oshima, Hideo Aoki,
	Frank Ch. Eigler, Tom Zanussi

Jose R. Santos wrote:
> This implementation is more generic and its available now.  While this
> is suitable for our current trace work, I still agree with Tom that this
> approach may not be as flexible for gathering other data types.  While
> having fixed data types make it easy to decode the binary trace, there
> may be situations were we may want to grab data that may not fit on
> either a size long arg or the 128 arg limit.

Thanks, and I think those situations would be rare case. Even if you
would like to transport some data structures totally, you can do it
over gBTI with a type signature by using custom tapsets.
However, in my opinion, interpretation of type information should be
left to each user. Because there are so many types and structures that
SystemTap can not handle all of them.

>  Also, while possible to
> write a function call with 128 args, this would look really ugly and
> seems error prone.  Having a basic struct data type on SystemTap would
> solve some of these issues.

I agree with the requirement of some special interface to transport
such a large data. One of simpler ways, I think, is to allow the
script functions to take long arrays as its arguments. Thus, we can
transport whole the array to user space as like below;

function array_log ( map:long[] ) %{
	struct map_node *node = _stp_map_start(THIS->map);
	// reserve region and write header
	int64_t * ptr = _stp_binary_reserve(THIS->map->num);
	while (node) {
		*(ptr++) = _stp_get_int64(node);
		node = _stp_map_iter(THIS->map, node);
	}
%}

This function does not consume the stack so much.

>> 2. stpd enhancement
>> We should enhance stpd daemon to handle both ascii packet and binary
>> packet
>> correctly. But it is not so difficalt.
>>
> I claim complete ignorance on how the stpd daemon works.  Can you
> elaborate on what the problem and possible enhancements are?  Does this
> still require your prohibit merging patch?

Yes, it is required, because of the reason below.
With relayfs, stpd daemon writes per-cpu data into per-cpu files. And
it sorts the entries of those files and merges to one file (or to
stdout) just before terminating. Currently, stpd's merging routine
can't handle binary packets. It will treat it as a 0 length string
and can't find the entries after a binary packet. So, you should
prohibit stpd merging files.

I think the solution is to improve stpd to merge binary packet and
ascii packet. I also developed external merging program. Its display
format of binary data should be discussed.

>> 3. wrapping functions
>> I also think we can define the various interfaces for example LKST, LKET,
>> or more generic binary_log() interface over the gBTI runtime.
>>
> This I like.  I would use this if it were available now on SystemTap.
>
>> What would you think about the gBTI?
>>
> Its a good solution and seems like we could port our trace to use this
> mechanism easily (which we would had to do anyway if we implemented our
> own BTI).  This could later be enhance with what Tom is proposing.  I
> would like to see this or an equivalent functionality as part of
> SystemTap though.  Since you move the functionality to a tapset, are you
> looking at maintaining this as part of LKST or do you have plans to get
> this into SystemTap itself?

Yes, I would like to merge this feature to SystemTap as a runtime.
The patch is under been developing.

>> I have already a concrete implementation of gBTI(for relayfs). I
>> developed
>> it as a tapset script including runtime part.
>> I attach two files to this mail.
>>
>> - gbti.stp:      gBTI core runtime (_stp_binary_write())
>> - lkst_gbti.stp: LKST compatible wrapping function of gBTI
>>
>> Please review it.
>>
>> Best regards,
>>
>>
> I've done a visual inspection and don't see any mayor issues with it.  I
> will play with it a bit and see how it goes.
>
> What is the purpose of having gbti_init()?

This is dummy function to implement _stp_binary_write() as a independent
tapset script (gbti.stp). Systemtap translator merges only the tapset
scripts which include some symbols used by the user script. Thus,
without calling gbti_init() from your script, the translator doesn't
merge gbti.stp and fails to build.

When gBTI is merged with runtime, gbti_init() is not required.

>
> Thanks
>
> -JRS
>

Best regards,

-- 
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: hiramatu@sdl.hitachi.co.jp

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-24 15:20 Implementing a generic binary trace interface Jose R. Santos
                   ` (4 preceding siblings ...)
  2006-01-25  6:37 ` Masami Hiramatsu
@ 2006-01-26  9:11 ` Martin Hunt
  2006-01-26 17:46   ` Jose R. Santos
  2006-01-27  2:58   ` Frank Ch. Eigler
  5 siblings, 2 replies; 29+ messages in thread
From: Martin Hunt @ 2006-01-26  9:11 UTC (permalink / raw)
  To: jrs; +Cc: systemtap

Sorry for jumping in this discussion so late.  I read quickly through
the thread. The problem appears to be

1. Need for binary data
2. Need for flexible formating
3. printf too slow (although it really isn't too bad)

I like the gBTI proposal a lot, but it doesn't allow mixing strings and
ints and doesn't use varargs.  However that can be easily changed. But
lets consider just using something similar to printf, except it is far
simpler and doesn't do integer-to-ascii conversion, just writes binary.

probe kernel.function("sys_open")
{
	trace("%d%d%d%s%d", HOOKID_OPEN, $flags, $mode, $filename, $fd)
}
so that would write:
64-bit integer, HOOKID_OPEN
64-bit integer, $flags
64-bit integer, $mode
Null-terminated string, $filename
64-bit integer, $fd

Of course we could add length modifiers if we prefer more compact
output. We could also make it totally compatible with printf if we use
something other than %d and %s for binary. Something like:
%d - ASCII int
%1b - 1 byte binary number
%4b - 4 byte binary number
%8b - 8 byte binary number
%s  - string
%0s - null-terminated string
Which would make the previous example something like
trace("%1b%1b%1b%0s%1b", HOOKID_OPEN, $flags, $mode, $filename, $fd)

The trace function would then just directly call _stp_trace(), which
would be simple to implement.

This seems low-level enough to allow implementing almost any desired
output format.

Martin



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26  2:27   ` Frank Ch. Eigler
  2006-01-26  4:29     ` Jose R. Santos
@ 2006-01-26 14:23     ` Tom Zanussi
  1 sibling, 0 replies; 29+ messages in thread
From: Tom Zanussi @ 2006-01-26 14:23 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Tom Zanussi, systemtap

Frank Ch. Eigler writes:
 > 
 > zanussi wrote:
 > 
 > > [...]
 > 
 > Thank you for your rather detailed proposal.
 > 
 > 
 > > [...]
 > > probe kernel.function("sys_open")
 > > {
 > > 	data.hook_id = HOOKID_OPEN
 > > 	fill_in_common(data)
 > > 	data.flags = $flags;
 > > [...]
 > > }
 > > 
 > > function fill_in_common(data)
 > > {
 > > 	data.timestamp = timestamp_us();
 > > }
 > 
 > >From this discussion, it seems to me that the sole use of this sort of
 > struct construct is for purposes of tracing.  There are no language
 > elements proposed for generically extracting information such as field
 > names / values.  Have you thought about what happens in connection
 > with conditionals / loops?  How should functions be type checked when
 > they take structs?  And return them?  And how are these connected to
 > arrays, when they are also passed by reference?  Can they be global?
 > Why not?
 > 
 > This seems to me as a too special-purpose and of too limited
 > application for it to show up in the script language in this
 > misleadingly general-looking form.
 > 

Yes, you're right - the sole purpose is for tracing.  I'll admit I
wasn't entirely comfortable attempting to usurp such an important
language construct for such trivial reasons, but hey, nobody else was
using it ;-)

 > Here is a possible way of accomplishing the same goal - lightweight
 > binary tracing.  No new misleading notation, just special
 > functions, like printf* today.
 > 
 > # probe kernel.function("sys_open")
 > # {
 > # 	trace("hook_id", HOOKID_OPEN)
 > # 	trace_common()
 > # 	trace("flags", $flags)
 > # 	[...]
 > # }
 > # 
 > # function trace_common()
 > # {
 > # 	trace("timestamp", timestamp_us())
 > # }
 > 
 > The translator could generate similar code to your sketch: a field
 > assignment in a synthesized buffer struct for each executed trace().
 > It could avoid widening $target values to systemtap types, or include
 > explicit encoding/casting directives.  It could infer a "flush" at the
 > end of a probe that calls trace().  It could invent field names if you
 > don't care to give one for each call site.
 > 

Sounds like it should do the trick.  Thanks for the suggestion.

 > For that matter, this could all be hidden in printf itself, with
 > special formatting directives.  (Don't assume that this would
 > necessarily invoke the kernel vsnprintf.  The translator could be a
 > smarter than to call the runtime _stp_printf function.)
 > 
 > By the way, in either the structy or printy design, how do you expect
 > the actual binary struct metadata to be used at the end?  How would
 > the binary trace data be decoded?
 > 

There are probably as many different ways of doing this as there are
different trace applications.

For instance, I have working code that registers event types by
creating proc files which describe them.  Each contains enough
information to completely decode any event in the event stream
e.g. event id, event size and for each field, type, length (for
supporting arrays/strings), etc.

Using this scheme, and since I'm hard-pressed to find a reason to
define more than 256 events, my trace 'format' consists of a single
byte at the beginning of each event.

Tom


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26  8:51     ` Masami Hiramatsu
@ 2006-01-26 14:41       ` Frank Ch. Eigler
  2006-01-26 17:32         ` Jose R. Santos
  2006-01-27  3:04         ` Masami Hiramatsu
  0 siblings, 2 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2006-01-26 14:41 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: systemtap

Hi -

> [...] One of simpler ways, I think, is to allow the script functions
> to take long arrays as its arguments. Thus, we can transport whole
> the array to user space as like below;
>
> function array_log ( map:long[] ) %{
> [...]
> %}

Can you think of a scenario where tracing an entire array could be
necessary in a time-critical place?  If not, then coding the array
printing by an explicit foreach would not be a serious burden.

- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26 14:41       ` Frank Ch. Eigler
@ 2006-01-26 17:32         ` Jose R. Santos
  2006-01-27  3:04         ` Masami Hiramatsu
  1 sibling, 0 replies; 29+ messages in thread
From: Jose R. Santos @ 2006-01-26 17:32 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Masami Hiramatsu, systemtap

Frank Ch. Eigler wrote:

>Hi -
>
>> [...] One of simpler ways, I think, is to allow the script functions
>> to take long arrays as its arguments. Thus, we can transport whole
>> the array to user space as like below;
>>
>> function array_log ( map:long[] ) %{
>> [...]
>> %}
>
>Can you think of a scenario where tracing an entire array could be
>necessary in a time-critical place?  If not, then coding the array
>printing by an explicit foreach would not be a serious burden.
>
>- FChE
>
>  
>
Its not really a problem if you are only planning on having a single 
trace point.  The problem is that as you add more trace hooks little 
burdens start adding up.  Some of the uses that we have for our trace 
tool require multiple hooks working together.  If running a trace 
impacts the performance to much, that impact may change some of the 
performance characteristics of the workload making the trace useless.

-JRS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26  9:11 ` Martin Hunt
@ 2006-01-26 17:46   ` Jose R. Santos
  2006-01-26 18:06     ` Martin Hunt
  2006-01-27  2:58   ` Frank Ch. Eigler
  1 sibling, 1 reply; 29+ messages in thread
From: Jose R. Santos @ 2006-01-26 17:46 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap

Martin Hunt wrote:

>Sorry for jumping in this discussion so late.  I read quickly through
>the thread. The problem appears to be
>
>1. Need for binary data
>2. Need for flexible formating
>3. printf too slow (although it really isn't too bad)
>
>I like the gBTI proposal a lot, but it doesn't allow mixing strings and
>ints and doesn't use varargs.  However that can be easily changed. But
>lets consider just using something similar to printf, except it is far
>simpler and doesn't do integer-to-ascii conversion, just writes binary.
>
>probe kernel.function("sys_open")
>{
>	trace("%d%d%d%s%d", HOOKID_OPEN, $flags, $mode, $filename, $fd)
>}
>so that would write:
>64-bit integer, HOOKID_OPEN
>64-bit integer, $flags
>64-bit integer, $mode
>Null-terminated string, $filename
>64-bit integer, $fd
>
>Of course we could add length modifiers if we prefer more compact
>output. We could also make it totally compatible with printf if we use
>something other than %d and %s for binary. Something like:
>%d - ASCII int
>%1b - 1 byte binary number
>%4b - 4 byte binary number
>%8b - 8 byte binary number
>%s  - string
>%0s - null-terminated string
>Which would make the previous example something like
>trace("%1b%1b%1b%0s%1b", HOOKID_OPEN, $flags, $mode, $filename, $fd)
>
>The trace function would then just directly call _stp_trace(), which
>would be simple to implement.
>
>This seems low-level enough to allow implementing almost any desired
>output format.
>
>Martin
>  
>
This sound like a good idea, but I wonder how it will perform compared 
to gBTI.  For the purposes of trace, I think the best implementation is 
the one with the best performance and smallest impact on the system.  If 
we can get something like this to perform as good or better that gBTI 
the is also a good candidate.  It does have some of the flexibility that 
we would like to see on a trace interface.

-JRS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26 17:46   ` Jose R. Santos
@ 2006-01-26 18:06     ` Martin Hunt
  2006-01-26 19:11       ` Satoshi Oshima
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Hunt @ 2006-01-26 18:06 UTC (permalink / raw)
  To: jrs; +Cc: systemtap

On Thu, 2006-01-26 at 11:46 -0600, Jose R. Santos wrote:

> This sound like a good idea, but I wonder how it will perform compared 
> to gBTI.  For the purposes of trace, I think the best implementation is 
> the one with the best performance and smallest impact on the system.  If 
> we can get something like this to perform as good or better that gBTI 
> the is also a good candidate.  

The difference between the two is minor. gBTI passes in an integer that
is the number of 64-bit ints to log. 

_stp_binary_write(2, (int64_t)current->pid, (int64_t)current);

my proposal passes is a simple format string

_stp_trace("%2b%4b", current->pid, current)

There is a very small additional overhead involved in parsing that
format string. That should be offset by not always having to write 64-
bit binary values.

gBTI also defines a packet format to be used over the transport.
[seq-id][\0][num][arg1][arg2]....[arg(num-1)]

I am not doing that. Whatever program is processing the output just
needs to understand the format you write.

Martin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26 18:06     ` Martin Hunt
@ 2006-01-26 19:11       ` Satoshi Oshima
  0 siblings, 0 replies; 29+ messages in thread
From: Satoshi Oshima @ 2006-01-26 19:11 UTC (permalink / raw)
  To: Martin Hunt; +Cc: jrs, systemtap

Martin Hunt wrote:
> On Thu, 2006-01-26 at 11:46 -0600, Jose R. Santos wrote:
> 
> 
>>This sound like a good idea, but I wonder how it will perform compared 
>>to gBTI.  For the purposes of trace, I think the best implementation is 
>>the one with the best performance and smallest impact on the system.  If 
>>we can get something like this to perform as good or better that gBTI 
>>the is also a good candidate.  
> 
> 
> The difference between the two is minor. gBTI passes in an integer that
> is the number of 64-bit ints to log. 
> 
> _stp_binary_write(2, (int64_t)current->pid, (int64_t)current);
> 
> my proposal passes is a simple format string
> 
> _stp_trace("%2b%4b", current->pid, current)
> 
> There is a very small additional overhead involved in parsing that
> format string. That should be offset by not always having to write 64-
> bit binary values.
> 
> gBTI also defines a packet format to be used over the transport.
> [seq-id][\0][num][arg1][arg2]....[arg(num-1)]
> 
> I am not doing that. Whatever program is processing the output just
> needs to understand the format you write.

Why Masami add [num] field and use 64bit fixed length is
to understand the length of each data.

Tapsets usually write some kind of data structures consist
of different format.

You can see how lkst_gbti.stp find the format type.

function lkst_trace(etype:long, arg1:long, arg2:long, arg3:long, arg4:long) %{
	int64_t tsc;
	rdtscll(tsc);
	_stp_binary_write (6,
			  THIS->etype << 40 |
			  (int64_t)smp_processor_id() << 32 | current->pid,
			  tsc, THIS->etype, THIS->arg1, THIS->arg2,
			  THIS->arg3, THIS->arg4);
%}

lkst_trace() have etype argument. This indicates the format.

Generally, we need to include length and format type information
in the header of binary data packet.

gBTI provides length information but dones not provide format
type information. So user who writes tapset should/must include
format information as a contents.

On the other hand, your proposal requires that user must calculate
and inlude length and format information. And it is much flexible
comparing to gBTI.

I think we should have both.


------------------
Satoshi Oshima
Hitachi Computer Product (America) Inc.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26  9:11 ` Martin Hunt
  2006-01-26 17:46   ` Jose R. Santos
@ 2006-01-27  2:58   ` Frank Ch. Eigler
  2006-01-27 10:50     ` Martin Hunt
  1 sibling, 1 reply; 29+ messages in thread
From: Frank Ch. Eigler @ 2006-01-27  2:58 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap


hunt wrote:

> [...]
> probe kernel.function("sys_open")
> {
> 	trace("%d%d%d%s%d", HOOKID_OPEN, $flags, $mode, $filename, $fd)
> }
> [...]
> The trace function would then just directly call _stp_trace(), which
> would be simple to implement.
> [...]

There are several differences between this scheme (basically a printf
with a different back-end function in the runtime) and the others.
The straightforward implementation would require copying of the
individual data values onto and off the stack, a function call,
format string parsing in the runtime, and possible problems with
composing a large trace record from several pieces (e.g.  common
timestamps).

Think of my proposed trace() special function as something that is
compiled right down into a single assignment per field.  The runtime
would not be called except at the beginning and the end of the probe,
only to reserve and to flush a trace structure.

- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-26 14:41       ` Frank Ch. Eigler
  2006-01-26 17:32         ` Jose R. Santos
@ 2006-01-27  3:04         ` Masami Hiramatsu
  2006-02-02  0:05           ` Frank Ch. Eigler
  1 sibling, 1 reply; 29+ messages in thread
From: Masami Hiramatsu @ 2006-01-27  3:04 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: systemtap, jrs, Yumiko Sugita, Satoshi Oshima, Hideo Aoki

Hi,

Frank Ch. Eigler wrote:
> Hi -
>
>
>>[...] One of simpler ways, I think, is to allow the script functions
>>to take long arrays as its arguments. Thus, we can transport whole
>>the array to user space as like below;
>>
>>function array_log ( map:long[] ) %{
>>[...]
>>%}
>
>
> Can you think of a scenario where tracing an entire array could be
> necessary in a time-critical place?  If not, then coding the array
> printing by an explicit foreach would not be a serious burden.

I think this feature is useful for the cases below.

- Shortage of Stack
There may be trace points at where there is not enough memory of
stack. The variable arguments may consume a lot of memory from
the stack.
On the other hand, an array argument consumes a bit of memory to
store a pointer.

- Single Probe for Various Events
For simplicity, single probe should be able to handle various
events. Please Assume a probe which traces two events; One event
has two argument, and another has five arguments. Obviously, the
probe should record five arguments for each event as below:

// definitions of events
probe kernel.trace.event1 = kernel.function("sys_gettimeofday").return {
	etype = 1; arg1 = retval(); nargs = 1
}
probe kernel.trace.event2 = kernel.function(""do_fork") {
	etype = 2; arg1 = pid(); arg2 = $clone_flags;
	arg3 = $stack_start;arg4 = $stack_size; nargs = 4;
}
// probe handler
probe kernel.trace.* {
	binary_log(nargs + 1,etype, arg1,arg2,arg3,arg4);
}
In this case, binary_log() function *always* gets six arguments,
and transports two or five arguments to user space. I think it
would be just wastes of stack.
On the other hand, an array uses only necessary memory.

Best regards,

-- 
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: hiramatu@sdl.hitachi.co.jp

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-27  2:58   ` Frank Ch. Eigler
@ 2006-01-27 10:50     ` Martin Hunt
  2006-01-27 13:05       ` Frank Ch. Eigler
  2006-01-27 16:06       ` Tom Zanussi
  0 siblings, 2 replies; 29+ messages in thread
From: Martin Hunt @ 2006-01-27 10:50 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

If I understand it correctly, you propose getting a buffer, having
numerous trace() calls, which the translator directly writes into the
buffer, then a call to the runtime to send the accumulated trace data?
How would this interact with normal ASCII data (if you mix printf() with
trace())? Also how do you specify the format of the data stored?

On Thu, 2006-01-26 at 21:57 -0500, Frank Ch. Eigler wrote:
> There are several differences between this scheme (basically a printf
> with a different back-end function in the runtime) and the others.
> The straightforward implementation would require copying of the
> individual data values onto and off the stack, a function call,
> format string parsing in the runtime, and possible problems with
> composing a large trace record from several pieces (e.g.  common
> timestamps).

If you use the trace function I proposed, each call logs as many args as
you give it in the format provided by the format string. There is no
"composing large trace record" problem. Each call is one record.

> Think of my proposed trace() special function as something that is
> compiled right down into a single assignment per field.  The runtime
> would not be called except at the beginning and the end of the probe,
> only to reserve and to flush a trace structure.

That might be good, depending on the details. But it is not more
efficient.

----

While I've got the numbers handy, I did some profiling and here are the
results on my 2.6GHz desktop.

printf() of 6 64-bit integers, converted to ASCII averaged 2.5 usecs
but depends on how large the numbers are. Function call overhead,
including copying the values on the stack, etc was .03 usecs.

I expect a binary printf of 6 64-bit values to take between 0.2 and 0.4
usecs based on a quick prototype. ASCII printf performance could be
improved too with a rewrite.

Martin



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-27 10:50     ` Martin Hunt
@ 2006-01-27 13:05       ` Frank Ch. Eigler
  2006-01-27 18:25         ` Martin Hunt
  2006-01-27 16:06       ` Tom Zanussi
  1 sibling, 1 reply; 29+ messages in thread
From: Frank Ch. Eigler @ 2006-01-27 13:05 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap

Hi -

hunt wrote:

> If I understand it correctly, you propose getting a buffer, having
> numerous trace() calls, which the translator directly writes into the
> buffer, then a call to the runtime to send the accumulated trace data?

Yes, or rather Tom Zanussi did in
<http://sourceware.org/ml/systemtap/2006-q1/msg00257.html>. 
My humble contribution is the notation - not struct field writes
but function calls.

> How would this interact with normal ASCII data (if you mix printf()
> with trace())?

I don't know, I'm only the notation guy.  :-) I figure wherever the
proposed gbti function would plop its output is where this would go.

> Also how do you specify the format of the data stored?

The translator could actually emit some metadata (field names & types)
to allow decoding of the binary trace data, in the form of a perl
script, LTTV xml stuff, C struct declaration.

> [...]
> While I've got the numbers handy, I did some profiling and here are the
> results on my 2.6GHz desktop.

Thanks.

> printf() of 6 64-bit integers, converted to ASCII averaged 2.5 usecs
> [...] I expect a binary printf of 6 64-bit values to take between
> 0.2 and 0.4 usecs based on a quick prototype.  ASCII printf
> performance could be improved too with a rewrite.

That's still orders of magnitude higher than a single binary
assignment.  This may be one of those situations where the runtime's
assistance is neither helpful nor necessary.

- FCHE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-27 10:50     ` Martin Hunt
  2006-01-27 13:05       ` Frank Ch. Eigler
@ 2006-01-27 16:06       ` Tom Zanussi
  1 sibling, 0 replies; 29+ messages in thread
From: Tom Zanussi @ 2006-01-27 16:06 UTC (permalink / raw)
  To: Martin Hunt; +Cc: Frank Ch. Eigler, systemtap

Martin Hunt writes:
 > If I understand it correctly, you propose getting a buffer, having
 > numerous trace() calls, which the translator directly writes into the
 > buffer, then a call to the runtime to send the accumulated trace data?
 > How would this interact with normal ASCII data (if you mix printf() with
 > trace())? Also how do you specify the format of the data stored?
 > 

Sorry, I've been occupied with other things, so haven't been able to
spend much time on this...

Anyway, the gist of it was that it with some language support, it
should be possible to have the translator generate the most efficient
code possible for logging variable-length trace records, which in many
cases would be a high-frequency occurence and so, worth optimizing.

I assumed we could add a transport-independent run-time function
called say _stp_reserve() which would reserve a slot of a given length
in the current 'trace buffer' whatever that might be for a given
transport.  The translator would have enough info from the set of
trace() calls Frank suggested to 1) calculate the size of the slot
needed to reserve space in the buffer and 2) generate the code that
would write the result of the trace() calls into the reserved slot.
There shouldn't need to be a 'flush' or anything else after the last
trace(), because the data would already be in its final destination.
I think Frank also suggested this could be done using something like
your trace syntax.

As for mixing the normal ASCII data and binary data, maybe it makes
more sense to not have to deal with that at all and use separate
streams for each...



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-27 13:05       ` Frank Ch. Eigler
@ 2006-01-27 18:25         ` Martin Hunt
  2006-01-30 15:12           ` Frank Ch. Eigler
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Hunt @ 2006-01-27 18:25 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

On Fri, 2006-01-27 at 08:05 -0500, Frank Ch. Eigler wrote:

> > printf() of 6 64-bit integers, converted to ASCII averaged 2.5 usecs
> > [...] I expect a binary printf of 6 64-bit values to take between
> > 0.2 and 0.4 usecs based on a quick prototype.  ASCII printf
> > performance could be improved too with a rewrite.
> 
> That's still orders of magnitude higher than a single binary
> assignment.  

Yes. It appears to be the buffer reservation and manipulation that is
expensive. I have some ideas how to speed it up. It is done wrong now
anyway using old-fashioned percpu code.

> This may be one of those situations where the runtime's
> assistance is neither helpful nor necessary.

It may also be that a few tweaks of the current code can speed up
everything and satisfy the requirement for a generic binary interface.
I prefer not to create some special case until we understand the whole
problem.

Martin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-27 18:25         ` Martin Hunt
@ 2006-01-30 15:12           ` Frank Ch. Eigler
  0 siblings, 0 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2006-01-30 15:12 UTC (permalink / raw)
  To: Martin Hunt; +Cc: systemtap


hunt wrote:

> [...]
> > That's still orders of magnitude higher than a single binary
> > assignment.  
> 
> Yes. It appears to be the buffer reservation and manipulation that is
> expensive. [...]

Please explain why you think so ("the").  Others have observed that
kernel vscnprintf is a problem too.

In any case, I don't like the binary tracing idea only because it does
away with some of the buffering/copying of the formatted output.  It's
because it eliminates run-time conditionals (interpreting of
formatting directives) and unnecessary copying of the individual data
values (to a runtime formatting function).  It's the difference
between compilation and interpretation.

> > This may be one of those situations where the runtime's
> > assistance is neither helpful nor necessary.
> 
> It may also be that a few tweaks of the current code can speed up
> everything and satisfy the requirement for a generic binary interface.

A glance at the runtime/transport code suggests that while there are
opportunities for improvement (e.g. copying between various buffers in
io.c vs transport/*), these benefits would be shared by both printf
and the hypothetical binary trace widget.

> I prefer not to create some special case until we understand the
> whole problem.

What understanding do you think we lack?


- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implementing a generic binary trace interface.
  2006-01-27  3:04         ` Masami Hiramatsu
@ 2006-02-02  0:05           ` Frank Ch. Eigler
  0 siblings, 0 replies; 29+ messages in thread
From: Frank Ch. Eigler @ 2006-02-02  0:05 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: systemtap

Hi -


On Fri, Jan 27, 2006 at 12:04:23PM +0900, Masami Hiramatsu wrote:

> [...]
> > Can you think of a scenario where tracing an entire array could be
> > necessary in a time-critical place?  [...]
> 
> I think this feature is useful for the cases below.
> - Shortage of Stack
>
> There may be trace points at where there is not enough memory of
> stack. The variable arguments may consume a lot of memory from
> the stack. [...]

But to insert values into an array, or to pull them back out, involves
possibly deeply nested function calls into the runtime.  These take
far more dynamic stack space than one or two extra paramters to a
single function.

Please note that my current favorite scheme is different, where
individual traced values would not be passed to a function like the
gBTI trace(), but rather passed as an specialized binary struct to
some buffer reservation routine, so the above is just hypothetical.

> - Single Probe for Various Events
> [...]
> // definitions of events
> probe kernel.trace.event1 = kernel.function("sys_gettimeofday").return {
> 	etype = 1; arg1 = retval(); nargs = 1
> }
> probe kernel.trace.event2 = kernel.function(""do_fork") {
> 	etype = 2; arg1 = pid(); arg2 = $clone_flags;
> 	arg3 = $stack_start;arg4 = $stack_size; nargs = 4;
> }
> // probe handler
> probe kernel.trace.* {
> 	binary_log(nargs + 1,etype, arg1,arg2,arg3,arg4);
> }
> [...]

To spell out the other (struct-based) binary tracing scenario, the
above could look like this instead:
 
# probe kernel.trace.event1 = kernel.function("sys_gettimeofday").return {
#       trace(retval())
# }
# probe kernel.trace.event2 = kernel.function(""do_fork") {
#       trace(pid())
#       trace($clone_flags)
#       trace($stack_start)
#       trace($stack_size)
# }
# // invoke all aliases
# probe kernel.trace.* { }

Note:

- systemtap could uniquely identify each trace record by assigning
  each of them a unique type (ID) number automatically
- that mapping can be reversed to the probe point (pp()) automatically
  when the binary data is decoded
- no explicit function call is needed to send the data on its way,
  that too is automatic (due to presence of trace())

 
> On the other hand, an array uses only necessary memory.

Even an empty systemtap array takes a lot of heap memory.


- FChE

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2006-02-02  0:05 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-24 15:20 Implementing a generic binary trace interface Jose R. Santos
2006-01-24 16:21 ` William Cohen
2006-01-24 17:03   ` Jose R. Santos
2006-01-24 16:42 ` Mathieu Desnoyers
2006-01-24 16:55 ` Tom Zanussi
2006-01-26  2:27   ` Frank Ch. Eigler
2006-01-26  4:29     ` Jose R. Santos
2006-01-26  5:07       ` Frank Ch. Eigler
2006-01-26 14:23     ` Tom Zanussi
2006-01-24 17:18 ` Frank Ch. Eigler
2006-01-24 17:55   ` Jose R. Santos
2006-01-24 18:30   ` Tom Zanussi
2006-01-25  6:37 ` Masami Hiramatsu
2006-01-25 14:45   ` Jose R. Santos
2006-01-26  8:51     ` Masami Hiramatsu
2006-01-26 14:41       ` Frank Ch. Eigler
2006-01-26 17:32         ` Jose R. Santos
2006-01-27  3:04         ` Masami Hiramatsu
2006-02-02  0:05           ` Frank Ch. Eigler
2006-01-26  9:11 ` Martin Hunt
2006-01-26 17:46   ` Jose R. Santos
2006-01-26 18:06     ` Martin Hunt
2006-01-26 19:11       ` Satoshi Oshima
2006-01-27  2:58   ` Frank Ch. Eigler
2006-01-27 10:50     ` Martin Hunt
2006-01-27 13:05       ` Frank Ch. Eigler
2006-01-27 18:25         ` Martin Hunt
2006-01-30 15:12           ` Frank Ch. Eigler
2006-01-27 16:06       ` Tom Zanussi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).