RE: Proposed systemtap access to perfmon hardware

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* RE: Proposed systemtap access to perfmon hardware
@ 2006-03-22 23:46 Stone, Joshua I
  2006-03-23 12:54 ` Maynard Johnson
  0 siblings, 1 reply; 18+ messages in thread
From: Stone, Joshua I @ 2006-03-22 23:46 UTC (permalink / raw)
  To: maynardj, William Cohen; +Cc: SystemTAP

Maynard Johnson wrote:
> William Cohen wrote:
>> The individual start and stop operations would be allowed.
> This is not so good.  Besides the fact that it may be difficult (or
> impossible) to do, I don't see it being all that useful.  But then,
> I'm a tool developer, not a performance analyst, so I could be
> missing the point.

Enabling start & stop lets you narrow the context that you want to
measure.  Perfmon can only give you thread level virtualization of the
counters.  With start & stop I can, for example, start the counters when
I enter sys_open and stop when I return.  Now if I want I can get a
microbenchmark of IPC for the sys_open call (and its callees).

But this also opens up possibilities for more obscure "contexts" -
perhaps I want to start counting when a network packet is received and
stop when it is delivered to the thread.  Any probepoint you can do
today can become a start/stop point for the counters.

Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-22 23:46 Proposed systemtap access to perfmon hardware Stone, Joshua I
@ 2006-03-23 12:54 ` Maynard Johnson
  2006-03-23 14:46   ` William Cohen
  0 siblings, 1 reply; 18+ messages in thread
From: Maynard Johnson @ 2006-03-23 12:54 UTC (permalink / raw)
  To: Stone, Joshua I; +Cc: William Cohen, SystemTAP

Stone, Joshua I wrote:

>Maynard Johnson wrote:
>  
>
>>William Cohen wrote:
>>    
>>
>>>The individual start and stop operations would be allowed.
>>>      
>>>
>>This is not so good.  Besides the fact that it may be difficult (or
>>impossible) to do, I don't see it being all that useful.  But then,
>>I'm a tool developer, not a performance analyst, so I could be
>>missing the point.
>>    
>>
>
>Enabling start & stop lets you narrow the context that you want to
>measure.  Perfmon can only give you thread level virtualization of the
>counters.  With start & stop I can, for example, start the counters when
>I enter sys_open and stop when I return.  Now if I want I can get a
>microbenchmark of IPC for the sys_open call (and its callees).
>
>But this also opens up possibilities for more obscure "contexts" -
>perhaps I want to start counting when a network packet is received and
>stop when it is delivered to the thread.  Any probepoint you can do
>today can become a start/stop point for the counters.
>  
>
Yes, I can certainly see this benefit.  It gives you PAPI-level control 
without having to modify source code.  My concern, however, was that if 
you have multiple counters configured, then individual control of them 
presents an extra level of difficulty.  But, as I've been thinking about 
this a bit more, I think this could be done if you can guarantee that 
the operation is not preempted or interrupted.  Then, the PMU can be 
disabled, reloaded with any changes, and then re-enabled.  Then, any 
counters that had been running before the operation -- and that were not 
changed by the operation -- will be reloaded with their previous count 
and continue running from where they left off.

-Maynard

>
>Josh
>  
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-23 12:54 ` Maynard Johnson
@ 2006-03-23 14:46   ` William Cohen
  0 siblings, 0 replies; 18+ messages in thread
From: William Cohen @ 2006-03-23 14:46 UTC (permalink / raw)
  To: Maynard Johnson; +Cc: Stone, Joshua I, SystemTAP

Maynard Johnson wrote:
> Stone, Joshua I wrote:
> 
>> Maynard Johnson wrote:
>>  
>>
>>> William Cohen wrote:
>>>   
>>>
>>>> The individual start and stop operations would be allowed.
>>>>     
>>>
>>> This is not so good.  Besides the fact that it may be difficult (or
>>> impossible) to do, I don't see it being all that useful.  But then,
>>> I'm a tool developer, not a performance analyst, so I could be
>>> missing the point.
>>>   
>>
>>
>> Enabling start & stop lets you narrow the context that you want to
>> measure.  Perfmon can only give you thread level virtualization of the
>> counters.  With start & stop I can, for example, start the counters when
>> I enter sys_open and stop when I return.  Now if I want I can get a
>> microbenchmark of IPC for the sys_open call (and its callees).

The place that I see the start and stop being most useful is for the 
sampling. Start the when some event occurs and stop the sampling when 
another event occurs to get a statistic picture of what is going on in a 
certain region. It would be possible have a flag in the sample routine 
to turn on and off recording the sample. However, this would mean the 
sampling would start counting when the sampling is turned on.

For the interval measurements it may be possible to leave the counter 
counting. This would avoid messing with the performance counter state. 
In the perfmon_start_counter() mark status as running and accumulate 
count in running state. In perfmon_stop_counter() mark status as stops 
and accumulate the count in the stopped state. This could be implemented 
for the global version. It might be a bit more complicated for the per 
process version there is state information for each context and I am not 
sure about whether the additional information managing the counter 
software state could fit in the context information for a thread.

>>
>> But this also opens up possibilities for more obscure "contexts" -
>> perhaps I want to start counting when a network packet is received and
>> stop when it is delivered to the thread.  Any probepoint you can do
>> today can become a start/stop point for the counters.
>>  
>>
> Yes, I can certainly see this benefit.  It gives you PAPI-level control 
> without having to modify source code.  My concern, however, was that if 
> you have multiple counters configured, then individual control of them 
> presents an extra level of difficulty.  But, as I've been thinking about 
> this a bit more, I think this could be done if you can guarantee that 
> the operation is not preempted or interrupted.  Then, the PMU can be 
> disabled, reloaded with any changes, and then re-enabled.  Then, any 
> counters that had been running before the operation -- and that were not 
> changed by the operation -- will be reloaded with their previous count 
> and continue running from where they left off.
> 
> -Maynard
> 
>>
>> Josh

Have the probe specifying whether the performance counter is in the 
running or stopped state when setup is a good idea.

As far as when the performance counters are set up and torn down it 
seems like it would be most reasonable to set them up before the first 
probe begin action and tear them down after the last probe end action. 
This would mean for sampling would need to have it stop sampling if 
don't want any additional samples while doing the probe end action.

-Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Proposed systemtap access to perfmon hardware
@ 2006-03-23 17:09 Stone, Joshua I
  0 siblings, 0 replies; 18+ messages in thread
From: Stone, Joshua I @ 2006-03-23 17:09 UTC (permalink / raw)
  To: William Cohen; +Cc: SystemTAP, Maynard Johnson

William Cohen wrote:
> As far as when the performance counters are set up and torn down it
> seems like it would be most reasonable to set them up before the first
> probe begin action and tear them down after the last probe end action.

I agree

> This would mean for sampling would need to have it stop sampling if
> don't want any additional samples while doing the probe end action.

This should be gated for you by using the session_state.  Begin probes
only run during STAP_SESSION_STARTING, normal probes (including permon
sampling) should only run during STAP_SESSION_RUNNING, and end probes
only run during STAP_SESSION_STOPPING.  If a probe is entered during a
state that doesn't match what it expects, it just returns with out
taking any action.

Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Proposed systemtap access to perfmon hardware
@ 2006-03-22 23:23 Stone, Joshua I
  0 siblings, 0 replies; 18+ messages in thread
From: Stone, Joshua I @ 2006-03-22 23:23 UTC (permalink / raw)
  To: fche; +Cc: systemtap

fche@redhat.com wrote:
> joshua wrote:
>>  [...]
>>> If they are started by default, where exactly are they running?
>>> Beginning of begin probe? End of begin probe?
>> 
>> I think the perfmon setup needs to happen before all begin probes, so
>> that the handle can be accessed within begin probes.  [...]
> 
> How important would that be?  At this time, we don't provide any
> ordering guarantees amongst begin/end probes.  If a begin probe would
> have to manipulate the handle, why not put that right into the
> perfctr.*.setup probe directly?

Perhaps it isn't that important - I didn't consider putting the work in
the ".setup" probe.  This is fine as long as the setup probe has the
guarantee that the counters haven't started yet (regardless of its
position in the begin stage)

It might be nice to have a similar shutdown/cleanup/whatever probe that
guarantees that the counters have been stopped.  This is superfluous if
we guarantee that the counters will be stopped before running any end
probes, but I haven't seen discussion on that point yet.  We should also
define when the handle be invalidated - this could be at the conclusion
of the cleanup probe, if it exists, or after all end probes have
completed.

Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: Proposed systemtap access to perfmon hardware
@ 2006-03-22 19:09 Stone, Joshua I
  2006-03-22 20:04 ` Frank Ch. Eigler
  0 siblings, 1 reply; 18+ messages in thread
From: Stone, Joshua I @ 2006-03-22 19:09 UTC (permalink / raw)
  To: William Cohen; +Cc: SystemTAP

William Cohen wrote:
> It is and open question what the counters default are; do they start
> running by default or have to be explicitly started.

This could be solved with an extension on the probe declaration.  For
example, the sensible default is probably to start the counters, but we
could allow a ".paused" to override that default.

> If they are started by default, where exactly are they running?
> Beginning of begin probe? End of begin probe?

I think the perfmon setup needs to happen before all begin probes, so
that the handle can be accessed within begin probes.  I don't know that
it really matters when you actually start the counters, but I would lean
towards putting that after.

A side note about handles - it might be useful to add a language
semantic to make it easier to capture handles.  There are actions that
would make sense for handles of other probe types besides perfmon,
especially being able to dynamically enable/disable kprobes and timers.

Josh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-22 19:09 Stone, Joshua I
@ 2006-03-22 20:04 ` Frank Ch. Eigler
  0 siblings, 0 replies; 18+ messages in thread
From: Frank Ch. Eigler @ 2006-03-22 20:04 UTC (permalink / raw)
  To: Stone, Joshua I; +Cc: systemtap

joshua wrote:

>  [...]
> > If they are started by default, where exactly are they running?
> > Beginning of begin probe? End of begin probe?
> 
> I think the perfmon setup needs to happen before all begin probes, so
> that the handle can be accessed within begin probes.  [...]

How important would that be?  At this time, we don't provide any
ordering guarantees amongst begin/end probes.  If a begin probe would
have to manipulate the handle, why not put that right into the
perfctr.*.setup probe directly?

> A side note about handles [...]  especially being able to
> dynamically enable/disable kprobes and timers.

Actual disarming of these kinds of probes is heavy-weight and may not
be safely done from within the confines of some generic
interrupt-disabled atomic probe handler.  For the perfctr case, we may
just directly poke at registers.

Programmatic control of swaths of probes is an interesting problem
though.

- FChE

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Proposed systemtap access to perfmon hardware
@ 2006-03-15 16:24 William Cohen
  2006-03-15 22:34 ` Frank Ch. Eigler
  2006-03-22  3:34 ` Maynard Johnson
  0 siblings, 2 replies; 18+ messages in thread
From: William Cohen @ 2006-03-15 16:24 UTC (permalink / raw)
  To: SystemTAP

[-- Attachment #1: Type: text/plain, Size: 201 bytes --]

I have written up material describing how I would think that systemtap 
could use the performance monitoring hardware. It is a work in progress, 
but I would appreciate people's comments on it.

-Will

[-- Attachment #2: stapperfmon.txt --]
[-- Type: text/plain, Size: 5928 bytes --]

Systemtap Performance Monitoring Hardware Support Proposal

March 15, 2006

Most modern processors have performance monitoring hardware that can
count event such as processor clock cycles, memory references, cache
misses, branches, and branch mispredictions.  The hardware counts can
be used directly to guage the cost of operations or the counts can be
used to trigger sampling to find out where these operations occur in
code.  SystemTap should have the ability to uses this performance
monitoring hardware to indicate what the underlying causes of the
performance problems are.

SYSTEMTAP PERFORMANCE MONITORING API

perfmon_allocate_counter:long (event_spec:string)

All the perfmon_allocate_counter() calls must be in the probe begin
(removing this restrictions will be considered later). A string as
specified in the EVENT SPECIFICATION section describes the event
performance counter configuration. If the configuration is sucessful a
even_handle in the form of a non-zero 64-bit value will be returned. A
zero value indicates that there was a problem with the counter
allocation. This event_handle will be used by other functions to
uniquely identify the counter being used. The counters are not set up
or running until the perfmon_create_context is performed.

perfmon_free_counter:long (event_handle:long)

All perfmon_free_counter() calls must be in the probe end (removing
this restrictions will be considered later). The function returns the
event_handle for a successful free operation and zero for an
unsuccessful operation.

perfmon_create_context:long ()

The perfmon_create_context command sets up the performance monitoring
hardware for the allocated contexts and starts the counters running.
If successful, the function will return zero. If the operation is
unsuccessful because an error code will be returned. This function
should only be used in probe begin. (FIXME list error code returned.)

perfmon_get_counter:long (event_handle:long)

The event_handle passed in indicates which counter to read. The value
is returned as a 64-bit long of the current counter value; the counter
could be either running or stopped.  The return value is undefined for
an invalid event_handle.

perfmon_start_counter:long (event_handle:long)

The event_handle passed in indicates which counter to start. The value
is returned as a 64-bit long of the current counter value.  The return
value is undefined for an invalid event_handle.

perfmon_stop_counter:long (event_handle:long)

The event_handle passed in indicates which counter to stop. The value
is returned as a 64-bit long of the current counter value.  The return
value is undefined for an invalid event_handle.

perfmon_handle_to_string:string (event_handle:long)

The perfmon_handle_to_string operation returns the string used by the
perfmon_allocate_counter to generate the handle.

probe kernel.perfmon.sample(event_handle:long) {/*body*/}

The kernel.perfmon.sample probe indicates the action to implement when
the counter specified by event_handle overflows. This could be
triggered at anytime, so the context information is limited to the
same data available for an asynchronous timer probe.

The event_handle is a global variable in the instrumentation
script. Multiple probes for a particular global variable is allowed.

EVENT SPECIFICATION

The performance monitoring events are specified in strings. The
information at the very least include the event name being monitored
by the counter.  Additional information would include a event mask to
specify subevents, whether to count in kernel or user space, whether
to keep track of counts on a per thread or per CPU basis, and the
interval for the sampling.

(FIXME more detail on the string compositions)

SYSTEMTAP PERFORMANCE HARDWARE ACCESS IMPLEMENTATION

The SystemTap access performance monitoring hardware is planned to be
built on the perfmon2 kernel support. The perfom2 provides reservation
and access to the performance monitoring hardware on ia64, i386, and
PowerPC processors. The perfmon2 support is not yet in the upstream
kernels, but patches are available.

Outline where things are done.

	In Translator:
	   group all probe kernel.perfmon.sample() together

	In perfmon tapset:
	   perfmon_allocate_counter()
	   perfmon_free_counter()
	   perfmon_create_context()
	   perfmon_get_counter()
	   perfmon_start_counter()
	   perfmon_stop_counter()
	   perfmon_handle_to_string()

	On startup (probe begin):
	   if perfmon.sample used, register perfmon custom buffer mechanism
	   The following steps will need some work done in userspace (libpfm):
	   -translate each of the perfmon_allocate_counter into perfmon config
	   -set up the perfmon contexts (either per processor or per pid)
	   -activate the perfmon contexts

	On shutdown (probe end):
	   The following steps will need some work done in userspace (libpfm):
	   -destroy the perfmon contexts
	   -if perfmon.sample used, unregister perfmon custom buffer mechanism

FIXME more details on the proposed implementation.

SYSTEMTAP PERFMON ISSUES

-There are numerous constraints on event setup. It is possible to
 request a configuration that cannot be set up in the performance
 monitoring hardware.

-This mechanism does not provide access to other related information
 provided by the performance monitoring hardware, e.g. the performance
 monitoring registers storing the data address tha caused a cache miss
 on ia64.

-The perfmon clones the context for new threads that have the perfmon
 context set up, but we probably do not want to attach to each
 existing thread and set up the context on it. That is going to be
 relatively expensive.

-Perfmon can either do global or per thread monitoring, but they
 cannot be mixed.

REFERENCES

Stephane Eranian, The perfmon2 interface specification
HP Laboratories, HPL-2004-200(R.1), February 7, 2005.
http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-15 16:24 William Cohen
@ 2006-03-15 22:34 ` Frank Ch. Eigler
  2006-03-17 16:20   ` William Cohen
  2006-03-22  3:34 ` Maynard Johnson
  1 sibling, 1 reply; 18+ messages in thread
From: Frank Ch. Eigler @ 2006-03-15 22:34 UTC (permalink / raw)
  To: William Cohen; +Cc: systemtap

wcohen wrote:

> [...]  I have written up material describing how I would think that
> systemtap could use the performance monitoring hardware. [...]

It would be helpful to see a hypothetical script that would use the
proposed API.

Anyway, beyond such script language design issues, the hard part has
been the provision of *some* kernel-side API in terms of which this
stuff can be implemented.  How is that going?

> [...]
> perfmon_allocate_counter:long (event_spec:string)
> perfmon_free_counter:long (event_handle:long)
> perfmon_create_context:long ()
> probe kernel.perfmon.sample(event_handle:long) {/*body*/}

These sound rather like suitable lower level functions that the
translator could use under the covers, and not functions that are
wisely exposed at the script level.

Specifically, I would rather expose each particular event_spec source
as a first-class probe point construct:

# probe perfmon.sample("event_spec") { /* body */ }

This would entail calls to such alloc/free/create functions being
emitted in the probe (un)registration boilerplate.  "event_spec"
could perhaps be expanded into several probe point components, and
result in a periodic run of the handler much like timer.ms(N):

# probe perfmon.counter("tlbmiss").cpu(0).sample(1000) { /* ... */ }

That seems to leave just free-running counter operations:

> perfmon_get_counter:long (event_handle:long)
> perfmon_start_counter:long (event_handle:long)
> perfmon_stop_counter:long (event_handle:long)

One possible way to cast these into the language, and yet retain
automatic initialization/cleanup, might be this:

# probe perfmon.counter("tlbmiss").cpu(0).run { h = $handle }
# probe ANY { perfmon_{get,start,stop}_counter (h) }
# global h

What this would do is to have that perfmon.* probe handler run just
once (during initialization), supplying the script with the
system-assigned handle for this counter.  Then another probe (though
probably not a "begin" one) can use that handle value to manipulate
the counter.

- FChE

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-15 22:34 ` Frank Ch. Eigler
@ 2006-03-17 16:20   ` William Cohen
  2006-03-17 17:10     ` Bill Rugolsky Jr.
  2006-03-17 17:34     ` Frank Ch. Eigler
  0 siblings, 2 replies; 18+ messages in thread
From: William Cohen @ 2006-03-17 16:20 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

To try to get a feel on how the performance monitoring hardware support 
would work in SystemTap I wrote some simple examples. Below are examples 
for computing IPC, average cycle count, and sampling within a function. 
The IPC and average cycle count function need a bit of rework to work 
for a SMP machines.

Let me know if there are comments or questions on the examples.

-Will

COMPUTING IPC

global cycles_h
global instr_retired_h
probe perfmon.event("cycles") {cycles_h = $handle;}
probe perfmon.event("intr_retired") {instr_retired_h = $handle;}

probe begin {print ("start probe");}
probe end
{
	factor=100;
	ipc = (factor*perfmon_get_counter(intr_retired_h))/
	    perfmon_get_counter(cycles_h);
	print ("ipc is %d.%d \n", ipc/factor, ipc % factor);
}




DETERMINING AVERAGE CYCLE COUNT FOR FUNCTION (AND CHILDREN)

global cycles_h

probe perfmon.event("cycles") {
       cycles_h = $handle;
       perfmon_stop_counter(cycles_h);
}

global count

probe kernel.function("blah"){
       ++count;
       perfmon_start_counter(cycles_h);
}

probe kernel.function.return("blah"){
       perfmon_stop_counter(cycles_h);
}

probe begin {print ("start probe");}
probe end
{
	total_cycles=perfmon_stop_counter(cycles_h);
	print ("average count in blah %d\n", total_cycles/count);
}


SAMPLING WITHIN A FUNCTION (AND CHILDREN)

global cycles_h
global where_am_i

probe perfmon.event("cycles").sample(100000) {
       cycles_h = $handle;
       # record where sample occured
       where_am_i[instruction_pointer()]++;
}

global count

probe kernel.function("blah"){
       ++count;
       perfmon_start_counter(cycles_h);
}

probe kernel.function("blah").return{
       perfmon_stop_counter(cycles_h);
}

probe begin
{
       # turn off the sampling
       perfmon_stop_counter(cycles_h);
       print("start probe");
}

probe end
{
	#write out the where_am_i entries
	print("address\tcount\n");
	foreach ([+ip] in where_am_i) {
		print("0x%x\t%d\n", ip, where_am_i[ip]);
	}
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-17 16:20   ` William Cohen
@ 2006-03-17 17:10     ` Bill Rugolsky Jr.
  2006-03-17 17:34     ` Frank Ch. Eigler
  1 sibling, 0 replies; 18+ messages in thread
From: Bill Rugolsky Jr. @ 2006-03-17 17:10 UTC (permalink / raw)
  To: William Cohen; +Cc: Frank Ch. Eigler, systemtap

On Fri, Mar 17, 2006 at 11:20:46AM -0500, William Cohen wrote:
> To try to get a feel on how the performance monitoring hardware support 
> would work in SystemTap I wrote some simple examples. Below are examples 
> for computing IPC, average cycle count, and sampling within a function. 
> The IPC and average cycle count function need a bit of rework to work 
> for a SMP machines.
> 
> Let me know if there are comments or questions on the examples.

A non-statistical application for systemtap perfctrs:

In light of a suggestion made by Alan Cox yesterday,

   http://lkml.org/lkml/2006/3/16/118

I'm hacking up some additions to Ingo Molnar's latency tracing patch
that will track retired insns as well as clock cycles, so I can see
where the processor is stalled on I/O, SMM traps, etc., within a given
execution path.

For the particular case that triggered the aforementioned thread, I know
the precise offending instruction, but in the general case a search
for the culprit might be necessary, and doing this with systemtap would
be convenient.

Regards,

	Bill Rugolsky

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-17 16:20   ` William Cohen
  2006-03-17 17:10     ` Bill Rugolsky Jr.
@ 2006-03-17 17:34     ` Frank Ch. Eigler
  2006-03-17 20:26       ` William Cohen
  1 sibling, 1 reply; 18+ messages in thread
From: Frank Ch. Eigler @ 2006-03-17 17:34 UTC (permalink / raw)
  To: systemtap


wcohen wrote:

> To try to get a feel on how the performance monitoring hardware
> support would work in SystemTap I wrote some simple examples. 

Nice work.  To flesh out the operational model (and please correct me
if I'm wrong): the way this stuff would all work is:

- The systemtap translator would be linked with libpfm from perfmon2.
  (libpfm license is friendly.)

- This library would be used at translation time to map perfmon.* probe
  point specifications to PMC register descriptions (pfmlib_output_param_t).
  (This will require telling the system the exact target cpu type for
  cross-instrumentation.)

- These descriptions would be emitted into the C code, for actual
  installation during module initialization.  For our first cut, since
  there appears to exist no kernel-side management API at the moment,
  the C code would directly manipulate the PMC registers.  (This means
  no coexistence for oprofile or other concurrent perfctr probing.
  C'est la vie.)

- The "sample" type perfmon probes would map to the same kind of
  dispatch/callback as the current "timer.profile": the probe handler
  should have valid pt_regs available.

- The free-running type perfmon probes, probably named
  "perfctr.SPEC.setup" or ".start" or ".begin" would map to a one-time
  initialization that passes a token (PMC counter number?)  to the
  handler.  Other probe handlers can then query/manipulate the
  free-running counter using that number via the start/stop/query
  functions.

Is that sufficiently detailed to begin an implementation?


> [...] print ("ipc is %d.%d \n", ipc/factor, ipc % factor);

(An aside: we should have a more compact notation for this.  We won't
support floating point numbers, but integers can be commonly scaled
like this.  Maybe printf("%.Nf", value), where N implies a
power-of-ten scaling factor, and printf("%*f", value, scale) for
general factors.)


- FChE

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-17 17:34     ` Frank Ch. Eigler
@ 2006-03-17 20:26       ` William Cohen
  2006-03-20 17:27         ` Frank Ch. Eigler
  0 siblings, 1 reply; 18+ messages in thread
From: William Cohen @ 2006-03-17 20:26 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

Frank Ch. Eigler wrote:
> wcohen wrote:
> 
> 
>>To try to get a feel on how the performance monitoring hardware
>>support would work in SystemTap I wrote some simple examples. 
> 
> 
> Nice work.  To flesh out the operational model (and please correct me
> if I'm wrong): the way this stuff would all work is:
> 
> - The systemtap translator would be linked with libpfm from perfmon2.
>   (libpfm license is friendly.)

The libpfm  library license is an MIT license, so it should be 
compatible with the systemtap licensing.

> - This library would be used at translation time to map perfmon.* probe
>   point specifications to PMC register descriptions (pfmlib_output_param_t).
>   (This will require telling the system the exact target cpu type for
>   cross-instrumentation.)

Yes, this complicates the cross kernel (build instrumentation on one 
system and run instrument on another). Different processors 
architectures could be used on each. Some performance monitoring systems 
such as PAPI has mappings for some generic names. This might help in 
some cases. However, there are some differences in computer architecture 
that just do not translate to the generic models

> - These descriptions would be emitted into the C code, for actual
>   installation during module initialization.  For our first cut, since
>   there appears to exist no kernel-side management API at the moment,
>   the C code would directly manipulate the PMC registers.  (This means
>   no coexistence for oprofile or other concurrent perfctr probing.
>   C'est la vie.)

Would prefer to reuse to other software to access the performance 
monitoring hardware. Don't want to generate yet another different piece 
of software that uses the performance monitoring hardware. We want 
64-bit values, but a number of the counters are much smaller than that 
(32-bit). On the pentium 4 the access to the performance counters is 
complicated and would prefer not reinventing the code to access the 
performance counters. This mechanism will only work with the global 
setup like sampling per thread would be unsupported. Also need to 
translate between the name and the event number the table in OProfile 
and perfmon are getting pretty large to keep all that information and 
catch any inabilities to map events to a register.

One advantage of generating the C code would be that it would work with 
existing RHEL4 kernel.

> - The "sample" type perfmon probes would map to the same kind of
>   dispatch/callback as the current "timer.profile": the probe handler
>   should have valid pt_regs available.

Yes, the pt_regs will be available to the sample type probe.

> - The free-running type perfmon probes, probably named
>   "perfctr.SPEC.setup" or ".start" or ".begin" would map to a one-time
>   initialization that passes a token (PMC counter number?)  to the
>   handler.  Other probe handlers can then query/manipulate the
>   free-running counter using that number via the start/stop/query
>   functions.
 >
> Is that sufficiently detailed to begin an implementation?

Pretty close. The one thing that isn't answered is the division of the 
labor for the sampling probes, onetime setup vs sample handler. Want to 
have some handle set in a global variable for the probe, but do not want 
to execute that everytime that the sample is collected. For the 
free-running probes it is pretty clear to handle the samples.

>>[...] print ("ipc is %d.%d \n", ipc/factor, ipc % factor);
> 
> 
> (An aside: we should have a more compact notation for this.  We won't
> support floating point numbers, but integers can be commonly scaled
> like this.  Maybe printf("%.Nf", value), where N implies a
> power-of-ten scaling factor, and printf("%*f", value, scale) for
> general factors.)

Yes, some scaling mechanism would be nice in some cases. The chances of 
having IPC around the value of one were pretty likely, so I put in the 
scaling to give a better picture of what is going on.

-Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-17 20:26       ` William Cohen
@ 2006-03-20 17:27         ` Frank Ch. Eigler
  0 siblings, 0 replies; 18+ messages in thread
From: Frank Ch. Eigler @ 2006-03-20 17:27 UTC (permalink / raw)
  To: William Cohen; +Cc: systemtap

Hi -

wcohen wrote:
> [...]
> > [...]
> > Is that sufficiently detailed to begin an implementation?
>
> Pretty close. The one thing that isn't answered is the division of
> the labor for the sampling probes, onetime setup vs sample handler.
> [...]

In other words, the issue is the desire to control sampling-event-type
counters, not just the free-running counters.  In this case, one might
use both ".setup" and ".sample" probes for the same SPEC:

# probe perfctr.SPEC.setup { h = $handle }
# probe perfctr.SPEC.sample(1234) { /* like timer.profile */ }
# probe ANY { ... perfctr_{start,stop,query} (h) ... }

- FChE

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-15 16:24 William Cohen
  2006-03-15 22:34 ` Frank Ch. Eigler
@ 2006-03-22  3:34 ` Maynard Johnson
  2006-03-22 18:02   ` William Cohen
  2006-03-22 18:30   ` Frank Ch. Eigler
  1 sibling, 2 replies; 18+ messages in thread
From: Maynard Johnson @ 2006-03-22  3:34 UTC (permalink / raw)
  To: William Cohen; +Cc: SystemTAP

William Cohen wrote:

> [snip]
>
>perfmon_create_context:long ()
>
>The perfmon_create_context command sets up the performance monitoring
>hardware for the allocated contexts and starts the counters running.
>If successful, the function will return zero. If the operation is
>unsuccessful because an error code will be returned. This function
>should only be used in probe begin. (FIXME list error code returned.)
>  
>
I'm confused about the relationship between this function and 
perfmon_start_counter, since starting the counters is mentioned in 
both.  Could you explain at what point this function is invoked and what 
the purpose of the context is?  I'm not real familiar with the perfmon2 
interface, but just on the face of it, your context doesn't seem like a 
one-to-one fit with the way contexts are used in perfmon2.  In perfmon2, 
a context is created first, which is then passed in to the calls for 
setting up events, thereby associating those events with the context. 
Then 'start' uses the context to set up the PMU for all requested events 
and begin the counting.

>
>[snip]
>
>perfmon_start_counter:long (event_handle:long)
>
>The event_handle passed in indicates which counter to start. The value
>is returned as a 64-bit long of the current counter value.  The return
>value is undefined for an invalid event_handle.
>  
>
I think individually starting counters is problematic at a couple 
different levels.  On some architectures (like PowerPC64), you don't 
have fine-grained control over each counter.  Also, one usually wants 
all counters to begin counting at the same time.  Maybe I'm 
misinterpreting what the intention of this function is.

>[snip]
>

>EVENT SPECIFICATION
>
>The performance monitoring events are specified in strings. The
>information at the very least include the event name being monitored
>  
>
Will, you allude to this in a later posting, but I'll reiterate here.  
Should the event name be the native event name for the arch?  Or some 
generic name that is mapped to a native name by some mechanism?  Or 
either (as in PAPI)?

>by the counter.  Additional information would include a event mask to
>specify subevents, whether to count in kernel or user space, whether
>to keep track of counts on a per thread or per CPU basis, and the
>interval for the sampling.
>
>(FIXME more detail on the string compositions)
>
>
>SYSTEMTAP PERFORMANCE HARDWARE ACCESS IMPLEMENTATION
>
>The SystemTap access performance monitoring hardware is planned to be
>built on the perfmon2 kernel support. The perfom2 provides reservation
>and access to the performance monitoring hardware on ia64, i386, and
>PowerPC processors. The perfmon2 support is not yet in the upstream
>kernels, but patches are available.
>  
>
As a proof of concept, I agree that this is the best route.  Reinventing 
the wheel would be useless.  Maybe building this prototype might help 
with refining the perfmon2 interface.

>  
>
Regards,
-Maynard

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-22  3:34 ` Maynard Johnson
@ 2006-03-22 18:02   ` William Cohen
  2006-03-22 22:16     ` Maynard Johnson
  2006-03-22 18:30   ` Frank Ch. Eigler
  1 sibling, 1 reply; 18+ messages in thread
From: William Cohen @ 2006-03-22 18:02 UTC (permalink / raw)
  To: Maynard Johnson; +Cc: SystemTAP

Maynard Johnson wrote:
> William Cohen wrote:
> 
>> [snip]
>>
>> perfmon_create_context:long ()
>>
>> The perfmon_create_context command sets up the performance monitoring
>> hardware for the allocated contexts and starts the counters running.
>> If successful, the function will return zero. If the operation is
>> unsuccessful because an error code will be returned. This function
>> should only be used in probe begin. (FIXME list error code returned.)
>>  
>>
> I'm confused about the relationship between this function and 
> perfmon_start_counter, since starting the counters is mentioned in 
> both.  Could you explain at what point this function is invoked and what 
> the purpose of the context is?  I'm not real familiar with the perfmon2 
> interface, but just on the face of it, your context doesn't seem like a 
> one-to-one fit with the way contexts are used in perfmon2.  In perfmon2, 
> a context is created first, which is then passed in to the calls for 
> setting up events, thereby associating those events with the context. 
> Then 'start' uses the context to set up the PMU for all requested events 
> and begin the counting.

Yes, perfmon2 has a contexts that sets all the performance monitoring 
hardware registers. The perfmon2 start and stop control the entire context.

Based on the feedback from earlier proposal email, revised to using 
something like:

probe perfmon.event("blah") ...

All the probes using the perfmon hardware would be collected together 
for the perfmon_create_context. The individual start and stop operations 
would be allowed. It is and open question what the counters default are; 
do they start running by default or have to be explicitly started. If 
they are started by default, where exactly are they running? Beginning 
of begin probe? End of begin probe?

>>
>> [snip]
>>
>> perfmon_start_counter:long (event_handle:long)
>>
>> The event_handle passed in indicates which counter to start. The value
>> is returned as a 64-bit long of the current counter value.  The return
>> value is undefined for an invalid event_handle.
>>  
>>
> I think individually starting counters is problematic at a couple 
> different levels.  On some architectures (like PowerPC64), you don't 
> have fine-grained control over each counter.  Also, one usually wants 
> all counters to begin counting at the same time.  Maybe I'm 
> misinterpreting what the intention of this function is.

I was thinking there are cases where one would want to start and stop 
individual sampling and interval counting. Yes, starting and stoping 
counters on some architectures can be a problem.  I was thinking if 
cheating and not actually starting and stopping the counters, but rather 
turning on and off the bits that enabling counting in user and kernel 
space. Do this by finding which bits to twiddle in the control register. 
However, maybe this won't work for ppc64. I will have to review the 
ppc64 hardware manual to see that this scheme would work.

>> [snip]
>>
> 
>> EVENT SPECIFICATION
>>
>> The performance monitoring events are specified in strings. The
>> information at the very least include the event name being monitored
>>  
>>
> Will, you allude to this in a later posting, but I'll reiterate here.  
> Should the event name be the native event name for the arch?  Or some 
> generic name that is mapped to a native name by some mechanism?  Or 
> either (as in PAPI)?

libpfm has some generic names for cycle counts. I expect that events 
will be both generic names and architecture specific. This will be a 
lookup in libpfm.

>> by the counter.  Additional information would include a event mask to
>> specify subevents, whether to count in kernel or user space, whether
>> to keep track of counts on a per thread or per CPU basis, and the
>> interval for the sampling.
>>
>> (FIXME more detail on the string compositions)
>>
>>
>> SYSTEMTAP PERFORMANCE HARDWARE ACCESS IMPLEMENTATION
>>
>> The SystemTap access performance monitoring hardware is planned to be
>> built on the perfmon2 kernel support. The perfom2 provides reservation
>> and access to the performance monitoring hardware on ia64, i386, and
>> PowerPC processors. The perfmon2 support is not yet in the upstream
>> kernels, but patches are available.
>>  
>>
> As a proof of concept, I agree that this is the best route.  Reinventing 
> the wheel would be useless.  Maybe building this prototype might help 
> with refining the perfmon2 interface.

I have been working on patching oprofile so that it uses the perfmon2 
interface. The work is being done on an amd64 machine. This should allow 
some examination of the mechanisms for setting up the events and 
sampling. It should be portable to perfmon2 for i386, ppc64, and ia64. I 
will make the patches available for comment.

Next step would be to protoype similar opertation for systemtap.

I am trying to avoid reinventing the wheel. I am also very concerned 
that raw access of the performance monitoring hardware will further 
increase the chances of multiple device drivers stepping on each other 
without knowing about it.

-Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-22 18:02   ` William Cohen
@ 2006-03-22 22:16     ` Maynard Johnson
  0 siblings, 0 replies; 18+ messages in thread
From: Maynard Johnson @ 2006-03-22 22:16 UTC (permalink / raw)
  To: William Cohen; +Cc: SystemTAP

William Cohen wrote:
> Maynard Johnson wrote:
> 
>> William Cohen wrote:
>>
>>> [snip]
>>>
>>> perfmon_create_context:long ()
>>>
>>> The perfmon_create_context command sets up the performance monitoring
>>> hardware for the allocated contexts and starts the counters running.
>>> If successful, the function will return zero. If the operation is
>>> unsuccessful because an error code will be returned. This function
>>> should only be used in probe begin. (FIXME list error code returned.)
>>>  
>>>
>> I'm confused about the relationship between this function and 
>> perfmon_start_counter, since starting the counters is mentioned in 
>> both.  Could you explain at what point this function is invoked and 
>> what the purpose of the context is?  I'm not real familiar with the 
>> perfmon2 interface, but just on the face of it, your context doesn't 
>> seem like a one-to-one fit with the way contexts are used in 
>> perfmon2.  In perfmon2, a context is created first, which is then 
>> passed in to the calls for setting up events, thereby associating 
>> those events with the context. Then 'start' uses the context to set up 
>> the PMU for all requested events and begin the counting.
> 
> 
> Yes, perfmon2 has a contexts that sets all the performance monitoring 
> hardware registers. The perfmon2 start and stop control the entire context.
> 
> Based on the feedback from earlier proposal email, revised to using 
> something like:
> 
> probe perfmon.event("blah") ...
> 
> All the probes using the perfmon hardware would be collected together 
> for the perfmon_create_context. 
This is good.
> The individual start and stop operations would be allowed. 
This is not so good.  Besides the fact that it may be difficult (or 
impossible) to do, I don't see it being all that useful.  But then, I'm 
a tool developer, not a performance analyst, so I could be missing the 
point.

 > It is and open question what the counters default are;
> do they start running by default or have to be explicitly started. If 
> they are started by default, where exactly are they running? Beginning 
> of begin probe? End of begin probe?
> 
>>>
>>> [snip]
>>>
>>> perfmon_start_counter:long (event_handle:long)
>>>
>>> The event_handle passed in indicates which counter to start. The value
>>> is returned as a 64-bit long of the current counter value.  The return
>>> value is undefined for an invalid event_handle.
>>>  
>>>
>> I think individually starting counters is problematic at a couple 
>> different levels.  On some architectures (like PowerPC64), you don't 
>> have fine-grained control over each counter.  Also, one usually wants 
>> all counters to begin counting at the same time.  Maybe I'm 
>> misinterpreting what the intention of this function is.
> 
> 
> I was thinking there are cases where one would want to start and stop 
> individual sampling and interval counting. Yes, starting and stoping 
> counters on some architectures can be a problem.  I was thinking if 
> cheating and not actually starting and stopping the counters, but rather 
> turning on and off the bits that enabling counting in user and kernel 
> space. Do this by finding which bits to twiddle in the control register. 
Unfortunately, this isn't possible for ppc64.  The control bits you 
mention (for user/kernel domain) are used for all counters, so there's 
no fine-grained control there.  There are PMCxSEL bits for setting up 
each counter for what you want it to count (including "count nothing"), 
but changing these on the fly (i.e., without disabling the PMU) may not 
have the desired effect.  The documentation states that you should first 
disable the PMU before you change these bits, but it doesn't say what 
would happen if you didn't disable.

-Maynard
> However, maybe this won't work for ppc64. I will have to review the 
> ppc64 hardware manual to see that this scheme would work.
> 
>>> [snip]
>>>
>>
>>> EVENT SPECIFICATION
>>>
>>> The performance monitoring events are specified in strings. The
>>> information at the very least include the event name being monitored
>>>  
>>>
>> Will, you allude to this in a later posting, but I'll reiterate here.  
>> Should the event name be the native event name for the arch?  Or some 
>> generic name that is mapped to a native name by some mechanism?  Or 
>> either (as in PAPI)?
> 
> 
> libpfm has some generic names for cycle counts. I expect that events 
> will be both generic names and architecture specific. This will be a 
> lookup in libpfm.
> 
>>> by the counter.  Additional information would include a event mask to
>>> specify subevents, whether to count in kernel or user space, whether
>>> to keep track of counts on a per thread or per CPU basis, and the
>>> interval for the sampling.
>>>
>>> (FIXME more detail on the string compositions)
>>>
>>>
>>> SYSTEMTAP PERFORMANCE HARDWARE ACCESS IMPLEMENTATION
>>>
>>> The SystemTap access performance monitoring hardware is planned to be
>>> built on the perfmon2 kernel support. The perfom2 provides reservation
>>> and access to the performance monitoring hardware on ia64, i386, and
>>> PowerPC processors. The perfmon2 support is not yet in the upstream
>>> kernels, but patches are available.
>>>  
>>>
>> As a proof of concept, I agree that this is the best route.  
>> Reinventing the wheel would be useless.  Maybe building this prototype 
>> might help with refining the perfmon2 interface.
> 
> 
> I have been working on patching oprofile so that it uses the perfmon2 
> interface. The work is being done on an amd64 machine. This should allow 
> some examination of the mechanisms for setting up the events and 
> sampling. It should be portable to perfmon2 for i386, ppc64, and ia64. I 
> will make the patches available for comment.
> 
> Next step would be to protoype similar opertation for systemtap.
> 
> I am trying to avoid reinventing the wheel. I am also very concerned 
> that raw access of the performance monitoring hardware will further 
> increase the chances of multiple device drivers stepping on each other 
> without knowing about it.
> 
> -Will


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Proposed systemtap access to perfmon hardware
  2006-03-22  3:34 ` Maynard Johnson
  2006-03-22 18:02   ` William Cohen
@ 2006-03-22 18:30   ` Frank Ch. Eigler
  1 sibling, 0 replies; 18+ messages in thread
From: Frank Ch. Eigler @ 2006-03-22 18:30 UTC (permalink / raw)
  To: Maynard Johnson; +Cc: systemtap

maynardj wrote:

> [...]

Several aspects of the script interface to the performance counters
was changed later in this thread: please check that too.

> >The performance monitoring events are specified in strings. The
> >information at the very least include the event name being monitored
> [...]
> Should the event name be the native event name for the arch?  Or some
> generic name that is mapped to a native name by some mechanism?  Or
> either (as in PAPI)?

It may be sufficient to use systemtap's general abstraction mechanisms
to map between generic and native event names, in much the same way as
the system-call tapset defines generic names ("syscall.read") in terms
of native functions ("kernel.function(...)").  This aliasing widget
may need some extension in order to deal with parameters or partial
matches; we'll know after someone constructs an informal
generic<->native event name dictionary.

- FChE

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2006-03-23 17:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-03-22 23:46 Proposed systemtap access to perfmon hardware Stone, Joshua I
2006-03-23 12:54 ` Maynard Johnson
2006-03-23 14:46   ` William Cohen
  -- strict thread matches above, loose matches on Subject: below --
2006-03-23 17:09 Stone, Joshua I
2006-03-22 23:23 Stone, Joshua I
2006-03-22 19:09 Stone, Joshua I
2006-03-22 20:04 ` Frank Ch. Eigler
2006-03-15 16:24 William Cohen
2006-03-15 22:34 ` Frank Ch. Eigler
2006-03-17 16:20   ` William Cohen
2006-03-17 17:10     ` Bill Rugolsky Jr.
2006-03-17 17:34     ` Frank Ch. Eigler
2006-03-17 20:26       ` William Cohen
2006-03-20 17:27         ` Frank Ch. Eigler
2006-03-22  3:34 ` Maynard Johnson
2006-03-22 18:02   ` William Cohen
2006-03-22 22:16     ` Maynard Johnson
2006-03-22 18:30   ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).