Infrastructure for tracking driver performance events

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* Infrastructure for tracking driver performance events
@ 2009-06-24 17:29 Ben Gamari
  2009-06-24 18:53 ` Josh Stone
  2009-06-25 12:56 ` Steven Rostedt
  0 siblings, 2 replies; 6+ messages in thread
From: Ben Gamari @ 2009-06-24 17:29 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, Stone, Joshua I, Rober Richter,
	anil.s.keshavamurthy, ananth, davem, mhiramat
  Cc: SystemTap, Eric Anholt, Chris Wilson, intel-gfx

[I apologize profusely to those of you who recieve this twice. Sorry about that]

Hey all,

Now since GEM has been implemented and is beginning to stabilize for the
Intel graphics driver, work has begun on trying to optimize the driver
and its usage of the hardware. While finding cpu-bound operations can be
easily done with a profiler, identifying GPU stalls has been
substantially more difficult.

One class of GPU stalls that can be easily identified occurs when the
driver needs to wait for the GPU to complete some work before proceeding
(waiting for the chip to free a hardware resource --- e.g. a fence
register for configuring tiling --- or complete some other type of
transaction --- e.g. flush caches). In order to debug these stalls, it is
useful to know both what is causing the stall (i.e. call path) and why
the driver had to wait (e.g. waiting for GEM domain change, waiting for
fence, waiting for cache flush, etc.)

I recently wrote a very simple patch to add accounting for these types
of stalls to the i915 driver[1], exposing a list of wait-event counts to
userspace through debugfs. While this is useful for giving a general
overview of the drivers' performance, it does very little to expose
individual bottlenecks in the driver or userland components. It has been
suggested[2] that this wait-event tracking functionality would be far more
useful if we could provide stack backtraces all the way into user space
for each wait event.

I am investigating how this might be accomplished with existing kernel
infrastructure. At first, ftrace looked like a promising option, as the
sysprof profiler is driven by ftrace and provides exactly the type of
full system backtraces we need. We could probably even accomplish an
approximation of our desired result by calling a function when we begin
and another when we end waiting and using a script to look for these
events. I haven't looked into how we could get a usermode trace with
this approach, but it seems possible as sysprof already does it.

While this approach would work, it has a few shortcomings:
1) Function graph tracing must be enabled on the entire machine to debug
   stalls
2) It is difficult to extract the kernel mode callgraph with no natural
   way to capture the usermode callgraph
3) A large amount of usermode support is necessary (which will likely be
   the case for any option; listed here for completeness)

Another option seems to be systemtap. It has already been documented[3]
that this option could provide both user-mode and kernel-mode
backtraces. The driver could provide a kernel marker at every potential
wait point (or a single marker in a function called at each wait point,
for that matter) which would be picked up by systemtap and processed in
usermode, calling ptrace to acquire a usermode backtrace. This approach
seems slightly cleaner as it doesn't require the tracing on the entire
machine to catch what should be reasonably rare events (hopefully).

Unfortunately, the systemtap approach described in [3] requires that
each process have an associated "driver" process to get a usermode
backtrace. It would be nice to avoid this requirement as there are
generally far more gpu clients than just the X server (i.e. direct
rendering clients) and tracking them all could get tricky.

These are the two options I have seen thusfar. It seems like getting
this sort of information will be increasingly important as more and more
drivers move into kernel-space and it is likely that the intel
implementation will be a model for future drivers, so it would be nice
to implement it correctly the first time. Does anyone see an option
which I have missed?  Are there any thoughts on any new generic services
that the kernel might provide that might make this task easier? Any
comments, questions, or complaints would be greatly appreciated.

Thanks,

- Ben

[1] http://lists.freedesktop.org/archives/intel-gfx/2009-June/002938.html
[2] http://lists.freedesktop.org/archives/intel-gfx/2009-June/002979.html
[3] http://sourceware.org/ml/systemtap/2006-q4/msg00198.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Infrastructure for tracking driver performance events
  2009-06-24 17:29 Infrastructure for tracking driver performance events Ben Gamari
@ 2009-06-24 18:53 ` Josh Stone
  2009-06-25 12:56 ` Steven Rostedt
  1 sibling, 0 replies; 6+ messages in thread
From: Josh Stone @ 2009-06-24 18:53 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, Rober Richter,
	anil.s.keshavamurthy, ananth, davem, mhiramat, SystemTap,
	Eric Anholt, Chris Wilson, intel-gfx

On 06/24/2009 10:29 AM, Ben Gamari wrote:
[...]
> I recently wrote a very simple patch to add accounting for these types
> of stalls to the i915 driver[1], exposing a list of wait-event counts to
> userspace through debugfs. While this is useful for giving a general
> overview of the drivers' performance, it does very little to expose
> individual bottlenecks in the driver or userland components. It has been
> suggested[2] that this wait-event tracking functionality would be far more
> useful if we could provide stack backtraces all the way into user space
> for each wait event.
[...]
> Another option seems to be systemtap. It has already been documented[3]
> that this option could provide both user-mode and kernel-mode
> backtraces. The driver could provide a kernel marker at every potential
> wait point (or a single marker in a function called at each wait point,
> for that matter) which would be picked up by systemtap and processed in
> usermode, calling ptrace to acquire a usermode backtrace. This approach
> seems slightly cleaner as it doesn't require the tracing on the entire
> machine to catch what should be reasonably rare events (hopefully).
> 
> Unfortunately, the systemtap approach described in [3] requires that
> each process have an associated "driver" process to get a usermode
> backtrace. It would be nice to avoid this requirement as there are
> generally far more gpu clients than just the X server (i.e. direct
> rendering clients) and tracking them all could get tricky.
[...]
> [3] http://sourceware.org/ml/systemtap/2006-q4/msg00198.html

I have to say, I'm a bit surprised to see my hacky suggestion
resurrected from the archives.  :)  I would guess that that approach
would add way too much overhead to be of use in diagnosing stalls though.

However, I think we can do a lot better with systemtap these days.
We're growing the ability to do userspace backtraces[1] directly within
your systemtap script, which should be much less intrusive.

Please take a look at ubacktrace() and family in recent systemtap and
let us know how you think it could improve.

Thanks,

Josh

[1] http://sourceware.org/ml/systemtap/2009-q2/msg00364.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Infrastructure for tracking driver performance events
  2009-06-24 17:29 Infrastructure for tracking driver performance events Ben Gamari
  2009-06-24 18:53 ` Josh Stone
@ 2009-06-25 12:56 ` Steven Rostedt
  2009-06-25 13:25   ` Mark Wielaard
  1 sibling, 1 reply; 6+ messages in thread
From: Steven Rostedt @ 2009-06-25 12:56 UTC (permalink / raw)
  To: Ben Gamari
  Cc: linux-kernel, Stone, Joshua I, Rober Richter,
	anil.s.keshavamurthy, ananth, davem, mhiramat, SystemTap,
	Eric Anholt, Chris Wilson, intel-gfx



On Wed, 24 Jun 2009, Ben Gamari wrote:
> 
> I am investigating how this might be accomplished with existing kernel
> infrastructure. At first, ftrace looked like a promising option, as the
> sysprof profiler is driven by ftrace and provides exactly the type of
> full system backtraces we need. We could probably even accomplish an
> approximation of our desired result by calling a function when we begin
> and another when we end waiting and using a script to look for these
> events. I haven't looked into how we could get a usermode trace with
> this approach, but it seems possible as sysprof already does it.
> 
> While this approach would work, it has a few shortcomings:
> 1) Function graph tracing must be enabled on the entire machine to debug
>    stalls

You can filter on functions to trace. Or add a list of functions
in set_graph_function to just graph a specific list.

> 2) It is difficult to extract the kernel mode callgraph with no natural
>    way to capture the usermode callgraph

Do you just need a backtrace of some point, or a full user mode graph?

> 3) A large amount of usermode support is necessary (which will likely be
>    the case for any option; listed here for completeness)
> 
> Another option seems to be systemtap. It has already been documented[3]
> that this option could provide both user-mode and kernel-mode
> backtraces. The driver could provide a kernel marker at every potential
> wait point (or a single marker in a function called at each wait point,
> for that matter) which would be picked up by systemtap and processed in
> usermode, calling ptrace to acquire a usermode backtrace. This approach
> seems slightly cleaner as it doesn't require the tracing on the entire
> machine to catch what should be reasonably rare events (hopefully).

Enabling the userstacktrace will give userspace stack traces at event
trace points. The thing is that the userspace utility must be built with 
frame pointers.

-- Steve

> 
> Unfortunately, the systemtap approach described in [3] requires that
> each process have an associated "driver" process to get a usermode
> backtrace. It would be nice to avoid this requirement as there are
> generally far more gpu clients than just the X server (i.e. direct
> rendering clients) and tracking them all could get tricky.
> 
> These are the two options I have seen thusfar. It seems like getting
> this sort of information will be increasingly important as more and more
> drivers move into kernel-space and it is likely that the intel
> implementation will be a model for future drivers, so it would be nice
> to implement it correctly the first time. Does anyone see an option
> which I have missed?  Are there any thoughts on any new generic services
> that the kernel might provide that might make this task easier? Any
> comments, questions, or complaints would be greatly appreciated.
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Infrastructure for tracking driver performance events
  2009-06-25 12:56 ` Steven Rostedt
@ 2009-06-25 13:25   ` Mark Wielaard
  0 siblings, 0 replies; 6+ messages in thread
From: Mark Wielaard @ 2009-06-25 13:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ben Gamari, linux-kernel, Stone, Joshua I, Rober Richter,
	anil.s.keshavamurthy, ananth, davem, mhiramat, SystemTap,
	Eric Anholt, Chris Wilson, intel-gfx

Hi,

On Thu, 2009-06-25 at 08:55 -0400, Steven Rostedt wrote:
> On Wed, 24 Jun 2009, Ben Gamari wrote:
> > 3) A large amount of usermode support is necessary (which will likely be
> >    the case for any option; listed here for completeness)
> > 
> > Another option seems to be systemtap. It has already been documented[3]
> > that this option could provide both user-mode and kernel-mode
> > backtraces. The driver could provide a kernel marker at every potential
> > wait point (or a single marker in a function called at each wait point,
> > for that matter) which would be picked up by systemtap and processed in
> > usermode, calling ptrace to acquire a usermode backtrace. This approach
> > seems slightly cleaner as it doesn't require the tracing on the entire
> > machine to catch what should be reasonably rare events (hopefully).
> 
> Enabling the userstacktrace will give userspace stack traces at event
> trace points. The thing is that the userspace utility must be built with 
> frame pointers.

This isn't true for Systemtap. It can unwind through anything since it
contains a dwarf-unwinder that can do backtraces as long as unwind
tables are available for the modules (executables, vdso, shared
libraries, etc.) one wants to unwind through. Systemtap currently gets
these in its "translation" phase and you do need to list them explicitly
atm. There is work underway to make this more flexible and automatic.
Also cross kernel-user-space backtraces need some work (systemtap can
use the dwarf unwinder also in-kernel, but some kernel parts are missing
unwind tables).

Some systemtap bugs to track if you are interested in extending this
functionality:

= Prerequirements for more ubiquitous backtracing
sw#6961 backtrace from non-pt_regs probe context
http://sourceware.org/bugzilla/show_bug.cgi?id=6961
sw#10080 track vdso for process symbols/backtrace
http://sourceware.org/bugzilla/show_bug.cgi?id=10080
sw#10208 Support probing glibc synthesized syscall wrappers
http://sourceware.org/bugzilla/show_bug.cgi?id=10208

NOTE: the above still won't make cross kernel-to-userspace backtracing
fully work since we cannot easily unwind through the kernel-entry/exit
assembly code that doesn't have dwarf unwind tables.

= Make user backtraces more convenient
sw#10228 Add more vma-tracking for user space symbol/backtraces
http://sourceware.org/bugzilla/show_bug.cgi?id=10228
sw#6580 revamp backtrace-related tapset functions
http://sourceware.org/bugzilla/show_bug.cgi?id=6580

Cheers,

Mark

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Infrastructure for tracking driver performance events
  2009-06-24 17:09 Ben Gamari
@ 2009-06-24 19:58 ` Frank Ch. Eigler
  0 siblings, 0 replies; 6+ messages in thread
From: Frank Ch. Eigler @ 2009-06-24 19:58 UTC (permalink / raw)
  To: Ben Gamari
  Cc: Steven Rostedt, lkml, Rober Richter, anil.s.keshavamurthy,
	ananth, davem, mhiramat, SystemTap, Eric Anholt, Chris Wilson,
	intel-gfx

Hi, Ben -

> [...]  It has been suggested[2] that this wait-event tracking
> functionality would be far more useful if we could provide stack
> backtraces all the way into user space for each wait event.  [...]

Right.

> Another option seems to be systemtap. It has already been
> documented[3] that this option could provide both user-mode and
> kernel-mode backtraces. [...]  Unfortunately, the systemtap approach
> described in [3] requires that each process have an associated
> "driver" process [...]
> [3] http://sourceware.org/ml/systemtap/2006-q4/msg00198.html

The date on the message you're referencing gives a hint that it's
obsolete.  A newer status of the relevant work (still incomplete!)  is
<http://sourceware.org/ml/systemtap/2009-q2/msg00364.html>.

Systemtap now handles user-space backtraces without such "driver"
processes.  While frame-pointer-based heuristics are available,
systemtap also uses dwarf unwind data to compute more accurate
backtraces for participating user-space processes.  These may be
composed of an identified set of shared libraries / binaries, whose
unwind data is made available to systemtap-generated probe modules for
instant reference.

> [...]  Are there any thoughts on any new generic services that the
> kernel might provide that might make this task easier? [...]

One obstacle to robust backtracing has been the inconsistent presence
of dwarf .cfi directives in all the layers of assembly code involved
in the gap between blocked userspace and some random kernel function.

- FChE

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Infrastructure for tracking driver performance events
@ 2009-06-24 17:09 Ben Gamari
  2009-06-24 19:58 ` Frank Ch. Eigler
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Gamari @ 2009-06-24 17:09 UTC (permalink / raw)
  To: Steven Rostedt, lkml, Stone, Joshua I, Rober Richter,
	anil.s.keshavamurthy, ananth, davem, mhiramat
  Cc: SystemTap, Eric Anholt, Chris Wilson, intel-gfx

Hey all,

Now since GEM has been implemented and is beginning to stabilize for the
Intel graphics driver, work has begun on trying to optimize the driver
and its usage of the hardware. While finding cpu-bound operations can be
easily done with a profiler, identifying GPU stalls has been
substantially more difficult.

One class of GPU stalls that can be easily identified occurs when the
driver needs to wait for the GPU to complete some work before proceeding
(waiting for the chip to free a hardware resource --- e.g. a fence
register for configuring tiling --- or complete some other type of
transaction --- e.g. flush caches). In order to debug these stalls, it is
useful to know both what is causing the stall (i.e. call path) and why
the driver had to wait (e.g. waiting for GEM domain change, waiting for
fence, waiting for cache flush, etc.)

I recently wrote a very simple patch to add accounting for these types
of stalls to the i915 driver[1], exposing a list of wait-event counts to
userspace through debugfs. While this is useful for giving a general
overview of the drivers' performance, it does very little to expose
individual bottlenecks in the driver or userland components. It has been
suggested[2] that this wait-event tracking functionality would be far more
useful if we could provide stack backtraces all the way into user space
for each wait event.

I am investigating how this might be accomplished with existing kernel
infrastructure. At first, ftrace looked like a promising option, as the
sysprof profiler is driven by ftrace and provides exactly the type of
full system backtraces we need. We could probably even accomplish an
approximation of our desired result by calling a function when we begin
and another when we end waiting and using a script to look for these
events. I haven't looked into how we could get a usermode trace with
this approach, but it seems possible as sysprof already does it.

While this approach would work, it has a few shortcomings:
1) Function graph tracing must be enabled on the entire machine to debug
   stalls
2) It is difficult to extract the kernel mode callgraph with no natural
   way to capture the usermode callgraph
3) A large amount of usermode support is necessary (which will likely be
   the case for any option; listed here for completeness)

Another option seems to be systemtap. It has already been documented[3]
that this option could provide both user-mode and kernel-mode
backtraces. The driver could provide a kernel marker at every potential
wait point (or a single marker in a function called at each wait point,
for that matter) which would be picked up by systemtap and processed in
usermode, calling ptrace to acquire a usermode backtrace. This approach
seems slightly cleaner as it doesn't require the tracing on the entire
machine to catch what should be reasonably rare events (hopefully).

Unfortunately, the systemtap approach described in [3] requires that
each process have an associated "driver" process to get a usermode
backtrace. It would be nice to avoid this requirement as there are
generally far more gpu clients than just the X server (i.e. direct
rendering clients) and tracking them all could get tricky.

These are the two options I have seen thusfar. It seems like getting
this sort of information will be increasingly important as more and more
drivers move into kernel-space and it is likely that the intel
implementation will be a model for future drivers, so it would be nice
to implement it correctly the first time. Does anyone see an option
which I have missed?  Are there any thoughts on any new generic services
that the kernel might provide that might make this task easier? Any
comments, questions, or complaints would be greatly appreciated.

Thanks,

- Ben

[1] http://lists.freedesktop.org/archives/intel-gfx/2009-June/002938.html
[2] http://lists.freedesktop.org/archives/intel-gfx/2009-June/002979.html
[3] http://sourceware.org/ml/systemtap/2006-q4/msg00198.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-06-25 13:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-24 17:29 Infrastructure for tracking driver performance events Ben Gamari
2009-06-24 18:53 ` Josh Stone
2009-06-25 12:56 ` Steven Rostedt
2009-06-25 13:25   ` Mark Wielaard
  -- strict thread matches above, loose matches on Subject: below --
2009-06-24 17:09 Ben Gamari
2009-06-24 19:58 ` Frank Ch. Eigler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).