public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* Making the transport layer more robust - Power Management - follow-up
@ 2011-12-09  4:38 Turgis, Frederic
  2011-12-15 15:31 ` Mark Wielaard
  0 siblings, 1 reply; 3+ messages in thread
From: Turgis, Frederic @ 2011-12-09  4:38 UTC (permalink / raw)
  To: systemtap

Hi,

To summarize past discussions in http://sourceware.org/ml/systemtap/2011-q3/msg00272.html: we want to tune systemtap to have as few regular wake-ups as possible. This is important to us as we monitor low power use cases on OMAP SoC and "straight" systemtap tool wakes up every 1 or 2 scheduler ticks.

- we looked at code and worked around 4 causes of regular wake-up: polling and timeouts in userspace and kernel space control/data channels

- in the "Making the transport layer more robust" thread, Mark presented a rework that replaces  userspace polling of control channel by a "select()" model if target allows. 1 less regular wake-up !

- in same thread, Mark made the remark that STP_RELAY_TIMER_INTERVAL and STP_CTL_TIMER_INTERVAL (kernel pollings) are in fact tunables, no need to modify the code:
   * to match our work-arounds, we used then -D STP_RELAY_TIMER_INTERVAL=128 -D STP_CTL_TIMER_INTERVAL=256
   * regular wake-ups are clearly occuring less often, with no tracing issue. But our trace bandwidth is generally hundreds of KB/s max so we don't really need much robustness


Some more recent findings:
- while testing fixes on some ARM backtrace issue with Mark, I got message "ctl_write_msg type=2 len=61 ENOMEM" several times at beginning of test (not root-caused yet). That means lack of trace buffer for msg type=2, which is OOB_DATA (error and warning messages). Test and trace data looked fine. Messages do not appear if I compile without -D STP_CTL_TIMER_INTERVAL=256.
So here is a kind of consequence of our tuning, not killer but still to be noted ;-) I guess we could have the same for data channel

- last non tunable wake-up is timeout of userspace data channel ppoll() call in reader_thread(). Without change, we wake-up every 200ms:
   * we currently set it to 5s. No issue so far
   * Mark (or someone else) suggested to use bulkmode. Here are some findings:
      + bulkmode sets timeout to NULL (or 10s if NEED_PPOLL is set). It solves wake-up issue. I am just wondering why we have NULL in bulkmode and 200ms otherwise
      + OMAP hotplugs core so generally core 1 is off at beginning of test. Therefore I don't get trace of core1 even if core1 is used later. Makes bulkmode less usable than I thought (at least I still need to test with core1 "on" at beginning of test to see further behaviour)


That makes the possibility to tune ppoll timeout value in non bulkmode still interesting. I even don't really know what could be consequences of directly setting to 1s or more but tunable would be good trade-off that does not break current status.

Well, I think I gave myself few actions to perform !


Regards
Fred

Frederic Turgis
OMAP Platform Business Unit - OMAP System Engineering - Platform Enablement - System Multimedia


Texas Instruments France SA, 821 Avenue Jack Kilby, 06270 Villeneuve Loubet. 036 420 040 R.C.S Antibes. Capital de EUR 753.920


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Making the transport layer more robust - Power Management - follow-up
  2011-12-09  4:38 Making the transport layer more robust - Power Management - follow-up Turgis, Frederic
@ 2011-12-15 15:31 ` Mark Wielaard
  2011-12-16 15:37   ` Turgis, Frederic
  0 siblings, 1 reply; 3+ messages in thread
From: Mark Wielaard @ 2011-12-15 15:31 UTC (permalink / raw)
  To: Turgis, Frederic; +Cc: systemtap

On Thu, 2011-12-08 at 21:22 +0000, Turgis, Frederic wrote:
> - in same thread, Mark made the remark that STP_RELAY_TIMER_INTERVAL
> and STP_CTL_TIMER_INTERVAL (kernel pollings) are in fact tunables, no
> need to modify the code:
>    * to match our work-arounds, we used then -D
> STP_RELAY_TIMER_INTERVAL=128 -D STP_CTL_TIMER_INTERVAL=256
>    * regular wake-ups are clearly occuring less often, with no tracing
> issue. But our trace bandwidth is generally hundreds of KB/s max so we
> don't really need much robustness

I noticed these aren't documented anywhere. I propose to document them
as follows:

STP_RELAY_TIMER_INTERVAL How often the relay or ring buffers are checked
to see if readers need to be woken up to deliver new trace data. Timer
interval given in jiffies. Defaults to "((HZ + 99) / 100)" which is
every 10ms.

STP_CTL_TIMER_INTERVAL How often control messages (system, warn, exit,
etc.) are checked to see if control channel readers need to be woken up
to notify them. Timer interval given in jiffies. Defaults to "((HZ
+49)/50)" which is every 20ms.

Where should we add this documentation?

> Some more recent findings:
> - while testing fixes on some ARM backtrace issue with Mark, I got
> message "ctl_write_msg type=2 len=61 ENOMEM" several times at
> beginning of test (not root-caused yet). That means lack of trace
> buffer for msg type=2, which is OOB_DATA (error and warning messages).
> Test and trace data looked fine. Messages do not appear if I compile
> without -D STP_CTL_TIMER_INTERVAL=256.

Yes, that is kind of expected. The control messages really want to be
delivered and if you wait too long new control messages will not have
room to be added to the buffers.

Would it help you if we made the pool reserved memory buffers also
tunable? Currently STP_DEFAULT_BUFFERS is defined staticly in either
runtime/transport/debugfs.c (40) or runtime/transport/procfs.c (256)
depending which backend we use for the control channel.

Documentation would be something like:

STP_DEFAULT_BUFFERS Defines the number of buffers allocated for control
messages the module can store before they have to be read by stapio.
Defaults to 40 (8 pre-allocated one time messages plus 32 dynamic
err/warning/system messages).

> - last non tunable wake-up is timeout of userspace data channel
> ppoll() call in reader_thread(). Without change, we wake-up every
> 200ms:
>    * we currently set it to 5s. No issue so far
>    * Mark (or someone else) suggested to use bulkmode. Here are some
> findings:
>       + bulkmode sets timeout to NULL (or 10s if NEED_PPOLL is set).
> It solves wake-up issue. I am just wondering why we have NULL in
> bulkmode and 200ms otherwise

That is probably because not all trace data backends really support
poll/select. The ring_buffer one seems to, but the relay one doesn't. So
we would need some way to detect whether the backend really supports
select/poll before we can really not use any timeout. If there isn't a
bug report about this, there probably should. Will's recent periodic.stp
example showed stap and the stap runtime are responsible for a noticable
amount of wakeups.

>       + OMAP hotplugs core so generally core 1 is off at beginning of
> test. Therefore I don't get trace of core1 even if core1 is used
> later. Makes bulkmode less usable than I thought (at least I still
> need to test with core1 "on" at beginning of test to see further
> behaviour)

Could you file a bug report about the systemtap runtime not noticing new
cores coming online for bulk mode?

> That makes the possibility to tune ppoll timeout value in non bulkmode
> still interesting. I even don't really know what could be consequences
> of directly setting to 1s or more but tunable would be good trade-off
> that does not break current status.
> 
> Well, I think I gave myself few actions to perform !

Thanks for the feedback. Please let us know how tuning things
differently make your life easier.

Cheers,

Mark

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Making the transport layer more robust - Power Management - follow-up
  2011-12-15 15:31 ` Mark Wielaard
@ 2011-12-16 15:37   ` Turgis, Frederic
  0 siblings, 0 replies; 3+ messages in thread
From: Turgis, Frederic @ 2011-12-16 15:37 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: systemtap


> I noticed these aren't documented anywhere. I propose to document them as follows:

Well written. Maybe we shall avoid converting into ms in documentation because this is very x86 centric, HZ is 128 on ARM (well, at least on OMAP) giving slightly different results ;-)


> Would it help you if we made the pool reserved memory buffers also tunable? Currently STP_DEFAULT_BUFFERS is defined staticly in either runtime/transport/debugfs.c (40) or runtime/transport/procfs.c (256) depending which backend we use for the control channel.

It would help but currently, my control buffers are flooded because script produces warnings/errors that I didn't have in the past (some NULL pointer backtraced). So my current solution is to correct the script. Control channel does not seem to need much bandwidth.


> That is probably because not all trace data backends really support poll/select. The ring_buffer one seems to, but the relay one doesn't.

I didn't know. I am using relay (without RELAY kernel flag, module misses some functions) and when I set timeout to 5s, I had the impression that I wake-upsometimes before 5s when I have more traces. But I need to dig in. And eventually choose ring_buffer.
I did my homework on this, increasing trace bandwidth with more prints. Playing with STP_RELAY_TIMER_INTERVAL did not help that much, I still had around same number of transport failures. Maybe bottleneck was more emptying buffer than checking if there is a buffer to empty. I guess I shall couple more investigation while digging in more into relay/ring_buffer.


>Could you file a bug report about the systemtap runtime not noticing new cores coming online for bulk mode?

Of course... I also did my homework there. If I force both cpus online before test starts, I get the second trace file. I can then force CPU1 offline then online, everything works fine. So this is really about coming online after start of test, not during test.


> Thanks for the feedback. Please let us know how tuning things differently make your life easier.

Currently, thanks to tunables and previous changes, I only need to tune reader_thread() timeout to make Power Management team happy. I have coded "struct timespec tim = {.tv_sec=5, .tv_nsec=0}". If we can put x * 1000000000 in tv_nsec, it could be the tunable, with default=200000000
I can also double check that if needed.

Regards
Fred

Texas Instruments France SA, 821 Avenue Jack Kilby, 06270 Villeneuve Loubet. 036 420 040 R.C.S Antibes. Capital de EUR 753.920


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-12-15 22:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-09  4:38 Making the transport layer more robust - Power Management - follow-up Turgis, Frederic
2011-12-15 15:31 ` Mark Wielaard
2011-12-16 15:37   ` Turgis, Frederic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).