Re: Unified tracing buffer

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* Re: Unified tracing buffer
       [not found] <33307c790809191433w246c0283l55a57c196664ce77@mail.gmail.com>
@ 2008-09-22 19:47 ` Masami Hiramatsu
  2008-09-22 20:14   ` Martin Bligh
  0 siblings, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-22 19:47 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	systemtap-ml

Hi Martin,

Martin Bligh wrote:
> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified
> tracing buffer system for multiple tracing applications (eg ftrace,
> lttng, systemtap, blktrace, etc) to use.
> This provides several advantages, including the ability to interleave
> data from multiple sources,
> not having to learn 200 different tools, duplicated code/effort, etc.
> 
> Several of us got together last night and tried to cut this down to
> the simplest usable system
> we could agree on (and nobody got hurt!). This will form version 1.
> I've sketched out a few
> enhancements we know that we want, but have agreed to leave these
> until version 2.
> The answer to most questions about the below is "yes we know, we'll
> fix that in version 2"
> (or 3). Simplicity was the rule ...
>
> Sketch of design.  Enjoy flaming me. Code will follow shortly.
>
>
> STORAGE
> -------
>
> We will support multiple buffers for different tracing systems, with
> separate names, event id spaces.
> Event ids are 16 bit, dynamically allocated.
> A "one line of text" print function will be provided for each event,
> or use the default (probably hex printf)
> Will provide a "flight data recorder" mode, and a "spool to disk" mode.
>
> Circular buffer per cpu, protected by per-cpu spinlock_irq
> Word aligned records.
> Variable record length, header will start with length record.
> Timestamps in fixed timebase, monotonically increasing (across all CPUs)

I agree to integrate tracing buffer mechanism, but I don't think
your proposal is the simplest one.

To simplify, I think the layered buffering mechanism is desirable.
- The lowest layer just provides named circular buffers(both per-cpu and
  uni-buffer in system) and read/write scheme.
- Next layer provides user/kernel interface including controls.
- Top layer defines packet(and event) formatting utilities.
- Additionally, it would better provide some library routines(timestamp,
  event-id synchronize, and so on).

Since this unified buffer is used from different kind of tracers/loggers,
I don't think all of them(and future tracers) want to be tied down by
"event-id"+"parameter" format.
So, Sorry, I disagree about that the tracing buffer defines its *data format*,
it's just overkill for me.

Thank you,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 19:47 ` Unified tracing buffer Masami Hiramatsu
@ 2008-09-22 20:14   ` Martin Bligh
  2008-09-22 22:27     ` Masami Hiramatsu
  0 siblings, 1 reply; 47+ messages in thread
From: Martin Bligh @ 2008-09-22 20:14 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	systemtap-ml

> I agree to integrate tracing buffer mechanism, but I don't think
> your proposal is the simplest one.
>
> To simplify, I think the layered buffering mechanism is desirable.
> - The lowest layer just provides named circular buffers(both per-cpu and
>  uni-buffer in system) and read/write scheme.
> - Next layer provides user/kernel interface including controls.
> - Top layer defines packet(and event) formatting utilities.
> - Additionally, it would better provide some library routines(timestamp,
>  event-id synchronize, and so on).
>
> Since this unified buffer is used from different kind of tracers/loggers,
> I don't think all of them(and future tracers) want to be tied down by
> "event-id"+"parameter" format.
> So, Sorry, I disagree about that the tracing buffer defines its *data format*,
> it's just overkill for me.

I think you're right that we can layer this, and we didn't make a particularly
good job of splitting those things out. I'll try to pull together
another revision
to reflect this and other suggested changes.

One thing that I think is still important is to have a unified timestamp
mechanism on everything, so we can co-ordinate different things back
together in userspace from different trace tools, so I intend to keep
that at a lower level, but I think you're right that the event id, etc can
move up into separate layers.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 20:14   ` Martin Bligh
@ 2008-09-22 22:27     ` Masami Hiramatsu
  2008-09-22 23:12       ` Darren Hart
                         ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-22 22:27 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	systemtap-ml

Hi Martin,

Martin Bligh wrote:
>> I agree to integrate tracing buffer mechanism, but I don't think
>> your proposal is the simplest one.
>>
>> To simplify, I think the layered buffering mechanism is desirable.
>> - The lowest layer just provides named circular buffers(both per-cpu and
>>  uni-buffer in system) and read/write scheme.
>> - Next layer provides user/kernel interface including controls.
>> - Top layer defines packet(and event) formatting utilities.
>> - Additionally, it would better provide some library routines(timestamp,
>>  event-id synchronize, and so on).
>>
>> Since this unified buffer is used from different kind of tracers/loggers,
>> I don't think all of them(and future tracers) want to be tied down by
>> "event-id"+"parameter" format.
>> So, Sorry, I disagree about that the tracing buffer defines its *data format*,
>> it's just overkill for me.
> 
> I think you're right that we can layer this, and we didn't make a particularly
> good job of splitting those things out. I'll try to pull together
> another revision
> to reflect this and other suggested changes.

I'm happy to hear that. :-)

> One thing that I think is still important is to have a unified timestamp
> mechanism on everything, so we can co-ordinate different things back
> together in userspace from different trace tools, so I intend to keep
> that at a lower level, but I think you're right that the event id, etc can
> move up into separate layers.

I'm not so sure that the unified 'timestamp' must be required by all tracers.
If you just need to merge and sort per-cpu data, you can use an atomic
sequential number for it.
IMHO, the unified 'timestamp' would better be an option, because some
architectures can't support it. I think preparing timestamp-callback
function will help us.

By the way, systemtap uses two modes;

- single-channel mode
 In this mode, all cpus share one buffer channel to write and read.
 each writer locks spinlock and write a probe-local data to buffer.

- per-cpu buffer mode
 In this mode, we use an atomic sequential number for ordering data. If
 user doesn't need it(because they have their own timestamps), they can
 choose not to use that seq-number.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 22:27     ` Masami Hiramatsu
@ 2008-09-22 23:12       ` Darren Hart
  2008-09-23  0:06         ` Masami Hiramatsu
  2008-09-22 23:17       ` Martin Bligh
  2008-09-23 14:38       ` KOSAKI Motohiro
  2 siblings, 1 reply; 47+ messages in thread
From: Darren Hart @ 2008-09-22 23:12 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, systemtap-ml

On Mon, Sep 22, 2008 at 3:25 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote:

>> One thing that I think is still important is to have a unified timestamp
>> mechanism on everything, so we can co-ordinate different things back
>> together in userspace from different trace tools, so I intend to keep
>> that at a lower level, but I think you're right that the event id, etc can
>> move up into separate layers.
>
> I'm not so sure that the unified 'timestamp' must be required by all tracers.
> If you just need to merge and sort per-cpu data, you can use an atomic
> sequential number for it.
> IMHO, the unified 'timestamp' would better be an option, because some
> architectures can't support it. I think preparing timestamp-callback
> function will help us.
>

There have been several posts on the timestamp for the events.  From a
real-time perspective, this timestamp will be a very important datapoint for
each event, and the more accurate/higher resolution the better.  Some thoughts.

o pretty print resolution should definitely be nanosecond (IMHO)

o internal storage should be "whatever is fastest" with the transformation to
 ns data stored in the trace header (as I believe Mathieu mentioned).

o for archs where the clock isn't synchronized across CPUs, perhaps for now it
 would be adequate to record the per cpu timestamps in the trace header and
 include the cpu id for each event as well.  This is in keeping with the
 previous suggestion to collect the most primitive data available without
 doing any sort of transformation at trace time.

--
Darren Hart

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 23:12       ` Darren Hart
@ 2008-09-23  0:06         ` Masami Hiramatsu
  0 siblings, 0 replies; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23  0:06 UTC (permalink / raw)
  To: Darren Hart
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, systemtap-ml

Hi Darren,

Darren Hart wrote:
> On Mon, Sep 22, 2008 at 3:25 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote:
> 
>>> One thing that I think is still important is to have a unified timestamp
>>> mechanism on everything, so we can co-ordinate different things back
>>> together in userspace from different trace tools, so I intend to keep
>>> that at a lower level, but I think you're right that the event id, etc can
>>> move up into separate layers.
>> I'm not so sure that the unified 'timestamp' must be required by all tracers.
>> If you just need to merge and sort per-cpu data, you can use an atomic
>> sequential number for it.
>> IMHO, the unified 'timestamp' would better be an option, because some
>> architectures can't support it. I think preparing timestamp-callback
>> function will help us.
>>
> 
> There have been several posts on the timestamp for the events.  From a
> real-time perspective, this timestamp will be a very important datapoint for
> each event, and the more accurate/higher resolution the better.  Some thoughts.

Sure, I know the precise timestamp is required for real-time sensitive
tracers. however, there are some other cases. for example debugging,
we don't need timestamps, but just want to know the order of events. :-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 22:27     ` Masami Hiramatsu
  2008-09-22 23:12       ` Darren Hart
@ 2008-09-22 23:17       ` Martin Bligh
  2008-09-23  0:07         ` Masami Hiramatsu
  2008-09-23 14:38       ` KOSAKI Motohiro
  2 siblings, 1 reply; 47+ messages in thread
From: Martin Bligh @ 2008-09-22 23:17 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler,
	systemtap-ml

>> One thing that I think is still important is to have a unified timestamp
>> mechanism on everything, so we can co-ordinate different things back
>> together in userspace from different trace tools, so I intend to keep
>> that at a lower level, but I think you're right that the event id, etc can
>> move up into separate layers.
>
> I'm not so sure that the unified 'timestamp' must be required by all tracers.
> If you just need to merge and sort per-cpu data, you can use an atomic
> sequential number for it.
> IMHO, the unified 'timestamp' would better be an option, because some
> architectures can't support it. I think preparing timestamp-callback
> function will help us.

An atomic sequential number is:

(a) far less meaningful than a timestamp for the user
(b) more expensive to compute in many cases.

I think we came up with a way to approximate this, using a callback every
ms or so as the higher order bits, and a sequential counter in the lower
order for those broken platforms.

But perhaps it would be better if we started with a discussion of which
platforms can't do global timestamps, and why not? I know some of them
are fixable, but perhaps not all.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 23:17       ` Martin Bligh
@ 2008-09-23  0:07         ` Masami Hiramatsu
  2008-09-23  0:13           ` Martin Bligh
  2008-09-23  0:41           ` Linus Torvalds
  0 siblings, 2 replies; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23  0:07 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Hi Martin,

Martin Bligh wrote:
>>> One thing that I think is still important is to have a unified timestamp
>>> mechanism on everything, so we can co-ordinate different things back
>>> together in userspace from different trace tools, so I intend to keep
>>> that at a lower level, but I think you're right that the event id, etc can
>>> move up into separate layers.
>> I'm not so sure that the unified 'timestamp' must be required by all tracers.
>> If you just need to merge and sort per-cpu data, you can use an atomic
>> sequential number for it.
>> IMHO, the unified 'timestamp' would better be an option, because some
>> architectures can't support it. I think preparing timestamp-callback
>> function will help us.
> 
> An atomic sequential number is:
> 
> (a) far less meaningful than a timestamp for the user
> (b) more expensive to compute in many cases.

Sure, atomic counter might be more expensive but accurate for ordering.
(and it can use on almost all architectures)
The cost depends on the architecture and system configuration.

So, I think it is preferable user to choose their timestamp rather
than fix it. For example, calling callback when writing a log entry
as following;

write_log(struct buffer *buffer, char *data, int len)
{
    /* reserve a logging space */
    char *buf = reserve(buffer, len + buffer->timestamp.len);

    /* write a timestamp */
    buf = buffer->timestamp.write(buf);

    /* write a body */
    memcpy(buf, data, len);
}

And unified buffer prepares default timestamp.write callbacks.

char * timestamp_write(char * buf); // write arch-specific timestamp
char * seqnum_write(char * buf); // write an sequence number

What would you think about it?

> I think we came up with a way to approximate this, using a callback every
> ms or so as the higher order bits, and a sequential counter in the lower
> order for those broken platforms.

Sure, that will work.

> But perhaps it would be better if we started with a discussion of which
> platforms can't do global timestamps, and why not? I know some of them
> are fixable, but perhaps not all.

For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-(

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:07         ` Masami Hiramatsu
@ 2008-09-23  0:13           ` Martin Bligh
  2008-09-23 14:51             ` Masami Hiramatsu
  2008-09-23  0:41           ` Linus Torvalds
  1 sibling, 1 reply; 47+ messages in thread
From: Martin Bligh @ 2008-09-23  0:13 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

>> I think we came up with a way to approximate this, using a callback every
>> ms or so as the higher order bits, and a sequential counter in the lower
>> order for those broken platforms.
>
> Sure, that will work.

OK, that'd fix 99% of it, even if we only get time within 1ms or so (but still
good ordering).

>> But perhaps it would be better if we started with a discussion of which
>> platforms can't do global timestamps, and why not? I know some of them
>> are fixable, but perhaps not all.
>
> For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-(

Can you define incorrect for me (in this case)?

We had similar problems with some AMD platforms that we can fix by syncing
the TSCs on exit_idle, etc.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:13           ` Martin Bligh
@ 2008-09-23 14:51             ` Masami Hiramatsu
  2008-09-23 15:09               ` Mathieu Desnoyers
  2008-09-23 15:48               ` Linus Torvalds
  0 siblings, 2 replies; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 14:51 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Hi Martin,

Martin Bligh wrote:
>>> But perhaps it would be better if we started with a discussion of which
>>> platforms can't do global timestamps, and why not? I know some of them
>>> are fixable, but perhaps not all.
>> For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-(
> 
> Can you define incorrect for me (in this case)?

On my laptop, TSC is disabled at boot time.

$ dmesg | grep TSC
checking TSC synchronization [CPU#0 -> CPU#1]:
Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock.
Marking TSC unstable due to: check_tsc_sync_source failed.

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips	: 3998.45
clflush size	: 64
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips	: 3994.43
clflush size	: 64
power management:

Actually, I had measured TSC drifting and reported to systemtap-bugzilla
http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19

Curiously, I've tested on another Core2Duo laptop, which cpu is same model and
same stepping, but on that laptop I couldn't see TSC drifting.
So I think this might be a product level issue and a rare case...


> We had similar problems with some AMD platforms that we can fix by syncing
> the TSCs on exit_idle, etc.

Hmm, very interesting. :-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:51             ` Masami Hiramatsu
@ 2008-09-23 15:09               ` Mathieu Desnoyers
  2008-09-23 15:32                 ` Masami Hiramatsu
  2008-09-23 15:48               ` Linus Torvalds
  1 sibling, 1 reply; 47+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23 15:09 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
[...]
> $ cat /proc/cpuinfo
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 15
> model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
> stepping	: 6
> cpu MHz		: 1000.000
[...]

... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for
both cpus)

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 15:09               ` Mathieu Desnoyers
@ 2008-09-23 15:32                 ` Masami Hiramatsu
  2008-09-23 16:02                   ` Linus Torvalds
  0 siblings, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 15:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Mathieu Desnoyers wrote:
> * Masami Hiramatsu (mhiramat@redhat.com) wrote:
> [...]
>> $ cat /proc/cpuinfo
>> processor	: 0
>> vendor_id	: GenuineIntel
>> cpu family	: 6
>> model		: 15
>> model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
>> stepping	: 6
>> cpu MHz		: 1000.000
> [...]
> 
> ... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for
> both cpus)

2.00GHz is the maximum(model) frequency. And 'cpu MHz' means
current frequency. (yep, now I'm using cpufreq)
Anyway, when I measured TSC drift, I killed cpuspeed service and
fixed freq to 2000. ;-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 15:32                 ` Masami Hiramatsu
@ 2008-09-23 16:02                   ` Linus Torvalds
  2008-09-23 17:07                     ` Masami Hiramatsu
  0 siblings, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2008-09-23 16:02 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> 
> 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means
> current frequency. (yep, now I'm using cpufreq)
> Anyway, when I measured TSC drift, I killed cpuspeed service and
> fixed freq to 2000. ;-)

Ahh. I have an idea..

Maybe that thing does thermal throttling? 

Fixing the frequency at the highest setting is actually one of the worst 
things you can do, because if the device is thermally limited, it will 
still do the whole throttling thing, but now it won't do it by changing 
the frequency any more, it will do it by essentially forxing the external 
frequency down.

And that is going to be *very* inefficient. You really really don't want 
that. Your performance will actually be _worse_ than if the CPU went to a 
lower frequency. And it might explain the unreliable TSC too, because I 
suspect constant TSC is really constant only wrt the bus clock to the CPU.

The termal throttling thing is a "protect the CPU from overheating" last 
ditch effort, and because it doesn't lower voltage, it isn't actually at 
all as efficient at saving power (and thus cooling the CPU) as a real 
frequency change event would be.

And fixing the frequency to the highest frequency in a tight laptop 
enclosure is the best way to force that behavior (in contrast - in a 
desktop system with sufficient cooling, it's usually not a problem at all 
to just say "run at highest frequency").

And btw, that also explains why you had so *big* changes in frequency: the 
throttling I think happens with a 1/8 duty cycle thing, iirc.

It's supposed to be very rare with Core 2. Thermal throttling was quite 
common with the P4 one, and was the main overheating protection initially. 
These days, you should only see it for really badly designed devices that 
simply don't have enough thermal cooling, but if the design calls for 
mostly running at low frequency because it's some thing-and-light notebook 
with insufficient cooling (or some thick-and-heavy thing that is just 
total crap), and you force it to always run at full speed, I can imagine 
it kicking in to protect the CPU.

It's obviously also going to be much easier to see if the ambient 
temperature is high. If you want to get best peformance, take one of those 
compressed-air spray-cans, and spray on the laptop with the can held 
upside down (the can will generally tell you _not_ to do that, because 
then you'll get the liquid itself rather than gas, but that's what you 
want for cooling).

So if you can test this, try it with 
 (a) cpufreq at a fixed _low_ value (to not cause overheating)
 (b) with the spray-can cooling the thing and cpufreq at a fixed high 
     value
and see if the TSC is constant then.

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 16:02                   ` Linus Torvalds
@ 2008-09-23 17:07                     ` Masami Hiramatsu
  2008-09-23 17:33                       ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Linus Torvalds wrote:
> 
> On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
>> 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means
>> current frequency. (yep, now I'm using cpufreq)
>> Anyway, when I measured TSC drift, I killed cpuspeed service and
>> fixed freq to 2000. ;-)
> 
> Ahh. I have an idea..
> 
> Maybe that thing does thermal throttling? 
> 
> Fixing the frequency at the highest setting is actually one of the worst 
> things you can do, because if the device is thermally limited, it will 
> still do the whole throttling thing, but now it won't do it by changing 
> the frequency any more, it will do it by essentially forxing the external 
> frequency down.
> 
> And that is going to be *very* inefficient. You really really don't want 
> that. Your performance will actually be _worse_ than if the CPU went to a 
> lower frequency. And it might explain the unreliable TSC too, because I 
> suspect constant TSC is really constant only wrt the bus clock to the CPU.
> 
> The termal throttling thing is a "protect the CPU from overheating" last 
> ditch effort, and because it doesn't lower voltage, it isn't actually at 
> all as efficient at saving power (and thus cooling the CPU) as a real 
> frequency change event would be.
> 
> And fixing the frequency to the highest frequency in a tight laptop 
> enclosure is the best way to force that behavior (in contrast - in a 
> desktop system with sufficient cooling, it's usually not a problem at all 
> to just say "run at highest frequency").
> 
> And btw, that also explains why you had so *big* changes in frequency: the 
> throttling I think happens with a 1/8 duty cycle thing, iirc.
> 
> It's supposed to be very rare with Core 2. Thermal throttling was quite 
> common with the P4 one, and was the main overheating protection initially. 
> These days, you should only see it for really badly designed devices that 
> simply don't have enough thermal cooling, but if the design calls for 
> mostly running at low frequency because it's some thing-and-light notebook 
> with insufficient cooling (or some thick-and-heavy thing that is just 
> total crap), and you force it to always run at full speed, I can imagine 
> it kicking in to protect the CPU.
> 
> It's obviously also going to be much easier to see if the ambient 
> temperature is high. If you want to get best peformance, take one of those 
> compressed-air spray-cans, and spray on the laptop with the can held 
> upside down (the can will generally tell you _not_ to do that, because 
> then you'll get the liquid itself rather than gas, but that's what you 
> want for cooling).
> 
> So if you can test this, try it with 
>  (a) cpufreq at a fixed _low_ value (to not cause overheating)
>  (b) with the spray-can cooling the thing and cpufreq at a fixed high 
>      value
> and see if the TSC is constant then.

Hi Linus,

Thank you for your advice. I tested it again according your advice,
I did:
- service cpuspeed stop
- echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
  and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is
  1000000.
- echo 1 >  /proc/acpi/thermal_zone/THM/polling_frequency
- cooling with spray-can :)
- cat /proc/acpi/thermal_zone/THM/temperature
  temperature:             39 C

and ran the test.
---
p0: c:1107576,	 ns:990280	 ratio:111
p0: c:1805640,	 ns:1008787	 ratio:178
p0: c:1998324,	 ns:1000127	 ratio:199
p0: c:946380,	 ns:990280	 ratio:95
p0: c:871728,	 ns:1000267	 ratio:87
p0: c:1807380,	 ns:1007949	 ratio:179
p0: c:1784808,	 ns:1000127	 ratio:178
p0: c:1768488,	 ns:991676	 ratio:178
p0: c:1802292,	 ns:1008299	 ratio:178
p0: c:1787088,	 ns:1000406	 ratio:178
p0: c:1999176,	 ns:1000896	 ratio:199
p0: c:881364,	 ns:991956	 ratio:88
p0: c:1802712,	 ns:1008019	 ratio:178
p0: c:1787088,	 ns:998590	 ratio:178
---
this seems not so stable yet. :-(

After test I checked temperature again.
# cat /proc/acpi/thermal_zone/THM/temperature
temperature:             39 C

Hmm, 39 C is not so high. I wouldn't be surprised even if this
is an individual product bug. Anyway, currently, Linux itself
works well on this laptop with hpet.:-)

Thank you,


> 
> 			Linus

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 17:07                     ` Masami Hiramatsu
@ 2008-09-23 17:33                       ` Thomas Gleixner
  2008-09-23 19:03                         ` Masami Hiramatsu
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2008-09-23 17:33 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> > So if you can test this, try it with 
> >  (a) cpufreq at a fixed _low_ value (to not cause overheating)
> >  (b) with the spray-can cooling the thing and cpufreq at a fixed high 
> >      value
> > and see if the TSC is constant then.
> 
> Hi Linus,
> 
> Thank you for your advice. I tested it again according your advice,
> I did:
> - service cpuspeed stop
> - echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
>   and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is
>   1000000.
> - echo 1 >  /proc/acpi/thermal_zone/THM/polling_frequency
> - cooling with spray-can :)
> - cat /proc/acpi/thermal_zone/THM/temperature
>   temperature:             39 C
> 
> and ran the test.
> ---
> p0: c:1107576,	 ns:990280	 ratio:111
> p0: c:1805640,	 ns:1008787	 ratio:178
> p0: c:1998324,	 ns:1000127	 ratio:199
> p0: c:946380,	 ns:990280	 ratio:95
> p0: c:871728,	 ns:1000267	 ratio:87
> p0: c:1807380,	 ns:1007949	 ratio:179
> p0: c:1784808,	 ns:1000127	 ratio:178
> p0: c:1768488,	 ns:991676	 ratio:178
> p0: c:1802292,	 ns:1008299	 ratio:178
> p0: c:1787088,	 ns:1000406	 ratio:178
> p0: c:1999176,	 ns:1000896	 ratio:199
> p0: c:881364,	 ns:991956	 ratio:88
> p0: c:1802712,	 ns:1008019	 ratio:178
> p0: c:1787088,	 ns:998590	 ratio:178
> ---
> this seems not so stable yet. :-(
> 
> After test I checked temperature again.
> # cat /proc/acpi/thermal_zone/THM/temperature
> temperature:             39 C
> 
> Hmm, 39 C is not so high. I wouldn't be surprised even if this
> is an individual product bug. Anyway, currently, Linux itself
> works well on this laptop with hpet.:-)

Do you have C-States enabled on that machine ?

ls /sys/devices/system/cpu/cpu0/cpuidle/

has it more than a state0 entry ?

If yes, please do:

cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage

where X is the highest number in there.

cat /proc/acpi/processor/CPU0/power

might be useful as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 17:33                       ` Thomas Gleixner
@ 2008-09-23 19:03                         ` Masami Hiramatsu
  2008-09-23 19:37                           ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 19:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

Hi Thomas,

Thomas Gleixner wrote:
> 
> Do you have C-States enabled on that machine ?
> 
> ls /sys/devices/system/cpu/cpu0/cpuidle/
> 
> has it more than a state0 entry ?

Yes, there are state0 to state3
# ls /sys/devices/system/cpu/cpu0/cpuidle/
state0  state1  state2  state3

> If yes, please do:
> 
> cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage
> 
> where X is the highest number in there.

# cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
171210

> 
> cat /proc/acpi/processor/CPU0/power
> 
> might be useful as well.

# cat /proc/acpi/processor/CPU0/power
active state:            C0
max_cstate:              C8
bus master activity:     00000000
maximum allowed latency: 2000000000 usec
states:
    C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
    C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
    C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]

Could these help you?

Thank you,

> 
> Thanks,
> 
> 	tglx

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:03                         ` Masami Hiramatsu
@ 2008-09-23 19:37                           ` Thomas Gleixner
  2008-09-23 19:39                             ` Martin Bligh
  2008-09-23 20:06                             ` Masami Hiramatsu
  0 siblings, 2 replies; 47+ messages in thread
From: Thomas Gleixner @ 2008-09-23 19:37 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
> 171210

C3 stops the TSC. So depending on how many C3 entries you have on the
different cores, your TSCs will drift apart. Some BIOSes do even a
lousy job trying to fixup the TSCs on exit from C3, which makes things
even worse.

>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
> 
> Could these help you?

Yup, explains your TSC observation. Nothing we can do about. Broken by
system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:37                           ` Thomas Gleixner
@ 2008-09-23 19:39                             ` Martin Bligh
  2008-09-23 19:42                               ` Thomas Gleixner
  2008-09-23 20:06                             ` Masami Hiramatsu
  1 sibling, 1 reply; 47+ messages in thread
From: Martin Bligh @ 2008-09-23 19:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, Sep 23, 2008 at 12:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
>> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
>> 171210
>
> C3 stops the TSC. So depending on how many C3 entries you have on the
> different cores, your TSCs will drift apart. Some BIOSes do even a
> lousy job trying to fixup the TSCs on exit from C3, which makes things
> even worse.
>
>>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
>>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
>>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
>>
>> Could these help you?
>
> Yup, explains your TSC observation. Nothing we can do about. Broken by
> system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !

We have linux patches that sync the TSC on exit_idle. I'll see if I can get
Michael to send them out.

M.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:39                             ` Martin Bligh
@ 2008-09-23 19:42                               ` Thomas Gleixner
  2008-09-23 19:51                                 ` Martin Bligh
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2008-09-23 19:42 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Martin Bligh wrote:

> On Tue, Sep 23, 2008 at 12:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
> >> 171210
> >
> > C3 stops the TSC. So depending on how many C3 entries you have on the
> > different cores, your TSCs will drift apart. Some BIOSes do even a
> > lousy job trying to fixup the TSCs on exit from C3, which makes things
> > even worse.
> >
> >>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
> >>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
> >>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
> >>
> >> Could these help you?
> >
> > Yup, explains your TSC observation. Nothing we can do about. Broken by
> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !
> 
> We have linux patches that sync the TSC on exit_idle. I'll see if I can get
> Michael to send them out.

Are you sure that they sync it precicely enough that there is no user
space observable way of time going backwards between cores ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:42                               ` Thomas Gleixner
@ 2008-09-23 19:51                                 ` Martin Bligh
  2008-09-23 20:05                                   ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Martin Bligh @ 2008-09-23 19:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

>> > Yup, explains your TSC observation. Nothing we can do about. Broken by
>> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !
>>
>> We have linux patches that sync the TSC on exit_idle. I'll see if I can get
>> Michael to send them out.
>
> Are you sure that they sync it precicely enough that there is no user
> space observable way of time going backwards between cores ?

I think the tolerance is about 500 cycles. If that's not sufficient, I guess
we'll have to either live with some slight misordering (which people have
pointed out is kind of inevitable anyway) on these broken machines?
It was sufficient for what we were using it for, but maybe not for everyone.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:51                                 ` Martin Bligh
@ 2008-09-23 20:05                                   ` Thomas Gleixner
  2008-09-23 21:02                                     ` Martin Bligh
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2008-09-23 20:05 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Martin Bligh wrote:

> >> > Yup, explains your TSC observation. Nothing we can do about. Broken by
> >> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !
> >>
> >> We have linux patches that sync the TSC on exit_idle. I'll see if I can get
> >> Michael to send them out.
> >
> > Are you sure that they sync it precicely enough that there is no user
> > space observable way of time going backwards between cores ?
> 
> I think the tolerance is about 500 cycles. If that's not sufficient, I guess
> we'll have to either live with some slight misordering (which people have
> pointed out is kind of inevitable anyway) on these broken machines?
> It was sufficient for what we were using it for, but maybe not for everyone.

Well, I dont care about the trace reordering at all. I care about user
space visible time going backwards issues observed via the
gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can
migrate in less than that :)

I guess you try this only for machines where the TSC runs with
constant frequency, right ?

Looking forward to your patches.

Thanks,

	tglx


 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 20:05                                   ` Thomas Gleixner
@ 2008-09-23 21:02                                     ` Martin Bligh
  0 siblings, 0 replies; 47+ messages in thread
From: Martin Bligh @ 2008-09-23 21:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

>> I think the tolerance is about 500 cycles. If that's not sufficient, I guess
>> we'll have to either live with some slight misordering (which people have
>> pointed out is kind of inevitable anyway) on these broken machines?
>> It was sufficient for what we were using it for, but maybe not for everyone.
>
> Well, I dont care about the trace reordering at all. I care about user
> space visible time going backwards issues observed via the
> gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can
> migrate in less than that :)

Right, that's what we were interested in.

> I guess you try this only for machines where the TSC runs with
> constant frequency, right ?

We don't do DVFS on the 'broken' TSC machines, just halt.
But yes, it's selective

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 19:37                           ` Thomas Gleixner
  2008-09-23 19:39                             ` Martin Bligh
@ 2008-09-23 20:06                             ` Masami Hiramatsu
  2008-09-23 20:09                               ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 20:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

Thomas Gleixner wrote:
> On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
>> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
>> 171210
> 
> C3 stops the TSC. So depending on how many C3 entries you have on the
> different cores, your TSCs will drift apart. Some BIOSes do even a
> lousy job trying to fixup the TSCs on exit from C3, which makes things
> even worse.
> 
>>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
>>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
>>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
>>
>> Could these help you?
> 
> Yup, explains your TSC observation. Nothing we can do about. Broken by
> system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !

Thank you for analyzing! :-)
Hmm, then could I fix that by fixing my dsdt...?

Thanks again,

> 
> Thanks,
> 
> 	tglx

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 20:06                             ` Masami Hiramatsu
@ 2008-09-23 20:09                               ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2008-09-23 20:09 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh,
	Linux Kernel Mailing List, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:

> Thomas Gleixner wrote:
> > On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage
> >> 171210
> > 
> > C3 stops the TSC. So depending on how many C3 entries you have on the
> > different cores, your TSCs will drift apart. Some BIOSes do even a
> > lousy job trying to fixup the TSCs on exit from C3, which makes things
> > even worse.
> > 
> >>     C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
> >>     C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
> >>     C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]
> >>
> >> Could these help you?
> > 
> > Yup, explains your TSC observation. Nothing we can do about. Broken by
> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !
> 
> Thank you for analyzing! :-)
> Hmm, then could I fix that by fixing my dsdt...?

You can limit c-states so you dont do down to the C3 state, but there
is a trade off vs. power saving.

Lets wait for Martins magic TSC patches first :)

     tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:51             ` Masami Hiramatsu
  2008-09-23 15:09               ` Mathieu Desnoyers
@ 2008-09-23 15:48               ` Linus Torvalds
  1 sibling, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2008-09-23 15:48 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

On Tue, 23 Sep 2008, Masami Hiramatsu wrote:
> 
> $ dmesg | grep TSC
> checking TSC synchronization [CPU#0 -> CPU#1]:
> Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock.
> Marking TSC unstable due to: check_tsc_sync_source failed.

Hmm.. Very interesting.

It smells of a non-stable TSC, but your Core2 Cpu shouldn't have that 
issue:

> 
> $ cat /proc/cpuinfo
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 15
> model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
> stepping	: 6
> cpu MHz	: 1000.000
> cache size	: 4096 KB
> flags		: ... constant_tsc ...
> 
> Actually, I had measured TSC drifting and reported to systemtap-bugzilla
> http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19

I'd have assumed it was just some initial offset issue, but your 
bug-report seems to say that it really does change the TSC frequency when 
the CPU frequency changes. That should _not_ happen on a core2 CPU, afaik. 
I didn't even know it could be a setup issue, but it does really smell 
like your TSC frequency changes.

Now, unstable TSC's are not uncommon per se, and most older Intel CPU's 
will do it, it's just that I thought it was fixed in Core2 (and later P4's 
for that matter).

The rule *should* be that:
 - family = 15 (netburst), model 3 or higher has constant TSC
 - family = 6 (PPro), model 14 or higher (Core, Core 2)
have constant TSCs. This is quite clearly documented: see Intel ia docs, 
vol 3B, 18.10 "Time-stamp counter".

Very odd. I wonder what your laptop does to screw this up. 

I also suspect that since we already _noticed_ that the TSC isn't stable, 
we should also have then cleared the "constant TSC" bit. And we apparently 
didn't.

Btw, your CPU looks quite odd in other respects too. Judging by your 
bugzilla entry, the TSC sometimes ticks at 2GHz (fine), sometimes at 1Ghz 
(also fine), and sometimes at 667/500MHz judging by the ratios you show 
for TSC/timer tick.

And that last one is really odd, afaik most 2GHz Core 2 duo's will have a 
lower frequency of 1GHz. Is that some special low-power version, perhaps? 
Or maybe it isn't a speedstep-able CPU at all, and the system actually 
changes the *bus* frequency (and then the CPU frequency is some constant 
factor of that). If so, the system messes with the CPU in bad ways.

And btw, I'm almost certain that what you see isn't actually any "CPU 
drift" in the sense that I strongly suspect that the TSC's for both cores 
will change frequency together. So because the TSC isn't stable, it's not 
a good time-source, but despite that it's not necessarily a bad way to 
compare events across cores.

To actually have different CPU's TSC drift wrt each other, you almost have 
to have them in different clock domains. And that is *very* rare. It 
happens when the CPU's are on different boards, and sure if happens if the 
CPU's have non-constant TSCs with different frequencies, but neither of 
those should be very common at all. The latter is uncommon because it's 
almost unheard of of having multi-socket devices with old CPU's that also 
do frequency changes.

Older multi-core CPU's tend to do frequency changes the whole chip at a 
time, and newer multi-core CPU's should all basically have a fixed TSC so 
even when they do frequency changes independently, the TSC should still be 
off the same clock on the die.

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:07         ` Masami Hiramatsu
  2008-09-23  0:13           ` Martin Bligh
@ 2008-09-23  0:41           ` Linus Torvalds
  2008-09-23  1:28             ` Roland Dreier
                               ` (2 more replies)
  1 sibling, 3 replies; 47+ messages in thread
From: Linus Torvalds @ 2008-09-23  0:41 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
> 
> Sure, atomic counter might be more expensive but accurate for ordering.

Don't be silly.

An atomic counter is no more accurate for ordering than anything else.

Why?

Because all it tells you is the ordering of the atomic increment, not of 
the caller. The atomic increment is not related to all the other ops that 
the code that you trace actually does in any shape or form, and so the 
ordering of the trace doesn't actually imply anything for the ordering of 
the operations you are tracing!

Except for a single CPU, of course, but for that case you don't need a 
sequence number either, since the ordering is entirely determined by the 
ring buffer itself.

So the counter will be more expensive (cross-cpu cache bouncing for EVERY 
SINGLE EVENT), less useful (no real meaning for people who DO want to have 
a timestamp), and it's really no more "ordered" than anything that bases 
itself on a TSC.

The fact is, you cannot order operations based on log messages unless you 
have a lock around the whole caller - absolutely _no_ amount of locking or 
atomic accesses in the log itself will guarantee ordering of the upper 
layers.

And sure, if you have locking at a higher layer, then a sequence number is 
sufficient, but on the other hand, so is a well-synchronized TSC.

So personally, I think that the optimal solution is:

 - let each ring buffer be associated with a "gettimestamp()" function, so 
   that everybody _can_ set it to something of their own. But default to 
   something sane, namely a raw TSC thing.

 - Add synchronization events to the ring buffer often enough that you can 
   make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply 
   noticing when the upper 32 bits change, although you could possibly do 
   it with a heartbeat too.

 - Similarly, add a synchronization event when the TSC frequency changes.

 - Make the synchronization packet contain the full 64-bit TSC base, in 
   addition to TSC frequency info _and_ the timebase.

 - From those synchronization events, you should be able to get a very 
   accurate timestamp *after* the fact from the raw TSC numbers (ie do all 
   the scaling not when you gather the info, but when you present it), 
   even if you only spent 32 bits of TSC info on 99% of all events (an 
   just had a overflow log occasionally to get the rest of the info)

 - Most people will be _way_ happier with a timestamp that has enough 
   precision to also show ordering (assuming that the caller holds a 
   lock over the operation _including_ the tracing) than they would ever 
   be with a sequence number.

 - people who really want to can consider the incrementing counter a TSC, 
   but it will suck in so many ways that I bet it will not be very popular 
   at all. But having the option to set a special timestamp function will
   give people the option (on a per-buffer level) to make the "TSC" be a 
   simple incrementing 32-bit counter using xaddl and the upper bits 
   incrementing from a timer, but keep that as a "ok, the TSC is really 
   broken, or this architecture doesn't support any fast cycle counters at 
   all, or I really don't care about time, just sequence, and I guarantee 
   I have a single lock in all callers that makes things unambiguous"

Note the "single lock" part. It's not enough that you make any trace thing 
under a lock. They must be under the _same_ lock for all relevant events 
for you to be able to say anything about ordering. And that's actually 
pretty rare for any complex behavior.

The timestamping, btw, is likely the most important part of the whole 
logging thing. So we need to get it right. But by "right" I mean really 
really low-latency so that it's acceptable to everybody, real-time enough 
that you can tell how far apart events were, and precise enough that you 
really _can_ see ordering.

The "raw TSC value with correction information" should be able to give you 
all of that. At least on x86. On some platforms, the TSC may not give you 
enough resolution to get reasonable guesses on event ordering.

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:41           ` Linus Torvalds
@ 2008-09-23  1:28             ` Roland Dreier
  2008-09-23  1:40               ` Steven Rostedt
                                 ` (2 more replies)
  2008-09-23  2:31             ` Mathieu Desnoyers
  2008-09-23  3:07             ` Masami Hiramatsu
  2 siblings, 3 replies; 47+ messages in thread
From: Roland Dreier @ 2008-09-23  1:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

 > Because all it tells you is the ordering of the atomic increment, not of 
 > the caller. The atomic increment is not related to all the other ops that 
 > the code that you trace actually does in any shape or form, and so the 
 > ordering of the trace doesn't actually imply anything for the ordering of 
 > the operations you are tracing!

This reminds me of a naive question that occurred to me while we were
discussing this at KS.  Namely, what does "ordering" mean for events?

An example I'm all too familiar with is the lack of ordering of MMIO on
big SGI systems -- if you forget an mmiowb(), then two CPUs taking a
spinlock and doing writel() inside the spinlock and then dropping the
spinlock (which should be enough to "order" things) might see the
writel() reach the final device "out of order" because the write has to
travel through a routed system fabric.

Just like Einstein said, it really seems to me that the order of things
depends on your frame of reference.

 - R.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  1:28             ` Roland Dreier
@ 2008-09-23  1:40               ` Steven Rostedt
  2008-09-23  2:08               ` Mathieu Desnoyers
  2008-09-23  3:29               ` Linus Torvalds
  2 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2008-09-23  1:40 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Linus Torvalds, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Roland Dreier wrote:

>  > Because all it tells you is the ordering of the atomic increment, not of 
>  > the caller. The atomic increment is not related to all the other ops that 
>  > the code that you trace actually does in any shape or form, and so the 
>  > ordering of the trace doesn't actually imply anything for the ordering of 
>  > the operations you are tracing!
> 
> This reminds me of a naive question that occurred to me while we were
> discussing this at KS.  Namely, what does "ordering" mean for events?
> 
> An example I'm all too familiar with is the lack of ordering of MMIO on
> big SGI systems -- if you forget an mmiowb(), then two CPUs taking a
> spinlock and doing writel() inside the spinlock and then dropping the
> spinlock (which should be enough to "order" things) might see the
> writel() reach the final device "out of order" because the write has to
> travel through a routed system fabric.
> 
> Just like Einstein said, it really seems to me that the order of things
> depends on your frame of reference.

In my logdev tracer (see http://rostedt.homelinux.com/logdev) I used an 
atomic counter to keep "order". But what I would say to people what this 
order means, is that order is among multiple traces between multiple CPUS.
That is if you have.

   CPU 1                                CPU 2
trace_point_a                        trace_point_c
trace_point_b                        trace_point_d

If you see in the trace:

trace_point_a
trace_point_c

You really do not know which happened first. Simply because trace_point_c 
could have been hit first, but for interrupts and nmis and what not, 
trace_point_a could have easily been recorded first. But to me, 
trace_points are more like memory barriers.

If I see:

trace_point_c
trace_point_a
trace_point_b
trace_point_d

I can assume that everything before trace_point_c happened before 
everything after trace_point_a, and that all before trace_point_b happened 
before trace_point_d.

One can not assume that the trace points themselves are in order. But you 
can assume that the things outside the trace points are, like memory 
barriers. I have found lots of race conditions with my logdev, and it was 
due to this "memory barrier" likeness to be able to see the races.

Unfortunately, if you are using an out of sync TSC, you lose even the 
memory barrier characteristic of the trace.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  1:28             ` Roland Dreier
  2008-09-23  1:40               ` Steven Rostedt
@ 2008-09-23  2:08               ` Mathieu Desnoyers
  2008-09-23  2:27                 ` Darren Hart
  2008-09-23  3:29               ` Linus Torvalds
  2 siblings, 1 reply; 47+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  2:08 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Linus Torvalds, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	darren, Frank Ch. Eigler, systemtap-ml

* Roland Dreier (rdreier@cisco.com) wrote:
>  > Because all it tells you is the ordering of the atomic increment, not of 
>  > the caller. The atomic increment is not related to all the other ops that 
>  > the code that you trace actually does in any shape or form, and so the 
>  > ordering of the trace doesn't actually imply anything for the ordering of 
>  > the operations you are tracing!
> 
> This reminds me of a naive question that occurred to me while we were
> discussing this at KS.  Namely, what does "ordering" mean for events?
> 
> An example I'm all too familiar with is the lack of ordering of MMIO on
> big SGI systems -- if you forget an mmiowb(), then two CPUs taking a
> spinlock and doing writel() inside the spinlock and then dropping the
> spinlock (which should be enough to "order" things) might see the
> writel() reach the final device "out of order" because the write has to
> travel through a routed system fabric.
> 
> Just like Einstein said, it really seems to me that the order of things
> depends on your frame of reference.
> 
>  - R.
> 

Exactly as Linus said, event ordering comes down to this : a choice
between heavy locking around the real operation traced and the tracing
statement itself (irq disable/spinlock) or the acknowledgement that the
ordering is only insured across the actual tracing _instrumentation_.

A worse case scenario would be to get an interrupt between the "real"
operation (e.g. a memory or mmio write) and the tracing statement, be
scheduled out, which would let a lot of stuff happen between the actual
impact of the operation on kernel memory and the tracing statement
itself.

If we want to be _sure_ such thing never happen, we would then have to
pay the price of heavy locking and that would not be pretty, especially
for complex data structure modifications comes in play. I don't really
think anyone with an half-sane mind would want to slow down such
critical kernel operations for the benefit of totally ordered tracing.

However, in many cases where ordering matters, e.g. to instrument
spinlocks themselves, if we put the instrumentation within the critical
section rather than outside of it, then we benefit from the existing
kernel locking (but only for events related to this specific spinlock).
This is the same for many synchronization primitives, except for atomic
operations, where we have to accept that the order will be imperfect.

So only in the specific case of instrumentation of things like locking,
where it is possible to insure that instrumentation is synchronized with
the instrumented operation, does it make a difference to choose the TSC
(which implies a slight delta between the TSCs due to cache line delays
at synchronization and delay due to TSCs drifts caused by temperature)
over an atomic increment.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  2:08               ` Mathieu Desnoyers
@ 2008-09-23  2:27                 ` Darren Hart
  2008-09-23  2:32                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 47+ messages in thread
From: Darren Hart @ 2008-09-23  2:27 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Roland Dreier, Linus Torvalds, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	Frank Ch. Eigler, systemtap-ml

On Mon, Sep 22, 2008 at 7:02 PM, Mathieu Desnoyers <compudj@krystal.dyndns.org>
> So only in the specific case of instrumentation of things like locking,
> where it is possible to insure that instrumentation is synchronized with
> the instrumented operation, does it make a difference to choose the TSC
> (which implies a slight delta between the TSCs due to cache line delays
> at synchronization and delay due to TSCs drifts caused by temperature)
> over an atomic increment.
>

Hrm, i think that overlooks the other reason to use a time based counter over
an atomic increment: you might care about time.  Perhaps one might be less
concerned with actual order tightly grouped events and more concerned with the
actual time delta between more temporally distant events.  In that case, using
a clocksource would still be valuable. Although admitedtly the caller could
embed that in their payload, but since we seem to agree we need some kind of
counter, the time-based counter appears to be the most flexible.

Thanks,

-- 
Darren Hart

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  2:27                 ` Darren Hart
@ 2008-09-23  2:32                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  2:32 UTC (permalink / raw)
  To: Darren Hart
  Cc: Roland Dreier, Linus Torvalds, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	Frank Ch. Eigler, systemtap-ml

* Darren Hart (darren@dvhart.com) wrote:
> On Mon, Sep 22, 2008 at 7:02 PM, Mathieu Desnoyers <compudj@krystal.dyndns.org>
> > So only in the specific case of instrumentation of things like locking,
> > where it is possible to insure that instrumentation is synchronized with
> > the instrumented operation, does it make a difference to choose the TSC
> > (which implies a slight delta between the TSCs due to cache line delays
> > at synchronization and delay due to TSCs drifts caused by temperature)
> > over an atomic increment.
> >
> 
> Hrm, i think that overlooks the other reason to use a time based counter over
> an atomic increment: you might care about time.  Perhaps one might be less
> concerned with actual order tightly grouped events and more concerned with the
> actual time delta between more temporally distant events.  In that case, using
> a clocksource would still be valuable. Although admitedtly the caller could
> embed that in their payload, but since we seem to agree we need some kind of
> counter, the time-based counter appears to be the most flexible.
> 
> Thanks,
> 

See my answer to Linus for a proposal on how to do both :)

Mathieu

> -- 
> Darren Hart
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  1:28             ` Roland Dreier
  2008-09-23  1:40               ` Steven Rostedt
  2008-09-23  2:08               ` Mathieu Desnoyers
@ 2008-09-23  3:29               ` Linus Torvalds
  2008-09-23  3:42                 ` Mathieu Desnoyers
  2008-09-23  3:44                 ` Steven Rostedt
  2 siblings, 2 replies; 47+ messages in thread
From: Linus Torvalds @ 2008-09-23  3:29 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren,
	Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Roland Dreier wrote:
>
> Just like Einstein said, it really seems to me that the order of things
> depends on your frame of reference.

Heh. Yes. In general, there is no single ordering unless you actually use 
a serializing lock on all CPU's involved. 

And exactly as in the theory of relativity, two people on different CPU's 
can actually validly _disagree_ about the ordering of the same event. 
There are things that act as "light-cones" and are borders for what 
everybody can agree on, but basically, in the absence of explicit locks, 
it is very possible that no such thing as "ordering" may even exist.

Now, an atomic increment on a single counter obviously does imply *one* 
certain ordering, but it really only defines the ordering of that counter 
itself. It does not at all necessarily imply any ordering on the events 
that go on around the counter in other unrelated cachelines.

Which is exactly why even a global counter in no way orders "events" in 
general, unless those events have something else that does so.

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:29               ` Linus Torvalds
@ 2008-09-23  3:42                 ` Mathieu Desnoyers
  2008-09-23  4:06                   ` Linus Torvalds
  2008-09-23  3:44                 ` Steven Rostedt
  1 sibling, 1 reply; 47+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  3:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	darren, Frank Ch. Eigler, systemtap-ml

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 22 Sep 2008, Roland Dreier wrote:
> >
> > Just like Einstein said, it really seems to me that the order of things
> > depends on your frame of reference.
> 
> Heh. Yes. In general, there is no single ordering unless you actually use 
> a serializing lock on all CPU's involved. 
> 
> And exactly as in the theory of relativity, two people on different CPU's 
> can actually validly _disagree_ about the ordering of the same event. 
> There are things that act as "light-cones" and are borders for what 
> everybody can agree on, but basically, in the absence of explicit locks, 
> it is very possible that no such thing as "ordering" may even exist.
> 
> Now, an atomic increment on a single counter obviously does imply *one* 
> certain ordering, but it really only defines the ordering of that counter 
> itself. It does not at all necessarily imply any ordering on the events 
> that go on around the counter in other unrelated cachelines.
> 
> Which is exactly why even a global counter in no way orders "events" in 
> general, unless those events have something else that does so.
> 
> 		Linus
> 

Unless I am missing something, in the case we use an atomic operation
which implies memory barriers (cmpxchg and atomic_add_return does), one
can be sure that all memory operations done before the barrier are
completed at the barrier and that all memory ops following the barrier
will happen after.

Did you have something else in mind ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:42                 ` Mathieu Desnoyers
@ 2008-09-23  4:06                   ` Linus Torvalds
  0 siblings, 0 replies; 47+ messages in thread
From: Linus Torvalds @ 2008-09-23  4:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt,
	darren, Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Mathieu Desnoyers wrote:
> 
> Unless I am missing something, in the case we use an atomic operation
> which implies memory barriers (cmpxchg and atomic_add_return does), one
> can be sure that all memory operations done before the barrier are
> completed at the barrier and that all memory ops following the barrier
> will happen after.

Sure (if you have a barrier - not all architectures will imply that for an 
incrment).

But that still doesn't mean a thing.

You have two events (a) and (b), and you put trace-points on each. In your 
trace, you see (a) before (b) by comparing the numbers. But what does that 
mean? 

The actual event that you traced is not the trace-point - the trace-point 
is more like a fancy "printk". And the fact that one showed up before 
another in the trace buffer, doesn't mean that the events _around_ the 
trace happened in the same order.

You can use the barriers to make a partial ordering, and if you have a 
separate tracepoint for entry into a region and exit, you can perhaps show 
that they were totally disjoint. Or maybe they were partially overlapping, 
and you'll never know exactly how they overlapped.

Example:

	trace(..);
	do_X();

being executed on two different CPU's. In the trace, CPU#1 was before 
CPU#2. Does that mean that "do_X()" happened first on CPU#1?

No.

The only way to show that would be to put a lock around the whole trace 
_and_ operation X, ie

	spin_lock(..);
	trace(..);
	do_X();
	spin_unlock(..);

and now, if CPU#1 shows up in the trace first, then you know that do_X() 
really did happen first on CPU#1. Otherwise you basically know *nothing*, 
and the ordering of the trace events was totally and utterly meaningless.

See? Trace events themselves may be ordered, but the point of the trace 
event is never to know the ordering of the trace itself - it's to know the 
ordering of the code we're interested in tracing. The ordering of the 
trace events themselves is irrelevant and not useful.

And I'd rather see people _understand_ that, than if they think the 
ordering is somehow something they can trust.

Btw, if you _do_ have locking, then you can also know that the "do_X()" 
operations will be essentially as far apart in some theoretical notion of 
"time" (let's imagine that we do have global time, even if we don't) as 
the cost of the trace operation and do_X() itself.

So if we _do_ have locking (and thus a valid ordering that actually can 
matter), then the TSC doesn't even have to be synchronized on a cycle 
basis across CPU's - it just needs to be close enough that you can tell 
which one happened first (and with ordering, that's a valid thing to do). 

So you don't even need "perfect" synchronization, you just need something 
reasonably close, and you'll be able to see ordering from TSC counts 
without having that horrible bouncing cross-CPU thing that will impact 
performance a lot.

Quite frankly, I suspect that anybody who wants to have a global counter 
might as well almost just have a global ring-buffer. The trace events 
aren't going to be CPU-local anyway if you need to always update a shared 
cacheline - and you might as well make the shared cacheline be the ring 
buffer head with a spinlock in it.

That may not be _quite_ true, but it's probably close enough.

		Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:29               ` Linus Torvalds
  2008-09-23  3:42                 ` Mathieu Desnoyers
@ 2008-09-23  3:44                 ` Steven Rostedt
  2008-09-23  4:12                   ` Masami Hiramatsu
  2008-09-23  4:20                   ` Linus Torvalds
  1 sibling, 2 replies; 47+ messages in thread
From: Steven Rostedt @ 2008-09-23  3:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Linus Torvalds wrote:

> 
> 
> On Mon, 22 Sep 2008, Roland Dreier wrote:
> >
> > Just like Einstein said, it really seems to me that the order of things
> > depends on your frame of reference.
> 
> Heh. Yes. In general, there is no single ordering unless you actually use 
> a serializing lock on all CPU's involved. 
> 
> And exactly as in the theory of relativity, two people on different CPU's 
> can actually validly _disagree_ about the ordering of the same event. 
> There are things that act as "light-cones" and are borders for what 
> everybody can agree on, but basically, in the absence of explicit locks, 
> it is very possible that no such thing as "ordering" may even exist.
> 
> Now, an atomic increment on a single counter obviously does imply *one* 
> certain ordering, but it really only defines the ordering of that counter 
> itself. It does not at all necessarily imply any ordering on the events 
> that go on around the counter in other unrelated cachelines.
> 
> Which is exactly why even a global counter in no way orders "events" in 
> general, unless those events have something else that does so.

Hmm, I've been pretty spoiled by x86 mostly ording things correctly, and 
the not x86 boxes I've used has mostly been UP.

But, with that, with a global atomic counter, and the following trace:

cpu 0: trace_point_a
cpu 1: trace_point_c
cpu 0: trace_point_b
cpu 1: trace_point_d

Could the event a really come after event d, even though we already hit 
event b?

But I guess you are stating the fact that what the computer does 
internally, no one really knows. Without the help of real memory barriers, 
ording of memory accesses is mostly determined by tarot cards.

But basically, the perceived result of assembly commands is suppose to be 
accurate at the user level.  The traces that I've used not only shows the 
order (or perceived order) of events, but also the output of the corrupted 
data when the race happens. Usually with the two together, you can pretty 
much guarantee that the traced events actually did occur in the order 
presented.

But without some perceived accurate ording, even when you do see the 
corrupted data, the events can easily be misleading, even on an arch that 
usually orders the events as seen by the user.

-- Steve

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:44                 ` Steven Rostedt
@ 2008-09-23  4:12                   ` Masami Hiramatsu
  2008-09-23  4:18                     ` Martin Bligh
  2008-09-23 10:54                     ` Steven Rostedt
  2008-09-23  4:20                   ` Linus Torvalds
  1 sibling, 2 replies; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23  4:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Roland Dreier, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

Steven Rostedt wrote:
> But, with that, with a global atomic counter, and the following trace:
> 
> cpu 0: trace_point_a
> cpu 1: trace_point_c
> cpu 0: trace_point_b
> cpu 1: trace_point_d
> 
> 
> Could the event a really come after event d, even though we already hit 
> event b?

yes, if event c is an interrupt event :-).

  cpu 0       cpu 1
             hit event d
hit event a
log event a
             irq event c
             log event c
hit event b
log event b
             log event d

so, I think if we really need to order events, we have to stop
irq right after hitting an event.

Anyway, in most case, I think it works, but as accurate as
synchronized-TSC if hardware supports it.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  4:12                   ` Masami Hiramatsu
@ 2008-09-23  4:18                     ` Martin Bligh
  2008-09-23 15:25                       ` Masami Hiramatsu
  2008-09-23 10:54                     ` Steven Rostedt
  1 sibling, 1 reply; 47+ messages in thread
From: Martin Bligh @ 2008-09-23  4:18 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Linus Torvalds, Roland Dreier,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

> yes, if event c is an interrupt event :-).
>
>  cpu 0       cpu 1
>            hit event d
> hit event a
> log event a
>            irq event c
>            log event c
> hit event b
> log event b
>            log event d
>
> so, I think if we really need to order events, we have to stop
> irq right after hitting an event.

How could you fix that in any practical way?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  4:18                     ` Martin Bligh
@ 2008-09-23 15:25                       ` Masami Hiramatsu
  0 siblings, 0 replies; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 15:25 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Steven Rostedt, Linus Torvalds, Roland Dreier,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

Hi Martin,

Martin Bligh wrote:
>>            log event c
>> hit event b
>> log event b
>>            log event d
>>
>> so, I think if we really need to order events, we have to stop
>> irq right after hitting an event.
> 
> How could you fix that in any practical way?

In that case, I think it's easy to know that event c is irq
related event by checking source code. :-)
And practically, in most cases, I think we can presume what
happened by event arguments and subsequent events.

Anyway, since there must be a delay between hitting event(and
as linus said, happening event) and logging it, I think buffering
mechanism itself should focus on ensuring only writing and
reading order.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  4:12                   ` Masami Hiramatsu
  2008-09-23  4:18                     ` Martin Bligh
@ 2008-09-23 10:54                     ` Steven Rostedt
  1 sibling, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2008-09-23 10:54 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Linus Torvalds, Roland Dreier, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml



On Tue, 23 Sep 2008, Masami Hiramatsu wrote:

> Steven Rostedt wrote:
> > But, with that, with a global atomic counter, and the following trace:
> > 
> > cpu 0: trace_point_a
> > cpu 1: trace_point_c
> > cpu 0: trace_point_b
> > cpu 1: trace_point_d
> > 
> > 
> > Could the event a really come after event d, even though we already hit
> > event b?
> 
> yes, if event c is an interrupt event :-).
> 
>  cpu 0       cpu 1
>             hit event d
> hit event a
> log event a
>             irq event c
>             log event c

heh, This is assuming that event c is in an IRQ handler.

Since I control where event c is, I can prevent that. I'm talking about
the CPU doing something funny that would have c come after d.

But I didn't specify exactly what the events were, so I'll accept that 
explanation ;-)

-- Steve

> hit event b
> log event b
>             log event d
> 
> so, I think if we really need to order events, we have to stop
> irq right after hitting an event.
> 
> Anyway, in most case, I think it works, but as accurate as
> synchronized-TSC if hardware supports it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  3:44                 ` Steven Rostedt
  2008-09-23  4:12                   ` Masami Hiramatsu
@ 2008-09-23  4:20                   ` Linus Torvalds
  2008-09-23 14:13                     ` Mathieu Desnoyers
  1 sibling, 1 reply; 47+ messages in thread
From: Linus Torvalds @ 2008-09-23  4:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers,
	darren, Frank Ch. Eigler, systemtap-ml

On Mon, 22 Sep 2008, Steven Rostedt wrote:
> 
> But, with that, with a global atomic counter, and the following trace:
> 
> cpu 0: trace_point_a
> cpu 1: trace_point_c
> cpu 0: trace_point_b
> cpu 1: trace_point_d
> 
> Could the event a really come after event d, even though we already hit 
> event b?

Each tracepoint will basically give a partial ordering (if you make it so, 
of course - and on x86 it's hard to avoid it).

And with many trace-points, you can narrow down ordering if you're lucky.

But say that you have code like

	CPU#1		CPU#2

	trace_a		trace_c
	..		..
	trace_b		trace_d

and since each CPU itself is obviously strictly ordered, you a priori know 
that a < b, and c < d. But your trace buffer can look many different ways:

 - a -> b -> c -> d
   c -> d -> a -> b

   Now you do know that what happened between c and d must all have 
   happened entirely after/before the things that happened between
   a and b, and there is no overlap.

   This is only assuming the x86 full memory barrier from a "lock xadd" of 
   course, but those are the semantics you'd get on x86. On others, the 
   ordering might not be that strong.

 - a -> c -> b -> d
   a -> c -> d -> b

   With these trace point orderings, you really don't know anything at all 
   about the order of any access that happened in between. CPU#1 might 
   have gone first. Or not. Or partially. You simply do not know.

> But I guess you are stating the fact that what the computer does 
> internally, no one really knows. Without the help of real memory barriers, 
> ording of memory accesses is mostly determined by tarot cards.

Well, x86 defines a memory order. But what I'm trying to explain is that 
memory order still doesn't actually specify what happens to the code that 
actually does tracing! The trace is only going to show the order of the 
tracepoints, not the _other_ memory accesses. So you'll have *some* 
information, but it's very partial.

And the thing is, all those other memory accesses are the ones that do all 
the real work. You'll know they happened _somewhere_ between two 
tracepoints, but not much more than that.

This is why timestamps aren't really any worse than sequence numbers in 
all practical matters. They'll get you close enough that you can consider 
them equivalent to a cache-coherent counter, just one that you don't have 
to take a cache miss for, and that increments on its own!

Quite a lot of CPU's have nice, dependable, TSC's that run at constant 
frequency. 

And quite a lot of traces care a _lot_ about real time. When you do IO 
tracing, the problem is almost never about lock ordering or anything like 
that. You want to see how long a request took. You don't care AT ALL how 
many tracepoints were in between the beginning and end, you care about how 
many microseconds there were!

			Linus

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  4:20                   ` Linus Torvalds
@ 2008-09-23 14:13                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23 14:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Roland Dreier, Masami Hiramatsu, Martin Bligh,
	Linux Kernel Mailing List, Thomas Gleixner, darren,
	Frank Ch. Eigler, systemtap-ml

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 22 Sep 2008, Steven Rostedt wrote:
> > 
> > But, with that, with a global atomic counter, and the following trace:
> > 
> > cpu 0: trace_point_a
> > cpu 1: trace_point_c
> > cpu 0: trace_point_b
> > cpu 1: trace_point_d
> > 
> > Could the event a really come after event d, even though we already hit 
> > event b?
> 
> Each tracepoint will basically give a partial ordering (if you make it so, 
> of course - and on x86 it's hard to avoid it).
> 
> And with many trace-points, you can narrow down ordering if you're lucky.
> 
> But say that you have code like
> 
> 	CPU#1		CPU#2
> 
> 	trace_a		trace_c
> 	..		..
> 	trace_b		trace_d
> 
> and since each CPU itself is obviously strictly ordered, you a priori know 
> that a < b, and c < d. But your trace buffer can look many different ways:
> 
>  - a -> b -> c -> d
>    c -> d -> a -> b
> 
>    Now you do know that what happened between c and d must all have 
>    happened entirely after/before the things that happened between
>    a and b, and there is no overlap.
> 
>    This is only assuming the x86 full memory barrier from a "lock xadd" of 
>    course, but those are the semantics you'd get on x86. On others, the 
>    ordering might not be that strong.
> 

Hrm, Documentation/atomic_ops.txt states that :

"Unlike the above routines, it is required that explicit memory
barriers are performed before and after the operation.  It must be
done such that all memory operations before and after the atomic
operation calls are strongly ordered with respect to the atomic
operation itself."

So on architectures with weaker ordering, the kernel atomic ops already
require that explicit smp_mb() are inserted before and after the atomic
increment. The same applies to cmpxchg.

Therefore I think it's ok, given the semantic provided by these two
atomic operations, to assume they imply a smp_mb() for any given
architecture. If not, then the architecture-specific implementation
would be broken wrt the semantic.

>  - a -> c -> b -> d
>    a -> c -> d -> b
> 
>    With these trace point orderings, you really don't know anything at all 
>    about the order of any access that happened in between. CPU#1 might 
>    have gone first. Or not. Or partially. You simply do not know.
> 

Yep. If two "real kernel" events happen to belong to the same
overlapping time window, there is not much we can know about their
order. Adding tracing statements before and after traced kernel
operations could help to make this window as small as possible, but I
doubt it's worth the performance penality and event duplication (and
incremented trace size).

Mathieu


> > But I guess you are stating the fact that what the computer does 
> > internally, no one really knows. Without the help of real memory barriers, 
> > ording of memory accesses is mostly determined by tarot cards.
> 
> Well, x86 defines a memory order. But what I'm trying to explain is that 
> memory order still doesn't actually specify what happens to the code that 
> actually does tracing! The trace is only going to show the order of the 
> tracepoints, not the _other_ memory accesses. So you'll have *some* 
> information, but it's very partial.
> 
> And the thing is, all those other memory accesses are the ones that do all 
> the real work. You'll know they happened _somewhere_ between two 
> tracepoints, but not much more than that.
> 
> This is why timestamps aren't really any worse than sequence numbers in 
> all practical matters. They'll get you close enough that you can consider 
> them equivalent to a cache-coherent counter, just one that you don't have 
> to take a cache miss for, and that increments on its own!
> 
> Quite a lot of CPU's have nice, dependable, TSC's that run at constant 
> frequency. 
> 
> And quite a lot of traces care a _lot_ about real time. When you do IO 
> tracing, the problem is almost never about lock ordering or anything like 
> that. You want to see how long a request took. You don't care AT ALL how 
> many tracepoints were in between the beginning and end, you care about how 
> many microseconds there were!
> 
> 			Linus
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:41           ` Linus Torvalds
  2008-09-23  1:28             ` Roland Dreier
@ 2008-09-23  2:31             ` Mathieu Desnoyers
  2008-09-23  3:07             ` Masami Hiramatsu
  2 siblings, 0 replies; 47+ messages in thread
From: Mathieu Desnoyers @ 2008-09-23  2:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List,
	Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
> > 
> > Sure, atomic counter might be more expensive but accurate for ordering.
> 
> Don't be silly.
> 
> An atomic counter is no more accurate for ordering than anything else.
> 
> Why?
> 
> Because all it tells you is the ordering of the atomic increment, not of 
> the caller. The atomic increment is not related to all the other ops that 
> the code that you trace actually does in any shape or form, and so the 
> ordering of the trace doesn't actually imply anything for the ordering of 
> the operations you are tracing!
> 
> Except for a single CPU, of course, but for that case you don't need a 
> sequence number either, since the ordering is entirely determined by the 
> ring buffer itself.
> 
> So the counter will be more expensive (cross-cpu cache bouncing for EVERY 
> SINGLE EVENT), less useful (no real meaning for people who DO want to have 
> a timestamp), and it's really no more "ordered" than anything that bases 
> itself on a TSC.
> 
> The fact is, you cannot order operations based on log messages unless you 
> have a lock around the whole caller - absolutely _no_ amount of locking or 
> atomic accesses in the log itself will guarantee ordering of the upper 
> layers.
> 
> And sure, if you have locking at a higher layer, then a sequence number is 
> sufficient, but on the other hand, so is a well-synchronized TSC.
> 
> So personally, I think that the optimal solution is:
> 
>  - let each ring buffer be associated with a "gettimestamp()" function, so 
>    that everybody _can_ set it to something of their own. But default to 
>    something sane, namely a raw TSC thing.
> 
>  - Add synchronization events to the ring buffer often enough that you can 
>    make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply 
>    noticing when the upper 32 bits change, although you could possibly do 
>    it with a heartbeat too.
> 

Hi Linus,

I totally agree with all of the above,

>  - Similarly, add a synchronization event when the TSC frequency changes.
> 
>  - Make the synchronization packet contain the full 64-bit TSC base, in 
>    addition to TSC frequency info _and_ the timebase.
> 

If possible, I would recommend to stay as far away as possible from
using frequency change events to support broken frequency scaling.
First, some architectures, like AMD, does not trigger an event for every
frequency changes on the CPU (if my memory is right, I think the
southbridge temperature throttling does not generate any event) and we
also have to consider the delay between the frequency change and and
moment the instrumentation would be called. Each frequency change event
would therefore increase the over TSC imprecision and I doubt we really
want to rely on this for our time base. I welcome people to prove me
wrong or paranoid by providing results proving the correct synchronicity
of such approach over time. :-) I however explain how LTTng deals with
such architectures I consider to fall into the "broken" category below.

>  - From those synchronization events, you should be able to get a very 
>    accurate timestamp *after* the fact from the raw TSC numbers (ie do all 
>    the scaling not when you gather the info, but when you present it), 
>    even if you only spent 32 bits of TSC info on 99% of all events (an 
>    just had a overflow log occasionally to get the rest of the info)
> 
>  - Most people will be _way_ happier with a timestamp that has enough 
>    precision to also show ordering (assuming that the caller holds a 
>    lock over the operation _including_ the tracing) than they would ever 
>    be with a sequence number.
> 

One gettimestamp() we could think of, to satisfy both people expecting
tracing to perform a memory barrier and for people expecting a
timestamp, would be to write timestamps taken from synchronized TSCs
into a variable with a cmpxchg, which would succeed only if the TSC
value is higher than the value present in this variable. That would
give both the memory barrier behavior and the timestamping. Given the
cache-line bouncing implied, I would only recommend to activate this
when really needed, e.g. when debugging race issues.

I currently use this scheme in LTTng to deal with broken x86
architecture with non-synchronized TSCs. Basically, all CPUs laging behind
others take the TSC value read from memory and adds 200 cycles (or how
many cycles it takes to actually read tsc) from the last cycle counter
value written to memory by the previous time base read. I also plan to
execute an heartbeat on every CPUs someday to give an upper bound to the
imprecision of such TSCs so it could be use for some sort of performance
measurement.

>  - people who really want to can consider the incrementing counter a TSC, 
>    but it will suck in so many ways that I bet it will not be very popular 
>    at all. But having the option to set a special timestamp function will
>    give people the option (on a per-buffer level) to make the "TSC" be a 
>    simple incrementing 32-bit counter using xaddl and the upper bits 
>    incrementing from a timer, but keep that as a "ok, the TSC is really 
>    broken, or this architecture doesn't support any fast cycle counters at 
>    all, or I really don't care about time, just sequence, and I guarantee 
>    I have a single lock in all callers that makes things unambiguous"
> 

I currently use the scheme you propose here on architectures
lacking TSC. I hook in the timer interrupt to increment the MSBs.
I doubt anyone will ever go the way of locking every caller, but yeah,
why not.

Also note that getting only 32-bits TSCs on MIPS makes things a bit less
simple, but I did an RCU-style adaptation layer which generates a full
64-bits TSC from a 32-bits time base. It's currently in the -lttng tree.

> Note the "single lock" part. It's not enough that you make any trace thing 
> under a lock. They must be under the _same_ lock for all relevant events 
> for you to be able to say anything about ordering. And that's actually 
> pretty rare for any complex behavior.
> 
> The timestamping, btw, is likely the most important part of the whole 
> logging thing. So we need to get it right. But by "right" I mean really 
> really low-latency so that it's acceptable to everybody, real-time enough 
> that you can tell how far apart events were, and precise enough that you 
> really _can_ see ordering.
> 
> The "raw TSC value with correction information" should be able to give you 
> all of that. At least on x86. On some platforms, the TSC may not give you 
> enough resolution to get reasonable guesses on event ordering.
> 

Mathieu

> 			Linus
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23  0:41           ` Linus Torvalds
  2008-09-23  1:28             ` Roland Dreier
  2008-09-23  2:31             ` Mathieu Desnoyers
@ 2008-09-23  3:07             ` Masami Hiramatsu
  2 siblings, 0 replies; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23  3:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner,
	Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler,
	systemtap-ml

Hi Linus,

Linus Torvalds wrote:
> 
> On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
>> Sure, atomic counter might be more expensive but accurate for ordering.
> 
> Don't be silly.
> 
> An atomic counter is no more accurate for ordering than anything else.
> 
> Why?
> 
> Because all it tells you is the ordering of the atomic increment, not of 
> the caller. The atomic increment is not related to all the other ops that 
> the code that you trace actually does in any shape or form, and so the 
> ordering of the trace doesn't actually imply anything for the ordering of 
> the operations you are tracing!
> 
> Except for a single CPU, of course, but for that case you don't need a 
> sequence number either, since the ordering is entirely determined by the 
> ring buffer itself.
> 
> So the counter will be more expensive (cross-cpu cache bouncing for EVERY 
> SINGLE EVENT), less useful (no real meaning for people who DO want to have 
> a timestamp), and it's really no more "ordered" than anything that bases 
> itself on a TSC.
> 
> The fact is, you cannot order operations based on log messages unless you 
> have a lock around the whole caller - absolutely _no_ amount of locking or 
> atomic accesses in the log itself will guarantee ordering of the upper 
> layers.

Indeed.
If TSC(or similar time counter) can provide synchronized-time, I don't
have any comment on that(AFAIK, latest x86 and ia64 can provide it).
# I might be a bit nervous about Broken TSC...

> And sure, if you have locking at a higher layer, then a sequence number is 
> sufficient, but on the other hand, so is a well-synchronized TSC.
> 
> So personally, I think that the optimal solution is:
> 
>  - let each ring buffer be associated with a "gettimestamp()" function, so 
>    that everybody _can_ set it to something of their own. But default to 
>    something sane, namely a raw TSC thing.

I agree, default to TSC is enough.

>  - Add synchronization events to the ring buffer often enough that you can 
>    make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply 
>    noticing when the upper 32 bits change, although you could possibly do 
>    it with a heartbeat too.
> 
>  - Similarly, add a synchronization event when the TSC frequency changes.
> 
>  - Make the synchronization packet contain the full 64-bit TSC base, in 
>    addition to TSC frequency info _and_ the timebase.
> 
>  - From those synchronization events, you should be able to get a very 
>    accurate timestamp *after* the fact from the raw TSC numbers (ie do all 
>    the scaling not when you gather the info, but when you present it), 
>    even if you only spent 32 bits of TSC info on 99% of all events (an 
>    just had a overflow log occasionally to get the rest of the info)
> 
>  - Most people will be _way_ happier with a timestamp that has enough 
>    precision to also show ordering (assuming that the caller holds a 
>    lock over the operation _including_ the tracing) than they would ever 
>    be with a sequence number.
> 
>  - people who really want to can consider the incrementing counter a TSC, 
>    but it will suck in so many ways that I bet it will not be very popular 
>    at all. But having the option to set a special timestamp function will
>    give people the option (on a per-buffer level) to make the "TSC" be a 
>    simple incrementing 32-bit counter using xaddl and the upper bits 
>    incrementing from a timer, but keep that as a "ok, the TSC is really 
>    broken, or this architecture doesn't support any fast cycle counters at 
>    all, or I really don't care about time, just sequence, and I guarantee 
>    I have a single lock in all callers that makes things unambiguous"

Thank you very much for giving me a good idea!
I agree with you.

> Note the "single lock" part. It's not enough that you make any trace thing 
> under a lock. They must be under the _same_ lock for all relevant events 
> for you to be able to say anything about ordering. And that's actually 
> pretty rare for any complex behavior.
> 
> The timestamping, btw, is likely the most important part of the whole 
> logging thing. So we need to get it right. But by "right" I mean really 
> really low-latency so that it's acceptable to everybody, real-time enough 
> that you can tell how far apart events were, and precise enough that you 
> really _can_ see ordering.
> 
> The "raw TSC value with correction information" should be able to give you 
> all of that. At least on x86. On some platforms, the TSC may not give you 
> enough resolution to get reasonable guesses on event ordering.
> 
> 			Linus

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-22 22:27     ` Masami Hiramatsu
  2008-09-22 23:12       ` Darren Hart
  2008-09-22 23:17       ` Martin Bligh
@ 2008-09-23 14:38       ` KOSAKI Motohiro
  2008-09-23 15:04         ` Frank Ch. Eigler
  2008-09-23 15:23         ` Masami Hiramatsu
  2 siblings, 2 replies; 47+ messages in thread
From: KOSAKI Motohiro @ 2008-09-23 14:38 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kosaki.motohiro, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml


> By the way, systemtap uses two modes;
> 
> - single-channel mode
>  In this mode, all cpus share one buffer channel to write and read.
>  each writer locks spinlock and write a probe-local data to buffer.
> 
> - per-cpu buffer mode
>  In this mode, we use an atomic sequential number for ordering data. If
>  user doesn't need it(because they have their own timestamps), they can
>  choose not to use that seq-number.

I can't imazine a merit of the single-channel mode.
Could you please explain it?

Because some architecture don't have fine grained timestamp?
if so, could you explain which architecture don't have it?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:38       ` KOSAKI Motohiro
@ 2008-09-23 15:04         ` Frank Ch. Eigler
  2008-09-23 15:23         ` Masami Hiramatsu
  1 sibling, 0 replies; 47+ messages in thread
From: Frank Ch. Eigler @ 2008-09-23 15:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od, systemtap-ml

Hi -

On Tue, Sep 23, 2008 at 11:36:26PM +0900, KOSAKI Motohiro wrote:
> > By the way, systemtap uses two modes;
> > 
> > - single-channel mode
> >  In this mode, all cpus share one buffer channel to write and read.
> >  each writer locks spinlock and write a probe-local data to buffer.
> > - per-cpu buffer mode [...]
> 
> I can't imazine a merit of the single-channel mode.
> Could you please explain it?

It could be a way of saving some memory and merging hassle for
low-throughput data.  (Remember that systemtap enables in-situ
analysis of events so that often only brief final results need be sent
along need be sent out.)  If timestampwise cross-cpu merging can be
done on demand by the hypothetical future buffer widget, then little
reason remains not to use it.

- FChE

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 14:38       ` KOSAKI Motohiro
  2008-09-23 15:04         ` Frank Ch. Eigler
@ 2008-09-23 15:23         ` Masami Hiramatsu
  2008-09-23 18:04           ` KOSAKI Motohiro
  1 sibling, 1 reply; 47+ messages in thread
From: Masami Hiramatsu @ 2008-09-23 15:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, systemtap-ml

Hi Motohiro,

KOSAKI Motohiro wrote:
>> By the way, systemtap uses two modes;
>>
>> - single-channel mode
>>  In this mode, all cpus share one buffer channel to write and read.
>>  each writer locks spinlock and write a probe-local data to buffer.
>>
>> - per-cpu buffer mode
>>  In this mode, we use an atomic sequential number for ordering data. If
>>  user doesn't need it(because they have their own timestamps), they can
>>  choose not to use that seq-number.
> 
> I can't imazine a merit of the single-channel mode.
> Could you please explain it?

Actually, single-channel mode is for not-frequently event tracing.
At least systemtap case, sometimes we just want to collect data
and watch it periodically(as like as 'top'). Or, just monitoring
errors as additional printk. in these cases, overhead is not so
important.

I think the main reason of using single-channel mode is simplicity of
userspace reader. We can use 'cat' or 'tail' to read the buffer on-line.
I'm not sure how much overhead ftrace-like buffer merging routine has,
but if kernel provides an interface which gives us single-merged buffer
image(like ftrace buffer), we are grad to use it. :-)

> Because some architecture don't have fine grained timestamp?
> if so, could you explain which architecture don't have it?

I heard that get_cycles always returns 0 on some cpu (ARM, MIPS, etc)...
(I think some of their platforms support variants of get_cycles)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 15:23         ` Masami Hiramatsu
@ 2008-09-23 18:04           ` KOSAKI Motohiro
  2008-09-23 18:29             ` Martin Bligh
  0 siblings, 1 reply; 47+ messages in thread
From: KOSAKI Motohiro @ 2008-09-23 18:04 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kosaki.motohiro, Martin Bligh, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers,
	Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml

> > I can't imazine a merit of the single-channel mode.
> > Could you please explain it?
> 
> Actually, single-channel mode is for not-frequently event tracing.
> At least systemtap case, sometimes we just want to collect data
> and watch it periodically(as like as 'top'). Or, just monitoring
> errors as additional printk. in these cases, overhead is not so
> important.
> 
> I think the main reason of using single-channel mode is simplicity of
> userspace reader. We can use 'cat' or 'tail' to read the buffer on-line.
> I'm not sure how much overhead ftrace-like buffer merging routine has,
> but if kernel provides an interface which gives us single-merged buffer
> image(like ftrace buffer), we are grad to use it. :-)

Yup, I also think it is better.

Thanks.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: Unified tracing buffer
  2008-09-23 18:04           ` KOSAKI Motohiro
@ 2008-09-23 18:29             ` Martin Bligh
  0 siblings, 0 replies; 47+ messages in thread
From: Martin Bligh @ 2008-09-23 18:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Masami Hiramatsu, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od,
	Frank Ch. Eigler, systemtap-ml

>> I think the main reason of using single-channel mode is simplicity of
>> userspace reader. We can use 'cat' or 'tail' to read the buffer on-line.
>> I'm not sure how much overhead ftrace-like buffer merging routine has,
>> but if kernel provides an interface which gives us single-merged buffer
>> image(like ftrace buffer), we are grad to use it. :-)
>
> Yup, I also think it is better.

That was the plan, yes. Merge sort is cheap ;-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2008-09-23 21:02 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <33307c790809191433w246c0283l55a57c196664ce77@mail.gmail.com>
2008-09-22 19:47 ` Unified tracing buffer Masami Hiramatsu
2008-09-22 20:14   ` Martin Bligh
2008-09-22 22:27     ` Masami Hiramatsu
2008-09-22 23:12       ` Darren Hart
2008-09-23  0:06         ` Masami Hiramatsu
2008-09-22 23:17       ` Martin Bligh
2008-09-23  0:07         ` Masami Hiramatsu
2008-09-23  0:13           ` Martin Bligh
2008-09-23 14:51             ` Masami Hiramatsu
2008-09-23 15:09               ` Mathieu Desnoyers
2008-09-23 15:32                 ` Masami Hiramatsu
2008-09-23 16:02                   ` Linus Torvalds
2008-09-23 17:07                     ` Masami Hiramatsu
2008-09-23 17:33                       ` Thomas Gleixner
2008-09-23 19:03                         ` Masami Hiramatsu
2008-09-23 19:37                           ` Thomas Gleixner
2008-09-23 19:39                             ` Martin Bligh
2008-09-23 19:42                               ` Thomas Gleixner
2008-09-23 19:51                                 ` Martin Bligh
2008-09-23 20:05                                   ` Thomas Gleixner
2008-09-23 21:02                                     ` Martin Bligh
2008-09-23 20:06                             ` Masami Hiramatsu
2008-09-23 20:09                               ` Thomas Gleixner
2008-09-23 15:48               ` Linus Torvalds
2008-09-23  0:41           ` Linus Torvalds
2008-09-23  1:28             ` Roland Dreier
2008-09-23  1:40               ` Steven Rostedt
2008-09-23  2:08               ` Mathieu Desnoyers
2008-09-23  2:27                 ` Darren Hart
2008-09-23  2:32                   ` Mathieu Desnoyers
2008-09-23  3:29               ` Linus Torvalds
2008-09-23  3:42                 ` Mathieu Desnoyers
2008-09-23  4:06                   ` Linus Torvalds
2008-09-23  3:44                 ` Steven Rostedt
2008-09-23  4:12                   ` Masami Hiramatsu
2008-09-23  4:18                     ` Martin Bligh
2008-09-23 15:25                       ` Masami Hiramatsu
2008-09-23 10:54                     ` Steven Rostedt
2008-09-23  4:20                   ` Linus Torvalds
2008-09-23 14:13                     ` Mathieu Desnoyers
2008-09-23  2:31             ` Mathieu Desnoyers
2008-09-23  3:07             ` Masami Hiramatsu
2008-09-23 14:38       ` KOSAKI Motohiro
2008-09-23 15:04         ` Frank Ch. Eigler
2008-09-23 15:23         ` Masami Hiramatsu
2008-09-23 18:04           ` KOSAKI Motohiro
2008-09-23 18:29             ` Martin Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).