* Re: Unified tracing buffer [not found] <33307c790809191433w246c0283l55a57c196664ce77@mail.gmail.com> @ 2008-09-22 19:47 ` Masami Hiramatsu 2008-09-22 20:14 ` Martin Bligh 0 siblings, 1 reply; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-22 19:47 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: > During kernel summit and Plumbers conference, Linus and others > expressed a desire for a unified > tracing buffer system for multiple tracing applications (eg ftrace, > lttng, systemtap, blktrace, etc) to use. > This provides several advantages, including the ability to interleave > data from multiple sources, > not having to learn 200 different tools, duplicated code/effort, etc. > > Several of us got together last night and tried to cut this down to > the simplest usable system > we could agree on (and nobody got hurt!). This will form version 1. > I've sketched out a few > enhancements we know that we want, but have agreed to leave these > until version 2. > The answer to most questions about the below is "yes we know, we'll > fix that in version 2" > (or 3). Simplicity was the rule ... > > Sketch of design. Enjoy flaming me. Code will follow shortly. > > > STORAGE > ------- > > We will support multiple buffers for different tracing systems, with > separate names, event id spaces. > Event ids are 16 bit, dynamically allocated. > A "one line of text" print function will be provided for each event, > or use the default (probably hex printf) > Will provide a "flight data recorder" mode, and a "spool to disk" mode. > > Circular buffer per cpu, protected by per-cpu spinlock_irq > Word aligned records. > Variable record length, header will start with length record. > Timestamps in fixed timebase, monotonically increasing (across all CPUs) I agree to integrate tracing buffer mechanism, but I don't think your proposal is the simplest one. To simplify, I think the layered buffering mechanism is desirable. - The lowest layer just provides named circular buffers(both per-cpu and uni-buffer in system) and read/write scheme. - Next layer provides user/kernel interface including controls. - Top layer defines packet(and event) formatting utilities. - Additionally, it would better provide some library routines(timestamp, event-id synchronize, and so on). Since this unified buffer is used from different kind of tracers/loggers, I don't think all of them(and future tracers) want to be tied down by "event-id"+"parameter" format. So, Sorry, I disagree about that the tracing buffer defines its *data format*, it's just overkill for me. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-22 19:47 ` Unified tracing buffer Masami Hiramatsu @ 2008-09-22 20:14 ` Martin Bligh 2008-09-22 22:27 ` Masami Hiramatsu 0 siblings, 1 reply; 47+ messages in thread From: Martin Bligh @ 2008-09-22 20:14 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml > I agree to integrate tracing buffer mechanism, but I don't think > your proposal is the simplest one. > > To simplify, I think the layered buffering mechanism is desirable. > - The lowest layer just provides named circular buffers(both per-cpu and > uni-buffer in system) and read/write scheme. > - Next layer provides user/kernel interface including controls. > - Top layer defines packet(and event) formatting utilities. > - Additionally, it would better provide some library routines(timestamp, > event-id synchronize, and so on). > > Since this unified buffer is used from different kind of tracers/loggers, > I don't think all of them(and future tracers) want to be tied down by > "event-id"+"parameter" format. > So, Sorry, I disagree about that the tracing buffer defines its *data format*, > it's just overkill for me. I think you're right that we can layer this, and we didn't make a particularly good job of splitting those things out. I'll try to pull together another revision to reflect this and other suggested changes. One thing that I think is still important is to have a unified timestamp mechanism on everything, so we can co-ordinate different things back together in userspace from different trace tools, so I intend to keep that at a lower level, but I think you're right that the event id, etc can move up into separate layers. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-22 20:14 ` Martin Bligh @ 2008-09-22 22:27 ` Masami Hiramatsu 2008-09-22 23:12 ` Darren Hart ` (2 more replies) 0 siblings, 3 replies; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-22 22:27 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: >> I agree to integrate tracing buffer mechanism, but I don't think >> your proposal is the simplest one. >> >> To simplify, I think the layered buffering mechanism is desirable. >> - The lowest layer just provides named circular buffers(both per-cpu and >> uni-buffer in system) and read/write scheme. >> - Next layer provides user/kernel interface including controls. >> - Top layer defines packet(and event) formatting utilities. >> - Additionally, it would better provide some library routines(timestamp, >> event-id synchronize, and so on). >> >> Since this unified buffer is used from different kind of tracers/loggers, >> I don't think all of them(and future tracers) want to be tied down by >> "event-id"+"parameter" format. >> So, Sorry, I disagree about that the tracing buffer defines its *data format*, >> it's just overkill for me. > > I think you're right that we can layer this, and we didn't make a particularly > good job of splitting those things out. I'll try to pull together > another revision > to reflect this and other suggested changes. I'm happy to hear that. :-) > One thing that I think is still important is to have a unified timestamp > mechanism on everything, so we can co-ordinate different things back > together in userspace from different trace tools, so I intend to keep > that at a lower level, but I think you're right that the event id, etc can > move up into separate layers. I'm not so sure that the unified 'timestamp' must be required by all tracers. If you just need to merge and sort per-cpu data, you can use an atomic sequential number for it. IMHO, the unified 'timestamp' would better be an option, because some architectures can't support it. I think preparing timestamp-callback function will help us. By the way, systemtap uses two modes; - single-channel mode In this mode, all cpus share one buffer channel to write and read. each writer locks spinlock and write a probe-local data to buffer. - per-cpu buffer mode In this mode, we use an atomic sequential number for ordering data. If user doesn't need it(because they have their own timestamps), they can choose not to use that seq-number. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-22 22:27 ` Masami Hiramatsu @ 2008-09-22 23:12 ` Darren Hart 2008-09-23 0:06 ` Masami Hiramatsu 2008-09-22 23:17 ` Martin Bligh 2008-09-23 14:38 ` KOSAKI Motohiro 2 siblings, 1 reply; 47+ messages in thread From: Darren Hart @ 2008-09-22 23:12 UTC (permalink / raw) To: Masami Hiramatsu Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml On Mon, Sep 22, 2008 at 3:25 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote: >> One thing that I think is still important is to have a unified timestamp >> mechanism on everything, so we can co-ordinate different things back >> together in userspace from different trace tools, so I intend to keep >> that at a lower level, but I think you're right that the event id, etc can >> move up into separate layers. > > I'm not so sure that the unified 'timestamp' must be required by all tracers. > If you just need to merge and sort per-cpu data, you can use an atomic > sequential number for it. > IMHO, the unified 'timestamp' would better be an option, because some > architectures can't support it. I think preparing timestamp-callback > function will help us. > There have been several posts on the timestamp for the events. From a real-time perspective, this timestamp will be a very important datapoint for each event, and the more accurate/higher resolution the better. Some thoughts. o pretty print resolution should definitely be nanosecond (IMHO) o internal storage should be "whatever is fastest" with the transformation to ns data stored in the trace header (as I believe Mathieu mentioned). o for archs where the clock isn't synchronized across CPUs, perhaps for now it would be adequate to record the per cpu timestamps in the trace header and include the cpu id for each event as well. This is in keeping with the previous suggestion to collect the most primitive data available without doing any sort of transformation at trace time. -- Darren Hart ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-22 23:12 ` Darren Hart @ 2008-09-23 0:06 ` Masami Hiramatsu 0 siblings, 0 replies; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 0:06 UTC (permalink / raw) To: Darren Hart Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml Hi Darren, Darren Hart wrote: > On Mon, Sep 22, 2008 at 3:25 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote: > >>> One thing that I think is still important is to have a unified timestamp >>> mechanism on everything, so we can co-ordinate different things back >>> together in userspace from different trace tools, so I intend to keep >>> that at a lower level, but I think you're right that the event id, etc can >>> move up into separate layers. >> I'm not so sure that the unified 'timestamp' must be required by all tracers. >> If you just need to merge and sort per-cpu data, you can use an atomic >> sequential number for it. >> IMHO, the unified 'timestamp' would better be an option, because some >> architectures can't support it. I think preparing timestamp-callback >> function will help us. >> > > There have been several posts on the timestamp for the events. From a > real-time perspective, this timestamp will be a very important datapoint for > each event, and the more accurate/higher resolution the better. Some thoughts. Sure, I know the precise timestamp is required for real-time sensitive tracers. however, there are some other cases. for example debugging, we don't need timestamps, but just want to know the order of events. :-) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-22 22:27 ` Masami Hiramatsu 2008-09-22 23:12 ` Darren Hart @ 2008-09-22 23:17 ` Martin Bligh 2008-09-23 0:07 ` Masami Hiramatsu 2008-09-23 14:38 ` KOSAKI Motohiro 2 siblings, 1 reply; 47+ messages in thread From: Martin Bligh @ 2008-09-22 23:17 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml >> One thing that I think is still important is to have a unified timestamp >> mechanism on everything, so we can co-ordinate different things back >> together in userspace from different trace tools, so I intend to keep >> that at a lower level, but I think you're right that the event id, etc can >> move up into separate layers. > > I'm not so sure that the unified 'timestamp' must be required by all tracers. > If you just need to merge and sort per-cpu data, you can use an atomic > sequential number for it. > IMHO, the unified 'timestamp' would better be an option, because some > architectures can't support it. I think preparing timestamp-callback > function will help us. An atomic sequential number is: (a) far less meaningful than a timestamp for the user (b) more expensive to compute in many cases. I think we came up with a way to approximate this, using a callback every ms or so as the higher order bits, and a sequential counter in the lower order for those broken platforms. But perhaps it would be better if we started with a discussion of which platforms can't do global timestamps, and why not? I know some of them are fixable, but perhaps not all. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-22 23:17 ` Martin Bligh @ 2008-09-23 0:07 ` Masami Hiramatsu 2008-09-23 0:13 ` Martin Bligh 2008-09-23 0:41 ` Linus Torvalds 0 siblings, 2 replies; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 0:07 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: >>> One thing that I think is still important is to have a unified timestamp >>> mechanism on everything, so we can co-ordinate different things back >>> together in userspace from different trace tools, so I intend to keep >>> that at a lower level, but I think you're right that the event id, etc can >>> move up into separate layers. >> I'm not so sure that the unified 'timestamp' must be required by all tracers. >> If you just need to merge and sort per-cpu data, you can use an atomic >> sequential number for it. >> IMHO, the unified 'timestamp' would better be an option, because some >> architectures can't support it. I think preparing timestamp-callback >> function will help us. > > An atomic sequential number is: > > (a) far less meaningful than a timestamp for the user > (b) more expensive to compute in many cases. Sure, atomic counter might be more expensive but accurate for ordering. (and it can use on almost all architectures) The cost depends on the architecture and system configuration. So, I think it is preferable user to choose their timestamp rather than fix it. For example, calling callback when writing a log entry as following; write_log(struct buffer *buffer, char *data, int len) { /* reserve a logging space */ char *buf = reserve(buffer, len + buffer->timestamp.len); /* write a timestamp */ buf = buffer->timestamp.write(buf); /* write a body */ memcpy(buf, data, len); } And unified buffer prepares default timestamp.write callbacks. char * timestamp_write(char * buf); // write arch-specific timestamp char * seqnum_write(char * buf); // write an sequence number What would you think about it? > I think we came up with a way to approximate this, using a callback every > ms or so as the higher order bits, and a sequential counter in the lower > order for those broken platforms. Sure, that will work. > But perhaps it would be better if we started with a discussion of which > platforms can't do global timestamps, and why not? I know some of them > are fixable, but perhaps not all. For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-( Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:07 ` Masami Hiramatsu @ 2008-09-23 0:13 ` Martin Bligh 2008-09-23 14:51 ` Masami Hiramatsu 2008-09-23 0:41 ` Linus Torvalds 1 sibling, 1 reply; 47+ messages in thread From: Martin Bligh @ 2008-09-23 0:13 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml >> I think we came up with a way to approximate this, using a callback every >> ms or so as the higher order bits, and a sequential counter in the lower >> order for those broken platforms. > > Sure, that will work. OK, that'd fix 99% of it, even if we only get time within 1ms or so (but still good ordering). >> But perhaps it would be better if we started with a discussion of which >> platforms can't do global timestamps, and why not? I know some of them >> are fixable, but perhaps not all. > > For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-( Can you define incorrect for me (in this case)? We had similar problems with some AMD platforms that we can fix by syncing the TSCs on exit_idle, etc. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:13 ` Martin Bligh @ 2008-09-23 14:51 ` Masami Hiramatsu 2008-09-23 15:09 ` Mathieu Desnoyers 2008-09-23 15:48 ` Linus Torvalds 0 siblings, 2 replies; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 14:51 UTC (permalink / raw) To: Martin Bligh Cc: Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: >>> But perhaps it would be better if we started with a discussion of which >>> platforms can't do global timestamps, and why not? I know some of them >>> are fixable, but perhaps not all. >> For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-( > > Can you define incorrect for me (in this case)? On my laptop, TSC is disabled at boot time. $ dmesg | grep TSC checking TSC synchronization [CPU#0 -> CPU#1]: Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock. Marking TSC unstable due to: check_tsc_sync_source failed. $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 3998.45 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz stepping : 6 cpu MHz : 1000.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 3994.43 clflush size : 64 power management: Actually, I had measured TSC drifting and reported to systemtap-bugzilla http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19 Curiously, I've tested on another Core2Duo laptop, which cpu is same model and same stepping, but on that laptop I couldn't see TSC drifting. So I think this might be a product level issue and a rare case... > We had similar problems with some AMD platforms that we can fix by syncing > the TSCs on exit_idle, etc. Hmm, very interesting. :-) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:51 ` Masami Hiramatsu @ 2008-09-23 15:09 ` Mathieu Desnoyers 2008-09-23 15:32 ` Masami Hiramatsu 2008-09-23 15:48 ` Linus Torvalds 1 sibling, 1 reply; 47+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 15:09 UTC (permalink / raw) To: Masami Hiramatsu Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml * Masami Hiramatsu (mhiramat@redhat.com) wrote: [...] > $ cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 15 > model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz > stepping : 6 > cpu MHz : 1000.000 [...] ... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for both cpus) Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 15:09 ` Mathieu Desnoyers @ 2008-09-23 15:32 ` Masami Hiramatsu 2008-09-23 16:02 ` Linus Torvalds 0 siblings, 1 reply; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 15:32 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Mathieu Desnoyers wrote: > * Masami Hiramatsu (mhiramat@redhat.com) wrote: > [...] >> $ cat /proc/cpuinfo >> processor : 0 >> vendor_id : GenuineIntel >> cpu family : 6 >> model : 15 >> model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz >> stepping : 6 >> cpu MHz : 1000.000 > [...] > > ... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for > both cpus) 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means current frequency. (yep, now I'm using cpufreq) Anyway, when I measured TSC drift, I killed cpuspeed service and fixed freq to 2000. ;-) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 15:32 ` Masami Hiramatsu @ 2008-09-23 16:02 ` Linus Torvalds 2008-09-23 17:07 ` Masami Hiramatsu 0 siblings, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2008-09-23 16:02 UTC (permalink / raw) To: Masami Hiramatsu Cc: Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > > 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means > current frequency. (yep, now I'm using cpufreq) > Anyway, when I measured TSC drift, I killed cpuspeed service and > fixed freq to 2000. ;-) Ahh. I have an idea.. Maybe that thing does thermal throttling? Fixing the frequency at the highest setting is actually one of the worst things you can do, because if the device is thermally limited, it will still do the whole throttling thing, but now it won't do it by changing the frequency any more, it will do it by essentially forxing the external frequency down. And that is going to be *very* inefficient. You really really don't want that. Your performance will actually be _worse_ than if the CPU went to a lower frequency. And it might explain the unreliable TSC too, because I suspect constant TSC is really constant only wrt the bus clock to the CPU. The termal throttling thing is a "protect the CPU from overheating" last ditch effort, and because it doesn't lower voltage, it isn't actually at all as efficient at saving power (and thus cooling the CPU) as a real frequency change event would be. And fixing the frequency to the highest frequency in a tight laptop enclosure is the best way to force that behavior (in contrast - in a desktop system with sufficient cooling, it's usually not a problem at all to just say "run at highest frequency"). And btw, that also explains why you had so *big* changes in frequency: the throttling I think happens with a 1/8 duty cycle thing, iirc. It's supposed to be very rare with Core 2. Thermal throttling was quite common with the P4 one, and was the main overheating protection initially. These days, you should only see it for really badly designed devices that simply don't have enough thermal cooling, but if the design calls for mostly running at low frequency because it's some thing-and-light notebook with insufficient cooling (or some thick-and-heavy thing that is just total crap), and you force it to always run at full speed, I can imagine it kicking in to protect the CPU. It's obviously also going to be much easier to see if the ambient temperature is high. If you want to get best peformance, take one of those compressed-air spray-cans, and spray on the laptop with the can held upside down (the can will generally tell you _not_ to do that, because then you'll get the liquid itself rather than gas, but that's what you want for cooling). So if you can test this, try it with (a) cpufreq at a fixed _low_ value (to not cause overheating) (b) with the spray-can cooling the thing and cpufreq at a fixed high value and see if the TSC is constant then. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 16:02 ` Linus Torvalds @ 2008-09-23 17:07 ` Masami Hiramatsu 2008-09-23 17:33 ` Thomas Gleixner 0 siblings, 1 reply; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Linus Torvalds wrote: > > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: >> 2.00GHz is the maximum(model) frequency. And 'cpu MHz' means >> current frequency. (yep, now I'm using cpufreq) >> Anyway, when I measured TSC drift, I killed cpuspeed service and >> fixed freq to 2000. ;-) > > Ahh. I have an idea.. > > Maybe that thing does thermal throttling? > > Fixing the frequency at the highest setting is actually one of the worst > things you can do, because if the device is thermally limited, it will > still do the whole throttling thing, but now it won't do it by changing > the frequency any more, it will do it by essentially forxing the external > frequency down. > > And that is going to be *very* inefficient. You really really don't want > that. Your performance will actually be _worse_ than if the CPU went to a > lower frequency. And it might explain the unreliable TSC too, because I > suspect constant TSC is really constant only wrt the bus clock to the CPU. > > The termal throttling thing is a "protect the CPU from overheating" last > ditch effort, and because it doesn't lower voltage, it isn't actually at > all as efficient at saving power (and thus cooling the CPU) as a real > frequency change event would be. > > And fixing the frequency to the highest frequency in a tight laptop > enclosure is the best way to force that behavior (in contrast - in a > desktop system with sufficient cooling, it's usually not a problem at all > to just say "run at highest frequency"). > > And btw, that also explains why you had so *big* changes in frequency: the > throttling I think happens with a 1/8 duty cycle thing, iirc. > > It's supposed to be very rare with Core 2. Thermal throttling was quite > common with the P4 one, and was the main overheating protection initially. > These days, you should only see it for really badly designed devices that > simply don't have enough thermal cooling, but if the design calls for > mostly running at low frequency because it's some thing-and-light notebook > with insufficient cooling (or some thick-and-heavy thing that is just > total crap), and you force it to always run at full speed, I can imagine > it kicking in to protect the CPU. > > It's obviously also going to be much easier to see if the ambient > temperature is high. If you want to get best peformance, take one of those > compressed-air spray-cans, and spray on the laptop with the can held > upside down (the can will generally tell you _not_ to do that, because > then you'll get the liquid itself rather than gas, but that's what you > want for cooling). > > So if you can test this, try it with > (a) cpufreq at a fixed _low_ value (to not cause overheating) > (b) with the spray-can cooling the thing and cpufreq at a fixed high > value > and see if the TSC is constant then. Hi Linus, Thank you for your advice. I tested it again according your advice, I did: - service cpuspeed stop - echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is 1000000. - echo 1 > /proc/acpi/thermal_zone/THM/polling_frequency - cooling with spray-can :) - cat /proc/acpi/thermal_zone/THM/temperature temperature: 39 C and ran the test. --- p0: c:1107576, ns:990280 ratio:111 p0: c:1805640, ns:1008787 ratio:178 p0: c:1998324, ns:1000127 ratio:199 p0: c:946380, ns:990280 ratio:95 p0: c:871728, ns:1000267 ratio:87 p0: c:1807380, ns:1007949 ratio:179 p0: c:1784808, ns:1000127 ratio:178 p0: c:1768488, ns:991676 ratio:178 p0: c:1802292, ns:1008299 ratio:178 p0: c:1787088, ns:1000406 ratio:178 p0: c:1999176, ns:1000896 ratio:199 p0: c:881364, ns:991956 ratio:88 p0: c:1802712, ns:1008019 ratio:178 p0: c:1787088, ns:998590 ratio:178 --- this seems not so stable yet. :-( After test I checked temperature again. # cat /proc/acpi/thermal_zone/THM/temperature temperature: 39 C Hmm, 39 C is not so high. I wouldn't be surprised even if this is an individual product bug. Anyway, currently, Linux itself works well on this laptop with hpet.:-) Thank you, > > Linus -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 17:07 ` Masami Hiramatsu @ 2008-09-23 17:33 ` Thomas Gleixner 2008-09-23 19:03 ` Masami Hiramatsu 0 siblings, 1 reply; 47+ messages in thread From: Thomas Gleixner @ 2008-09-23 17:33 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > > So if you can test this, try it with > > (a) cpufreq at a fixed _low_ value (to not cause overheating) > > (b) with the spray-can cooling the thing and cpufreq at a fixed high > > value > > and see if the TSC is constant then. > > Hi Linus, > > Thank you for your advice. I tested it again according your advice, > I did: > - service cpuspeed stop > - echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed > and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is > 1000000. > - echo 1 > /proc/acpi/thermal_zone/THM/polling_frequency > - cooling with spray-can :) > - cat /proc/acpi/thermal_zone/THM/temperature > temperature: 39 C > > and ran the test. > --- > p0: c:1107576, ns:990280 ratio:111 > p0: c:1805640, ns:1008787 ratio:178 > p0: c:1998324, ns:1000127 ratio:199 > p0: c:946380, ns:990280 ratio:95 > p0: c:871728, ns:1000267 ratio:87 > p0: c:1807380, ns:1007949 ratio:179 > p0: c:1784808, ns:1000127 ratio:178 > p0: c:1768488, ns:991676 ratio:178 > p0: c:1802292, ns:1008299 ratio:178 > p0: c:1787088, ns:1000406 ratio:178 > p0: c:1999176, ns:1000896 ratio:199 > p0: c:881364, ns:991956 ratio:88 > p0: c:1802712, ns:1008019 ratio:178 > p0: c:1787088, ns:998590 ratio:178 > --- > this seems not so stable yet. :-( > > After test I checked temperature again. > # cat /proc/acpi/thermal_zone/THM/temperature > temperature: 39 C > > Hmm, 39 C is not so high. I wouldn't be surprised even if this > is an individual product bug. Anyway, currently, Linux itself > works well on this laptop with hpet.:-) Do you have C-States enabled on that machine ? ls /sys/devices/system/cpu/cpu0/cpuidle/ has it more than a state0 entry ? If yes, please do: cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage where X is the highest number in there. cat /proc/acpi/processor/CPU0/power might be useful as well. Thanks, tglx ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 17:33 ` Thomas Gleixner @ 2008-09-23 19:03 ` Masami Hiramatsu 2008-09-23 19:37 ` Thomas Gleixner 0 siblings, 1 reply; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 19:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Hi Thomas, Thomas Gleixner wrote: > > Do you have C-States enabled on that machine ? > > ls /sys/devices/system/cpu/cpu0/cpuidle/ > > has it more than a state0 entry ? Yes, there are state0 to state3 # ls /sys/devices/system/cpu/cpu0/cpuidle/ state0 state1 state2 state3 > If yes, please do: > > cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage > > where X is the highest number in there. # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage 171210 > > cat /proc/acpi/processor/CPU0/power > > might be useful as well. # cat /proc/acpi/processor/CPU0/power active state: C0 max_cstate: C8 bus master activity: 00000000 maximum allowed latency: 2000000000 usec states: C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] Could these help you? Thank you, > > Thanks, > > tglx -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:03 ` Masami Hiramatsu @ 2008-09-23 19:37 ` Thomas Gleixner 2008-09-23 19:39 ` Martin Bligh 2008-09-23 20:06 ` Masami Hiramatsu 0 siblings, 2 replies; 47+ messages in thread From: Thomas Gleixner @ 2008-09-23 19:37 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage > 171210 C3 stops the TSC. So depending on how many C3 entries you have on the different cores, your TSCs will drift apart. Some BIOSes do even a lousy job trying to fixup the TSCs on exit from C3, which makes things even worse. > C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] > C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] > C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] > > Could these help you? Yup, explains your TSC observation. Nothing we can do about. Broken by system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! Thanks, tglx ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:37 ` Thomas Gleixner @ 2008-09-23 19:39 ` Martin Bligh 2008-09-23 19:42 ` Thomas Gleixner 2008-09-23 20:06 ` Masami Hiramatsu 1 sibling, 1 reply; 47+ messages in thread From: Martin Bligh @ 2008-09-23 19:39 UTC (permalink / raw) To: Thomas Gleixner Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, Sep 23, 2008 at 12:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage >> 171210 > > C3 stops the TSC. So depending on how many C3 entries you have on the > different cores, your TSCs will drift apart. Some BIOSes do even a > lousy job trying to fixup the TSCs on exit from C3, which makes things > even worse. > >> C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] >> C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] >> C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] >> >> Could these help you? > > Yup, explains your TSC observation. Nothing we can do about. Broken by > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! We have linux patches that sync the TSC on exit_idle. I'll see if I can get Michael to send them out. M. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:39 ` Martin Bligh @ 2008-09-23 19:42 ` Thomas Gleixner 2008-09-23 19:51 ` Martin Bligh 0 siblings, 1 reply; 47+ messages in thread From: Thomas Gleixner @ 2008-09-23 19:42 UTC (permalink / raw) To: Martin Bligh Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Martin Bligh wrote: > On Tue, Sep 23, 2008 at 12:36 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage > >> 171210 > > > > C3 stops the TSC. So depending on how many C3 entries you have on the > > different cores, your TSCs will drift apart. Some BIOSes do even a > > lousy job trying to fixup the TSCs on exit from C3, which makes things > > even worse. > > > >> C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] > >> C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] > >> C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] > >> > >> Could these help you? > > > > Yup, explains your TSC observation. Nothing we can do about. Broken by > > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! > > We have linux patches that sync the TSC on exit_idle. I'll see if I can get > Michael to send them out. Are you sure that they sync it precicely enough that there is no user space observable way of time going backwards between cores ? Thanks, tglx ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:42 ` Thomas Gleixner @ 2008-09-23 19:51 ` Martin Bligh 2008-09-23 20:05 ` Thomas Gleixner 0 siblings, 1 reply; 47+ messages in thread From: Martin Bligh @ 2008-09-23 19:51 UTC (permalink / raw) To: Thomas Gleixner Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml >> > Yup, explains your TSC observation. Nothing we can do about. Broken by >> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! >> >> We have linux patches that sync the TSC on exit_idle. I'll see if I can get >> Michael to send them out. > > Are you sure that they sync it precicely enough that there is no user > space observable way of time going backwards between cores ? I think the tolerance is about 500 cycles. If that's not sufficient, I guess we'll have to either live with some slight misordering (which people have pointed out is kind of inevitable anyway) on these broken machines? It was sufficient for what we were using it for, but maybe not for everyone. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:51 ` Martin Bligh @ 2008-09-23 20:05 ` Thomas Gleixner 2008-09-23 21:02 ` Martin Bligh 0 siblings, 1 reply; 47+ messages in thread From: Thomas Gleixner @ 2008-09-23 20:05 UTC (permalink / raw) To: Martin Bligh Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Martin Bligh wrote: > >> > Yup, explains your TSC observation. Nothing we can do about. Broken by > >> > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! > >> > >> We have linux patches that sync the TSC on exit_idle. I'll see if I can get > >> Michael to send them out. > > > > Are you sure that they sync it precicely enough that there is no user > > space observable way of time going backwards between cores ? > > I think the tolerance is about 500 cycles. If that's not sufficient, I guess > we'll have to either live with some slight misordering (which people have > pointed out is kind of inevitable anyway) on these broken machines? > It was sufficient for what we were using it for, but maybe not for everyone. Well, I dont care about the trace reordering at all. I care about user space visible time going backwards issues observed via the gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can migrate in less than that :) I guess you try this only for machines where the TSC runs with constant frequency, right ? Looking forward to your patches. Thanks, tglx ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 20:05 ` Thomas Gleixner @ 2008-09-23 21:02 ` Martin Bligh 0 siblings, 0 replies; 47+ messages in thread From: Martin Bligh @ 2008-09-23 21:02 UTC (permalink / raw) To: Thomas Gleixner Cc: Masami Hiramatsu, Linus Torvalds, Mathieu Desnoyers, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml >> I think the tolerance is about 500 cycles. If that's not sufficient, I guess >> we'll have to either live with some slight misordering (which people have >> pointed out is kind of inevitable anyway) on these broken machines? >> It was sufficient for what we were using it for, but maybe not for everyone. > > Well, I dont care about the trace reordering at all. I care about user > space visible time going backwards issues observed via the > gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can > migrate in less than that :) Right, that's what we were interested in. > I guess you try this only for machines where the TSC runs with > constant frequency, right ? We don't do DVFS on the 'broken' TSC machines, just halt. But yes, it's selective ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 19:37 ` Thomas Gleixner 2008-09-23 19:39 ` Martin Bligh @ 2008-09-23 20:06 ` Masami Hiramatsu 2008-09-23 20:09 ` Thomas Gleixner 1 sibling, 1 reply; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 20:06 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Thomas Gleixner wrote: > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage >> 171210 > > C3 stops the TSC. So depending on how many C3 entries you have on the > different cores, your TSCs will drift apart. Some BIOSes do even a > lousy job trying to fixup the TSCs on exit from C3, which makes things > even worse. > >> C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] >> C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] >> C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] >> >> Could these help you? > > Yup, explains your TSC observation. Nothing we can do about. Broken by > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! Thank you for analyzing! :-) Hmm, then could I fix that by fixing my dsdt...? Thanks again, > > Thanks, > > tglx -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 20:06 ` Masami Hiramatsu @ 2008-09-23 20:09 ` Thomas Gleixner 0 siblings, 0 replies; 47+ messages in thread From: Thomas Gleixner @ 2008-09-23 20:09 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linus Torvalds, Mathieu Desnoyers, Martin Bligh, Linux Kernel Mailing List, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > Thomas Gleixner wrote: > > On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > >> # cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage > >> 171210 > > > > C3 stops the TSC. So depending on how many C3 entries you have on the > > different cores, your TSCs will drift apart. Some BIOSes do even a > > lousy job trying to fixup the TSCs on exit from C3, which makes things > > even worse. > > > >> C1: type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000] > >> C2: type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003] > >> C3: type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636] > >> > >> Could these help you? > > > > Yup, explains your TSC observation. Nothing we can do about. Broken by > > system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI ! > > Thank you for analyzing! :-) > Hmm, then could I fix that by fixing my dsdt...? You can limit c-states so you dont do down to the C3 state, but there is a trade off vs. power saving. Lets wait for Martins magic TSC patches first :) tglx ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:51 ` Masami Hiramatsu 2008-09-23 15:09 ` Mathieu Desnoyers @ 2008-09-23 15:48 ` Linus Torvalds 1 sibling, 0 replies; 47+ messages in thread From: Linus Torvalds @ 2008-09-23 15:48 UTC (permalink / raw) To: Masami Hiramatsu Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > > $ dmesg | grep TSC > checking TSC synchronization [CPU#0 -> CPU#1]: > Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock. > Marking TSC unstable due to: check_tsc_sync_source failed. Hmm.. Very interesting. It smells of a non-stable TSC, but your Core2 Cpu shouldn't have that issue: > > $ cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 15 > model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz > stepping : 6 > cpu MHz : 1000.000 > cache size : 4096 KB > flags : ... constant_tsc ... > > Actually, I had measured TSC drifting and reported to systemtap-bugzilla > http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19 I'd have assumed it was just some initial offset issue, but your bug-report seems to say that it really does change the TSC frequency when the CPU frequency changes. That should _not_ happen on a core2 CPU, afaik. I didn't even know it could be a setup issue, but it does really smell like your TSC frequency changes. Now, unstable TSC's are not uncommon per se, and most older Intel CPU's will do it, it's just that I thought it was fixed in Core2 (and later P4's for that matter). The rule *should* be that: - family = 15 (netburst), model 3 or higher has constant TSC - family = 6 (PPro), model 14 or higher (Core, Core 2) have constant TSCs. This is quite clearly documented: see Intel ia docs, vol 3B, 18.10 "Time-stamp counter". Very odd. I wonder what your laptop does to screw this up. I also suspect that since we already _noticed_ that the TSC isn't stable, we should also have then cleared the "constant TSC" bit. And we apparently didn't. Btw, your CPU looks quite odd in other respects too. Judging by your bugzilla entry, the TSC sometimes ticks at 2GHz (fine), sometimes at 1Ghz (also fine), and sometimes at 667/500MHz judging by the ratios you show for TSC/timer tick. And that last one is really odd, afaik most 2GHz Core 2 duo's will have a lower frequency of 1GHz. Is that some special low-power version, perhaps? Or maybe it isn't a speedstep-able CPU at all, and the system actually changes the *bus* frequency (and then the CPU frequency is some constant factor of that). If so, the system messes with the CPU in bad ways. And btw, I'm almost certain that what you see isn't actually any "CPU drift" in the sense that I strongly suspect that the TSC's for both cores will change frequency together. So because the TSC isn't stable, it's not a good time-source, but despite that it's not necessarily a bad way to compare events across cores. To actually have different CPU's TSC drift wrt each other, you almost have to have them in different clock domains. And that is *very* rare. It happens when the CPU's are on different boards, and sure if happens if the CPU's have non-constant TSCs with different frequencies, but neither of those should be very common at all. The latter is uncommon because it's almost unheard of of having multi-socket devices with old CPU's that also do frequency changes. Older multi-core CPU's tend to do frequency changes the whole chip at a time, and newer multi-core CPU's should all basically have a fixed TSC so even when they do frequency changes independently, the TSC should still be off the same clock on the die. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:07 ` Masami Hiramatsu 2008-09-23 0:13 ` Martin Bligh @ 2008-09-23 0:41 ` Linus Torvalds 2008-09-23 1:28 ` Roland Dreier ` (2 more replies) 1 sibling, 3 replies; 47+ messages in thread From: Linus Torvalds @ 2008-09-23 0:41 UTC (permalink / raw) To: Masami Hiramatsu Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Masami Hiramatsu wrote: > > Sure, atomic counter might be more expensive but accurate for ordering. Don't be silly. An atomic counter is no more accurate for ordering than anything else. Why? Because all it tells you is the ordering of the atomic increment, not of the caller. The atomic increment is not related to all the other ops that the code that you trace actually does in any shape or form, and so the ordering of the trace doesn't actually imply anything for the ordering of the operations you are tracing! Except for a single CPU, of course, but for that case you don't need a sequence number either, since the ordering is entirely determined by the ring buffer itself. So the counter will be more expensive (cross-cpu cache bouncing for EVERY SINGLE EVENT), less useful (no real meaning for people who DO want to have a timestamp), and it's really no more "ordered" than anything that bases itself on a TSC. The fact is, you cannot order operations based on log messages unless you have a lock around the whole caller - absolutely _no_ amount of locking or atomic accesses in the log itself will guarantee ordering of the upper layers. And sure, if you have locking at a higher layer, then a sequence number is sufficient, but on the other hand, so is a well-synchronized TSC. So personally, I think that the optimal solution is: - let each ring buffer be associated with a "gettimestamp()" function, so that everybody _can_ set it to something of their own. But default to something sane, namely a raw TSC thing. - Add synchronization events to the ring buffer often enough that you can make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply noticing when the upper 32 bits change, although you could possibly do it with a heartbeat too. - Similarly, add a synchronization event when the TSC frequency changes. - Make the synchronization packet contain the full 64-bit TSC base, in addition to TSC frequency info _and_ the timebase. - From those synchronization events, you should be able to get a very accurate timestamp *after* the fact from the raw TSC numbers (ie do all the scaling not when you gather the info, but when you present it), even if you only spent 32 bits of TSC info on 99% of all events (an just had a overflow log occasionally to get the rest of the info) - Most people will be _way_ happier with a timestamp that has enough precision to also show ordering (assuming that the caller holds a lock over the operation _including_ the tracing) than they would ever be with a sequence number. - people who really want to can consider the incrementing counter a TSC, but it will suck in so many ways that I bet it will not be very popular at all. But having the option to set a special timestamp function will give people the option (on a per-buffer level) to make the "TSC" be a simple incrementing 32-bit counter using xaddl and the upper bits incrementing from a timer, but keep that as a "ok, the TSC is really broken, or this architecture doesn't support any fast cycle counters at all, or I really don't care about time, just sequence, and I guarantee I have a single lock in all callers that makes things unambiguous" Note the "single lock" part. It's not enough that you make any trace thing under a lock. They must be under the _same_ lock for all relevant events for you to be able to say anything about ordering. And that's actually pretty rare for any complex behavior. The timestamping, btw, is likely the most important part of the whole logging thing. So we need to get it right. But by "right" I mean really really low-latency so that it's acceptable to everybody, real-time enough that you can tell how far apart events were, and precise enough that you really _can_ see ordering. The "raw TSC value with correction information" should be able to give you all of that. At least on x86. On some platforms, the TSC may not give you enough resolution to get reasonable guesses on event ordering. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:41 ` Linus Torvalds @ 2008-09-23 1:28 ` Roland Dreier 2008-09-23 1:40 ` Steven Rostedt ` (2 more replies) 2008-09-23 2:31 ` Mathieu Desnoyers 2008-09-23 3:07 ` Masami Hiramatsu 2 siblings, 3 replies; 47+ messages in thread From: Roland Dreier @ 2008-09-23 1:28 UTC (permalink / raw) To: Linus Torvalds Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml > Because all it tells you is the ordering of the atomic increment, not of > the caller. The atomic increment is not related to all the other ops that > the code that you trace actually does in any shape or form, and so the > ordering of the trace doesn't actually imply anything for the ordering of > the operations you are tracing! This reminds me of a naive question that occurred to me while we were discussing this at KS. Namely, what does "ordering" mean for events? An example I'm all too familiar with is the lack of ordering of MMIO on big SGI systems -- if you forget an mmiowb(), then two CPUs taking a spinlock and doing writel() inside the spinlock and then dropping the spinlock (which should be enough to "order" things) might see the writel() reach the final device "out of order" because the write has to travel through a routed system fabric. Just like Einstein said, it really seems to me that the order of things depends on your frame of reference. - R. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 1:28 ` Roland Dreier @ 2008-09-23 1:40 ` Steven Rostedt 2008-09-23 2:08 ` Mathieu Desnoyers 2008-09-23 3:29 ` Linus Torvalds 2 siblings, 0 replies; 47+ messages in thread From: Steven Rostedt @ 2008-09-23 1:40 UTC (permalink / raw) To: Roland Dreier Cc: Linus Torvalds, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Roland Dreier wrote: > > Because all it tells you is the ordering of the atomic increment, not of > > the caller. The atomic increment is not related to all the other ops that > > the code that you trace actually does in any shape or form, and so the > > ordering of the trace doesn't actually imply anything for the ordering of > > the operations you are tracing! > > This reminds me of a naive question that occurred to me while we were > discussing this at KS. Namely, what does "ordering" mean for events? > > An example I'm all too familiar with is the lack of ordering of MMIO on > big SGI systems -- if you forget an mmiowb(), then two CPUs taking a > spinlock and doing writel() inside the spinlock and then dropping the > spinlock (which should be enough to "order" things) might see the > writel() reach the final device "out of order" because the write has to > travel through a routed system fabric. > > Just like Einstein said, it really seems to me that the order of things > depends on your frame of reference. In my logdev tracer (see http://rostedt.homelinux.com/logdev) I used an atomic counter to keep "order". But what I would say to people what this order means, is that order is among multiple traces between multiple CPUS. That is if you have. CPU 1 CPU 2 trace_point_a trace_point_c trace_point_b trace_point_d If you see in the trace: trace_point_a trace_point_c You really do not know which happened first. Simply because trace_point_c could have been hit first, but for interrupts and nmis and what not, trace_point_a could have easily been recorded first. But to me, trace_points are more like memory barriers. If I see: trace_point_c trace_point_a trace_point_b trace_point_d I can assume that everything before trace_point_c happened before everything after trace_point_a, and that all before trace_point_b happened before trace_point_d. One can not assume that the trace points themselves are in order. But you can assume that the things outside the trace points are, like memory barriers. I have found lots of race conditions with my logdev, and it was due to this "memory barrier" likeness to be able to see the races. Unfortunately, if you are using an out of sync TSC, you lose even the memory barrier characteristic of the trace. -- Steve ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 1:28 ` Roland Dreier 2008-09-23 1:40 ` Steven Rostedt @ 2008-09-23 2:08 ` Mathieu Desnoyers 2008-09-23 2:27 ` Darren Hart 2008-09-23 3:29 ` Linus Torvalds 2 siblings, 1 reply; 47+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 2:08 UTC (permalink / raw) To: Roland Dreier Cc: Linus Torvalds, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml * Roland Dreier (rdreier@cisco.com) wrote: > > Because all it tells you is the ordering of the atomic increment, not of > > the caller. The atomic increment is not related to all the other ops that > > the code that you trace actually does in any shape or form, and so the > > ordering of the trace doesn't actually imply anything for the ordering of > > the operations you are tracing! > > This reminds me of a naive question that occurred to me while we were > discussing this at KS. Namely, what does "ordering" mean for events? > > An example I'm all too familiar with is the lack of ordering of MMIO on > big SGI systems -- if you forget an mmiowb(), then two CPUs taking a > spinlock and doing writel() inside the spinlock and then dropping the > spinlock (which should be enough to "order" things) might see the > writel() reach the final device "out of order" because the write has to > travel through a routed system fabric. > > Just like Einstein said, it really seems to me that the order of things > depends on your frame of reference. > > - R. > Exactly as Linus said, event ordering comes down to this : a choice between heavy locking around the real operation traced and the tracing statement itself (irq disable/spinlock) or the acknowledgement that the ordering is only insured across the actual tracing _instrumentation_. A worse case scenario would be to get an interrupt between the "real" operation (e.g. a memory or mmio write) and the tracing statement, be scheduled out, which would let a lot of stuff happen between the actual impact of the operation on kernel memory and the tracing statement itself. If we want to be _sure_ such thing never happen, we would then have to pay the price of heavy locking and that would not be pretty, especially for complex data structure modifications comes in play. I don't really think anyone with an half-sane mind would want to slow down such critical kernel operations for the benefit of totally ordered tracing. However, in many cases where ordering matters, e.g. to instrument spinlocks themselves, if we put the instrumentation within the critical section rather than outside of it, then we benefit from the existing kernel locking (but only for events related to this specific spinlock). This is the same for many synchronization primitives, except for atomic operations, where we have to accept that the order will be imperfect. So only in the specific case of instrumentation of things like locking, where it is possible to insure that instrumentation is synchronized with the instrumented operation, does it make a difference to choose the TSC (which implies a slight delta between the TSCs due to cache line delays at synchronization and delay due to TSCs drifts caused by temperature) over an atomic increment. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 2:08 ` Mathieu Desnoyers @ 2008-09-23 2:27 ` Darren Hart 2008-09-23 2:32 ` Mathieu Desnoyers 0 siblings, 1 reply; 47+ messages in thread From: Darren Hart @ 2008-09-23 2:27 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Roland Dreier, Linus Torvalds, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, Frank Ch. Eigler, systemtap-ml On Mon, Sep 22, 2008 at 7:02 PM, Mathieu Desnoyers <compudj@krystal.dyndns.org> > So only in the specific case of instrumentation of things like locking, > where it is possible to insure that instrumentation is synchronized with > the instrumented operation, does it make a difference to choose the TSC > (which implies a slight delta between the TSCs due to cache line delays > at synchronization and delay due to TSCs drifts caused by temperature) > over an atomic increment. > Hrm, i think that overlooks the other reason to use a time based counter over an atomic increment: you might care about time. Perhaps one might be less concerned with actual order tightly grouped events and more concerned with the actual time delta between more temporally distant events. In that case, using a clocksource would still be valuable. Although admitedtly the caller could embed that in their payload, but since we seem to agree we need some kind of counter, the time-based counter appears to be the most flexible. Thanks, -- Darren Hart ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 2:27 ` Darren Hart @ 2008-09-23 2:32 ` Mathieu Desnoyers 0 siblings, 0 replies; 47+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 2:32 UTC (permalink / raw) To: Darren Hart Cc: Roland Dreier, Linus Torvalds, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, Frank Ch. Eigler, systemtap-ml * Darren Hart (darren@dvhart.com) wrote: > On Mon, Sep 22, 2008 at 7:02 PM, Mathieu Desnoyers <compudj@krystal.dyndns.org> > > So only in the specific case of instrumentation of things like locking, > > where it is possible to insure that instrumentation is synchronized with > > the instrumented operation, does it make a difference to choose the TSC > > (which implies a slight delta between the TSCs due to cache line delays > > at synchronization and delay due to TSCs drifts caused by temperature) > > over an atomic increment. > > > > Hrm, i think that overlooks the other reason to use a time based counter over > an atomic increment: you might care about time. Perhaps one might be less > concerned with actual order tightly grouped events and more concerned with the > actual time delta between more temporally distant events. In that case, using > a clocksource would still be valuable. Although admitedtly the caller could > embed that in their payload, but since we seem to agree we need some kind of > counter, the time-based counter appears to be the most flexible. > > Thanks, > See my answer to Linus for a proposal on how to do both :) Mathieu > -- > Darren Hart > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 1:28 ` Roland Dreier 2008-09-23 1:40 ` Steven Rostedt 2008-09-23 2:08 ` Mathieu Desnoyers @ 2008-09-23 3:29 ` Linus Torvalds 2008-09-23 3:42 ` Mathieu Desnoyers 2008-09-23 3:44 ` Steven Rostedt 2 siblings, 2 replies; 47+ messages in thread From: Linus Torvalds @ 2008-09-23 3:29 UTC (permalink / raw) To: Roland Dreier Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Roland Dreier wrote: > > Just like Einstein said, it really seems to me that the order of things > depends on your frame of reference. Heh. Yes. In general, there is no single ordering unless you actually use a serializing lock on all CPU's involved. And exactly as in the theory of relativity, two people on different CPU's can actually validly _disagree_ about the ordering of the same event. There are things that act as "light-cones" and are borders for what everybody can agree on, but basically, in the absence of explicit locks, it is very possible that no such thing as "ordering" may even exist. Now, an atomic increment on a single counter obviously does imply *one* certain ordering, but it really only defines the ordering of that counter itself. It does not at all necessarily imply any ordering on the events that go on around the counter in other unrelated cachelines. Which is exactly why even a global counter in no way orders "events" in general, unless those events have something else that does so. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:29 ` Linus Torvalds @ 2008-09-23 3:42 ` Mathieu Desnoyers 2008-09-23 4:06 ` Linus Torvalds 2008-09-23 3:44 ` Steven Rostedt 1 sibling, 1 reply; 47+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 3:42 UTC (permalink / raw) To: Linus Torvalds Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 22 Sep 2008, Roland Dreier wrote: > > > > Just like Einstein said, it really seems to me that the order of things > > depends on your frame of reference. > > Heh. Yes. In general, there is no single ordering unless you actually use > a serializing lock on all CPU's involved. > > And exactly as in the theory of relativity, two people on different CPU's > can actually validly _disagree_ about the ordering of the same event. > There are things that act as "light-cones" and are borders for what > everybody can agree on, but basically, in the absence of explicit locks, > it is very possible that no such thing as "ordering" may even exist. > > Now, an atomic increment on a single counter obviously does imply *one* > certain ordering, but it really only defines the ordering of that counter > itself. It does not at all necessarily imply any ordering on the events > that go on around the counter in other unrelated cachelines. > > Which is exactly why even a global counter in no way orders "events" in > general, unless those events have something else that does so. > > Linus > Unless I am missing something, in the case we use an atomic operation which implies memory barriers (cmpxchg and atomic_add_return does), one can be sure that all memory operations done before the barrier are completed at the barrier and that all memory ops following the barrier will happen after. Did you have something else in mind ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:42 ` Mathieu Desnoyers @ 2008-09-23 4:06 ` Linus Torvalds 0 siblings, 0 replies; 47+ messages in thread From: Linus Torvalds @ 2008-09-23 4:06 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Mathieu Desnoyers wrote: > > Unless I am missing something, in the case we use an atomic operation > which implies memory barriers (cmpxchg and atomic_add_return does), one > can be sure that all memory operations done before the barrier are > completed at the barrier and that all memory ops following the barrier > will happen after. Sure (if you have a barrier - not all architectures will imply that for an incrment). But that still doesn't mean a thing. You have two events (a) and (b), and you put trace-points on each. In your trace, you see (a) before (b) by comparing the numbers. But what does that mean? The actual event that you traced is not the trace-point - the trace-point is more like a fancy "printk". And the fact that one showed up before another in the trace buffer, doesn't mean that the events _around_ the trace happened in the same order. You can use the barriers to make a partial ordering, and if you have a separate tracepoint for entry into a region and exit, you can perhaps show that they were totally disjoint. Or maybe they were partially overlapping, and you'll never know exactly how they overlapped. Example: trace(..); do_X(); being executed on two different CPU's. In the trace, CPU#1 was before CPU#2. Does that mean that "do_X()" happened first on CPU#1? No. The only way to show that would be to put a lock around the whole trace _and_ operation X, ie spin_lock(..); trace(..); do_X(); spin_unlock(..); and now, if CPU#1 shows up in the trace first, then you know that do_X() really did happen first on CPU#1. Otherwise you basically know *nothing*, and the ordering of the trace events was totally and utterly meaningless. See? Trace events themselves may be ordered, but the point of the trace event is never to know the ordering of the trace itself - it's to know the ordering of the code we're interested in tracing. The ordering of the trace events themselves is irrelevant and not useful. And I'd rather see people _understand_ that, than if they think the ordering is somehow something they can trust. Btw, if you _do_ have locking, then you can also know that the "do_X()" operations will be essentially as far apart in some theoretical notion of "time" (let's imagine that we do have global time, even if we don't) as the cost of the trace operation and do_X() itself. So if we _do_ have locking (and thus a valid ordering that actually can matter), then the TSC doesn't even have to be synchronized on a cycle basis across CPU's - it just needs to be close enough that you can tell which one happened first (and with ordering, that's a valid thing to do). So you don't even need "perfect" synchronization, you just need something reasonably close, and you'll be able to see ordering from TSC counts without having that horrible bouncing cross-CPU thing that will impact performance a lot. Quite frankly, I suspect that anybody who wants to have a global counter might as well almost just have a global ring-buffer. The trace events aren't going to be CPU-local anyway if you need to always update a shared cacheline - and you might as well make the shared cacheline be the ring buffer head with a spinlock in it. That may not be _quite_ true, but it's probably close enough. Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:29 ` Linus Torvalds 2008-09-23 3:42 ` Mathieu Desnoyers @ 2008-09-23 3:44 ` Steven Rostedt 2008-09-23 4:12 ` Masami Hiramatsu 2008-09-23 4:20 ` Linus Torvalds 1 sibling, 2 replies; 47+ messages in thread From: Steven Rostedt @ 2008-09-23 3:44 UTC (permalink / raw) To: Linus Torvalds Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Linus Torvalds wrote: > > > On Mon, 22 Sep 2008, Roland Dreier wrote: > > > > Just like Einstein said, it really seems to me that the order of things > > depends on your frame of reference. > > Heh. Yes. In general, there is no single ordering unless you actually use > a serializing lock on all CPU's involved. > > And exactly as in the theory of relativity, two people on different CPU's > can actually validly _disagree_ about the ordering of the same event. > There are things that act as "light-cones" and are borders for what > everybody can agree on, but basically, in the absence of explicit locks, > it is very possible that no such thing as "ordering" may even exist. > > Now, an atomic increment on a single counter obviously does imply *one* > certain ordering, but it really only defines the ordering of that counter > itself. It does not at all necessarily imply any ordering on the events > that go on around the counter in other unrelated cachelines. > > Which is exactly why even a global counter in no way orders "events" in > general, unless those events have something else that does so. Hmm, I've been pretty spoiled by x86 mostly ording things correctly, and the not x86 boxes I've used has mostly been UP. But, with that, with a global atomic counter, and the following trace: cpu 0: trace_point_a cpu 1: trace_point_c cpu 0: trace_point_b cpu 1: trace_point_d Could the event a really come after event d, even though we already hit event b? But I guess you are stating the fact that what the computer does internally, no one really knows. Without the help of real memory barriers, ording of memory accesses is mostly determined by tarot cards. But basically, the perceived result of assembly commands is suppose to be accurate at the user level. The traces that I've used not only shows the order (or perceived order) of events, but also the output of the corrupted data when the race happens. Usually with the two together, you can pretty much guarantee that the traced events actually did occur in the order presented. But without some perceived accurate ording, even when you do see the corrupted data, the events can easily be misleading, even on an arch that usually orders the events as seen by the user. -- Steve ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:44 ` Steven Rostedt @ 2008-09-23 4:12 ` Masami Hiramatsu 2008-09-23 4:18 ` Martin Bligh 2008-09-23 10:54 ` Steven Rostedt 2008-09-23 4:20 ` Linus Torvalds 1 sibling, 2 replies; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 4:12 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Roland Dreier, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml Steven Rostedt wrote: > But, with that, with a global atomic counter, and the following trace: > > cpu 0: trace_point_a > cpu 1: trace_point_c > cpu 0: trace_point_b > cpu 1: trace_point_d > > > Could the event a really come after event d, even though we already hit > event b? yes, if event c is an interrupt event :-). cpu 0 cpu 1 hit event d hit event a log event a irq event c log event c hit event b log event b log event d so, I think if we really need to order events, we have to stop irq right after hitting an event. Anyway, in most case, I think it works, but as accurate as synchronized-TSC if hardware supports it. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 4:12 ` Masami Hiramatsu @ 2008-09-23 4:18 ` Martin Bligh 2008-09-23 15:25 ` Masami Hiramatsu 2008-09-23 10:54 ` Steven Rostedt 1 sibling, 1 reply; 47+ messages in thread From: Martin Bligh @ 2008-09-23 4:18 UTC (permalink / raw) To: Masami Hiramatsu Cc: Steven Rostedt, Linus Torvalds, Roland Dreier, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml > yes, if event c is an interrupt event :-). > > cpu 0 cpu 1 > hit event d > hit event a > log event a > irq event c > log event c > hit event b > log event b > log event d > > so, I think if we really need to order events, we have to stop > irq right after hitting an event. How could you fix that in any practical way? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 4:18 ` Martin Bligh @ 2008-09-23 15:25 ` Masami Hiramatsu 0 siblings, 0 replies; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 15:25 UTC (permalink / raw) To: Martin Bligh Cc: Steven Rostedt, Linus Torvalds, Roland Dreier, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml Hi Martin, Martin Bligh wrote: >> log event c >> hit event b >> log event b >> log event d >> >> so, I think if we really need to order events, we have to stop >> irq right after hitting an event. > > How could you fix that in any practical way? In that case, I think it's easy to know that event c is irq related event by checking source code. :-) And practically, in most cases, I think we can presume what happened by event arguments and subsequent events. Anyway, since there must be a delay between hitting event(and as linus said, happening event) and logging it, I think buffering mechanism itself should focus on ensuring only writing and reading order. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 4:12 ` Masami Hiramatsu 2008-09-23 4:18 ` Martin Bligh @ 2008-09-23 10:54 ` Steven Rostedt 1 sibling, 0 replies; 47+ messages in thread From: Steven Rostedt @ 2008-09-23 10:54 UTC (permalink / raw) To: Masami Hiramatsu Cc: Linus Torvalds, Roland Dreier, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml On Tue, 23 Sep 2008, Masami Hiramatsu wrote: > Steven Rostedt wrote: > > But, with that, with a global atomic counter, and the following trace: > > > > cpu 0: trace_point_a > > cpu 1: trace_point_c > > cpu 0: trace_point_b > > cpu 1: trace_point_d > > > > > > Could the event a really come after event d, even though we already hit > > event b? > > yes, if event c is an interrupt event :-). > > cpu 0 cpu 1 > hit event d > hit event a > log event a > irq event c > log event c heh, This is assuming that event c is in an IRQ handler. Since I control where event c is, I can prevent that. I'm talking about the CPU doing something funny that would have c come after d. But I didn't specify exactly what the events were, so I'll accept that explanation ;-) -- Steve > hit event b > log event b > log event d > > so, I think if we really need to order events, we have to stop > irq right after hitting an event. > > Anyway, in most case, I think it works, but as accurate as > synchronized-TSC if hardware supports it. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 3:44 ` Steven Rostedt 2008-09-23 4:12 ` Masami Hiramatsu @ 2008-09-23 4:20 ` Linus Torvalds 2008-09-23 14:13 ` Mathieu Desnoyers 1 sibling, 1 reply; 47+ messages in thread From: Linus Torvalds @ 2008-09-23 4:20 UTC (permalink / raw) To: Steven Rostedt Cc: Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, darren, Frank Ch. Eigler, systemtap-ml On Mon, 22 Sep 2008, Steven Rostedt wrote: > > But, with that, with a global atomic counter, and the following trace: > > cpu 0: trace_point_a > cpu 1: trace_point_c > cpu 0: trace_point_b > cpu 1: trace_point_d > > Could the event a really come after event d, even though we already hit > event b? Each tracepoint will basically give a partial ordering (if you make it so, of course - and on x86 it's hard to avoid it). And with many trace-points, you can narrow down ordering if you're lucky. But say that you have code like CPU#1 CPU#2 trace_a trace_c .. .. trace_b trace_d and since each CPU itself is obviously strictly ordered, you a priori know that a < b, and c < d. But your trace buffer can look many different ways: - a -> b -> c -> d c -> d -> a -> b Now you do know that what happened between c and d must all have happened entirely after/before the things that happened between a and b, and there is no overlap. This is only assuming the x86 full memory barrier from a "lock xadd" of course, but those are the semantics you'd get on x86. On others, the ordering might not be that strong. - a -> c -> b -> d a -> c -> d -> b With these trace point orderings, you really don't know anything at all about the order of any access that happened in between. CPU#1 might have gone first. Or not. Or partially. You simply do not know. > But I guess you are stating the fact that what the computer does > internally, no one really knows. Without the help of real memory barriers, > ording of memory accesses is mostly determined by tarot cards. Well, x86 defines a memory order. But what I'm trying to explain is that memory order still doesn't actually specify what happens to the code that actually does tracing! The trace is only going to show the order of the tracepoints, not the _other_ memory accesses. So you'll have *some* information, but it's very partial. And the thing is, all those other memory accesses are the ones that do all the real work. You'll know they happened _somewhere_ between two tracepoints, but not much more than that. This is why timestamps aren't really any worse than sequence numbers in all practical matters. They'll get you close enough that you can consider them equivalent to a cache-coherent counter, just one that you don't have to take a cache miss for, and that increments on its own! Quite a lot of CPU's have nice, dependable, TSC's that run at constant frequency. And quite a lot of traces care a _lot_ about real time. When you do IO tracing, the problem is almost never about lock ordering or anything like that. You want to see how long a request took. You don't care AT ALL how many tracepoints were in between the beginning and end, you care about how many microseconds there were! Linus ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 4:20 ` Linus Torvalds @ 2008-09-23 14:13 ` Mathieu Desnoyers 0 siblings, 0 replies; 47+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 14:13 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, Roland Dreier, Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, darren, Frank Ch. Eigler, systemtap-ml * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 22 Sep 2008, Steven Rostedt wrote: > > > > But, with that, with a global atomic counter, and the following trace: > > > > cpu 0: trace_point_a > > cpu 1: trace_point_c > > cpu 0: trace_point_b > > cpu 1: trace_point_d > > > > Could the event a really come after event d, even though we already hit > > event b? > > Each tracepoint will basically give a partial ordering (if you make it so, > of course - and on x86 it's hard to avoid it). > > And with many trace-points, you can narrow down ordering if you're lucky. > > But say that you have code like > > CPU#1 CPU#2 > > trace_a trace_c > .. .. > trace_b trace_d > > and since each CPU itself is obviously strictly ordered, you a priori know > that a < b, and c < d. But your trace buffer can look many different ways: > > - a -> b -> c -> d > c -> d -> a -> b > > Now you do know that what happened between c and d must all have > happened entirely after/before the things that happened between > a and b, and there is no overlap. > > This is only assuming the x86 full memory barrier from a "lock xadd" of > course, but those are the semantics you'd get on x86. On others, the > ordering might not be that strong. > Hrm, Documentation/atomic_ops.txt states that : "Unlike the above routines, it is required that explicit memory barriers are performed before and after the operation. It must be done such that all memory operations before and after the atomic operation calls are strongly ordered with respect to the atomic operation itself." So on architectures with weaker ordering, the kernel atomic ops already require that explicit smp_mb() are inserted before and after the atomic increment. The same applies to cmpxchg. Therefore I think it's ok, given the semantic provided by these two atomic operations, to assume they imply a smp_mb() for any given architecture. If not, then the architecture-specific implementation would be broken wrt the semantic. > - a -> c -> b -> d > a -> c -> d -> b > > With these trace point orderings, you really don't know anything at all > about the order of any access that happened in between. CPU#1 might > have gone first. Or not. Or partially. You simply do not know. > Yep. If two "real kernel" events happen to belong to the same overlapping time window, there is not much we can know about their order. Adding tracing statements before and after traced kernel operations could help to make this window as small as possible, but I doubt it's worth the performance penality and event duplication (and incremented trace size). Mathieu > > But I guess you are stating the fact that what the computer does > > internally, no one really knows. Without the help of real memory barriers, > > ording of memory accesses is mostly determined by tarot cards. > > Well, x86 defines a memory order. But what I'm trying to explain is that > memory order still doesn't actually specify what happens to the code that > actually does tracing! The trace is only going to show the order of the > tracepoints, not the _other_ memory accesses. So you'll have *some* > information, but it's very partial. > > And the thing is, all those other memory accesses are the ones that do all > the real work. You'll know they happened _somewhere_ between two > tracepoints, but not much more than that. > > This is why timestamps aren't really any worse than sequence numbers in > all practical matters. They'll get you close enough that you can consider > them equivalent to a cache-coherent counter, just one that you don't have > to take a cache miss for, and that increments on its own! > > Quite a lot of CPU's have nice, dependable, TSC's that run at constant > frequency. > > And quite a lot of traces care a _lot_ about real time. When you do IO > tracing, the problem is almost never about lock ordering or anything like > that. You want to see how long a request took. You don't care AT ALL how > many tracepoints were in between the beginning and end, you care about how > many microseconds there were! > > Linus > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:41 ` Linus Torvalds 2008-09-23 1:28 ` Roland Dreier @ 2008-09-23 2:31 ` Mathieu Desnoyers 2008-09-23 3:07 ` Masami Hiramatsu 2 siblings, 0 replies; 47+ messages in thread From: Mathieu Desnoyers @ 2008-09-23 2:31 UTC (permalink / raw) To: Linus Torvalds Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 22 Sep 2008, Masami Hiramatsu wrote: > > > > Sure, atomic counter might be more expensive but accurate for ordering. > > Don't be silly. > > An atomic counter is no more accurate for ordering than anything else. > > Why? > > Because all it tells you is the ordering of the atomic increment, not of > the caller. The atomic increment is not related to all the other ops that > the code that you trace actually does in any shape or form, and so the > ordering of the trace doesn't actually imply anything for the ordering of > the operations you are tracing! > > Except for a single CPU, of course, but for that case you don't need a > sequence number either, since the ordering is entirely determined by the > ring buffer itself. > > So the counter will be more expensive (cross-cpu cache bouncing for EVERY > SINGLE EVENT), less useful (no real meaning for people who DO want to have > a timestamp), and it's really no more "ordered" than anything that bases > itself on a TSC. > > The fact is, you cannot order operations based on log messages unless you > have a lock around the whole caller - absolutely _no_ amount of locking or > atomic accesses in the log itself will guarantee ordering of the upper > layers. > > And sure, if you have locking at a higher layer, then a sequence number is > sufficient, but on the other hand, so is a well-synchronized TSC. > > So personally, I think that the optimal solution is: > > - let each ring buffer be associated with a "gettimestamp()" function, so > that everybody _can_ set it to something of their own. But default to > something sane, namely a raw TSC thing. > > - Add synchronization events to the ring buffer often enough that you can > make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply > noticing when the upper 32 bits change, although you could possibly do > it with a heartbeat too. > Hi Linus, I totally agree with all of the above, > - Similarly, add a synchronization event when the TSC frequency changes. > > - Make the synchronization packet contain the full 64-bit TSC base, in > addition to TSC frequency info _and_ the timebase. > If possible, I would recommend to stay as far away as possible from using frequency change events to support broken frequency scaling. First, some architectures, like AMD, does not trigger an event for every frequency changes on the CPU (if my memory is right, I think the southbridge temperature throttling does not generate any event) and we also have to consider the delay between the frequency change and and moment the instrumentation would be called. Each frequency change event would therefore increase the over TSC imprecision and I doubt we really want to rely on this for our time base. I welcome people to prove me wrong or paranoid by providing results proving the correct synchronicity of such approach over time. :-) I however explain how LTTng deals with such architectures I consider to fall into the "broken" category below. > - From those synchronization events, you should be able to get a very > accurate timestamp *after* the fact from the raw TSC numbers (ie do all > the scaling not when you gather the info, but when you present it), > even if you only spent 32 bits of TSC info on 99% of all events (an > just had a overflow log occasionally to get the rest of the info) > > - Most people will be _way_ happier with a timestamp that has enough > precision to also show ordering (assuming that the caller holds a > lock over the operation _including_ the tracing) than they would ever > be with a sequence number. > One gettimestamp() we could think of, to satisfy both people expecting tracing to perform a memory barrier and for people expecting a timestamp, would be to write timestamps taken from synchronized TSCs into a variable with a cmpxchg, which would succeed only if the TSC value is higher than the value present in this variable. That would give both the memory barrier behavior and the timestamping. Given the cache-line bouncing implied, I would only recommend to activate this when really needed, e.g. when debugging race issues. I currently use this scheme in LTTng to deal with broken x86 architecture with non-synchronized TSCs. Basically, all CPUs laging behind others take the TSC value read from memory and adds 200 cycles (or how many cycles it takes to actually read tsc) from the last cycle counter value written to memory by the previous time base read. I also plan to execute an heartbeat on every CPUs someday to give an upper bound to the imprecision of such TSCs so it could be use for some sort of performance measurement. > - people who really want to can consider the incrementing counter a TSC, > but it will suck in so many ways that I bet it will not be very popular > at all. But having the option to set a special timestamp function will > give people the option (on a per-buffer level) to make the "TSC" be a > simple incrementing 32-bit counter using xaddl and the upper bits > incrementing from a timer, but keep that as a "ok, the TSC is really > broken, or this architecture doesn't support any fast cycle counters at > all, or I really don't care about time, just sequence, and I guarantee > I have a single lock in all callers that makes things unambiguous" > I currently use the scheme you propose here on architectures lacking TSC. I hook in the timer interrupt to increment the MSBs. I doubt anyone will ever go the way of locking every caller, but yeah, why not. Also note that getting only 32-bits TSCs on MIPS makes things a bit less simple, but I did an RCU-style adaptation layer which generates a full 64-bits TSC from a 32-bits time base. It's currently in the -lttng tree. > Note the "single lock" part. It's not enough that you make any trace thing > under a lock. They must be under the _same_ lock for all relevant events > for you to be able to say anything about ordering. And that's actually > pretty rare for any complex behavior. > > The timestamping, btw, is likely the most important part of the whole > logging thing. So we need to get it right. But by "right" I mean really > really low-latency so that it's acceptable to everybody, real-time enough > that you can tell how far apart events were, and precise enough that you > really _can_ see ordering. > > The "raw TSC value with correction information" should be able to give you > all of that. At least on x86. On some platforms, the TSC may not give you > enough resolution to get reasonable guesses on event ordering. > Mathieu > Linus > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 0:41 ` Linus Torvalds 2008-09-23 1:28 ` Roland Dreier 2008-09-23 2:31 ` Mathieu Desnoyers @ 2008-09-23 3:07 ` Masami Hiramatsu 2 siblings, 0 replies; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 3:07 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Bligh, Linux Kernel Mailing List, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, darren, Frank Ch. Eigler, systemtap-ml Hi Linus, Linus Torvalds wrote: > > On Mon, 22 Sep 2008, Masami Hiramatsu wrote: >> Sure, atomic counter might be more expensive but accurate for ordering. > > Don't be silly. > > An atomic counter is no more accurate for ordering than anything else. > > Why? > > Because all it tells you is the ordering of the atomic increment, not of > the caller. The atomic increment is not related to all the other ops that > the code that you trace actually does in any shape or form, and so the > ordering of the trace doesn't actually imply anything for the ordering of > the operations you are tracing! > > Except for a single CPU, of course, but for that case you don't need a > sequence number either, since the ordering is entirely determined by the > ring buffer itself. > > So the counter will be more expensive (cross-cpu cache bouncing for EVERY > SINGLE EVENT), less useful (no real meaning for people who DO want to have > a timestamp), and it's really no more "ordered" than anything that bases > itself on a TSC. > > The fact is, you cannot order operations based on log messages unless you > have a lock around the whole caller - absolutely _no_ amount of locking or > atomic accesses in the log itself will guarantee ordering of the upper > layers. Indeed. If TSC(or similar time counter) can provide synchronized-time, I don't have any comment on that(AFAIK, latest x86 and ia64 can provide it). # I might be a bit nervous about Broken TSC... > And sure, if you have locking at a higher layer, then a sequence number is > sufficient, but on the other hand, so is a well-synchronized TSC. > > So personally, I think that the optimal solution is: > > - let each ring buffer be associated with a "gettimestamp()" function, so > that everybody _can_ set it to something of their own. But default to > something sane, namely a raw TSC thing. I agree, default to TSC is enough. > - Add synchronization events to the ring buffer often enough that you can > make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply > noticing when the upper 32 bits change, although you could possibly do > it with a heartbeat too. > > - Similarly, add a synchronization event when the TSC frequency changes. > > - Make the synchronization packet contain the full 64-bit TSC base, in > addition to TSC frequency info _and_ the timebase. > > - From those synchronization events, you should be able to get a very > accurate timestamp *after* the fact from the raw TSC numbers (ie do all > the scaling not when you gather the info, but when you present it), > even if you only spent 32 bits of TSC info on 99% of all events (an > just had a overflow log occasionally to get the rest of the info) > > - Most people will be _way_ happier with a timestamp that has enough > precision to also show ordering (assuming that the caller holds a > lock over the operation _including_ the tracing) than they would ever > be with a sequence number. > > - people who really want to can consider the incrementing counter a TSC, > but it will suck in so many ways that I bet it will not be very popular > at all. But having the option to set a special timestamp function will > give people the option (on a per-buffer level) to make the "TSC" be a > simple incrementing 32-bit counter using xaddl and the upper bits > incrementing from a timer, but keep that as a "ok, the TSC is really > broken, or this architecture doesn't support any fast cycle counters at > all, or I really don't care about time, just sequence, and I guarantee > I have a single lock in all callers that makes things unambiguous" Thank you very much for giving me a good idea! I agree with you. > Note the "single lock" part. It's not enough that you make any trace thing > under a lock. They must be under the _same_ lock for all relevant events > for you to be able to say anything about ordering. And that's actually > pretty rare for any complex behavior. > > The timestamping, btw, is likely the most important part of the whole > logging thing. So we need to get it right. But by "right" I mean really > really low-latency so that it's acceptable to everybody, real-time enough > that you can tell how far apart events were, and precise enough that you > really _can_ see ordering. > > The "raw TSC value with correction information" should be able to give you > all of that. At least on x86. On some platforms, the TSC may not give you > enough resolution to get reasonable guesses on event ordering. > > Linus -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-22 22:27 ` Masami Hiramatsu 2008-09-22 23:12 ` Darren Hart 2008-09-22 23:17 ` Martin Bligh @ 2008-09-23 14:38 ` KOSAKI Motohiro 2008-09-23 15:04 ` Frank Ch. Eigler 2008-09-23 15:23 ` Masami Hiramatsu 2 siblings, 2 replies; 47+ messages in thread From: KOSAKI Motohiro @ 2008-09-23 14:38 UTC (permalink / raw) To: Masami Hiramatsu Cc: kosaki.motohiro, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml > By the way, systemtap uses two modes; > > - single-channel mode > In this mode, all cpus share one buffer channel to write and read. > each writer locks spinlock and write a probe-local data to buffer. > > - per-cpu buffer mode > In this mode, we use an atomic sequential number for ordering data. If > user doesn't need it(because they have their own timestamps), they can > choose not to use that seq-number. I can't imazine a merit of the single-channel mode. Could you please explain it? Because some architecture don't have fine grained timestamp? if so, could you explain which architecture don't have it? ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:38 ` KOSAKI Motohiro @ 2008-09-23 15:04 ` Frank Ch. Eigler 2008-09-23 15:23 ` Masami Hiramatsu 1 sibling, 0 replies; 47+ messages in thread From: Frank Ch. Eigler @ 2008-09-23 15:04 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Masami Hiramatsu, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, systemtap-ml Hi - On Tue, Sep 23, 2008 at 11:36:26PM +0900, KOSAKI Motohiro wrote: > > By the way, systemtap uses two modes; > > > > - single-channel mode > > In this mode, all cpus share one buffer channel to write and read. > > each writer locks spinlock and write a probe-local data to buffer. > > - per-cpu buffer mode [...] > > I can't imazine a merit of the single-channel mode. > Could you please explain it? It could be a way of saving some memory and merging hassle for low-throughput data. (Remember that systemtap enables in-situ analysis of events so that often only brief final results need be sent along need be sent out.) If timestampwise cross-cpu merging can be done on demand by the hypothetical future buffer widget, then little reason remains not to use it. - FChE ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 14:38 ` KOSAKI Motohiro 2008-09-23 15:04 ` Frank Ch. Eigler @ 2008-09-23 15:23 ` Masami Hiramatsu 2008-09-23 18:04 ` KOSAKI Motohiro 1 sibling, 1 reply; 47+ messages in thread From: Masami Hiramatsu @ 2008-09-23 15:23 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml Hi Motohiro, KOSAKI Motohiro wrote: >> By the way, systemtap uses two modes; >> >> - single-channel mode >> In this mode, all cpus share one buffer channel to write and read. >> each writer locks spinlock and write a probe-local data to buffer. >> >> - per-cpu buffer mode >> In this mode, we use an atomic sequential number for ordering data. If >> user doesn't need it(because they have their own timestamps), they can >> choose not to use that seq-number. > > I can't imazine a merit of the single-channel mode. > Could you please explain it? Actually, single-channel mode is for not-frequently event tracing. At least systemtap case, sometimes we just want to collect data and watch it periodically(as like as 'top'). Or, just monitoring errors as additional printk. in these cases, overhead is not so important. I think the main reason of using single-channel mode is simplicity of userspace reader. We can use 'cat' or 'tail' to read the buffer on-line. I'm not sure how much overhead ftrace-like buffer merging routine has, but if kernel provides an interface which gives us single-merged buffer image(like ftrace buffer), we are grad to use it. :-) > Because some architecture don't have fine grained timestamp? > if so, could you explain which architecture don't have it? I heard that get_cycles always returns 0 on some cpu (ARM, MIPS, etc)... (I think some of their platforms support variants of get_cycles) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 15:23 ` Masami Hiramatsu @ 2008-09-23 18:04 ` KOSAKI Motohiro 2008-09-23 18:29 ` Martin Bligh 0 siblings, 1 reply; 47+ messages in thread From: KOSAKI Motohiro @ 2008-09-23 18:04 UTC (permalink / raw) To: Masami Hiramatsu Cc: kosaki.motohiro, Martin Bligh, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml > > I can't imazine a merit of the single-channel mode. > > Could you please explain it? > > Actually, single-channel mode is for not-frequently event tracing. > At least systemtap case, sometimes we just want to collect data > and watch it periodically(as like as 'top'). Or, just monitoring > errors as additional printk. in these cases, overhead is not so > important. > > I think the main reason of using single-channel mode is simplicity of > userspace reader. We can use 'cat' or 'tail' to read the buffer on-line. > I'm not sure how much overhead ftrace-like buffer merging routine has, > but if kernel provides an interface which gives us single-merged buffer > image(like ftrace buffer), we are grad to use it. :-) Yup, I also think it is better. Thanks. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Unified tracing buffer 2008-09-23 18:04 ` KOSAKI Motohiro @ 2008-09-23 18:29 ` Martin Bligh 0 siblings, 0 replies; 47+ messages in thread From: Martin Bligh @ 2008-09-23 18:29 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Masami Hiramatsu, Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner, Mathieu Desnoyers, Steven Rostedt, od, Frank Ch. Eigler, systemtap-ml >> I think the main reason of using single-channel mode is simplicity of >> userspace reader. We can use 'cat' or 'tail' to read the buffer on-line. >> I'm not sure how much overhead ftrace-like buffer merging routine has, >> but if kernel provides an interface which gives us single-merged buffer >> image(like ftrace buffer), we are grad to use it. :-) > > Yup, I also think it is better. That was the plan, yes. Merge sort is cheap ;-) ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2008-09-23 21:02 UTC | newest] Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <33307c790809191433w246c0283l55a57c196664ce77@mail.gmail.com> 2008-09-22 19:47 ` Unified tracing buffer Masami Hiramatsu 2008-09-22 20:14 ` Martin Bligh 2008-09-22 22:27 ` Masami Hiramatsu 2008-09-22 23:12 ` Darren Hart 2008-09-23 0:06 ` Masami Hiramatsu 2008-09-22 23:17 ` Martin Bligh 2008-09-23 0:07 ` Masami Hiramatsu 2008-09-23 0:13 ` Martin Bligh 2008-09-23 14:51 ` Masami Hiramatsu 2008-09-23 15:09 ` Mathieu Desnoyers 2008-09-23 15:32 ` Masami Hiramatsu 2008-09-23 16:02 ` Linus Torvalds 2008-09-23 17:07 ` Masami Hiramatsu 2008-09-23 17:33 ` Thomas Gleixner 2008-09-23 19:03 ` Masami Hiramatsu 2008-09-23 19:37 ` Thomas Gleixner 2008-09-23 19:39 ` Martin Bligh 2008-09-23 19:42 ` Thomas Gleixner 2008-09-23 19:51 ` Martin Bligh 2008-09-23 20:05 ` Thomas Gleixner 2008-09-23 21:02 ` Martin Bligh 2008-09-23 20:06 ` Masami Hiramatsu 2008-09-23 20:09 ` Thomas Gleixner 2008-09-23 15:48 ` Linus Torvalds 2008-09-23 0:41 ` Linus Torvalds 2008-09-23 1:28 ` Roland Dreier 2008-09-23 1:40 ` Steven Rostedt 2008-09-23 2:08 ` Mathieu Desnoyers 2008-09-23 2:27 ` Darren Hart 2008-09-23 2:32 ` Mathieu Desnoyers 2008-09-23 3:29 ` Linus Torvalds 2008-09-23 3:42 ` Mathieu Desnoyers 2008-09-23 4:06 ` Linus Torvalds 2008-09-23 3:44 ` Steven Rostedt 2008-09-23 4:12 ` Masami Hiramatsu 2008-09-23 4:18 ` Martin Bligh 2008-09-23 15:25 ` Masami Hiramatsu 2008-09-23 10:54 ` Steven Rostedt 2008-09-23 4:20 ` Linus Torvalds 2008-09-23 14:13 ` Mathieu Desnoyers 2008-09-23 2:31 ` Mathieu Desnoyers 2008-09-23 3:07 ` Masami Hiramatsu 2008-09-23 14:38 ` KOSAKI Motohiro 2008-09-23 15:04 ` Frank Ch. Eigler 2008-09-23 15:23 ` Masami Hiramatsu 2008-09-23 18:04 ` KOSAKI Motohiro 2008-09-23 18:29 ` Martin Bligh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).