public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
From: Masami Hiramatsu <mhiramat@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Bligh <mbligh@google.com>,
	        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	        Thomas Gleixner <tglx@linutronix.de>,
	        Mathieu Desnoyers <compudj@krystal.dyndns.org>,
	        Steven Rostedt <rostedt@goodmis.org>,
	darren@dvhart.com,         "Frank Ch. Eigler" <fche@redhat.com>,
	        systemtap-ml <systemtap@sources.redhat.com>
Subject: Re: Unified tracing buffer
Date: Tue, 23 Sep 2008 03:07:00 -0000	[thread overview]
Message-ID: <48D85D21.7060801@redhat.com> (raw)
In-Reply-To: <alpine.LFD.1.10.0809221718100.3265@nehalem.linux-foundation.org>

Hi Linus,

Linus Torvalds wrote:
> 
> On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
>> Sure, atomic counter might be more expensive but accurate for ordering.
> 
> Don't be silly.
> 
> An atomic counter is no more accurate for ordering than anything else.
> 
> Why?
> 
> Because all it tells you is the ordering of the atomic increment, not of 
> the caller. The atomic increment is not related to all the other ops that 
> the code that you trace actually does in any shape or form, and so the 
> ordering of the trace doesn't actually imply anything for the ordering of 
> the operations you are tracing!
> 
> Except for a single CPU, of course, but for that case you don't need a 
> sequence number either, since the ordering is entirely determined by the 
> ring buffer itself.
> 
> So the counter will be more expensive (cross-cpu cache bouncing for EVERY 
> SINGLE EVENT), less useful (no real meaning for people who DO want to have 
> a timestamp), and it's really no more "ordered" than anything that bases 
> itself on a TSC.
> 
> The fact is, you cannot order operations based on log messages unless you 
> have a lock around the whole caller - absolutely _no_ amount of locking or 
> atomic accesses in the log itself will guarantee ordering of the upper 
> layers.

Indeed.
If TSC(or similar time counter) can provide synchronized-time, I don't
have any comment on that(AFAIK, latest x86 and ia64 can provide it).
# I might be a bit nervous about Broken TSC...

> And sure, if you have locking at a higher layer, then a sequence number is 
> sufficient, but on the other hand, so is a well-synchronized TSC.
> 
> So personally, I think that the optimal solution is:
> 
>  - let each ring buffer be associated with a "gettimestamp()" function, so 
>    that everybody _can_ set it to something of their own. But default to 
>    something sane, namely a raw TSC thing.

I agree, default to TSC is enough.

>  - Add synchronization events to the ring buffer often enough that you can 
>    make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply 
>    noticing when the upper 32 bits change, although you could possibly do 
>    it with a heartbeat too.
> 
>  - Similarly, add a synchronization event when the TSC frequency changes.
> 
>  - Make the synchronization packet contain the full 64-bit TSC base, in 
>    addition to TSC frequency info _and_ the timebase.
> 
>  - From those synchronization events, you should be able to get a very 
>    accurate timestamp *after* the fact from the raw TSC numbers (ie do all 
>    the scaling not when you gather the info, but when you present it), 
>    even if you only spent 32 bits of TSC info on 99% of all events (an 
>    just had a overflow log occasionally to get the rest of the info)
> 
>  - Most people will be _way_ happier with a timestamp that has enough 
>    precision to also show ordering (assuming that the caller holds a 
>    lock over the operation _including_ the tracing) than they would ever 
>    be with a sequence number.
> 
>  - people who really want to can consider the incrementing counter a TSC, 
>    but it will suck in so many ways that I bet it will not be very popular 
>    at all. But having the option to set a special timestamp function will
>    give people the option (on a per-buffer level) to make the "TSC" be a 
>    simple incrementing 32-bit counter using xaddl and the upper bits 
>    incrementing from a timer, but keep that as a "ok, the TSC is really 
>    broken, or this architecture doesn't support any fast cycle counters at 
>    all, or I really don't care about time, just sequence, and I guarantee 
>    I have a single lock in all callers that makes things unambiguous"

Thank you very much for giving me a good idea!
I agree with you.

> Note the "single lock" part. It's not enough that you make any trace thing 
> under a lock. They must be under the _same_ lock for all relevant events 
> for you to be able to say anything about ordering. And that's actually 
> pretty rare for any complex behavior.
> 
> The timestamping, btw, is likely the most important part of the whole 
> logging thing. So we need to get it right. But by "right" I mean really 
> really low-latency so that it's acceptable to everybody, real-time enough 
> that you can tell how far apart events were, and precise enough that you 
> really _can_ see ordering.
> 
> The "raw TSC value with correction information" should be able to give you 
> all of that. At least on x86. On some platforms, the TSC may not give you 
> enough resolution to get reasonable guesses on event ordering.
> 
> 			Linus

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

  parent reply	other threads:[~2008-09-23  3:07 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <33307c790809191433w246c0283l55a57c196664ce77@mail.gmail.com>
2008-09-22 19:47 ` Masami Hiramatsu
2008-09-22 20:14   ` Martin Bligh
2008-09-22 22:27     ` Masami Hiramatsu
2008-09-22 23:12       ` Darren Hart
2008-09-23  0:06         ` Masami Hiramatsu
2008-09-22 23:17       ` Martin Bligh
2008-09-23  0:07         ` Masami Hiramatsu
2008-09-23  0:13           ` Martin Bligh
2008-09-23 14:51             ` Masami Hiramatsu
2008-09-23 15:09               ` Mathieu Desnoyers
2008-09-23 15:32                 ` Masami Hiramatsu
2008-09-23 16:02                   ` Linus Torvalds
2008-09-23 17:07                     ` Masami Hiramatsu
2008-09-23 17:33                       ` Thomas Gleixner
2008-09-23 19:03                         ` Masami Hiramatsu
2008-09-23 19:37                           ` Thomas Gleixner
2008-09-23 19:39                             ` Martin Bligh
2008-09-23 19:42                               ` Thomas Gleixner
2008-09-23 19:51                                 ` Martin Bligh
2008-09-23 20:05                                   ` Thomas Gleixner
2008-09-23 21:02                                     ` Martin Bligh
2008-09-23 20:06                             ` Masami Hiramatsu
2008-09-23 20:09                               ` Thomas Gleixner
2008-09-23 15:48               ` Linus Torvalds
2008-09-23  0:41           ` Linus Torvalds
2008-09-23  1:28             ` Roland Dreier
2008-09-23  1:40               ` Steven Rostedt
2008-09-23  2:08               ` Mathieu Desnoyers
2008-09-23  2:27                 ` Darren Hart
2008-09-23  2:32                   ` Mathieu Desnoyers
2008-09-23  3:29               ` Linus Torvalds
2008-09-23  3:42                 ` Mathieu Desnoyers
2008-09-23  4:06                   ` Linus Torvalds
2008-09-23  3:44                 ` Steven Rostedt
2008-09-23  4:12                   ` Masami Hiramatsu
2008-09-23  4:18                     ` Martin Bligh
2008-09-23 15:25                       ` Masami Hiramatsu
2008-09-23 10:54                     ` Steven Rostedt
2008-09-23  4:20                   ` Linus Torvalds
2008-09-23 14:13                     ` Mathieu Desnoyers
2008-09-23  2:31             ` Mathieu Desnoyers
2008-09-23  3:07             ` Masami Hiramatsu [this message]
2008-09-23 14:38       ` KOSAKI Motohiro
2008-09-23 15:04         ` Frank Ch. Eigler
2008-09-23 15:23         ` Masami Hiramatsu
2008-09-23 18:04           ` KOSAKI Motohiro
2008-09-23 18:29             ` Martin Bligh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48D85D21.7060801@redhat.com \
    --to=mhiramat@redhat.com \
    --cc=compudj@krystal.dyndns.org \
    --cc=darren@dvhart.com \
    --cc=fche@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbligh@google.com \
    --cc=rostedt@goodmis.org \
    --cc=systemtap@sources.redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).