From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <systemtap-return-14862-listarch-systemtap=sources.redhat.com@sourceware.org>
Received: (qmail 11517 invoked by alias); 9 Jan 2010 02:46:30 -0000
Received: (qmail 11508 invoked by uid 22791); 9 Jan 2010 02:46:29 -0000
X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 	tests=AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Received: from e28smtp07.in.ibm.com (HELO e28smtp07.in.ibm.com) (122.248.162.7)     by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Sat, 09 Jan 2010 02:46:24 +0000
Received: from d28relay03.in.ibm.com (d28relay03.in.ibm.com [9.184.220.60]) 	by e28smtp07.in.ibm.com (8.14.3/8.13.1) with ESMTP id o092kJ7k025373 	for <systemtap@sourceware.org>; Sat, 9 Jan 2010 08:16:19 +0530
Received: from d28av05.in.ibm.com (d28av05.in.ibm.com [9.184.220.67]) 	by d28relay03.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o092kJ4b3883162 	for <systemtap@sourceware.org>; Sat, 9 Jan 2010 08:16:19 +0530
Received: from d28av05.in.ibm.com (loopback [127.0.0.1]) 	by d28av05.in.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id o092kJIG018198 	for <systemtap@sourceware.org>; Sat, 9 Jan 2010 13:46:19 +1100
Received: from in.ibm.com ([9.77.195.218]) 	by d28av05.in.ibm.com (8.14.3/8.13.1/NCO v10.0 AVin) with ESMTP id o092kGu3018195 	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); 	Sat, 9 Jan 2010 13:46:17 +1100
Date: Sat, 09 Jan 2010 02:46:00 -0000
From: "K.Prasad" <prasad@linux.vnet.ibm.com>
To: Roland McGrath <roland@redhat.com>
Cc: Prerna Saxena <prerna@linux.vnet.ibm.com>, systemtap@sourceware.org
Subject: Re: [RFC] Systemtap translator support for hardware breakpoints on
Message-ID: <20100109024616.GA6810@in.ibm.com>
Reply-To: prasad@linux.vnet.ibm.com
References: <4B459CC8.2030402@linux.vnet.ibm.com> <20100108015301.72EF67300@magilla.sf.frob.com> <20100109011140.GB3486@in.ibm.com> <20100109014457.EF791CC@magilla.sf.frob.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20100109014457.EF791CC@magilla.sf.frob.com>
User-Agent: Mutt/1.5.19 (2009-01-05)
X-IsSubscribed: yes
Mailing-List: contact systemtap-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <systemtap.sourceware.org>
List-Subscribe: <mailto:systemtap-subscribe@sourceware.org>
List-Post: <mailto:systemtap@sourceware.org>
List-Help: <mailto:systemtap-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: systemtap-owner@sourceware.org
X-SW-Source: 2010-q1/txt/msg00073.txt.bz2

On Fri, Jan 08, 2010 at 05:44:57PM -0800, Roland McGrath wrote:
> > If my understanding is correct, this is a suggestion that demands an
> > 'overcommit' feature (ability to accept requests more than the available
> > debug registers) in hw-breakpoints, right?
> 
> Actually that's exactly not what I was talking about.  It's an
> interesting subject to consider (I was the one who originally
> proposed such complexity for hw_breakpoint at its inception).  
> 
> But all I was talking about here is what a stap module can do when a
> "allocate it now and forever" call fails.  (This might be at script
> startup time, or might be later during a module's lifetime if the
> putative script-driven dynamic registration feature were used.)
> 

Yes, I remember reading-through the discussions you had with Alan Stern
as a result of which his code, originally, had the over-commit and
prioritisation feature but was subsequently removed during a re-design
as demanded by the community/maintainer. Post perf-events integration,
the over-commit is available in the form of 'un-pinned' requests.

> > In its new form (post perf-events integration), hw-breakpoints can
> > indeed accept new requests that far exceed the number of underlying
> > debug registers. This can be achieved by making an 'un-pinned' breakpoint
> > request, where every such request gets a chance to use the debug
> > register in a round-robin fashion (all this is provided by perf-events
> > infrastructure anyway).
> 
> I don't understand what "round-robin" means for breakpoints.  When
> you let the kernel execute normally again after registration, either
> my breakpoint is enabled or it isn't.  There is no meaningful sense
> in which you can "time-share" a hardware breakpoint slot.  (That is,
> except for doing per-task registrations, which is a different
> semantics entirely.)
>

As I said before, it is a 'feature' 'acquired' by breakpoints by virtue
of using the perf-events layer underneath. One can think of it remotely
being useful to profile (for reads/writes) a large number of addresses
simultaneously (reference:LKML message-id:1248856132.6987.3034.camel@twins
and related thread) to 'statistically' indicate a hit-count for the same;
however disastrous to use in debugging roles.

> > Presently, the breakpoint infrastructure does not provide callbacks that
> > can be invoked whenever an 'un-pinned' breakpoint request is
> > scheduled-in/out (analogous to .enabled and .disabled). We could pursue
> > to get support for the same (of course, that would require a good
> > in-kernel user to convince the community!).
> 
> I am having trouble imagining what any kind of "scheduled-in/out"
> could possibly useful to do at all if you don't notify me about
> whether or not my breakpoint is in place.  A "best effort"
> breakpoint, that might be caught and might not be and who knows
> whether it's really installed, is just not useful.  I must be
> missing the essence of what you mean.
> 

Unfortunately, what you've described above as "best effort" breakpoint
is what 'un-pinned' breakpoint now is. Again, it is a carry-over from
perf-events where the said behaviour suits fine for other performance
monitoring counters (provided by the PMU on Intel x86 processors).

There isn't an in-kernel user for 'un-pinned' breakpoint requests at
present and perhaps when such need arises, a notifier/callback
implementation during sched-in/out would also follow.

> The thing I had talked about before was each hw_breakpoint
> registration having a priority number.  When another registration
> comes along with a higher priority, it can boot yours and make a
> callback so you know it's been stolen.  Conversely, when a competing
> registration goes away, the highest-priority registration-in-waiting
> gets installed and gets a callback to tell you it's now active.  (I
> guess really there could just be a single notifier list that gets
> called when a slot becomes free, so the suitors can try again to see
> whose priority wins.  Whatever.)  But you've said this is not what
> it does now.
> 

Yes, priority based hw-breakpoint registration (and queueing of new
requests) is no longer there.

> Anyway, that kind of dynamicism is not what I was talking about here.
> If we had it, the script-language features might look rather similar
> and so work how you're talking about here.  i.e., the .enabled and
> .disabled probes firing due to an hw_breakpoint layer callback that is
> "spontaneous" to the stap module's eyes, i.e. driven by the ebb and
> flow of external demand on the scarce shared resource.
> 
> In the examples I gave, the .unavailable sub-probe in the simplest
> form is just a translation-time (tapset sugar) way to do some implicit
> script-level stuff at 'probe begin' time, but conditional on whether
> the associated hw_breakpoint registration at startup succeeded.
> 
> In the example with dynamic (i.e. runtime script-controlled)
> registration, the .begin and .end sub-probes are just tapset sugar for
> some script-level stuff to do implicitly when script code uses "enable
> watch_foo" and "disable watch_foo".  For that use, you might have a
> .unavailable sub-probe that runs when "enable" fails, or you might
> just have "enable" return a testable success/failure to the script
> code or whatnot.
> 

While hw-breakpoint APIs could originally be enabled from most contexts,
I'm unsure about the same post perf-integration. As Prerna stated, it
needs to be verified to know if (un)register_<> breakpoint interfaces
can be successfully invoked from a kprobe handler context.

> Given all that, I can imagine wanting the tie-in for some kind of
> "breakpoint scheduling" to be that "enable" has "enable-now-or-fail"
> and "enable-now-or-later" variants.  Then in the "now or later"
> variant, those same .begin and .end probes might get run during the
> lifetime of a script (due to an hw_breakpoint layer callback) rather
> than only at "enable foo" and "disable foo" time.
>

The kernel hw-breakpoint API supports only a "enable-now-or-fail"
variant and I think that the script cannot implement
"enable-now-or-later" without kernel support (to notify a vacated debug
register).

This means all 'pinned' breakpoint requests beyond HBP_NUM
(HBP_NUM = number of debug registers per-CPU) would always fail and
must fall-back to .unavailable probe. The failure can happen early in
case of pre-existing breakpoint requests.

So, is the .unavailable probe's role just to warn the user that his
request has failed? Or do you see more uses for the same (that isn't
obvious to me)?

Thanks,
K.Prasad