From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 23158 invoked by alias); 1 Oct 2007 01:25:38 -0000 Received: (qmail 23149 invoked by uid 22791); 1 Oct 2007 01:25:37 -0000 X-Spam-Status: No, hits=0.6 required=5.0 tests=AWL,BAYES_50,DK_POLICY_SIGNSOME,FORGED_RCVD_HELO,SPF_FAIL X-Spam-Check-By: sourceware.org Received: from rwcrmhc15.comcast.net (HELO rwcrmhc15.comcast.net) (204.127.192.85) by sourceware.org (qpsmtpd/0.31) with ESMTP; Mon, 01 Oct 2007 01:25:33 +0000 Received: from gateway.sf.frob.com (c-67-160-211-197.hsd1.ca.comcast.net[67.160.211.197]) by comcast.net (rwcrmhc15) with ESMTP id <20071001012531m1500caf18e>; Mon, 1 Oct 2007 01:25:31 +0000 Received: from magilla.localdomain (magilla.sf.frob.com [198.49.250.228]) by gateway.sf.frob.com (Postfix) with ESMTP id 671C5357B; Sun, 30 Sep 2007 18:25:30 -0700 (PDT) Received: by magilla.localdomain (Postfix, from userid 5281) id D264A4D0325; Sun, 30 Sep 2007 18:25:29 -0700 (PDT) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit From: Roland McGrath To: Phil Muldoon Cc: Frysk Hackers Subject: Re: Optimizing watchpoints In-Reply-To: Phil Muldoon's message of Friday, 28 September 2007 22:20:54 +0100 <46FD7036.2010500@redhat.com> References: <46FD7036.2010500@redhat.com> Emacs: the road to Hell is paved with extensibility. Message-Id: <20071001012529.D264A4D0325@magilla.localdomain> Date: Mon, 01 Oct 2007 01:25:00 -0000 X-IsSubscribed: yes Mailing-List: contact frysk-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: frysk-owner@sourceware.org X-SW-Source: 2007-q4/txt/msg00000.txt.bz2 You've brought up two issues, which I think each deserve their own separate thread of discussion. The second thread is about indirection, or generally speaking, dynamic specification of watchpoint addresses. That is a worthy and interesting subject, but I don't think you need to worry about it now. For considering the watchpoint implementation per se, we can just talk about "a watchpoint" as being a request to watch a given fixed address. There are two aspects of specification that will poke through to the lowest-level implementation layers. Those are how you can specify the address range and whose actions you want to watch. For the latter, that means an individual thread or a group of threads that share a set of watchpoints. Right now, the implementation can only be done by setting each watchpoint individually on each thread. But it is likely that future facilities will be able to share some low-level resources and interface overhead by treating uniformly an arbitrary subset of threads in the same address space. It also likely to matter whether the chosen subset is in fact the whole set of all threads in the same address space, and whether a thread has only the breakpoints shared with its siblings in a chosen subset, or has those plus additional private breakpoints of its own. So it's worthwhile to think about how the structure of keeping track of watchpoints (and other kinds of low-level breakpoints) can reflect those groupings of threads from the high-level semantic control plane down to the lowest-level implementation, where the most important sharing can occur. For the address range specification, the low-level implementation pokes up to constrain the granularity at which you can usefully specify what you want to watch. On the most common machines, it's only naturally-aligned chunks whose size is some machine-dependent subset of 1, 2, 4, 8 bytes. (i.e. 1,2,4 for 32-bit x86 kernels, 1,2,4,8 for x86_64 kernels, and 8 for powerpc and ia64). Probably future facilities will add page size to that set of sizes (using a software VM facility rather than hardware features). So the first step has to be turning the semantic request of a given address range into one or more aligned-address, size pairs (or if you prefer, address, mask pairs) of sizes supported at low level. Then you maintain the active set in that form, and identify the redundancies as they go in. Only when the duplicate-free active set changes do you poke the low level. There is one final aspect of organization to consider. At the lowest level, there is a fixed-size hardware resource of watchpoint slots. When you set them with ptrace, the operating system just context-switches them for each thread in the most straightforward way. So the hardware resource is yours to decide how to allocate. However, this is not what we expect to see in future facilities. The coming model is that hardware watchpoints are a shared resource managed and virtualized to a certain degree by the operating system. The debugger may be one among several noncooperating users of this resource, for both per-thread and system-wide uses. Rather than having the hardware slots to allocate as you choose, you will specify what you want in a slot, and a priority, and can get dynamic feedback about the availability of a slot for your priority. (For compatibility, ptrace itself will use that facility to virtualize the demands made by PTRACE_SET_DEBUGREG and the like. ptrace uses a known priority number that is fairly high, so that some system-wide or other background tracing would have to knowingly intend to interfere with traditional user application use by choosing an even higher priority.) In the long run, the way to look at this is not as a set of resources to be allocated, but as a bag of tricks each with different tradeoffs of constraints, costs, and dynamic resource contention. At one extreme you have single-step, i.e. software watchpoints by storing the old value, stepping an instruction, and checking if the value in memory changed. This has few constraints on specification (only that you can't distinguish stored-same from no-store, and it's not a mechanism for data read breakpoints). It has no resource contention issues at all. It is inordinately expensive in CPU time (though a straightforward in-kernel implementation could easily be orders of magnitude faster than the traditional debugger experience of implementing this). Hardware watchpoints have some precise constraints and they compete for a very limited dynamic resource, but they are extremely cheap in CPU time. In the future we might have the option of VM tricks. Those have their own constraints (page granularity on addresses), they consume an important dynamic resource that is finite but often not too constrained in practice (page tables), and they have a somewhat complex balance of cost. When I talk about cost, I mean roughly the setup overhead, plus the exception overhead of a hit, plus the overhead of taking and sorting out any false positive hits (due to the granularity of the facility you choose being coarser than what you're actually aiming for). For cases like VM tricks, this takes in the complex set of factors affecting scaling and secondary overhead (page table locking, TLB, cache, etc). To satisfy a set of requests percolated down from the higher levels, you take those requirements, your bag of tricks, and each trick's particular tradeoffs for each case, and figure out dynamically what to do. For tricks that depend on resources with priority allocations, i.e. hardware watchpoints, you have to rate the cost-effectiveness of the next-best trick you could use instead, somehow scaled by how badly you want to get this done properly, to come up with the right priority number to use so as best to express decent global tradeoffs among the competing needs in the system. In a simple example, you can do a one-byte data write watchpoint on powerpc but only as a two-tier operation. You set the hardware to catch stores to the aligned 8-byte word containing the target byte address. You get a fault before the store happens, but you only know it will store to one or more of those 8 bytes. So, you can save the old 8 bytes, disable the hardware breakpoint, single-step, and compare the new 8 bytes to the old to decide which smaller-granularity data-change watchpoints to say have hit. This is a pretty good tradeoff, since while the total overhead of a hit is at least twice that of an optimal case (aligned 8-byte watchpoint), the likely rate of false positives is quite low. (Compared to "sorry, just can't do it", that's really quite good.) In an oft-imagined future example, VM tricks give an effectively unlimited resource in how many watchpoints (of that kind) you can have installed simultaneously. Even if you can only have one word-granularity watchpoint, or four, or none, you can set lots of page-granularity watchpoints. You can dynamically track the rate of false-positives engendered by those (page faults you have to examine and let go), potentially feeding a complex estimate of the cost of maintaining particular watchpoints in the page tables. You can install many page-level watchpoints to start, then respond to their hits and to the dynamic feedback of hardware watchpoint slot pressure to choose the hottest watchpoints and give them the hardware slots. In the absence of hardware slots, this might always lead to single-step over stores to watched pages, or to declaring the situation too costly and alerting the user we had to give up. The trivial case is that you request at low level exactly one watchpoint for what the user needs, with normal ptrace-equivalent priority. If the slot is (or becomes) unavailable, you tell the user "no can do", end of story. (In today's implementation based on ptrace, this is what the low level always has to boil down to.) This view of the whole bag of tricks should be unified across all sorts of low-level breakpoint resources, i.e. instruction breakpoints as well as data read/write breakpoints (aka watchpoints). Normal breakpoint insertion and related techniques are more of the tricks in the bag; like some of the hardware features mentioned earlier, they are very constrained in their kind of specification (an exact instruction address). The most common case is one with no watchpoints at all, but just a few instruction breakpoints. On x86 and ia64, the same hardware features used for data breakpoints work for these and are orders of magnitude less costly (and far less complicated) than even the sexiest hypothetical breakpoint assistance techniques. When the pressure for those resources allows, it's always optimal to use hardware breakpoints in preference to memory insertion. Thanks, Roland