From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11814 invoked by alias); 1 Aug 2005 08:44:18 -0000 Mailing-List: contact systemtap-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Post: List-Help: , Sender: systemtap-owner@sources.redhat.com Received: (qmail 11794 invoked by uid 22791); 1 Aug 2005 08:44:13 -0000 In-Reply-To: <20050731220304.GJ3726@bragg.suse.de> Subject: Re: Hitachi djprobe mechanism Sensitivity: To: Andi Kleen Cc: Andi Kleen , Mathieu Desnoyers , Masami Hiramatsu , Karim Yaghmour , Masami Hiramatsu , michel.dagenais@polymtl.ca, Roland McGrath , Satoshi Oshima , sugita@sdl.hitachi.co.jp, systemtap@sources.redhat.com X-Mailer: Lotus Notes Release 6.5.1IBM February 19, 2004 Message-ID: From: Richard J Moore Date: Mon, 01 Aug 2005 08:44:00 -0000 X-MIMETrack: Serialize by Router on D06ML065/06/M/IBM(Release 6.53HF247 | January 6, 2005) at 01/08/2005 09:44:09 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-SW-Source: 2005-q3/txt/msg00208.txt.bz2 There is another issue to consider when looking into using probes other then int3: Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the practice of modifying code on one processor where another has prefetched the unmodified version of the code. Intel states that unpredictable general protection faults may result if a synchronizing instruction (iret, int, int3, cpuid, etc ) is not executed on the second processor before it executes the pre-fetched out-of-date copy of the instruction. When we became aware of this I had a long discussion with Intel's microarchitecture guys. It turns out that the reason for this erratum (which incidentally Intel does not intend to fix) is because the trace cache - the stream of micorops resulting from instruction interpretation - cannot guaranteed to be valid. Reading between the lines I assume this issue arises because of optimization done in the trace cache, where it is no longer possible to identify the original instruction boundaries. If the CPU discoverers that the trace cache has been invalidated because of unsynchronized cross-modification then instruction execution will be aborted with a GPF. Further discussion with Intel revealed that replacing the first opcode byte with an int3 would not be subject to this erratum. So, is cmpxchg reliable? One has to guarantee more than mere atomicity. - - Richard J Moore IBM Advanced Linux Response Team - Linux Technology Centre MOBEX: 264807; Mobile (+44) (0)7739-875237 Office: (+44) (0)1962-817072 Andi Kleen To 31/07/2005 Mathieu Desnoyers 23:03 cc Andi Kleen , Karim Yaghmour , Masami Hiramatsu , Masami Hiramatsu , Roland McGrath , Richard J Moore/UK/IBM@IBMGB, systemtap@sources.redhat.com, sugita@sdl.hitachi.co.jp, Satoshi Oshima , michel.dagenais@polymtl.ca bcc Subject Re: Hitachi djprobe mechanism On Sat, Jul 30, 2005 at 12:47:47PM -0400, Mathieu Desnoyers wrote: > * Andi Kleen (ak@suse.de) wrote: > > > As I see it, the write in memory is atomic, but not the instruction fetching. In > > > that case, the reader would see an inconsistent last jmp address byte. > > > > Yes, you're right. cmpxchg only helps when the replaced instruction > > is >= the new instruction. For smaller instructions only a IPI to > > stop all CPUs works. > > > > It was not exactly the point of my comment. If we try to overwrite an existing > instruction, without any marker, two cases may show up : > > * the instruction to replace is >= the jmp instruction (5 bytes) > > It has been suggested that using cmpxchg8 would solve this problem. cmpxchg8 > does indeed commit 8 bytes of data to memory atomically, even on 32 bits > architectures. > > My question is related to the instruction we want to replace : how is it read by > the CPU ? If it's 5 bytes in size, il has to be read in two chunks by the cpu in > a 32 bits arch. Does the CPU lock the memory bus between those two read ? 32bit ISA has nothing to do how the CPU fetches instructions ("32bit" x86s usually have a much wider memory interface) In general these things are done on cache lines between 32 and 128 bytes depending on the CPU. Of course cache lines can be crossed by instructions, but the CPU should handle that atomically. However is no guarantee afaik for that in the architecture though so you cannot really rely on it. If let's say the 386 had this behaviour then it is probably safe to assume later x86s implement it too for compatibility (modulo bugs) In practice it's more complicated. The CPU fetches the instruction some time before actually executing it into its pipeline, and then sniffs the bus for any modifications of it and then cancels and reexecutes the instruction if needed. However when you look at CPU errata sheets you will find quite a lot of bugs in this area, so I would not really rely on frequent patching for production. I think just using the IPI is much simpler and easier. > * the instruction to replace is < the jmp instruction (4 bytes or less) > > If our goal is to overwrite code which has not been surrounded by a marker, an > IPI wouldn't save us here. The marker is necessary in order to disable > interruptions and make the IPI meaningful. You lost me here. > > > > Actually there may be tricks possible to first int3 (or equivalent single > > byte replacement on other archs) the second instruction, > > then the first, then wait for a RCU period of all CPUs to quiescence and then > > write the longer jump. But an IPI is probably easier because it doesn't need > > a full disassembler for this and setting probes should not be performance > > critical. > > > > Well, in fact, there is still a problem. (on no, not again!) ;) The RCU does > require the reader to disable preemption, otherwise there is no guarantee that > they won't be scheduled out in the middle of the critical section, and the RCU > does only guarantee that a non schedulable reader will have finished by the time > the RCU period is over. > > How do you plan to disable unvolountary preemption around the instructions you > want to overwrite ? One way would be to just search the task list for any tasks blocked with an IP inside the patched region. If yes rewait for another quiescent period. -Andi