* Re: boosting with preemptable kernel [not found] <20061117235807.GB11523@urbana.css.mot.com> @ 2006-11-21 6:49 ` Masami Hiramatsu 2007-01-03 22:39 ` Quentin Barnes 0 siblings, 1 reply; 5+ messages in thread From: Masami Hiramatsu @ 2006-11-21 6:49 UTC (permalink / raw) To: Quentin Barnes; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP Hi Quentin, Actually, I discussed a similar idea internally with Satoshi. And we decided the current approach is better than that, because, Trampoline approach - depends on the architecture. not general solution. - increases overhead by executing trampoline code. - is useful ONLY for booster, but is NOT useful for djprobe. Garbage Collector (and Safety check routine) - does NOT depends on the architecture. - does NOT increase overhead when probing. - is useful for not only booster but also djprobe. Thus, I posted GC patch, and it was merged to -mm tree. http://sources.redhat.com/ml/systemtap/2006-q4/msg00453.html And, I know there is a weak point in the GC. It will take a time to freeze processes. But I tried to reduce the frequency of executing the GC by using the "dirty" flag and some conditions. I also think we can control when it works by introducing commit_kprobes() interface. > (If you'd like, you can copy any or all of this post to the > systemtap mailing list.) Thanks, I Cc this to the SystemTap ML. Best regards, Quentin Barnes wrote: > I've been working with someone else for a few weeks on an ARM > kprobes port. We initially started independently unaware of each > other. Once we found each other, we've been merging our approaches > and efforts. > > In blending our two approaches, we have a possible design > alternative that offers the speed of boosting that should work > on under preemptable and MP kernels. (I say "should" because we > haven't done a live test yet, but we think it is workable.) > > I read your post ("[RFC][PATCH][kprobe] enabling booster on the > preemptible kernel, take 2") to fix the current design limitation in > boosting on preemptable and MP kernels. I think this approach as > laid out isn't the best approach. It seems too expensive to solve > the problem. I wanted to bounce the idea our approach off you and > see what you think of it. > > Instead of returning from kprobe_handler(), freeing locks, and > restarting the instruction in the instruction slot under its > own context, the alternate approach runs the instruction at > kprobe_handler() invocation time before the function returns from > the exception. That way preemption is still held disabled and > the resource lock is still held on the slot when the instruction > runs. To run the instruction before returning requires enough of > the context from the "regs" parameter be restored by a trampoline, > the instruction in the instruction slot to jumped to and then back > to another trampoline that saves any updates to the register state > back into "regs". It's a very simple design that solves the problem > cleanly by eliminating any races since no locks (preemption or > instruction slot) are released in the interm. The only drawback I > see is that it requires some assembly in the architecture dependent > implementation code for the two trampolines. > > Has an approach like the above this been discussed? Is there a > reason to prefer the garbage collection approach over an approach > like the above? > > Now the above approach doesn't handle all instructions including > those that read or write the PC. Those are handled by another very > similar approach except it simulates execution of the instruction > rather than vectoring through the trampolines. > > (If you'd like, you can copy any or all of this post to the > systemtap mailing list.) > > Quentin Barnes > > -- Masami HIRAMATSU Linux Technology Center Hitachi, Ltd., Systems Development Laboratory E-mail: masami.hiramatsu.pt@hitachi.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: boosting with preemptable kernel 2006-11-21 6:49 ` boosting with preemptable kernel Masami Hiramatsu @ 2007-01-03 22:39 ` Quentin Barnes 2007-01-14 3:13 ` Masami Hiramatsu 0 siblings, 1 reply; 5+ messages in thread From: Quentin Barnes @ 2007-01-03 22:39 UTC (permalink / raw) To: Masami Hiramatsu; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP >Hi Quentin, > >Actually, I discussed a similar idea internally with Satoshi. >And we decided the current approach is better than that, because, > >Trampoline approach >- depends on the architecture. not general solution. >- increases overhead by executing trampoline code. >- is useful ONLY for booster, but is NOT useful for djprobe. > >Garbage Collector (and Safety check routine) >- does NOT depends on the architecture. >- does NOT increase overhead when probing. >- is useful for not only booster but also djprobe. > >Thus, I posted GC patch, and it was merged to -mm tree. >http://sources.redhat.com/ml/systemtap/2006-q4/msg00453.html > >And, I know there is a weak point in the GC. It will take >a time to freeze processes. But I tried to reduce the >frequency of executing the GC by using the "dirty" flag and >some conditions. I also think we can control when it >works by introducing commit_kprobes() interface. > >> (If you'd like, you can copy any or all of this post to the >> systemtap mailing list.) > >Thanks, I Cc this to the SystemTap ML. I didn't see your reply until about three weeks after you wrote it. I was out on Thanksgiving break when it was sent and I missed it. I still waited to reply to your note until I got my ARM kprobes work further along. I'll go ahead and reply to the mailing list too. This is my first note to the list, so I hope it gets through. You're right, a trampoline approach is architecture-specific, but so is much of the kprobes support code already, especially when having to deal with jprobes and kretprobes often requiring arch support code to be written in assembly. But I also understand the less we have that's architecture dependent or in assembly the better. I disagree that an such approach increases overhead by any significant amount and is only useful for booster code, but more on this later down in the mail at the end. I really disagree with the garbage collector approach. On some of our projects we use the Linux kernel in a soft real-time environment. A GC introduces sporadic indefinite postponement problems. It doesn't matter at all how often it runs -- once is too much. The basic reason for having a preemptive kernel is to reduce indefinite postponement. The GC is most disruptive on preemptive kernels working against the very reason for using the preemptive feature to begin with. (It's so disruptive on preemptive kernels because it calls freeze_processes() affecting the entire system. On non-preemptive kernels it uses the passive synchronize_sched().) Aside from RT related impact issues the GC causes, it also degrades poorly on SMP systems. It causes the entire system across all CPUs to be forced to stop executing and all held into an idle state simultaneously. The reason for having multiple CPUs is so that a thread tying up one CPU doesn't impact the performance (much) of the rest of the system. Now we have a single thread being able to impact the performance across all CPUs system-wide simultaneously. (If my understanding of the GC implementation is wrong, please correct me. It's only from my limited understanding of reading and following the code, not from using it.) What I would recommend is that the GC and its hooks be able to be conditionally enabled or disabled with an ifdef, probably in the arch kprobes.h file like insn slot and kretprobes are now. That way for architectures that don't need it by having their own alternate approach can choose not to use it. Would that be a reasonable suggestion? About your last bullet mentioning djprobes, the ARM implementation of kprobes I'm working on does have support for kretprobes and jprobes, but not djprobes. I have read your post from Oct '05 on djprobes and understand it in a general way, but haven't yet wrapped my mind fully around it. Could you explain why such an approach doesn't work for djprobes on x86? Do you imagine that assertion would hold for other non-x86 architectures as well? Quentin =-=-= Below is some background on my ARM kprobes working design. Reading it might be helpful when discussing the above issues. I pulled the text out and put it down here so people wouldn't be forced to wade through it in following the above discussion. The basic implementation of kprobes on ARM has been complex, regardless of approach chosen, due to the inherent nature of the processor's design. For some background information on the ARM architecture, the current processors have no single-step mode and have no "next PC" register, so there's no way to regain control of the processor after executing instructions that modify the PC. All branches, calls, and other writes to the PC must be decoded and detected at arch_prepare_kprobes() time and simulated by software at kprobe_handler() time. Also any instruction run from the execute slot that reads the PC's value would give a "wrong" result. They must also be detected at preparation time and have its results adjusted at execution time. Most popular ARM instructions are capable of legally having R15 (the PC) as either a source (read) or destination (write) register, so regardless of approach, the logic to decode many of the ARM instructions already has to be written. Extending it to decode and handle the additional instructions that aren't capable of reading or writing the PC wasn't that much additional work. At kprobe_handler() time to execute the kprobed instruction and return, my ARM design does not have a single generic trampoline. I originally tried this, but found it too wasteful for my tastes in the general case to do correctly. (Still, it wasn't that bad. The whole thing was about 35 machine instructions with 14 written in assembly, but it did do a fair amount of often unnecessary data slinging though.) What I do instead is to have individual trampolines tuned for specific classes of instructions. The ARM is a RISCish architecture. Its instructions fall into easily decoded classes of instructions and with its registers in specific fields within the instruction. The individually tuned trampolines avoid unnecessary loading up and saving back of the pre-exception register state. The rewritten instructions saved in the insn slots at preparation time use preassigned fixed registers (R0, R1, etc.). Each tuned trampoline assigned at preparation time loads up the necessary register state into the specific fixed registers for its class, executes the rewritten instruction, and writes back the changed register state. At kprobe_handler() time, the kprobe handler calls the insn slot handler by dereferencing its pointer saved in the "arch_specific_insn" structure at preparation time. The handler loads up its zero to four registers (depending on the class of the instruction it's written for) from the pt_regs save area and then writes back zero to four registers. The entire execution handler (all pre- and post-instruction work, loading and saving register state, and the rewritten instruction) generally takes between 10 and 20 ARM instructions in total to run. That number of instructions is just in the noise for that of the kprobe handler. Although technically it does increase the execution time of the kprobe_handler(), the increase is so small that it cannot in any way be considered significant. Also with my approach, no classic boosting is ever necessary. All instructions are effectively already "boosted" (i.e.: no secondary breakpoint exception is ever needed or taken.) All kprobed instructions always complete before returning from the primary kprobe exception. The kprrobe exception always returns to the instruction to be executed next after the kprobed instruction. This design also works exactly the same way on SMP and single processors, as well as on preemptive and non-preemptive kernels, so no ifdef'ing is necessary. This work may sound like a lot, but it's really not. All decoding and instruction rewriting software for all the ARM instructions plus all the individually tuned trampolines when compiled fits in less than 4KB of text space with no data space at all. That doesn't even account for having simpler kprobes arch support code in arch/arm/kernel/kprobes.c. That's smaller since it doesn't have to account for ever handle re-entering itself via a secondary breakpoint nor for having to support classic boost mode. I don't consider myself a Linux kernel internals expert though. In my approach, I may have made an invalid assumption about the kernel's behavior somewhere, possibly in boundary cases. This is my biggest concern. As my testing moves forward, I'll see if anything turns up. I could go into more detail of the design, but I'm sure I've bored and confused enough people already. >Best regards, > >-- >Masami HIRAMATSU >Linux Technology Center >Hitachi, Ltd., Systems Development Laboratory >E-mail: masami.hiramatsu.pt@hitachi.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: boosting with preemptable kernel 2007-01-03 22:39 ` Quentin Barnes @ 2007-01-14 3:13 ` Masami Hiramatsu 2007-01-27 6:38 ` Quentin Barnes 0 siblings, 1 reply; 5+ messages in thread From: Masami Hiramatsu @ 2007-01-14 3:13 UTC (permalink / raw) To: Quentin Barnes; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP Hi Quentin, Quentin Barnes wrote: > I disagree that an such approach increases overhead by any > significant amount and is only useful for booster code, but more on > this later down in the mail at the end. > > I really disagree with the garbage collector approach. On some > of our projects we use the Linux kernel in a soft real-time > environment. A GC introduces sporadic indefinite postponement > problems. It doesn't matter at all how often it runs -- once is too > much. The basic reason for having a preemptive kernel is to reduce > indefinite postponement. The GC is most disruptive on preemptive > kernels working against the very reason for using the preemptive > feature to begin with. (It's so disruptive on preemptive kernels > because it calls freeze_processes() affecting the entire system. On > non-preemptive kernels it uses the passive synchronize_sched().) Indeed. > Aside from RT related impact issues the GC causes, it also degrades > poorly on SMP systems. It causes the entire system across all CPUs > to be forced to stop executing and all held into an idle state > simultaneously. The reason for having multiple CPUs is so that a > thread tying up one CPU doesn't impact the performance (much) of > the rest of the system. Now we have a single thread being able to > impact the performance across all CPUs system-wide simultaneously. > (If my understanding of the GC implementation is wrong, please > correct me. It's only from my limited understanding of reading and > following the code, not from using it.) I think you are a bit misreading the GC implementation. First, this GC is never invoked when the kprobe is hit. This GC may be invoked when you register/unregister a kprobe. I think these operations are not frequently done and it will be done when the module is loading/unloading. Next, this GC will be invoked if there ARE some garbage(dirty) slots when you get/free an insn slot. Thus, if your kprobe clean its slot up before release it, your kprobe NEVER invokes the GC. > What I would recommend is that the GC and its hooks be able to be > conditionally enabled or disabled with an ifdef, probably in the > arch kprobes.h file like insn slot and kretprobes are now. That way > for architectures that don't need it by having their own alternate > approach can choose not to use it. Would that be a reasonable > suggestion? Currently, only the slots used by boosted kprobes are dirty. So, on the ARM archtecture, if you'd like disable the GC, you just call free_insn_slot with dirty=0, as below. free_insn_slot(insn_slot, 0); on i386, you can prohibit boosting kprobes by specifying a void post_handler to the kprobes. And then, the GC will not work any more. If you feel a strong need for disabling GC at compiling, I can add the compile flag for disabling GC. Or, could you write the trampoline routine and the compile flag for switching the trampoline and the GC? > About your last bullet mentioning djprobes, the ARM implementation > of kprobes I'm working on does have support for kretprobes and > jprobes, but not djprobes. I have read your post from Oct '05 on > djprobes and understand it in a general way, but haven't yet wrapped > my mind fully around it. Could you explain why such an approach > doesn't work for djprobes on x86? Do you imagine that assertion > would hold for other non-x86 architectures as well? Djprobe has to rewrite multiple instructions on i386 (CISC), because the size of the long jump instruction is bigger than many instructions. Before rewriting those instructions, we must ensure no other processes running/sleeping on the target instructions which will be rewritten by the long jump. This case can not be helped by the trampoline routine. Best regards, -- Masami HIRAMATSU Linux Technology Center Hitachi, Ltd., Systems Development Laboratory E-mail: masami.hiramatsu.pt@hitachi.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: boosting with preemptable kernel 2007-01-14 3:13 ` Masami Hiramatsu @ 2007-01-27 6:38 ` Quentin Barnes 2007-01-30 5:05 ` Masami Hiramatsu 0 siblings, 1 reply; 5+ messages in thread From: Quentin Barnes @ 2007-01-27 6:38 UTC (permalink / raw) To: Masami Hiramatsu; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP >> Aside from RT related impact issues the GC causes, it also degrades >> poorly on SMP systems. It causes the entire system across all CPUs >> to be forced to stop executing and all held into an idle state >> simultaneously. The reason for having multiple CPUs is so that a >> thread tying up one CPU doesn't impact the performance (much) of >> the rest of the system. Now we have a single thread being able to >> impact the performance across all CPUs system-wide simultaneously. >> (If my understanding of the GC implementation is wrong, please >> correct me. It's only from my limited understanding of reading and >> following the code, not from using it.) > >I think you are a bit misreading the GC implementation. >First, this GC is never invoked when the kprobe is hit. >This GC may be invoked when you register/unregister a kprobe. >I think these operations are not frequently done and it will >be done when the module is loading/unloading. I had followed all that. >Next, this GC will be invoked if there ARE some garbage(dirty) >slots when you get/free an insn slot. Thus, if your kprobe clean >its slot up before release it, your kprobe NEVER invokes the GC. Ah, I did not follow this. I see now. The variable "kprobe_garbage_slots" keeps track of the dirty count to avoid invoking the GC when the count is zero. >> What I would recommend is that the GC and its hooks be able to be >> conditionally enabled or disabled with an ifdef, probably in the >> arch kprobes.h file like insn slot and kretprobes are now. That way >> for architectures that don't need it by having their own alternate >> approach can choose not to use it. Would that be a reasonable >> suggestion? > >Currently, only the slots used by boosted kprobes are dirty. >So, on the ARM archtecture, if you'd like disable the GC, >you just call free_insn_slot with dirty=0, as below. > free_insn_slot(insn_slot, 0); Yes, when the ARM code is updated to have the second parameter to free_insn_slot(), it will always be 0, >on i386, you can prohibit boosting kprobes by specifying a void >post_handler to the kprobes. And then, the GC will not work any more. Having a post-handler defeats boosting which then "defeats" the GC. For our platform use though, we're ARM only. >If you feel a strong need for disabling GC at compiling, >I can add the compile flag for disabling GC. Since the GC will be inactive for ARM with only a trivial execution charge to check and maintain the "kprobe_garbage_slots" variable, it's not that big a deal. The amount of dead code is pretty minor too. At this point I'd say it's not worth putting a compile time flag around. >Or, >could you write the trampoline routine and the >compile flag for switching the trampoline and the GC? In my approach, there's no single "trampoline" in the classic sense. The kprobe'd instruction's effects always complete before kprobe_handler() returns, so there is no system state to hold and manage across re-entry for the same kprobe's continued processing. This allows kprobe_handler() and its associated management functions and saved data state to be greatly simplified and reduced. There's no way to just "throw a switch" to put all that complexity back. It would almost have to be two completely different implementations. >> About your last bullet mentioning djprobes, the ARM implementation >> of kprobes I'm working on does have support for kretprobes and >> jprobes, but not djprobes. I have read your post from Oct '05 on >> djprobes and understand it in a general way, but haven't yet wrapped >> my mind fully around it. Could you explain why such an approach >> doesn't work for djprobes on x86? Do you imagine that assertion >> would hold for other non-x86 architectures as well? > >Djprobe has to rewrite multiple instructions on i386 (CISC), because >the size of the long jump instruction is bigger than many instructions. >Before rewriting those instructions, we must ensure no other processes >running/sleeping on the target instructions which will be rewritten by >the long jump. This case can not be helped by the trampoline routine. Yes, that's a very, very messy problem on the x86. Up to five(?) instructions could be overwritten by the long jump instruction. I'm still not sure though I see why a trampoline approach wouldn't work for x86. It would just have to iterate the trampoline up to five times. But I still don't have the model clear enough in my head yet. Maybe it will become clearer over time. As long as the kernel text address space stays under 32MB, an ARM djprobe implementation would be a one-for-one instruction replacement. I'm still absorbing the djprobes explanation (djprobe-20051031.txt) and perusing the patch you sent out Nov 21. Sorry if the following question has already been discussed. If it has, just point me to it. Is there a reason djprobes needs its own, separate interface? Could it just use the kprobes registration service and have the kprobes code decide whether to implement a given probe as a kprobe with an exception or djprobe with a direct jump? Or is this a long term goal after shaking out the djprobes model? >Best regards, > >-- >Masami HIRAMATSU >Linux Technology Center >Hitachi, Ltd., Systems Development Laboratory >E-mail: masami.hiramatsu.pt@hitachi.com Quentin ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: boosting with preemptable kernel 2007-01-27 6:38 ` Quentin Barnes @ 2007-01-30 5:05 ` Masami Hiramatsu 0 siblings, 0 replies; 5+ messages in thread From: Masami Hiramatsu @ 2007-01-30 5:05 UTC (permalink / raw) To: Quentin Barnes; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP Hi Quentin, Quentin Barnes wrote: >> Djprobe has to rewrite multiple instructions on i386 (CISC), because >> the size of the long jump instruction is bigger than many instructions. >> Before rewriting those instructions, we must ensure no other processes >> running/sleeping on the target instructions which will be rewritten by >> the long jump. This case can not be helped by the trampoline routine. > > Yes, that's a very, very messy problem on the x86. Up to five(?) > instructions could be overwritten by the long jump instruction. > > I'm still not sure though I see why a trampoline approach wouldn't > work for x86. It would just have to iterate the trampoline up to > five times. But I still don't have the model clear enough in my > head yet. Maybe it will become clearer over time. Similar idea had discussed with Karim on the thread below: http://sourceware.org/ml/systemtap/2006-q3/msg00725.html > As long as the kernel text address space stays under 32MB, an > ARM djprobe implementation would be a one-for-one instruction > replacement. Unfortunately, the djprobe allocates its stub buffer from the module area. Is it included in the 32MB area on ARM? > I'm still absorbing the djprobes explanation (djprobe-20051031.txt) > and perusing the patch you sent out Nov 21. Sorry if the following > question has already been discussed. If it has, just point me to > it. Is there a reason djprobes needs its own, separate interface? > Could it just use the kprobes registration service and have the > kprobes code decide whether to implement a given probe as a kprobe > with an exception or djprobe with a direct jump? Or is this a long > term goal after shaking out the djprobes model? Before I sent the latest patch, we are discussed that. The patch which I sent Nov 21, integrated the djprobe's interface into kprobe's interface. If user sets the length of the instructions (which will be replaced by a long jump) to the length member of the kprobe data structure, that kprobe will be optimized (become a djprobe) after invoking commit_kprobes(). Best regards, -- Masami HIRAMATSU Linux Technology Center Hitachi, Ltd., Systems Development Laboratory E-mail: masami.hiramatsu.pt@hitachi.com ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2007-01-30 5:05 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20061117235807.GB11523@urbana.css.mot.com> 2006-11-21 6:49 ` boosting with preemptable kernel Masami Hiramatsu 2007-01-03 22:39 ` Quentin Barnes 2007-01-14 3:13 ` Masami Hiramatsu 2007-01-27 6:38 ` Quentin Barnes 2007-01-30 5:05 ` Masami Hiramatsu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).