Re: boosting with preemptable kernel

public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed

* Re: boosting with preemptable kernel
       [not found] <20061117235807.GB11523@urbana.css.mot.com>
@ 2006-11-21  6:49 ` Masami Hiramatsu
  2007-01-03 22:39   ` Quentin Barnes
  0 siblings, 1 reply; 5+ messages in thread
From: Masami Hiramatsu @ 2006-11-21  6:49 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP

Hi Quentin,

Actually, I discussed a similar idea internally with Satoshi.
And we decided the current approach is better than that, because,

Trampoline approach
- depends on the architecture. not general solution.
- increases overhead by executing trampoline code.
- is useful ONLY for booster, but is NOT useful for djprobe.

Garbage Collector (and Safety check routine)
- does NOT depends on the architecture.
- does NOT increase overhead when probing.
- is useful for not only booster but also djprobe.

Thus, I posted GC patch, and it was merged to -mm tree.
http://sources.redhat.com/ml/systemtap/2006-q4/msg00453.html

And, I know there is a weak point in the GC. It will take
a time to freeze processes. But I tried to reduce the
frequency of executing the GC by using the "dirty" flag and
some conditions. I also think we can control when it
works by introducing commit_kprobes() interface.

> (If you'd like, you can copy any or all of this post to the
> systemtap mailing list.)

Thanks, I Cc this to the SystemTap ML.

Best regards,

Quentin Barnes wrote:
> I've been working with someone else for a few weeks on an ARM
> kprobes port.  We initially started independently unaware of each
> other.  Once we found each other, we've been merging our approaches
> and efforts.
>
> In blending our two approaches, we have a possible design
> alternative that offers the speed of boosting that should work
> on under preemptable and MP kernels.  (I say "should" because we
> haven't done a live test yet, but we think it is workable.)
>
> I read your post ("[RFC][PATCH][kprobe] enabling booster on the
> preemptible kernel, take 2") to fix the current design limitation in
> boosting on preemptable and MP kernels.  I think this approach as
> laid out isn't the best approach.  It seems too expensive to solve
> the problem.  I wanted to bounce the idea our approach off you and
> see what you think of it.
>
> Instead of returning from kprobe_handler(), freeing locks, and
> restarting the instruction in the instruction slot under its
> own context, the alternate approach runs the instruction at
> kprobe_handler() invocation time before the function returns from
> the exception.  That way preemption is still held disabled and
> the resource lock is still held on the slot when the instruction
> runs.  To run the instruction before returning requires enough of
> the context from the "regs" parameter be restored by a trampoline,
> the instruction in the instruction slot to jumped to and then back
> to another trampoline that saves any updates to the register state
> back into "regs".  It's a very simple design that solves the problem
> cleanly by eliminating any races since no locks (preemption or
> instruction slot) are released in the interm.  The only drawback I
> see is that it requires some assembly in the architecture dependent
> implementation code for the two trampolines.
>
> Has an approach like the above this been discussed?  Is there a
> reason to prefer the garbage collection approach over an approach
> like the above?
>
> Now the above approach doesn't handle all instructions including
> those that read or write the PC.  Those are handled by another very
> similar approach except it simulates execution of the instruction
> rather than vectoring through the trampolines.
>
> (If you'd like, you can copy any or all of this post to the
> systemtap mailing list.)
>
> Quentin Barnes
>
>


-- 
Masami HIRAMATSU
Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: boosting with preemptable kernel
  2006-11-21  6:49 ` boosting with preemptable kernel Masami Hiramatsu
@ 2007-01-03 22:39   ` Quentin Barnes
  2007-01-14  3:13     ` Masami Hiramatsu
  0 siblings, 1 reply; 5+ messages in thread
From: Quentin Barnes @ 2007-01-03 22:39 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP

>Hi Quentin,
>
>Actually, I discussed a similar idea internally with Satoshi.
>And we decided the current approach is better than that, because,
>
>Trampoline approach
>- depends on the architecture. not general solution.
>- increases overhead by executing trampoline code.
>- is useful ONLY for booster, but is NOT useful for djprobe.
>
>Garbage Collector (and Safety check routine)
>- does NOT depends on the architecture.
>- does NOT increase overhead when probing.
>- is useful for not only booster but also djprobe.
>
>Thus, I posted GC patch, and it was merged to -mm tree.
>http://sources.redhat.com/ml/systemtap/2006-q4/msg00453.html
>
>And, I know there is a weak point in the GC. It will take
>a time to freeze processes. But I tried to reduce the
>frequency of executing the GC by using the "dirty" flag and
>some conditions. I also think we can control when it
>works by introducing commit_kprobes() interface.
>
>> (If you'd like, you can copy any or all of this post to the
>> systemtap mailing list.)
>
>Thanks, I Cc this to the SystemTap ML.

I didn't see your reply until about three weeks after you wrote it.
I was out on Thanksgiving break when it was sent and I missed it.  I
still waited to reply to your note until I got my ARM kprobes work
further along.  I'll go ahead and reply to the mailing list too.
This is my first note to the list, so I hope it gets through.

You're right, a trampoline approach is architecture-specific, but so
is much of the kprobes support code already, especially when having
to deal with jprobes and kretprobes often requiring arch support code
to be written in assembly.  But I also understand the less we have
that's architecture dependent or in assembly the better.

I disagree that an such approach increases overhead by any
significant amount and is only useful for booster code, but more on
this later down in the mail at the end.

I really disagree with the garbage collector approach.  On some
of our projects we use the Linux kernel in a soft real-time
environment.  A GC introduces sporadic indefinite postponement
problems.  It doesn't matter at all how often it runs -- once is too
much.  The basic reason for having a preemptive kernel is to reduce
indefinite postponement.  The GC is most disruptive on preemptive
kernels working against the very reason for using the preemptive
feature to begin with.  (It's so disruptive on preemptive kernels
because it calls freeze_processes() affecting the entire system.  On
non-preemptive kernels it uses the passive synchronize_sched().)

Aside from RT related impact issues the GC causes, it also degrades
poorly on SMP systems.  It causes the entire system across all CPUs
to be forced to stop executing and all held into an idle state
simultaneously.  The reason for having multiple CPUs is so that a
thread tying up one CPU doesn't impact the performance (much) of
the rest of the system.  Now we have a single thread being able to
impact the performance across all CPUs system-wide simultaneously.
(If my understanding of the GC implementation is wrong, please
correct me.  It's only from my limited understanding of reading and
following the code, not from using it.)

What I would recommend is that the GC and its hooks be able to be
conditionally enabled or disabled with an ifdef, probably in the
arch kprobes.h file like insn slot and kretprobes are now.  That way
for architectures that don't need it by having their own alternate
approach can choose not to use it.  Would that be a reasonable
suggestion?

About your last bullet mentioning djprobes, the ARM implementation
of kprobes I'm working on does have support for kretprobes and
jprobes, but not djprobes.  I have read your post from Oct '05 on
djprobes and understand it in a general way, but haven't yet wrapped
my mind fully around it.  Could you explain why such an approach
doesn't work for djprobes on x86?  Do you imagine that assertion
would hold for other non-x86 architectures as well?

Quentin

=-=-=
Below is some background on my ARM kprobes working design.  Reading
it might be helpful when discussing the above issues.  I pulled the
text out and put it down here so people wouldn't be forced to wade
through it in following the above discussion.

The basic implementation of kprobes on ARM has been complex,
regardless of approach chosen, due to the inherent nature of the
processor's design.  For some background information on the ARM
architecture, the current processors have no single-step mode and
have no "next PC" register, so there's no way to regain control of
the processor after executing instructions that modify the PC.  All
branches, calls, and other writes to the PC must be decoded and
detected at arch_prepare_kprobes() time and simulated by software at
kprobe_handler() time.  Also any instruction run from the execute
slot that reads the PC's value would give a "wrong" result.  They
must also be detected at preparation time and have its results
adjusted at execution time.

Most popular ARM instructions are capable of legally having R15
(the PC) as either a source (read) or destination (write) register,
so regardless of approach, the logic to decode many of the ARM
instructions already has to be written.  Extending it to decode and
handle the additional instructions that aren't capable of reading or
writing the PC wasn't that much additional work.

At kprobe_handler() time to execute the kprobed instruction and
return, my ARM design does not have a single generic trampoline.
I originally tried this, but found it too wasteful for my tastes
in the general case to do correctly.  (Still, it wasn't that bad.
The whole thing was about 35 machine instructions with 14 written
in assembly, but it did do a fair amount of often unnecessary
data slinging though.)  What I do instead is to have individual
trampolines tuned for specific classes of instructions.  The ARM is
a RISCish architecture.  Its instructions fall into easily decoded
classes of instructions and with its registers in specific fields
within the instruction.

The individually tuned trampolines avoid unnecessary loading up and
saving back of the pre-exception register state.  The rewritten
instructions saved in the insn slots at preparation time use
preassigned fixed registers (R0, R1, etc.).  Each tuned trampoline
assigned at preparation time loads up the necessary register state
into the specific fixed registers for its class, executes the
rewritten instruction, and writes back the changed register state.

At kprobe_handler() time, the kprobe handler calls the insn
slot handler by dereferencing its pointer saved in the
"arch_specific_insn" structure at preparation time.  The handler
loads up its zero to four registers (depending on the class of
the instruction it's written for) from the pt_regs save area and
then writes back zero to four registers.  The entire execution
handler (all pre- and post-instruction work, loading and saving
register state, and the rewritten instruction) generally takes
between 10 and 20 ARM instructions in total to run.  That number of
instructions is just in the noise for that of the kprobe handler.
Although technically it does increase the execution time of the
kprobe_handler(), the increase is so small that it cannot in any way
be considered significant.

Also with my approach, no classic boosting is ever necessary.
All instructions are effectively already "boosted" (i.e.: no
secondary breakpoint exception is ever needed or taken.)  All
kprobed instructions always complete before returning from the
primary kprobe exception.  The kprrobe exception always returns to
the instruction to be executed next after the kprobed instruction.
This design also works exactly the same way on SMP and single
processors, as well as on preemptive and non-preemptive kernels, so
no ifdef'ing is necessary.

This work may sound like a lot, but it's really not.  All decoding
and instruction rewriting software for all the ARM instructions
plus all the individually tuned trampolines when compiled fits
in less than 4KB of text space with no data space at all.  That
doesn't even account for having simpler kprobes arch support code
in arch/arm/kernel/kprobes.c.  That's smaller since it doesn't
have to account for ever handle re-entering itself via a secondary
breakpoint nor for having to support classic boost mode.

I don't consider myself a Linux kernel internals expert though.
In my approach, I may have made an invalid assumption about the
kernel's behavior somewhere, possibly in boundary cases.  This is my
biggest concern.  As my testing moves forward, I'll see if anything
turns up.

I could go into more detail of the design, but I'm sure I've bored
and confused enough people already.

>Best regards,
>
>-- 
>Masami HIRAMATSU
>Linux Technology Center
>Hitachi, Ltd., Systems Development Laboratory
>E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: boosting with preemptable kernel
  2007-01-03 22:39   ` Quentin Barnes
@ 2007-01-14  3:13     ` Masami Hiramatsu
  2007-01-27  6:38       ` Quentin Barnes
  0 siblings, 1 reply; 5+ messages in thread
From: Masami Hiramatsu @ 2007-01-14  3:13 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP

Hi Quentin,

Quentin Barnes wrote:
> I disagree that an such approach increases overhead by any
> significant amount and is only useful for booster code, but more on
> this later down in the mail at the end.
>
> I really disagree with the garbage collector approach.  On some
> of our projects we use the Linux kernel in a soft real-time
> environment.  A GC introduces sporadic indefinite postponement
> problems.  It doesn't matter at all how often it runs -- once is too
> much.  The basic reason for having a preemptive kernel is to reduce
> indefinite postponement.  The GC is most disruptive on preemptive
> kernels working against the very reason for using the preemptive
> feature to begin with.  (It's so disruptive on preemptive kernels
> because it calls freeze_processes() affecting the entire system.  On
> non-preemptive kernels it uses the passive synchronize_sched().)

Indeed.

> Aside from RT related impact issues the GC causes, it also degrades
> poorly on SMP systems.  It causes the entire system across all CPUs
> to be forced to stop executing and all held into an idle state
> simultaneously.  The reason for having multiple CPUs is so that a
> thread tying up one CPU doesn't impact the performance (much) of
> the rest of the system.  Now we have a single thread being able to
> impact the performance across all CPUs system-wide simultaneously.
> (If my understanding of the GC implementation is wrong, please
> correct me.  It's only from my limited understanding of reading and
> following the code, not from using it.)

I think you are a bit misreading the GC implementation.
First, this GC is never invoked when the kprobe is hit.
This GC may be invoked when you register/unregister a kprobe.
I think these operations are not frequently done and it will
be done when the module is loading/unloading.

Next, this GC will be invoked if there ARE some garbage(dirty)
slots when you get/free an insn slot. Thus, if your kprobe clean
its slot up before release it, your kprobe NEVER invokes the GC.

> What I would recommend is that the GC and its hooks be able to be
> conditionally enabled or disabled with an ifdef, probably in the
> arch kprobes.h file like insn slot and kretprobes are now.  That way
> for architectures that don't need it by having their own alternate
> approach can choose not to use it.  Would that be a reasonable
> suggestion?

Currently, only the slots used by boosted kprobes are dirty.
So, on the ARM archtecture, if you'd like disable the GC,
you just call free_insn_slot with dirty=0, as below.
 free_insn_slot(insn_slot, 0);

on i386, you can prohibit boosting kprobes by specifying a void
post_handler to the kprobes. And then, the GC will not work any more.

If you feel a strong need for disabling GC at compiling,
I can add the compile flag for disabling GC. Or,
could you write the trampoline routine and the
compile flag for switching the trampoline and the GC?

> About your last bullet mentioning djprobes, the ARM implementation
> of kprobes I'm working on does have support for kretprobes and
> jprobes, but not djprobes.  I have read your post from Oct '05 on
> djprobes and understand it in a general way, but haven't yet wrapped
> my mind fully around it.  Could you explain why such an approach
> doesn't work for djprobes on x86?  Do you imagine that assertion
> would hold for other non-x86 architectures as well?

Djprobe has to rewrite multiple instructions on i386 (CISC), because
the size of the long jump instruction is bigger than many instructions.
Before rewriting those instructions, we must ensure no other processes
running/sleeping on the target instructions which will be rewritten by
the long jump. This case can not be helped by the trampoline routine.

Best regards,

-- 
Masami HIRAMATSU
Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: boosting with preemptable kernel
  2007-01-14  3:13     ` Masami Hiramatsu
@ 2007-01-27  6:38       ` Quentin Barnes
  2007-01-30  5:05         ` Masami Hiramatsu
  0 siblings, 1 reply; 5+ messages in thread
From: Quentin Barnes @ 2007-01-27  6:38 UTC (permalink / raw)
  To: Masami Hiramatsu; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP

>> Aside from RT related impact issues the GC causes, it also degrades
>> poorly on SMP systems.  It causes the entire system across all CPUs
>> to be forced to stop executing and all held into an idle state
>> simultaneously.  The reason for having multiple CPUs is so that a
>> thread tying up one CPU doesn't impact the performance (much) of
>> the rest of the system.  Now we have a single thread being able to
>> impact the performance across all CPUs system-wide simultaneously.
>> (If my understanding of the GC implementation is wrong, please
>> correct me.  It's only from my limited understanding of reading and
>> following the code, not from using it.)
>
>I think you are a bit misreading the GC implementation.
>First, this GC is never invoked when the kprobe is hit.
>This GC may be invoked when you register/unregister a kprobe.
>I think these operations are not frequently done and it will
>be done when the module is loading/unloading.

I had followed all that.

>Next, this GC will be invoked if there ARE some garbage(dirty)
>slots when you get/free an insn slot. Thus, if your kprobe clean
>its slot up before release it, your kprobe NEVER invokes the GC.

Ah, I did not follow this.  I see now.  The variable
"kprobe_garbage_slots" keeps track of the dirty count to avoid
invoking the GC when the count is zero.

>> What I would recommend is that the GC and its hooks be able to be
>> conditionally enabled or disabled with an ifdef, probably in the
>> arch kprobes.h file like insn slot and kretprobes are now.  That way
>> for architectures that don't need it by having their own alternate
>> approach can choose not to use it.  Would that be a reasonable
>> suggestion?
>
>Currently, only the slots used by boosted kprobes are dirty.
>So, on the ARM archtecture, if you'd like disable the GC,
>you just call free_insn_slot with dirty=0, as below.
> free_insn_slot(insn_slot, 0);

Yes, when the ARM code is updated to have the second parameter
to free_insn_slot(), it will always be 0,

>on i386, you can prohibit boosting kprobes by specifying a void
>post_handler to the kprobes. And then, the GC will not work any more.

Having a post-handler defeats boosting which then "defeats" the GC.

For our platform use though, we're ARM only.

>If you feel a strong need for disabling GC at compiling,
>I can add the compile flag for disabling GC.

Since the GC will be inactive for ARM with only a trivial execution
charge to check and maintain the "kprobe_garbage_slots" variable,
it's not that big a deal.  The amount of dead code is pretty minor
too.  At this point I'd say it's not worth putting a compile time
flag around.

>Or,
>could you write the trampoline routine and the
>compile flag for switching the trampoline and the GC?

In my approach, there's no single "trampoline" in the classic sense.

The kprobe'd instruction's effects always complete before
kprobe_handler() returns, so there is no system state to hold and
manage across re-entry for the same kprobe's continued processing.
This allows kprobe_handler() and its associated management functions
and saved data state to be greatly simplified and reduced.  There's
no way to just "throw a switch" to put all that complexity back.  It
would almost have to be two completely different implementations.

>> About your last bullet mentioning djprobes, the ARM implementation
>> of kprobes I'm working on does have support for kretprobes and
>> jprobes, but not djprobes.  I have read your post from Oct '05 on
>> djprobes and understand it in a general way, but haven't yet wrapped
>> my mind fully around it.  Could you explain why such an approach
>> doesn't work for djprobes on x86?  Do you imagine that assertion
>> would hold for other non-x86 architectures as well?
>
>Djprobe has to rewrite multiple instructions on i386 (CISC), because
>the size of the long jump instruction is bigger than many instructions.
>Before rewriting those instructions, we must ensure no other processes
>running/sleeping on the target instructions which will be rewritten by
>the long jump. This case can not be helped by the trampoline routine.

Yes, that's a very, very messy problem on the x86.  Up to five(?)
instructions could be overwritten by the long jump instruction.

I'm still not sure though I see why a trampoline approach wouldn't
work for x86.  It would just have to iterate the trampoline up to
five times.  But I still don't have the model clear enough in my
head yet.  Maybe it will become clearer over time.

As long as the kernel text address space stays under 32MB, an
ARM djprobe implementation would be a one-for-one instruction
replacement.

I'm still absorbing the djprobes explanation (djprobe-20051031.txt)
and perusing the patch you sent out Nov 21.  Sorry if the following
question has already been discussed.  If it has, just point me to
it.  Is there a reason djprobes needs its own, separate interface?
Could it just use the kprobes registration service and have the
kprobes code decide whether to implement a given probe as a kprobe
with an exception or djprobe with a direct jump?  Or is this a long
term goal after shaking out the djprobes model?

>Best regards,
>
>-- 
>Masami HIRAMATSU
>Linux Technology Center
>Hitachi, Ltd., Systems Development Laboratory
>E-mail: masami.hiramatsu.pt@hitachi.com

Quentin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: boosting with preemptable kernel
  2007-01-27  6:38       ` Quentin Barnes
@ 2007-01-30  5:05         ` Masami Hiramatsu
  0 siblings, 0 replies; 5+ messages in thread
From: Masami Hiramatsu @ 2007-01-30  5:05 UTC (permalink / raw)
  To: Quentin Barnes; +Cc: Satoshi Oshima, Hideo Aoki, Yumiko Sugita, SystemTAP

Hi Quentin,

Quentin Barnes wrote:
>> Djprobe has to rewrite multiple instructions on i386 (CISC), because
>> the size of the long jump instruction is bigger than many instructions.
>> Before rewriting those instructions, we must ensure no other processes
>> running/sleeping on the target instructions which will be rewritten by
>> the long jump. This case can not be helped by the trampoline routine.
> 
> Yes, that's a very, very messy problem on the x86.  Up to five(?)
> instructions could be overwritten by the long jump instruction.
> 
> I'm still not sure though I see why a trampoline approach wouldn't
> work for x86.  It would just have to iterate the trampoline up to
> five times.  But I still don't have the model clear enough in my
> head yet.  Maybe it will become clearer over time.

Similar idea had discussed with Karim on the thread below:
http://sourceware.org/ml/systemtap/2006-q3/msg00725.html

> As long as the kernel text address space stays under 32MB, an
> ARM djprobe implementation would be a one-for-one instruction
> replacement.

Unfortunately, the djprobe allocates its stub buffer from the
module area. Is it included in the 32MB area on ARM?

> I'm still absorbing the djprobes explanation (djprobe-20051031.txt)
> and perusing the patch you sent out Nov 21.  Sorry if the following
> question has already been discussed.  If it has, just point me to
> it.  Is there a reason djprobes needs its own, separate interface?
> Could it just use the kprobes registration service and have the
> kprobes code decide whether to implement a given probe as a kprobe
> with an exception or djprobe with a direct jump?  Or is this a long
> term goal after shaking out the djprobes model?

Before I sent the latest patch, we are discussed that.
The patch which I sent Nov 21, integrated the djprobe's interface
into kprobe's interface. If user sets the length of the instructions
(which will be replaced by a long jump) to the length member of the
kprobe data structure, that kprobe will be optimized (become a
djprobe) after invoking commit_kprobes().

Best regards,

-- 
Masami HIRAMATSU
Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-01-30  5:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20061117235807.GB11523@urbana.css.mot.com>
2006-11-21  6:49 ` boosting with preemptable kernel Masami Hiramatsu
2007-01-03 22:39   ` Quentin Barnes
2007-01-14  3:13     ` Masami Hiramatsu
2007-01-27  6:38       ` Quentin Barnes
2007-01-30  5:05         ` Masami Hiramatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).