public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* breakpoint assistance: single-step out of line
@ 2007-03-14 19:27 Roland McGrath
  2007-03-16  1:06 ` Jim Keniston
  2007-03-16 14:09 ` Frank Ch. Eigler
  0 siblings, 2 replies; 10+ messages in thread
From: Roland McGrath @ 2007-03-14 19:27 UTC (permalink / raw)
  To: systemtap

The method of single-stepping over an out of line copy of the instruction
clobbered by breakpoint insertion has been proven by kprobes.  The
complexities are mitigated in that implementation by the constrained
context of the kernel and the fixed subset of possible machine code known
to validly occur in any kernel or module text.

There are two core problem areas in implementing single-step out of line
for user mode code.  These are where to store the out of line copies, and
arch issues with instruction semantics.


Starting with arch issues, I'll talk about the only ones I know in detail,
which are x86 and x86_64.  kprobes has done the basic work here.  For the
user mode context, on the one hand the risks of munging an instruction's
behavior are confined to the user address space in question, but on the
other hand we have to deal robustly with the full range of instructions
that can be executed on the processor in user mode.

Instruction decoding needs to be robust, not presume the canonical subset
of encodings normally produced by the compiler, as used in the kernel.  On
machines other than x86, this tends to be quite simple.  On x86, it means
parsing all the instruction prefixes correctly and so forth.  I think the
parsing should be done at breakpoint insertion time, caching just a few
bits saying what fixup strategy we need to use after the step.  If we can't
positively decode the instruction enough to be confident that we know how
to fix it up, refuse to insert the breakpoint.  (If it's an invalid
instruction, you don't need a breakpoint because you'll get a trap anyway.)

The instructions of concern are those that refer to the PC.
On 32-bit x86, these are only the few control flow instructions.

On x86_64, there is also %rip-relative addressing.  We cannot presume
addresses are within the same 4GB window so that the displacement can just
be adjusted, as we do in the kernel.  However, we can use some other
tricks.  The only instruction that computes a %rip-relative address as a
result is lea.  It is not difficult to recognize that one and just emulate
it outright; there are only a few variations of address-size, data-size,
and output register.  It's not much easier to fix it up after the step.

Unless I'm overlooking something, all other %rip-relative uses are implicit
in the effective address for a memory access.  For these, we can use the fs
or gs segment prefix on the copied instruction, and adjust the displacement
and the fs or gs base value to come up with the original target address.
In the unlikely event that the instruction already uses the fs or gs
prefix, just adjust the appropriate base value and use the instruction as
it is.  Otherwise, insert a gs prefix in the copied instruction, and set
the gs base to the difference between the address of the copy (after the
inserted prefix) and the breakpoint address.  It is a little costly to set
the fs or gs base value and reset it after the step, much more than setting
a register in the trap frame; but it's probably not too bad.


Next we come to the problem of where to store copied instructions for
stepping.  The idea of stealing a stack page for this is a non-starter.
For both security and robustness, it's never acceptable to introduce a user
mapping that is both writable and executable, even temporarily.  We need to
use an otherwise unused page in the address space, that will be
read/execute only for the user, we can write to it only from kernel mode.

In some meeting notes I've seen mention of "do what the vdso does".  I
don't know what this referred to specifically.  There are two things this
might mean, and those are the two main options I see.  What the i386 vDSO
used to do (CONFIG_COMPAT_VDSO), what the ia64 vDSO does, and what the
x86-64 vsyscall page does (not a vDSO but similar), is the fixmap area.
What the i386 vDSO, the ia32 vDSO on x86_64, and the powerpc vDSO do,
is insert a vma.

The fixmap area is a region of address space that shares some page tables
across all tasks in the system.  The advantages are that it has no vm setup
cost since it is done once at boot, and that it is completely outside the
range of virtual addresses the user task can map normally and so does not
perturb any mapping behavior or appear in /proc/PID/maps or via
access_process_vm or such things that might have unintended side effects on
the user process.  On 32-bit x86, a disadvantage is that when NX page
protection is not available (older CPUs or non-PAE kernel builds), the
exec-shield approximation of NX protection via segmentation is defeated by
having an executable page high in the address space; this can be worked
around on the exec-shield kernel with some extra effort.  Other machines
may not already have an analogous region of reserved address space where a
page can be made user-readable/executable.  Other potential disadvantages
are the fixed amount of space (chosen at compile-time or boot-time, with
some small limit on the number of pages available), and the security
implications of global pages visible to all users on the system.  The
limited size might mean that slots need to be assigned only momentarily
while doing the step, meaning fresh icache flushing every time.  Then you'd
ideally use only one slot per CPU, but that needs some work to be right
given preemption.  The briefness of this window may mitigate the security
concerns, but still there are a few bytes of information about a traced
thread leaking to anyone in the system who wants to try to see them.  The
setup every time necessitated by the fixed space is costly, but on the
other hand its CPU use scales linearly with more breakpoints and more
occurrences and its memory use stays constant, compared to open-ended
allocation scaling with the number of breakpoints.

Inserting a vma means essentially doing an mmap from inside the kernel.
Both the advantages and the disadvantages of this stem from its normalcy.
Any stray mmap/munmap/mprotect call from the user might wind up clobbering
this mapping.  It appears in /proc/PID/maps and will become known to other
debugging facilities tracing the process, so they will think it's a normal
user allocation; it might appear in core dumps.  This might have other bad
effects on processes that look at their own maps file to see what heap
pages there are, which some GC libraries or suchlike might well do.  The
mapping also has subtler effects perturbing the process's own mapping
behavior, which could introduce anomalies or even break some programs that
need a lot of control over their address space.  The advantages are that
it's straightforward to implement and easy to be sure that it does the
right thing vis a vis paging and so forth, it provides the option of using
an open-ended amount of storage to optimize the use of many breakpoints,
and it's wholly private to the user address space in question.

A third option I didn't mention before is doing something in the page
tables behind the vm system's back (this as distinct, and somewhat simpler
than, the fancy VM ideas like per-thread page tables).  I don't know enough
about this to comment in detail.  The attraction is that it would avoid
some of the interactions I just mentioned with vma's, and might have lower
overhead to set up.  It might be difficult to make this do reasonable
things about paging and such.  This is probably not a good bet, but I don't
know much about it.

The fixmap is somewhat attractive at least for x86, x86-64, and ia64.  It's
nice that it doesn't interact with the normal user address range and set of
visible mappings.  The overhead of resetting and icache flushing an
instruction slot on every use is less than the uprobes prototype using a
stack page already has.  I don't know if the performance of that will be
good enough in the long run, or if priming a slot once and using it
repeatedly will perform enough better that we care about this overhead.

The vma is the most straightforward thing to implement, and is generic
across machines.  It makes sense to implement this first generically and
then experiment later with the fixmap approach as an arch-specific
alternative.  The stack randomization done on at least x86/x86-64 means
that there is normally a good little stretch of address space free above
the stack vma (the top part of which holds environ and auxv).  (Just try
"tail -1 /proc/self/maps" a few times.)  This area is unlikely to conflict
with address space the user's own mappings would ever have considered.
Allocating at one page above the end of the stack vma (leaving a red zone)
seems good.  I'm really more concerned about things monitoring the
mappings.  Perhaps we could add a VM_* flag that says to omit the vma from
listings, but I don't know how that would be received by kernel people, let
alone a flag to disallow user munmap/mmap/mprotect calls to change a mapping.

I can go into further detail on how I envision implementing the vma and/or
fixmap plans if it is not clear.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-14 19:27 breakpoint assistance: single-step out of line Roland McGrath
@ 2007-03-16  1:06 ` Jim Keniston
  2007-03-29  4:40   ` Roland McGrath
  2007-03-16 14:09 ` Frank Ch. Eigler
  1 sibling, 1 reply; 10+ messages in thread
From: Jim Keniston @ 2007-03-16  1:06 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap

On Sun, 2007-03-04 at 13:38 -0800, Roland McGrath wrote:
> The method of single-stepping over an out of line copy of the instruction
> clobbered by breakpoint insertion has been proven by kprobes.  The
> complexities are mitigated in that implementation by the constrained
> context of the kernel and the fixed subset of possible machine code known
> to validly occur in any kernel or module text.
> 
> There are two core problem areas in implementing single-step out of line
> for user mode code.  These are where to store the out of line copies, and
> arch issues with instruction semantics.
> 
> 
> Starting with arch issues, I'll talk about the only ones I know in detail,
> which are x86 and x86_64.  kprobes has done the basic work here.  For the
> user mode context, on the one hand the risks of munging an instruction's
> behavior are confined to the user address space in question, but on the
> other hand we have to deal robustly with the full range of instructions
> that can be executed on the processor in user mode.

Yes.

> 
> Instruction decoding needs to be robust, not presume the canonical subset
> of encodings normally produced by the compiler, as used in the kernel.  On
> machines other than x86, this tends to be quite simple.  On x86, it means
> parsing all the instruction prefixes correctly and so forth.  I think the
> parsing should be done at breakpoint insertion time, caching just a few
> bits saying what fixup strategy we need to use after the step.

I guess that depends on how complicated the switch(opcode) { ... } code
in uprobe_resume_execution() gets.  Parsing the instruction at
probe-insertion time is essential for x86_64, at least partly because of
rip-relative addressing, as you discuss below.

> If we can't
> positively decode the instruction enough to be confident that we know how
> to fix it up, refuse to insert the breakpoint.

Yes.

> (If it's an invalid
> instruction, you don't need a breakpoint because you'll get a trap anyway.)
> 
> The instructions of concern are those that refer to the PC.
> On 32-bit x86, these are only the few control flow instructions.
> 
> On x86_64, there is also %rip-relative addressing.  We cannot presume
> addresses are within the same 4GB window so that the displacement can just
> be adjusted, as we do in the kernel.

Yes.

> However, we can use some other
> tricks.  The only instruction that computes a %rip-relative address as a
> result is lea.  It is not difficult to recognize that one and just emulate
> it outright; there are only a few variations of address-size, data-size,
> and output register.  It's not much easier to fix it up after the step.
> 
> Unless I'm overlooking something, all other %rip-relative uses are implicit
> in the effective address for a memory access.  

I think that's the case, but we haven't done a thorough review of the
instruction list.

> For these, we can use the fs
> or gs segment prefix on the copied instruction, and adjust the displacement
> and the fs or gs base value to come up with the original target address.
> In the unlikely event that the instruction already uses the fs or gs
> prefix, just adjust the appropriate base value and use the instruction as
> it is.  Otherwise, insert a gs prefix in the copied instruction, and set
> the gs base to the difference between the address of the copy (after the
> inserted prefix) and the breakpoint address.  It is a little costly to set
> the fs or gs base value and reset it after the step, much more than setting
> a register in the trap frame; but it's probably not too bad.

The approach we had in mind was to change the rip-relative instruction
to an indirect instruction where the target address is in a scratch
register (one not accessed by the original instruction).  Save the value
of the scratch register, load in the target address, single-step, and
restore the scratch register's real value.  This isn't coded yet.

> 
> 
> Next we come to the problem of where to store copied instructions for
> stepping.  The idea of stealing a stack page for this is a non-starter.
> For both security and robustness, it's never acceptable to introduce a user
> mapping that is both writable and executable, even temporarily.  We need to
> use an otherwise unused page in the address space, that will be
> read/execute only for the user, we can write to it only from kernel mode.

As it turns out, this approach isn't very portable, either.  The s390
and powerpc compilers regularly generate code that accesses data beyond
the top-of-stack, so it's tough to find a "safe" page in the stack vma.

> 
> In some meeting notes I've seen mention of "do what the vdso does".  I
> don't know what this referred to specifically.

We were thinking in terms of a per-process page that's automatically set
up at exec time.  There's no dso involved in our approach, but the
"vdso" reference has been hard to kill.

> There are two things this
> might mean, and those are the two main options I see.  What the i386 vDSO
> used to do (CONFIG_COMPAT_VDSO), what the ia64 vDSO does, and what the
> x86-64 vsyscall page does (not a vDSO but similar), is the fixmap area.
> What the i386 vDSO, the ia32 vDSO on x86_64, and the powerpc vDSO do,
> is insert a vma.

It's the latter.

> 
> The fixmap area is a region of address space that shares some page tables
> across all tasks in the system.  The advantages are that it has no vm setup
> cost since it is done once at boot, and that it is completely outside the
> range of virtual addresses the user task can map normally and so does not
> perturb any mapping behavior or appear in /proc/PID/maps or via
> access_process_vm or such things that might have unintended side effects on
> the user process.  On 32-bit x86, a disadvantage is that when NX page
> protection is not available (older CPUs or non-PAE kernel builds), the
> exec-shield approximation of NX protection via segmentation is defeated by
> having an executable page high in the address space; this can be worked
> around on the exec-shield kernel with some extra effort.  Other machines
> may not already have an analogous region of reserved address space where a
> page can be made user-readable/executable.  Other potential disadvantages
> are the fixed amount of space (chosen at compile-time or boot-time, with
> some small limit on the number of pages available), and the security
> implications of global pages visible to all users on the system.  The
> limited size might mean that slots need to be assigned only momentarily
> while doing the step, meaning fresh icache flushing every time.  Then you'd
> ideally use only one slot per CPU, but that needs some work to be right
> given preemption.  The briefness of this window may mitigate the security
> concerns, but still there are a few bytes of information about a traced
> thread leaking to anyone in the system who wants to try to see them.  The
> setup every time necessitated by the fixed space is costly, but on the
> other hand its CPU use scales linearly with more breakpoints and more
> occurrences and its memory use stays constant, compared to open-ended
> allocation scaling with the number of breakpoints.

We haven't seriously considered the above approach.

> 
> Inserting a vma means essentially doing an mmap from inside the kernel.
> Both the advantages and the disadvantages of this stem from its normalcy.
> Any stray mmap/munmap/mprotect call from the user might wind up clobbering
> this mapping.  

Good point.

> It appears in /proc/PID/maps and will become known to other
> debugging facilities tracing the process, so they will think it's a normal
> user allocation; it might appear in core dumps.  This might have other bad
> effects on processes that look at their own maps file to see what heap
> pages there are, which some GC libraries or suchlike might well do.  The
> mapping also has subtler effects perturbing the process's own mapping
> behavior, which could introduce anomalies or even break some programs that
> need a lot of control over their address space.  The advantages are that
> it's straightforward to implement and easy to be sure that it does the
> right thing vis a vis paging and so forth, it provides the option of using
> an open-ended amount of storage to optimize the use of many breakpoints,
> and it's wholly private to the user address space in question.

Our current approach uses a fixed-size area (1 page for now) that's
allocated at exec time.  Instruction slots are allocated to probepoints
as they are hit, and a probepoint owns the slot until another probepoint
steals it.  For x86[_64], 1 pages gives us 256 slots.  We would see
thrashing due to slot starvation only if the process is hitting more
than 256 different probepoints in a short time span.

We're still debugging this approach.

> 
> A third option I didn't mention before is doing something in the page
> tables behind the vm system's back (this as distinct, and somewhat simpler
> than, the fancy VM ideas like per-thread page tables).  I don't know enough
> about this to comment in detail.  The attraction is that it would avoid
> some of the interactions I just mentioned with vma's, and might have lower
> overhead to set up.  It might be difficult to make this do reasonable
> things about paging and such.  This is probably not a good bet, but I don't
> know much about it.

I haven't thought about the above approach.

> 
> The fixmap is somewhat attractive at least for x86, x86-64, and ia64.  It's
> nice that it doesn't interact with the normal user address range and set of
> visible mappings.  The overhead of resetting and icache flushing an
> instruction slot on every use is less than the uprobes prototype using a
> stack page already has.  I don't know if the performance of that will be
> good enough in the long run, or if priming a slot once and using it
> repeatedly will perform enough better that we care about this overhead.

We picked per-probepoint multiplexing because of icache issues (and
because it seems best for single-threaded apps), but it turned out to be
no more complex than per-thread or per-cpu muxing.

> 
> The vma is the most straightforward thing to implement, and is generic
> across machines.  It makes sense to implement this first generically

Oh, good.  Glad we got that right.

> and
> then experiment later with the fixmap approach as an arch-specific
> alternative.  The stack randomization done on at least x86/x86-64 means
> that there is normally a good little stretch of address space free above
> the stack vma (the top part of which holds environ and auxv).  (Just try
> "tail -1 /proc/self/maps" a few times.)  This area is unlikely to conflict
> with address space the user's own mappings would ever have considered.
> Allocating at one page above the end of the stack vma (leaving a red zone)
> seems good.

Sounds good, although I personally don't know the incantation for
putting the vma there.  Any help would be appreciated.

> I'm really more concerned about things monitoring the
> mappings.  Perhaps we could add a VM_* flag that says to omit the vma from
> listings, but I don't know how that would be received by kernel people, let
> alone a flag to disallow user munmap/mmap/mprotect calls to change a mapping.

I'm not so worried about the visibility of the area in /proc/*/maps and
such; protecting it from munmap & friends seems more of a concern.

> 
> I can go into further detail on how I envision implementing the vma and/or
> fixmap plans if it is not clear.
> 
We hope to post the above-described implementation Real Soon Now.  Given
your interest, maybe we will even if we don't have it firing on all
cylinders.

> 
> Thanks,
> Roland

Thanks again for your ideas and interest.
Jim

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-14 19:27 breakpoint assistance: single-step out of line Roland McGrath
  2007-03-16  1:06 ` Jim Keniston
@ 2007-03-16 14:09 ` Frank Ch. Eigler
  2007-03-16 18:54   ` Jim Keniston
  2007-03-20  8:08   ` Roland McGrath
  1 sibling, 2 replies; 10+ messages in thread
From: Frank Ch. Eigler @ 2007-03-16 14:09 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap


Roland McGrath <roland@redhat.com> writes:

> The method of single-stepping over an out of line copy of the
> instruction clobbered by breakpoint insertion has been proven by
> kprobes.  The complexities are mitigated in that implementation by
> the constrained context of the kernel and the fixed subset of
> possible machine code known to validly occur in any kernel or module
> text.

Another important aspect is that userspace may be hostile.  Beyond
just containing oddball instruction sequences, it may deliberately
rewrite its own .text, or otherwise interfere with probing in order to
produce crashes or security breaches.

- FChE

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-16 14:09 ` Frank Ch. Eigler
@ 2007-03-16 18:54   ` Jim Keniston
  2007-03-16 19:02     ` Frank Ch. Eigler
  2007-03-16 21:00     ` Jim Keniston
  2007-03-20  8:08   ` Roland McGrath
  1 sibling, 2 replies; 10+ messages in thread
From: Jim Keniston @ 2007-03-16 18:54 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Roland McGrath, systemtap

On Fri, 2007-03-16 at 10:09 -0400, Frank Ch. Eigler wrote:
> Roland McGrath <roland@redhat.com> writes:
> 
> > The method of single-stepping over an out of line copy of the
> > instruction clobbered by breakpoint insertion has been proven by
> > kprobes.  The complexities are mitigated in that implementation by
> > the constrained context of the kernel and the fixed subset of
> > possible machine code known to validly occur in any kernel or module
> > text.
> 
> Another important aspect is that userspace may be hostile.  Beyond
> just containing oddball instruction sequences, it may deliberately
> rewrite its own .text, or otherwise interfere with probing in order to
> produce crashes or security breaches.

Under what circumstances can a user program rewrite its own text?

> 
> - FChE

Jim

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-16 18:54   ` Jim Keniston
@ 2007-03-16 19:02     ` Frank Ch. Eigler
  2007-03-16 21:00     ` Jim Keniston
  1 sibling, 0 replies; 10+ messages in thread
From: Frank Ch. Eigler @ 2007-03-16 19:02 UTC (permalink / raw)
  To: Jim Keniston; +Cc: systemtap

Hi -

On Fri, Mar 16, 2007 at 10:54:15AM -0700, Jim Keniston wrote:
> [...]
> Under what circumstances can a user program rewrite its own text?

After an mprotect?  Or someone else doing a ptrace write?

- FChE

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-16 18:54   ` Jim Keniston
  2007-03-16 19:02     ` Frank Ch. Eigler
@ 2007-03-16 21:00     ` Jim Keniston
  2007-03-16 21:56       ` Frank Ch. Eigler
  1 sibling, 1 reply; 10+ messages in thread
From: Jim Keniston @ 2007-03-16 21:00 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: Roland McGrath, systemtap

On Fri, 2007-03-16 at 10:54 -0700, Jim Keniston wrote:
> On Fri, 2007-03-16 at 10:09 -0400, Frank Ch. Eigler wrote:
> > Roland McGrath <roland@redhat.com> writes:
> > 
> > > The method of single-stepping over an out of line copy of the
> > > instruction clobbered by breakpoint insertion has been proven by
> > > kprobes.  The complexities are mitigated in that implementation by
> > > the constrained context of the kernel and the fixed subset of
> > > possible machine code known to validly occur in any kernel or module
> > > text.
> > 
> > Another important aspect is that userspace may be hostile.  Beyond
> > just containing oddball instruction sequences, it may deliberately
> > rewrite its own .text, or otherwise interfere with probing in order to
> > produce crashes or security breaches.
> 
> Under what circumstances can a user program rewrite its own text?

Frank answered:
> After an mprotect?

Indeed.  I had to try it to believe it.

> Or someone else doing a ptrace write?

Yes.

OK, here's my next dumb question.  How do you envision a user process
exploiting uprobes to mess up anything but itself (or its ptraced child)
in a novel way?

It could insert breakpoint instructions, but uprobes would let the
resulting SIGTRAPs pass through because they don't match known
probepoints.

It could remove breakpoint instructions inserted by uprobes, but then
the probepoints would never be hit and uprobes would never get involved.

It could scribble on the SSOL instruction slots or the uretprobe
trampoline, with the result that the wrong USER-mode instruction gets
executed, but I don't see how that's anything new that the kernel can't
handle.

I'm not arguing (confidently, anyway) that there are no such exploits.
If you can think of any, I'd like to know them.

Thanks.
Jim

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-16 21:00     ` Jim Keniston
@ 2007-03-16 21:56       ` Frank Ch. Eigler
  0 siblings, 0 replies; 10+ messages in thread
From: Frank Ch. Eigler @ 2007-03-16 21:56 UTC (permalink / raw)
  To: systemtap

Jim Keniston <jkenisto@us.ibm.com> writes:

> [...]
> > Under what circumstances can a user program rewrite its own text?
> 
> Frank answered:
> > After an mprotect?
> Indeed.  I had to try it to believe it.

Likewise!

> [...]  OK, here's my next dumb question.  How do you envision a user
> process exploiting uprobes to mess up anything but itself (or its
> ptraced child) in a novel way?

I don't have a specific scenario in mind.  One just needs to distrust
all the data coming from user space.

For example, the instructions being disassembled for out-of-line
single-stepping must be carefully analyzed, so it cannot hit shady
corner cases.  The single-stepping must be done in minimum-privilege
state.  The restoration of the instruction byte under the breakpoint
might need to assert that it is unchanged, or perhaps outright block
its attempted change somehow.

This one is an old shangri-la saw, but it may be desirable to block
visibility of the breakpoint itself, to make a systemtap session
relatively invisible.

- FChE

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-16 14:09 ` Frank Ch. Eigler
  2007-03-16 18:54   ` Jim Keniston
@ 2007-03-20  8:08   ` Roland McGrath
  1 sibling, 0 replies; 10+ messages in thread
From: Roland McGrath @ 2007-03-20  8:08 UTC (permalink / raw)
  To: Frank Ch. Eigler; +Cc: systemtap

> Another important aspect is that userspace may be hostile.  Beyond
> just containing oddball instruction sequences, it may deliberately
> rewrite its own .text, or otherwise interfere with probing in order to
> produce crashes or security breaches.

Indeed.  Another way to say what I meant to express about this is that the
kernel implementation must be robust to any potential user-mode activity
and protect the kernel and other user address spaces from any side effects.
(That's what a kernel is for, after all.)  At worst, some pathological or
hostile case that might scramble things so they work differently than they
would have in the absence of the single-stepping procedure, should be able
to scramble or crash only the probed user address space.  Anything less is
a dangerous bug unacceptable for any production system.  The ideal we
strive for is that at worst, a probe will have to be backed out and produce
some orderly error at the script level, but not at all disrupt the normal
(pathological) behavior of the user task.

> This one is an old shangri-la saw, but it may be desirable to block
> visibility of the breakpoint itself, to make a systemtap session
> relatively invisible.

In the aforementioned "fancy VM tricks" shangri-la, you get this too.  That
is, the process memory as seen by debuggers and core dumps and such will
have original pages rather than breakpoint-touched ones.  With the final
demise of segmentation, I don't think there is any way at all in the
hardware to prevent normal memory reads by the process itself from seeing
the breakpoint.  Of course you can mess around for the one single-stepped
copied instruction, but for every other instruction you are SOL.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-16  1:06 ` Jim Keniston
@ 2007-03-29  4:40   ` Roland McGrath
  2007-03-30 23:53     ` Jim Keniston
  0 siblings, 1 reply; 10+ messages in thread
From: Roland McGrath @ 2007-03-29  4:40 UTC (permalink / raw)
  To: Jim Keniston; +Cc: systemtap

> > Instruction decoding needs to be robust, not presume the canonical subset
> > of encodings normally produced by the compiler, as used in the kernel.  On
> > machines other than x86, this tends to be quite simple.  On x86, it means
> > parsing all the instruction prefixes correctly and so forth.  I think the
> > parsing should be done at breakpoint insertion time, caching just a few
> > bits saying what fixup strategy we need to use after the step.
> 
> I guess that depends on how complicated the switch(opcode) { ... } code
> in uprobe_resume_execution() gets.  Parsing the instruction at
> probe-insertion time is essential for x86_64, at least partly because of
> rip-relative addressing, as you discuss below.

The parsing will always be more costly than checking a few bits.  You're
always going to be doing it at insertion time anyway; there's no reason not
to cache the results of that across the board.

> The approach we had in mind was to change the rip-relative instruction
> to an indirect instruction where the target address is in a scratch
> register (one not accessed by the original instruction).  Save the value
> of the scratch register, load in the target address, single-step, and
> restore the scratch register's real value.  This isn't coded yet.

This requires decoding the instruction in more detail than we've done
before, to be sure of what register is free.  Maybe that isn't really all
that hard, but I'm not sure--I think there are a lot of cases to be sure
what the target register is.  By contrast, the segment prefix is very easy
to parse.  The normal register fiddling is probably more efficient than the
fs/gs fiddling, but unless it's drastic I think keeping the instruction
decoder simpler is the overall win.

> We were thinking in terms of a per-process page that's automatically set
> up at exec time.  There's no dso involved in our approach, but the
> "vdso" reference has been hard to kill.
[...]
> Our current approach uses a fixed-size area (1 page for now) that's
> allocated at exec time.  
[...]
> Sounds good, although I personally don't know the incantation for
> putting the vma there.  Any help would be appreciated.
[...]
> I'm not so worried about the visibility of the area in /proc/*/maps and
> such; protecting it from munmap & friends seems more of a concern.

You can't do anything about that without either not using a proper vma, or
adding some new VM_* flag and make the kernel enforce that normal user
calls can't change it (which might as well be the flag that says to hide it
in /proc too).

Also note that ptrace or suchlike can always come and modify the page too,
even if the user has not made it writable.  So whatever bits you store on
that page, you must always use with care in the kernel.  Scrambling that
page must not be able to produce any bad effects in the kernel, nothing
worse than a scrambled user context in a thread in that address space.

The patch you posted is a non-starter.  I think I get now why you keep
thinking "vdso"--you mean an unaccounted mapping of an unaccounted page
that will never be paged out.  This is several kinds of bad, I won't go
into the details of why.  I'm sorry I wasn't more explicit about how to
keep it simple.  For not even trying to do any special hiding magic, it
didn't occur to me that you'd do anything but this:

	#define SLOT_SIZE		...
	#define	SLOT_AREA_SIZE		PAGE_ALIGN(NR_CPUS * SLOT_SIZE)

		struct mm_struct *mm = current->mm;
		unsigned long addr;

		down_write(&mm->mmap_sem);
		/*
		 * Find the end of the top mapping and skip a page.
		 * If there is no space for SLOT_AREA_SIZE above
		 * that, mmap will ignore our address hint.
		 */
		addr = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct,
				vm_rb)->vm_end + PAGE_SIZE;

		addr = do_mmap_pgoff(NULL, addr, SLOT_AREA_SIZE, PROT_EXEC,
				     MAP_PRIVATE|MAP_ANONYMOUS, 0);
		if (addr &~ PAGE_MASK)
			... -errno = addr ...;
		up_write(&mm->mmap_sem);

If we think up useful tweaks to make the vma more special, add (before
up_write):

		vma = find_vma(addr);
		vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; // or whatever

Also, doing preemptive allocation at exec time does not wash with me.  Your
version has an extra unpageable page per process as well, but with a normal
allocation it's still a gratuitous vma per process.  Most processes will
never be probed.  I don't think this universal overhead is warranted.
Allocating on demand at first probe insertion makes sense to me.  Using the
top address area means it's unlikely you'll ever interfere with normal
mappings anyway, and if somehow none available at insertion time, then
tough, you don't insert.

Sorry, I really thought the vma was the trivial part of this and not the
interesting one.  I'd like to see the robust instruction decoding work.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: breakpoint assistance: single-step out of line
  2007-03-29  4:40   ` Roland McGrath
@ 2007-03-30 23:53     ` Jim Keniston
  0 siblings, 0 replies; 10+ messages in thread
From: Jim Keniston @ 2007-03-30 23:53 UTC (permalink / raw)
  To: Roland McGrath; +Cc: systemtap

On Wed, 2007-03-28 at 21:40 -0700, Roland McGrath wrote:
> ...
> I'm sorry I wasn't more explicit about how to
> keep it simple.  For not even trying to do any special hiding magic, it
> didn't occur to me that you'd do anything but this:
> 
> 	#define SLOT_SIZE		...
> 	#define	SLOT_AREA_SIZE		PAGE_ALIGN(NR_CPUS * SLOT_SIZE)

We've considered per-CPU slots.  Can you guarantee that the probed
thread can't migrate to another CPU (or be preempted) between the time
we store the instruction in the slot, return from our report_signal
callback (which handles the breakpoint trap), and single-step the
instruction?  (Assume that our callback doesn't sleep in that interval.)
> 
> 		struct mm_struct *mm = current->mm;
> 		unsigned long addr;
> 
> 		down_write(&mm->mmap_sem);
> 		/*
> 		 * Find the end of the top mapping and skip a page.
> 		 * If there is no space for SLOT_AREA_SIZE above
> 		 * that, mmap will ignore our address hint.
> 		 */
> 		addr = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct,
> 				vm_rb)->vm_end + PAGE_SIZE;
> 
> 		addr = do_mmap_pgoff(NULL, addr, SLOT_AREA_SIZE, PROT_EXEC,
> 				     MAP_PRIVATE|MAP_ANONYMOUS, 0);
> 		if (addr &~ PAGE_MASK)
> 			... -errno = addr ...;
> 		up_write(&mm->mmap_sem);

We'll try this.  I looked at do_mmap_pgoff() before, and my eyes glazed
over after about 200 lines.  At least I understand what Prasanna's code
does (though I don't know all the implications of what it DOESN'T do).

do_mmap_pgoff() has to be run by the probed process,  so it can't be run
by the process that registers the first probe (typically insmod).  We
could do while handling the first probe hit, or when we quiesce the
process for the first probepoint insertion.  Any preference?  I've been
thinking about NOT quiescing for the general case of probepoint
insertion/removal for i386, x86_64, powerpc, ...

> 
> If we think up useful tweaks to make the vma more special, add (before
> up_write):
> 
> 		vma = find_vma(addr);
> 		vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; // or whatever
> 
> Also, doing preemptive allocation at exec time does not wash with me.  Your
> version has an extra unpageable page per process as well, 

Yeah, fixed that.

> but with a normal
> allocation it's still a gratuitous vma per process.  Most processes will
> never be probed.  I don't think this universal overhead is warranted.
> Allocating on demand at first probe insertion makes sense to me.  Using the
> top address area means it's unlikely you'll ever interfere with normal
> mappings anyway, and if somehow none available at insertion time, then
> tough, you don't insert.
> 
> Sorry, I really thought the vma was the trivial part of this and not the
> interesting one.  I'd like to see the robust instruction decoding work.
> 
> 
> Thanks,
> Roland

Jim

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-03-30 23:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-14 19:27 breakpoint assistance: single-step out of line Roland McGrath
2007-03-16  1:06 ` Jim Keniston
2007-03-29  4:40   ` Roland McGrath
2007-03-30 23:53     ` Jim Keniston
2007-03-16 14:09 ` Frank Ch. Eigler
2007-03-16 18:54   ` Jim Keniston
2007-03-16 19:02     ` Frank Ch. Eigler
2007-03-16 21:00     ` Jim Keniston
2007-03-16 21:56       ` Frank Ch. Eigler
2007-03-20  8:08   ` Roland McGrath

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).