* breakpoint assistance: single-step out of line @ 2007-03-14 19:27 Roland McGrath 2007-03-16 1:06 ` Jim Keniston 2007-03-16 14:09 ` Frank Ch. Eigler 0 siblings, 2 replies; 10+ messages in thread From: Roland McGrath @ 2007-03-14 19:27 UTC (permalink / raw) To: systemtap The method of single-stepping over an out of line copy of the instruction clobbered by breakpoint insertion has been proven by kprobes. The complexities are mitigated in that implementation by the constrained context of the kernel and the fixed subset of possible machine code known to validly occur in any kernel or module text. There are two core problem areas in implementing single-step out of line for user mode code. These are where to store the out of line copies, and arch issues with instruction semantics. Starting with arch issues, I'll talk about the only ones I know in detail, which are x86 and x86_64. kprobes has done the basic work here. For the user mode context, on the one hand the risks of munging an instruction's behavior are confined to the user address space in question, but on the other hand we have to deal robustly with the full range of instructions that can be executed on the processor in user mode. Instruction decoding needs to be robust, not presume the canonical subset of encodings normally produced by the compiler, as used in the kernel. On machines other than x86, this tends to be quite simple. On x86, it means parsing all the instruction prefixes correctly and so forth. I think the parsing should be done at breakpoint insertion time, caching just a few bits saying what fixup strategy we need to use after the step. If we can't positively decode the instruction enough to be confident that we know how to fix it up, refuse to insert the breakpoint. (If it's an invalid instruction, you don't need a breakpoint because you'll get a trap anyway.) The instructions of concern are those that refer to the PC. On 32-bit x86, these are only the few control flow instructions. On x86_64, there is also %rip-relative addressing. We cannot presume addresses are within the same 4GB window so that the displacement can just be adjusted, as we do in the kernel. However, we can use some other tricks. The only instruction that computes a %rip-relative address as a result is lea. It is not difficult to recognize that one and just emulate it outright; there are only a few variations of address-size, data-size, and output register. It's not much easier to fix it up after the step. Unless I'm overlooking something, all other %rip-relative uses are implicit in the effective address for a memory access. For these, we can use the fs or gs segment prefix on the copied instruction, and adjust the displacement and the fs or gs base value to come up with the original target address. In the unlikely event that the instruction already uses the fs or gs prefix, just adjust the appropriate base value and use the instruction as it is. Otherwise, insert a gs prefix in the copied instruction, and set the gs base to the difference between the address of the copy (after the inserted prefix) and the breakpoint address. It is a little costly to set the fs or gs base value and reset it after the step, much more than setting a register in the trap frame; but it's probably not too bad. Next we come to the problem of where to store copied instructions for stepping. The idea of stealing a stack page for this is a non-starter. For both security and robustness, it's never acceptable to introduce a user mapping that is both writable and executable, even temporarily. We need to use an otherwise unused page in the address space, that will be read/execute only for the user, we can write to it only from kernel mode. In some meeting notes I've seen mention of "do what the vdso does". I don't know what this referred to specifically. There are two things this might mean, and those are the two main options I see. What the i386 vDSO used to do (CONFIG_COMPAT_VDSO), what the ia64 vDSO does, and what the x86-64 vsyscall page does (not a vDSO but similar), is the fixmap area. What the i386 vDSO, the ia32 vDSO on x86_64, and the powerpc vDSO do, is insert a vma. The fixmap area is a region of address space that shares some page tables across all tasks in the system. The advantages are that it has no vm setup cost since it is done once at boot, and that it is completely outside the range of virtual addresses the user task can map normally and so does not perturb any mapping behavior or appear in /proc/PID/maps or via access_process_vm or such things that might have unintended side effects on the user process. On 32-bit x86, a disadvantage is that when NX page protection is not available (older CPUs or non-PAE kernel builds), the exec-shield approximation of NX protection via segmentation is defeated by having an executable page high in the address space; this can be worked around on the exec-shield kernel with some extra effort. Other machines may not already have an analogous region of reserved address space where a page can be made user-readable/executable. Other potential disadvantages are the fixed amount of space (chosen at compile-time or boot-time, with some small limit on the number of pages available), and the security implications of global pages visible to all users on the system. The limited size might mean that slots need to be assigned only momentarily while doing the step, meaning fresh icache flushing every time. Then you'd ideally use only one slot per CPU, but that needs some work to be right given preemption. The briefness of this window may mitigate the security concerns, but still there are a few bytes of information about a traced thread leaking to anyone in the system who wants to try to see them. The setup every time necessitated by the fixed space is costly, but on the other hand its CPU use scales linearly with more breakpoints and more occurrences and its memory use stays constant, compared to open-ended allocation scaling with the number of breakpoints. Inserting a vma means essentially doing an mmap from inside the kernel. Both the advantages and the disadvantages of this stem from its normalcy. Any stray mmap/munmap/mprotect call from the user might wind up clobbering this mapping. It appears in /proc/PID/maps and will become known to other debugging facilities tracing the process, so they will think it's a normal user allocation; it might appear in core dumps. This might have other bad effects on processes that look at their own maps file to see what heap pages there are, which some GC libraries or suchlike might well do. The mapping also has subtler effects perturbing the process's own mapping behavior, which could introduce anomalies or even break some programs that need a lot of control over their address space. The advantages are that it's straightforward to implement and easy to be sure that it does the right thing vis a vis paging and so forth, it provides the option of using an open-ended amount of storage to optimize the use of many breakpoints, and it's wholly private to the user address space in question. A third option I didn't mention before is doing something in the page tables behind the vm system's back (this as distinct, and somewhat simpler than, the fancy VM ideas like per-thread page tables). I don't know enough about this to comment in detail. The attraction is that it would avoid some of the interactions I just mentioned with vma's, and might have lower overhead to set up. It might be difficult to make this do reasonable things about paging and such. This is probably not a good bet, but I don't know much about it. The fixmap is somewhat attractive at least for x86, x86-64, and ia64. It's nice that it doesn't interact with the normal user address range and set of visible mappings. The overhead of resetting and icache flushing an instruction slot on every use is less than the uprobes prototype using a stack page already has. I don't know if the performance of that will be good enough in the long run, or if priming a slot once and using it repeatedly will perform enough better that we care about this overhead. The vma is the most straightforward thing to implement, and is generic across machines. It makes sense to implement this first generically and then experiment later with the fixmap approach as an arch-specific alternative. The stack randomization done on at least x86/x86-64 means that there is normally a good little stretch of address space free above the stack vma (the top part of which holds environ and auxv). (Just try "tail -1 /proc/self/maps" a few times.) This area is unlikely to conflict with address space the user's own mappings would ever have considered. Allocating at one page above the end of the stack vma (leaving a red zone) seems good. I'm really more concerned about things monitoring the mappings. Perhaps we could add a VM_* flag that says to omit the vma from listings, but I don't know how that would be received by kernel people, let alone a flag to disallow user munmap/mmap/mprotect calls to change a mapping. I can go into further detail on how I envision implementing the vma and/or fixmap plans if it is not clear. Thanks, Roland ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-14 19:27 breakpoint assistance: single-step out of line Roland McGrath @ 2007-03-16 1:06 ` Jim Keniston 2007-03-29 4:40 ` Roland McGrath 2007-03-16 14:09 ` Frank Ch. Eigler 1 sibling, 1 reply; 10+ messages in thread From: Jim Keniston @ 2007-03-16 1:06 UTC (permalink / raw) To: Roland McGrath; +Cc: systemtap On Sun, 2007-03-04 at 13:38 -0800, Roland McGrath wrote: > The method of single-stepping over an out of line copy of the instruction > clobbered by breakpoint insertion has been proven by kprobes. The > complexities are mitigated in that implementation by the constrained > context of the kernel and the fixed subset of possible machine code known > to validly occur in any kernel or module text. > > There are two core problem areas in implementing single-step out of line > for user mode code. These are where to store the out of line copies, and > arch issues with instruction semantics. > > > Starting with arch issues, I'll talk about the only ones I know in detail, > which are x86 and x86_64. kprobes has done the basic work here. For the > user mode context, on the one hand the risks of munging an instruction's > behavior are confined to the user address space in question, but on the > other hand we have to deal robustly with the full range of instructions > that can be executed on the processor in user mode. Yes. > > Instruction decoding needs to be robust, not presume the canonical subset > of encodings normally produced by the compiler, as used in the kernel. On > machines other than x86, this tends to be quite simple. On x86, it means > parsing all the instruction prefixes correctly and so forth. I think the > parsing should be done at breakpoint insertion time, caching just a few > bits saying what fixup strategy we need to use after the step. I guess that depends on how complicated the switch(opcode) { ... } code in uprobe_resume_execution() gets. Parsing the instruction at probe-insertion time is essential for x86_64, at least partly because of rip-relative addressing, as you discuss below. > If we can't > positively decode the instruction enough to be confident that we know how > to fix it up, refuse to insert the breakpoint. Yes. > (If it's an invalid > instruction, you don't need a breakpoint because you'll get a trap anyway.) > > The instructions of concern are those that refer to the PC. > On 32-bit x86, these are only the few control flow instructions. > > On x86_64, there is also %rip-relative addressing. We cannot presume > addresses are within the same 4GB window so that the displacement can just > be adjusted, as we do in the kernel. Yes. > However, we can use some other > tricks. The only instruction that computes a %rip-relative address as a > result is lea. It is not difficult to recognize that one and just emulate > it outright; there are only a few variations of address-size, data-size, > and output register. It's not much easier to fix it up after the step. > > Unless I'm overlooking something, all other %rip-relative uses are implicit > in the effective address for a memory access. I think that's the case, but we haven't done a thorough review of the instruction list. > For these, we can use the fs > or gs segment prefix on the copied instruction, and adjust the displacement > and the fs or gs base value to come up with the original target address. > In the unlikely event that the instruction already uses the fs or gs > prefix, just adjust the appropriate base value and use the instruction as > it is. Otherwise, insert a gs prefix in the copied instruction, and set > the gs base to the difference between the address of the copy (after the > inserted prefix) and the breakpoint address. It is a little costly to set > the fs or gs base value and reset it after the step, much more than setting > a register in the trap frame; but it's probably not too bad. The approach we had in mind was to change the rip-relative instruction to an indirect instruction where the target address is in a scratch register (one not accessed by the original instruction). Save the value of the scratch register, load in the target address, single-step, and restore the scratch register's real value. This isn't coded yet. > > > Next we come to the problem of where to store copied instructions for > stepping. The idea of stealing a stack page for this is a non-starter. > For both security and robustness, it's never acceptable to introduce a user > mapping that is both writable and executable, even temporarily. We need to > use an otherwise unused page in the address space, that will be > read/execute only for the user, we can write to it only from kernel mode. As it turns out, this approach isn't very portable, either. The s390 and powerpc compilers regularly generate code that accesses data beyond the top-of-stack, so it's tough to find a "safe" page in the stack vma. > > In some meeting notes I've seen mention of "do what the vdso does". I > don't know what this referred to specifically. We were thinking in terms of a per-process page that's automatically set up at exec time. There's no dso involved in our approach, but the "vdso" reference has been hard to kill. > There are two things this > might mean, and those are the two main options I see. What the i386 vDSO > used to do (CONFIG_COMPAT_VDSO), what the ia64 vDSO does, and what the > x86-64 vsyscall page does (not a vDSO but similar), is the fixmap area. > What the i386 vDSO, the ia32 vDSO on x86_64, and the powerpc vDSO do, > is insert a vma. It's the latter. > > The fixmap area is a region of address space that shares some page tables > across all tasks in the system. The advantages are that it has no vm setup > cost since it is done once at boot, and that it is completely outside the > range of virtual addresses the user task can map normally and so does not > perturb any mapping behavior or appear in /proc/PID/maps or via > access_process_vm or such things that might have unintended side effects on > the user process. On 32-bit x86, a disadvantage is that when NX page > protection is not available (older CPUs or non-PAE kernel builds), the > exec-shield approximation of NX protection via segmentation is defeated by > having an executable page high in the address space; this can be worked > around on the exec-shield kernel with some extra effort. Other machines > may not already have an analogous region of reserved address space where a > page can be made user-readable/executable. Other potential disadvantages > are the fixed amount of space (chosen at compile-time or boot-time, with > some small limit on the number of pages available), and the security > implications of global pages visible to all users on the system. The > limited size might mean that slots need to be assigned only momentarily > while doing the step, meaning fresh icache flushing every time. Then you'd > ideally use only one slot per CPU, but that needs some work to be right > given preemption. The briefness of this window may mitigate the security > concerns, but still there are a few bytes of information about a traced > thread leaking to anyone in the system who wants to try to see them. The > setup every time necessitated by the fixed space is costly, but on the > other hand its CPU use scales linearly with more breakpoints and more > occurrences and its memory use stays constant, compared to open-ended > allocation scaling with the number of breakpoints. We haven't seriously considered the above approach. > > Inserting a vma means essentially doing an mmap from inside the kernel. > Both the advantages and the disadvantages of this stem from its normalcy. > Any stray mmap/munmap/mprotect call from the user might wind up clobbering > this mapping. Good point. > It appears in /proc/PID/maps and will become known to other > debugging facilities tracing the process, so they will think it's a normal > user allocation; it might appear in core dumps. This might have other bad > effects on processes that look at their own maps file to see what heap > pages there are, which some GC libraries or suchlike might well do. The > mapping also has subtler effects perturbing the process's own mapping > behavior, which could introduce anomalies or even break some programs that > need a lot of control over their address space. The advantages are that > it's straightforward to implement and easy to be sure that it does the > right thing vis a vis paging and so forth, it provides the option of using > an open-ended amount of storage to optimize the use of many breakpoints, > and it's wholly private to the user address space in question. Our current approach uses a fixed-size area (1 page for now) that's allocated at exec time. Instruction slots are allocated to probepoints as they are hit, and a probepoint owns the slot until another probepoint steals it. For x86[_64], 1 pages gives us 256 slots. We would see thrashing due to slot starvation only if the process is hitting more than 256 different probepoints in a short time span. We're still debugging this approach. > > A third option I didn't mention before is doing something in the page > tables behind the vm system's back (this as distinct, and somewhat simpler > than, the fancy VM ideas like per-thread page tables). I don't know enough > about this to comment in detail. The attraction is that it would avoid > some of the interactions I just mentioned with vma's, and might have lower > overhead to set up. It might be difficult to make this do reasonable > things about paging and such. This is probably not a good bet, but I don't > know much about it. I haven't thought about the above approach. > > The fixmap is somewhat attractive at least for x86, x86-64, and ia64. It's > nice that it doesn't interact with the normal user address range and set of > visible mappings. The overhead of resetting and icache flushing an > instruction slot on every use is less than the uprobes prototype using a > stack page already has. I don't know if the performance of that will be > good enough in the long run, or if priming a slot once and using it > repeatedly will perform enough better that we care about this overhead. We picked per-probepoint multiplexing because of icache issues (and because it seems best for single-threaded apps), but it turned out to be no more complex than per-thread or per-cpu muxing. > > The vma is the most straightforward thing to implement, and is generic > across machines. It makes sense to implement this first generically Oh, good. Glad we got that right. > and > then experiment later with the fixmap approach as an arch-specific > alternative. The stack randomization done on at least x86/x86-64 means > that there is normally a good little stretch of address space free above > the stack vma (the top part of which holds environ and auxv). (Just try > "tail -1 /proc/self/maps" a few times.) This area is unlikely to conflict > with address space the user's own mappings would ever have considered. > Allocating at one page above the end of the stack vma (leaving a red zone) > seems good. Sounds good, although I personally don't know the incantation for putting the vma there. Any help would be appreciated. > I'm really more concerned about things monitoring the > mappings. Perhaps we could add a VM_* flag that says to omit the vma from > listings, but I don't know how that would be received by kernel people, let > alone a flag to disallow user munmap/mmap/mprotect calls to change a mapping. I'm not so worried about the visibility of the area in /proc/*/maps and such; protecting it from munmap & friends seems more of a concern. > > I can go into further detail on how I envision implementing the vma and/or > fixmap plans if it is not clear. > We hope to post the above-described implementation Real Soon Now. Given your interest, maybe we will even if we don't have it firing on all cylinders. > > Thanks, > Roland Thanks again for your ideas and interest. Jim ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-16 1:06 ` Jim Keniston @ 2007-03-29 4:40 ` Roland McGrath 2007-03-30 23:53 ` Jim Keniston 0 siblings, 1 reply; 10+ messages in thread From: Roland McGrath @ 2007-03-29 4:40 UTC (permalink / raw) To: Jim Keniston; +Cc: systemtap > > Instruction decoding needs to be robust, not presume the canonical subset > > of encodings normally produced by the compiler, as used in the kernel. On > > machines other than x86, this tends to be quite simple. On x86, it means > > parsing all the instruction prefixes correctly and so forth. I think the > > parsing should be done at breakpoint insertion time, caching just a few > > bits saying what fixup strategy we need to use after the step. > > I guess that depends on how complicated the switch(opcode) { ... } code > in uprobe_resume_execution() gets. Parsing the instruction at > probe-insertion time is essential for x86_64, at least partly because of > rip-relative addressing, as you discuss below. The parsing will always be more costly than checking a few bits. You're always going to be doing it at insertion time anyway; there's no reason not to cache the results of that across the board. > The approach we had in mind was to change the rip-relative instruction > to an indirect instruction where the target address is in a scratch > register (one not accessed by the original instruction). Save the value > of the scratch register, load in the target address, single-step, and > restore the scratch register's real value. This isn't coded yet. This requires decoding the instruction in more detail than we've done before, to be sure of what register is free. Maybe that isn't really all that hard, but I'm not sure--I think there are a lot of cases to be sure what the target register is. By contrast, the segment prefix is very easy to parse. The normal register fiddling is probably more efficient than the fs/gs fiddling, but unless it's drastic I think keeping the instruction decoder simpler is the overall win. > We were thinking in terms of a per-process page that's automatically set > up at exec time. There's no dso involved in our approach, but the > "vdso" reference has been hard to kill. [...] > Our current approach uses a fixed-size area (1 page for now) that's > allocated at exec time. [...] > Sounds good, although I personally don't know the incantation for > putting the vma there. Any help would be appreciated. [...] > I'm not so worried about the visibility of the area in /proc/*/maps and > such; protecting it from munmap & friends seems more of a concern. You can't do anything about that without either not using a proper vma, or adding some new VM_* flag and make the kernel enforce that normal user calls can't change it (which might as well be the flag that says to hide it in /proc too). Also note that ptrace or suchlike can always come and modify the page too, even if the user has not made it writable. So whatever bits you store on that page, you must always use with care in the kernel. Scrambling that page must not be able to produce any bad effects in the kernel, nothing worse than a scrambled user context in a thread in that address space. The patch you posted is a non-starter. I think I get now why you keep thinking "vdso"--you mean an unaccounted mapping of an unaccounted page that will never be paged out. This is several kinds of bad, I won't go into the details of why. I'm sorry I wasn't more explicit about how to keep it simple. For not even trying to do any special hiding magic, it didn't occur to me that you'd do anything but this: #define SLOT_SIZE ... #define SLOT_AREA_SIZE PAGE_ALIGN(NR_CPUS * SLOT_SIZE) struct mm_struct *mm = current->mm; unsigned long addr; down_write(&mm->mmap_sem); /* * Find the end of the top mapping and skip a page. * If there is no space for SLOT_AREA_SIZE above * that, mmap will ignore our address hint. */ addr = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb)->vm_end + PAGE_SIZE; addr = do_mmap_pgoff(NULL, addr, SLOT_AREA_SIZE, PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, 0); if (addr &~ PAGE_MASK) ... -errno = addr ...; up_write(&mm->mmap_sem); If we think up useful tweaks to make the vma more special, add (before up_write): vma = find_vma(addr); vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; // or whatever Also, doing preemptive allocation at exec time does not wash with me. Your version has an extra unpageable page per process as well, but with a normal allocation it's still a gratuitous vma per process. Most processes will never be probed. I don't think this universal overhead is warranted. Allocating on demand at first probe insertion makes sense to me. Using the top address area means it's unlikely you'll ever interfere with normal mappings anyway, and if somehow none available at insertion time, then tough, you don't insert. Sorry, I really thought the vma was the trivial part of this and not the interesting one. I'd like to see the robust instruction decoding work. Thanks, Roland ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-29 4:40 ` Roland McGrath @ 2007-03-30 23:53 ` Jim Keniston 0 siblings, 0 replies; 10+ messages in thread From: Jim Keniston @ 2007-03-30 23:53 UTC (permalink / raw) To: Roland McGrath; +Cc: systemtap On Wed, 2007-03-28 at 21:40 -0700, Roland McGrath wrote: > ... > I'm sorry I wasn't more explicit about how to > keep it simple. For not even trying to do any special hiding magic, it > didn't occur to me that you'd do anything but this: > > #define SLOT_SIZE ... > #define SLOT_AREA_SIZE PAGE_ALIGN(NR_CPUS * SLOT_SIZE) We've considered per-CPU slots. Can you guarantee that the probed thread can't migrate to another CPU (or be preempted) between the time we store the instruction in the slot, return from our report_signal callback (which handles the breakpoint trap), and single-step the instruction? (Assume that our callback doesn't sleep in that interval.) > > struct mm_struct *mm = current->mm; > unsigned long addr; > > down_write(&mm->mmap_sem); > /* > * Find the end of the top mapping and skip a page. > * If there is no space for SLOT_AREA_SIZE above > * that, mmap will ignore our address hint. > */ > addr = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, > vm_rb)->vm_end + PAGE_SIZE; > > addr = do_mmap_pgoff(NULL, addr, SLOT_AREA_SIZE, PROT_EXEC, > MAP_PRIVATE|MAP_ANONYMOUS, 0); > if (addr &~ PAGE_MASK) > ... -errno = addr ...; > up_write(&mm->mmap_sem); We'll try this. I looked at do_mmap_pgoff() before, and my eyes glazed over after about 200 lines. At least I understand what Prasanna's code does (though I don't know all the implications of what it DOESN'T do). do_mmap_pgoff() has to be run by the probed process, so it can't be run by the process that registers the first probe (typically insmod). We could do while handling the first probe hit, or when we quiesce the process for the first probepoint insertion. Any preference? I've been thinking about NOT quiescing for the general case of probepoint insertion/removal for i386, x86_64, powerpc, ... > > If we think up useful tweaks to make the vma more special, add (before > up_write): > > vma = find_vma(addr); > vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; // or whatever > > Also, doing preemptive allocation at exec time does not wash with me. Your > version has an extra unpageable page per process as well, Yeah, fixed that. > but with a normal > allocation it's still a gratuitous vma per process. Most processes will > never be probed. I don't think this universal overhead is warranted. > Allocating on demand at first probe insertion makes sense to me. Using the > top address area means it's unlikely you'll ever interfere with normal > mappings anyway, and if somehow none available at insertion time, then > tough, you don't insert. > > Sorry, I really thought the vma was the trivial part of this and not the > interesting one. I'd like to see the robust instruction decoding work. > > > Thanks, > Roland Jim ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-14 19:27 breakpoint assistance: single-step out of line Roland McGrath 2007-03-16 1:06 ` Jim Keniston @ 2007-03-16 14:09 ` Frank Ch. Eigler 2007-03-16 18:54 ` Jim Keniston 2007-03-20 8:08 ` Roland McGrath 1 sibling, 2 replies; 10+ messages in thread From: Frank Ch. Eigler @ 2007-03-16 14:09 UTC (permalink / raw) To: Roland McGrath; +Cc: systemtap Roland McGrath <roland@redhat.com> writes: > The method of single-stepping over an out of line copy of the > instruction clobbered by breakpoint insertion has been proven by > kprobes. The complexities are mitigated in that implementation by > the constrained context of the kernel and the fixed subset of > possible machine code known to validly occur in any kernel or module > text. Another important aspect is that userspace may be hostile. Beyond just containing oddball instruction sequences, it may deliberately rewrite its own .text, or otherwise interfere with probing in order to produce crashes or security breaches. - FChE ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-16 14:09 ` Frank Ch. Eigler @ 2007-03-16 18:54 ` Jim Keniston 2007-03-16 19:02 ` Frank Ch. Eigler 2007-03-16 21:00 ` Jim Keniston 2007-03-20 8:08 ` Roland McGrath 1 sibling, 2 replies; 10+ messages in thread From: Jim Keniston @ 2007-03-16 18:54 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: Roland McGrath, systemtap On Fri, 2007-03-16 at 10:09 -0400, Frank Ch. Eigler wrote: > Roland McGrath <roland@redhat.com> writes: > > > The method of single-stepping over an out of line copy of the > > instruction clobbered by breakpoint insertion has been proven by > > kprobes. The complexities are mitigated in that implementation by > > the constrained context of the kernel and the fixed subset of > > possible machine code known to validly occur in any kernel or module > > text. > > Another important aspect is that userspace may be hostile. Beyond > just containing oddball instruction sequences, it may deliberately > rewrite its own .text, or otherwise interfere with probing in order to > produce crashes or security breaches. Under what circumstances can a user program rewrite its own text? > > - FChE Jim ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-16 18:54 ` Jim Keniston @ 2007-03-16 19:02 ` Frank Ch. Eigler 2007-03-16 21:00 ` Jim Keniston 1 sibling, 0 replies; 10+ messages in thread From: Frank Ch. Eigler @ 2007-03-16 19:02 UTC (permalink / raw) To: Jim Keniston; +Cc: systemtap Hi - On Fri, Mar 16, 2007 at 10:54:15AM -0700, Jim Keniston wrote: > [...] > Under what circumstances can a user program rewrite its own text? After an mprotect? Or someone else doing a ptrace write? - FChE ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-16 18:54 ` Jim Keniston 2007-03-16 19:02 ` Frank Ch. Eigler @ 2007-03-16 21:00 ` Jim Keniston 2007-03-16 21:56 ` Frank Ch. Eigler 1 sibling, 1 reply; 10+ messages in thread From: Jim Keniston @ 2007-03-16 21:00 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: Roland McGrath, systemtap On Fri, 2007-03-16 at 10:54 -0700, Jim Keniston wrote: > On Fri, 2007-03-16 at 10:09 -0400, Frank Ch. Eigler wrote: > > Roland McGrath <roland@redhat.com> writes: > > > > > The method of single-stepping over an out of line copy of the > > > instruction clobbered by breakpoint insertion has been proven by > > > kprobes. The complexities are mitigated in that implementation by > > > the constrained context of the kernel and the fixed subset of > > > possible machine code known to validly occur in any kernel or module > > > text. > > > > Another important aspect is that userspace may be hostile. Beyond > > just containing oddball instruction sequences, it may deliberately > > rewrite its own .text, or otherwise interfere with probing in order to > > produce crashes or security breaches. > > Under what circumstances can a user program rewrite its own text? Frank answered: > After an mprotect? Indeed. I had to try it to believe it. > Or someone else doing a ptrace write? Yes. OK, here's my next dumb question. How do you envision a user process exploiting uprobes to mess up anything but itself (or its ptraced child) in a novel way? It could insert breakpoint instructions, but uprobes would let the resulting SIGTRAPs pass through because they don't match known probepoints. It could remove breakpoint instructions inserted by uprobes, but then the probepoints would never be hit and uprobes would never get involved. It could scribble on the SSOL instruction slots or the uretprobe trampoline, with the result that the wrong USER-mode instruction gets executed, but I don't see how that's anything new that the kernel can't handle. I'm not arguing (confidently, anyway) that there are no such exploits. If you can think of any, I'd like to know them. Thanks. Jim ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-16 21:00 ` Jim Keniston @ 2007-03-16 21:56 ` Frank Ch. Eigler 0 siblings, 0 replies; 10+ messages in thread From: Frank Ch. Eigler @ 2007-03-16 21:56 UTC (permalink / raw) To: systemtap Jim Keniston <jkenisto@us.ibm.com> writes: > [...] > > Under what circumstances can a user program rewrite its own text? > > Frank answered: > > After an mprotect? > Indeed. I had to try it to believe it. Likewise! > [...] OK, here's my next dumb question. How do you envision a user > process exploiting uprobes to mess up anything but itself (or its > ptraced child) in a novel way? I don't have a specific scenario in mind. One just needs to distrust all the data coming from user space. For example, the instructions being disassembled for out-of-line single-stepping must be carefully analyzed, so it cannot hit shady corner cases. The single-stepping must be done in minimum-privilege state. The restoration of the instruction byte under the breakpoint might need to assert that it is unchanged, or perhaps outright block its attempted change somehow. This one is an old shangri-la saw, but it may be desirable to block visibility of the breakpoint itself, to make a systemtap session relatively invisible. - FChE ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: breakpoint assistance: single-step out of line 2007-03-16 14:09 ` Frank Ch. Eigler 2007-03-16 18:54 ` Jim Keniston @ 2007-03-20 8:08 ` Roland McGrath 1 sibling, 0 replies; 10+ messages in thread From: Roland McGrath @ 2007-03-20 8:08 UTC (permalink / raw) To: Frank Ch. Eigler; +Cc: systemtap > Another important aspect is that userspace may be hostile. Beyond > just containing oddball instruction sequences, it may deliberately > rewrite its own .text, or otherwise interfere with probing in order to > produce crashes or security breaches. Indeed. Another way to say what I meant to express about this is that the kernel implementation must be robust to any potential user-mode activity and protect the kernel and other user address spaces from any side effects. (That's what a kernel is for, after all.) At worst, some pathological or hostile case that might scramble things so they work differently than they would have in the absence of the single-stepping procedure, should be able to scramble or crash only the probed user address space. Anything less is a dangerous bug unacceptable for any production system. The ideal we strive for is that at worst, a probe will have to be backed out and produce some orderly error at the script level, but not at all disrupt the normal (pathological) behavior of the user task. > This one is an old shangri-la saw, but it may be desirable to block > visibility of the breakpoint itself, to make a systemtap session > relatively invisible. In the aforementioned "fancy VM tricks" shangri-la, you get this too. That is, the process memory as seen by debuggers and core dumps and such will have original pages rather than breakpoint-touched ones. With the final demise of segmentation, I don't think there is any way at all in the hardware to prevent normal memory reads by the process itself from seeing the breakpoint. Of course you can mess around for the one single-stepped copied instruction, but for every other instruction you are SOL. Thanks, Roland ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-03-30 23:53 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-03-14 19:27 breakpoint assistance: single-step out of line Roland McGrath 2007-03-16 1:06 ` Jim Keniston 2007-03-29 4:40 ` Roland McGrath 2007-03-30 23:53 ` Jim Keniston 2007-03-16 14:09 ` Frank Ch. Eigler 2007-03-16 18:54 ` Jim Keniston 2007-03-16 19:02 ` Frank Ch. Eigler 2007-03-16 21:00 ` Jim Keniston 2007-03-16 21:56 ` Frank Ch. Eigler 2007-03-20 8:08 ` Roland McGrath
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).