public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* RE: [3/3] Userspace probes prototype-take2
@ 2006-02-20  3:32 Zhang, Yanmin
  2006-02-20  5:07 ` Prasanna S Panchamukhi
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Yanmin @ 2006-02-20  3:32 UTC (permalink / raw)
  To: prasanna; +Cc: systemtap

>>-----Original Message-----
>>From: Zhang, Yanmin
>>Sent: 2006年2月20日 11:16
>>To: Zhang, Yanmin; prasanna@in.ibm.com; systemtap@sources.redhat.com
>>Subject: RE: [3/3] Userspace probes prototype-take2
>>
>>I lost an important comment. The patch is not aware of signal processing. After kernel prepares the single-step-inst on the stack, if
>>a signal is delivered to the thread, kernel will save some states into stack and switch to signal handler function, so single-step-inst
>>on the stack might be erased.
>>
>>>>-----Original Message-----
>>>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Zhang, Yanmin
>>>>Sent: 2006年2月17日 17:20
>>>>To: prasanna@in.ibm.com; systemtap@sources.redhat.com
>>>>Subject: RE: [3/3] Userspace probes prototype-take2
>>>>
>>>>2 main issues:
>>>>1) task switch caused by external interrupt when single-step;
[YM] I think we could resolve this problem. Kernel probe has some differences from uprobe. One of them is that we couldn't estimate if kernel probe happens in process context, or interrupt context, while uprobe always happens in process context (user space). So from some points of view, uprobe could be simplified from kernel probe.
a) Don't use kcb (kprobe_ctlblk) if uprobe is triggered. Create new functions, kprobe__handler_user, kprobe_fault_handler_user and other handlers. In the new functions, instead of kcb, we could use uprobe_page being allocated dynamically. Considering signal action handler (possible uprobe nested), a thread might have a list of uprobe_page.
b) Delete current_uprobe;



>>>>2) multi-thread:
[YM] We could resolve this problem.
a) Don't call replace_orignal_insn in function uprobe_single_step. It might cause a race condition.
b) Delete copy_insn_on_new_page;
c) Merge copy_insn_onstack and copy_insn_onexpstack. The single-step-insn address could be esp-sizeof(long long)-MAX_INSN_SIZE*sizeof(kprobe_opcode_t). 
d) If the stack couldn't be expanded, just kill the thread. It's reasonable because the stack is used up.


>>>>
>>>>See below inline comments.
>>>>
>>>>Yanmin
>>>>
>>>>>>-----Original Message-----
>>>>>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Prasanna S Panchamukhi
>>>>>>Sent: 2006年2月8日 22:14
>>>>>>To: systemtap@sources.redhat.com
>>>>>>Subject: Re: [3/3] Userspace probes prototype-take2
>>>>>>
>>>>>>
>>>>>>This patch handles the executing the registered callback
>>>>>>functions when probes is hit.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3/3] Userspace probes prototype-take2
  2006-02-20  3:32 [3/3] Userspace probes prototype-take2 Zhang, Yanmin
@ 2006-02-20  5:07 ` Prasanna S Panchamukhi
  0 siblings, 0 replies; 9+ messages in thread
From: Prasanna S Panchamukhi @ 2006-02-20  5:07 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: systemtap

Yanmin,

On Mon, Feb 20, 2006 at 11:32:31AM +0800, Zhang, Yanmin wrote:
> >>-----Original Message-----
> >>From: Zhang, Yanmin
> >>Sent: 2006年2月20日 11:16
> >>To: Zhang, Yanmin; prasanna@in.ibm.com; systemtap@sources.redhat.com
> >>Subject: RE: [3/3] Userspace probes prototype-take2
> >>
> >>I lost an important comment. The patch is not aware of signal processing. After kernel prepares the single-step-inst on the stack, if
> >>a signal is delivered to the thread, kernel will save some states into stack and switch to signal handler function, so single-step-inst
> >>on the stack might be erased.
> >>
> >>>>-----Original Message-----
> >>>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Zhang, Yanmin
> >>>>Sent: 2006年2月17日 17:20
> >>>>To: prasanna@in.ibm.com; systemtap@sources.redhat.com
> >>>>Subject: RE: [3/3] Userspace probes prototype-take2
> >>>>
> >>>>2 main issues:
> >>>>1) task switch caused by external interrupt when single-step;
> [YM] I think we could resolve this problem. Kernel probe has some differences from uprobe. One of them is that we couldn't estimate if kernel probe happens in process context, or interrupt context, while uprobe always happens in process context (user space). So from some points of view, uprobe could be simplified from kernel probe.
> a) Don't use kcb (kprobe_ctlblk) if uprobe is triggered. Create new functions, kprobe__handler_user, kprobe_fault_handler_user and other handlers. In the new functions, instead of kcb, we could use uprobe_page being allocated dynamically
.
Yes, I am trying to seperate out kprobes_handlers and uprobe_handlers
since user space probe handlers can preempt and might sleep. also given
that we might preempt, we cannot reuse the kprobe_handlers() that use rcu.
My next take will address this issues.

> Considering signal action handler (possible uprobe nested), a thread might have a list of uprobe_page.

Yes, reentrancy in this situation also need to be handled.

> b) Delete current_uprobe;

> 
> 
> 
> >>>>2) multi-thread:
> [YM] We could resolve this problem.
> a) Don't call replace_orignal_insn in function uprobe_single_step. It might cause a race condition.
> b) Delete copy_insn_on_new_page;
> c) Merge copy_insn_onstack and copy_insn_onexpstack. The single-step-insn address could be esp-sizeof(long long)-MAX_INSN_SIZE*sizeof(kprobe_opcode_t). 
This can be done.
> d) If the stack couldn't be expanded, just kill the thread. It's reasonable because the stack is used up.

We need to take a closer look at this issue

> 
> 
> >>>>
> >>>>See below inline comments.
> >>>>
> >>>>Yanmin
> >>>>
> >>>>>>-----Original Message-----
> >>>>>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Prasanna S Panchamukhi
> >>>>>>Sent: 2006年2月8日 22:14
> >>>>>>To: systemtap@sources.redhat.com
> >>>>>>Subject: Re: [3/3] Userspace probes prototype-take2
> >>>>>>
> >>>>>>
> >>>>>>This patch handles the executing the registered callback
> >>>>>>functions when probes is hit.

-- 
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: prasanna@in.ibm.com
Ph: 91-80-51776329

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [3/3] Userspace probes prototype-take2
@ 2006-02-20  5:48 Zhang, Yanmin
  0 siblings, 0 replies; 9+ messages in thread
From: Zhang, Yanmin @ 2006-02-20  5:48 UTC (permalink / raw)
  To: prasanna; +Cc: systemtap

>>-----Original Message-----
>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Prasanna S Panchamukhi
>>Sent: 2006年2月20日 13:38
>>To: Zhang, Yanmin
>>Cc: systemtap@sources.redhat.com
>>Subject: Re: [3/3] Userspace probes prototype-take2
>>
>>Yanmin,
>>
>>
>>Please see my comments inline below.
>>
>>Thanks
>>Prasanna
>>> >>
>>> >>2. This patch works only with PREEMPT config option disabled, to work
>>> >>in PREEMPT enabled condition handlers must be re-written and must
>>> >>be seperated out from kernel probes allowing preemption.
>>> One of my old comments is an external device interrupt might happen when cpu is single-stepping the original instruction, then the
>>task might be switched to another cpu. If we disable irq when exiting to user space to single step the instruction, kernel might switch
>>the task off just on the exit kernel path. 1) uprobe_page; 2) kprobe_ctlblk, These 2 resources shouldn't be pre cpu, or we need get
>>another approach. How could you resolve the task switch issue?
>>
>>My new design does not use the kprobe handlers and per cpu kprobe data
>>structures itself  so that task switch issue will be resolved.
>>We register a separte set of uprobe handlers and use uprobe data structure.
>>Also now we will be handling uprobes serially and synchronize using some lock/mutex, but later on we have scale it up for better performance.
>>
>>> >>+static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
>>> >>+			struct pt_regs *regs, struct vm_area_struct *vma)
>>> >>+{
>>> >>+	unsigned long addr, *vaddr, stack_addr = regs->esp;
>>> >>+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
>>> >>+	struct uprobe_page *upage;
>>> >>+	struct page *page;
>>> >>+	pte_t *pte;
>>> >>+
>>> >>+
>>> >>+	if (vma->vm_flags & VM_GROWSDOWN) {
>>> >>+		if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
>>> >>+			return -ENOMEM;
>>> >>+
>>> >>+		addr = vma->vm_start;
>>> >>+	} else if (vma->vm_flags & VM_GROWSUP) {
>>> >>+		if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
>>> >>+			return -ENOMEM;
>>> >>+
>>> >>+		addr = vma->vm_end - size;
>>> >>+	} else
>>> >>+		return -EFAULT;
>>> >>+
>>> The multi-thread case is not resolved here. One of typical multi-thread model is that the all threads share the same vma and every
>>thread has 8-k stack.
>>
>>>If 2 threads trigger uprobe (although might be not the same uprobe) at the same time, one thread might erase single-step instruction
>>of another.
>>
>>Do these threads share the same stack pages?
[YM] No. They share the same vma. And copy_insm_onstack might happen to fail on the 2 threads at the same time.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [3/3] Userspace probes prototype-take2
@ 2006-02-20  5:48 Zhang, Yanmin
  0 siblings, 0 replies; 9+ messages in thread
From: Zhang, Yanmin @ 2006-02-20  5:48 UTC (permalink / raw)
  To: prasanna; +Cc: systemtap

>>-----Original Message-----
>>From: Prasanna S Panchamukhi [mailto:prasanna@in.ibm.com]
>>Sent: 2006年2月20日 12:52
>>To: Zhang, Yanmin
>>Cc: systemtap@sources.redhat.com
>>Subject: Re: [3/3] Userspace probes prototype-take2
>>
>>Yanmin,
>>
>>On Mon, Feb 20, 2006 at 11:15:31AM +0800, Zhang, Yanmin wrote:
>>> I lost an important comment. The patch is not aware of signal processing. After kernel prepares the single-step-inst on the stack,
>>if a signal is delivered to the thread, kernel will save some states into stack and switch to signal handler function, so single-step-inst
>>on the stack might be erased.
>>
>>AFAIK this problem can be addressed in the following ways.
>>
>>1. Leave the sufficient stack space for the kernel to deliver the
>>signals and then copy the instructions on the stack.
[YM] signal action handler itself could be nested again, so it looks not a good approach.


>>
>>2. Synchronize usage of stack between signal processing and user space probes.
[YM] This approach looks not good. Another issue if doing so is that the single-step-insn itself might change esp.


>>
>>3. Block the signal processing by disabling interrupts and preemption from
>>the time we copy the instruction on the stack untill we single step on the
>>original instruction. Or even wait for the signal processing to be
>>complete and then setup the stack for single stepping on the original
>>instructions and single step.
[YM] We could do not check signal when exiting from kernel if trying to single-step. It's easy to be implemented, but community might argue. I prefer this one. For example, add a new flag, TIF_UPROBING, at thread_info->flags. And check it at appropriate time.


>>
>>Your suggestion are welcome to provide better solutions to this problem.
>>
>>Thanks
>>Prasanna

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3/3] Userspace probes prototype-take2
  2006-02-17  9:19 Zhang, Yanmin
@ 2006-02-20  5:36 ` Prasanna S Panchamukhi
  0 siblings, 0 replies; 9+ messages in thread
From: Prasanna S Panchamukhi @ 2006-02-20  5:36 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: systemtap

Yanmin,


Please see my comments inline below.

Thanks
Prasanna
> >>
> >>2. This patch works only with PREEMPT config option disabled, to work
> >>in PREEMPT enabled condition handlers must be re-written and must
> >>be seperated out from kernel probes allowing preemption.
> One of my old comments is an external device interrupt might happen when cpu is single-stepping the original instruction, then the task might be switched to another cpu. If we disable irq when exiting to user space to single step the instruction, kernel might switch the task off just on the exit kernel path. 1) uprobe_page; 2) kprobe_ctlblk, These 2 resources shouldn't be pre cpu, or we need get another approach. How could you resolve the task switch issue?

My new design does not use the kprobe handlers and per cpu kprobe data
structures itself  so that task switch issue will be resolved.
We register a separte set of uprobe handlers and use uprobe data structure.
Also now we will be handling uprobes serially and synchronize using some lock/mutex, but later on we have scale it up for better performance.

> >>+static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
> >>+			struct pt_regs *regs, struct vm_area_struct *vma)
> >>+{
> >>+	unsigned long addr, *vaddr, stack_addr = regs->esp;
> >>+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
> >>+	struct uprobe_page *upage;
> >>+	struct page *page;
> >>+	pte_t *pte;
> >>+
> >>+
> >>+	if (vma->vm_flags & VM_GROWSDOWN) {
> >>+		if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
> >>+			return -ENOMEM;
> >>+
> >>+		addr = vma->vm_start;
> >>+	} else if (vma->vm_flags & VM_GROWSUP) {
> >>+		if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
> >>+			return -ENOMEM;
> >>+
> >>+		addr = vma->vm_end - size;
> >>+	} else
> >>+		return -EFAULT;
> >>+
> The multi-thread case is not resolved here. One of typical multi-thread model is that the all threads share the same vma and every thread has 8-k stack. 

>If 2 threads trigger uprobe (although might be not the same uprobe) at the same time, one thread might erase single-step instruction of another.

Do these threads share the same stack pages?


-- 
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: prasanna@in.ibm.com
Ph: 91-80-51776329

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3/3] Userspace probes prototype-take2
  2006-02-20  3:16 Zhang, Yanmin
@ 2006-02-20  4:51 ` Prasanna S Panchamukhi
  0 siblings, 0 replies; 9+ messages in thread
From: Prasanna S Panchamukhi @ 2006-02-20  4:51 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: systemtap

Yanmin,

On Mon, Feb 20, 2006 at 11:15:31AM +0800, Zhang, Yanmin wrote:
> I lost an important comment. The patch is not aware of signal processing. After kernel prepares the single-step-inst on the stack, if a signal is delivered to the thread, kernel will save some states into stack and switch to signal handler function, so single-step-inst on the stack might be erased.

AFAIK this problem can be addressed in the following ways.

1. Leave the sufficient stack space for the kernel to deliver the
signals and then copy the instructions on the stack.

2. Synchronize usage of stack between signal processing and user space probes.

3. Block the signal processing by disabling interrupts and preemption from 
the time we copy the instruction on the stack untill we single step on the 
original instruction. Or even wait for the signal processing to be
complete and then setup the stack for single stepping on the original
instructions and single step. 

Your suggestion are welcome to provide better solutions to this problem.

Thanks
Prasanna

> 
> >>-----Original Message-----
> >>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Zhang, Yanmin
> >>Sent: 2006年2月17日 17:20
> >>To: prasanna@in.ibm.com; systemtap@sources.redhat.com
> >>Subject: RE: [3/3] Userspace probes prototype-take2
> >>
> >>2 main issues:
> >>1) task switch caused by external interrupt when single-step;
> >>2) multi-thread:
> >>
> >>See below inline comments.
> >>
> >>Yanmin
> >>
> >>>>-----Original Message-----
> >>>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Prasanna S Panchamukhi
> >>>>Sent: 2006年2月8日 22:14
> >>>>To: systemtap@sources.redhat.com
> >>>>Subject: Re: [3/3] Userspace probes prototype-take2
> >>>>
> >>>>
> >>>>This patch handles the executing the registered callback
> >>>>functions when probes is hit.
> >>>>
> >>>>	Each userspace probe is uniquely identified by the
> >>>>combination of inode and offset, hence during registeration the inode
> >>>>and offset combination is added to kprobes hash table. Initially when
> >>>>breakpoint instruction is hit, the kprobes hash table is looked up
> >>>>for matching inode and offset. The pre_handlers are called in sequence
> >>>>if multiple probes are registered. The original instruction is single
> >>>>stepped out-of-line similar to kernel probes. In kernel space probes,
> >>>>single stepping out-of-line is achieved by copying the instruction on
> >>>>to some location within kernel address space and then single step
> >>>>from that location. But for userspace probes, instruction copied
> >>>>into kernel address space cannot be single stepped, hence the
> >>>>instruction should be copied to user address space. The solution is
> >>>>to find free space in the current process address space and then copy
> >>>>the original instruction and single step that instruction.
> >>>>
> >>>>User processes use stack space to store local variables, agruments and
> >>>>return values. Normally the stack space either below or above the
> >>>>stack pointer indicates the free stack space. If the stack grows
> >>>>downwards, the stack space below the stack pointer indicates the
> >>>>unused stack free space and if the stack grows upwards, the stack
> >>>>space above the stack pointer indicates the unused stack free space.
> >>>>
> >>>>The instruction to be single stepped can modify the stack space, hence
> >>>>before using the unused stack free space, sufficient stack space
> >>>>should be left. The instruction is copied to the bottom of the page
> >>>>and check is made such that the copied instruction does not cross the
> >>>>page boundry. The copied instruction is then single stepped.
> >>>>Several architectures does not allow the instruction to be executed
> >>>>from the stack location, since no-exec bit is set for the stack pages.
> >>>>In those architectures, the page table entry corresponding to the
> >>>>stack page is identified and the no-exec bit is unset making the
> >>>>instruction on that stack page to be executed.
> >>>>
> >>>>There are situations where even the unused free stack space is not
> >>>>enough for the user instruction to be copied and single stepped. In
> >>>>such situations, the virtual memory area(vma) can be expanded beyond
> >>>>the current stack vma. This expaneded stack can be used to copy the
> >>>>original instruction and single step out-of-line.
> >>>>
> >>>>Even if the vma cannot be extended then the instruction much be
> >>>>executed inline, by replacing the breakpoint instruction with original
> >>>>instruction.
> >>>>
> >>>>TODO list
> >>>>--------
> >>>>1. This patch is not stable yet, should work for most conditions.
> >>>>
> >>>>2. This patch works only with PREEMPT config option disabled, to work
> >>>>in PREEMPT enabled condition handlers must be re-written and must
> >>>>be seperated out from kernel probes allowing preemption.
> >>One of my old comments is an external device interrupt might happen when cpu is single-stepping the original instruction, then the task
> >>might be switched to another cpu. If we disable irq when exiting to user space to single step the instruction, kernel might switch the
> >>task off just on the exit kernel path. 1) uprobe_page; 2) kprobe_ctlblk, These 2 resources shouldn't be pre cpu, or we need get another
> >>approach. How could you resolve the task switch issue?
> >>
> >>
> >>
> >>>>
> >>>>3. Insert probes on copy-on-write pages. Tracks all COW pages for the
> >>>>page containing the specified probe point and inserts/removes all the
> >>>>probe points for that page.
> >>>>
> >>>>4. Optimize the insertion of probes through readpage hooks. Identify
> >>>>all the probes to be inserted on the read page and insert them at
> >>>>once.
> >>>>
> >>>>5. Resume exectution should handle setting of proper eip and eflags
> >>>>for special instructions similar to kernel probes.
> >>>>
> >>>>6. Single stepping out-of-line expands the stack if there is no
> >>>>enough stack space to copy the original instruction. Expansion of
> >>>>stack should be shrinked back to the original size after single
> >>>>stepping or the expanded stack should be reused for single stepping
> >>>>out-of-line for other probes.
> >>>>
> >>>>7. A wrapper routines to calculate the offset from the probed file
> >>>>beginning. In case of dynamic shared library, the offset is
> >>>>calculated by substracting the address of the probe point from the
> >>>>beginning of the file mapped address.
> >>>>
> >>>>8. Handing of page faults while inthe kprobes_handler() and while
> >>>>single stepping.
> >>>>
> >>>>9. Accessing user space pages not present in memory, from the
> >>>>registered callback routines.
> >>>>
> >>>>Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
> >>>>
> >>>>
> >>>> arch/i386/kernel/kprobes.c |  460 +++++++++++++++++++++++++++++++++++++++++++--
> >>>> include/asm-i386/kprobes.h |   13 +
> >>>> include/linux/kprobes.h    |    7
> >>>> kernel/kprobes.c           |    3
> >>>> 4 files changed, 468 insertions(+), 15 deletions(-)
> >>>>
> >>>>diff -puN arch/i386/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line arch/i386/kernel/kprobes.c
> >>>>--- linux-2.6.16-rc1-mm5/arch/i386/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
> >>>>+++ linux-2.6.16-rc1-mm5-prasanna/arch/i386/kernel/kprobes.c	2006-02-08 19:26:10.000000000 +0530
> >>>>@@ -30,6 +30,7 @@
> >>>>
> >>>> #include <linux/config.h>
> >>>> #include <linux/kprobes.h>
> >>>>+#include <linux/hash.h>
> >>>> #include <linux/ptrace.h>
> >>>> #include <linux/preempt.h>
> >>>> #include <asm/cacheflush.h>
> >>>>@@ -38,8 +39,12 @@
> >>>>
> >>>> void jprobe_return_end(void);
> >>>>
> >>>>+static struct uprobe_page *uprobe_page;
> >>>>+static struct hlist_head uprobe_page_table[KPROBE_TABLE_SIZE];
> >>>> DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
> >>>> DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
> >>>>+DEFINE_PER_CPU(struct uprobe *, current_uprobe) = NULL;
> >>>>+DEFINE_PER_CPU(unsigned long, singlestep_addr);
> >>>>
> >>>> /* insert a jmp code */
> >>>> static inline void set_jmp_op(void *from, void *to)
> >>>>@@ -125,6 +130,23 @@ void __kprobes arch_disarm_kprobe(struct
> >>>> 			   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
> >>>> }
> >>>>
> >>>>+void __kprobes arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address)
> >>>>+{
> >>>>+	*address = p->opcode;
> >>>>+}
> >>>>+
> >>>>+void __kprobes arch_arm_uprobe(unsigned long *address)
> >>>>+{
> >>>>+	*(kprobe_opcode_t *)address = BREAKPOINT_INSTRUCTION;
> >>>>+}
> >>>>+
> >>>>+void __kprobes arch_copy_uprobe(struct kprobe *p, unsigned long *address)
> >>>>+{
> >>>>+	memcpy(p->ainsn.insn, (kprobe_opcode_t *)address,
> >>>>+				MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
> >>>>+	p->opcode = *(kprobe_opcode_t *)address;
> >>>>+}
> >>>>+
> >>>> static inline void save_previous_kprobe(struct kprobe_ctlblk *kcb)
> >>>> {
> >>>> 	kcb->prev_kprobe.kp = kprobe_running();
> >>>>@@ -151,15 +173,326 @@ static inline void set_current_kprobe(st
> >>>> 		kcb->kprobe_saved_eflags &= ~IF_MASK;
> >>>> }
> >>>>
> >>>>+struct uprobe_page __kprobes *get_upage_current(struct task_struct *tsk)
> >>>>+{
> >>>>+	struct hlist_head *head;
> >>>>+	struct hlist_node *node;
> >>>>+	struct uprobe_page *upage;
> >>>>+
> >>>>+	head = &uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)];
> >>>>+	hlist_for_each_entry(upage, node, head, hlist) {
> >>>>+		if (upage->tsk == tsk)
> >>>>+			return upage;
> >>>>+        }
> >>>>+	return NULL;
> >>>>+}
> >>>>+
> >>>>+struct uprobe_page __kprobes *get_upage_free(struct task_struct *tsk)
> >>>>+{
> >>>>+	int cpu;
> >>>>+
> >>>>+	for_each_cpu(cpu) {
> >>>>+		struct uprobe_page *upage;
> >>>>+		upage = per_cpu_ptr(uprobe_page, cpu);
> >>>>+		if (upage->status & UPROBE_PAGE_FREE)
> >>>>+			return upage;
> >>>>+	}
> >>>>+	return NULL;
> >>>>+}
> >>>>+
> >>>>+/**
> >>>>+ * This routines get the pte of the page containing the specified address.
> >>>>+ */
> >>>>+static pte_t  __kprobes *get_uprobe_pte(unsigned long address)
> >>>>+{
> >>>>+	pgd_t *pgd;
> >>>>+	pud_t *pud;
> >>>>+	pmd_t *pmd;
> >>>>+	pte_t *pte = NULL;
> >>>>+
> >>>>+	pgd = pgd_offset(current->mm, address);
> >>>>+	if (!pgd)
> >>>>+		goto out;
> >>>>+
> >>>>+	pud = pud_offset(pgd, address);
> >>>>+	if (!pud)
> >>>>+		goto out;
> >>>>+
> >>>>+	pmd = pmd_offset(pud, address);
> >>>>+	if (!pmd)
> >>>>+		goto out;
> >>>>+
> >>>>+	pte = pte_alloc_map(current->mm, pmd, address);
> >>>>+
> >>>>+out:
> >>>>+	return pte;
> >>>>+}
> >>>>+
> >>>>+/**
> >>>>+ *  This routine check for space in the current process's stack address space.
> >>>>+ *  If enough address space is found, it just maps a new page and copies the
> >>>>+ *  new instruction on that page for single stepping out-of-line.
> >>>>+ */
> >>>>+static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
> >>>>+			struct pt_regs *regs, struct vm_area_struct *vma)
> >>>>+{
> >>>>+	unsigned long addr, *vaddr, stack_addr = regs->esp;
> >>>>+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
> >>>>+	struct uprobe_page *upage;
> >>>>+	struct page *page;
> >>>>+	pte_t *pte;
> >>>>+
> >>>>+
> >>>>+	if (vma->vm_flags & VM_GROWSDOWN) {
> >>>>+		if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
> >>>>+			return -ENOMEM;
> >>>>+
> >>>>+		addr = vma->vm_start;
> >>>>+	} else if (vma->vm_flags & VM_GROWSUP) {
> >>>>+		if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
> >>>>+			return -ENOMEM;
> >>>>+
> >>>>+		addr = vma->vm_end - size;
> >>>>+	} else
> >>>>+		return -EFAULT;
> >>>>+
> >>The multi-thread case is not resolved here. One of typical multi-thread model is that the all threads share the same vma and every thread
> >>has 8-k stack. If 2 threads trigger uprobe (although might be not the same uprobe) at the same time, one thread might erase single-step
> >>instruction of another.
> >>
> >>
> >>
> >>>>+	preempt_enable_no_resched();
> >>>>+
> >>>>+	pte = get_uprobe_pte(addr);
> >>>>+	preempt_disable();
> >>>>+	if (!pte)
> >>>>+		return -EFAULT;
> >>>>+
> >>>>+	upage = get_upage_free(current);
> >>>>+	upage->status &= ~UPROBE_PAGE_FREE;
> >>>>+	upage->tsk = current;
> >>>>+	INIT_HLIST_NODE(&upage->hlist);
> >>>>+	hlist_add_head(&upage->hlist,
> >>>>+		&uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)]);
> >>>>+
> >>>>+	upage->orig_pte = pte;
> >>>>+	upage->orig_pte_val =  pte_val(*pte);
> >>>>+	set_pte(pte, (*(upage->alias_pte)));
> >>>>+
> >>>>+	page = pte_page(*pte);
> >>>>+	vaddr = (unsigned long *)kmap_atomic(page, KM_USER1);
> >>>>+	vaddr = (unsigned long *)((unsigned long)vaddr + (addr & ~PAGE_MASK));
> >>>>+	memcpy(vaddr, (unsigned long *)uprobe->kp.ainsn.insn, size);
> >>>>+	kunmap_atomic(vaddr, KM_USER1);
> >>>>+	regs->eip = addr;
> >>So the temp page, upage->alias_addr, replaces the original one on the stack. If the replaced instruction is to operate stack, such like
> >>"push eax", the result might be on the new page. After the single step, the pte is restored to the original page which doesn't have
> >>the value of eax.
> >>
> >>
> >>
> >>>>+
> >>>>+	return 0;
> >>>>+}
> >>>>+
> >>>>+/**
> >>>>+ * This routine expands the stack beyond the present process address space
> >>>>+ * and copies the instruction to that location, so that processor can
> >>>>+ * single step out-of-line.
> >>>>+ */
> >>>>+static int __kprobes copy_insn_onexpstack(struct uprobe *uprobe,
> >>>>+			struct pt_regs *regs, struct vm_area_struct *vma)
> >>It has the same issues like function copy_insn_on_new_page.
> >>
> >>
> >>>>+{
> >>>>+	unsigned long addr, *vaddr, vm_addr;
> >>>>+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
> >>>>+	struct vm_area_struct *new_vma;
> >>>>+	struct uprobe_page *upage;
> >>>>+	struct mm_struct *mm = current->mm;
> >>>>+	struct page *page;
> >>>>+	pte_t *pte;
> >>>>+
> >>>>+
> >>>>+	if (vma->vm_flags & VM_GROWSDOWN)
> >>>>+		vm_addr = vma->vm_start - size;
> >>>>+	else if (vma->vm_flags & VM_GROWSUP)
> >>>>+		vm_addr = vma->vm_end + size;
> >>>>+	else
> >>>>+		return -EFAULT;
> >>>>+
> >>>>+	preempt_enable_no_resched();
> >>>>+
> >>>>+	/* TODO: do we need to expand stack if extend_vma fails? */
> >>>>+	new_vma = find_extend_vma(mm, vm_addr);
> >>>>+	preempt_disable();
> >>>>+	if (!new_vma)
> >>>>+		return -ENOMEM;
> >>>>+
> >>>>+	/*
> >>>>+	 * TODO: Expanding stack for every probe is not a good idea, stack must
> >>>>+	 * either be shrunk to its original size after single stepping or the
> >>>>+	 * expanded stack should be kept track of, for the probed application,
> >>>>+	 * so it can be reused to single step out-of-line
> >>>>+	 */
> >>>>+	if (new_vma->vm_flags & VM_GROWSDOWN)
> >>>>+		addr = new_vma->vm_start;
> >>>>+	else
> >>>>+		addr = new_vma->vm_end - size;
> >>>>+
> >>>>+	preempt_enable_no_resched();
> >>>>+	pte = get_uprobe_pte(addr);
> >>>>+	preempt_disable();
> >>>>+	if (!pte)
> >>>>+		return -EFAULT;
> >>>>+
> >>>>+	upage = get_upage_free(current);
> >>>>+	upage->status &= ~UPROBE_PAGE_FREE;
> >>>>+	upage->tsk = current;
> >>>>+	INIT_HLIST_NODE(&upage->hlist);
> >>>>+	hlist_add_head(&upage->hlist,
> >>>>+		&uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)]);
> >>>>+	upage->orig_pte = pte;
> >>>>+	upage->orig_pte_val =  pte_val(*pte);
> >>>>+	set_pte(pte, (*(upage->alias_pte)));
> >>>>+
> >>>>+	page = pte_page(*pte);
> >>>>+	vaddr = (unsigned long *)kmap_atomic(page, KM_USER1);
> >>>>+	vaddr = (unsigned long *)((unsigned long)vaddr + (addr & ~PAGE_MASK));
> >>>>+	memcpy(vaddr, (unsigned long *)uprobe->kp.ainsn.insn, size);
> >>>>+	kunmap_atomic(vaddr, KM_USER1);
> >>>>+	regs->eip = addr;
> >>>>+
> >>>>+	return  0;
> >>>>+}
> >>>>+
> >>>>+/**
> >>>>+ * This routine checks for stack free space below the stack pointer and
> >>>>+ * then copies the instructions at that location so that the processor can
> >>>>+ * single step out-of-line. If there is no enough stack space or if
> >>>>+ * copy_to_user fails or if the vma is invalid, it returns error.
> >>>>+ */
> >>>>+static int __kprobes copy_insn_onstack(struct uprobe *uprobe,
> >>>>+			struct pt_regs *regs, unsigned long flags)
> >>>>+{
> >>>>+	unsigned long page_addr, stack_addr = regs->esp;
> >>>>+	int  size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
> >>>>+	unsigned long *source = (unsigned long *)uprobe->kp.ainsn.insn;
> >>>>+
> >>>>+	if (flags & VM_GROWSDOWN) {
> >>>>+		page_addr = stack_addr & PAGE_MASK;
> >>>>+
> >>>>+		if (((stack_addr - sizeof(long long))) < (page_addr + size))
> >>>>+			return -ENOMEM;
> >>>>+
> >>>>+		if (__copy_to_user_inatomic((unsigned long *)page_addr, source,
> >>>>+									size))
> >>>>+			return -EFAULT;
> >>>>+
> >>>>+		regs->eip = page_addr;
> >>>>+	} else if (flags & VM_GROWSUP) {
> >>>>+		page_addr = stack_addr & PAGE_MASK;
> >>>>+
> >>>>+		if (page_addr == stack_addr)
> >>>>+			return -ENOMEM;
> >>>>+		else
> >>>>+			page_addr += PAGE_SIZE;
> >>>>+
> >>>>+		if ((page_addr - size) < (stack_addr + sizeof(long long)))
> >>>>+			return -ENOMEM;
> >>>>+
> >>>>+		if (__copy_to_user_inatomic((unsigned long *)(page_addr - size),
> >>>>+								source, size))
> >>>>+			return -EFAULT;
> >>>>+
> >>>>+		regs->eip = page_addr - size;
> >>>>+	} else
> >>>>+		return -EINVAL;
> >>>>+
> >>>>+	return 0;
> >>>>+}
> >>>>+
> >>>>+/**
> >>>>+ * This routines get the page containing the probe, maps it and
> >>>>+ * replaced the instruction at the probed address with specified
> >>>>+ * opcode.
> >>>>+ */
> >>>>+void __kprobes replace_original_insn(struct uprobe *uprobe,
> >>>>+				struct pt_regs *regs, kprobe_opcode_t opcode)
> >>>>+{
> >>>>+	kprobe_opcode_t *addr;
> >>>>+	struct page *page;
> >>>>+
> >>>>+	page = find_get_page(uprobe->inode->i_mapping,
> >>>>+					uprobe->offset >> PAGE_CACHE_SHIFT);
> >>>>+	lock_page(page);
> >>>>+
> >>>>+	addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER0);
> >>>>+	addr = (kprobe_opcode_t *)((unsigned long)addr +
> >>>>+				 (unsigned long)(uprobe->offset & ~PAGE_MASK));
> >>>>+	*addr = opcode;
> >>>>+	/*TODO: flush vma ? */
> >>>>+	kunmap_atomic(addr, KM_USER0);
> >>>>+
> >>>>+	unlock_page(page);
> >>>>+
> >>>>+	page_cache_release(page);
> >>>>+	regs->eip = (unsigned long)uprobe->kp.addr;
> >>>>+}
> >>>>+
> >>>>+/**
> >>>>+ * This routine provides the functionality of single stepping out of line.
> >>>>+ * If single stepping out-of-line cannot be achieved, it replaces with
> >>>>+ * the original instruction allowing it to single step inline.
> >>>>+ */
> >>>>+static inline int uprobe_single_step(struct kprobe *p, struct pt_regs *regs)
> >>>>+{
> >>>>+	unsigned long stack_addr = regs->esp, flags;
> >>>>+	struct vm_area_struct *vma = NULL;
> >>>>+	struct uprobe *uprobe =  __get_cpu_var(current_uprobe);
> >>>>+	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
> >>>>+	int err = 0;
> >>>>+
> >>>>+	down_read(&current->mm->mmap_sem);
> >>>>+
> >>>>+	vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
> >>>>+	if (!vma) {
> >>>>+		/* TODO: Need better error reporting? */
> >>>>+		printk("No vma found\n");
> >>>>+		up_read(&current->mm->mmap_sem);
> >>>>+		return -ENOENT;
> >>>>+	}
> >>>>+	flags = vma->vm_flags;
> >>>>+	up_read(&current->mm->mmap_sem);
> >>>>+
> >>>>+	kcb->kprobe_status |= UPROBE_SS_STACK;
> >>>>+	err = copy_insn_onstack(uprobe, regs, flags);
> >>>>+
> >>>>+	down_write(&current->mm->mmap_sem);
> >>>>+
> >>>>+	if (err) {
> >>>>+		kcb->kprobe_status |= UPROBE_SS_NEW_STACK;
> >>>>+		err = copy_insn_on_new_page(uprobe, regs, vma);
> >>>>+	}
> >>>>+	if (err) {
> >>>>+		kcb->kprobe_status |= UPROBE_SS_EXPSTACK;
> >>>>+		err = copy_insn_onexpstack(uprobe, regs, vma);
> >>>>+	}
> >>>>+
> >>>>+	up_write(&current->mm->mmap_sem);
> >>>>+
> >>>>+	if (err) {
> >>>>+		kcb->kprobe_status |= UPROBE_SS_INLINE;
> >>>>+		replace_original_insn(uprobe, regs, uprobe->kp.opcode);
> >>>>+	}
> >>>>+
> >>>>+	 __get_cpu_var(singlestep_addr) = regs->eip;
> >>>>+
> >>>>+
> >>>>+	return 0;
> >>>>+}
> >>>>+
> >>>> static inline void prepare_singlestep(struct kprobe *p, struct pt_regs *regs)
> >>>> {
> >>>> 	regs->eflags |= TF_MASK;
> >>>> 	regs->eflags &= ~IF_MASK;
> >>>> 	/*single step inline if the instruction is an int3*/
> >>>>+
> >>>> 	if (p->opcode == BREAKPOINT_INSTRUCTION)
> >>>> 		regs->eip = (unsigned long)p->addr;
> >>>>-	else
> >>>>-		regs->eip = (unsigned long)&p->ainsn.insn;
> >>>>+	else {
> >>>>+		if (!kernel_text_address((unsigned long)p->addr))
> >>>>+			uprobe_single_step(p, regs);
> >>>>+		else
> >>>>+			regs->eip = (unsigned long)&p->ainsn.insn;
> >>>>+	}
> >>>> }
> >>>>
> >>>> /* Called with kretprobe_lock held */
> >>>>@@ -194,6 +527,7 @@ static int __kprobes kprobe_handler(stru
> >>>> 	kprobe_opcode_t *addr = NULL;
> >>>> 	unsigned long *lp;
> >>>> 	struct kprobe_ctlblk *kcb;
> >>>>+	unsigned seg = regs->xcs & 0xffff;
> >>>> #ifdef CONFIG_PREEMPT
> >>>> 	unsigned pre_preempt_count = preempt_count();
> >>>> #endif /* CONFIG_PREEMPT */
> >>>>@@ -208,14 +542,21 @@ static int __kprobes kprobe_handler(stru
> >>>> 	/* Check if the application is using LDT entry for its code segment and
> >>>> 	 * calculate the address by reading the base address from the LDT entry.
> >>>> 	 */
> >>>>-	if ((regs->xcs & 4) && (current->mm)) {
> >>>>+
> >>>>+	if (regs->eflags & VM_MASK)
> >>>>+		addr = (kprobe_opcode_t *)(((seg << 4) + regs->eip -
> >>>>+			sizeof(kprobe_opcode_t)) & 0xffff);
> >>>>+	else if ((regs->xcs & 4) && (current->mm)) {
> >>>>+		local_irq_enable();
> >>>>+		down(&current->mm->context.sem);
> >>>> 		lp = (unsigned long *) ((unsigned long)((regs->xcs >> 3) * 8)
> >>>> 					+ (char *) current->mm->context.ldt);
> >>>> 		addr = (kprobe_opcode_t *) (get_desc_base(lp) + regs->eip -
> >>>> 						sizeof(kprobe_opcode_t));
> >>>>-	} else {
> >>>>+		up(&current->mm->context.sem);
> >>>>+		local_irq_disable();
> >>>>+	} else
> >>>> 		addr = (kprobe_opcode_t *)(regs->eip - sizeof(kprobe_opcode_t));
> >>>>-	}
> >>>> 	/* Check we're not actually recursing */
> >>>> 	if (kprobe_running()) {
> >>>> 		p = get_kprobe(addr);
> >>>>@@ -235,7 +576,6 @@ static int __kprobes kprobe_handler(stru
> >>>> 			save_previous_kprobe(kcb);
> >>>> 			set_current_kprobe(p, regs, kcb);
> >>>> 			kprobes_inc_nmissed_count(p);
> >>>>-			prepare_singlestep(p, regs);
> >>>> 			kcb->kprobe_status = KPROBE_REENTER;
> >>>> 			return 1;
> >>>> 		} else {
> >>>>@@ -307,8 +647,8 @@ static int __kprobes kprobe_handler(stru
> >>>> 	}
> >>>>
> >>>> ss_probe:
> >>>>-	prepare_singlestep(p, regs);
> >>>> 	kcb->kprobe_status = KPROBE_HIT_SS;
> >>>>+	prepare_singlestep(p, regs);
> >>>> 	return 1;
> >>>>
> >>>> no_kprobe:
> >>>>@@ -498,6 +838,33 @@ no_change:
> >>>> 	return;
> >>>> }
> >>>>
> >>>>+static void __kprobes resume_execution_user(struct uprobe *uprobe,
> >>>>+				struct pt_regs *regs, struct kprobe_ctlblk *kcb)
> >>>>+{
> >>>>+	unsigned long delta;
> >>>>+	struct uprobe_page *upage;
> >>>>+
> >>>>+	/*
> >>>>+	 * TODO :need to fixup special instructions as done with kernel probes.
> >>>>+	 */
> >>>>+	delta = regs->eip - __get_cpu_var(singlestep_addr);
> >>>>+	regs->eip = (unsigned long)(uprobe->kp.addr + delta);
> >>>>+
> >>>>+	if ((kcb->kprobe_status & UPROBE_SS_EXPSTACK) ||
> >>>>+			(kcb->kprobe_status & UPROBE_SS_NEW_STACK)) {
> >>>>+		upage = get_upage_current(current);
> >>>>+		set_pte(upage->orig_pte, __pte(upage->orig_pte_val));
> >>>>+		pte_unmap(upage->orig_pte);
> >>>>+
> >>>>+		upage->status = UPROBE_PAGE_FREE;
> >>>>+		hlist_del(&upage->hlist);
> >>>>+
> >>>>+	} else if (kcb->kprobe_status & UPROBE_SS_INLINE)
> >>>>+		replace_original_insn(uprobe, regs,
> >>>>+				(kprobe_opcode_t)BREAKPOINT_INSTRUCTION);
> >>>>+	regs->eflags &= ~TF_MASK;
> >>>>+}
> >>>>+
> >>>> /*
> >>>>  * Interrupts are disabled on entry as trap1 is an interrupt gate and they
> >>>>  * remain disabled thoroughout this function.
> >>>>@@ -510,16 +877,19 @@ static inline int post_kprobe_handler(st
> >>>> 	if (!cur)
> >>>> 		return 0;
> >>>>
> >>>>-	if ((kcb->kprobe_status != KPROBE_REENTER) && cur->post_handler) {
> >>>>-		kcb->kprobe_status = KPROBE_HIT_SSDONE;
> >>>>+	if (!(kcb->kprobe_status & KPROBE_REENTER) && cur->post_handler) {
> >>>>+		kcb->kprobe_status |= KPROBE_HIT_SSDONE;
> >>>> 		cur->post_handler(cur, regs, 0);
> >>>> 	}
> >>>>
> >>>>-	resume_execution(cur, regs, kcb);
> >>>>+	if (!kernel_text_address((unsigned long)cur->addr))
> >>>>+		resume_execution_user(__get_cpu_var(current_uprobe), regs, kcb);
> >>>>+	else
> >>>>+		resume_execution(cur, regs, kcb);
> >>>> 	regs->eflags |= kcb->kprobe_saved_eflags;
> >>>>
> >>>> 	/*Restore back the original saved kprobes variables and continue. */
> >>>>-	if (kcb->kprobe_status == KPROBE_REENTER) {
> >>>>+	if (kcb->kprobe_status & KPROBE_REENTER) {
> >>>> 		restore_previous_kprobe(kcb);
> >>>> 		goto out;
> >>>> 	}
> >>>>@@ -547,7 +917,13 @@ static inline int kprobe_fault_handler(s
> >>>> 		return 1;
> >>>>
> >>>> 	if (kcb->kprobe_status & KPROBE_HIT_SS) {
> >>>>-		resume_execution(cur, regs, kcb);
> >>>>+		if (!kernel_text_address((unsigned long)cur->addr)) {
> >>>>+			struct uprobe *uprobe =  __get_cpu_var(current_uprobe);
> >>>>+			/* TODO: Proper handling of all instruction */
> >>>>+			replace_original_insn(uprobe, regs, uprobe->kp.opcode);
> >>>>+			regs->eflags &= ~TF_MASK;
> >>>>+		} else
> >>>>+			resume_execution(cur, regs, kcb);
> >>>> 		regs->eflags |= kcb->kprobe_old_eflags;
> >>>>
> >>>> 		reset_current_kprobe();
> >>>>@@ -654,7 +1030,67 @@ int __kprobes longjmp_break_handler(stru
> >>>> 	return 0;
> >>>> }
> >>>>
> >>>>+static void free_alias(void)
> >>>>+{
> >>>>+	int cpu;
> >>>>+
> >>>>+	for_each_cpu(cpu) {
> >>>>+		struct uprobe_page *upage;
> >>>>+		upage = per_cpu_ptr(uprobe_page, cpu);
> >>>>+
> >>>>+		if (upage->alias_addr) {
> >>>>+			set_pte(upage->alias_pte, __pte(upage->alias_pte_val));
> >>>>+			kfree(upage->alias_addr);
> >>>>+		}
> >>>>+		upage->alias_pte = 0;
> >>>>+	}
> >>>>+	free_percpu(uprobe_page);
> >>>>+	return;
> >>>>+}
> >>>>+
> >>>>+static int alloc_alias(void)
> >>>>+{
> >>>>+	int cpu;
> >>>>+
> >>>>+	uprobe_page = __alloc_percpu(sizeof(struct uprobe_page));
> >>[YM] Do here codes try to resolve the problem of task switch at single-step? If so, the per cpu data also might be used up although
> >>get_upage_free will go through all uprobe_page of all cpus. I suggest to allocate a series of uprobe_page, and allocate again when they
> >>are used up.
> >>
> >>
> >>
> >>
> >>>>+
> >>>>+	for_each_cpu(cpu) {
> >>>>+		struct uprobe_page *upage;
> >>>>+		upage = per_cpu_ptr(uprobe_page, cpu);
> >>>>+		upage->alias_addr = kmalloc(PAGE_SIZE, GFP_USER);
> >>[YM] Does kmalloc(PAGE_SIZE...) imply the result is aligned to page? How about using alloc_page?
> >>
> >>
> >>>>+		if (!upage->alias_addr) {
> >>>>+			free_alias();
> >>>>+			return -ENOMEM;
> >>>>+		}
> >>>>+		upage->alias_pte = lookup_address(
> >>>>+					(unsigned long)upage->alias_addr);
> >>>>+		upage->alias_pte_val = pte_val(*upage->alias_pte);
> >>>>+		if (upage->alias_pte) {
> >>[YM] If kmalloc returns a non-NULL address, upage->alias_pte is not equal to NULL. So delete above checking?
> >>
> >>
> >>>>+			upage->status = UPROBE_PAGE_FREE;
> >>>>+			set_pte(upage->alias_pte,
> >>>>+						pte_mkdirty(*upage->alias_pte));
> >>>>+			set_pte(upage->alias_pte,
> >>>>+						pte_mkexec(*upage->alias_pte));
> >>>>+			set_pte(upage->alias_pte,
> >>>>+						 pte_mkwrite(*upage->alias_pte));
> >>>>+			set_pte(upage->alias_pte,
> >>>>+						pte_mkyoung(*upage->alias_pte));
> >>>>+		}
> >>>>+	}
> >>>>+	return 0;
> >>>>+}
> >>>>+
> >>>> int __init arch_init_kprobes(void)
> >>>> {
> >>>>+	int ret = 0;
> >>>>+	/*
> >>>>+	 * user space probes requires a page to copy the original instruction
> >>>>+	 * so that it can single step if there is no free stack space, allocate
> >>>>+	 * per cpu page.
> >>>>+	 */
> >>>>+
> >>>>+	if ((ret = alloc_alias()))
> >>>>+		return ret;
> >>>>+
> >>>> 	return 0;
> >>>> }
> >>>>diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes_ss_out-of-line include/asm-i386/kprobes.h
> >>>>--- linux-2.6.16-rc1-mm5/include/asm-i386/kprobes.h~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
> >>>>+++ linux-2.6.16-rc1-mm5-prasanna/include/asm-i386/kprobes.h	2006-02-08 19:26:10.000000000 +0530
> >>>>@@ -42,6 +42,7 @@ typedef u8 kprobe_opcode_t;
> >>>> #define JPROBE_ENTRY(pentry)	(kprobe_opcode_t *)pentry
> >>>> #define ARCH_SUPPORTS_KRETPROBES
> >>>> #define arch_remove_kprobe(p)	do {} while (0)
> >>>>+#define UPROBE_PAGE_FREE 0x00000001
> >>>>
> >>>> void kretprobe_trampoline(void);
> >>>>
> >>>>@@ -74,6 +75,18 @@ struct kprobe_ctlblk {
> >>>> 	struct prev_kprobe prev_kprobe;
> >>>> };
> >>>>
> >>>>+/* per cpu uprobe page structure */
> >>>>+struct uprobe_page {
> >>>>+	struct hlist_node hlist;
> >>>>+	pte_t *alias_pte;
> >>>>+	pte_t *orig_pte;
> >>>>+	unsigned long orig_pte_val;
> >>>>+	unsigned long alias_pte_val;
> >>[YM] I think the patch doesn't support CONFIG_X86_PAE, because if CONFIG_X86_PAE=y, pte_t becomes 64 bits.
> >>How about changing above 2 members' type to pte_t directly?
> >>
> >>
> >>
> >>>>+	void *alias_addr;
> >>>>+	struct task_struct *tsk;
> >>>>+	unsigned long status;
> >>>>+};
> >>>>+
> >>>> /* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
> >>>>  * if necessary, before executing the original int3/1 (trap) handler.
> >>>>  */
> >>>>diff -puN include/linux/kprobes.h~kprobes_userspace_probes_ss_out-of-line include/linux/kprobes.h
> >>>>--- linux-2.6.16-rc1-mm5/include/linux/kprobes.h~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
> >>>>+++ linux-2.6.16-rc1-mm5-prasanna/include/linux/kprobes.h	2006-02-08 19:26:10.000000000 +0530
> >>>>@@ -45,11 +45,18 @@
> >>>> #ifdef CONFIG_KPROBES
> >>>> #include <asm/kprobes.h>
> >>>>
> >>>>+#define KPROBE_HASH_BITS 6
> >>>>+#define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> >>>>+
> >>>> /* kprobe_status settings */
> >>>> #define KPROBE_HIT_ACTIVE	0x00000001
> >>>> #define KPROBE_HIT_SS		0x00000002
> >>>> #define KPROBE_REENTER		0x00000004
> >>>> #define KPROBE_HIT_SSDONE	0x00000008
> >>>>+#define UPROBE_SS_STACK		0x00000010
> >>>>+#define UPROBE_SS_EXPSTACK	0x00000020
> >>>>+#define UPROBE_SS_INLINE	0x00000040
> >>>>+#define UPROBE_SS_NEW_STACK	0x00000080
> >>>>
> >>>> /* Attach to insert probes on any functions which should be ignored*/
> >>>> #define __kprobes	__attribute__((__section__(".kprobes.text")))
> >>>>diff -puN kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line kernel/kprobes.c
> >>>>--- linux-2.6.16-rc1-mm5/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:10.000000000 +0530
> >>>>+++ linux-2.6.16-rc1-mm5-prasanna/kernel/kprobes.c	2006-02-08 19:26:10.000000000 +0530
> >>>>@@ -42,9 +42,6 @@
> >>>> #include <asm/errno.h>
> >>>> #include <asm/kdebug.h>
> >>>>
> >>>>-#define KPROBE_HASH_BITS 6
> >>>>-#define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> >>>>-
> >>>> static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
> >>>> static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];
> >>>> static struct list_head uprobe_module_list;
> >>>>
> >>>>_
> >>>>--
> >>>>Prasanna S Panchamukhi
> >>>>Linux Technology Center
> >>>>India Software Labs, IBM Bangalore
> >>>>Email: prasanna@in.ibm.com
> >>>>Ph: 91-80-51776329

-- 
Thanks & Regards
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: prasanna@in.ibm.com
Ph: 91-80-51776329

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [3/3] Userspace probes prototype-take2
@ 2006-02-20  3:16 Zhang, Yanmin
  2006-02-20  4:51 ` Prasanna S Panchamukhi
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Yanmin @ 2006-02-20  3:16 UTC (permalink / raw)
  To: Zhang, Yanmin, prasanna, systemtap

I lost an important comment. The patch is not aware of signal processing. After kernel prepares the single-step-inst on the stack, if a signal is delivered to the thread, kernel will save some states into stack and switch to signal handler function, so single-step-inst on the stack might be erased.

>>-----Original Message-----
>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Zhang, Yanmin
>>Sent: 2006年2月17日 17:20
>>To: prasanna@in.ibm.com; systemtap@sources.redhat.com
>>Subject: RE: [3/3] Userspace probes prototype-take2
>>
>>2 main issues:
>>1) task switch caused by external interrupt when single-step;
>>2) multi-thread:
>>
>>See below inline comments.
>>
>>Yanmin
>>
>>>>-----Original Message-----
>>>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Prasanna S Panchamukhi
>>>>Sent: 2006年2月8日 22:14
>>>>To: systemtap@sources.redhat.com
>>>>Subject: Re: [3/3] Userspace probes prototype-take2
>>>>
>>>>
>>>>This patch handles the executing the registered callback
>>>>functions when probes is hit.
>>>>
>>>>	Each userspace probe is uniquely identified by the
>>>>combination of inode and offset, hence during registeration the inode
>>>>and offset combination is added to kprobes hash table. Initially when
>>>>breakpoint instruction is hit, the kprobes hash table is looked up
>>>>for matching inode and offset. The pre_handlers are called in sequence
>>>>if multiple probes are registered. The original instruction is single
>>>>stepped out-of-line similar to kernel probes. In kernel space probes,
>>>>single stepping out-of-line is achieved by copying the instruction on
>>>>to some location within kernel address space and then single step
>>>>from that location. But for userspace probes, instruction copied
>>>>into kernel address space cannot be single stepped, hence the
>>>>instruction should be copied to user address space. The solution is
>>>>to find free space in the current process address space and then copy
>>>>the original instruction and single step that instruction.
>>>>
>>>>User processes use stack space to store local variables, agruments and
>>>>return values. Normally the stack space either below or above the
>>>>stack pointer indicates the free stack space. If the stack grows
>>>>downwards, the stack space below the stack pointer indicates the
>>>>unused stack free space and if the stack grows upwards, the stack
>>>>space above the stack pointer indicates the unused stack free space.
>>>>
>>>>The instruction to be single stepped can modify the stack space, hence
>>>>before using the unused stack free space, sufficient stack space
>>>>should be left. The instruction is copied to the bottom of the page
>>>>and check is made such that the copied instruction does not cross the
>>>>page boundry. The copied instruction is then single stepped.
>>>>Several architectures does not allow the instruction to be executed
>>>>from the stack location, since no-exec bit is set for the stack pages.
>>>>In those architectures, the page table entry corresponding to the
>>>>stack page is identified and the no-exec bit is unset making the
>>>>instruction on that stack page to be executed.
>>>>
>>>>There are situations where even the unused free stack space is not
>>>>enough for the user instruction to be copied and single stepped. In
>>>>such situations, the virtual memory area(vma) can be expanded beyond
>>>>the current stack vma. This expaneded stack can be used to copy the
>>>>original instruction and single step out-of-line.
>>>>
>>>>Even if the vma cannot be extended then the instruction much be
>>>>executed inline, by replacing the breakpoint instruction with original
>>>>instruction.
>>>>
>>>>TODO list
>>>>--------
>>>>1. This patch is not stable yet, should work for most conditions.
>>>>
>>>>2. This patch works only with PREEMPT config option disabled, to work
>>>>in PREEMPT enabled condition handlers must be re-written and must
>>>>be seperated out from kernel probes allowing preemption.
>>One of my old comments is an external device interrupt might happen when cpu is single-stepping the original instruction, then the task
>>might be switched to another cpu. If we disable irq when exiting to user space to single step the instruction, kernel might switch the
>>task off just on the exit kernel path. 1) uprobe_page; 2) kprobe_ctlblk, These 2 resources shouldn't be pre cpu, or we need get another
>>approach. How could you resolve the task switch issue?
>>
>>
>>
>>>>
>>>>3. Insert probes on copy-on-write pages. Tracks all COW pages for the
>>>>page containing the specified probe point and inserts/removes all the
>>>>probe points for that page.
>>>>
>>>>4. Optimize the insertion of probes through readpage hooks. Identify
>>>>all the probes to be inserted on the read page and insert them at
>>>>once.
>>>>
>>>>5. Resume exectution should handle setting of proper eip and eflags
>>>>for special instructions similar to kernel probes.
>>>>
>>>>6. Single stepping out-of-line expands the stack if there is no
>>>>enough stack space to copy the original instruction. Expansion of
>>>>stack should be shrinked back to the original size after single
>>>>stepping or the expanded stack should be reused for single stepping
>>>>out-of-line for other probes.
>>>>
>>>>7. A wrapper routines to calculate the offset from the probed file
>>>>beginning. In case of dynamic shared library, the offset is
>>>>calculated by substracting the address of the probe point from the
>>>>beginning of the file mapped address.
>>>>
>>>>8. Handing of page faults while inthe kprobes_handler() and while
>>>>single stepping.
>>>>
>>>>9. Accessing user space pages not present in memory, from the
>>>>registered callback routines.
>>>>
>>>>Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
>>>>
>>>>
>>>> arch/i386/kernel/kprobes.c |  460 +++++++++++++++++++++++++++++++++++++++++++--
>>>> include/asm-i386/kprobes.h |   13 +
>>>> include/linux/kprobes.h    |    7
>>>> kernel/kprobes.c           |    3
>>>> 4 files changed, 468 insertions(+), 15 deletions(-)
>>>>
>>>>diff -puN arch/i386/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line arch/i386/kernel/kprobes.c
>>>>--- linux-2.6.16-rc1-mm5/arch/i386/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
>>>>+++ linux-2.6.16-rc1-mm5-prasanna/arch/i386/kernel/kprobes.c	2006-02-08 19:26:10.000000000 +0530
>>>>@@ -30,6 +30,7 @@
>>>>
>>>> #include <linux/config.h>
>>>> #include <linux/kprobes.h>
>>>>+#include <linux/hash.h>
>>>> #include <linux/ptrace.h>
>>>> #include <linux/preempt.h>
>>>> #include <asm/cacheflush.h>
>>>>@@ -38,8 +39,12 @@
>>>>
>>>> void jprobe_return_end(void);
>>>>
>>>>+static struct uprobe_page *uprobe_page;
>>>>+static struct hlist_head uprobe_page_table[KPROBE_TABLE_SIZE];
>>>> DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
>>>> DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>>>>+DEFINE_PER_CPU(struct uprobe *, current_uprobe) = NULL;
>>>>+DEFINE_PER_CPU(unsigned long, singlestep_addr);
>>>>
>>>> /* insert a jmp code */
>>>> static inline void set_jmp_op(void *from, void *to)
>>>>@@ -125,6 +130,23 @@ void __kprobes arch_disarm_kprobe(struct
>>>> 			   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
>>>> }
>>>>
>>>>+void __kprobes arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address)
>>>>+{
>>>>+	*address = p->opcode;
>>>>+}
>>>>+
>>>>+void __kprobes arch_arm_uprobe(unsigned long *address)
>>>>+{
>>>>+	*(kprobe_opcode_t *)address = BREAKPOINT_INSTRUCTION;
>>>>+}
>>>>+
>>>>+void __kprobes arch_copy_uprobe(struct kprobe *p, unsigned long *address)
>>>>+{
>>>>+	memcpy(p->ainsn.insn, (kprobe_opcode_t *)address,
>>>>+				MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
>>>>+	p->opcode = *(kprobe_opcode_t *)address;
>>>>+}
>>>>+
>>>> static inline void save_previous_kprobe(struct kprobe_ctlblk *kcb)
>>>> {
>>>> 	kcb->prev_kprobe.kp = kprobe_running();
>>>>@@ -151,15 +173,326 @@ static inline void set_current_kprobe(st
>>>> 		kcb->kprobe_saved_eflags &= ~IF_MASK;
>>>> }
>>>>
>>>>+struct uprobe_page __kprobes *get_upage_current(struct task_struct *tsk)
>>>>+{
>>>>+	struct hlist_head *head;
>>>>+	struct hlist_node *node;
>>>>+	struct uprobe_page *upage;
>>>>+
>>>>+	head = &uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)];
>>>>+	hlist_for_each_entry(upage, node, head, hlist) {
>>>>+		if (upage->tsk == tsk)
>>>>+			return upage;
>>>>+        }
>>>>+	return NULL;
>>>>+}
>>>>+
>>>>+struct uprobe_page __kprobes *get_upage_free(struct task_struct *tsk)
>>>>+{
>>>>+	int cpu;
>>>>+
>>>>+	for_each_cpu(cpu) {
>>>>+		struct uprobe_page *upage;
>>>>+		upage = per_cpu_ptr(uprobe_page, cpu);
>>>>+		if (upage->status & UPROBE_PAGE_FREE)
>>>>+			return upage;
>>>>+	}
>>>>+	return NULL;
>>>>+}
>>>>+
>>>>+/**
>>>>+ * This routines get the pte of the page containing the specified address.
>>>>+ */
>>>>+static pte_t  __kprobes *get_uprobe_pte(unsigned long address)
>>>>+{
>>>>+	pgd_t *pgd;
>>>>+	pud_t *pud;
>>>>+	pmd_t *pmd;
>>>>+	pte_t *pte = NULL;
>>>>+
>>>>+	pgd = pgd_offset(current->mm, address);
>>>>+	if (!pgd)
>>>>+		goto out;
>>>>+
>>>>+	pud = pud_offset(pgd, address);
>>>>+	if (!pud)
>>>>+		goto out;
>>>>+
>>>>+	pmd = pmd_offset(pud, address);
>>>>+	if (!pmd)
>>>>+		goto out;
>>>>+
>>>>+	pte = pte_alloc_map(current->mm, pmd, address);
>>>>+
>>>>+out:
>>>>+	return pte;
>>>>+}
>>>>+
>>>>+/**
>>>>+ *  This routine check for space in the current process's stack address space.
>>>>+ *  If enough address space is found, it just maps a new page and copies the
>>>>+ *  new instruction on that page for single stepping out-of-line.
>>>>+ */
>>>>+static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
>>>>+			struct pt_regs *regs, struct vm_area_struct *vma)
>>>>+{
>>>>+	unsigned long addr, *vaddr, stack_addr = regs->esp;
>>>>+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
>>>>+	struct uprobe_page *upage;
>>>>+	struct page *page;
>>>>+	pte_t *pte;
>>>>+
>>>>+
>>>>+	if (vma->vm_flags & VM_GROWSDOWN) {
>>>>+		if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
>>>>+			return -ENOMEM;
>>>>+
>>>>+		addr = vma->vm_start;
>>>>+	} else if (vma->vm_flags & VM_GROWSUP) {
>>>>+		if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
>>>>+			return -ENOMEM;
>>>>+
>>>>+		addr = vma->vm_end - size;
>>>>+	} else
>>>>+		return -EFAULT;
>>>>+
>>The multi-thread case is not resolved here. One of typical multi-thread model is that the all threads share the same vma and every thread
>>has 8-k stack. If 2 threads trigger uprobe (although might be not the same uprobe) at the same time, one thread might erase single-step
>>instruction of another.
>>
>>
>>
>>>>+	preempt_enable_no_resched();
>>>>+
>>>>+	pte = get_uprobe_pte(addr);
>>>>+	preempt_disable();
>>>>+	if (!pte)
>>>>+		return -EFAULT;
>>>>+
>>>>+	upage = get_upage_free(current);
>>>>+	upage->status &= ~UPROBE_PAGE_FREE;
>>>>+	upage->tsk = current;
>>>>+	INIT_HLIST_NODE(&upage->hlist);
>>>>+	hlist_add_head(&upage->hlist,
>>>>+		&uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)]);
>>>>+
>>>>+	upage->orig_pte = pte;
>>>>+	upage->orig_pte_val =  pte_val(*pte);
>>>>+	set_pte(pte, (*(upage->alias_pte)));
>>>>+
>>>>+	page = pte_page(*pte);
>>>>+	vaddr = (unsigned long *)kmap_atomic(page, KM_USER1);
>>>>+	vaddr = (unsigned long *)((unsigned long)vaddr + (addr & ~PAGE_MASK));
>>>>+	memcpy(vaddr, (unsigned long *)uprobe->kp.ainsn.insn, size);
>>>>+	kunmap_atomic(vaddr, KM_USER1);
>>>>+	regs->eip = addr;
>>So the temp page, upage->alias_addr, replaces the original one on the stack. If the replaced instruction is to operate stack, such like
>>"push eax", the result might be on the new page. After the single step, the pte is restored to the original page which doesn't have
>>the value of eax.
>>
>>
>>
>>>>+
>>>>+	return 0;
>>>>+}
>>>>+
>>>>+/**
>>>>+ * This routine expands the stack beyond the present process address space
>>>>+ * and copies the instruction to that location, so that processor can
>>>>+ * single step out-of-line.
>>>>+ */
>>>>+static int __kprobes copy_insn_onexpstack(struct uprobe *uprobe,
>>>>+			struct pt_regs *regs, struct vm_area_struct *vma)
>>It has the same issues like function copy_insn_on_new_page.
>>
>>
>>>>+{
>>>>+	unsigned long addr, *vaddr, vm_addr;
>>>>+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
>>>>+	struct vm_area_struct *new_vma;
>>>>+	struct uprobe_page *upage;
>>>>+	struct mm_struct *mm = current->mm;
>>>>+	struct page *page;
>>>>+	pte_t *pte;
>>>>+
>>>>+
>>>>+	if (vma->vm_flags & VM_GROWSDOWN)
>>>>+		vm_addr = vma->vm_start - size;
>>>>+	else if (vma->vm_flags & VM_GROWSUP)
>>>>+		vm_addr = vma->vm_end + size;
>>>>+	else
>>>>+		return -EFAULT;
>>>>+
>>>>+	preempt_enable_no_resched();
>>>>+
>>>>+	/* TODO: do we need to expand stack if extend_vma fails? */
>>>>+	new_vma = find_extend_vma(mm, vm_addr);
>>>>+	preempt_disable();
>>>>+	if (!new_vma)
>>>>+		return -ENOMEM;
>>>>+
>>>>+	/*
>>>>+	 * TODO: Expanding stack for every probe is not a good idea, stack must
>>>>+	 * either be shrunk to its original size after single stepping or the
>>>>+	 * expanded stack should be kept track of, for the probed application,
>>>>+	 * so it can be reused to single step out-of-line
>>>>+	 */
>>>>+	if (new_vma->vm_flags & VM_GROWSDOWN)
>>>>+		addr = new_vma->vm_start;
>>>>+	else
>>>>+		addr = new_vma->vm_end - size;
>>>>+
>>>>+	preempt_enable_no_resched();
>>>>+	pte = get_uprobe_pte(addr);
>>>>+	preempt_disable();
>>>>+	if (!pte)
>>>>+		return -EFAULT;
>>>>+
>>>>+	upage = get_upage_free(current);
>>>>+	upage->status &= ~UPROBE_PAGE_FREE;
>>>>+	upage->tsk = current;
>>>>+	INIT_HLIST_NODE(&upage->hlist);
>>>>+	hlist_add_head(&upage->hlist,
>>>>+		&uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)]);
>>>>+	upage->orig_pte = pte;
>>>>+	upage->orig_pte_val =  pte_val(*pte);
>>>>+	set_pte(pte, (*(upage->alias_pte)));
>>>>+
>>>>+	page = pte_page(*pte);
>>>>+	vaddr = (unsigned long *)kmap_atomic(page, KM_USER1);
>>>>+	vaddr = (unsigned long *)((unsigned long)vaddr + (addr & ~PAGE_MASK));
>>>>+	memcpy(vaddr, (unsigned long *)uprobe->kp.ainsn.insn, size);
>>>>+	kunmap_atomic(vaddr, KM_USER1);
>>>>+	regs->eip = addr;
>>>>+
>>>>+	return  0;
>>>>+}
>>>>+
>>>>+/**
>>>>+ * This routine checks for stack free space below the stack pointer and
>>>>+ * then copies the instructions at that location so that the processor can
>>>>+ * single step out-of-line. If there is no enough stack space or if
>>>>+ * copy_to_user fails or if the vma is invalid, it returns error.
>>>>+ */
>>>>+static int __kprobes copy_insn_onstack(struct uprobe *uprobe,
>>>>+			struct pt_regs *regs, unsigned long flags)
>>>>+{
>>>>+	unsigned long page_addr, stack_addr = regs->esp;
>>>>+	int  size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
>>>>+	unsigned long *source = (unsigned long *)uprobe->kp.ainsn.insn;
>>>>+
>>>>+	if (flags & VM_GROWSDOWN) {
>>>>+		page_addr = stack_addr & PAGE_MASK;
>>>>+
>>>>+		if (((stack_addr - sizeof(long long))) < (page_addr + size))
>>>>+			return -ENOMEM;
>>>>+
>>>>+		if (__copy_to_user_inatomic((unsigned long *)page_addr, source,
>>>>+									size))
>>>>+			return -EFAULT;
>>>>+
>>>>+		regs->eip = page_addr;
>>>>+	} else if (flags & VM_GROWSUP) {
>>>>+		page_addr = stack_addr & PAGE_MASK;
>>>>+
>>>>+		if (page_addr == stack_addr)
>>>>+			return -ENOMEM;
>>>>+		else
>>>>+			page_addr += PAGE_SIZE;
>>>>+
>>>>+		if ((page_addr - size) < (stack_addr + sizeof(long long)))
>>>>+			return -ENOMEM;
>>>>+
>>>>+		if (__copy_to_user_inatomic((unsigned long *)(page_addr - size),
>>>>+								source, size))
>>>>+			return -EFAULT;
>>>>+
>>>>+		regs->eip = page_addr - size;
>>>>+	} else
>>>>+		return -EINVAL;
>>>>+
>>>>+	return 0;
>>>>+}
>>>>+
>>>>+/**
>>>>+ * This routines get the page containing the probe, maps it and
>>>>+ * replaced the instruction at the probed address with specified
>>>>+ * opcode.
>>>>+ */
>>>>+void __kprobes replace_original_insn(struct uprobe *uprobe,
>>>>+				struct pt_regs *regs, kprobe_opcode_t opcode)
>>>>+{
>>>>+	kprobe_opcode_t *addr;
>>>>+	struct page *page;
>>>>+
>>>>+	page = find_get_page(uprobe->inode->i_mapping,
>>>>+					uprobe->offset >> PAGE_CACHE_SHIFT);
>>>>+	lock_page(page);
>>>>+
>>>>+	addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER0);
>>>>+	addr = (kprobe_opcode_t *)((unsigned long)addr +
>>>>+				 (unsigned long)(uprobe->offset & ~PAGE_MASK));
>>>>+	*addr = opcode;
>>>>+	/*TODO: flush vma ? */
>>>>+	kunmap_atomic(addr, KM_USER0);
>>>>+
>>>>+	unlock_page(page);
>>>>+
>>>>+	page_cache_release(page);
>>>>+	regs->eip = (unsigned long)uprobe->kp.addr;
>>>>+}
>>>>+
>>>>+/**
>>>>+ * This routine provides the functionality of single stepping out of line.
>>>>+ * If single stepping out-of-line cannot be achieved, it replaces with
>>>>+ * the original instruction allowing it to single step inline.
>>>>+ */
>>>>+static inline int uprobe_single_step(struct kprobe *p, struct pt_regs *regs)
>>>>+{
>>>>+	unsigned long stack_addr = regs->esp, flags;
>>>>+	struct vm_area_struct *vma = NULL;
>>>>+	struct uprobe *uprobe =  __get_cpu_var(current_uprobe);
>>>>+	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
>>>>+	int err = 0;
>>>>+
>>>>+	down_read(&current->mm->mmap_sem);
>>>>+
>>>>+	vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
>>>>+	if (!vma) {
>>>>+		/* TODO: Need better error reporting? */
>>>>+		printk("No vma found\n");
>>>>+		up_read(&current->mm->mmap_sem);
>>>>+		return -ENOENT;
>>>>+	}
>>>>+	flags = vma->vm_flags;
>>>>+	up_read(&current->mm->mmap_sem);
>>>>+
>>>>+	kcb->kprobe_status |= UPROBE_SS_STACK;
>>>>+	err = copy_insn_onstack(uprobe, regs, flags);
>>>>+
>>>>+	down_write(&current->mm->mmap_sem);
>>>>+
>>>>+	if (err) {
>>>>+		kcb->kprobe_status |= UPROBE_SS_NEW_STACK;
>>>>+		err = copy_insn_on_new_page(uprobe, regs, vma);
>>>>+	}
>>>>+	if (err) {
>>>>+		kcb->kprobe_status |= UPROBE_SS_EXPSTACK;
>>>>+		err = copy_insn_onexpstack(uprobe, regs, vma);
>>>>+	}
>>>>+
>>>>+	up_write(&current->mm->mmap_sem);
>>>>+
>>>>+	if (err) {
>>>>+		kcb->kprobe_status |= UPROBE_SS_INLINE;
>>>>+		replace_original_insn(uprobe, regs, uprobe->kp.opcode);
>>>>+	}
>>>>+
>>>>+	 __get_cpu_var(singlestep_addr) = regs->eip;
>>>>+
>>>>+
>>>>+	return 0;
>>>>+}
>>>>+
>>>> static inline void prepare_singlestep(struct kprobe *p, struct pt_regs *regs)
>>>> {
>>>> 	regs->eflags |= TF_MASK;
>>>> 	regs->eflags &= ~IF_MASK;
>>>> 	/*single step inline if the instruction is an int3*/
>>>>+
>>>> 	if (p->opcode == BREAKPOINT_INSTRUCTION)
>>>> 		regs->eip = (unsigned long)p->addr;
>>>>-	else
>>>>-		regs->eip = (unsigned long)&p->ainsn.insn;
>>>>+	else {
>>>>+		if (!kernel_text_address((unsigned long)p->addr))
>>>>+			uprobe_single_step(p, regs);
>>>>+		else
>>>>+			regs->eip = (unsigned long)&p->ainsn.insn;
>>>>+	}
>>>> }
>>>>
>>>> /* Called with kretprobe_lock held */
>>>>@@ -194,6 +527,7 @@ static int __kprobes kprobe_handler(stru
>>>> 	kprobe_opcode_t *addr = NULL;
>>>> 	unsigned long *lp;
>>>> 	struct kprobe_ctlblk *kcb;
>>>>+	unsigned seg = regs->xcs & 0xffff;
>>>> #ifdef CONFIG_PREEMPT
>>>> 	unsigned pre_preempt_count = preempt_count();
>>>> #endif /* CONFIG_PREEMPT */
>>>>@@ -208,14 +542,21 @@ static int __kprobes kprobe_handler(stru
>>>> 	/* Check if the application is using LDT entry for its code segment and
>>>> 	 * calculate the address by reading the base address from the LDT entry.
>>>> 	 */
>>>>-	if ((regs->xcs & 4) && (current->mm)) {
>>>>+
>>>>+	if (regs->eflags & VM_MASK)
>>>>+		addr = (kprobe_opcode_t *)(((seg << 4) + regs->eip -
>>>>+			sizeof(kprobe_opcode_t)) & 0xffff);
>>>>+	else if ((regs->xcs & 4) && (current->mm)) {
>>>>+		local_irq_enable();
>>>>+		down(&current->mm->context.sem);
>>>> 		lp = (unsigned long *) ((unsigned long)((regs->xcs >> 3) * 8)
>>>> 					+ (char *) current->mm->context.ldt);
>>>> 		addr = (kprobe_opcode_t *) (get_desc_base(lp) + regs->eip -
>>>> 						sizeof(kprobe_opcode_t));
>>>>-	} else {
>>>>+		up(&current->mm->context.sem);
>>>>+		local_irq_disable();
>>>>+	} else
>>>> 		addr = (kprobe_opcode_t *)(regs->eip - sizeof(kprobe_opcode_t));
>>>>-	}
>>>> 	/* Check we're not actually recursing */
>>>> 	if (kprobe_running()) {
>>>> 		p = get_kprobe(addr);
>>>>@@ -235,7 +576,6 @@ static int __kprobes kprobe_handler(stru
>>>> 			save_previous_kprobe(kcb);
>>>> 			set_current_kprobe(p, regs, kcb);
>>>> 			kprobes_inc_nmissed_count(p);
>>>>-			prepare_singlestep(p, regs);
>>>> 			kcb->kprobe_status = KPROBE_REENTER;
>>>> 			return 1;
>>>> 		} else {
>>>>@@ -307,8 +647,8 @@ static int __kprobes kprobe_handler(stru
>>>> 	}
>>>>
>>>> ss_probe:
>>>>-	prepare_singlestep(p, regs);
>>>> 	kcb->kprobe_status = KPROBE_HIT_SS;
>>>>+	prepare_singlestep(p, regs);
>>>> 	return 1;
>>>>
>>>> no_kprobe:
>>>>@@ -498,6 +838,33 @@ no_change:
>>>> 	return;
>>>> }
>>>>
>>>>+static void __kprobes resume_execution_user(struct uprobe *uprobe,
>>>>+				struct pt_regs *regs, struct kprobe_ctlblk *kcb)
>>>>+{
>>>>+	unsigned long delta;
>>>>+	struct uprobe_page *upage;
>>>>+
>>>>+	/*
>>>>+	 * TODO :need to fixup special instructions as done with kernel probes.
>>>>+	 */
>>>>+	delta = regs->eip - __get_cpu_var(singlestep_addr);
>>>>+	regs->eip = (unsigned long)(uprobe->kp.addr + delta);
>>>>+
>>>>+	if ((kcb->kprobe_status & UPROBE_SS_EXPSTACK) ||
>>>>+			(kcb->kprobe_status & UPROBE_SS_NEW_STACK)) {
>>>>+		upage = get_upage_current(current);
>>>>+		set_pte(upage->orig_pte, __pte(upage->orig_pte_val));
>>>>+		pte_unmap(upage->orig_pte);
>>>>+
>>>>+		upage->status = UPROBE_PAGE_FREE;
>>>>+		hlist_del(&upage->hlist);
>>>>+
>>>>+	} else if (kcb->kprobe_status & UPROBE_SS_INLINE)
>>>>+		replace_original_insn(uprobe, regs,
>>>>+				(kprobe_opcode_t)BREAKPOINT_INSTRUCTION);
>>>>+	regs->eflags &= ~TF_MASK;
>>>>+}
>>>>+
>>>> /*
>>>>  * Interrupts are disabled on entry as trap1 is an interrupt gate and they
>>>>  * remain disabled thoroughout this function.
>>>>@@ -510,16 +877,19 @@ static inline int post_kprobe_handler(st
>>>> 	if (!cur)
>>>> 		return 0;
>>>>
>>>>-	if ((kcb->kprobe_status != KPROBE_REENTER) && cur->post_handler) {
>>>>-		kcb->kprobe_status = KPROBE_HIT_SSDONE;
>>>>+	if (!(kcb->kprobe_status & KPROBE_REENTER) && cur->post_handler) {
>>>>+		kcb->kprobe_status |= KPROBE_HIT_SSDONE;
>>>> 		cur->post_handler(cur, regs, 0);
>>>> 	}
>>>>
>>>>-	resume_execution(cur, regs, kcb);
>>>>+	if (!kernel_text_address((unsigned long)cur->addr))
>>>>+		resume_execution_user(__get_cpu_var(current_uprobe), regs, kcb);
>>>>+	else
>>>>+		resume_execution(cur, regs, kcb);
>>>> 	regs->eflags |= kcb->kprobe_saved_eflags;
>>>>
>>>> 	/*Restore back the original saved kprobes variables and continue. */
>>>>-	if (kcb->kprobe_status == KPROBE_REENTER) {
>>>>+	if (kcb->kprobe_status & KPROBE_REENTER) {
>>>> 		restore_previous_kprobe(kcb);
>>>> 		goto out;
>>>> 	}
>>>>@@ -547,7 +917,13 @@ static inline int kprobe_fault_handler(s
>>>> 		return 1;
>>>>
>>>> 	if (kcb->kprobe_status & KPROBE_HIT_SS) {
>>>>-		resume_execution(cur, regs, kcb);
>>>>+		if (!kernel_text_address((unsigned long)cur->addr)) {
>>>>+			struct uprobe *uprobe =  __get_cpu_var(current_uprobe);
>>>>+			/* TODO: Proper handling of all instruction */
>>>>+			replace_original_insn(uprobe, regs, uprobe->kp.opcode);
>>>>+			regs->eflags &= ~TF_MASK;
>>>>+		} else
>>>>+			resume_execution(cur, regs, kcb);
>>>> 		regs->eflags |= kcb->kprobe_old_eflags;
>>>>
>>>> 		reset_current_kprobe();
>>>>@@ -654,7 +1030,67 @@ int __kprobes longjmp_break_handler(stru
>>>> 	return 0;
>>>> }
>>>>
>>>>+static void free_alias(void)
>>>>+{
>>>>+	int cpu;
>>>>+
>>>>+	for_each_cpu(cpu) {
>>>>+		struct uprobe_page *upage;
>>>>+		upage = per_cpu_ptr(uprobe_page, cpu);
>>>>+
>>>>+		if (upage->alias_addr) {
>>>>+			set_pte(upage->alias_pte, __pte(upage->alias_pte_val));
>>>>+			kfree(upage->alias_addr);
>>>>+		}
>>>>+		upage->alias_pte = 0;
>>>>+	}
>>>>+	free_percpu(uprobe_page);
>>>>+	return;
>>>>+}
>>>>+
>>>>+static int alloc_alias(void)
>>>>+{
>>>>+	int cpu;
>>>>+
>>>>+	uprobe_page = __alloc_percpu(sizeof(struct uprobe_page));
>>[YM] Do here codes try to resolve the problem of task switch at single-step? If so, the per cpu data also might be used up although
>>get_upage_free will go through all uprobe_page of all cpus. I suggest to allocate a series of uprobe_page, and allocate again when they
>>are used up.
>>
>>
>>
>>
>>>>+
>>>>+	for_each_cpu(cpu) {
>>>>+		struct uprobe_page *upage;
>>>>+		upage = per_cpu_ptr(uprobe_page, cpu);
>>>>+		upage->alias_addr = kmalloc(PAGE_SIZE, GFP_USER);
>>[YM] Does kmalloc(PAGE_SIZE...) imply the result is aligned to page? How about using alloc_page?
>>
>>
>>>>+		if (!upage->alias_addr) {
>>>>+			free_alias();
>>>>+			return -ENOMEM;
>>>>+		}
>>>>+		upage->alias_pte = lookup_address(
>>>>+					(unsigned long)upage->alias_addr);
>>>>+		upage->alias_pte_val = pte_val(*upage->alias_pte);
>>>>+		if (upage->alias_pte) {
>>[YM] If kmalloc returns a non-NULL address, upage->alias_pte is not equal to NULL. So delete above checking?
>>
>>
>>>>+			upage->status = UPROBE_PAGE_FREE;
>>>>+			set_pte(upage->alias_pte,
>>>>+						pte_mkdirty(*upage->alias_pte));
>>>>+			set_pte(upage->alias_pte,
>>>>+						pte_mkexec(*upage->alias_pte));
>>>>+			set_pte(upage->alias_pte,
>>>>+						 pte_mkwrite(*upage->alias_pte));
>>>>+			set_pte(upage->alias_pte,
>>>>+						pte_mkyoung(*upage->alias_pte));
>>>>+		}
>>>>+	}
>>>>+	return 0;
>>>>+}
>>>>+
>>>> int __init arch_init_kprobes(void)
>>>> {
>>>>+	int ret = 0;
>>>>+	/*
>>>>+	 * user space probes requires a page to copy the original instruction
>>>>+	 * so that it can single step if there is no free stack space, allocate
>>>>+	 * per cpu page.
>>>>+	 */
>>>>+
>>>>+	if ((ret = alloc_alias()))
>>>>+		return ret;
>>>>+
>>>> 	return 0;
>>>> }
>>>>diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes_ss_out-of-line include/asm-i386/kprobes.h
>>>>--- linux-2.6.16-rc1-mm5/include/asm-i386/kprobes.h~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
>>>>+++ linux-2.6.16-rc1-mm5-prasanna/include/asm-i386/kprobes.h	2006-02-08 19:26:10.000000000 +0530
>>>>@@ -42,6 +42,7 @@ typedef u8 kprobe_opcode_t;
>>>> #define JPROBE_ENTRY(pentry)	(kprobe_opcode_t *)pentry
>>>> #define ARCH_SUPPORTS_KRETPROBES
>>>> #define arch_remove_kprobe(p)	do {} while (0)
>>>>+#define UPROBE_PAGE_FREE 0x00000001
>>>>
>>>> void kretprobe_trampoline(void);
>>>>
>>>>@@ -74,6 +75,18 @@ struct kprobe_ctlblk {
>>>> 	struct prev_kprobe prev_kprobe;
>>>> };
>>>>
>>>>+/* per cpu uprobe page structure */
>>>>+struct uprobe_page {
>>>>+	struct hlist_node hlist;
>>>>+	pte_t *alias_pte;
>>>>+	pte_t *orig_pte;
>>>>+	unsigned long orig_pte_val;
>>>>+	unsigned long alias_pte_val;
>>[YM] I think the patch doesn't support CONFIG_X86_PAE, because if CONFIG_X86_PAE=y, pte_t becomes 64 bits.
>>How about changing above 2 members' type to pte_t directly?
>>
>>
>>
>>>>+	void *alias_addr;
>>>>+	struct task_struct *tsk;
>>>>+	unsigned long status;
>>>>+};
>>>>+
>>>> /* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
>>>>  * if necessary, before executing the original int3/1 (trap) handler.
>>>>  */
>>>>diff -puN include/linux/kprobes.h~kprobes_userspace_probes_ss_out-of-line include/linux/kprobes.h
>>>>--- linux-2.6.16-rc1-mm5/include/linux/kprobes.h~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
>>>>+++ linux-2.6.16-rc1-mm5-prasanna/include/linux/kprobes.h	2006-02-08 19:26:10.000000000 +0530
>>>>@@ -45,11 +45,18 @@
>>>> #ifdef CONFIG_KPROBES
>>>> #include <asm/kprobes.h>
>>>>
>>>>+#define KPROBE_HASH_BITS 6
>>>>+#define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
>>>>+
>>>> /* kprobe_status settings */
>>>> #define KPROBE_HIT_ACTIVE	0x00000001
>>>> #define KPROBE_HIT_SS		0x00000002
>>>> #define KPROBE_REENTER		0x00000004
>>>> #define KPROBE_HIT_SSDONE	0x00000008
>>>>+#define UPROBE_SS_STACK		0x00000010
>>>>+#define UPROBE_SS_EXPSTACK	0x00000020
>>>>+#define UPROBE_SS_INLINE	0x00000040
>>>>+#define UPROBE_SS_NEW_STACK	0x00000080
>>>>
>>>> /* Attach to insert probes on any functions which should be ignored*/
>>>> #define __kprobes	__attribute__((__section__(".kprobes.text")))
>>>>diff -puN kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line kernel/kprobes.c
>>>>--- linux-2.6.16-rc1-mm5/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:10.000000000 +0530
>>>>+++ linux-2.6.16-rc1-mm5-prasanna/kernel/kprobes.c	2006-02-08 19:26:10.000000000 +0530
>>>>@@ -42,9 +42,6 @@
>>>> #include <asm/errno.h>
>>>> #include <asm/kdebug.h>
>>>>
>>>>-#define KPROBE_HASH_BITS 6
>>>>-#define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
>>>>-
>>>> static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
>>>> static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];
>>>> static struct list_head uprobe_module_list;
>>>>
>>>>_
>>>>--
>>>>Prasanna S Panchamukhi
>>>>Linux Technology Center
>>>>India Software Labs, IBM Bangalore
>>>>Email: prasanna@in.ibm.com
>>>>Ph: 91-80-51776329

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [3/3] Userspace probes prototype-take2
@ 2006-02-17  9:19 Zhang, Yanmin
  2006-02-20  5:36 ` Prasanna S Panchamukhi
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Yanmin @ 2006-02-17  9:19 UTC (permalink / raw)
  To: prasanna, systemtap

2 main issues:
1) task switch caused by external interrupt when single-step;
2) multi-thread:

See below inline comments.

Yanmin

>>-----Original Message-----
>>From: systemtap-owner@sourceware.org [mailto:systemtap-owner@sourceware.org] On Behalf Of Prasanna S Panchamukhi
>>Sent: 2006年2月8日 22:14
>>To: systemtap@sources.redhat.com
>>Subject: Re: [3/3] Userspace probes prototype-take2
>>
>>
>>This patch handles the executing the registered callback
>>functions when probes is hit.
>>
>>	Each userspace probe is uniquely identified by the
>>combination of inode and offset, hence during registeration the inode
>>and offset combination is added to kprobes hash table. Initially when
>>breakpoint instruction is hit, the kprobes hash table is looked up
>>for matching inode and offset. The pre_handlers are called in sequence
>>if multiple probes are registered. The original instruction is single
>>stepped out-of-line similar to kernel probes. In kernel space probes,
>>single stepping out-of-line is achieved by copying the instruction on
>>to some location within kernel address space and then single step
>>from that location. But for userspace probes, instruction copied
>>into kernel address space cannot be single stepped, hence the
>>instruction should be copied to user address space. The solution is
>>to find free space in the current process address space and then copy
>>the original instruction and single step that instruction.
>>
>>User processes use stack space to store local variables, agruments and
>>return values. Normally the stack space either below or above the
>>stack pointer indicates the free stack space. If the stack grows
>>downwards, the stack space below the stack pointer indicates the
>>unused stack free space and if the stack grows upwards, the stack
>>space above the stack pointer indicates the unused stack free space.
>>
>>The instruction to be single stepped can modify the stack space, hence
>>before using the unused stack free space, sufficient stack space
>>should be left. The instruction is copied to the bottom of the page
>>and check is made such that the copied instruction does not cross the
>>page boundry. The copied instruction is then single stepped.
>>Several architectures does not allow the instruction to be executed
>>from the stack location, since no-exec bit is set for the stack pages.
>>In those architectures, the page table entry corresponding to the
>>stack page is identified and the no-exec bit is unset making the
>>instruction on that stack page to be executed.
>>
>>There are situations where even the unused free stack space is not
>>enough for the user instruction to be copied and single stepped. In
>>such situations, the virtual memory area(vma) can be expanded beyond
>>the current stack vma. This expaneded stack can be used to copy the
>>original instruction and single step out-of-line.
>>
>>Even if the vma cannot be extended then the instruction much be
>>executed inline, by replacing the breakpoint instruction with original
>>instruction.
>>
>>TODO list
>>--------
>>1. This patch is not stable yet, should work for most conditions.
>>
>>2. This patch works only with PREEMPT config option disabled, to work
>>in PREEMPT enabled condition handlers must be re-written and must
>>be seperated out from kernel probes allowing preemption.
One of my old comments is an external device interrupt might happen when cpu is single-stepping the original instruction, then the task might be switched to another cpu. If we disable irq when exiting to user space to single step the instruction, kernel might switch the task off just on the exit kernel path. 1) uprobe_page; 2) kprobe_ctlblk, These 2 resources shouldn't be pre cpu, or we need get another approach. How could you resolve the task switch issue?



>>
>>3. Insert probes on copy-on-write pages. Tracks all COW pages for the
>>page containing the specified probe point and inserts/removes all the
>>probe points for that page.
>>
>>4. Optimize the insertion of probes through readpage hooks. Identify
>>all the probes to be inserted on the read page and insert them at
>>once.
>>
>>5. Resume exectution should handle setting of proper eip and eflags
>>for special instructions similar to kernel probes.
>>
>>6. Single stepping out-of-line expands the stack if there is no
>>enough stack space to copy the original instruction. Expansion of
>>stack should be shrinked back to the original size after single
>>stepping or the expanded stack should be reused for single stepping
>>out-of-line for other probes.
>>
>>7. A wrapper routines to calculate the offset from the probed file
>>beginning. In case of dynamic shared library, the offset is
>>calculated by substracting the address of the probe point from the
>>beginning of the file mapped address.
>>
>>8. Handing of page faults while inthe kprobes_handler() and while
>>single stepping.
>>
>>9. Accessing user space pages not present in memory, from the
>>registered callback routines.
>>
>>Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
>>
>>
>> arch/i386/kernel/kprobes.c |  460 +++++++++++++++++++++++++++++++++++++++++++--
>> include/asm-i386/kprobes.h |   13 +
>> include/linux/kprobes.h    |    7
>> kernel/kprobes.c           |    3
>> 4 files changed, 468 insertions(+), 15 deletions(-)
>>
>>diff -puN arch/i386/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line arch/i386/kernel/kprobes.c
>>--- linux-2.6.16-rc1-mm5/arch/i386/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
>>+++ linux-2.6.16-rc1-mm5-prasanna/arch/i386/kernel/kprobes.c	2006-02-08 19:26:10.000000000 +0530
>>@@ -30,6 +30,7 @@
>>
>> #include <linux/config.h>
>> #include <linux/kprobes.h>
>>+#include <linux/hash.h>
>> #include <linux/ptrace.h>
>> #include <linux/preempt.h>
>> #include <asm/cacheflush.h>
>>@@ -38,8 +39,12 @@
>>
>> void jprobe_return_end(void);
>>
>>+static struct uprobe_page *uprobe_page;
>>+static struct hlist_head uprobe_page_table[KPROBE_TABLE_SIZE];
>> DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
>> DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>>+DEFINE_PER_CPU(struct uprobe *, current_uprobe) = NULL;
>>+DEFINE_PER_CPU(unsigned long, singlestep_addr);
>>
>> /* insert a jmp code */
>> static inline void set_jmp_op(void *from, void *to)
>>@@ -125,6 +130,23 @@ void __kprobes arch_disarm_kprobe(struct
>> 			   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
>> }
>>
>>+void __kprobes arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address)
>>+{
>>+	*address = p->opcode;
>>+}
>>+
>>+void __kprobes arch_arm_uprobe(unsigned long *address)
>>+{
>>+	*(kprobe_opcode_t *)address = BREAKPOINT_INSTRUCTION;
>>+}
>>+
>>+void __kprobes arch_copy_uprobe(struct kprobe *p, unsigned long *address)
>>+{
>>+	memcpy(p->ainsn.insn, (kprobe_opcode_t *)address,
>>+				MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
>>+	p->opcode = *(kprobe_opcode_t *)address;
>>+}
>>+
>> static inline void save_previous_kprobe(struct kprobe_ctlblk *kcb)
>> {
>> 	kcb->prev_kprobe.kp = kprobe_running();
>>@@ -151,15 +173,326 @@ static inline void set_current_kprobe(st
>> 		kcb->kprobe_saved_eflags &= ~IF_MASK;
>> }
>>
>>+struct uprobe_page __kprobes *get_upage_current(struct task_struct *tsk)
>>+{
>>+	struct hlist_head *head;
>>+	struct hlist_node *node;
>>+	struct uprobe_page *upage;
>>+
>>+	head = &uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)];
>>+	hlist_for_each_entry(upage, node, head, hlist) {
>>+		if (upage->tsk == tsk)
>>+			return upage;
>>+        }
>>+	return NULL;
>>+}
>>+
>>+struct uprobe_page __kprobes *get_upage_free(struct task_struct *tsk)
>>+{
>>+	int cpu;
>>+
>>+	for_each_cpu(cpu) {
>>+		struct uprobe_page *upage;
>>+		upage = per_cpu_ptr(uprobe_page, cpu);
>>+		if (upage->status & UPROBE_PAGE_FREE)
>>+			return upage;
>>+	}
>>+	return NULL;
>>+}
>>+
>>+/**
>>+ * This routines get the pte of the page containing the specified address.
>>+ */
>>+static pte_t  __kprobes *get_uprobe_pte(unsigned long address)
>>+{
>>+	pgd_t *pgd;
>>+	pud_t *pud;
>>+	pmd_t *pmd;
>>+	pte_t *pte = NULL;
>>+
>>+	pgd = pgd_offset(current->mm, address);
>>+	if (!pgd)
>>+		goto out;
>>+
>>+	pud = pud_offset(pgd, address);
>>+	if (!pud)
>>+		goto out;
>>+
>>+	pmd = pmd_offset(pud, address);
>>+	if (!pmd)
>>+		goto out;
>>+
>>+	pte = pte_alloc_map(current->mm, pmd, address);
>>+
>>+out:
>>+	return pte;
>>+}
>>+
>>+/**
>>+ *  This routine check for space in the current process's stack address space.
>>+ *  If enough address space is found, it just maps a new page and copies the
>>+ *  new instruction on that page for single stepping out-of-line.
>>+ */
>>+static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
>>+			struct pt_regs *regs, struct vm_area_struct *vma)
>>+{
>>+	unsigned long addr, *vaddr, stack_addr = regs->esp;
>>+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
>>+	struct uprobe_page *upage;
>>+	struct page *page;
>>+	pte_t *pte;
>>+
>>+
>>+	if (vma->vm_flags & VM_GROWSDOWN) {
>>+		if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
>>+			return -ENOMEM;
>>+
>>+		addr = vma->vm_start;
>>+	} else if (vma->vm_flags & VM_GROWSUP) {
>>+		if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
>>+			return -ENOMEM;
>>+
>>+		addr = vma->vm_end - size;
>>+	} else
>>+		return -EFAULT;
>>+
The multi-thread case is not resolved here. One of typical multi-thread model is that the all threads share the same vma and every thread has 8-k stack. If 2 threads trigger uprobe (although might be not the same uprobe) at the same time, one thread might erase single-step instruction of another.



>>+	preempt_enable_no_resched();
>>+
>>+	pte = get_uprobe_pte(addr);
>>+	preempt_disable();
>>+	if (!pte)
>>+		return -EFAULT;
>>+
>>+	upage = get_upage_free(current);
>>+	upage->status &= ~UPROBE_PAGE_FREE;
>>+	upage->tsk = current;
>>+	INIT_HLIST_NODE(&upage->hlist);
>>+	hlist_add_head(&upage->hlist,
>>+		&uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)]);
>>+
>>+	upage->orig_pte = pte;
>>+	upage->orig_pte_val =  pte_val(*pte);
>>+	set_pte(pte, (*(upage->alias_pte)));
>>+
>>+	page = pte_page(*pte);
>>+	vaddr = (unsigned long *)kmap_atomic(page, KM_USER1);
>>+	vaddr = (unsigned long *)((unsigned long)vaddr + (addr & ~PAGE_MASK));
>>+	memcpy(vaddr, (unsigned long *)uprobe->kp.ainsn.insn, size);
>>+	kunmap_atomic(vaddr, KM_USER1);
>>+	regs->eip = addr;
So the temp page, upage->alias_addr, replaces the original one on the stack. If the replaced instruction is to operate stack, such like "push eax", the result might be on the new page. After the single step, the pte is restored to the original page which doesn't have the value of eax.



>>+
>>+	return 0;
>>+}
>>+
>>+/**
>>+ * This routine expands the stack beyond the present process address space
>>+ * and copies the instruction to that location, so that processor can
>>+ * single step out-of-line.
>>+ */
>>+static int __kprobes copy_insn_onexpstack(struct uprobe *uprobe,
>>+			struct pt_regs *regs, struct vm_area_struct *vma)
It has the same issues like function copy_insn_on_new_page.


>>+{
>>+	unsigned long addr, *vaddr, vm_addr;
>>+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
>>+	struct vm_area_struct *new_vma;
>>+	struct uprobe_page *upage;
>>+	struct mm_struct *mm = current->mm;
>>+	struct page *page;
>>+	pte_t *pte;
>>+
>>+
>>+	if (vma->vm_flags & VM_GROWSDOWN)
>>+		vm_addr = vma->vm_start - size;
>>+	else if (vma->vm_flags & VM_GROWSUP)
>>+		vm_addr = vma->vm_end + size;
>>+	else
>>+		return -EFAULT;
>>+
>>+	preempt_enable_no_resched();
>>+
>>+	/* TODO: do we need to expand stack if extend_vma fails? */
>>+	new_vma = find_extend_vma(mm, vm_addr);
>>+	preempt_disable();
>>+	if (!new_vma)
>>+		return -ENOMEM;
>>+
>>+	/*
>>+	 * TODO: Expanding stack for every probe is not a good idea, stack must
>>+	 * either be shrunk to its original size after single stepping or the
>>+	 * expanded stack should be kept track of, for the probed application,
>>+	 * so it can be reused to single step out-of-line
>>+	 */
>>+	if (new_vma->vm_flags & VM_GROWSDOWN)
>>+		addr = new_vma->vm_start;
>>+	else
>>+		addr = new_vma->vm_end - size;
>>+
>>+	preempt_enable_no_resched();
>>+	pte = get_uprobe_pte(addr);
>>+	preempt_disable();
>>+	if (!pte)
>>+		return -EFAULT;
>>+
>>+	upage = get_upage_free(current);
>>+	upage->status &= ~UPROBE_PAGE_FREE;
>>+	upage->tsk = current;
>>+	INIT_HLIST_NODE(&upage->hlist);
>>+	hlist_add_head(&upage->hlist,
>>+		&uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)]);
>>+	upage->orig_pte = pte;
>>+	upage->orig_pte_val =  pte_val(*pte);
>>+	set_pte(pte, (*(upage->alias_pte)));
>>+
>>+	page = pte_page(*pte);
>>+	vaddr = (unsigned long *)kmap_atomic(page, KM_USER1);
>>+	vaddr = (unsigned long *)((unsigned long)vaddr + (addr & ~PAGE_MASK));
>>+	memcpy(vaddr, (unsigned long *)uprobe->kp.ainsn.insn, size);
>>+	kunmap_atomic(vaddr, KM_USER1);
>>+	regs->eip = addr;
>>+
>>+	return  0;
>>+}
>>+
>>+/**
>>+ * This routine checks for stack free space below the stack pointer and
>>+ * then copies the instructions at that location so that the processor can
>>+ * single step out-of-line. If there is no enough stack space or if
>>+ * copy_to_user fails or if the vma is invalid, it returns error.
>>+ */
>>+static int __kprobes copy_insn_onstack(struct uprobe *uprobe,
>>+			struct pt_regs *regs, unsigned long flags)
>>+{
>>+	unsigned long page_addr, stack_addr = regs->esp;
>>+	int  size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
>>+	unsigned long *source = (unsigned long *)uprobe->kp.ainsn.insn;
>>+
>>+	if (flags & VM_GROWSDOWN) {
>>+		page_addr = stack_addr & PAGE_MASK;
>>+
>>+		if (((stack_addr - sizeof(long long))) < (page_addr + size))
>>+			return -ENOMEM;
>>+
>>+		if (__copy_to_user_inatomic((unsigned long *)page_addr, source,
>>+									size))
>>+			return -EFAULT;
>>+
>>+		regs->eip = page_addr;
>>+	} else if (flags & VM_GROWSUP) {
>>+		page_addr = stack_addr & PAGE_MASK;
>>+
>>+		if (page_addr == stack_addr)
>>+			return -ENOMEM;
>>+		else
>>+			page_addr += PAGE_SIZE;
>>+
>>+		if ((page_addr - size) < (stack_addr + sizeof(long long)))
>>+			return -ENOMEM;
>>+
>>+		if (__copy_to_user_inatomic((unsigned long *)(page_addr - size),
>>+								source, size))
>>+			return -EFAULT;
>>+
>>+		regs->eip = page_addr - size;
>>+	} else
>>+		return -EINVAL;
>>+
>>+	return 0;
>>+}
>>+
>>+/**
>>+ * This routines get the page containing the probe, maps it and
>>+ * replaced the instruction at the probed address with specified
>>+ * opcode.
>>+ */
>>+void __kprobes replace_original_insn(struct uprobe *uprobe,
>>+				struct pt_regs *regs, kprobe_opcode_t opcode)
>>+{
>>+	kprobe_opcode_t *addr;
>>+	struct page *page;
>>+
>>+	page = find_get_page(uprobe->inode->i_mapping,
>>+					uprobe->offset >> PAGE_CACHE_SHIFT);
>>+	lock_page(page);
>>+
>>+	addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER0);
>>+	addr = (kprobe_opcode_t *)((unsigned long)addr +
>>+				 (unsigned long)(uprobe->offset & ~PAGE_MASK));
>>+	*addr = opcode;
>>+	/*TODO: flush vma ? */
>>+	kunmap_atomic(addr, KM_USER0);
>>+
>>+	unlock_page(page);
>>+
>>+	page_cache_release(page);
>>+	regs->eip = (unsigned long)uprobe->kp.addr;
>>+}
>>+
>>+/**
>>+ * This routine provides the functionality of single stepping out of line.
>>+ * If single stepping out-of-line cannot be achieved, it replaces with
>>+ * the original instruction allowing it to single step inline.
>>+ */
>>+static inline int uprobe_single_step(struct kprobe *p, struct pt_regs *regs)
>>+{
>>+	unsigned long stack_addr = regs->esp, flags;
>>+	struct vm_area_struct *vma = NULL;
>>+	struct uprobe *uprobe =  __get_cpu_var(current_uprobe);
>>+	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
>>+	int err = 0;
>>+
>>+	down_read(&current->mm->mmap_sem);
>>+
>>+	vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
>>+	if (!vma) {
>>+		/* TODO: Need better error reporting? */
>>+		printk("No vma found\n");
>>+		up_read(&current->mm->mmap_sem);
>>+		return -ENOENT;
>>+	}
>>+	flags = vma->vm_flags;
>>+	up_read(&current->mm->mmap_sem);
>>+
>>+	kcb->kprobe_status |= UPROBE_SS_STACK;
>>+	err = copy_insn_onstack(uprobe, regs, flags);
>>+
>>+	down_write(&current->mm->mmap_sem);
>>+
>>+	if (err) {
>>+		kcb->kprobe_status |= UPROBE_SS_NEW_STACK;
>>+		err = copy_insn_on_new_page(uprobe, regs, vma);
>>+	}
>>+	if (err) {
>>+		kcb->kprobe_status |= UPROBE_SS_EXPSTACK;
>>+		err = copy_insn_onexpstack(uprobe, regs, vma);
>>+	}
>>+
>>+	up_write(&current->mm->mmap_sem);
>>+
>>+	if (err) {
>>+		kcb->kprobe_status |= UPROBE_SS_INLINE;
>>+		replace_original_insn(uprobe, regs, uprobe->kp.opcode);
>>+	}
>>+
>>+	 __get_cpu_var(singlestep_addr) = regs->eip;
>>+
>>+
>>+	return 0;
>>+}
>>+
>> static inline void prepare_singlestep(struct kprobe *p, struct pt_regs *regs)
>> {
>> 	regs->eflags |= TF_MASK;
>> 	regs->eflags &= ~IF_MASK;
>> 	/*single step inline if the instruction is an int3*/
>>+
>> 	if (p->opcode == BREAKPOINT_INSTRUCTION)
>> 		regs->eip = (unsigned long)p->addr;
>>-	else
>>-		regs->eip = (unsigned long)&p->ainsn.insn;
>>+	else {
>>+		if (!kernel_text_address((unsigned long)p->addr))
>>+			uprobe_single_step(p, regs);
>>+		else
>>+			regs->eip = (unsigned long)&p->ainsn.insn;
>>+	}
>> }
>>
>> /* Called with kretprobe_lock held */
>>@@ -194,6 +527,7 @@ static int __kprobes kprobe_handler(stru
>> 	kprobe_opcode_t *addr = NULL;
>> 	unsigned long *lp;
>> 	struct kprobe_ctlblk *kcb;
>>+	unsigned seg = regs->xcs & 0xffff;
>> #ifdef CONFIG_PREEMPT
>> 	unsigned pre_preempt_count = preempt_count();
>> #endif /* CONFIG_PREEMPT */
>>@@ -208,14 +542,21 @@ static int __kprobes kprobe_handler(stru
>> 	/* Check if the application is using LDT entry for its code segment and
>> 	 * calculate the address by reading the base address from the LDT entry.
>> 	 */
>>-	if ((regs->xcs & 4) && (current->mm)) {
>>+
>>+	if (regs->eflags & VM_MASK)
>>+		addr = (kprobe_opcode_t *)(((seg << 4) + regs->eip -
>>+			sizeof(kprobe_opcode_t)) & 0xffff);
>>+	else if ((regs->xcs & 4) && (current->mm)) {
>>+		local_irq_enable();
>>+		down(&current->mm->context.sem);
>> 		lp = (unsigned long *) ((unsigned long)((regs->xcs >> 3) * 8)
>> 					+ (char *) current->mm->context.ldt);
>> 		addr = (kprobe_opcode_t *) (get_desc_base(lp) + regs->eip -
>> 						sizeof(kprobe_opcode_t));
>>-	} else {
>>+		up(&current->mm->context.sem);
>>+		local_irq_disable();
>>+	} else
>> 		addr = (kprobe_opcode_t *)(regs->eip - sizeof(kprobe_opcode_t));
>>-	}
>> 	/* Check we're not actually recursing */
>> 	if (kprobe_running()) {
>> 		p = get_kprobe(addr);
>>@@ -235,7 +576,6 @@ static int __kprobes kprobe_handler(stru
>> 			save_previous_kprobe(kcb);
>> 			set_current_kprobe(p, regs, kcb);
>> 			kprobes_inc_nmissed_count(p);
>>-			prepare_singlestep(p, regs);
>> 			kcb->kprobe_status = KPROBE_REENTER;
>> 			return 1;
>> 		} else {
>>@@ -307,8 +647,8 @@ static int __kprobes kprobe_handler(stru
>> 	}
>>
>> ss_probe:
>>-	prepare_singlestep(p, regs);
>> 	kcb->kprobe_status = KPROBE_HIT_SS;
>>+	prepare_singlestep(p, regs);
>> 	return 1;
>>
>> no_kprobe:
>>@@ -498,6 +838,33 @@ no_change:
>> 	return;
>> }
>>
>>+static void __kprobes resume_execution_user(struct uprobe *uprobe,
>>+				struct pt_regs *regs, struct kprobe_ctlblk *kcb)
>>+{
>>+	unsigned long delta;
>>+	struct uprobe_page *upage;
>>+
>>+	/*
>>+	 * TODO :need to fixup special instructions as done with kernel probes.
>>+	 */
>>+	delta = regs->eip - __get_cpu_var(singlestep_addr);
>>+	regs->eip = (unsigned long)(uprobe->kp.addr + delta);
>>+
>>+	if ((kcb->kprobe_status & UPROBE_SS_EXPSTACK) ||
>>+			(kcb->kprobe_status & UPROBE_SS_NEW_STACK)) {
>>+		upage = get_upage_current(current);
>>+		set_pte(upage->orig_pte, __pte(upage->orig_pte_val));
>>+		pte_unmap(upage->orig_pte);
>>+
>>+		upage->status = UPROBE_PAGE_FREE;
>>+		hlist_del(&upage->hlist);
>>+
>>+	} else if (kcb->kprobe_status & UPROBE_SS_INLINE)
>>+		replace_original_insn(uprobe, regs,
>>+				(kprobe_opcode_t)BREAKPOINT_INSTRUCTION);
>>+	regs->eflags &= ~TF_MASK;
>>+}
>>+
>> /*
>>  * Interrupts are disabled on entry as trap1 is an interrupt gate and they
>>  * remain disabled thoroughout this function.
>>@@ -510,16 +877,19 @@ static inline int post_kprobe_handler(st
>> 	if (!cur)
>> 		return 0;
>>
>>-	if ((kcb->kprobe_status != KPROBE_REENTER) && cur->post_handler) {
>>-		kcb->kprobe_status = KPROBE_HIT_SSDONE;
>>+	if (!(kcb->kprobe_status & KPROBE_REENTER) && cur->post_handler) {
>>+		kcb->kprobe_status |= KPROBE_HIT_SSDONE;
>> 		cur->post_handler(cur, regs, 0);
>> 	}
>>
>>-	resume_execution(cur, regs, kcb);
>>+	if (!kernel_text_address((unsigned long)cur->addr))
>>+		resume_execution_user(__get_cpu_var(current_uprobe), regs, kcb);
>>+	else
>>+		resume_execution(cur, regs, kcb);
>> 	regs->eflags |= kcb->kprobe_saved_eflags;
>>
>> 	/*Restore back the original saved kprobes variables and continue. */
>>-	if (kcb->kprobe_status == KPROBE_REENTER) {
>>+	if (kcb->kprobe_status & KPROBE_REENTER) {
>> 		restore_previous_kprobe(kcb);
>> 		goto out;
>> 	}
>>@@ -547,7 +917,13 @@ static inline int kprobe_fault_handler(s
>> 		return 1;
>>
>> 	if (kcb->kprobe_status & KPROBE_HIT_SS) {
>>-		resume_execution(cur, regs, kcb);
>>+		if (!kernel_text_address((unsigned long)cur->addr)) {
>>+			struct uprobe *uprobe =  __get_cpu_var(current_uprobe);
>>+			/* TODO: Proper handling of all instruction */
>>+			replace_original_insn(uprobe, regs, uprobe->kp.opcode);
>>+			regs->eflags &= ~TF_MASK;
>>+		} else
>>+			resume_execution(cur, regs, kcb);
>> 		regs->eflags |= kcb->kprobe_old_eflags;
>>
>> 		reset_current_kprobe();
>>@@ -654,7 +1030,67 @@ int __kprobes longjmp_break_handler(stru
>> 	return 0;
>> }
>>
>>+static void free_alias(void)
>>+{
>>+	int cpu;
>>+
>>+	for_each_cpu(cpu) {
>>+		struct uprobe_page *upage;
>>+		upage = per_cpu_ptr(uprobe_page, cpu);
>>+
>>+		if (upage->alias_addr) {
>>+			set_pte(upage->alias_pte, __pte(upage->alias_pte_val));
>>+			kfree(upage->alias_addr);
>>+		}
>>+		upage->alias_pte = 0;
>>+	}
>>+	free_percpu(uprobe_page);
>>+	return;
>>+}
>>+
>>+static int alloc_alias(void)
>>+{
>>+	int cpu;
>>+
>>+	uprobe_page = __alloc_percpu(sizeof(struct uprobe_page));
[YM] Do here codes try to resolve the problem of task switch at single-step? If so, the per cpu data also might be used up although get_upage_free will go through all uprobe_page of all cpus. I suggest to allocate a series of uprobe_page, and allocate again when they are used up.




>>+
>>+	for_each_cpu(cpu) {
>>+		struct uprobe_page *upage;
>>+		upage = per_cpu_ptr(uprobe_page, cpu);
>>+		upage->alias_addr = kmalloc(PAGE_SIZE, GFP_USER);
[YM] Does kmalloc(PAGE_SIZE...) imply the result is aligned to page? How about using alloc_page?


>>+		if (!upage->alias_addr) {
>>+			free_alias();
>>+			return -ENOMEM;
>>+		}
>>+		upage->alias_pte = lookup_address(
>>+					(unsigned long)upage->alias_addr);
>>+		upage->alias_pte_val = pte_val(*upage->alias_pte);
>>+		if (upage->alias_pte) {
[YM] If kmalloc returns a non-NULL address, upage->alias_pte is not equal to NULL. So delete above checking?


>>+			upage->status = UPROBE_PAGE_FREE;
>>+			set_pte(upage->alias_pte,
>>+						pte_mkdirty(*upage->alias_pte));
>>+			set_pte(upage->alias_pte,
>>+						pte_mkexec(*upage->alias_pte));
>>+			set_pte(upage->alias_pte,
>>+						 pte_mkwrite(*upage->alias_pte));
>>+			set_pte(upage->alias_pte,
>>+						pte_mkyoung(*upage->alias_pte));
>>+		}
>>+	}
>>+	return 0;
>>+}
>>+
>> int __init arch_init_kprobes(void)
>> {
>>+	int ret = 0;
>>+	/*
>>+	 * user space probes requires a page to copy the original instruction
>>+	 * so that it can single step if there is no free stack space, allocate
>>+	 * per cpu page.
>>+	 */
>>+
>>+	if ((ret = alloc_alias()))
>>+		return ret;
>>+
>> 	return 0;
>> }
>>diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes_ss_out-of-line include/asm-i386/kprobes.h
>>--- linux-2.6.16-rc1-mm5/include/asm-i386/kprobes.h~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
>>+++ linux-2.6.16-rc1-mm5-prasanna/include/asm-i386/kprobes.h	2006-02-08 19:26:10.000000000 +0530
>>@@ -42,6 +42,7 @@ typedef u8 kprobe_opcode_t;
>> #define JPROBE_ENTRY(pentry)	(kprobe_opcode_t *)pentry
>> #define ARCH_SUPPORTS_KRETPROBES
>> #define arch_remove_kprobe(p)	do {} while (0)
>>+#define UPROBE_PAGE_FREE 0x00000001
>>
>> void kretprobe_trampoline(void);
>>
>>@@ -74,6 +75,18 @@ struct kprobe_ctlblk {
>> 	struct prev_kprobe prev_kprobe;
>> };
>>
>>+/* per cpu uprobe page structure */
>>+struct uprobe_page {
>>+	struct hlist_node hlist;
>>+	pte_t *alias_pte;
>>+	pte_t *orig_pte;
>>+	unsigned long orig_pte_val;
>>+	unsigned long alias_pte_val;
[YM] I think the patch doesn't support CONFIG_X86_PAE, because if CONFIG_X86_PAE=y, pte_t becomes 64 bits.
How about changing above 2 members' type to pte_t directly?



>>+	void *alias_addr;
>>+	struct task_struct *tsk;
>>+	unsigned long status;
>>+};
>>+
>> /* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
>>  * if necessary, before executing the original int3/1 (trap) handler.
>>  */
>>diff -puN include/linux/kprobes.h~kprobes_userspace_probes_ss_out-of-line include/linux/kprobes.h
>>--- linux-2.6.16-rc1-mm5/include/linux/kprobes.h~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
>>+++ linux-2.6.16-rc1-mm5-prasanna/include/linux/kprobes.h	2006-02-08 19:26:10.000000000 +0530
>>@@ -45,11 +45,18 @@
>> #ifdef CONFIG_KPROBES
>> #include <asm/kprobes.h>
>>
>>+#define KPROBE_HASH_BITS 6
>>+#define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
>>+
>> /* kprobe_status settings */
>> #define KPROBE_HIT_ACTIVE	0x00000001
>> #define KPROBE_HIT_SS		0x00000002
>> #define KPROBE_REENTER		0x00000004
>> #define KPROBE_HIT_SSDONE	0x00000008
>>+#define UPROBE_SS_STACK		0x00000010
>>+#define UPROBE_SS_EXPSTACK	0x00000020
>>+#define UPROBE_SS_INLINE	0x00000040
>>+#define UPROBE_SS_NEW_STACK	0x00000080
>>
>> /* Attach to insert probes on any functions which should be ignored*/
>> #define __kprobes	__attribute__((__section__(".kprobes.text")))
>>diff -puN kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line kernel/kprobes.c
>>--- linux-2.6.16-rc1-mm5/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:10.000000000 +0530
>>+++ linux-2.6.16-rc1-mm5-prasanna/kernel/kprobes.c	2006-02-08 19:26:10.000000000 +0530
>>@@ -42,9 +42,6 @@
>> #include <asm/errno.h>
>> #include <asm/kdebug.h>
>>
>>-#define KPROBE_HASH_BITS 6
>>-#define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
>>-
>> static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
>> static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];
>> static struct list_head uprobe_module_list;
>>
>>_
>>--
>>Prasanna S Panchamukhi
>>Linux Technology Center
>>India Software Labs, IBM Bangalore
>>Email: prasanna@in.ibm.com
>>Ph: 91-80-51776329

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3/3] Userspace probes prototype-take2
  2006-02-08 14:12 ` [2/3] " Prasanna S Panchamukhi
@ 2006-02-08 14:13   ` Prasanna S Panchamukhi
  0 siblings, 0 replies; 9+ messages in thread
From: Prasanna S Panchamukhi @ 2006-02-08 14:13 UTC (permalink / raw)
  To: systemtap


This patch handles the executing the registered callback
functions when probes is hit.

	Each userspace probe is uniquely identified by the
combination of inode and offset, hence during registeration the inode
and offset combination is added to kprobes hash table. Initially when
breakpoint instruction is hit, the kprobes hash table is looked up
for matching inode and offset. The pre_handlers are called in sequence
if multiple probes are registered. The original instruction is single
stepped out-of-line similar to kernel probes. In kernel space probes,
single stepping out-of-line is achieved by copying the instruction on
to some location within kernel address space and then single step
from that location. But for userspace probes, instruction copied
into kernel address space cannot be single stepped, hence the
instruction should be copied to user address space. The solution is
to find free space in the current process address space and then copy
the original instruction and single step that instruction.

User processes use stack space to store local variables, agruments and
return values. Normally the stack space either below or above the
stack pointer indicates the free stack space. If the stack grows
downwards, the stack space below the stack pointer indicates the
unused stack free space and if the stack grows upwards, the stack
space above the stack pointer indicates the unused stack free space.

The instruction to be single stepped can modify the stack space, hence
before using the unused stack free space, sufficient stack space
should be left. The instruction is copied to the bottom of the page
and check is made such that the copied instruction does not cross the
page boundry. The copied instruction is then single stepped.  
Several architectures does not allow the instruction to be executed
from the stack location, since no-exec bit is set for the stack pages.
In those architectures, the page table entry corresponding to the
stack page is identified and the no-exec bit is unset making the
instruction on that stack page to be executed.

There are situations where even the unused free stack space is not
enough for the user instruction to be copied and single stepped. In
such situations, the virtual memory area(vma) can be expanded beyond
the current stack vma. This expaneded stack can be used to copy the
original instruction and single step out-of-line.

Even if the vma cannot be extended then the instruction much be
executed inline, by replacing the breakpoint instruction with original
instruction.

TODO list
--------
1. This patch is not stable yet, should work for most conditions.

2. This patch works only with PREEMPT config option disabled, to work
in PREEMPT enabled condition handlers must be re-written and must
be seperated out from kernel probes allowing preemption.

3. Insert probes on copy-on-write pages. Tracks all COW pages for the
page containing the specified probe point and inserts/removes all the
probe points for that page.

4. Optimize the insertion of probes through readpage hooks. Identify
all the probes to be inserted on the read page and insert them at
once.

5. Resume exectution should handle setting of proper eip and eflags
for special instructions similar to kernel probes.

6. Single stepping out-of-line expands the stack if there is no
enough stack space to copy the original instruction. Expansion of
stack should be shrinked back to the original size after single
stepping or the expanded stack should be reused for single stepping
out-of-line for other probes.

7. A wrapper routines to calculate the offset from the probed file
beginning. In case of dynamic shared library, the offset is
calculated by substracting the address of the probe point from the
beginning of the file mapped address.

8. Handing of page faults while inthe kprobes_handler() and while
single stepping.

9. Accessing user space pages not present in memory, from the
registered callback routines.

Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>


 arch/i386/kernel/kprobes.c |  460 +++++++++++++++++++++++++++++++++++++++++++--
 include/asm-i386/kprobes.h |   13 +
 include/linux/kprobes.h    |    7 
 kernel/kprobes.c           |    3 
 4 files changed, 468 insertions(+), 15 deletions(-)

diff -puN arch/i386/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line arch/i386/kernel/kprobes.c
--- linux-2.6.16-rc1-mm5/arch/i386/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
+++ linux-2.6.16-rc1-mm5-prasanna/arch/i386/kernel/kprobes.c	2006-02-08 19:26:10.000000000 +0530
@@ -30,6 +30,7 @@
 
 #include <linux/config.h>
 #include <linux/kprobes.h>
+#include <linux/hash.h>
 #include <linux/ptrace.h>
 #include <linux/preempt.h>
 #include <asm/cacheflush.h>
@@ -38,8 +39,12 @@
 
 void jprobe_return_end(void);
 
+static struct uprobe_page *uprobe_page;
+static struct hlist_head uprobe_page_table[KPROBE_TABLE_SIZE];
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
 DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
+DEFINE_PER_CPU(struct uprobe *, current_uprobe) = NULL;
+DEFINE_PER_CPU(unsigned long, singlestep_addr);
 
 /* insert a jmp code */
 static inline void set_jmp_op(void *from, void *to)
@@ -125,6 +130,23 @@ void __kprobes arch_disarm_kprobe(struct
 			   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
 }
 
+void __kprobes arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address)
+{
+	*address = p->opcode;
+}
+
+void __kprobes arch_arm_uprobe(unsigned long *address)
+{
+	*(kprobe_opcode_t *)address = BREAKPOINT_INSTRUCTION;
+}
+
+void __kprobes arch_copy_uprobe(struct kprobe *p, unsigned long *address)
+{
+	memcpy(p->ainsn.insn, (kprobe_opcode_t *)address,
+				MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+	p->opcode = *(kprobe_opcode_t *)address;
+}
+
 static inline void save_previous_kprobe(struct kprobe_ctlblk *kcb)
 {
 	kcb->prev_kprobe.kp = kprobe_running();
@@ -151,15 +173,326 @@ static inline void set_current_kprobe(st
 		kcb->kprobe_saved_eflags &= ~IF_MASK;
 }
 
+struct uprobe_page __kprobes *get_upage_current(struct task_struct *tsk)
+{
+	struct hlist_head *head;
+	struct hlist_node *node;
+	struct uprobe_page *upage;
+
+	head = &uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)];
+	hlist_for_each_entry(upage, node, head, hlist) {
+		if (upage->tsk == tsk)
+			return upage;
+        }
+	return NULL;
+}
+
+struct uprobe_page __kprobes *get_upage_free(struct task_struct *tsk)
+{
+	int cpu;
+
+	for_each_cpu(cpu) {
+		struct uprobe_page *upage;
+		upage = per_cpu_ptr(uprobe_page, cpu);
+		if (upage->status & UPROBE_PAGE_FREE)
+			return upage;
+	}
+	return NULL;
+}
+
+/**
+ * This routines get the pte of the page containing the specified address.
+ */
+static pte_t  __kprobes *get_uprobe_pte(unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte = NULL;
+
+	pgd = pgd_offset(current->mm, address);
+	if (!pgd)
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud)
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd)
+		goto out;
+
+	pte = pte_alloc_map(current->mm, pmd, address);
+
+out:
+	return pte;
+}
+
+/**
+ *  This routine check for space in the current process's stack address space.
+ *  If enough address space is found, it just maps a new page and copies the
+ *  new instruction on that page for single stepping out-of-line.
+ */
+static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
+			struct pt_regs *regs, struct vm_area_struct *vma)
+{
+	unsigned long addr, *vaddr, stack_addr = regs->esp;
+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
+	struct uprobe_page *upage;
+	struct page *page;
+	pte_t *pte;
+
+
+	if (vma->vm_flags & VM_GROWSDOWN) {
+		if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
+			return -ENOMEM;
+
+		addr = vma->vm_start;
+	} else if (vma->vm_flags & VM_GROWSUP) {
+		if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
+			return -ENOMEM;
+
+		addr = vma->vm_end - size;
+	} else
+		return -EFAULT;
+
+	preempt_enable_no_resched();
+
+	pte = get_uprobe_pte(addr);
+	preempt_disable();
+	if (!pte)
+		return -EFAULT;
+
+	upage = get_upage_free(current);
+	upage->status &= ~UPROBE_PAGE_FREE;
+	upage->tsk = current;
+	INIT_HLIST_NODE(&upage->hlist);
+	hlist_add_head(&upage->hlist,
+		&uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)]);
+
+	upage->orig_pte = pte;
+	upage->orig_pte_val =  pte_val(*pte);
+	set_pte(pte, (*(upage->alias_pte)));
+
+	page = pte_page(*pte);
+	vaddr = (unsigned long *)kmap_atomic(page, KM_USER1);
+	vaddr = (unsigned long *)((unsigned long)vaddr + (addr & ~PAGE_MASK));
+	memcpy(vaddr, (unsigned long *)uprobe->kp.ainsn.insn, size);
+	kunmap_atomic(vaddr, KM_USER1);
+	regs->eip = addr;
+
+	return 0;
+}
+
+/**
+ * This routine expands the stack beyond the present process address space
+ * and copies the instruction to that location, so that processor can
+ * single step out-of-line.
+ */
+static int __kprobes copy_insn_onexpstack(struct uprobe *uprobe,
+			struct pt_regs *regs, struct vm_area_struct *vma)
+{
+	unsigned long addr, *vaddr, vm_addr;
+	int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
+	struct vm_area_struct *new_vma;
+	struct uprobe_page *upage;
+	struct mm_struct *mm = current->mm;
+	struct page *page;
+	pte_t *pte;
+
+
+	if (vma->vm_flags & VM_GROWSDOWN)
+		vm_addr = vma->vm_start - size;
+	else if (vma->vm_flags & VM_GROWSUP)
+		vm_addr = vma->vm_end + size;
+	else
+		return -EFAULT;
+
+	preempt_enable_no_resched();
+
+	/* TODO: do we need to expand stack if extend_vma fails? */
+	new_vma = find_extend_vma(mm, vm_addr);
+	preempt_disable();
+	if (!new_vma)
+		return -ENOMEM;
+
+	/*
+	 * TODO: Expanding stack for every probe is not a good idea, stack must
+	 * either be shrunk to its original size after single stepping or the
+	 * expanded stack should be kept track of, for the probed application,
+	 * so it can be reused to single step out-of-line
+	 */
+	if (new_vma->vm_flags & VM_GROWSDOWN)
+		addr = new_vma->vm_start;
+	else
+		addr = new_vma->vm_end - size;
+
+	preempt_enable_no_resched();
+	pte = get_uprobe_pte(addr);
+	preempt_disable();
+	if (!pte)
+		return -EFAULT;
+
+	upage = get_upage_free(current);
+	upage->status &= ~UPROBE_PAGE_FREE;
+	upage->tsk = current;
+	INIT_HLIST_NODE(&upage->hlist);
+	hlist_add_head(&upage->hlist,
+		&uprobe_page_table[hash_ptr(current, KPROBE_HASH_BITS)]);
+	upage->orig_pte = pte;
+	upage->orig_pte_val =  pte_val(*pte);
+	set_pte(pte, (*(upage->alias_pte)));
+
+	page = pte_page(*pte);
+	vaddr = (unsigned long *)kmap_atomic(page, KM_USER1);
+	vaddr = (unsigned long *)((unsigned long)vaddr + (addr & ~PAGE_MASK));
+	memcpy(vaddr, (unsigned long *)uprobe->kp.ainsn.insn, size);
+	kunmap_atomic(vaddr, KM_USER1);
+	regs->eip = addr;
+
+	return  0;
+}
+
+/**
+ * This routine checks for stack free space below the stack pointer and
+ * then copies the instructions at that location so that the processor can
+ * single step out-of-line. If there is no enough stack space or if
+ * copy_to_user fails or if the vma is invalid, it returns error.
+ */
+static int __kprobes copy_insn_onstack(struct uprobe *uprobe,
+			struct pt_regs *regs, unsigned long flags)
+{
+	unsigned long page_addr, stack_addr = regs->esp;
+	int  size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
+	unsigned long *source = (unsigned long *)uprobe->kp.ainsn.insn;
+
+	if (flags & VM_GROWSDOWN) {
+		page_addr = stack_addr & PAGE_MASK;
+
+		if (((stack_addr - sizeof(long long))) < (page_addr + size))
+			return -ENOMEM;
+
+		if (__copy_to_user_inatomic((unsigned long *)page_addr, source,
+									size))
+			return -EFAULT;
+
+		regs->eip = page_addr;
+	} else if (flags & VM_GROWSUP) {
+		page_addr = stack_addr & PAGE_MASK;
+
+		if (page_addr == stack_addr)
+			return -ENOMEM;
+		else
+			page_addr += PAGE_SIZE;
+
+		if ((page_addr - size) < (stack_addr + sizeof(long long)))
+			return -ENOMEM;
+
+		if (__copy_to_user_inatomic((unsigned long *)(page_addr - size),
+								source, size))
+			return -EFAULT;
+
+		regs->eip = page_addr - size;
+	} else
+		return -EINVAL;
+
+	return 0;
+}
+
+/**
+ * This routines get the page containing the probe, maps it and
+ * replaced the instruction at the probed address with specified
+ * opcode.
+ */
+void __kprobes replace_original_insn(struct uprobe *uprobe,
+				struct pt_regs *regs, kprobe_opcode_t opcode)
+{
+	kprobe_opcode_t *addr;
+	struct page *page;
+
+	page = find_get_page(uprobe->inode->i_mapping,
+					uprobe->offset >> PAGE_CACHE_SHIFT);
+	lock_page(page);
+
+	addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER0);
+	addr = (kprobe_opcode_t *)((unsigned long)addr +
+				 (unsigned long)(uprobe->offset & ~PAGE_MASK));
+	*addr = opcode;
+	/*TODO: flush vma ? */
+	kunmap_atomic(addr, KM_USER0);
+
+	unlock_page(page);
+
+	page_cache_release(page);
+	regs->eip = (unsigned long)uprobe->kp.addr;
+}
+
+/**
+ * This routine provides the functionality of single stepping out of line.
+ * If single stepping out-of-line cannot be achieved, it replaces with
+ * the original instruction allowing it to single step inline.
+ */
+static inline int uprobe_single_step(struct kprobe *p, struct pt_regs *regs)
+{
+	unsigned long stack_addr = regs->esp, flags;
+	struct vm_area_struct *vma = NULL;
+	struct uprobe *uprobe =  __get_cpu_var(current_uprobe);
+	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
+	int err = 0;
+
+	down_read(&current->mm->mmap_sem);
+
+	vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
+	if (!vma) {
+		/* TODO: Need better error reporting? */
+		printk("No vma found\n");
+		up_read(&current->mm->mmap_sem);
+		return -ENOENT;
+	}
+	flags = vma->vm_flags;
+	up_read(&current->mm->mmap_sem);
+
+	kcb->kprobe_status |= UPROBE_SS_STACK;
+	err = copy_insn_onstack(uprobe, regs, flags);
+
+	down_write(&current->mm->mmap_sem);
+
+	if (err) {
+		kcb->kprobe_status |= UPROBE_SS_NEW_STACK;
+		err = copy_insn_on_new_page(uprobe, regs, vma);
+	}
+	if (err) {
+		kcb->kprobe_status |= UPROBE_SS_EXPSTACK;
+		err = copy_insn_onexpstack(uprobe, regs, vma);
+	}
+
+	up_write(&current->mm->mmap_sem);
+
+	if (err) {
+		kcb->kprobe_status |= UPROBE_SS_INLINE;
+		replace_original_insn(uprobe, regs, uprobe->kp.opcode);
+	}
+
+	 __get_cpu_var(singlestep_addr) = regs->eip;
+
+
+	return 0;
+}
+
 static inline void prepare_singlestep(struct kprobe *p, struct pt_regs *regs)
 {
 	regs->eflags |= TF_MASK;
 	regs->eflags &= ~IF_MASK;
 	/*single step inline if the instruction is an int3*/
+
 	if (p->opcode == BREAKPOINT_INSTRUCTION)
 		regs->eip = (unsigned long)p->addr;
-	else
-		regs->eip = (unsigned long)&p->ainsn.insn;
+	else {
+		if (!kernel_text_address((unsigned long)p->addr))
+			uprobe_single_step(p, regs);
+		else
+			regs->eip = (unsigned long)&p->ainsn.insn;
+	}
 }
 
 /* Called with kretprobe_lock held */
@@ -194,6 +527,7 @@ static int __kprobes kprobe_handler(stru
 	kprobe_opcode_t *addr = NULL;
 	unsigned long *lp;
 	struct kprobe_ctlblk *kcb;
+	unsigned seg = regs->xcs & 0xffff;
 #ifdef CONFIG_PREEMPT
 	unsigned pre_preempt_count = preempt_count();
 #endif /* CONFIG_PREEMPT */
@@ -208,14 +542,21 @@ static int __kprobes kprobe_handler(stru
 	/* Check if the application is using LDT entry for its code segment and
 	 * calculate the address by reading the base address from the LDT entry.
 	 */
-	if ((regs->xcs & 4) && (current->mm)) {
+
+	if (regs->eflags & VM_MASK)
+		addr = (kprobe_opcode_t *)(((seg << 4) + regs->eip -
+			sizeof(kprobe_opcode_t)) & 0xffff);
+	else if ((regs->xcs & 4) && (current->mm)) {
+		local_irq_enable();
+		down(&current->mm->context.sem);
 		lp = (unsigned long *) ((unsigned long)((regs->xcs >> 3) * 8)
 					+ (char *) current->mm->context.ldt);
 		addr = (kprobe_opcode_t *) (get_desc_base(lp) + regs->eip -
 						sizeof(kprobe_opcode_t));
-	} else {
+		up(&current->mm->context.sem);
+		local_irq_disable();
+	} else
 		addr = (kprobe_opcode_t *)(regs->eip - sizeof(kprobe_opcode_t));
-	}
 	/* Check we're not actually recursing */
 	if (kprobe_running()) {
 		p = get_kprobe(addr);
@@ -235,7 +576,6 @@ static int __kprobes kprobe_handler(stru
 			save_previous_kprobe(kcb);
 			set_current_kprobe(p, regs, kcb);
 			kprobes_inc_nmissed_count(p);
-			prepare_singlestep(p, regs);
 			kcb->kprobe_status = KPROBE_REENTER;
 			return 1;
 		} else {
@@ -307,8 +647,8 @@ static int __kprobes kprobe_handler(stru
 	}
 
 ss_probe:
-	prepare_singlestep(p, regs);
 	kcb->kprobe_status = KPROBE_HIT_SS;
+	prepare_singlestep(p, regs);
 	return 1;
 
 no_kprobe:
@@ -498,6 +838,33 @@ no_change:
 	return;
 }
 
+static void __kprobes resume_execution_user(struct uprobe *uprobe,
+				struct pt_regs *regs, struct kprobe_ctlblk *kcb)
+{
+	unsigned long delta;
+	struct uprobe_page *upage;
+
+	/*
+	 * TODO :need to fixup special instructions as done with kernel probes.
+	 */
+	delta = regs->eip - __get_cpu_var(singlestep_addr);
+	regs->eip = (unsigned long)(uprobe->kp.addr + delta);
+
+	if ((kcb->kprobe_status & UPROBE_SS_EXPSTACK) ||
+			(kcb->kprobe_status & UPROBE_SS_NEW_STACK)) {
+		upage = get_upage_current(current);
+		set_pte(upage->orig_pte, __pte(upage->orig_pte_val));
+		pte_unmap(upage->orig_pte);
+
+		upage->status = UPROBE_PAGE_FREE;
+		hlist_del(&upage->hlist);
+
+	} else if (kcb->kprobe_status & UPROBE_SS_INLINE)
+		replace_original_insn(uprobe, regs,
+				(kprobe_opcode_t)BREAKPOINT_INSTRUCTION);
+	regs->eflags &= ~TF_MASK;
+}
+
 /*
  * Interrupts are disabled on entry as trap1 is an interrupt gate and they
  * remain disabled thoroughout this function.
@@ -510,16 +877,19 @@ static inline int post_kprobe_handler(st
 	if (!cur)
 		return 0;
 
-	if ((kcb->kprobe_status != KPROBE_REENTER) && cur->post_handler) {
-		kcb->kprobe_status = KPROBE_HIT_SSDONE;
+	if (!(kcb->kprobe_status & KPROBE_REENTER) && cur->post_handler) {
+		kcb->kprobe_status |= KPROBE_HIT_SSDONE;
 		cur->post_handler(cur, regs, 0);
 	}
 
-	resume_execution(cur, regs, kcb);
+	if (!kernel_text_address((unsigned long)cur->addr))
+		resume_execution_user(__get_cpu_var(current_uprobe), regs, kcb);
+	else
+		resume_execution(cur, regs, kcb);
 	regs->eflags |= kcb->kprobe_saved_eflags;
 
 	/*Restore back the original saved kprobes variables and continue. */
-	if (kcb->kprobe_status == KPROBE_REENTER) {
+	if (kcb->kprobe_status & KPROBE_REENTER) {
 		restore_previous_kprobe(kcb);
 		goto out;
 	}
@@ -547,7 +917,13 @@ static inline int kprobe_fault_handler(s
 		return 1;
 
 	if (kcb->kprobe_status & KPROBE_HIT_SS) {
-		resume_execution(cur, regs, kcb);
+		if (!kernel_text_address((unsigned long)cur->addr)) {
+			struct uprobe *uprobe =  __get_cpu_var(current_uprobe);
+			/* TODO: Proper handling of all instruction */
+			replace_original_insn(uprobe, regs, uprobe->kp.opcode);
+			regs->eflags &= ~TF_MASK;
+		} else
+			resume_execution(cur, regs, kcb);
 		regs->eflags |= kcb->kprobe_old_eflags;
 
 		reset_current_kprobe();
@@ -654,7 +1030,67 @@ int __kprobes longjmp_break_handler(stru
 	return 0;
 }
 
+static void free_alias(void)
+{
+	int cpu;
+
+	for_each_cpu(cpu) {
+		struct uprobe_page *upage;
+		upage = per_cpu_ptr(uprobe_page, cpu);
+
+		if (upage->alias_addr) {
+			set_pte(upage->alias_pte, __pte(upage->alias_pte_val));
+			kfree(upage->alias_addr);
+		}
+		upage->alias_pte = 0;
+	}
+	free_percpu(uprobe_page);
+	return;
+}
+
+static int alloc_alias(void)
+{
+	int cpu;
+
+	uprobe_page = __alloc_percpu(sizeof(struct uprobe_page));
+
+	for_each_cpu(cpu) {
+		struct uprobe_page *upage;
+		upage = per_cpu_ptr(uprobe_page, cpu);
+		upage->alias_addr = kmalloc(PAGE_SIZE, GFP_USER);
+		if (!upage->alias_addr) {
+			free_alias();
+			return -ENOMEM;
+		}
+		upage->alias_pte = lookup_address(
+					(unsigned long)upage->alias_addr);
+		upage->alias_pte_val = pte_val(*upage->alias_pte);
+		if (upage->alias_pte) {
+			upage->status = UPROBE_PAGE_FREE;
+			set_pte(upage->alias_pte,
+						pte_mkdirty(*upage->alias_pte));
+			set_pte(upage->alias_pte,
+						pte_mkexec(*upage->alias_pte));
+			set_pte(upage->alias_pte,
+						 pte_mkwrite(*upage->alias_pte));
+			set_pte(upage->alias_pte,
+						pte_mkyoung(*upage->alias_pte));
+		}
+	}
+	return 0;
+}
+
 int __init arch_init_kprobes(void)
 {
+	int ret = 0;
+	/*
+	 * user space probes requires a page to copy the original instruction
+	 * so that it can single step if there is no free stack space, allocate
+	 * per cpu page.
+	 */
+
+	if ((ret = alloc_alias()))
+		return ret;
+
 	return 0;
 }
diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes_ss_out-of-line include/asm-i386/kprobes.h
--- linux-2.6.16-rc1-mm5/include/asm-i386/kprobes.h~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
+++ linux-2.6.16-rc1-mm5-prasanna/include/asm-i386/kprobes.h	2006-02-08 19:26:10.000000000 +0530
@@ -42,6 +42,7 @@ typedef u8 kprobe_opcode_t;
 #define JPROBE_ENTRY(pentry)	(kprobe_opcode_t *)pentry
 #define ARCH_SUPPORTS_KRETPROBES
 #define arch_remove_kprobe(p)	do {} while (0)
+#define UPROBE_PAGE_FREE 0x00000001
 
 void kretprobe_trampoline(void);
 
@@ -74,6 +75,18 @@ struct kprobe_ctlblk {
 	struct prev_kprobe prev_kprobe;
 };
 
+/* per cpu uprobe page structure */
+struct uprobe_page {
+	struct hlist_node hlist;
+	pte_t *alias_pte;
+	pte_t *orig_pte;
+	unsigned long orig_pte_val;
+	unsigned long alias_pte_val;
+	void *alias_addr;
+	struct task_struct *tsk;
+	unsigned long status;
+};
+
 /* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
  * if necessary, before executing the original int3/1 (trap) handler.
  */
diff -puN include/linux/kprobes.h~kprobes_userspace_probes_ss_out-of-line include/linux/kprobes.h
--- linux-2.6.16-rc1-mm5/include/linux/kprobes.h~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:09.000000000 +0530
+++ linux-2.6.16-rc1-mm5-prasanna/include/linux/kprobes.h	2006-02-08 19:26:10.000000000 +0530
@@ -45,11 +45,18 @@
 #ifdef CONFIG_KPROBES
 #include <asm/kprobes.h>
 
+#define KPROBE_HASH_BITS 6
+#define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
+
 /* kprobe_status settings */
 #define KPROBE_HIT_ACTIVE	0x00000001
 #define KPROBE_HIT_SS		0x00000002
 #define KPROBE_REENTER		0x00000004
 #define KPROBE_HIT_SSDONE	0x00000008
+#define UPROBE_SS_STACK		0x00000010
+#define UPROBE_SS_EXPSTACK	0x00000020
+#define UPROBE_SS_INLINE	0x00000040
+#define UPROBE_SS_NEW_STACK	0x00000080
 
 /* Attach to insert probes on any functions which should be ignored*/
 #define __kprobes	__attribute__((__section__(".kprobes.text")))
diff -puN kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line kernel/kprobes.c
--- linux-2.6.16-rc1-mm5/kernel/kprobes.c~kprobes_userspace_probes_ss_out-of-line	2006-02-08 19:26:10.000000000 +0530
+++ linux-2.6.16-rc1-mm5-prasanna/kernel/kprobes.c	2006-02-08 19:26:10.000000000 +0530
@@ -42,9 +42,6 @@
 #include <asm/errno.h>
 #include <asm/kdebug.h>
 
-#define KPROBE_HASH_BITS 6
-#define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
-
 static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
 static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];
 static struct list_head uprobe_module_list;

_
-- 
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: prasanna@in.ibm.com
Ph: 91-80-51776329

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-02-20  5:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-20  3:32 [3/3] Userspace probes prototype-take2 Zhang, Yanmin
2006-02-20  5:07 ` Prasanna S Panchamukhi
  -- strict thread matches above, loose matches on Subject: below --
2006-02-20  5:48 Zhang, Yanmin
2006-02-20  5:48 Zhang, Yanmin
2006-02-20  3:16 Zhang, Yanmin
2006-02-20  4:51 ` Prasanna S Panchamukhi
2006-02-17  9:19 Zhang, Yanmin
2006-02-20  5:36 ` Prasanna S Panchamukhi
2006-02-08 14:10 [1/3] " Prasanna S Panchamukhi
2006-02-08 14:12 ` [2/3] " Prasanna S Panchamukhi
2006-02-08 14:13   ` [3/3] " Prasanna S Panchamukhi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).