* s390x help needed - kernel read faults @ 2011-10-17 18:42 David Smith [not found] ` <20111018052305.GA22831@in.ibm.com> 0 siblings, 1 reply; 8+ messages in thread From: David Smith @ 2011-10-17 18:42 UTC (permalink / raw) To: Systemtap List I've been looking at BZ639338 (<https://bugzilla.redhat.com/show_bug.cgi?id=639338>). Basically the testsuite is getting lots of kernel read faults on RHEL6 (2.6.32) that it doesn't get on RHEL5 (2.6.18). This reminded me of PR12341 "deref()/store_deref() variations between arches" (<http://sourceware.org/bugzilla/show_bug.cgi?id=12341>). I hadn't found the time to see if the s390x could use the kernel's own get_user()/put_user() macros instead of having custom s390x asm code in loc2c-runtime.h. However, replacing systemtap's custom s390x asm code didn't fix the kernel read faults - it fixed some but others popped up. That work did find PR13289 "%m/%M printf formatting operators access memory", so it wasn't wasted at all. I've learned a lot about the s390x platform that I didn't know before. The one most relevant here is that the s390x has 3 different address spaces - "home", "primary", and "secondary" with some instructions that only operate on certain address spaces. In RHEL5 (2.6.18) kernels, replacing our custom asm code with the stock __get_user()/__put_user() showed similar results as RHEL6. After looking at it a bit, I realized the main difference between the stock RHEL5 __get_user() and our version is that we use the 'mvc' instruction and the stock __get_user() code uses the 'mvcp' instruction. The 'mvcp' instruction only works on addresses in the primary address space (where user data is stored on 2.6.18 kernels). Systemtap's deref() macro is used to read both user data and kernel data, but the stock RHEL5 __get_user() function will only read user data. In RHEL6 (2.6.32) kernels, things are more complicated. The stock __get_user()/__put_user() macros can end up using one of 3 sets of code (uaccess_std.c, uaccess_pt.c, uaccess_mvcos.c), based on the system's hardware and the size of the data request. In addition, which address space will be used for kernel data vs user data appears to also differ based on the system's hardware. But, once again the stock __get_user()/__put_user() code only works on true user data, not kernel data. At this point my s390x knowledge is exhausted. I need help from people who actually understand this platform to be able to fix deref()/store_deref() to properly work on kernel data and user data in newer kernels. Hopefully, we can use as much stock kernel code as possible, to avoid this problem popping up again in the future. Thanks. -- David Smith dsmith@redhat.com Red Hat http://www.redhat.com 256.217.0141 (direct) 256.837.0057 (fax) ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20111018052305.GA22831@in.ibm.com>]
[parent not found: <20111018081753.GA2578@osiris.boeblingen.de.ibm.com>]
[parent not found: <4E9D8577.9030705@redhat.com>]
[parent not found: <20111028124058.GA2475@osiris.boeblingen.de.ibm.com>]
* Re: s390x help needed - kernel read faults [not found] ` <20111028124058.GA2475@osiris.boeblingen.de.ibm.com> @ 2011-10-28 18:43 ` David Smith 2011-10-31 10:29 ` Heiko Carstens 0 siblings, 1 reply; 8+ messages in thread From: David Smith @ 2011-10-28 18:43 UTC (permalink / raw) To: Heiko Carstens; +Cc: Ananth N Mavinakayanahalli, Systemtap List (Background for the systemtap list. I sent out a plea for s390x help, and in some private email Ananth sent Heiko my way.) On 10/28/2011 07:40 AM, Heiko Carstens wrote: > On Tue, Oct 18, 2011 at 08:56:07AM -0500, David Smith wrote: > > Sorry for late reply.. I was a bit busy. But anyway... > >>> If you would give me a pointer to all the relavant sources/repositories >>> I might have a look at it. However it will probably take until next week >>> before I have time to look into it. >> >> Thanks! You can get the entire systemtap source tree with: >> >> git clone git://sourceware.org/git/systemtap.git >> >> The gitweb viewer is at: >> >> http://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=summary >> >> The particular file to look at is in runtime/loc2c-runtime.h. That file >> has all the arch-specific stuff in it, but you shouldn't have any >> problem finding the s390x specific code. >> >> In case you lost this in my first email message, the tricky part here is >> that the code has to work for both user data and kernel data. Otherwise >> I could just call the stock kernel routines. > > The current code won't work for user space or kernel space. The problem in > kernel space is that the enable write protection for the kernel image, hence > every store will result in a fault. > You can avoid that by using the probe_kernel_write() function which will > bybass the write protection (on s390). I don't doubt writes are broken. While writes will need to work, I'd guess 85%-90% of the time we will call this code to do reads (deref()), not writes (store_deref()). So, I'd like to initially focus on reads. > Also in the current code I do not see how address spaces are switched, so > it looks to me like all operations are always done in the kernel address > space. > > I didn't see however the tricky part you mentioned (user space vs kernel > space). How does the code distinguish if an operation has to be done on > the kernel space address range or user space address range? Sigh. That's the thing - we don't distinguish. The assumption was made that calling a function to access user memory would also work for kernel memory. In RHEL5-era kernels (2.6.18), that assumption worked for us with the code that's checked in (originally done by David Wilder from IBM). Something has changed in the way the s390x handles memory spaces since then. >> P.S.: Is it OK to copy this private email back to the list? > > Sure. Thanks, I've done this. -- David Smith dsmith@redhat.com Red Hat http://www.redhat.com 256.217.0141 (direct) 256.837.0057 (fax) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: s390x help needed - kernel read faults 2011-10-28 18:43 ` David Smith @ 2011-10-31 10:29 ` Heiko Carstens 2011-11-07 16:03 ` David Smith 0 siblings, 1 reply; 8+ messages in thread From: Heiko Carstens @ 2011-10-31 10:29 UTC (permalink / raw) To: David Smith; +Cc: Ananth N Mavinakayanahalli, Systemtap List > > You can avoid that by using the probe_kernel_write() function which will > > bybass the write protection (on s390). > > I don't doubt writes are broken. While writes will need to work, I'd > guess 85%-90% of the time we will call this code to do reads (deref()), > not writes (store_deref()). So, I'd like to initially focus on reads. > > > Also in the current code I do not see how address spaces are switched, so > > it looks to me like all operations are always done in the kernel address > > space. > > > > I didn't see however the tricky part you mentioned (user space vs kernel > > space). How does the code distinguish if an operation has to be done on > > the kernel space address range or user space address range? > > Sigh. That's the thing - we don't distinguish. The assumption was made > that calling a function to access user memory would also work for kernel > memory. Sure it does, _if_ you call set_fs(KERNEL_DS) in advance of any of uaccess functions. Of course you know that already. What I don't understand is, why you do not use the set_fs() mechanism and the in-kernel uaccess functions. Instead you provide your own functions. Looking alone at the address that needs to be read or written to it's simply impossible to tell if it's a user space or kernel space address. Just must specify it. In addition all the different address space layouts on s390 will make your life much more difficult. Hence.. use the in-kernel functions, everything else will break again. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: s390x help needed - kernel read faults 2011-10-31 10:29 ` Heiko Carstens @ 2011-11-07 16:03 ` David Smith 2011-11-07 17:18 ` Mark Wielaard 0 siblings, 1 reply; 8+ messages in thread From: David Smith @ 2011-11-07 16:03 UTC (permalink / raw) To: Heiko Carstens; +Cc: Ananth N Mavinakayanahalli, Systemtap List On 10/31/2011 05:29 AM, Heiko Carstens wrote: >>> You can avoid that by using the probe_kernel_write() function which will >>> bybass the write protection (on s390). >> >> I don't doubt writes are broken. While writes will need to work, I'd >> guess 85%-90% of the time we will call this code to do reads (deref()), >> not writes (store_deref()). So, I'd like to initially focus on reads. >> >>> Also in the current code I do not see how address spaces are switched, so >>> it looks to me like all operations are always done in the kernel address >>> space. >>> >>> I didn't see however the tricky part you mentioned (user space vs kernel >>> space). How does the code distinguish if an operation has to be done on >>> the kernel space address range or user space address range? >> >> Sigh. That's the thing - we don't distinguish. The assumption was made >> that calling a function to access user memory would also work for kernel >> memory. > > Sure it does, _if_ you call set_fs(KERNEL_DS) in advance of any of uaccess > functions. Of course you know that already. > > What I don't understand is, why you do not use the set_fs() mechanism and > the in-kernel uaccess functions. Instead you provide your own functions. > > Looking alone at the address that needs to be read or written to it's simply > impossible to tell if it's a user space or kernel space address. Just must > specify it. > In addition all the different address space layouts on s390 will make your > life much more difficult. Hence.. use the in-kernel functions, everything > else will break again. Sorry for not responding sooner. I'm not sure why we provided our own functions, that decision was made a long time ago. If we can't look at the address and know whether it is a user space or kernel space address, then I don't see much choice than to break up our memory accesses and require the callers to know whether they are accessing kernel space or user space. Thanks for the advice. -- David Smith dsmith@redhat.com Red Hat http://www.redhat.com 256.217.0141 (direct) 256.837.0057 (fax) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: s390x help needed - kernel read faults 2011-11-07 16:03 ` David Smith @ 2011-11-07 17:18 ` Mark Wielaard 2011-11-08 12:18 ` Heiko Carstens 2011-11-26 1:50 ` Mark Wielaard 0 siblings, 2 replies; 8+ messages in thread From: Mark Wielaard @ 2011-11-07 17:18 UTC (permalink / raw) To: David Smith; +Cc: Heiko Carstens, Ananth N Mavinakayanahalli, Systemtap List On Mon, Nov 07, 2011 at 10:03:19AM -0600, David Smith wrote: > I'm not sure why we provided our own functions, that decision was made a > long time ago. If we can't look at the address and know whether it is a > user space or kernel space address, then I don't see much choice than to > break up our memory accesses and require the callers to know whether > they are accessing kernel space or user space. Yes, it seems not that hard to track fully, we mostly do know already. The only place where it would be helpful to determine whether or not a address is valid for either kernel or user space would be in the unwinder. There we walk the stack and stop and/or switch from kernel to user stack walking based on whether the PC address is a valid kernel or user space address. Is there really no way to tell the difference on s390x from kernel space? Thanks, Mark ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: s390x help needed - kernel read faults 2011-11-07 17:18 ` Mark Wielaard @ 2011-11-08 12:18 ` Heiko Carstens 2011-11-26 1:50 ` Mark Wielaard 1 sibling, 0 replies; 8+ messages in thread From: Heiko Carstens @ 2011-11-08 12:18 UTC (permalink / raw) To: Mark Wielaard; +Cc: David Smith, Ananth N Mavinakayanahalli, Systemtap List On Mon, Nov 07, 2011 at 06:17:37PM +0100, Mark Wielaard wrote: > On Mon, Nov 07, 2011 at 10:03:19AM -0600, David Smith wrote: > > I'm not sure why we provided our own functions, that decision was made a > > long time ago. If we can't look at the address and know whether it is a > > user space or kernel space address, then I don't see much choice than to > > break up our memory accesses and require the callers to know whether > > they are accessing kernel space or user space. > > Yes, it seems not that hard to track fully, we mostly do know already. > The only place where it would be helpful to determine whether or not > a address is valid for either kernel or user space would be in the > unwinder. There we walk the stack and stop and/or switch from kernel > to user stack walking based on whether the PC address is a valid > kernel or user space address. Is there really no way to tell the > difference on s390x from kernel space? If all you have is an address, then no, you cannot tell if it belongs to user space or kernel space. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: s390x help needed - kernel read faults 2011-11-07 17:18 ` Mark Wielaard 2011-11-08 12:18 ` Heiko Carstens @ 2011-11-26 1:50 ` Mark Wielaard 2011-11-28 11:20 ` Mark Wielaard 1 sibling, 1 reply; 8+ messages in thread From: Mark Wielaard @ 2011-11-26 1:50 UTC (permalink / raw) To: David Smith; +Cc: Heiko Carstens, Ananth N Mavinakayanahalli, Systemtap List On Mon, 2011-11-07 at 18:17 +0100, Mark Wielaard wrote: > On Mon, Nov 07, 2011 at 10:03:19AM -0600, David Smith wrote: > > I'm not sure why we provided our own functions, that decision was made a > > long time ago. If we can't look at the address and know whether it is a > > user space or kernel space address, then I don't see much choice than to > > break up our memory accesses and require the callers to know whether > > they are accessing kernel space or user space. > > Yes, it seems not that hard to track fully, we mostly do know already. So I split our deref and store_deref into uderef/kderef and store_u/kderef that are now used when user or kernel addresses are accessed. For most architectures these are equal. But for s390x I replaced our custom accessor functions with with standard s390 kernel __get_user() and __put_user() functions and then wrap those in get_fs() set_fs() calls to switch between user/kernel space addresses. See below. Tested against 2.6.18 (rhel5) and 2.6.32 (rhel6) kernels. It improves the make installcheck results a lot. There is still some work to do for the user space dwarf unwinder. Cheers, Mark === runtime/loc2c-runtime.h s390 deref functions === #elif defined (__s390__) || defined (__s390x__) /* Use same __get_user() and __put_user() for both user and kernel addresses, but make sure set_fs() is called appropriately first. */ #define uderef(size, addr) ({ \ u8 _b; u16 _w; u32 _l; u64 _q; \ uintptr_t _a = (uintptr_t) addr; \ intptr_t _v = 0; \ int _bad = 0; \ mm_segment_t _oldfs = get_fs(); \ set_fs (USER_DS); \ switch (size) { \ case 1: _bad = __get_user(_b, (u8 *)(_a)); _v = _b; break; \ case 2: _bad = __get_user(_w, (u16 *)(_a)); _v = _w; break; \ case 4: _bad = __get_user(_l, (u32 *)(_a)); _v = _l; break; \ case 8: _bad = __get_user(_q, (u64 *)(_a)); _v = _q; break; \ default: __get_user_bad(); \ } \ set_fs (_oldfs); \ if (_bad) \ DEREF_FAULT(addr); \ _v; \ }) #define store_uderef(size, addr, value) ({ \ int _bad = 0; \ mm_segment_t _oldfs = get_fs(); \ set_fs (USER_DS); \ switch (size) { \ case 1: _bad = __put_user(((u8)(value)), ((u8 *)(addr))); break; \ case 2: _bad = __put_user(((u16)(value)), ((u16 *)(addr))); break; \ case 4: _bad = __put_user(((u32)(value)), ((u32 *)(addr))); break; \ case 8: _bad = __put_user(((u64)(value)), ((u64 *)(addr))); break; \ default: __put_user_bad(); \ } \ set_fs (_oldfs); \ if (_bad) \ STORE_DEREF_FAULT(addr); \ }) #define kderef(size, addr) ({ \ u8 _b; u16 _w; u32 _l; u64 _q; \ uintptr_t _a = (uintptr_t) addr; \ intptr_t _v = 0; \ int _bad = 0; \ mm_segment_t _oldfs = get_fs(); \ set_fs (KERNEL_DS); \ switch (size) { \ case 1: _bad = __get_user(_b, (u8 *)(_a)); _v = _b; break; \ case 2: _bad = __get_user(_w, (u16 *)(_a)); _v = _w; break; \ case 4: _bad = __get_user(_l, (u32 *)(_a)); _v = _l; break; \ case 8: _bad = __get_user(_q, (u64 *)(_a)); _v = _q; break; \ default: __get_user_bad(); \ } \ set_fs (_oldfs); \ if (_bad) \ DEREF_FAULT(addr); \ _v; \ }) #define store_kderef(size, addr, value) ({ \ int _bad = 0; \ mm_segment_t _oldfs = get_fs(); \ set_fs (KERNEL_DS); \ switch (size) { \ case 1: _bad = __put_user(((u8)(value)), ((u8 *)(addr))); break; \ case 2: _bad = __put_user(((u16)(value)), ((u16 *)(addr))); break; \ case 4: _bad = __put_user(((u32)(value)), ((u32 *)(addr))); break; \ case 8: _bad = __put_user(((u64)(value)), ((u64 *)(addr))); break; \ default: __put_user_bad(); \ } \ set_fs (_oldfs); \ if (_bad) \ STORE_DEREF_FAULT(addr); \ }) #endif /* (s390) || (s390x) */ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: s390x help needed - kernel read faults 2011-11-26 1:50 ` Mark Wielaard @ 2011-11-28 11:20 ` Mark Wielaard 0 siblings, 0 replies; 8+ messages in thread From: Mark Wielaard @ 2011-11-28 11:20 UTC (permalink / raw) To: David Smith; +Cc: Heiko Carstens, Ananth N Mavinakayanahalli, Systemtap List On Sat, 2011-11-26 at 02:50 +0100, Mark Wielaard wrote: > So I split our deref and store_deref into uderef/kderef and > store_u/kderef that are now used when user or kernel addresses are > accessed. For most architectures these are equal. But for s390x I > replaced our custom accessor functions with with standard s390 kernel > __get_user() and __put_user() functions and then wrap those in get_fs() > set_fs() calls to switch between user/kernel space addresses. See below. > > Tested against 2.6.18 (rhel5) and 2.6.32 (rhel6) kernels. It improves > the make installcheck results a lot. > > There is still some work to do for the user space dwarf unwinder. This has been fixed, the unwinder now also splits its user/kernel address view completely. There was also a generic bug in the vma tracker that could in theory hit on any architecture. There is still an issue with PIE executables on s390x that I haven't tracked down yet. But excluding PIE executables both user and kernel backtracing and symbol tracking seems to be on par with other architectures now. Cheers, Mark commit 3c6dd54efb6de5b4d55e073c8d5884440aece135 Author: Mark Wielaard <mjw@redhat.com> Date: Sat Nov 26 22:21:22 2011 +0100 Pass through user flag into all of unwinder. runtime/unwind.c relied on passing of struct task being not NULL to see whether to access user space. In practice tsk was always either NULL or current. And not all bits of the unwinder checked it. Now we just pass a user flag and push it down to every function that needs it. commit bde342310dd6b1a7e247a0dc40089b25f045552a Author: Mark Wielaard <mjw@redhat.com> Date: Mon Nov 28 12:06:39 2011 +0100 _stp_vma_mmap_cb: Really mean that we are only interested in the first load. There was a chance our test for whether this was the first load of the whole module that is executable could miss a second load, which would override the original start address. Fix that by explicitly checking whether we already registered this module or not. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-11-28 11:20 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-10-17 18:42 s390x help needed - kernel read faults David Smith [not found] ` <20111018052305.GA22831@in.ibm.com> [not found] ` <20111018081753.GA2578@osiris.boeblingen.de.ibm.com> [not found] ` <4E9D8577.9030705@redhat.com> [not found] ` <20111028124058.GA2475@osiris.boeblingen.de.ibm.com> 2011-10-28 18:43 ` David Smith 2011-10-31 10:29 ` Heiko Carstens 2011-11-07 16:03 ` David Smith 2011-11-07 17:18 ` Mark Wielaard 2011-11-08 12:18 ` Heiko Carstens 2011-11-26 1:50 ` Mark Wielaard 2011-11-28 11:20 ` Mark Wielaard
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).