public inbox for systemtap@sourceware.org
 help / color / mirror / Atom feed
* s390x help needed - kernel read faults
@ 2011-10-17 18:42 David Smith
       [not found] ` <20111018052305.GA22831@in.ibm.com>
  0 siblings, 1 reply; 8+ messages in thread
From: David Smith @ 2011-10-17 18:42 UTC (permalink / raw)
  To: Systemtap List

I've been looking at BZ639338
(<https://bugzilla.redhat.com/show_bug.cgi?id=639338>).  Basically the
testsuite is getting lots of kernel read faults on RHEL6 (2.6.32) that
it doesn't get on RHEL5 (2.6.18).

This reminded me of PR12341 "deref()/store_deref() variations between
arches" (<http://sourceware.org/bugzilla/show_bug.cgi?id=12341>).  I
hadn't found the time to see if the s390x could use the kernel's own
get_user()/put_user() macros instead of having custom s390x asm code in
loc2c-runtime.h.

However, replacing systemtap's custom s390x asm code didn't fix the
kernel read faults - it fixed some but others popped up.  That work did
find PR13289 "%m/%M printf formatting operators access memory", so it
wasn't wasted at all.

I've learned a lot about the s390x platform that I didn't know before.
The one most relevant here is that the s390x has 3 different address
spaces - "home", "primary", and "secondary" with some instructions that
only operate on certain address spaces.

In RHEL5 (2.6.18) kernels, replacing our custom asm code with the stock
__get_user()/__put_user() showed similar results as RHEL6.  After
looking at it a bit, I realized the main difference between the stock
RHEL5 __get_user() and our version is that we use the 'mvc' instruction
and the stock __get_user() code uses the 'mvcp' instruction.  The 'mvcp'
instruction only works on addresses in the primary address space (where
user data is stored on 2.6.18 kernels).  Systemtap's deref() macro is
used to read both user data and kernel data, but the stock RHEL5
__get_user() function will only read user data.

In RHEL6 (2.6.32) kernels, things are more complicated.  The stock
__get_user()/__put_user() macros can end up using one of 3 sets of code
(uaccess_std.c, uaccess_pt.c, uaccess_mvcos.c), based on the system's
hardware and the size of the data request.  In addition, which address
space will be used for kernel data vs user data appears to also differ
based on the system's hardware.  But, once again the stock
__get_user()/__put_user() code only works on true user data, not kernel
data.

At this point my s390x knowledge is exhausted.  I need help from people
who actually understand this platform to be able to fix
deref()/store_deref() to properly work on kernel data and user data in
newer kernels.  Hopefully, we can use as much stock kernel code as
possible, to avoid this problem popping up again in the future.

Thanks.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: s390x help needed - kernel read faults
       [not found]       ` <20111028124058.GA2475@osiris.boeblingen.de.ibm.com>
@ 2011-10-28 18:43         ` David Smith
  2011-10-31 10:29           ` Heiko Carstens
  0 siblings, 1 reply; 8+ messages in thread
From: David Smith @ 2011-10-28 18:43 UTC (permalink / raw)
  To: Heiko Carstens; +Cc: Ananth N Mavinakayanahalli, Systemtap List

(Background for the systemtap list.  I sent out a plea for s390x help,
and in some private email Ananth sent Heiko my way.)

On 10/28/2011 07:40 AM, Heiko Carstens wrote:

> On Tue, Oct 18, 2011 at 08:56:07AM -0500, David Smith wrote:
> 
> Sorry for late reply.. I was a bit busy. But anyway...
> 
>>> If you would give me a pointer to all the relavant sources/repositories
>>> I might have a look at it. However it will probably take until next week
>>> before I have time to look into it.
>>
>> Thanks!  You can get the entire systemtap source tree with:
>>
>> git clone git://sourceware.org/git/systemtap.git
>>
>> The gitweb viewer is at:
>>
>> http://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=summary
>>
>> The particular file to look at is in runtime/loc2c-runtime.h.  That file
>> has all the arch-specific stuff in it, but you shouldn't have any
>> problem finding the s390x specific code.
>>
>> In case you lost this in my first email message, the tricky part here is
>> that the code has to work for both user data and kernel data.  Otherwise
>> I could just call the stock kernel routines.
> 
> The current code won't work for user space or kernel space. The problem in
> kernel space is that the enable write protection for the kernel image, hence
> every store will result in a fault.
> You can avoid that by using the probe_kernel_write() function which will
> bybass the write protection (on s390).

I don't doubt writes are broken.  While writes will need to work, I'd
guess 85%-90% of the time we will call this code to do reads (deref()),
not writes (store_deref()).  So, I'd like to initially focus on reads.

> Also in the current code I do not see how address spaces are switched, so
> it looks to me like all operations are always done in the kernel address
> space.
> 
> I didn't see however the tricky part you mentioned (user space vs kernel
> space). How does the code distinguish if an operation has to be done on
> the kernel space address range or user space address range?

Sigh.  That's the thing - we don't distinguish.  The assumption was made
that calling a function to access user memory would also work for kernel
memory.  In RHEL5-era kernels (2.6.18), that assumption worked for us
with the code that's checked in (originally done by David Wilder from
IBM).  Something has changed in the way the s390x handles memory spaces
since then.

>> P.S.: Is it OK to copy this private email back to the list?
> 
> Sure.


Thanks, I've done this.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: s390x help needed - kernel read faults
  2011-10-28 18:43         ` David Smith
@ 2011-10-31 10:29           ` Heiko Carstens
  2011-11-07 16:03             ` David Smith
  0 siblings, 1 reply; 8+ messages in thread
From: Heiko Carstens @ 2011-10-31 10:29 UTC (permalink / raw)
  To: David Smith; +Cc: Ananth N Mavinakayanahalli, Systemtap List

> > You can avoid that by using the probe_kernel_write() function which will
> > bybass the write protection (on s390).
> 
> I don't doubt writes are broken.  While writes will need to work, I'd
> guess 85%-90% of the time we will call this code to do reads (deref()),
> not writes (store_deref()).  So, I'd like to initially focus on reads.
> 
> > Also in the current code I do not see how address spaces are switched, so
> > it looks to me like all operations are always done in the kernel address
> > space.
> > 
> > I didn't see however the tricky part you mentioned (user space vs kernel
> > space). How does the code distinguish if an operation has to be done on
> > the kernel space address range or user space address range?
> 
> Sigh.  That's the thing - we don't distinguish.  The assumption was made
> that calling a function to access user memory would also work for kernel
> memory.

Sure it does, _if_ you call set_fs(KERNEL_DS) in advance of any of uaccess
functions. Of course you know that already.

What I don't understand is, why you do not use the set_fs() mechanism and
the in-kernel uaccess functions. Instead you provide your own functions.

Looking alone at the address that needs to be read or written to it's simply
impossible to tell if it's a user space or kernel space address. Just must
specify it.
In addition all the different address space layouts on s390 will make your
life much more difficult. Hence.. use the in-kernel functions, everything
else will break again.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: s390x help needed - kernel read faults
  2011-10-31 10:29           ` Heiko Carstens
@ 2011-11-07 16:03             ` David Smith
  2011-11-07 17:18               ` Mark Wielaard
  0 siblings, 1 reply; 8+ messages in thread
From: David Smith @ 2011-11-07 16:03 UTC (permalink / raw)
  To: Heiko Carstens; +Cc: Ananth N Mavinakayanahalli, Systemtap List

On 10/31/2011 05:29 AM, Heiko Carstens wrote:

>>> You can avoid that by using the probe_kernel_write() function which will
>>> bybass the write protection (on s390).
>>
>> I don't doubt writes are broken.  While writes will need to work, I'd
>> guess 85%-90% of the time we will call this code to do reads (deref()),
>> not writes (store_deref()).  So, I'd like to initially focus on reads.
>>
>>> Also in the current code I do not see how address spaces are switched, so
>>> it looks to me like all operations are always done in the kernel address
>>> space.
>>>
>>> I didn't see however the tricky part you mentioned (user space vs kernel
>>> space). How does the code distinguish if an operation has to be done on
>>> the kernel space address range or user space address range?
>>
>> Sigh.  That's the thing - we don't distinguish.  The assumption was made
>> that calling a function to access user memory would also work for kernel
>> memory.
> 
> Sure it does, _if_ you call set_fs(KERNEL_DS) in advance of any of uaccess
> functions. Of course you know that already.
> 
> What I don't understand is, why you do not use the set_fs() mechanism and
> the in-kernel uaccess functions. Instead you provide your own functions.
> 
> Looking alone at the address that needs to be read or written to it's simply
> impossible to tell if it's a user space or kernel space address. Just must
> specify it.
> In addition all the different address space layouts on s390 will make your
> life much more difficult. Hence.. use the in-kernel functions, everything
> else will break again.


Sorry for not responding sooner.

I'm not sure why we provided our own functions, that decision was made a
long time ago.  If we can't look at the address and know whether it is a
user space or kernel space address, then I don't see much choice than to
break up our memory accesses and require the callers to know whether
they are accessing kernel space or user space.

Thanks for the advice.

-- 
David Smith
dsmith@redhat.com
Red Hat
http://www.redhat.com
256.217.0141 (direct)
256.837.0057 (fax)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: s390x help needed - kernel read faults
  2011-11-07 16:03             ` David Smith
@ 2011-11-07 17:18               ` Mark Wielaard
  2011-11-08 12:18                 ` Heiko Carstens
  2011-11-26  1:50                 ` Mark Wielaard
  0 siblings, 2 replies; 8+ messages in thread
From: Mark Wielaard @ 2011-11-07 17:18 UTC (permalink / raw)
  To: David Smith; +Cc: Heiko Carstens, Ananth N Mavinakayanahalli, Systemtap List

On Mon, Nov 07, 2011 at 10:03:19AM -0600, David Smith wrote:
> I'm not sure why we provided our own functions, that decision was made a
> long time ago.  If we can't look at the address and know whether it is a
> user space or kernel space address, then I don't see much choice than to
> break up our memory accesses and require the callers to know whether
> they are accessing kernel space or user space.

Yes, it seems not that hard to track fully, we mostly do know already.
The only place where it would be helpful to determine whether or not
a address is valid for either kernel or user space would be in the
unwinder. There we walk the stack and stop and/or switch from kernel
to user stack walking based on whether the PC address is a valid
kernel or user space address. Is there really no way to tell the
difference on s390x from kernel space?

Thanks,

Mark

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: s390x help needed - kernel read faults
  2011-11-07 17:18               ` Mark Wielaard
@ 2011-11-08 12:18                 ` Heiko Carstens
  2011-11-26  1:50                 ` Mark Wielaard
  1 sibling, 0 replies; 8+ messages in thread
From: Heiko Carstens @ 2011-11-08 12:18 UTC (permalink / raw)
  To: Mark Wielaard; +Cc: David Smith, Ananth N Mavinakayanahalli, Systemtap List

On Mon, Nov 07, 2011 at 06:17:37PM +0100, Mark Wielaard wrote:
> On Mon, Nov 07, 2011 at 10:03:19AM -0600, David Smith wrote:
> > I'm not sure why we provided our own functions, that decision was made a
> > long time ago.  If we can't look at the address and know whether it is a
> > user space or kernel space address, then I don't see much choice than to
> > break up our memory accesses and require the callers to know whether
> > they are accessing kernel space or user space.
> 
> Yes, it seems not that hard to track fully, we mostly do know already.
> The only place where it would be helpful to determine whether or not
> a address is valid for either kernel or user space would be in the
> unwinder. There we walk the stack and stop and/or switch from kernel
> to user stack walking based on whether the PC address is a valid
> kernel or user space address. Is there really no way to tell the
> difference on s390x from kernel space?

If all you have is an address, then no, you cannot tell if it belongs to
user space or kernel space.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: s390x help needed - kernel read faults
  2011-11-07 17:18               ` Mark Wielaard
  2011-11-08 12:18                 ` Heiko Carstens
@ 2011-11-26  1:50                 ` Mark Wielaard
  2011-11-28 11:20                   ` Mark Wielaard
  1 sibling, 1 reply; 8+ messages in thread
From: Mark Wielaard @ 2011-11-26  1:50 UTC (permalink / raw)
  To: David Smith; +Cc: Heiko Carstens, Ananth N Mavinakayanahalli, Systemtap List

On Mon, 2011-11-07 at 18:17 +0100, Mark Wielaard wrote:
> On Mon, Nov 07, 2011 at 10:03:19AM -0600, David Smith wrote:
> > I'm not sure why we provided our own functions, that decision was made a
> > long time ago.  If we can't look at the address and know whether it is a
> > user space or kernel space address, then I don't see much choice than to
> > break up our memory accesses and require the callers to know whether
> > they are accessing kernel space or user space.
> 
> Yes, it seems not that hard to track fully, we mostly do know already.

So I split our deref and store_deref into uderef/kderef and
store_u/kderef that are now used when user or kernel addresses are
accessed. For most architectures these are equal. But for s390x I
replaced our custom accessor functions with with standard s390 kernel
__get_user() and __put_user() functions and then wrap those in get_fs()
set_fs() calls to switch between user/kernel space addresses. See below.

Tested against 2.6.18 (rhel5) and 2.6.32 (rhel6) kernels. It improves
the make installcheck results a lot.

There is still some work to do for the user space dwarf unwinder.

Cheers,

Mark

=== runtime/loc2c-runtime.h s390 deref functions ===

#elif defined (__s390__) || defined (__s390x__)

/* Use same __get_user() and __put_user() for both user and kernel
   addresses, but make sure set_fs() is called appropriately first. */

#define uderef(size, addr) ({ \
    u8 _b; u16 _w; u32 _l; u64 _q; \
    uintptr_t _a = (uintptr_t) addr; \
    intptr_t _v = 0; \
    int _bad = 0; \
    mm_segment_t _oldfs = get_fs(); \
    set_fs (USER_DS); \
    switch (size) { \
      case 1: _bad = __get_user(_b, (u8 *)(_a)); _v = _b; break; \
      case 2: _bad = __get_user(_w, (u16 *)(_a)); _v = _w; break; \
      case 4: _bad = __get_user(_l, (u32 *)(_a)); _v = _l; break; \
      case 8: _bad = __get_user(_q, (u64 *)(_a)); _v = _q; break; \
      default: __get_user_bad(); \
    } \
    set_fs (_oldfs); \
    if (_bad) \
      DEREF_FAULT(addr); \
    _v; \
  })

#define store_uderef(size, addr, value) ({ \
    int _bad = 0; \
    mm_segment_t _oldfs = get_fs(); \
    set_fs (USER_DS); \
    switch (size) {              \
      case 1: _bad = __put_user(((u8)(value)), ((u8 *)(addr))); break; \
      case 2: _bad = __put_user(((u16)(value)), ((u16 *)(addr))); break; \
      case 4: _bad = __put_user(((u32)(value)), ((u32 *)(addr))); break; \
      case 8: _bad = __put_user(((u64)(value)), ((u64 *)(addr))); break; \
      default: __put_user_bad(); \
    } \
    set_fs (_oldfs); \
    if (_bad) \
        STORE_DEREF_FAULT(addr); \
  })

#define kderef(size, addr) ({ \
    u8 _b; u16 _w; u32 _l; u64 _q; \
    uintptr_t _a = (uintptr_t) addr; \
    intptr_t _v = 0; \
    int _bad = 0; \
    mm_segment_t _oldfs = get_fs(); \
    set_fs (KERNEL_DS); \
    switch (size) { \
      case 1: _bad = __get_user(_b, (u8 *)(_a)); _v = _b; break; \
      case 2: _bad = __get_user(_w, (u16 *)(_a)); _v = _w; break; \
      case 4: _bad = __get_user(_l, (u32 *)(_a)); _v = _l; break; \
      case 8: _bad = __get_user(_q, (u64 *)(_a)); _v = _q; break; \
      default: __get_user_bad(); \
    } \
    set_fs (_oldfs); \
    if (_bad) \
      DEREF_FAULT(addr); \
    _v; \
  })

#define store_kderef(size, addr, value) ({ \
    int _bad = 0; \
    mm_segment_t _oldfs = get_fs(); \
    set_fs (KERNEL_DS); \
    switch (size) { \
      case 1: _bad = __put_user(((u8)(value)), ((u8 *)(addr))); break; \
      case 2: _bad = __put_user(((u16)(value)), ((u16 *)(addr))); break; \
      case 4: _bad = __put_user(((u32)(value)), ((u32 *)(addr))); break; \
      case 8: _bad = __put_user(((u64)(value)), ((u64 *)(addr))); break; \
      default: __put_user_bad(); \
    } \
    set_fs (_oldfs); \
    if (_bad) \
        STORE_DEREF_FAULT(addr); \
  })

#endif /* (s390) || (s390x) */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: s390x help needed - kernel read faults
  2011-11-26  1:50                 ` Mark Wielaard
@ 2011-11-28 11:20                   ` Mark Wielaard
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Wielaard @ 2011-11-28 11:20 UTC (permalink / raw)
  To: David Smith; +Cc: Heiko Carstens, Ananth N Mavinakayanahalli, Systemtap List

On Sat, 2011-11-26 at 02:50 +0100, Mark Wielaard wrote:
> So I split our deref and store_deref into uderef/kderef and
> store_u/kderef that are now used when user or kernel addresses are
> accessed. For most architectures these are equal. But for s390x I
> replaced our custom accessor functions with with standard s390 kernel
> __get_user() and __put_user() functions and then wrap those in get_fs()
> set_fs() calls to switch between user/kernel space addresses. See below.
> 
> Tested against 2.6.18 (rhel5) and 2.6.32 (rhel6) kernels. It improves
> the make installcheck results a lot.
> 
> There is still some work to do for the user space dwarf unwinder.

This has been fixed, the unwinder now also splits its user/kernel
address view completely. There was also a generic bug in the vma tracker
that could in theory hit on any architecture. There is still an issue
with PIE executables on s390x that I haven't tracked down yet. But
excluding PIE executables both user and kernel backtracing and symbol
tracking seems to be on par with other architectures now.

Cheers,

Mark

commit 3c6dd54efb6de5b4d55e073c8d5884440aece135
Author: Mark Wielaard <mjw@redhat.com>
Date:   Sat Nov 26 22:21:22 2011 +0100

Pass through user flag into all of unwinder.
    
runtime/unwind.c relied on passing of struct task being not NULL
to see whether to access user space. In practice tsk was always
either NULL or current. And not all bits of the unwinder checked
it. Now we just pass a user flag and push it down to every
function that needs it.


commit bde342310dd6b1a7e247a0dc40089b25f045552a
Author: Mark Wielaard <mjw@redhat.com>
Date:   Mon Nov 28 12:06:39 2011 +0100

_stp_vma_mmap_cb: Really mean that we are only interested in the first
load.
    
There was a chance our test for whether this was the first load of the
whole module that is executable could miss a second load, which would
override the original start address. Fix that by explicitly checking
whether we already registered this module or not.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-11-28 11:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-17 18:42 s390x help needed - kernel read faults David Smith
     [not found] ` <20111018052305.GA22831@in.ibm.com>
     [not found]   ` <20111018081753.GA2578@osiris.boeblingen.de.ibm.com>
     [not found]     ` <4E9D8577.9030705@redhat.com>
     [not found]       ` <20111028124058.GA2475@osiris.boeblingen.de.ibm.com>
2011-10-28 18:43         ` David Smith
2011-10-31 10:29           ` Heiko Carstens
2011-11-07 16:03             ` David Smith
2011-11-07 17:18               ` Mark Wielaard
2011-11-08 12:18                 ` Heiko Carstens
2011-11-26  1:50                 ` Mark Wielaard
2011-11-28 11:20                   ` Mark Wielaard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).