RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
@ 2022-09-16 14:36 Mathieu Desnoyers
  2022-09-16 14:44 ` Mathieu Desnoyers
  2022-09-16 15:29 ` Florian Weimer
  0 siblings, 2 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2022-09-16 14:36 UTC (permalink / raw)
  To: Florian Weimer, carlos, libc-alpha, szabolcs.nagy, libc-coord

Hi Florian,

I wanted to clarify by email what we each have in mind with respect
to exposing the RSEQ feature set available to the outside world
through libc symbols.

I have 3 different possible approaches in mind, shown below with
3 examples:

#include <stdint.h>

#undef likely
#define likely(x)       __builtin_expect(!!(x), 1)
#undef __aligned
#define __aligned(x)    __attribute__((__aligned__(x)))
#undef offsetof
#define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
#undef sizeof_field
#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
#undef offsetofend
#define offsetofend(TYPE, MEMBER) \
         (offsetof(TYPE, MEMBER) + sizeof_field(TYPE, MEMBER))

#define __RSEQ_FLAG_FEATURE_EXTENDED    0x2

#define __RSEQ_FLAG_FEATURE_VM_VCPU_ID  0x4

typedef uint32_t __u32;
typedef uint64_t __u64;

/* Original: size=32 bytes */

struct rseq_orig {
   uint32_t cpu_id_start;
   uint32_t cpu_id;
   uint64_t rseq_cs;
   uint32_t flags;
   uint32_t padding[3];
} __aligned(32);

/* Extended */

struct rseq_ext {
   uint32_t cpu_id_start;
   uint32_t cpu_id;
   uint64_t rseq_cs;
   uint32_t flags;
   /* New */
   uint32_t node_id;
   uint32_t vm_vcpu_id;
   uint32_t padding[1];
} __aligned(32);

unsigned int __rseq_flags;
unsigned int __rseq_size;
unsigned int __rseq_feature_size;

/* A) Check extended feature flag and size. One mask and two comparisons. */
void fA(void)
{
         if (likely((__rseq_flags & __RSEQ_FLAG_FEATURE_EXTENDED)
                    && __rseq_size >= offsetofend(struct rseq_ext, vm_vcpu_id))) {
                 /* Use rseq with vcpu_id. */
                 asm volatile ("ud2\n\t");
         } else {
                 /* Fallback. */
                 asm volatile ("int3\n\t");
         }
}

/*
  * B) Check rseq feature size. Feature number only limited by size of
  * uint32_t. One comparison.
  */
void fB(void)
{
         if (likely(__rseq_feature_size >= offsetofend(struct rseq_ext, vm_vcpu_id))) {
                 /* Use rseq with vcpu_id. */
                 asm volatile ("ud2\n\t");
         } else {
                 /* Fallback. */
                 asm volatile ("int3\n\t");
         }
}

/*
  * C) Check only rseq flags. 32 features at most. One mask and one
  * comparison.
  */

void fC(void)
{
         if (likely(__rseq_flags & __RSEQ_FLAG_FEATURE_VM_VCPU_ID)) {
                 /* Use rseq with vcpu_id. */
                 asm volatile ("ud2\n\t");
         } else {
                 /* Fallback. */
                 asm volatile ("int3\n\t");
         }

Here is the resulting objdump:


rseq-flags.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <fA>:
    0:   f6 05 00 00 00 00 02    testb  $0x2,0x0(%rip)        # 7 <fA+0x7>
    7:   74 0f                   je     18 <fA+0x18>
    9:   83 3d 00 00 00 00 1b    cmpl   $0x1b,0x0(%rip)        # 10 <fA+0x10>
   10:   76 06                   jbe    18 <fA+0x18>
   12:   0f 0b                   ud2
   14:   c3                      retq
   15:   0f 1f 00                nopl   (%rax)
   18:   cc                      int3
   19:   c3                      retq
   1a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

0000000000000020 <fB>:
   20:   83 3d 00 00 00 00 1b    cmpl   $0x1b,0x0(%rip)        # 27 <fB+0x7>
   27:   76 07                   jbe    30 <fB+0x10>
   29:   0f 0b                   ud2
   2b:   c3                      retq
   2c:   0f 1f 40 00             nopl   0x0(%rax)
   30:   cc                      int3
   31:   c3                      retq
   32:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
   39:   00 00 00 00
   3d:   0f 1f 00                nopl   (%rax)

0000000000000040 <fC>:
   40:   f6 05 00 00 00 00 04    testb  $0x4,0x0(%rip)        # 47 <fC+0x7>
   47:   74 07                   je     50 <fC+0x10>
   49:   0f 0b                   ud2
   4b:   c3                      retq
   4c:   0f 1f 40 00             nopl   0x0(%rax)
   50:   cc                      int3
   51:   c3                      retq

I can think of 4 approaches that applications will use to detect
availability of their specific rseq feature for each rseq critical
section:

1) Dynamically check whether the feature is implemented at runtime
    with conditional branches. Those using this approach will probably
    not want to have the overhead of the two comparisons in approach (A)
    above. Applications and libraries should probably use their own copy
    of the glibc symbols for speed purposes.

2) Implement the entire function as IFUNC and select whether a rseq or
    non-rseq implementation should be used at C startup. The tradeoff
    here is code size vs speed, and using IFUNC for things like malloc
    may add additional constraints on the startup order.

3) Code rewrite (dynamic code patching) between rseq and non-rseq code.
    This may be frowned upon in the security area and may not always be
    possible depending on the context.

3) JIT compilation of specialized rseq vs non-rseq code. Not generally
    available in C.

I suspect that glibc may rely on approaches 1+2 depending on the
situation, and many applications may use approach (1) for simplicity
reasons.

Ideally I would like to keep approach (1) fast, so I'd prefer to
keep the check to one single conditional branch. This eliminates
approach (A) and leaves approaches (B) and (C). Approach (B) has
the advantage of not limiting us to 32 features, but its downside
is that we need to introduce a new __rseq_feature_size symbol to
the libc ABI. Approach (C) has the advantage of using __rseq_flags
which is already exposed, but limits us to 32 features.

Did you have in mind an approach like (A), (B) or (C) for exposing
the rseq feature set or something else entirely ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
  2022-09-16 14:36 RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size Mathieu Desnoyers
@ 2022-09-16 14:44 ` Mathieu Desnoyers
  2022-09-16 15:29 ` Florian Weimer
  1 sibling, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2022-09-16 14:44 UTC (permalink / raw)
  To: Florian Weimer, carlos, libc-alpha, szabolcs.nagy, libc-coord

On 2022-09-16 16:36, Mathieu Desnoyers wrote:
> Hi Florian,
> 
> I wanted to clarify by email what we each have in mind with respect
> to exposing the RSEQ feature set available to the outside world
> through libc symbols.
> 
> I have 3 different possible approaches in mind, shown below with
> 3 examples:
> 
> #include <stdint.h>
> 
> #undef likely
> #define likely(x)       __builtin_expect(!!(x), 1)
> #undef __aligned
> #define __aligned(x)    __attribute__((__aligned__(x)))
> #undef offsetof
> #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
> #undef sizeof_field
> #define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
> #undef offsetofend
> #define offsetofend(TYPE, MEMBER) \
>          (offsetof(TYPE, MEMBER) + sizeof_field(TYPE, MEMBER))
> 
> #define __RSEQ_FLAG_FEATURE_EXTENDED    0x2
> 
> #define __RSEQ_FLAG_FEATURE_VM_VCPU_ID  0x4
> 
> typedef uint32_t __u32;
> typedef uint64_t __u64;
> 
> /* Original: size=32 bytes */
> 
> struct rseq_orig {
>    uint32_t cpu_id_start;
>    uint32_t cpu_id;
>    uint64_t rseq_cs;
>    uint32_t flags;
>    uint32_t padding[3];
> } __aligned(32);
> 
> /* Extended */
> 
> struct rseq_ext {
>    uint32_t cpu_id_start;
>    uint32_t cpu_id;
>    uint64_t rseq_cs;
>    uint32_t flags;
>    /* New */
>    uint32_t node_id;
>    uint32_t vm_vcpu_id;
>    uint32_t padding[1];
> } __aligned(32);
> 
> unsigned int __rseq_flags;
> unsigned int __rseq_size;
> unsigned int __rseq_feature_size;
> 
> /* A) Check extended feature flag and size. One mask and two 
> comparisons. */
> void fA(void)
> {
>          if (likely((__rseq_flags & __RSEQ_FLAG_FEATURE_EXTENDED)
>                     && __rseq_size >= offsetofend(struct rseq_ext, 
> vm_vcpu_id))) {
>                  /* Use rseq with vcpu_id. */
>                  asm volatile ("ud2\n\t");
>          } else {
>                  /* Fallback. */
>                  asm volatile ("int3\n\t");
>          }
> }
> 
> /*
>   * B) Check rseq feature size. Feature number only limited by size of
>   * uint32_t. One comparison.
>   */
> void fB(void)
> {
>          if (likely(__rseq_feature_size >= offsetofend(struct rseq_ext, 
> vm_vcpu_id))) {
>                  /* Use rseq with vcpu_id. */
>                  asm volatile ("ud2\n\t");
>          } else {
>                  /* Fallback. */
>                  asm volatile ("int3\n\t");
>          }
> }
> 
> /*
>   * C) Check only rseq flags. 32 features at most. One mask and one
>   * comparison.
>   */
> 
> void fC(void)
> {
>          if (likely(__rseq_flags & __RSEQ_FLAG_FEATURE_VM_VCPU_ID)) {
>                  /* Use rseq with vcpu_id. */
>                  asm volatile ("ud2\n\t");
>          } else {
>                  /* Fallback. */
>                  asm volatile ("int3\n\t");
>          }
> 
> Here is the resulting objdump:
> 
> 
> rseq-flags.o:     file format elf64-x86-64
> 
> 
> Disassembly of section .text:
> 
> 0000000000000000 <fA>:
>     0:   f6 05 00 00 00 00 02    testb  $0x2,0x0(%rip)        # 7 <fA+0x7>
>     7:   74 0f                   je     18 <fA+0x18>
>     9:   83 3d 00 00 00 00 1b    cmpl   $0x1b,0x0(%rip)        # 10 
> <fA+0x10>
>    10:   76 06                   jbe    18 <fA+0x18>
>    12:   0f 0b                   ud2
>    14:   c3                      retq
>    15:   0f 1f 00                nopl   (%rax)
>    18:   cc                      int3
>    19:   c3                      retq
>    1a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
> 
> 0000000000000020 <fB>:
>    20:   83 3d 00 00 00 00 1b    cmpl   $0x1b,0x0(%rip)        # 27 
> <fB+0x7>
>    27:   76 07                   jbe    30 <fB+0x10>
>    29:   0f 0b                   ud2
>    2b:   c3                      retq
>    2c:   0f 1f 40 00             nopl   0x0(%rax)
>    30:   cc                      int3
>    31:   c3                      retq
>    32:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
>    39:   00 00 00 00
>    3d:   0f 1f 00                nopl   (%rax)
> 
> 0000000000000040 <fC>:
>    40:   f6 05 00 00 00 00 04    testb  $0x4,0x0(%rip)        # 47 <fC+0x7>
>    47:   74 07                   je     50 <fC+0x10>
>    49:   0f 0b                   ud2
>    4b:   c3                      retq
>    4c:   0f 1f 40 00             nopl   0x0(%rax)
>    50:   cc                      int3
>    51:   c3                      retq
> 
> I can think of 4 approaches that applications will use to detect
> availability of their specific rseq feature for each rseq critical
> section:
> 
> 1) Dynamically check whether the feature is implemented at runtime
>     with conditional branches. Those using this approach will probably
>     not want to have the overhead of the two comparisons in approach (A)
>     above. Applications and libraries should probably use their own copy
>     of the glibc symbols for speed purposes.
> 
> 2) Implement the entire function as IFUNC and select whether a rseq or
>     non-rseq implementation should be used at C startup. The tradeoff
>     here is code size vs speed, and using IFUNC for things like malloc
>     may add additional constraints on the startup order.
> 
> 3) Code rewrite (dynamic code patching) between rseq and non-rseq code.
>     This may be frowned upon in the security area and may not always be
>     possible depending on the context.
> 
> 3) JIT compilation of specialized rseq vs non-rseq code. Not generally
>     available in C.
> 
> I suspect that glibc may rely on approaches 1+2 depending on the
> situation, and many applications may use approach (1) for simplicity
> reasons.
> 
> Ideally I would like to keep approach (1) fast, so I'd prefer to
> keep the check to one single conditional branch. This eliminates
> approach (A) and leaves approaches (B) and (C). Approach (B) has
> the advantage of not limiting us to 32 features, but its downside
> is that we need to introduce a new __rseq_feature_size symbol to
> the libc ABI. Approach (C) has the advantage of using __rseq_flags
> which is already exposed, but limits us to 32 features.
> 
> Did you have in mind an approach like (A), (B) or (C) for exposing
> the rseq feature set or something else entirely ?

One more detail: Approach (C) using __rseq_flags would allow us to 
eventually deprecate features if need be by clearing the bit associated 
with the deprecated feature and leaving the field unpopulated in the 
rseq structure. I'm not certain whether this is something we want to be 
able to do or not, but it may be a nice property of the flags approach.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
  2022-09-16 14:36 RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size Mathieu Desnoyers
  2022-09-16 14:44 ` Mathieu Desnoyers
@ 2022-09-16 15:29 ` Florian Weimer
  2022-09-16 15:41   ` [libc-coord] " Chris Kennelly
  1 sibling, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2022-09-16 15:29 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: carlos, libc-alpha, szabolcs.nagy, libc-coord

* Mathieu Desnoyers:

> /*
>   * C) Check only rseq flags. 32 features at most. One mask and one
>   * comparison.
>   */
>
> void fC(void)
> {
>          if (likely(__rseq_flags & __RSEQ_FLAG_FEATURE_VM_VCPU_ID)) {
>                  /* Use rseq with vcpu_id. */
>                  asm volatile ("ud2\n\t");
>          } else {
>                  /* Fallback. */
>                  asm volatile ("int3\n\t");
>          }

I think it has to be this because we cannot lower __rseq_flags below
32 now, not if rseq is active.

If you don't find a better use fot the remaining 32 bits of padding,
maybe put the PID or TID there, so that we can create a
system-call-less version of getpid/gettid.  So the flag would just say
that the padding is now completely used.

Going forward, we can use the size increasing above 32 as a support
indicator.

> I can think of 4 approaches that applications will use to detect
> availability of their specific rseq feature for each rseq critical
> section:
>
> 1) Dynamically check whether the feature is implemented at runtime
>     with conditional branches. Those using this approach will probably
>     not want to have the overhead of the two comparisons in approach (A)
>     above. Applications and libraries should probably use their own copy
>     of the glibc symbols for speed purposes.
>
> 2) Implement the entire function as IFUNC and select whether a rseq or
>     non-rseq implementation should be used at C startup. The tradeoff
>     here is code size vs speed, and using IFUNC for things like malloc
>     may add additional constraints on the startup order.
>
> 3) Code rewrite (dynamic code patching) between rseq and non-rseq code.
>     This may be frowned upon in the security area and may not always be
>     possible depending on the context.
>
> 3) JIT compilation of specialized rseq vs non-rseq code. Not generally
>     available in C.
>
> I suspect that glibc may rely on approaches 1+2 depending on the
> situation, and many applications may use approach (1) for simplicity
> reasons.

If the kernel does not currently overwrite the padding, glibc can do
its own per-thread initialization there to support its malloc
implementation (because the padding is undefined today from an
application perspective).  That is, we would initialize these
invisible vCPU IDs the same way we assign arenas today.  That would
cover this specific malloc use case only, of course.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [libc-coord] Re: RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
  2022-09-16 15:29 ` Florian Weimer
@ 2022-09-16 15:41   ` Chris Kennelly
  2022-09-16 21:32     ` Florian Weimer
  2022-09-20 11:51     ` Szabolcs Nagy
  0 siblings, 2 replies; 9+ messages in thread
From: Chris Kennelly @ 2022-09-16 15:41 UTC (permalink / raw)
  To: libc-coord; +Cc: Mathieu Desnoyers, carlos, libc-alpha, szabolcs.nagy

[-- Attachment #1: Type: text/plain, Size: 3116 bytes --]

On Fri, Sep 16, 2022 at 11:29 AM Florian Weimer <fw@deneb.enyo.de> wrote:

> * Mathieu Desnoyers:
>
> > /*
> >   * C) Check only rseq flags. 32 features at most. One mask and one
> >   * comparison.
> >   */
> >
> > void fC(void)
> > {
> >          if (likely(__rseq_flags & __RSEQ_FLAG_FEATURE_VM_VCPU_ID)) {
> >                  /* Use rseq with vcpu_id. */
> >                  asm volatile ("ud2\n\t");
> >          } else {
> >                  /* Fallback. */
> >                  asm volatile ("int3\n\t");
> >          }
>
> I think it has to be this because we cannot lower __rseq_flags below
> 32 now, not if rseq is active.
>
> If you don't find a better use fot the remaining 32 bits of padding,
> maybe put the PID or TID there, so that we can create a
> system-call-less version of getpid/gettid.  So the flag would just say
> that the padding is now completely used.
>
> Going forward, we can use the size increasing above 32 as a support
> indicator.
>
> > I can think of 4 approaches that applications will use to detect
> > availability of their specific rseq feature for each rseq critical
> > section:
> >
> > 1) Dynamically check whether the feature is implemented at runtime
> >     with conditional branches. Those using this approach will probably
> >     not want to have the overhead of the two comparisons in approach (A)
> >     above. Applications and libraries should probably use their own copy
> >     of the glibc symbols for speed purposes.
>

TCMalloc, which has an implementation of this, uses an offset to adjust
which field it reads (cpu_id versus vcpu_id).


> >
> > 2) Implement the entire function as IFUNC and select whether a rseq or
> >     non-rseq implementation should be used at C startup. The tradeoff
> >     here is code size vs speed, and using IFUNC for things like malloc
> >     may add additional constraints on the startup order.
>

IFUNC has significant performance overheads as well.  For frequently used
code (like memcpy), avoiding them has been an optimization for us (
https://research.google/pubs/pub50338/) even if it leaves some nominal
microbenchmark performance on the table.


>
> > 3) Code rewrite (dynamic code patching) between rseq and non-rseq code.
> >     This may be frowned upon in the security area and may not always be
> >     possible depending on the context.
> >
> > 3) JIT compilation of specialized rseq vs non-rseq code. Not generally
> >     available in C.
> >
> > I suspect that glibc may rely on approaches 1+2 depending on the
> > situation, and many applications may use approach (1) for simplicity
> > reasons.
>
> If the kernel does not currently overwrite the padding, glibc can do
> its own per-thread initialization there to support its malloc
> implementation (because the padding is undefined today from an
> application perspective).  That is, we would initialize these
> invisible vCPU IDs the same way we assign arenas today.  That would
> cover this specific malloc use case only, of course.
>

If a user program updates to a new kernel before glibc does, would it be
able to easily take advantage of it?

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [libc-coord] Re: RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
  2022-09-16 15:41   ` [libc-coord] " Chris Kennelly
@ 2022-09-16 21:32     ` Florian Weimer
  2022-09-17 11:51       ` Mathieu Desnoyers
  2022-09-20 11:51     ` Szabolcs Nagy
  1 sibling, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2022-09-16 21:32 UTC (permalink / raw)
  To: Chris Kennelly
  Cc: libc-coord, Mathieu Desnoyers, carlos, libc-alpha, szabolcs.nagy

* Chris Kennelly:

>> If the kernel does not currently overwrite the padding, glibc can do
>> its own per-thread initialization there to support its malloc
>> implementation (because the padding is undefined today from an
>> application perspective).  That is, we would initialize these
>> invisible vCPU IDs the same way we assign arenas today.  That would
>> cover this specific malloc use case only, of course.

> If a user program updates to a new kernel before glibc does, would it be
> able to easily take advantage of it?

No, as far as I understand it, there is presently no signaling from
kernel to applications that bypasses the rseq area registration.  So
the only thing you could do is to unregister and re-register with a
compatible value.  And that is of course quite undefined and assumes
that you can do this early enough during the life-time of each thread.

But if we have the extension handshake, I'll expect us to backport it
quite widely, after some time to verify that it works with CRIU etc.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [libc-coord] Re: RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
  2022-09-16 21:32     ` Florian Weimer
@ 2022-09-17 11:51       ` Mathieu Desnoyers
  2022-09-17 14:45         ` Florian Weimer
  0 siblings, 1 reply; 9+ messages in thread
From: Mathieu Desnoyers @ 2022-09-17 11:51 UTC (permalink / raw)
  To: Florian Weimer, Chris Kennelly
  Cc: libc-coord, carlos, libc-alpha, szabolcs.nagy

On 2022-09-16 23:32, Florian Weimer wrote:
> * Chris Kennelly:
> 
>>> If the kernel does not currently overwrite the padding, glibc can do
>>> its own per-thread initialization there to support its malloc
>>> implementation (because the padding is undefined today from an
>>> application perspective).  That is, we would initialize these
>>> invisible vCPU IDs the same way we assign arenas today.  That would
>>> cover this specific malloc use case only, of course.
> 
>> If a user program updates to a new kernel before glibc does, would it be
>> able to easily take advantage of it?
> 
> No, as far as I understand it, there is presently no signaling from
> kernel to applications that bypasses the rseq area registration.  So
> the only thing you could do is to unregister and re-register with a
> compatible value.  And that is of course quite undefined and assumes
> that you can do this early enough during the life-time of each thread.
> 
> But if we have the extension handshake, I'll expect us to backport it
> quite widely, after some time to verify that it works with CRIU etc.

I don't think this is what Chris is asking here.

I think the requirement here is to make sure that the extensibility 
scheme we come up with will allow to extend struct rseq simply by 
upgrading the kernel, without any need to upgrade glibc. (that's indeed 
a requirement of mine). So a new application and a new kernel can use a 
newly available extended field, even with an old glibc.

Let me bring an example of what I think would be a *bad* way to do 
things, just to show how we can shoot ourselves in the foot if we don't 
consider evolution of this ABI carefully.

Let's assume we expose a "rseq_feature_size" integer through 
getauxval(). This allows the kernel to tell glibc about the memory size 
required to hold all the rseq features. This is information that we 
_need_ to expose from the kernel to glibc. So if glibc decides to expose 
each new features through __rseq_flags bits (e.g. one bit per feature), 
then we run into a situation where for every new feature exposed by the 
kernel, glibc needs to know the mapping from feature size to feature bit 
before it can expose them to the rest of user-space, which goes against 
the requirement that we should be able to extend rseq features by simply 
upgrading the kernel, without needing to upgrade glibc as well every time.

So considering that the kernel needs to let glibc know how much memory 
to allocate for struct rseq, a getauxval() "rseq_feature_size" is 
needed. One approach we could consider to allow extending rseq features 
without upgrading glibc would be to expose an additional 
"rseq_feature_flags" getauxval(), which could then be used by glibc to 
populate its __rseq_flags symbol without prior knowledge of the 
feature-set. This could accommodate 32 features before we need to expose 
an additional __rseq_flags2 symbol.

Exposing a feature flag from the kernel through getauxval() would have 
the advantage to allow the kernel to "disable" some features in the 
future, e.g. if we want to deprecate a field. This comes with its own 
complexity though, as user-space could then not rely that when a feature 
is present, all prior feature fields are necessarily present, which 
therefore makes the testing matrix more complex. I personally don't see 
a need to deprecate rseq fields, but it might just be a lack of 
imagination on my part.

If we want to keep the kernel ABI as simple as we can, then we just 
expose the rseq feature size (and required alignment), and don't expose 
any rseq feature flags. This in turn means that glibc would have to 
somehow expose the rseq feature size in its ABI. If glibc decides 
against exposing this rseq feature size symbol, then it would be up to 
the application to combine information about __rseq_size and 
getauxval(rseq feature size) to figure out which fields are actually 
populated. It would "work", but chances are that some users will get it 
wrong. It seems simpler for a user to simply do:

if (__rseq_feature_size >= offsetofend(struct rseq, vm_vcpu_id))

to validate whether a given field is indeed populated.

The rseq feature size approach would scale to very large feature 
numbers. It would *not* allow deprecation of fields after they are 
published, but I see this as a gain in simplicity for users of the ABI, 
even though we lose a knob as kernel developers.

I think it's important that we consider both the kernel and libc ABIs if 
we want to make sure that we can extend the feature-set without having a 
mandatory glibc upgrade in the way every time we add a rseq feature.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [libc-coord] Re: RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
  2022-09-17 11:51       ` Mathieu Desnoyers
@ 2022-09-17 14:45         ` Florian Weimer
  2022-09-17 15:25           ` Mathieu Desnoyers
  0 siblings, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2022-09-17 14:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Chris Kennelly, libc-coord, carlos, libc-alpha, szabolcs.nagy

* Mathieu Desnoyers:

> On 2022-09-16 23:32, Florian Weimer wrote:
>> * Chris Kennelly:
>> 
>>>> If the kernel does not currently overwrite the padding, glibc can do
>>>> its own per-thread initialization there to support its malloc
>>>> implementation (because the padding is undefined today from an
>>>> application perspective).  That is, we would initialize these
>>>> invisible vCPU IDs the same way we assign arenas today.  That would
>>>> cover this specific malloc use case only, of course.
>> 
>>> If a user program updates to a new kernel before glibc does, would it be
>>> able to easily take advantage of it?
>> 
>> No, as far as I understand it, there is presently no signaling from
>> kernel to applications that bypasses the rseq area registration.  So
>> the only thing you could do is to unregister and re-register with a
>> compatible value.  And that is of course quite undefined and assumes
>> that you can do this early enough during the life-time of each thread.
>> 
>> But if we have the extension handshake, I'll expect us to backport it
>> quite widely, after some time to verify that it works with CRIU etc.
>
> I don't think this is what Chris is asking here.
>
> I think the requirement here is to make sure that the extensibility 
> scheme we come up with will allow to extend struct rseq simply by 
> upgrading the kernel, without any need to upgrade glibc. (that's indeed 
> a requirement of mine). So a new application and a new kernel can use a 
> newly available extended field, even with an old glibc.

I took it for granted that we'd like to get libc out of the picture
for future changes, so I interpreted Chris' question in the context of
the initial switch (i.e., enabling rseq features that need
extensibility on currently released glibc, without upgrading glibc).

> If we want to keep the kernel ABI as simple as we can, then we just 
> expose the rseq feature size (and required alignment), and don't expose 
> any rseq feature flags. This in turn means that glibc would have to 
> somehow expose the rseq feature size in its ABI. If glibc decides 
> against exposing this rseq feature size symbol, then it would be up to 
> the application to combine information about __rseq_size and 
> getauxval(rseq feature size) to figure out which fields are actually 
> populated. It would "work", but chances are that some users will get it 
> wrong. It seems simpler for a user to simply do:
>
> if (__rseq_feature_size >= offsetofend(struct rseq, vm_vcpu_id))
>
> to validate whether a given field is indeed populated.
>
> The rseq feature size approach would scale to very large feature 
> numbers. It would *not* allow deprecation of fields after they are 
> published, but I see this as a gain in simplicity for users of the ABI, 
> even though we lose a knob as kernel developers.

I think glibc can register rseq with a new flag once it sees
AT_RSEQ_FEATURE_SIZE in the auxiliary vector (even if it's 32).  That
flag would naturally end up in __rseq_flags.  For future extensions
__rseq_size should work directly.

But as I said, we better use all the padding at once during the first
step.  Or we could add even more stuff to move past the current 32,
then we wouldn't need the flag dance. 8-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [libc-coord] Re: RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
  2022-09-17 14:45         ` Florian Weimer
@ 2022-09-17 15:25           ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2022-09-17 15:25 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Chris Kennelly, libc-coord, carlos, libc-alpha, szabolcs.nagy

On 2022-09-17 16:45, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> On 2022-09-16 23:32, Florian Weimer wrote:
>>> * Chris Kennelly:
>>>
>>>>> If the kernel does not currently overwrite the padding, glibc can do
>>>>> its own per-thread initialization there to support its malloc
>>>>> implementation (because the padding is undefined today from an
>>>>> application perspective).  That is, we would initialize these
>>>>> invisible vCPU IDs the same way we assign arenas today.  That would
>>>>> cover this specific malloc use case only, of course.
>>>
>>>> If a user program updates to a new kernel before glibc does, would it be
>>>> able to easily take advantage of it?
>>>
>>> No, as far as I understand it, there is presently no signaling from
>>> kernel to applications that bypasses the rseq area registration.  So
>>> the only thing you could do is to unregister and re-register with a
>>> compatible value.  And that is of course quite undefined and assumes
>>> that you can do this early enough during the life-time of each thread.
>>>
>>> But if we have the extension handshake, I'll expect us to backport it
>>> quite widely, after some time to verify that it works with CRIU etc.
>>
>> I don't think this is what Chris is asking here.
>>
>> I think the requirement here is to make sure that the extensibility
>> scheme we come up with will allow to extend struct rseq simply by
>> upgrading the kernel, without any need to upgrade glibc. (that's indeed
>> a requirement of mine). So a new application and a new kernel can use a
>> newly available extended field, even with an old glibc.
> 
> I took it for granted that we'd like to get libc out of the picture
> for future changes, so I interpreted Chris' question in the context of
> the initial switch (i.e., enabling rseq features that need
> extensibility on currently released glibc, without upgrading glibc).

I think it makes sense to require that glibc needs to be upgraded to a 
new version before applications can use the feature fields beyond the 
initial 32 bytes. What I care about is that after we have an extensible 
scheme in glibc, then there is no need to upgrade glibc afterwards when 
additional features appear.

> 
>> If we want to keep the kernel ABI as simple as we can, then we just
>> expose the rseq feature size (and required alignment), and don't expose
>> any rseq feature flags. This in turn means that glibc would have to
>> somehow expose the rseq feature size in its ABI. If glibc decides
>> against exposing this rseq feature size symbol, then it would be up to
>> the application to combine information about __rseq_size and
>> getauxval(rseq feature size) to figure out which fields are actually
>> populated. It would "work", but chances are that some users will get it
>> wrong. It seems simpler for a user to simply do:
>>
>> if (__rseq_feature_size >= offsetofend(struct rseq, vm_vcpu_id))
>>
>> to validate whether a given field is indeed populated.
>>
>> The rseq feature size approach would scale to very large feature
>> numbers. It would *not* allow deprecation of fields after they are
>> published, but I see this as a gain in simplicity for users of the ABI,
>> even though we lose a knob as kernel developers.
> 
> I think glibc can register rseq with a new flag once it sees
> AT_RSEQ_FEATURE_SIZE in the auxiliary vector (even if it's 32).

I understand that you propose adding a rseq registration flag to be 
passed to the rseq system call when glibc supports extended rseq. I 
wonder whether it's useful at all. We implicitly know the relevant 
information through the rseq_len parameter of the rseq syscall.

If rseq_len==32, this means glibc allocated a 32-byte struct rseq area 
(either because it was the original structure size, or because 
getauxval() actually reported that at the supported feature size is <= 
32). The kernel is therefore free to use all fields below 32 bytes. This 
basically mean that we get 3 4-byte word fields available for user-space 
use right away without a glibc upgrade. The application just needs to 
use getauxval() to know whether the kernel populates those fields or not.

If rseq_len > 32, this means glibc used the getauxval() information to 
know the area size. Then the kernel can validate that the size is large 
enough to contain all features it supports upon rseq registration.

That
> flag would naturally end up in __rseq_flags.

I'm still unsure that we need to pass a flag to the kernel at all.

   For future extensions
> __rseq_size should work directly.

So for extensions in the last 3 4-byte words of struct rseq (currently 
padding), applications would have to check both __rseq_flags and compare 
the offset of the end of the field with __rseq_size.

Then for fields beyond the 32 first bytes, just checking the __rseq_size 
would suffice. However, we should be careful that the semantic of 
__rseq_size should *not* include padding anymore, but rather end exactly 
after the last supported feature field.

> 
> But as I said, we better use all the padding at once during the first
> step.  Or we could add even more stuff to move past the current 32,
> then we wouldn't need the flag dance. 8-)

Adding semi-useful information in those words means we put a hard 
requirement on users to upgrade their glibc before they can access the 
more important feature fields we would rather like to make available first.

So with your proposal, the application-level users would be expected to 
do something like this:

static uint32_t local_rseq_feature_size;

if (__rseq_flags & RSEQ_FLAG_EXTENDED)
	local_rseq_feature_size = __rseq_size;
else
	local_rseq_feature_size = 20; /* after last orig. field */

and then use:

if (local_rseq_feature_size >= offsetofend(struct rseq, field))

as check for feature availability. I just find it more likely that users 
may get it wrong compared to having the rseq feature size already 
available in a new libc __rseq_feature_size symbol. On the upside there 
is then no need to export an additional symbol.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [libc-coord] Re: RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size
  2022-09-16 15:41   ` [libc-coord] " Chris Kennelly
  2022-09-16 21:32     ` Florian Weimer
@ 2022-09-20 11:51     ` Szabolcs Nagy
  1 sibling, 0 replies; 9+ messages in thread
From: Szabolcs Nagy @ 2022-09-20 11:51 UTC (permalink / raw)
  To: Chris Kennelly; +Cc: libc-coord, Mathieu Desnoyers, carlos, libc-alpha

The 09/16/2022 11:41, Chris Kennelly wrote:
> > >
> > > 2) Implement the entire function as IFUNC and select whether a rseq or
> > >     non-rseq implementation should be used at C startup. The tradeoff
> > >     here is code size vs speed, and using IFUNC for things like malloc
> > >     may add additional constraints on the startup order.
> >
> 
> IFUNC has significant performance overheads as well.  For frequently used
> code (like memcpy), avoiding them has been an optimization for us (
> https://research.google/pubs/pub50338/) even if it leaves some nominal
> microbenchmark performance on the table.

only in static linked exe. which is not the important case for distros
that want the ability to update the libc with security fixes.

glibc is not optimized for static linking, if that's a requirement that
should be a separate discussion (it's not just about ifunc).

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-09-20 11:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16 14:36 RSEQ symbols: __rseq_size, __rseq_flags vs __rseq_feature_size Mathieu Desnoyers
2022-09-16 14:44 ` Mathieu Desnoyers
2022-09-16 15:29 ` Florian Weimer
2022-09-16 15:41   ` [libc-coord] " Chris Kennelly
2022-09-16 21:32     ` Florian Weimer
2022-09-17 11:51       ` Mathieu Desnoyers
2022-09-17 14:45         ` Florian Weimer
2022-09-17 15:25           ` Mathieu Desnoyers
2022-09-20 11:51     ` Szabolcs Nagy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).