[PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863]

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863]
@ 2022-12-14  0:11 Noah Goldstein
  2022-12-14  2:41 ` Carlos O'Donell
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Noah Goldstein @ 2022-12-14  0:11 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n, hjl.tools, carlos

In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
are concurrently modified as `memcmp` runs, there can be a SIG11 in
`L(ret_nonzero_vec_end_0)` because the sequential logic assumes
that `(rdx - 32 + rax)` is a positive 32-bit integer.

To be clear, this "fix" does not mean this usage of `memcmp` is
supported. `memcmp` is incorrect when the values of `a` and/or `b`
are modified while its running, and that incorrectness may manifest
itself as a SIG-11. That being said, if we can make the results
less dramatic with no cost to regular uses cases, there is no harm
in doing so.

The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
`addq %rdx, %rax`. The 1-extra byte of code size from using the
64-bit instruction doesn't contribute to overall code size as the
next target is aligned and has multiple bytes of `nop` padding
before it. As well all the logic between the add and `ret` still
fits in the same fetch block, so the cost of this change is
basically zero.

The sequential logic makes the assume behind the following code:
```
    /*
     * rsi = a
     * rdi = b
     * rdx = len - 32
     */
    /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
    in this case, this check is also assume to cover a[0:(31 - len)]
    and b[0:(31 - len)].  */
	movups	(%rsi), %xmm0
	movups	(%rdi), %xmm1
	PCMPEQ	%xmm0, %xmm1
	pmovmskb %xmm1, %eax
	subl	%ecx, %eax
	jnz	L(END_NEQ)

    /* cmp a[len-16:len-1] and b[len-16:len-1].  */
    movups	16(%rsi, %rdx), %xmm0
	movups	16(%rdi, %rdx), %xmm1
	PCMPEQ	%xmm0, %xmm1
	pmovmskb %xmm1, %eax
	subl	%ecx, %eax
	jnz	L(END_NEQ2)
    ret

L(END2):
    /* Position first mismatch.  */
    bsfl %eax, %eax

    /* BUG IS FROM THIS. The sequential version is able to assume this
    value is a positive 32-bit value because first check included
    bytes in range a[0:(31 - len)], b[0:(31 - len)] so `eax` must be
    greater than `31 - len` so the minimum value of `edx` + `eax` is
    `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
    `a` or `b` could have been changed so a mismatch in `eax` less or
    equal than `(31 - len)` is possible (the new low bound in `(16 -
    len)`. This can result in a negative 32-bit signed integer, which
    when non-sign extended to 64-bits is a random large value out of
    bounds. */
    addl %edx, %eax

    /* Crash here because 32-bit negative number in `eax` non-sign
    extends to out of bounds 64-bit offset.  */
    movzbl 16(%rdi, %rax), %ecx
    movzbl 16(%rsi, %rax), %eax
```

This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
`addq %rdx, %rax`). This prevent the 32-bit non-sign extension
and since `eax` still a low bound of `16 - len` the `rdx + rax`
is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
fixed offset of `16` in the memory access this must be inbounds.
---
 sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
index afd450d020..34e60e567d 100644
--- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
+++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
@@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
 	setg	%dl
 	leal	-1(%rdx, %rdx), %eax
 #  else
-	addl	%edx, %eax
+	addq	%rdx, %rax
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
 	subl	%ecx, %eax
-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  0:11 [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863] Noah Goldstein
@ 2022-12-14  2:41 ` Carlos O'Donell
  2022-12-14  9:04   ` Andreas Schwab
  2022-12-14  3:25 ` [PATCH v2] x86: Prevent SIGSEGV " Noah Goldstein
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Carlos O'Donell @ 2022-12-14  2:41 UTC (permalink / raw)
  To: Noah Goldstein, libc-alpha; +Cc: hjl.tools, carlos

Please post a v2. Thanks!

Subject: x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]

Replaces SIG11 with SIGSEGV (the documented name of the signal).

On 12/13/22 19:11, Noah Goldstein via Libc-alpha wrote:
> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> are concurrently modified as `memcmp` runs, there can be a SIG11 in

s/SIG11/SIGSEGV/g

> `L(ret_nonzero_vec_end_0)` because the sequential logic assumes
> that `(rdx - 32 + rax)` is a positive 32-bit integer.
> 
> To be clear, this "fix" does not mean this usage of `memcmp` is
> supported. `memcmp` is incorrect when the values of `a` and/or `b`
> are modified while its running, and that incorrectness may manifest
> itself as a SIG-11. That being said, if we can make the results

s/SIG-11/SIGSEGV/g

> less dramatic with no cost to regular uses cases, there is no harm
> in doing so.

I agree that a user focused change like this is going to be a balance between
keeping it working for an unsupported use case versus the cost to the library.
Given that you've found a low-cost way to support the incorrect but idiomatic
use case then I have no sustained objections to this patch. However, this won't
be the last we hear of this as we continue down the path of optimizing against
a well defined memory model.

> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> 64-bit instruction doesn't contribute to overall code size as the
> next target is aligned and has multiple bytes of `nop` padding
> before it. As well all the logic between the add and `ret` still
> fits in the same fetch block, so the cost of this change is
> basically zero.

OK.
 
> The sequential logic makes the assume behind the following code:

Suggest:
The relevant sequential logic can be seen in the following code:

> ```
>     /*
>      * rsi = a
>      * rdi = b
>      * rdx = len - 32
>      */
>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>     in this case, this check is also assume to cover a[0:(31 - len)]

s/assume/assumed/g

>     and b[0:(31 - len)].  */
> 	movups	(%rsi), %xmm0
> 	movups	(%rdi), %xmm1
> 	PCMPEQ	%xmm0, %xmm1
> 	pmovmskb %xmm1, %eax
> 	subl	%ecx, %eax
> 	jnz	L(END_NEQ)
> 
>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>     movups	16(%rsi, %rdx), %xmm0
> 	movups	16(%rdi, %rdx), %xmm1
> 	PCMPEQ	%xmm0, %xmm1
> 	pmovmskb %xmm1, %eax
> 	subl	%ecx, %eax
> 	jnz	L(END_NEQ2)
>     ret
> 
> L(END2):
>     /* Position first mismatch.  */
>     bsfl %eax, %eax
> 
>     /* BUG IS FROM THIS. The sequential version is able to assume this

s/BUG IS FROM THIS. //g

>     value is a positive 32-bit value because first check included

s/because first/because the first/g

>     bytes in range a[0:(31 - len)], b[0:(31 - len)] so `eax` must be

s/,/ and/g

>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>     `a` or `b` could have been changed so a mismatch in `eax` less or
>     equal than `(31 - len)` is possible (the new low bound in `(16 -

s/in/is/g

>     len)`. This can result in a negative 32-bit signed integer, which
>     when non-sign extended to 64-bits is a random large value out of

s/out of/that is out of/g

>     bounds. */
>     addl %edx, %eax
> 
>     /* Crash here because 32-bit negative number in `eax` non-sign
>     extends to out of bounds 64-bit offset.  */
>     movzbl 16(%rdi, %rax), %ecx
>     movzbl 16(%rsi, %rax), %eax
> ```
> 
> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> `addq %rdx, %rax`). This prevent the 32-bit non-sign extension

s/prevent/prevents/g

> and since `eax` still a low bound of `16 - len` the `rdx + rax`

s/still/is still/g

> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> fixed offset of `16` in the memory access this must be inbounds.

s/inbounds/in bounds/g

> ---
>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index afd450d020..34e60e567d 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>  	setg	%dl
>  	leal	-1(%rdx, %rdx), %eax
>  #  else
> -	addl	%edx, %eax
> +	addq	%rdx, %rax

OK. 64-bit addq.

>  	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>  	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>  	subl	%ecx, %eax

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  0:11 [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863] Noah Goldstein
  2022-12-14  2:41 ` Carlos O'Donell
@ 2022-12-14  3:25 ` Noah Goldstein
  2022-12-14  4:42   ` Carlos O'Donell
  2022-12-14  7:36 ` [PATCH v3] " Noah Goldstein
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Noah Goldstein @ 2022-12-14  3:25 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n, hjl.tools, carlos

In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
are concurrently modified as `memcmp` runs, there can be a SIGSEGV
in `L(ret_nonzero_vec_end_0)` because the sequential logic
assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.

To be clear, this "fix" does not mean this usage of `memcmp` is
supported. `memcmp` is incorrect when the values of `a` and/or `b`
are modified while its running, and that incorrectness may manifest
itself as a SIGSEGV. That being said, if we can make the results
less dramatic with no cost to regular use cases, there is no harm
in doing so.

The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
`addq %rdx, %rax`. The 1-extra byte of code size from using the
64-bit instruction doesn't contribute to overall code size as the
next target is aligned and has multiple bytes of `nop` padding
before it. As well all the logic between the add and `ret` still
fits in the same fetch block, so the cost of this change is
basically zero.

The relevant sequential logic can be seen in the following
pseudo-code:
```
    /*
     * rsi = a
     * rdi = b
     * rdx = len - 32
     */
    /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
    in this case, this check is also assumed to cover a[0:(31 - len)]
    and b[0:(31 - len)].  */
	movups	(%rsi), %xmm0
	movups	(%rdi), %xmm1
	PCMPEQ	%xmm0, %xmm1
	pmovmskb %xmm1, %eax
	subl	%ecx, %eax
	jnz	L(END_NEQ)

    /* cmp a[len-16:len-1] and b[len-16:len-1].  */
    movups	16(%rsi, %rdx), %xmm0
	movups	16(%rdi, %rdx), %xmm1
	PCMPEQ	%xmm0, %xmm1
	pmovmskb %xmm1, %eax
	subl	%ecx, %eax
	jnz	L(END_NEQ2)
    ret

L(END2):
    /* Position first mismatch.  */
    bsfl %eax, %eax

    /* The sequential version is able to assume this value is a
    positive 32-bit value because the first check included bytes in
    range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
    greater than `31 - len` so the minimum value of `edx` + `eax` is
    `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
    `a` or `b` could have been changed so a mismatch in `eax` less or
    equal than `(31 - len)` is possible (the new low bound is `(16 -
    len)`. This can result in a negative 32-bit signed integer, which
    when non-sign extended to 64-bits is a random large value this is
    out of bounds. */ addl %edx, %eax

    /* Crash here because 32-bit negative number in `eax` non-sign
    extends to out of bounds 64-bit offset.  */
    movzbl 16(%rdi, %rax), %ecx
    movzbl 16(%rsi, %rax), %eax
```

This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
`addq %rdx, %rax`). This prevents the 32-bit non-sign extension
and since `eax` is still a low bound of `16 - len` the `rdx + rax`
is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
fixed offset of `16` in the memory access this must be in bounds.
---
 sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
index afd450d020..34e60e567d 100644
--- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
+++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
@@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
 	setg	%dl
 	leal	-1(%rdx, %rdx), %eax
 #  else
-	addl	%edx, %eax
+	addq	%rdx, %rax
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
 	subl	%ecx, %eax
-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  3:25 ` [PATCH v2] x86: Prevent SIGSEGV " Noah Goldstein
@ 2022-12-14  4:42   ` Carlos O'Donell
  0 siblings, 0 replies; 16+ messages in thread
From: Carlos O'Donell @ 2022-12-14  4:42 UTC (permalink / raw)
  To: Noah Goldstein, libc-alpha; +Cc: hjl.tools, carlos

OK, last wordsmith pass, I promise.

Please post a v3 and I'll ACK that.

On 12/13/22 22:25, Noah Goldstein via Libc-alpha wrote:
> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> in `L(ret_nonzero_vec_end_0)` because the sequential logic
> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> 
> To be clear, this "fix" does not mean this usage of `memcmp` is
> supported. `memcmp` is incorrect when the values of `a` and/or `b`
> are modified while its running, and that incorrectness may manifest
> itself as a SIGSEGV. That being said, if we can make the results
> less dramatic with no cost to regular use cases, there is no harm
> in doing so.

Suggest:

To be clear, this change does not mean the usage of `memcmp` is supported.
The program behaviour is undefined (UB) in the presence of data races, and
`memcmp` is incorrect when the values of `a` and/or `b` are modified
concurrently (data race). This UB may manifest itself as a SIGSEGV. That
being said, if we can allow the idiomatic use cases, like those in yottadb
with opportunistic concurrency control (OCC), to execute without a SIGSEGV,
at no cost to regular use cases, then we can aim to minimize harm to those
existing users.

> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> 64-bit instruction doesn't contribute to overall code size as the
> next target is aligned and has multiple bytes of `nop` padding
> before it. As well all the logic between the add and `ret` still
> fits in the same fetch block, so the cost of this change is
> basically zero.

OK.
 
> The relevant sequential logic can be seen in the following
> pseudo-code:
> ```
>     /*
>      * rsi = a
>      * rdi = b
>      * rdx = len - 32
>      */
>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>     in this case, this check is also assumed to cover a[0:(31 - len)]
>     and b[0:(31 - len)].  */
> 	movups	(%rsi), %xmm0
> 	movups	(%rdi), %xmm1
> 	PCMPEQ	%xmm0, %xmm1
> 	pmovmskb %xmm1, %eax
> 	subl	%ecx, %eax
> 	jnz	L(END_NEQ)
> 
>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>     movups	16(%rsi, %rdx), %xmm0
> 	movups	16(%rdi, %rdx), %xmm1
> 	PCMPEQ	%xmm0, %xmm1
> 	pmovmskb %xmm1, %eax
> 	subl	%ecx, %eax
> 	jnz	L(END_NEQ2)
>     ret
> 
> L(END2):
>     /* Position first mismatch.  */
>     bsfl %eax, %eax
> 
>     /* The sequential version is able to assume this value is a
>     positive 32-bit value because the first check included bytes in
>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>     `a` or `b` could have been changed so a mismatch in `eax` less or
>     equal than `(31 - len)` is possible (the new low bound is `(16 -
>     len)`. This can result in a negative 32-bit signed integer, which
>     when non-sign extended to 64-bits is a random large value this is

s/this/that/g

>     out of bounds. */ addl %edx, %eax

Please put the addl on a new line.

> 
>     /* Crash here because 32-bit negative number in `eax` non-sign
>     extends to out of bounds 64-bit offset.  */
>     movzbl 16(%rdi, %rax), %ecx
>     movzbl 16(%rsi, %rax), %eax
> ```
> 
> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> `addq %rdx, %rax`). This prevents the 32-bit non-sign extension
> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> fixed offset of `16` in the memory access this must be in bounds.
> ---
>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index afd450d020..34e60e567d 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>  	setg	%dl
>  	leal	-1(%rdx, %rdx), %eax
>  #  else
> -	addl	%edx, %eax
> +	addq	%rdx, %rax

OK.

>  	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>  	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>  	subl	%ecx, %eax

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  0:11 [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863] Noah Goldstein
  2022-12-14  2:41 ` Carlos O'Donell
  2022-12-14  3:25 ` [PATCH v2] x86: Prevent SIGSEGV " Noah Goldstein
@ 2022-12-14  7:36 ` Noah Goldstein
  2022-12-14 16:09   ` H.J. Lu
  2022-12-14 16:57   ` Carlos O'Donell
  2022-12-14 16:07 ` [PATCH v1] x86: Prevent SIG11 " H.J. Lu
  2022-12-14 18:52 ` [PATCH v4] x86: Prevent SIGSEGV " Noah Goldstein
  4 siblings, 2 replies; 16+ messages in thread
From: Noah Goldstein @ 2022-12-14  7:36 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n, hjl.tools, carlos

In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
are concurrently modified as `memcmp` runs, there can be a SIGSEGV
in `L(ret_nonzero_vec_end_0)` because the sequential logic
assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.

To be clear, this change does not mean the usage of `memcmp` is
supported.  The program behaviour is undefined (UB) in the
presence of data races, and `memcmp` is incorrect when the values
of `a` and/or `b` are modified concurrently (data race). This UB
may manifest itself as a SIGSEGV. That being said, if we can
allow the idiomatic use cases, like those in yottadb with
opportunistic concurrency control (OCC), to execute without a
SIGSEGV, at no cost to regular use cases, then we can aim to
minimize harm to those existing users.

The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
`addq %rdx, %rax`. The 1-extra byte of code size from using the
64-bit instruction doesn't contribute to overall code size as the
next target is aligned and has multiple bytes of `nop` padding
before it. As well all the logic between the add and `ret` still
fits in the same fetch block, so the cost of this change is
basically zero.

The relevant sequential logic can be seen in the following
pseudo-code:
```
    /*
     * rsi = a
     * rdi = b
     * rdx = len - 32
     */
    /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
    in this case, this check is also assumed to cover a[0:(31 - len)]
    and b[0:(31 - len)].  */
    movups  (%rsi), %xmm0
    movups  (%rdi), %xmm1
    PCMPEQ  %xmm0, %xmm1
    pmovmskb %xmm1, %eax
    subl    %ecx, %eax
    jnz L(END_NEQ)

    /* cmp a[len-16:len-1] and b[len-16:len-1].  */
    movups  16(%rsi, %rdx), %xmm0
    movups  16(%rdi, %rdx), %xmm1
    PCMPEQ  %xmm0, %xmm1
    pmovmskb %xmm1, %eax
    subl    %ecx, %eax
    jnz L(END_NEQ2)
    ret

L(END2):
    /* Position first mismatch.  */
    bsfl    %eax, %eax

    /* The sequential version is able to assume this value is a
    positive 32-bit value because the first check included bytes in
    range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
    greater than `31 - len` so the minimum value of `edx` + `eax` is
    `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
    `a` or `b` could have been changed so a mismatch in `eax` less or
    equal than `(31 - len)` is possible (the new low bound is `(16 -
    len)`. This can result in a negative 32-bit signed integer, which
    when zero extended to 64-bits is a random large value this out
    out of bounds. */
    addl %edx, %eax

    /* Crash here because 32-bit negative number in `eax` zero
    extends to out of bounds 64-bit offset.  */
    movzbl  16(%rdi, %rax), %ecx
    movzbl  16(%rsi, %rax), %eax
```

This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
`addq %rdx, %rax`). This prevents the 32-bit zero extension
and since `eax` is still a low bound of `16 - len` the `rdx + rax`
is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
fixed offset of `16` in the memory access this must be in bounds.
---
 sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
index afd450d020..34e60e567d 100644
--- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
+++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
@@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
 	setg	%dl
 	leal	-1(%rdx, %rdx), %eax
 #  else
-	addl	%edx, %eax
+	addq	%rdx, %rax
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
 	subl	%ecx, %eax
-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  2:41 ` Carlos O'Donell
@ 2022-12-14  9:04   ` Andreas Schwab
  0 siblings, 0 replies; 16+ messages in thread
From: Andreas Schwab @ 2022-12-14  9:04 UTC (permalink / raw)
  To: Carlos O'Donell via Libc-alpha
  Cc: Noah Goldstein, Carlos O'Donell, hjl.tools, carlos

On Dez 13 2022, Carlos O'Donell via Libc-alpha wrote:

> Please post a v2. Thanks!
>
> Subject: x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
>
> Replaces SIG11 with SIGSEGV (the documented name of the signal).

Even better: out-of-bounds access (which can manifest in multitude of
different ways).

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  0:11 [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863] Noah Goldstein
                   ` (2 preceding siblings ...)
  2022-12-14  7:36 ` [PATCH v3] " Noah Goldstein
@ 2022-12-14 16:07 ` H.J. Lu
  2022-12-14 18:52 ` [PATCH v4] x86: Prevent SIGSEGV " Noah Goldstein
  4 siblings, 0 replies; 16+ messages in thread
From: H.J. Lu @ 2022-12-14 16:07 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, carlos

On Tue, Dec 13, 2022 at 4:12 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> are concurrently modified as `memcmp` runs, there can be a SIG11 in
> `L(ret_nonzero_vec_end_0)` because the sequential logic assumes
> that `(rdx - 32 + rax)` is a positive 32-bit integer.
>
> To be clear, this "fix" does not mean this usage of `memcmp` is
> supported. `memcmp` is incorrect when the values of `a` and/or `b`
> are modified while its running, and that incorrectness may manifest
> itself as a SIG-11. That being said, if we can make the results
> less dramatic with no cost to regular uses cases, there is no harm
> in doing so.
>
> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> 64-bit instruction doesn't contribute to overall code size as the
> next target is aligned and has multiple bytes of `nop` padding
> before it. As well all the logic between the add and `ret` still
> fits in the same fetch block, so the cost of this change is
> basically zero.
>
> The sequential logic makes the assume behind the following code:
> ```
>     /*
>      * rsi = a
>      * rdi = b
>      * rdx = len - 32
>      */
>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>     in this case, this check is also assume to cover a[0:(31 - len)]
>     and b[0:(31 - len)].  */
>         movups  (%rsi), %xmm0
>         movups  (%rdi), %xmm1
>         PCMPEQ  %xmm0, %xmm1
>         pmovmskb %xmm1, %eax
>         subl    %ecx, %eax
>         jnz     L(END_NEQ)
>
>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>     movups      16(%rsi, %rdx), %xmm0
>         movups  16(%rdi, %rdx), %xmm1
>         PCMPEQ  %xmm0, %xmm1
>         pmovmskb %xmm1, %eax
>         subl    %ecx, %eax
>         jnz     L(END_NEQ2)
>     ret
>
> L(END2):
>     /* Position first mismatch.  */
>     bsfl %eax, %eax
>
>     /* BUG IS FROM THIS. The sequential version is able to assume this
>     value is a positive 32-bit value because first check included
>     bytes in range a[0:(31 - len)], b[0:(31 - len)] so `eax` must be
>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>     `a` or `b` could have been changed so a mismatch in `eax` less or
>     equal than `(31 - len)` is possible (the new low bound in `(16 -
>     len)`. This can result in a negative 32-bit signed integer, which
>     when non-sign extended to 64-bits is a random large value out of
>     bounds. */
>     addl %edx, %eax
>
>     /* Crash here because 32-bit negative number in `eax` non-sign
>     extends to out of bounds 64-bit offset.  */
>     movzbl 16(%rdi, %rax), %ecx
>     movzbl 16(%rsi, %rax), %eax
> ```
>
> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> `addq %rdx, %rax`). This prevent the 32-bit non-sign extension
> and since `eax` still a low bound of `16 - len` the `rdx + rax`
> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> fixed offset of `16` in the memory access this must be inbounds.
> ---
>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index afd450d020..34e60e567d 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>         setg    %dl
>         leal    -1(%rdx, %rdx), %eax
>  #  else
> -       addl    %edx, %eax
> +       addq    %rdx, %rax

Please add some comments here and also include the testcase.

>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>         subl    %ecx, %eax
> --
> 2.34.1
>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  7:36 ` [PATCH v3] " Noah Goldstein
@ 2022-12-14 16:09   ` H.J. Lu
  2022-12-14 16:57     ` Carlos O'Donell
  2022-12-14 18:52     ` Noah Goldstein
  2022-12-14 16:57   ` Carlos O'Donell
  1 sibling, 2 replies; 16+ messages in thread
From: H.J. Lu @ 2022-12-14 16:09 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, carlos

On Tue, Dec 13, 2022 at 11:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> in `L(ret_nonzero_vec_end_0)` because the sequential logic
> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
>
> To be clear, this change does not mean the usage of `memcmp` is
> supported.  The program behaviour is undefined (UB) in the
> presence of data races, and `memcmp` is incorrect when the values
> of `a` and/or `b` are modified concurrently (data race). This UB
> may manifest itself as a SIGSEGV. That being said, if we can
> allow the idiomatic use cases, like those in yottadb with
> opportunistic concurrency control (OCC), to execute without a
> SIGSEGV, at no cost to regular use cases, then we can aim to
> minimize harm to those existing users.
>
> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> 64-bit instruction doesn't contribute to overall code size as the
> next target is aligned and has multiple bytes of `nop` padding
> before it. As well all the logic between the add and `ret` still
> fits in the same fetch block, so the cost of this change is
> basically zero.
>
> The relevant sequential logic can be seen in the following
> pseudo-code:
> ```
>     /*
>      * rsi = a
>      * rdi = b
>      * rdx = len - 32
>      */
>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>     in this case, this check is also assumed to cover a[0:(31 - len)]
>     and b[0:(31 - len)].  */
>     movups  (%rsi), %xmm0
>     movups  (%rdi), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ)
>
>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>     movups  16(%rsi, %rdx), %xmm0
>     movups  16(%rdi, %rdx), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ2)
>     ret
>
> L(END2):
>     /* Position first mismatch.  */
>     bsfl    %eax, %eax
>
>     /* The sequential version is able to assume this value is a
>     positive 32-bit value because the first check included bytes in
>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>     `a` or `b` could have been changed so a mismatch in `eax` less or
>     equal than `(31 - len)` is possible (the new low bound is `(16 -
>     len)`. This can result in a negative 32-bit signed integer, which
>     when zero extended to 64-bits is a random large value this out
>     out of bounds. */
>     addl %edx, %eax
>
>     /* Crash here because 32-bit negative number in `eax` zero
>     extends to out of bounds 64-bit offset.  */
>     movzbl  16(%rdi, %rax), %ecx
>     movzbl  16(%rsi, %rax), %eax
> ```
>
> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> `addq %rdx, %rax`). This prevents the 32-bit zero extension
> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> fixed offset of `16` in the memory access this must be in bounds.
> ---
>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index afd450d020..34e60e567d 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>         setg    %dl
>         leal    -1(%rdx, %rdx), %eax
>  #  else
> -       addl    %edx, %eax
> +       addq    %rdx, %rax

Please add some comments here and a testcase.

>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>         subl    %ecx, %eax
> --
> 2.34.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  7:36 ` [PATCH v3] " Noah Goldstein
  2022-12-14 16:09   ` H.J. Lu
@ 2022-12-14 16:57   ` Carlos O'Donell
  1 sibling, 0 replies; 16+ messages in thread
From: Carlos O'Donell @ 2022-12-14 16:57 UTC (permalink / raw)
  To: Noah Goldstein, libc-alpha; +Cc: hjl.tools, carlos

On 12/14/22 02:36, Noah Goldstein via Libc-alpha wrote:
> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> in `L(ret_nonzero_vec_end_0)` because the sequential logic
> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> 
> To be clear, this change does not mean the usage of `memcmp` is
> supported.  The program behaviour is undefined (UB) in the
> presence of data races, and `memcmp` is incorrect when the values
> of `a` and/or `b` are modified concurrently (data race). This UB
> may manifest itself as a SIGSEGV. That being said, if we can
> allow the idiomatic use cases, like those in yottadb with
> opportunistic concurrency control (OCC), to execute without a
> SIGSEGV, at no cost to regular use cases, then we can aim to
> minimize harm to those existing users.
> 
> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> 64-bit instruction doesn't contribute to overall code size as the
> next target is aligned and has multiple bytes of `nop` padding
> before it. As well all the logic between the add and `ret` still
> fits in the same fetch block, so the cost of this change is
> basically zero.
> 
> The relevant sequential logic can be seen in the following
> pseudo-code:
> ```
>     /*
>      * rsi = a
>      * rdi = b
>      * rdx = len - 32
>      */
>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>     in this case, this check is also assumed to cover a[0:(31 - len)]
>     and b[0:(31 - len)].  */
>     movups  (%rsi), %xmm0
>     movups  (%rdi), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ)
> 
>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>     movups  16(%rsi, %rdx), %xmm0
>     movups  16(%rdi, %rdx), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ2)
>     ret
> 
> L(END2):
>     /* Position first mismatch.  */
>     bsfl    %eax, %eax
> 
>     /* The sequential version is able to assume this value is a
>     positive 32-bit value because the first check included bytes in
>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>     `a` or `b` could have been changed so a mismatch in `eax` less or
>     equal than `(31 - len)` is possible (the new low bound is `(16 -
>     len)`. This can result in a negative 32-bit signed integer, which
>     when zero extended to 64-bits is a random large value this out
>     out of bounds. */
>     addl %edx, %eax
> 
>     /* Crash here because 32-bit negative number in `eax` zero
>     extends to out of bounds 64-bit offset.  */
>     movzbl  16(%rdi, %rax), %ecx
>     movzbl  16(%rsi, %rax), %eax
> ```
> 
> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> `addq %rdx, %rax`). This prevents the 32-bit zero extension
> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> fixed offset of `16` in the memory access this must be in bounds.
> ---

v3 LGTM.

I commented downthread that I do not suggest a test case for this because
such a test case would be testing for a specific behaviour in the UB case.
As of today we do not have consensus that under UB we want the string or
memory operations have a specific kind of behaviour. My opinion is that
requiring a specific behaviour e.g. no SIGSEGV, would limit the algorithmic
choices available, and that is too heavy a cost to pay without further
strong justification.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index afd450d020..34e60e567d 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>  	setg	%dl
>  	leal	-1(%rdx, %rdx), %eax
>  #  else
> -	addl	%edx, %eax
> +	addq	%rdx, %rax
>  	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>  	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>  	subl	%ecx, %eax

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14 16:09   ` H.J. Lu
@ 2022-12-14 16:57     ` Carlos O'Donell
  2022-12-14 17:18       ` Noah Goldstein
  2022-12-14 18:52     ` Noah Goldstein
  1 sibling, 1 reply; 16+ messages in thread
From: Carlos O'Donell @ 2022-12-14 16:57 UTC (permalink / raw)
  To: H.J. Lu, Noah Goldstein; +Cc: libc-alpha, carlos

On 12/14/22 11:09, H.J. Lu via Libc-alpha wrote:
> On Tue, Dec 13, 2022 at 11:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
>> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
>> in `L(ret_nonzero_vec_end_0)` because the sequential logic
>> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
>>
>> To be clear, this change does not mean the usage of `memcmp` is
>> supported.  The program behaviour is undefined (UB) in the
>> presence of data races, and `memcmp` is incorrect when the values
>> of `a` and/or `b` are modified concurrently (data race). This UB
>> may manifest itself as a SIGSEGV. That being said, if we can
>> allow the idiomatic use cases, like those in yottadb with
>> opportunistic concurrency control (OCC), to execute without a
>> SIGSEGV, at no cost to regular use cases, then we can aim to
>> minimize harm to those existing users.
>>
>> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
>> `addq %rdx, %rax`. The 1-extra byte of code size from using the
>> 64-bit instruction doesn't contribute to overall code size as the
>> next target is aligned and has multiple bytes of `nop` padding
>> before it. As well all the logic between the add and `ret` still
>> fits in the same fetch block, so the cost of this change is
>> basically zero.
>>
>> The relevant sequential logic can be seen in the following
>> pseudo-code:
>> ```
>>     /*
>>      * rsi = a
>>      * rdi = b
>>      * rdx = len - 32
>>      */
>>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>>     in this case, this check is also assumed to cover a[0:(31 - len)]
>>     and b[0:(31 - len)].  */
>>     movups  (%rsi), %xmm0
>>     movups  (%rdi), %xmm1
>>     PCMPEQ  %xmm0, %xmm1
>>     pmovmskb %xmm1, %eax
>>     subl    %ecx, %eax
>>     jnz L(END_NEQ)
>>
>>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>>     movups  16(%rsi, %rdx), %xmm0
>>     movups  16(%rdi, %rdx), %xmm1
>>     PCMPEQ  %xmm0, %xmm1
>>     pmovmskb %xmm1, %eax
>>     subl    %ecx, %eax
>>     jnz L(END_NEQ2)
>>     ret
>>
>> L(END2):
>>     /* Position first mismatch.  */
>>     bsfl    %eax, %eax
>>
>>     /* The sequential version is able to assume this value is a
>>     positive 32-bit value because the first check included bytes in
>>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
>>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>>     `a` or `b` could have been changed so a mismatch in `eax` less or
>>     equal than `(31 - len)` is possible (the new low bound is `(16 -
>>     len)`. This can result in a negative 32-bit signed integer, which
>>     when zero extended to 64-bits is a random large value this out
>>     out of bounds. */
>>     addl %edx, %eax
>>
>>     /* Crash here because 32-bit negative number in `eax` zero
>>     extends to out of bounds 64-bit offset.  */
>>     movzbl  16(%rdi, %rax), %ecx
>>     movzbl  16(%rsi, %rax), %eax
>> ```
>>
>> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
>> `addq %rdx, %rax`). This prevents the 32-bit zero extension
>> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
>> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
>> fixed offset of `16` in the memory access this must be in bounds.
>> ---
>>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
>> index afd450d020..34e60e567d 100644
>> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
>> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
>> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>>         setg    %dl
>>         leal    -1(%rdx, %rdx), %eax
>>  #  else
>> -       addl    %edx, %eax
>> +       addq    %rdx, %rax
> 
> Please add some comments here and a testcase.

I do not recommend a test case.

I would accept Noah's v3 as-is.

Even with a two or three thread test you'd have to run for long enough to statistically
trigger the change at the right time, and that means spending developer CPU resources
to test that a specific IFUNC works under UB conditions. This slows down development
cycle time for a use case we don't actually support.

I support Noah making this change because it doesn't have a performance impact and
I'm empathetic to fixing a developer reported issue. I wouldn't go any further than that.

My strong suggestion to yottadb is that they need to write their own hand-crafted memcmp
to ensure the entire algorithm operates under the *must-only-read-bytes-once* conditions.
If you read bytes *twice* due to overlapped loads you may have problems if you read
inconsistent values where you didn't expect them. The algorithm limitations are quite
severe IMO, and any mix of C and asm here could result in the compiler exposing visible
UB to the application author.
 
>>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>>         subl    %ecx, %eax
>> --
>> 2.34.1
>>
> 
> 

-- 
Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14 16:57     ` Carlos O'Donell
@ 2022-12-14 17:18       ` Noah Goldstein
  0 siblings, 0 replies; 16+ messages in thread
From: Noah Goldstein @ 2022-12-14 17:18 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: H.J. Lu, libc-alpha, carlos

On Wed, Dec 14, 2022 at 8:57 AM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On 12/14/22 11:09, H.J. Lu via Libc-alpha wrote:
> > On Tue, Dec 13, 2022 at 11:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> >> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> >> in `L(ret_nonzero_vec_end_0)` because the sequential logic
> >> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> >>
> >> To be clear, this change does not mean the usage of `memcmp` is
> >> supported.  The program behaviour is undefined (UB) in the
> >> presence of data races, and `memcmp` is incorrect when the values
> >> of `a` and/or `b` are modified concurrently (data race). This UB
> >> may manifest itself as a SIGSEGV. That being said, if we can
> >> allow the idiomatic use cases, like those in yottadb with
> >> opportunistic concurrency control (OCC), to execute without a
> >> SIGSEGV, at no cost to regular use cases, then we can aim to
> >> minimize harm to those existing users.
> >>
> >> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> >> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> >> 64-bit instruction doesn't contribute to overall code size as the
> >> next target is aligned and has multiple bytes of `nop` padding
> >> before it. As well all the logic between the add and `ret` still
> >> fits in the same fetch block, so the cost of this change is
> >> basically zero.
> >>
> >> The relevant sequential logic can be seen in the following
> >> pseudo-code:
> >> ```
> >>     /*
> >>      * rsi = a
> >>      * rdi = b
> >>      * rdx = len - 32
> >>      */
> >>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
> >>     in this case, this check is also assumed to cover a[0:(31 - len)]
> >>     and b[0:(31 - len)].  */
> >>     movups  (%rsi), %xmm0
> >>     movups  (%rdi), %xmm1
> >>     PCMPEQ  %xmm0, %xmm1
> >>     pmovmskb %xmm1, %eax
> >>     subl    %ecx, %eax
> >>     jnz L(END_NEQ)
> >>
> >>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
> >>     movups  16(%rsi, %rdx), %xmm0
> >>     movups  16(%rdi, %rdx), %xmm1
> >>     PCMPEQ  %xmm0, %xmm1
> >>     pmovmskb %xmm1, %eax
> >>     subl    %ecx, %eax
> >>     jnz L(END_NEQ2)
> >>     ret
> >>
> >> L(END2):
> >>     /* Position first mismatch.  */
> >>     bsfl    %eax, %eax
> >>
> >>     /* The sequential version is able to assume this value is a
> >>     positive 32-bit value because the first check included bytes in
> >>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
> >>     greater than `31 - len` so the minimum value of `edx` + `eax` is
> >>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
> >>     `a` or `b` could have been changed so a mismatch in `eax` less or
> >>     equal than `(31 - len)` is possible (the new low bound is `(16 -
> >>     len)`. This can result in a negative 32-bit signed integer, which
> >>     when zero extended to 64-bits is a random large value this out
> >>     out of bounds. */
> >>     addl %edx, %eax
> >>
> >>     /* Crash here because 32-bit negative number in `eax` zero
> >>     extends to out of bounds 64-bit offset.  */
> >>     movzbl  16(%rdi, %rax), %ecx
> >>     movzbl  16(%rsi, %rax), %eax
> >> ```
> >>
> >> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> >> `addq %rdx, %rax`). This prevents the 32-bit zero extension
> >> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> >> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> >> fixed offset of `16` in the memory access this must be in bounds.
> >> ---
> >>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> >> index afd450d020..34e60e567d 100644
> >> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> >> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> >> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
> >>         setg    %dl
> >>         leal    -1(%rdx, %rdx), %eax
> >>  #  else
> >> -       addl    %edx, %eax
> >> +       addq    %rdx, %rax
> >
> > Please add some comments here and a testcase.
>
> I do not recommend a test case.
>
> I would accept Noah's v3 as-is.


I can add a comment around the 'addq'
but agree a test case doesn't make sense
because at the moment its failure / success
is pretty arbitrary.
>
> Even with a two or three thread test you'd have to run for long enough to statistically
> trigger the change at the right time, and that means spending developer CPU resources
> to test that a specific IFUNC works under UB conditions. This slows down development
> cycle time for a use case we don't actually support.
>
> I support Noah making this change because it doesn't have a performance impact and
> I'm empathetic to fixing a developer reported issue. I wouldn't go any further than that.
>
> My strong suggestion to yottadb is that they need to write their own hand-crafted memcmp
> to ensure the entire algorithm operates under the *must-only-read-bytes-once* conditions.
> If you read bytes *twice* due to overlapped loads you may have problems if you read
> inconsistent values where you didn't expect them. The algorithm limitations are quite
> severe IMO, and any mix of C and asm here could result in the compiler exposing visible
> UB to the application author.
>
> >>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
> >>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
> >>         subl    %ecx, %eax
> >> --
> >> 2.34.1
> >>
> >
> >
>
> --
> Cheers,
> Carlos.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14  0:11 [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863] Noah Goldstein
                   ` (3 preceding siblings ...)
  2022-12-14 16:07 ` [PATCH v1] x86: Prevent SIG11 " H.J. Lu
@ 2022-12-14 18:52 ` Noah Goldstein
  2022-12-14 21:35   ` H.J. Lu
  4 siblings, 1 reply; 16+ messages in thread
From: Noah Goldstein @ 2022-12-14 18:52 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n, hjl.tools, carlos

In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
are concurrently modified as `memcmp` runs, there can be a SIGSEGV
in `L(ret_nonzero_vec_end_0)` because the sequential logic
assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.

To be clear, this change does not mean the usage of `memcmp` is
supported.  The program behaviour is undefined (UB) in the
presence of data races, and `memcmp` is incorrect when the values
of `a` and/or `b` are modified concurrently (data race). This UB
may manifest itself as a SIGSEGV. That being said, if we can
allow the idiomatic use cases, like those in yottadb with
opportunistic concurrency control (OCC), to execute without a
SIGSEGV, at no cost to regular use cases, then we can aim to
minimize harm to those existing users.

The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
`addq %rdx, %rax`. The 1-extra byte of code size from using the
64-bit instruction doesn't contribute to overall code size as the
next target is aligned and has multiple bytes of `nop` padding
before it. As well all the logic between the add and `ret` still
fits in the same fetch block, so the cost of this change is
basically zero.

The relevant sequential logic can be seen in the following
pseudo-code:
```
    /*
     * rsi = a
     * rdi = b
     * rdx = len - 32
     */
    /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
    in this case, this check is also assumed to cover a[0:(31 - len)]
    and b[0:(31 - len)].  */
    movups  (%rsi), %xmm0
    movups  (%rdi), %xmm1
    PCMPEQ  %xmm0, %xmm1
    pmovmskb %xmm1, %eax
    subl    %ecx, %eax
    jnz L(END_NEQ)

    /* cmp a[len-16:len-1] and b[len-16:len-1].  */
    movups  16(%rsi, %rdx), %xmm0
    movups  16(%rdi, %rdx), %xmm1
    PCMPEQ  %xmm0, %xmm1
    pmovmskb %xmm1, %eax
    subl    %ecx, %eax
    jnz L(END_NEQ2)
    ret

L(END2):
    /* Position first mismatch.  */
    bsfl    %eax, %eax

    /* The sequential version is able to assume this value is a
    positive 32-bit value because the first check included bytes in
    range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
    greater than `31 - len` so the minimum value of `edx` + `eax` is
    `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
    `a` or `b` could have been changed so a mismatch in `eax` less or
    equal than `(31 - len)` is possible (the new low bound is `(16 -
    len)`. This can result in a negative 32-bit signed integer, which
    when zero extended to 64-bits is a random large value this out
    out of bounds. */
    addl %edx, %eax

    /* Crash here because 32-bit negative number in `eax` zero
    extends to out of bounds 64-bit offset.  */
    movzbl  16(%rdi, %rax), %ecx
    movzbl  16(%rsi, %rax), %eax
```

This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
`addq %rdx, %rax`). This prevents the 32-bit zero extension
and since `eax` is still a low bound of `16 - len` the `rdx + rax`
is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
fixed offset of `16` in the memory access this must be in bounds.
---
 sysdeps/x86_64/multiarch/memcmp-sse2.S | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
index afd450d020..51bc9344f0 100644
--- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
+++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
@@ -308,7 +308,17 @@ L(ret_nonzero_vec_end_0):
 	setg	%dl
 	leal	-1(%rdx, %rdx), %eax
 #  else
-	addl	%edx, %eax
+	/* Use `addq` instead of `addl` here so that even if `rax` + `rdx`
+       is negative value of the sum will be usable as a 64-bit offset
+       (negative 32-bit numbers zero-extend to a large and often
+       out-of-bounds 64-bit offsets).  Note that `rax` + `rdx` >= 0 is
+       an invariant when `memcmp` is used correctly, but if the input
+       strings `rsi`/`rdi` are concurrently modified as the function
+       runs (there is a Data-Race) it is possible for `rax` + `rdx` to
+       be negative.  Given that there is virtually no extra to cost
+       using `addq` instead of `addl` we may as well protect the
+       data-race case.  */
+	addq	%rdx, %rax
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
 	subl	%ecx, %eax
-- 
2.34.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14 16:09   ` H.J. Lu
  2022-12-14 16:57     ` Carlos O'Donell
@ 2022-12-14 18:52     ` Noah Goldstein
  1 sibling, 0 replies; 16+ messages in thread
From: Noah Goldstein @ 2022-12-14 18:52 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha, carlos

On Wed, Dec 14, 2022 at 8:10 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Dec 13, 2022 at 11:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> > are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> > in `L(ret_nonzero_vec_end_0)` because the sequential logic
> > assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> >
> > To be clear, this change does not mean the usage of `memcmp` is
> > supported.  The program behaviour is undefined (UB) in the
> > presence of data races, and `memcmp` is incorrect when the values
> > of `a` and/or `b` are modified concurrently (data race). This UB
> > may manifest itself as a SIGSEGV. That being said, if we can
> > allow the idiomatic use cases, like those in yottadb with
> > opportunistic concurrency control (OCC), to execute without a
> > SIGSEGV, at no cost to regular use cases, then we can aim to
> > minimize harm to those existing users.
> >
> > The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> > `addq %rdx, %rax`. The 1-extra byte of code size from using the
> > 64-bit instruction doesn't contribute to overall code size as the
> > next target is aligned and has multiple bytes of `nop` padding
> > before it. As well all the logic between the add and `ret` still
> > fits in the same fetch block, so the cost of this change is
> > basically zero.
> >
> > The relevant sequential logic can be seen in the following
> > pseudo-code:
> > ```
> >     /*
> >      * rsi = a
> >      * rdi = b
> >      * rdx = len - 32
> >      */
> >     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
> >     in this case, this check is also assumed to cover a[0:(31 - len)]
> >     and b[0:(31 - len)].  */
> >     movups  (%rsi), %xmm0
> >     movups  (%rdi), %xmm1
> >     PCMPEQ  %xmm0, %xmm1
> >     pmovmskb %xmm1, %eax
> >     subl    %ecx, %eax
> >     jnz L(END_NEQ)
> >
> >     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
> >     movups  16(%rsi, %rdx), %xmm0
> >     movups  16(%rdi, %rdx), %xmm1
> >     PCMPEQ  %xmm0, %xmm1
> >     pmovmskb %xmm1, %eax
> >     subl    %ecx, %eax
> >     jnz L(END_NEQ2)
> >     ret
> >
> > L(END2):
> >     /* Position first mismatch.  */
> >     bsfl    %eax, %eax
> >
> >     /* The sequential version is able to assume this value is a
> >     positive 32-bit value because the first check included bytes in
> >     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
> >     greater than `31 - len` so the minimum value of `edx` + `eax` is
> >     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
> >     `a` or `b` could have been changed so a mismatch in `eax` less or
> >     equal than `(31 - len)` is possible (the new low bound is `(16 -
> >     len)`. This can result in a negative 32-bit signed integer, which
> >     when zero extended to 64-bits is a random large value this out
> >     out of bounds. */
> >     addl %edx, %eax
> >
> >     /* Crash here because 32-bit negative number in `eax` zero
> >     extends to out of bounds 64-bit offset.  */
> >     movzbl  16(%rdi, %rax), %ecx
> >     movzbl  16(%rsi, %rax), %eax
> > ```
> >
> > This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> > `addq %rdx, %rax`). This prevents the 32-bit zero extension
> > and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> > is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> > fixed offset of `16` in the memory access this must be in bounds.
> > ---
> >  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > index afd450d020..34e60e567d 100644
> > --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
> >         setg    %dl
> >         leal    -1(%rdx, %rdx), %eax
> >  #  else
> > -       addl    %edx, %eax
> > +       addq    %rdx, %rax
>
> Please add some comments here and a testcase.

Added comments.
>
> >         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
> >         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
> >         subl    %ecx, %eax
> > --
> > 2.34.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14 18:52 ` [PATCH v4] x86: Prevent SIGSEGV " Noah Goldstein
@ 2022-12-14 21:35   ` H.J. Lu
  2023-01-10 22:02     ` Sunil Pandey
  0 siblings, 1 reply; 16+ messages in thread
From: H.J. Lu @ 2022-12-14 21:35 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, carlos

On Wed, Dec 14, 2022 at 10:52 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> in `L(ret_nonzero_vec_end_0)` because the sequential logic
> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
>
> To be clear, this change does not mean the usage of `memcmp` is
> supported.  The program behaviour is undefined (UB) in the
> presence of data races, and `memcmp` is incorrect when the values
> of `a` and/or `b` are modified concurrently (data race). This UB
> may manifest itself as a SIGSEGV. That being said, if we can
> allow the idiomatic use cases, like those in yottadb with
> opportunistic concurrency control (OCC), to execute without a
> SIGSEGV, at no cost to regular use cases, then we can aim to
> minimize harm to those existing users.
>
> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> 64-bit instruction doesn't contribute to overall code size as the
> next target is aligned and has multiple bytes of `nop` padding
> before it. As well all the logic between the add and `ret` still
> fits in the same fetch block, so the cost of this change is
> basically zero.
>
> The relevant sequential logic can be seen in the following
> pseudo-code:
> ```
>     /*
>      * rsi = a
>      * rdi = b
>      * rdx = len - 32
>      */
>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>     in this case, this check is also assumed to cover a[0:(31 - len)]
>     and b[0:(31 - len)].  */
>     movups  (%rsi), %xmm0
>     movups  (%rdi), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ)
>
>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>     movups  16(%rsi, %rdx), %xmm0
>     movups  16(%rdi, %rdx), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ2)
>     ret
>
> L(END2):
>     /* Position first mismatch.  */
>     bsfl    %eax, %eax
>
>     /* The sequential version is able to assume this value is a
>     positive 32-bit value because the first check included bytes in
>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>     `a` or `b` could have been changed so a mismatch in `eax` less or
>     equal than `(31 - len)` is possible (the new low bound is `(16 -
>     len)`. This can result in a negative 32-bit signed integer, which
>     when zero extended to 64-bits is a random large value this out
>     out of bounds. */
>     addl %edx, %eax
>
>     /* Crash here because 32-bit negative number in `eax` zero
>     extends to out of bounds 64-bit offset.  */
>     movzbl  16(%rdi, %rax), %ecx
>     movzbl  16(%rsi, %rax), %eax
> ```
>
> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> `addq %rdx, %rax`). This prevents the 32-bit zero extension
> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> fixed offset of `16` in the memory access this must be in bounds.
> ---
>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index afd450d020..51bc9344f0 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -308,7 +308,17 @@ L(ret_nonzero_vec_end_0):
>         setg    %dl
>         leal    -1(%rdx, %rdx), %eax
>  #  else
> -       addl    %edx, %eax
> +       /* Use `addq` instead of `addl` here so that even if `rax` + `rdx`
> +       is negative value of the sum will be usable as a 64-bit offset
> +       (negative 32-bit numbers zero-extend to a large and often
> +       out-of-bounds 64-bit offsets).  Note that `rax` + `rdx` >= 0 is
> +       an invariant when `memcmp` is used correctly, but if the input
> +       strings `rsi`/`rdi` are concurrently modified as the function
> +       runs (there is a Data-Race) it is possible for `rax` + `rdx` to
> +       be negative.  Given that there is virtually no extra to cost
> +       using `addq` instead of `addl` we may as well protect the
> +       data-race case.  */
> +       addq    %rdx, %rax
>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>         subl    %ecx, %eax
> --
> 2.34.1
>

LGTM.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2022-12-14 21:35   ` H.J. Lu
@ 2023-01-10 22:02     ` Sunil Pandey
  2023-01-10 23:02       ` Noah Goldstein
  0 siblings, 1 reply; 16+ messages in thread
From: Sunil Pandey @ 2023-01-10 22:02 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, libc-alpha, carlos

On Wed, Dec 14, 2022 at 1:36 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Dec 14, 2022 at 10:52 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> > are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> > in `L(ret_nonzero_vec_end_0)` because the sequential logic
> > assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> >
> > To be clear, this change does not mean the usage of `memcmp` is
> > supported.  The program behaviour is undefined (UB) in the
> > presence of data races, and `memcmp` is incorrect when the values
> > of `a` and/or `b` are modified concurrently (data race). This UB
> > may manifest itself as a SIGSEGV. That being said, if we can
> > allow the idiomatic use cases, like those in yottadb with
> > opportunistic concurrency control (OCC), to execute without a
> > SIGSEGV, at no cost to regular use cases, then we can aim to
> > minimize harm to those existing users.
> >
> > The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> > `addq %rdx, %rax`. The 1-extra byte of code size from using the
> > 64-bit instruction doesn't contribute to overall code size as the
> > next target is aligned and has multiple bytes of `nop` padding
> > before it. As well all the logic between the add and `ret` still
> > fits in the same fetch block, so the cost of this change is
> > basically zero.
> >
> > The relevant sequential logic can be seen in the following
> > pseudo-code:
> > ```
> >     /*
> >      * rsi = a
> >      * rdi = b
> >      * rdx = len - 32
> >      */
> >     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
> >     in this case, this check is also assumed to cover a[0:(31 - len)]
> >     and b[0:(31 - len)].  */
> >     movups  (%rsi), %xmm0
> >     movups  (%rdi), %xmm1
> >     PCMPEQ  %xmm0, %xmm1
> >     pmovmskb %xmm1, %eax
> >     subl    %ecx, %eax
> >     jnz L(END_NEQ)
> >
> >     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
> >     movups  16(%rsi, %rdx), %xmm0
> >     movups  16(%rdi, %rdx), %xmm1
> >     PCMPEQ  %xmm0, %xmm1
> >     pmovmskb %xmm1, %eax
> >     subl    %ecx, %eax
> >     jnz L(END_NEQ2)
> >     ret
> >
> > L(END2):
> >     /* Position first mismatch.  */
> >     bsfl    %eax, %eax
> >
> >     /* The sequential version is able to assume this value is a
> >     positive 32-bit value because the first check included bytes in
> >     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
> >     greater than `31 - len` so the minimum value of `edx` + `eax` is
> >     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
> >     `a` or `b` could have been changed so a mismatch in `eax` less or
> >     equal than `(31 - len)` is possible (the new low bound is `(16 -
> >     len)`. This can result in a negative 32-bit signed integer, which
> >     when zero extended to 64-bits is a random large value this out
> >     out of bounds. */
> >     addl %edx, %eax
> >
> >     /* Crash here because 32-bit negative number in `eax` zero
> >     extends to out of bounds 64-bit offset.  */
> >     movzbl  16(%rdi, %rax), %ecx
> >     movzbl  16(%rsi, %rax), %eax
> > ```
> >
> > This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> > `addq %rdx, %rax`). This prevents the 32-bit zero extension
> > and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> > is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> > fixed offset of `16` in the memory access this must be in bounds.
> > ---
> >  sysdeps/x86_64/multiarch/memcmp-sse2.S | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > index afd450d020..51bc9344f0 100644
> > --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > @@ -308,7 +308,17 @@ L(ret_nonzero_vec_end_0):
> >         setg    %dl
> >         leal    -1(%rdx, %rdx), %eax
> >  #  else
> > -       addl    %edx, %eax
> > +       /* Use `addq` instead of `addl` here so that even if `rax` + `rdx`
> > +       is negative value of the sum will be usable as a 64-bit offset
> > +       (negative 32-bit numbers zero-extend to a large and often
> > +       out-of-bounds 64-bit offsets).  Note that `rax` + `rdx` >= 0 is
> > +       an invariant when `memcmp` is used correctly, but if the input
> > +       strings `rsi`/`rdi` are concurrently modified as the function
> > +       runs (there is a Data-Race) it is possible for `rax` + `rdx` to
> > +       be negative.  Given that there is virtually no extra to cost
> > +       using `addq` instead of `addl` we may as well protect the
> > +       data-race case.  */
> > +       addq    %rdx, %rax
> >         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
> >         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
> >         subl    %ecx, %eax
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]
  2023-01-10 22:02     ` Sunil Pandey
@ 2023-01-10 23:02       ` Noah Goldstein
  0 siblings, 0 replies; 16+ messages in thread
From: Noah Goldstein @ 2023-01-10 23:02 UTC (permalink / raw)
  To: Sunil Pandey; +Cc: H.J. Lu, Libc-stable Mailing List, libc-alpha, carlos

On Tue, Jan 10, 2023 at 2:03 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
> On Wed, Dec 14, 2022 at 1:36 PM H.J. Lu via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > On Wed, Dec 14, 2022 at 10:52 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> > > are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> > > in `L(ret_nonzero_vec_end_0)` because the sequential logic
> > > assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> > >
> > > To be clear, this change does not mean the usage of `memcmp` is
> > > supported.  The program behaviour is undefined (UB) in the
> > > presence of data races, and `memcmp` is incorrect when the values
> > > of `a` and/or `b` are modified concurrently (data race). This UB
> > > may manifest itself as a SIGSEGV. That being said, if we can
> > > allow the idiomatic use cases, like those in yottadb with
> > > opportunistic concurrency control (OCC), to execute without a
> > > SIGSEGV, at no cost to regular use cases, then we can aim to
> > > minimize harm to those existing users.
> > >
> > > The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> > > `addq %rdx, %rax`. The 1-extra byte of code size from using the
> > > 64-bit instruction doesn't contribute to overall code size as the
> > > next target is aligned and has multiple bytes of `nop` padding
> > > before it. As well all the logic between the add and `ret` still
> > > fits in the same fetch block, so the cost of this change is
> > > basically zero.
> > >
> > > The relevant sequential logic can be seen in the following
> > > pseudo-code:
> > > ```
> > >     /*
> > >      * rsi = a
> > >      * rdi = b
> > >      * rdx = len - 32
> > >      */
> > >     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
> > >     in this case, this check is also assumed to cover a[0:(31 - len)]
> > >     and b[0:(31 - len)].  */
> > >     movups  (%rsi), %xmm0
> > >     movups  (%rdi), %xmm1
> > >     PCMPEQ  %xmm0, %xmm1
> > >     pmovmskb %xmm1, %eax
> > >     subl    %ecx, %eax
> > >     jnz L(END_NEQ)
> > >
> > >     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
> > >     movups  16(%rsi, %rdx), %xmm0
> > >     movups  16(%rdi, %rdx), %xmm1
> > >     PCMPEQ  %xmm0, %xmm1
> > >     pmovmskb %xmm1, %eax
> > >     subl    %ecx, %eax
> > >     jnz L(END_NEQ2)
> > >     ret
> > >
> > > L(END2):
> > >     /* Position first mismatch.  */
> > >     bsfl    %eax, %eax
> > >
> > >     /* The sequential version is able to assume this value is a
> > >     positive 32-bit value because the first check included bytes in
> > >     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
> > >     greater than `31 - len` so the minimum value of `edx` + `eax` is
> > >     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
> > >     `a` or `b` could have been changed so a mismatch in `eax` less or
> > >     equal than `(31 - len)` is possible (the new low bound is `(16 -
> > >     len)`. This can result in a negative 32-bit signed integer, which
> > >     when zero extended to 64-bits is a random large value this out
> > >     out of bounds. */
> > >     addl %edx, %eax
> > >
> > >     /* Crash here because 32-bit negative number in `eax` zero
> > >     extends to out of bounds 64-bit offset.  */
> > >     movzbl  16(%rdi, %rax), %ecx
> > >     movzbl  16(%rsi, %rax), %eax
> > > ```
> > >
> > > This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> > > `addq %rdx, %rax`). This prevents the 32-bit zero extension
> > > and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> > > is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> > > fixed offset of `16` in the memory access this must be in bounds.
> > > ---
> > >  sysdeps/x86_64/multiarch/memcmp-sse2.S | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > > index afd450d020..51bc9344f0 100644
> > > --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > > +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > > @@ -308,7 +308,17 @@ L(ret_nonzero_vec_end_0):
> > >         setg    %dl
> > >         leal    -1(%rdx, %rdx), %eax
> > >  #  else
> > > -       addl    %edx, %eax
> > > +       /* Use `addq` instead of `addl` here so that even if `rax` + `rdx`
> > > +       is negative value of the sum will be usable as a 64-bit offset
> > > +       (negative 32-bit numbers zero-extend to a large and often
> > > +       out-of-bounds 64-bit offsets).  Note that `rax` + `rdx` >= 0 is
> > > +       an invariant when `memcmp` is used correctly, but if the input
> > > +       strings `rsi`/`rdi` are concurrently modified as the function
> > > +       runs (there is a Data-Race) it is possible for `rax` + `rdx` to
> > > +       be negative.  Given that there is virtually no extra to cost
> > > +       using `addq` instead of `addl` we may as well protect the
> > > +       data-race case.  */
> > > +       addq    %rdx, %rax
> > >         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
> > >         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
> > >         subl    %ecx, %eax
> > > --
> > > 2.34.1
> > >
> >
> > LGTM.
> >
> > Thanks.
> >
> > --
> > H.J.
>
> I would like to backport this patch to release branches.
> Any comments or objections?

Fine by me.
>
> --Sunil

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-01-10 23:02 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-14  0:11 [PATCH v1] x86: Prevent SIG11 in memcmp-sse2 when data is concurrently modified [BZ #29863] Noah Goldstein
2022-12-14  2:41 ` Carlos O'Donell
2022-12-14  9:04   ` Andreas Schwab
2022-12-14  3:25 ` [PATCH v2] x86: Prevent SIGSEGV " Noah Goldstein
2022-12-14  4:42   ` Carlos O'Donell
2022-12-14  7:36 ` [PATCH v3] " Noah Goldstein
2022-12-14 16:09   ` H.J. Lu
2022-12-14 16:57     ` Carlos O'Donell
2022-12-14 17:18       ` Noah Goldstein
2022-12-14 18:52     ` Noah Goldstein
2022-12-14 16:57   ` Carlos O'Donell
2022-12-14 16:07 ` [PATCH v1] x86: Prevent SIG11 " H.J. Lu
2022-12-14 18:52 ` [PATCH v4] x86: Prevent SIGSEGV " Noah Goldstein
2022-12-14 21:35   ` H.J. Lu
2023-01-10 22:02     ` Sunil Pandey
2023-01-10 23:02       ` Noah Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).