Re: “Undefined behavior” considered harmful (was Re: Bug 29863 - Segmentation fault in memcmp-sse2.S…)

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Florian Weimer <fw@deneb.enyo.de>
To: Zack Weinberg via Libc-alpha <libc-alpha@sourceware.org>
Cc: Carlos O'Donell <carlos@redhat.com>,  Zack Weinberg <zack@owlfolio.org>
Subject: Re: “Undefined behavior” considered harmful (was Re: Bug 29863 - Segmentation fault in memcmp-sse2.S…)
Date: Fri, 30 Dec 2022 16:09:49 +0100	[thread overview]
Message-ID: <874jtd3p36.fsf@mid.deneb.enyo.de> (raw)
In-Reply-To: <ypikr0wigg4d.wl-zack@owlfolio.org> (Zack Weinberg via Libc-alpha's message of "Thu, 29 Dec 2022 14:32:50 -0500")

* Zack Weinberg via Libc-alpha:

> The original issue with memcmp() seems to have been resolved, but I’d
> like to start a broader discussion about the C library’s
> responsibilities in the face of a program that’s, to some degree,
> incorrect.
>
> On Tue, 13 Dec 2022 23:16:19 -0500, Carlos O'Donell wrote:
>> We are talking about the C language, and when you write
>> "unspecified" in that context it means the language *does* have
>> something to say about the behaviour but does not pick one or other
>> of the available behaviours. This is not the case, the language very
>> clearly says this is undefined behaviour, so it says nothing about
>> what should happen.
>> 
>> My understanding was that you were trying to ascribe more
>> determinism to the operation of memcpy under the presence of data
>> races than could be granted by UB.
>> 
>> I strongly disagree to ascribing more determinism than UB.
>
> Everything I’m about to say stems from the following three premises:
>
> 1. The C standard uses “undefined behavior” far more liberally than it
>    ought to.  In many cases of existing UB the committee could define
>    the behavior (possibly as implementation-defined or unspecified)
>    without any actual negative consequences.  It seems the committee
>    *is* moving in this direction as of C2x, for instance by dropping
>    the allowances for non-twos-complement signed arithmetic, but they
>    could and should go a lot farther down that road.

I'm not sure if this is a priority.  It's a lot of work, and most
aspects are probably not technical in the end.  For example, Peter
Sewell's group did a lot of work on the (non-concurrent) C memory
model, both investigative and descriptive, and it's been hard to
integrate this into the standard, as far as I know.

From a licensing perspective, I would very much prefer if anything
that approaches an executable specification were outside the ISO
framework.  It would also enable exploring different directions based
on interests (some might prefer semantics more along the lines of
-fwrapv -fno-strict-aliasing, for example).

> 2. The remaining cases of UB are those where we still want to say that
>    the program is *incorrect* if it does these things, but we don’t
>    want to require the compiler to diagnose the incorrectness (usually
>    because detection would be intractable).  *Even in these cases*,
>    the current concept of “undefined behavior,” licensing the
>    implementation to do *anything*, is troublesome.  The standard
>    should replace the concept entirely, with something analogous to,
>    say, the ARM ARM’s concept of “constrained unpredictable” behavior:

Hmm.  Ada RM? Bounded errors?  Maybe these concepts are quite common?

> 3. Implementations can and should work in advance of the C committee
>    on restricting the consequences of UB as outlined in (1) and (2).
>    That is, the present state of the C standard should not stop _us_
>    from specifying the behavior of GNU libc in cases where clause 7
>    leaves behavior “undefined”.

That seems reasonable.  Yet in many cases, we still struggle to find
consensus among our smaller group.

> The earlier thread provides a concrete example of a type 2 change that
> I think is desirable: instead of 5.1.2.4p25 simply saying that if a
> data race occurs, the behavior is undefined, it should say that if a
> data race occurs, all calculations data- or control-dependent on the
> conflicting expressions produce unpredictable results, where
> “unpredictable result” is approximately the same thing as the current
> “unspecified value” but might wind up needing to have a slightly
> different definition.  [Notice that this is already a significant
> constraint on what actually happens, since unpredictability can no
> longer propagate backwards in time.]

I think this would have serious implications for any optimizations
involving multiple memory accesses.  With undefined data races and no
relevant writes from the current thread, you can assume that the
memory is stable.

I doubt there's a good alternative right now to the present situation.

>> Could you expand on why you think this is a "natural" guarantee and
>> from what that derives from?
>
> I am envisioning (pointer, length) 2-tuples as object capabilities.
> By calling memcmp(a, b, n), the caller grants memcmp access to the
> address ranges [a, a+n) and [b, b+n), but not a single byte more.
> Some concrete machines (e.g. the valgrind and ASAN VMs, and to a
> lesser extent CHERI) can actually enforce the limits of that grant.
> Even if the hardware cannot enforce the limits, the callee should
> honor them.
>
> The contents, and the stability of the contents, of those address
> ranges *cannot* affect the limits of the grant, because they appear
> nowhere in the expressions that define the limits.

But that's not how memcmp is implemented in practice.  Our
implementation assumes that it can access memory outside these bounds,
but that is not observable because we only do so when we are sure it
does not cause a page fault.  But there are other ways memory accesses
can be observable.

It's also proven difficult to find consensus whether string functions
may continue reading beyond bytes that determine the function's
result.

> Because we believe that the language guarantees are too weak and need
> to be strengthened, and the standard committees will want to see that
> strengthening play out as “existing practice” before they make any
> changes.

Or maybe they standardize that's incompatible with what we do, and
they feel free to do so because it was previously undefined, so it's
technically not a language change.

If we go in that direction, we should accept future, persistent
incompatibilities, and get out of the business to present a
standards-compliant view to the programmer even when the underlying
infrastructure isn't.  See __ASSUME_SYSVIPC_BROKEN_MODE_T for an
example what we've been doing so far.

next prev parent reply	other threads:[~2022-12-30 15:09 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-13 18:20 Bug 29863 - Segmentation fault in memcmp-sse2.S if memory contents can concurrently change Narayanan Iyer
2022-12-13 18:31 ` Andrew Pinski
2022-12-13 18:39   ` Narayanan Iyer
2022-12-13 18:39 ` Cristian Rodríguez
2022-12-13 19:08 ` Noah Goldstein
2022-12-13 19:13   ` Narayanan Iyer
2022-12-13 19:25     ` Noah Goldstein
2022-12-13 20:56       ` Zack Weinberg
2022-12-13 23:29         ` Carlos O'Donell
2022-12-14  2:28           ` Zack Weinberg
2022-12-14  4:16             ` Carlos O'Donell
2022-12-14 14:16               ` Zack Weinberg
2022-12-14 17:36                 ` Paolo Bonzini
2022-12-29  7:09                   ` Zack Weinberg
2022-12-29 19:32               ` “Undefined behavior” considered harmful (was Re: Bug 29863 - Segmentation fault in memcmp-sse2.S…) Zack Weinberg
2022-12-29 22:20                 ` Andreas Schwab
2022-12-30 13:28                   ` Florian Weimer
2022-12-30 15:09                 ` Florian Weimer [this message]
2022-12-13 22:52       ` Bug 29863 - Segmentation fault vs invalid results, memory models, and control/data dependencies Carlos O'Donell
2022-12-14 12:03         ` Florian Weimer
2022-12-13 21:20   ` Bug 29863 - Segmentation fault in memcmp-sse2.S if memory contents can concurrently change Florian Weimer
2022-12-13 22:59     ` Noah Goldstein
2022-12-14 12:06       ` Florian Weimer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874jtd3p36.fsf@mid.deneb.enyo.de \
    --to=fw@deneb.enyo.de \
    --cc=carlos@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=zack@owlfolio.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).