From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from albireo.enyo.de (albireo.enyo.de [37.24.231.21]) by sourceware.org (Postfix) with ESMTPS id 5C84A3858D1E for ; Fri, 30 Dec 2022 15:09:51 +0000 (GMT) Received: from [172.17.203.2] (port=36391 helo=deneb.enyo.de) by albireo.enyo.de ([172.17.140.2]) with esmtps (TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) id 1pBH13-008yD8-PL; Fri, 30 Dec 2022 15:09:49 +0000 Received: from fw by deneb.enyo.de with local (Exim 4.94.2) (envelope-from ) id 1pBH13-0002Hh-Eo; Fri, 30 Dec 2022 16:09:49 +0100 From: Florian Weimer To: Zack Weinberg via Libc-alpha Cc: Carlos O'Donell , Zack Weinberg Subject: Re: =?utf-8?Q?=E2=80=9CUndefined_behavior=E2=80=9D?= considered harmful (was Re: Bug 29863 - Segmentation fault in =?utf-8?Q?memcmp-sse2=2ES=E2=80=A6=29?= References: <0a1f01d90f1f$96c7ce60$c4576b20$@yottadb.com> <0b2901d90f26$f82b4720$e881d560$@yottadb.com> <38450ca5-599d-4e5d-b2db-be01856680cb@app.fastmail.com> <736bb5b6-f9d5-b541-f983-1e5026aaacfa@redhat.com> Date: Fri, 30 Dec 2022 16:09:49 +0100 In-Reply-To: (Zack Weinberg via Libc-alpha's message of "Thu, 29 Dec 2022 14:32:50 -0500") Message-ID: <874jtd3p36.fsf@mid.deneb.enyo.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,KAM_DMARC_STATUS,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: * Zack Weinberg via Libc-alpha: > The original issue with memcmp() seems to have been resolved, but I=E2=80= =99d > like to start a broader discussion about the C library=E2=80=99s > responsibilities in the face of a program that=E2=80=99s, to some degree, > incorrect. > > On Tue, 13 Dec 2022 23:16:19 -0500, Carlos O'Donell wrote: >> We are talking about the C language, and when you write >> "unspecified" in that context it means the language *does* have >> something to say about the behaviour but does not pick one or other >> of the available behaviours. This is not the case, the language very >> clearly says this is undefined behaviour, so it says nothing about >> what should happen. >>=20 >> My understanding was that you were trying to ascribe more >> determinism to the operation of memcpy under the presence of data >> races than could be granted by UB. >>=20 >> I strongly disagree to ascribing more determinism than UB. > > Everything I=E2=80=99m about to say stems from the following three premis= es: > > 1. The C standard uses =E2=80=9Cundefined behavior=E2=80=9D far more libe= rally than it > ought to. In many cases of existing UB the committee could define > the behavior (possibly as implementation-defined or unspecified) > without any actual negative consequences. It seems the committee > *is* moving in this direction as of C2x, for instance by dropping > the allowances for non-twos-complement signed arithmetic, but they > could and should go a lot farther down that road. I'm not sure if this is a priority. It's a lot of work, and most aspects are probably not technical in the end. For example, Peter Sewell's group did a lot of work on the (non-concurrent) C memory model, both investigative and descriptive, and it's been hard to integrate this into the standard, as far as I know. >From a licensing perspective, I would very much prefer if anything that approaches an executable specification were outside the ISO framework. It would also enable exploring different directions based on interests (some might prefer semantics more along the lines of -fwrapv -fno-strict-aliasing, for example). > 2. The remaining cases of UB are those where we still want to say that > the program is *incorrect* if it does these things, but we don=E2=80= =99t > want to require the compiler to diagnose the incorrectness (usually > because detection would be intractable). *Even in these cases*, > the current concept of =E2=80=9Cundefined behavior,=E2=80=9D licensing= the > implementation to do *anything*, is troublesome. The standard > should replace the concept entirely, with something analogous to, > say, the ARM ARM=E2=80=99s concept of =E2=80=9Cconstrained unpredictab= le=E2=80=9D behavior: Hmm. Ada RM? Bounded errors? Maybe these concepts are quite common? > 3. Implementations can and should work in advance of the C committee > on restricting the consequences of UB as outlined in (1) and (2). > That is, the present state of the C standard should not stop _us_ > from specifying the behavior of GNU libc in cases where clause 7 > leaves behavior =E2=80=9Cundefined=E2=80=9D. That seems reasonable. Yet in many cases, we still struggle to find consensus among our smaller group. > The earlier thread provides a concrete example of a type 2 change that > I think is desirable: instead of 5.1.2.4p25 simply saying that if a > data race occurs, the behavior is undefined, it should say that if a > data race occurs, all calculations data- or control-dependent on the > conflicting expressions produce unpredictable results, where > =E2=80=9Cunpredictable result=E2=80=9D is approximately the same thing as= the current > =E2=80=9Cunspecified value=E2=80=9D but might wind up needing to have a s= lightly > different definition. [Notice that this is already a significant > constraint on what actually happens, since unpredictability can no > longer propagate backwards in time.] I think this would have serious implications for any optimizations involving multiple memory accesses. With undefined data races and no relevant writes from the current thread, you can assume that the memory is stable. I doubt there's a good alternative right now to the present situation. >> Could you expand on why you think this is a "natural" guarantee and >> from what that derives from? > > I am envisioning (pointer, length) 2-tuples as object capabilities. > By calling memcmp(a, b, n), the caller grants memcmp access to the > address ranges [a, a+n) and [b, b+n), but not a single byte more. > Some concrete machines (e.g. the valgrind and ASAN VMs, and to a > lesser extent CHERI) can actually enforce the limits of that grant. > Even if the hardware cannot enforce the limits, the callee should > honor them. > > The contents, and the stability of the contents, of those address > ranges *cannot* affect the limits of the grant, because they appear > nowhere in the expressions that define the limits. But that's not how memcmp is implemented in practice. Our implementation assumes that it can access memory outside these bounds, but that is not observable because we only do so when we are sure it does not cause a page fault. But there are other ways memory accesses can be observable. It's also proven difficult to find consensus whether string functions may continue reading beyond bytes that determine the function's result. > Because we believe that the language guarantees are too weak and need > to be strengthened, and the standard committees will want to see that > strengthening play out as =E2=80=9Cexisting practice=E2=80=9D before they= make any > changes. Or maybe they standardize that's incompatible with what we do, and they feel free to do so because it was previously undefined, so it's technically not a language change. If we go in that direction, we should accept future, persistent incompatibilities, and get out of the business to present a standards-compliant view to the programmer even when the underlying infrastructure isn't. See __ASSUME_SYSVIPC_BROKEN_MODE_T for an example what we've been doing so far.