From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 102911 invoked by alias); 12 Apr 2019 15:31:10 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 102902 invoked by uid 89); 12 Apr 2019 15:31:10 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=talked, fresh, justify, H*f:sk:ddf469f X-HELO: mail-wr1-f68.google.com Received: from mail-wr1-f68.google.com (HELO mail-wr1-f68.google.com) (209.85.221.68) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 12 Apr 2019 15:31:08 +0000 Received: by mail-wr1-f68.google.com with SMTP id h4so12488864wre.7 for ; Fri, 12 Apr 2019 08:31:07 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Reply-To: Peter.Sewell@cl.cam.ac.uk From: Peter Sewell Date: Fri, 12 Apr 2019 15:31:00 -0000 Message-ID: Subject: Re: C provenance semantics proposal To: Jeff Law Cc: gcc@gcc.gnu.org, cl-c-memory-object-model Content-Type: text/plain; charset="UTF-8" X-SW-Source: 2019-04/txt/msg00140.txt.bz2 On Fri, 12 Apr 2019 at 15:51, Jeff Law wrote: > > On 4/2/19 2:11 AM, Peter Sewell wrote: > > Dear all, > > > > continuing the discussion from the 2018 GNU Tools Cauldron, we > > (the WG14 C memory object model study group) now > > have a detailed proposal for pointer provenance semantics, refining > > the "provenance not via integers (PNVI)" model presented there. > > This will be discussed at the ISO WG14 C standards committee at the > > end of April, and comments from the GCC community before then would > > be very welcome. The proposal reconciles the needs of existing code > > and the behaviour of existing compilers as well as we can, but it doesn't > > exactly match any of the latter, so we'd especially like to know whether > > it would be feasible to implement - our hope is that it would only require > > minor changes. It's presented in three documents: > > > > N2362 Moving to a provenance-aware memory model for C: proposal for C2x > > by the memory object model study group. Jens Gustedt, Peter Sewell, > > Kayvan Memarian, Victor B. F. Gomes, Martin Uecker. > > This introduces the proposal and gives the proposed change to the standard > > text, presented as change-highlighted pages of the standard > > (though one might want to read the N2363 examples before going into that). > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf > > > > N2363 C provenance semantics: examples. > > Peter Sewell, Kayvan Memarian, Victor B. F. Gomes, Jens Gustedt, Martin Uecker. > > This explains the proposal and its design choices with discussion of a > > series of examples. > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf > > > > N2364 C provenance semantics: detailed semantics. > > Peter Sewell, Kayvan Memarian, Victor B. F. Gomes. > > This gives a detailed mathematical semantics for the proposal > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf > > > > In addition, at http://cerberus.cl.cam.ac.uk/cerberus we provide an > > executable version of the semantics, with a web interface that > > allows one to explore and visualise the behaviour of small test > > programs, stepping through and seeing the abstract-machine > > memory state including provenance information. N2363 compares > > the results of this for the example programs with gcc, clang, and icc > > results, though the tests are really intended as tests of the semantics > > rather than compiler tests, so one has to interpret this with care. > THanks. I just noticed this came up in EuroLLVM as well. Getting > some standards clarity in this space would be good. > > Richi is in the best position to cover for GCC, but I suspect he's > buried with gcc-9 issues as we approach the upcoming release. Hopefully > he'll have time to review this once crunch time has past. I think more > than anything sanity checking the proposal's requirements vs what can be > reasonably implmemented is most important at this stage. Indeed. We talked with him at the GNU cauldron, without uncovering any serious problems, but more detailed review from an implementability point of view would be great. For the UB mailing list we just made a brief plain-text summary of the proposal (leaving out all the examples and standards diff, and glossing over some details). I'll paste that in below in case it's helpful. The next WG14 meeting is the week of April 29; comments before then would be particularly useful if that's possible. best, Peter C pointer values are typically represented at runtime as simple concrete numeric values, but mainstream compilers routinely exploit information about the "provenance" of pointers to reason that they cannot alias, and hence to justify optimisations. This is long-standing practice, but exactly what it means (what programmers can rely on, and what provenance-based alias analysis is allowed to do), has never been nailed down. That's what the proposal does. The basic idea is to associate a *provenance* with every pointer value, identifying the original storage instance (or allocation, in other words) that the pointer is derived from. In more detail: - We take abstract-machine pointer values to be pairs (pi,a), adding a provenance pi, either @i where i is a storage instance ID, or the *empty* provenance, to their concrete address a. - On every storage instance creation (of objects with static, thread, automatic, and allocated storage duration), the abstract machine nondeterministically chooses a fresh storage instance ID i (unique across the entire execution), and the resulting pointer value carries that single storage instance ID as its provenance @i. - Provenance is preserved by pointer arithmetic that adds or subtracts an integer to a pointer. - At any access via a pointer value, its numeric address must be consistent with its provenance, with undefined behaviour otherwise. In particular: -- access via a pointer value which has provenance a single storage instance ID @i must be within the memory footprint of the corresponding original storage instance, which must still be live. -- all other accesses, including those via a pointer value with empty provenance, are undefined behaviour. Regarding such accesses as undefined behaviour is necessary to make optimisation based on provenance alias analysis sound: if the standard did define behaviour for programs that make provenance-violating accesses, e.g.~by adopting a concrete semantics, optimisation based on provenance-aware alias analysis would not be sound. In other words, the provenance lets one distinguish a one-past pointer from a pointer to the start of an adjacently-allocated object, which otherwise are indistinguishable. All this is for the C abstract machine as defined in the standard: compilers might rely on provenance in their alias analysis and optimisation, but one would not expect normal implementations to record or manipulate provenance at runtime (though dynamic or static analysis tools might). Then, to support low-level systems programming, C provides many other ways to construct and manipulate pointer values: - casts of pointers to integer types and back, possibly with integer arithmetic, e.g.~to force alignment, or to store information in unused bits of pointers; - copying pointer values with memcpy; - manipulation of the representation bytes of pointers, e.g.~via user code that copies them via char* or unsigned char* accesses; - type punning between pointer and integer values; - I/O, using either fprintf/fscanf and the %p format, fwrite/fread on the pointer representation bytes, or pointer/integer casts and integer I/O; - copying pointer values with realloc; and - constructing pointer values that embody knowledge established from linking, and from constants that represent the addresses of memory-mapped devices. A satisfactory semantics has to address all these, together with the implications on optimisation. We've explored several, but our main proposal is "PNVI-ae-udi" (provenance not via integers, address-exposed, user-disambiguation). This semantics does not track provenance via integers. Instead, at integer-to-pointer cast points, it checks whether the given address points within a live object that has previously been *exposed* and, if so, recreates the corresponding provenance. A storage instance is deemed exposed by a cast of a pointer to it to an integer type, by a read (at non-pointer type) of the representation of the pointer, or by an output of the pointer using %p. The user-disambiguation refinement adds some complexity but supports roundtrip casts, from pointer to integer and back, of pointers that are one-past a storage instance.