From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-518543-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 6180 invoked by alias); 30 Jan 2020 08:31:07 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 6168 invoked by uid 89); 30 Jan 2020 08:31:07 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-2.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=fastest, consequences, difficulties
X-HELO: mail-lj1-f194.google.com
Received: from mail-lj1-f194.google.com (HELO mail-lj1-f194.google.com) (209.85.208.194) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 30 Jan 2020 08:31:02 +0000
Received: by mail-lj1-f194.google.com with SMTP id r19so2375433ljg.3        for <gcc-patches@gcc.gnu.org>; Thu, 30 Jan 2020 00:30:59 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc;        bh=3nPvyEqOTqneGNH4Zb6VrJhrvUVVzQiZipiIpLxziqU=;        b=myaGjYrXNcl/RSvIgcd21kJfc/iHbZ5CWozqDX6jxP85zgFitXwA8bT5vmeVqk2p6U         LYck+lZ31/rUdvCUDd9/A4hR8vIoYMTVO6J1D2UX+XXiz5tDJxD1uSIgrRl+67oCCQ9I         aeEX/W8heExss1w6/cHIufUo9sxRpKai+CShrF5FVsvdqy4pNGxuAf59Bh09dVHWPnTI         RzI7hZJdZMImjLGYf9BmWMZo0Zd5OO+qgHCxsOF/40s9dPA9l4NwciYOUnXle6UXB+2n         WM77181VFW8nuvwYVq8F6XA6FYqNukqBCd82A/HrN+4/qoAM5An1IEGt99RR3JYXVOEt         ArMA==
MIME-Version: 1.0
References: <alpine.LNX.2.20.13.2001201801090.7204@monopod.intra.ispras.ru> <alpine.DEB.2.21.2001202315100.5014@digraph.polyomino.org.uk> <alpine.LNX.2.20.13.2001211527540.32566@monopod.intra.ispras.ru> <nycvar.YFH.7.76.2001211520150.5566@zhemvz.fhfr.qr> <alpine.LNX.2.20.13.2001211752200.32566@monopod.intra.ispras.ru> <nycvar.YFH.7.76.2001220830420.5566@zhemvz.fhfr.qr> <0ea4d9e9-8358-78e3-3153-0eda111d20b7@gmail.com> <CAFiYyc0vmEe9GFzx3QgoByOKghXLKEVoWVgO7sy4oJrVVwoMyA@mail.gmail.com> <1579823158.4442.3.camel@med.uni-goettingen.de> <CAFiYyc0Vqfg6Q11oauaqyxF1ftOE4WyfKGkoTS4iXPWxHSwAJg@mail.gmail.com> <1580172878.9104.1.camel@med.uni-goettingen.de> <alpine.LNX.2.20.13.2001281007170.32444@monopod.intra.ispras.ru> <CAFiYyc0BTPzGVNwD9hCgz2pP9-i+srDBXnPgi-1hg5xnb550fQ@mail.gmail.com> <1580214291.2690.8.camel@med.uni-goettingen.de> <CAFiYyc1yrLkQJMus77smbjmkbvqosmhZEq0EiW5oEBXnQmEVsg@mail.gmail.com> <1580306413.6601.1.camel@med.uni-goettingen.de>
In-Reply-To: <1580306413.6601.1.camel@med.uni-goettingen.de>
From: Richard Biener <richard.guenther@gmail.com>
Date: Thu, 30 Jan 2020 09:47:00 -0000
Message-ID: <CAFiYyc1Bp+hj0TDgsYp-L+ihgE9K5DohGkm9n5CjYQ+L1rnPew@mail.gmail.com>
Subject: Re: [PATCH] doc: clarify the situation with pointer arithmetic
To: "Uecker, Martin" <Martin.Uecker@med.uni-goettingen.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, "amonakov@ispras.ru" <amonakov@ispras.ru>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2020-01/txt/msg01988.txt.bz2

On Wed, Jan 29, 2020 at 3:00 PM Uecker, Martin
<Martin.Uecker@med.uni-goettingen.de> wrote:
>
> Am Mittwoch, den 29.01.2020, 09:45 +0100 schrieb Richard Biener:
> > On Tue, Jan 28, 2020 at 1:24 PM Uecker, Martin
> > <Martin.Uecker@med.uni-goettingen.de> wrote:
>
> > > >
> > > > Note for the current PTA implementation there's almost no cases we can
> > > > handle conservatively enough.  Consider the simple
> > > >
> > > >  int a[4];
> > > >  int *p = &a[1];
> > > >  uintptr_t pi = (uintptr_t)p;
> > > >  pi += 4;
> > > >  int *q = (int *)pi;
> > > >
> > > > our PTA knows that p points to a (but not the exact offset), same for pi
> > > > (the cast doesn't change the value).  Now you add 4 - this could lead
> > > > you outside of 'a' so the points-to set becomes 'a and anything'.
> > >
> > > PVNI would say that 'a' gets exposed in the first case and
> > > then 'q' can point to all exposed objects because of the
> > > second cast.
> > >
> > > A correct and conservative implementation would do this:
> > > In the PTA you would need to mark the address of a escaped
> > > and then later assign a conservative points-to set (all
> > > escaped objects) to q.
> >
> > I see (I'm asking all these questions to see what implementing -fpta-pnvi
> > would mean).
>
> Thank you for looking into this.
>
> > That might be a feasible implementation route.  How
> > does "escaped" as used in your answer apply when doing inter-procedural
> > analysis?  I guess we would need to assume points-to sets include
> > automatic variables that escaped in the caller.
>
> Yes. Is this not something the compiler assumes in general?
>
> If you have code like this:
>
> int pi = foo(p);
> int *q = bar(pi);
>
> where 'foo' and 'bar' are unknown to the compiler, it
> should have the same  semantics with regard to PTA
> as needed for the casts in PVNI.
>
> (except that one knows that q and p have the same
> value which could be exploited for optimization)
>
> > > Yes, this limits optimizations, but I do not think this is
> > > terrible. (optimizations could be re-enabled with a compiler
> > > option)
> >
> > We'll see.
> >
> > > > I'm also not sure what PVNI does to
> > > >
> > > >  int a[4];
> > > >  int *p = &a[1];
> > > >  p += 10;
> > > >  uintptr_t pi = (uintptr_t)p;
> > > >  p = (int *)pi;
> > > >
> > > > we assume that p points to a even after p += 10 (but it of course points
> > > > outside of the object - obvious here, but not necessarily in more
> > > > obfuscated cases).
> > >
> > > This is UB because a pointer to outside of the object is formed.
> > >
> > > > Now, can we assume pi points to a?  The cast isn't value-changing.  Do we have
> > > > to assume (int *)pi points to anything?  So, is
> > > >
> > > >  p = (int *)(uintptr_t)p;
> > > >
> > > > something like "laundering" a pointer?  We don't preserve such simple round-trip
> > > > casts since they are value-preserving.  Are they provenance preserving?
> > >
> > > Yes, such a pair would be "laundering" as it allows 'p' to then
> > > point to any exposed object provenance-wise.
> > >
> > > For such casts the FE would maybe add a marker. Maybe a calls
> > > to builtin functions 'builtin_expose(a)' and 'builting_bless'.
> > > (having those builtins would be interesting on its own, btw).
> >
> > Uh, ok.
>
> For testing, I implemented builtin_expose by adding
> in 'find_func_aliases_for_builtin_call'
>
>    case BUILT_IN_ESCAPE:
>        {
>           make_escape_constraint (gimple_call_arg (t, 0));
>           return true;
>        }
>
> which seems to work, but I wasn't sure about
> the other function.
>
> There are concurrent algorithms which revive dead pointers.
> They might also need explicit control to work around provenance
> issues.
>
> > > Having said this, some optimizations could still be allowed using
> > > the "as-if" rule and other lines of reasoning. Specifally, PVNI
> > > states that 'p' gets assigned the provenance of the object the
> > > integer values is the address of. So if the compiler can proof
> > > that the address belongs to certain objects it can reassign the
> > > points-to set to the new 'p'. Only if there is ambiguity between
> > > which objects the address belongs to, the reasoning needs to
> > > be more conservative.
> > >
> > > For example:
> > > int a[3];
> > > int b[3];
> > >
> > > &b; // b also exposed
>
> Correction: this should be:
>
> (intptr_t)&b;
>
> as it is the cast to integers and not just the address-taken
> operation that marks the object exposed.
>
> > > int *p = (int*)(uintptr_t)&a[3];
> > >
> > > Here, p could point to the one-after address of 'a' or the
> > > first element of 'b'. (but only because 'b' was also exposed).
> > >
> > > If the compiler can prove that something like this can not
> > > happen (e.g. by considering offsets), it can still do some
> > > tracking of points-to sets.
> >
> > That's probably the very case that we'll get wrong since
> > we definitely won't be able to reliably preserve these
> > kind of laundering points...
>
> Maybe the FE could add the builtins ?
>
> > I guess they could be obfuscated like
> >
> > union { void *p; long l; } u;
> >
> > u.p = p;
> > u.l = u.l;
> > p = u.p;
> >
> > where GCC (and IIRC also now the standard) allows the
> > type-punning when reading u.p via u.l and vice versa.
>
> It is allowed in C but not in C++.
>
> > That is, the "conversions" might be hidden in the
> > memory access types.  That means (our PTA tracks
> > points-to sets of memory) that all stores of pointers
> > (even to automatic vars that are themselves not exposed)
> > make them escaped and CSE would need to desperately
> > avoid CSEing the load from u.p to p or insert laundering
> > operations.
>
> I am not sure I fully understand the proposal.
>
> > I guess I'd me much more happy if PVNI said that when
> > an integer is converted to a pointer and the integer
> > is value-equivalent to pointers { p1, p2, ... } then
> > the provenance of the resulting pointer is
> > that of p1 (or p2, ... which is semantically equivalent)
>
> (if the provenance is the same)
>
> > and when two pointers p1 and p2 are
> > value-equivalent and their provenance is not the
> > same then the behavior is undefined.
>
> I see. Then here..
>
> int a[3];
> int b[3];
>
> (uintptr_t)&b[0]; // b also exposed
> int *p = (int*)(uintptr_t)&a[3];
>
> ..the behavior is undefined because the
> two pointers have identical addresses
> but different provenance.
>
> I agree, from a compiler writer's point-of-view
> this would be a good solution. But to a programmer,
> this would be quite difficult to explain.
> The preference of the working group was that the casts
> should just work in all cases and do what the
> programmer intended, even if this prevents some
> optimization. But I will see that this is
> added to the list of options under consideration.
>
>
> PVNI-ae-ud assigns the provenance of an
> exposed object at the address. If there
> are two possible objects (as in the example
> above), the pointer could point to both but
> then has to be used consistently only with
> only one object. Essentially, we want the
> pointer to have exactly one provenance but
> we might delay the decision. The idea is
> that a compiler might figure out the correct
> provenance later, e.g. by observing accesses.

I thought about alternatives to PVNI and implementation
consequences.  But all different kind of must-behave-like-this
guarantees face serious implementation difficulties I think
so the only alternative to PVNI (which I think is implementable
but at a optimization opportunity cost) is one that makes
two pointers with the same value always have the same
provenance (and otherwise make the behavior undefined).

> It is possible to formulate
> some conditions about when a pointer converted
> from an integer could get assigned the
> points-to-set of a value-equivalent pointer:
>
> 1) using knowledge about object location in
> memory: If there is no adjacent object which
> was exposed, one can conclude that the
> provenance is the object at this address.

Usually at the point compilers want to know objects
are not laid out.  So what compilers do is simply
say the user cannot possibly know so it can
choose at will (even if later object layout disagrees).

> 2) based on offsets: If the pointer points
> in the middle of an object, there is also
> no ambiguity.

The difficulty here lies in the requirement of
exact offset tracking which makes (some?)
points-to implementations prohibitly expensive.
But yes, sure.

> 3) a mix of both, to differentiate objects
> before and after in memory.
>
>
> >  That is,
> >
> > int a, b;
> > int  *p = &a + 1;
> > int *q = &b;
> > if (p == q)
> >   ... undefined ...
>
> We considered making the comparison undefined in the
> specific situation where one of the pointer is one-after
> pointer and the other a pointer to the beginning of a
> different object. This would solve the problems with
> conditional equivalences.

Note my proposal doesn't make the comparison undefined
but the case where both are equivalent cannot be reached
at runtime without invoking undefined behavior.  That means
we can optimize the comparison based on provenance
where p points to a and q points to b.

> Others proposed to make the result of the comparison
> unspecified, but I think this does not help.

Indeed.  It's not unspecified, it's known to evaluate to false.
I think there's existing wording in the standard that
allows it to evaluate to true for pointers one-after-the-object,
that would need to be changed of course.

> At the moment, the consensus is that pointer
> comparison should be always allowed and the
> result should only depend on the address. Again,
> the idea is to make is simpler and more consistent
> for the programmer. But yes, this makes it more
> difficult for the compiler writer.

it's a conflict of interest on the user side as well - users
expect DWIM semantics but at the same time want
fastest possible code...

Richard.

> Best,
> Martin