From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-508635-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 73547 invoked by alias); 9 Sep 2019 14:17:41 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 73539 invoked by uid 89); 9 Sep 2019 14:17:40 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-2.1 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.1 spammy=proven, H*i:sk:CAFiYyc, makefiles, Makefiles
X-HELO: mx0a-001b2d01.pphosted.com
Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com) (148.163.158.5) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 09 Sep 2019 14:17:39 +0000
Received: from pps.filterd (m0098417.ppops.net [127.0.0.1])	by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x89EFJv5043241	for <gcc-patches@gcc.gnu.org>; Mon, 9 Sep 2019 10:17:37 -0400
Received: from e06smtp02.uk.ibm.com (e06smtp02.uk.ibm.com [195.75.94.98])	by mx0a-001b2d01.pphosted.com with ESMTP id 2uwqk5tmh3-1	(version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT)	for <gcc-patches@gcc.gnu.org>; Mon, 09 Sep 2019 10:17:31 -0400
Received: from localhost	by e06smtp02.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted	for <gcc-patches@gcc.gnu.org> from <uweigand@de.ibm.com>;	Mon, 9 Sep 2019 15:17:25 +0100
Received: from b06cxnps4076.portsmouth.uk.ibm.com (9.149.109.198)	by e06smtp02.uk.ibm.com (192.168.101.132) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted;	(version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256)	Mon, 9 Sep 2019 15:17:22 +0100
Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60])	by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x89EHLhr48365672	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);	Mon, 9 Sep 2019 14:17:21 GMT
Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1])	by IMSVA (Postfix) with ESMTP id D8EC042059;	Mon,  9 Sep 2019 14:17:20 +0000 (GMT)
Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1])	by IMSVA (Postfix) with ESMTP id BBEB842042;	Mon,  9 Sep 2019 14:17:20 +0000 (GMT)
Received: from oc3748833570.ibm.com (unknown [9.152.214.225])	by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP;	Mon,  9 Sep 2019 14:17:20 +0000 (GMT)
Received: by oc3748833570.ibm.com (Postfix, from userid 1000)	id 7BC33D802F0; Mon,  9 Sep 2019 16:17:20 +0200 (CEST)
Subject: Re: [PATCH] Use type alignment in get_builtin_sync_mem
To: richard.guenther@gmail.com (Richard Biener)
Date: Mon, 09 Sep 2019 14:17:00 -0000
From: "Ulrich Weigand" <uweigand@de.ibm.com>
Cc: gcc-patches@gcc.gnu.org (GCC Patches)
In-Reply-To: <CAFiYyc2R1rOVkpte8HBToqrxNwOMYeBr4Y76VgUFVFWomYaVvg@mail.gmail.com> from "Richard Biener" at Sep 09, 2019 10:37:24 AM
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
x-cbid: 19090914-0008-0000-0000-000003132F89
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 19090914-0009-0000-0000-00004A319357
Message-Id: <20190909141720.7BC33D802F0@oc3748833570.ibm.com>
X-SW-Source: 2019-09/txt/msg00526.txt.bz2

Richard Biener wrote:
> On Fri, Sep 6, 2019 at 3:00 PM Ulrich Weigand <uweigand@de.ibm.com> wrote:
> > But as far as I can see, for *atomic* operations at least, we do make
> > that assumption.  The x86 back-end for example just assumes that any
> > "int" or "long" object that is the target of an atomic operation is
> > naturally aligned, or else the generated code would just crash.   So
> > if you used your example with a packed struct and passed that pointer
> > to an atomic, the back-end would still assume the alignment and the
> > code would crash.  But I'd still consider this a perfectly reasonable
> > and expected behavior in this case ...
> 
> Would it crash?  I think it would be not atomic if it crossed a cache-line
> boundary.

Sorry, I misremembered the x86 operations, it does indeed work for
unaligned 4- or 8-byte accesses.  However, for 16-byte accesses,
CMPXCHG16B does require aligned memory, the manual says:

  Note that CMPXCHG16B requires that the destination (memory)
  operand be 16-byte aligned.
[...]
64-Bit Mode Exceptions
[...]
#GP(0) If the memory address is in a non-canonical form.
       If memory operand for CMPXCHG16B is not aligned on a
       16-byte boundary.
[...]

So this is basically the same situation as on s390, except that on x86
the default TImode alignment is already 16 bytes.

> > Of course if some part of the middle end get the alignment wrong, we
> > have a problem.  But I'm not sure how this could happen here.  I agree
> > that it might be the case that a user-specified *under*-alignment might
> > get lost somewhere (e.g. you have a packed int and somewhere in the
> > optimizers this gets accidentally cast to a normal int with the default
> > alignment).  But in *this* case, what would have to happen is that the
> > middle-end accidentally casts something to a pointer with *higher*
> > than the default alignment for the type, even though no such pointer
> > cast is present in the source.  Can this really happen?
> 
> If the cast to the lower-aligned type is lost and there is an earlier cast
> to the aligned type.

My point is that this "cast to the aligned type" must have come from the
user in this case (since the aligned type is *more* aligned that any
standard version of the typ), and if the user casts the value to the aligned
type, it is undefined behavior anyway if the value was in fact not aligned.

> > This would actually
> > be wrong on s390.  The problem is that all atomic operations on any
> > one single object need to be consistent: they either all use the
> > 16-byte atomic instruction, or else they all protect the access with
> > a lock.  If you have parts of the code use the lock and other parts
> > use the instruction, they're not actually protected against each other.
> 
> But then the user has to be consistent in accessing the object
> atomically.  If he accesses it once as (aligned_int128_t *)
> and once as (int128_t *) it's his fault, no?

I'm not sure why this should be a requirement.  E.g. if we have a
set of subroutines that operates (correctly) on any int128_t *,
aligned or not, and have one user of those routines that actually
locally has an aligned_int128_t, then that user could no longer
safely pass that a pointer to its aligned variable to that
subsystem if it also does atomic operations locally ...

> If we'd document that the user invokes undefined behavior
> when performing __builtin_atomic () on objects that are not
> sufficiently aligned according to target specific needs then
> we are of course fine and should simply assume the memory
> is aligned accordingly (similar to your patch but probably with
> some target hook specifying the atomic alignment requirement
> when it differs from mode alignment).  But I don't read the
> documentation of our atomic builtins that way.
> 
> Does _Atomic __int128_t work properly on s390?

Yes, it currently does work properly in all cases (just not in
all cases as efficiently as it could be).

The rule to perform atomic operations on __int128_t on s390 is:
 - If the object is *actually* 16-byte aligned at runtime, then
   every atomic access must be performed using one of the atomic
   instructions (CDSG, LPQ, STPQ).
 - If the object is actually *not* 16-byte aligned, then every
   atomic access must be performed under protection of an
   out-of-line lock.

This rule is correctly implemented by:
 - The __builtin_*_16 family of libatomic library routines;
   these all perform a run-time alignment check and use either
   the instruction or the lock, as appropriate; and
 - Compiler-generated inline code;
   this will always use the instruction, but the compiler will
   generate it only if it can prove at compile-time that the
   object *must* be aligned at run-time.

[ However, this rule is *not* implemented by the _n family of
libatomic library routines.  That is not a problem at the moment
since those will *never* get called on any object of size 16;
but they would be if we implemented your proposal; that's why
I pointed out that this would then *introduce* unsafe behavior. ]

The original point of why I started this discussion is nothing to
do with safety -- code is currently safe and would remain safe
after this change -- but simply performance: we would get compiler
generated inline code more frequently than we do today.

(This is not *just* performance but also convenience, since we
can avoid having to explicitly link in libatomic in many cases
as well.  This would help with portability of some Makefiles
etc. to s390, where they don't link in libatomic since it is
not needed on x86 ...)

> > This is why the _16 version of the library routine does the runtime
> > alignment check, so that all accesses to actually 16-byte aligned
> > objects use the instruction, both in the library and in compiler-
> > generated code.   The _n version doesn't do that.
> >
> > So I guess we'd have to convert the _n version back to the _16
> > library call if the size is constant 16, or something?
> 
> If it is aligned 16?  But not sure - see above.

No, it would have to be done no matter whether it is (i.e. can be
proven at compiler-time) aligned or not, the point is we must call
the _16 version of the library routine so that it can do the run-time
alignment check and perform the appropriate action.

> > > So yes, I'd say try tackling the issue in the frontend which is the
> > > last to know the alignment in the most optimistic way (according
> > > to the C standard).  At RTL expansion time we have to be (very)
> > > conservative.
> > >
> > > Btw, the alternative is to add an extra alignment parameter
> > > to all of the atomic builtins where the FE could stick the alignment
> > > for us to use during RTL expansion.  But given we have the _{1,2,4,8,16,N}
> > > variants it sounds easier to use those...
> > >
> > > Note we only document __atomic_load and __atomic_load_n so
> > > the _{1,2,4,8,16} variants seem to be implementation detail.
> >
> > Adding the alignment parameter would work, I think.
> 
> But then the user still needs to access the int128_t with
> consistent alignment, no?  Otherwise it will again not protect
> against each other?

See above, it will still work *correctly* in either case.  The
alignment info simply allows the compiler to safely choose the
fast (inlined) variant. 

> Why not always use the slow & correct variant for __int128?

As above, both slow & fast variants are correct; it's just that
the fast variant is, well, faster and more convenient.

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  Ulrich.Weigand@de.ibm.com