From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-323500-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 14514 invoked by alias); 31 Jul 2012 09:09:56 -0000
Received: (qmail 14428 invoked by uid 22791); 31 Jul 2012 09:09:53 -0000
X-SWARE-Spam-Status: No, hits=-6.0 required=5.0	tests=AWL,BAYES_00,KHOP_RCVD_UNTRUST,KHOP_THREADED,RCVD_IN_DNSWL_HI,TW_LH,TW_SR,T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from cantor2.suse.de (HELO mx2.suse.de) (195.135.220.15)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 31 Jul 2012 09:09:39 +0000
Received: from relay2.suse.de (unknown [195.135.220.254])	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx2.suse.de (Postfix) with ESMTP id 694B1A329C;	Tue, 31 Jul 2012 11:09:38 +0200 (CEST)
Date: Tue, 31 Jul 2012 09:11:00 -0000
From: Richard Guenther <rguenther@suse.de>
To: Richard Henderson <rth@redhat.com>
Cc: uweigand@de.ibm.com, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 0/2] Convert s390 to atomic optabs, v2
In-Reply-To: <1343687574-3244-1-git-send-email-rth@redhat.com>
Message-ID: <alpine.LNX.2.00.1207311053340.17088@jbgna.fhfr.qr>
References: <5016C81E.5020709@redhat.com> <1343687574-3244-1-git-send-email-rth@redhat.com>
User-Agent: Alpine 2.00 (LNX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2012-07/txt/msg01537.txt.bz2

On Mon, 30 Jul 2012, Richard Henderson wrote:

> The atomic_load/storedi_1 patterns are fixed to use LM, STM.
> 
> I've had a go at generating better code in the HQImode CAS
> loop for aligned memory, but I don't know that I'd call it
> the most efficient thing ever.  Some of this is due to 
> deficiencies in other parts of the compiler (including the
> s390 backend):
> 
>   (1) MEM_ALIGN can't pass down full align+ofs data that we had
>       during cfgexpand.  This means the opportunities for using
>       the "aligned" path are less than they ought to be.
> 
>   (2) In get_pointer_alignment (used by get_builtin_sync_mem),
>       we don't consider an ADDR_EXPR to return the full alignment
>       that the type is due.  I'm sure this is to work around some
>       other sort of usage via the <string.h> builtins, but it's
>       less-than-handy in this case.
> 
>       I wonder if in get_builtin_sync_mem we ought to be using
>       get_object_alignment (build_fold_indirect_ref (addr)) instead?
> 
>       Consider
> 
> 	struct S { int x; unsigned short y; } g_s;
> 	unsigned short o, n;
> 	void good() {
> 	  __builtin_compare_exchange (&g_s.y, &o, n, 0, 0, 0);
> 	}
> 	void bad(S *p_s) {
> 	  __builtin_compare_exchange (&p_s->y, &o, n, 0, 0, 0);
> 	}
> 
>       where GOOD produces the aligned MEM that we need, and BAD doesn't.

You cannot generally use get_object_alignment here.  Once we have
an address in the middle-end we treat it as generic pointer, happily
not caring about the actual pointer types.  But it seems we do

  /* The alignment needs to be at least according to that of the mode.  */
  set_mem_align (mem, MAX (GET_MODE_ALIGNMENT (mode),
                           get_pointer_alignment (loc)));

anyway?  What do we expect __builtin_compare_exchange to do for
unaligned inputs?  Like

typedef int uint __attribute__((aligned((8))));
unsigned short o, n;
void very_bad (uint *p) {
  __builtin_compare_exchange (p, &o, n, 0, 0, 0);
}

?  Doesn't the above set a wrong alignment?

Back to using get_object_alignment - you cannot blindly use
build_fold_indirect_ref at least, you could use get_object_alignment
on the operand of an ADDR_EXPR address but I am sure we can construct
a testcase where that would give a wrong answer, too (maybe not
easily without violating the C standards rule that pointers have
to be aligned according to their type ... but nobody in practice
follows this and the middle-end does not require this either).

Thus, the bad news is that it's hard for the middle-end to
recover alignment of a memory access that is represented as
a builtin function call that takes addresses as parameters
(which also makes them address-taken and thus possibly aliased).
Didn't Andrew have some patches to introduce a GIMPLE_ATOMIC
eventually side-stepping this issue (maybe that used addresses, too)?

Richard.

>   (3) Support for IC, and ICM via the insv pattern is lacking.
>       I've added a tiny bit of support here, in the form of using
>       the existing strict_low_part patterns, but most definitely we
>       could do better.
> 
>   (4) The *sethighpartsi and *sethighpartdi_64 patterns ought to be
>       more different.  As is, we can't insert into bits 48-56 of a
>       DImode quantity, because we don't generate ICM for DImode,
>       only ICMH.
> 
>   (5) Missing support for RISBGZ in the form of an extv/z expander.
>       The existing *extv/z splitters probably ought to be conditionalized
>       on !Z10.
> 
>   (6) The strict_low_part patterns should allow registers for at
>       least Z10.  The SImode strict_low_part can use LR everywhere.
> 
>   (7) RISBGZ could be used for a 3-address constant lshrsi3 before
>       srlk is available.
> 
> For the GOOD function above, and this patch set, for -O3 -march=z10:
> 
>         larl    %r3,s+4
>         lhrl    %r0,o
>         lhi     %r2,1
>         l       %r1,0(%r3)
>         nilh    %r1,0
> .L2:
>         lr      %r5,%r1
>         larl    %r12,n
>         lr      %r4,%r1
>         risbg   %r4,%r0,32,47,16
>         icm     %r5,3,0(%r12)
>         cs      %r4,%r5,0(%r3)
>         je      .L3
>         lr      %r5,%r4
>         nilh    %r5,0
>         cr      %r5,%r1
>         lr      %r1,%r5
>         jne     .L2
>         lhi     %r2,0
> .L3:
>         srl     %r4,16
>         sthrl   %r4,o
> 
> Odd things:
> 
>    * O is forced into a register before reaching the expander, so we
>      get the RISBG for that.  N is left in a memory and so we commit
>      to using ICM for that.  Further, because of how strict_low_part
>      is implemented we're committed to leaving that in memory.
> 
>    * We don't optimize the loop and hoist the LARL of N outside the loop.
> 
>    * Given that we're having to zap the mask in %r1 for the second
>      compare anyway, I wonder if RISBG is really beneficial over OR.
>      Is RISBG (or ICM for that matter) any faster (or even smaller)?
> 
> 
> r~
> 
> 
> Richard Henderson (2):
>   s390: Reorg s390_expand_insv
>   s390: Convert from sync to atomic optabs
> 
>  gcc/config/s390/s390-protos.h |    3 +-
>  gcc/config/s390/s390.c        |  270 ++++++++++++++++++----------
>  gcc/config/s390/s390.md       |  401 +++++++++++++++++++++++++++++------------
>  3 files changed, 465 insertions(+), 209 deletions(-)
> 
> 

-- 
Richard Guenther <rguenther@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend