From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=9VWM=6X=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id E827B3858D32
	for <gcc-patches@gcc.gnu.org>; Mon, 27 Feb 2023 15:34:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E827B3858D32
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9A573C14;
	Mon, 27 Feb 2023 07:34:56 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DC7493F67D;
	Mon, 27 Feb 2023 07:34:12 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Jakub Jelinek <jakub@redhat.com>
Mail-Followup-To: Jakub Jelinek <jakub@redhat.com>,Richard Biener <rguenther@suse.de>,  gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Cc: Richard Biener <rguenther@suse.de>,  gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803]
References: <Y+9TodOHE09e9Vwq@tucnak>
Date: Mon, 27 Feb 2023 15:34:11 +0000
In-Reply-To: <Y+9TodOHE09e9Vwq@tucnak> (Jakub Jelinek's message of "Fri, 17
	Feb 2023 11:14:57 +0100")
Message-ID: <mpth6v7ruks.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-28.8 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Jakub Jelinek <jakub@redhat.com> writes:
> Hi!
>
> The following testcase is miscompiled on aarch64.  The problem is that
> aarch64 with TARGET_SIMD is !SHIFT_COUNT_TRUNCATED target with
> targetm.shift_truncation_mask (DImode) == 0 which has HAVE_conditional_move
> true.  If a doubleword shift (in this case TImode) doesn't have its own
> expander (as is the case of e.g. x86) and is handled in generic code,
> there are 3 possible expansions.  One is when the shift count is constant,
> the code computes in that case superword_op1 as op1 - BITS_PER_WORD,
> and chooses at compile time which of expand_superword_shift or
> expand_subword_shift to call, which ensures that whatever is used
> actually has its shift count (superword_op1 or op1) in [0, BITS_PER_WORD - 1]
> range.  If !HAVE_conditional_move or that expansion fails, the function
> handles non-constant op1 similarly (there are some special cases for
> shift_mask >= BITS_PER_WORD - 1 but let's talk here just about
> shift_mask < BITS_PER_WORD - 1), except that the selection is done at
> runtime, with branches around the stuff.  While superword_op1 can be
> [-BITS_PER_WORD, BITS_PER_WORD - 1] and op1 [0, 2 * BITS_PER_WORD - 1],
> the runtime selection ensures that the instructions executed at runtime
> have their corresponding shift count again in the right range of
> [0, BITS_PER_WORD - 1].
> The problematic is the HAVE_conditional_move case, which emits both
> sequences into the actually executed code, so necessarily one of them
> has an out of range shift count and then using 2 conditional moves
> picks up a result.
> Now, in the testcase because -Og doesn't perform VRP/EVRP the shift
> count isn't optimized to constant during GIMPLE passes, but is determined
> to be constant during/after expansion into RTL.  The shift count is
> actually const0_rtx later, so superword_op1 is (const_int -64) and we end
> up with miscompilation during combine because of that.

I haven't worked through the testcase yet, but how does that happen?
Having shifts with out-of-range shift counts shouldn't be a problem
in itself, provided that the conditional move selects the one with
the in-range shift.

Thanks,
Richard

> I'm afraid on targetm.shift_truncation_mask (DImode) == 0 targets we have
> to mask the shift counts when the doubleshift count is in range but
> one of subword_op1/superword_op1 is not, which is what the following
> patch does and what fixes the wrong-code.  Now, targets like x86 or aarch64,
> while they are !SHIFT_COUNT_TRUNCATED, have actually patterns to catch
> shift with masked counter, so the ANDs can be optimized out.  On the other
> side, when we know the result will be masked this way we can use the
> simpler ~op1 instead of (BITS_PER_WORD - 1) - op1 in expand_subword_shift.
> So, say on
> __attribute__((noipa)) __int128
> foo (__int128 a, unsigned k)
> {
>   return a << k;
> }
> on aarch64 at -Og the difference is:
>  foo:
>         subs    w5, w2, #64
> -       lsl     x6, x0, x5
> +       lsl     x4, x0, x2
>         lsr     x3, x0, 1
> -       mov     w4, 63
> -       sub     w4, w4, w2
> -       lsr     x3, x3, x4
> +       mvn     w6, w2
> +       and     w2, w2, 63
> +       lsr     x3, x3, x6
>         lsl     x1, x1, x2
>         orr     x1, x3, x1
>         lsl     x0, x0, x2
>         csel    x0, xzr, x0, pl
> -       csel    x1, x6, x1, pl
> +       csel    x1, x4, x1, pl
>         ret
> We could do even better and optimize the and w2, w2, 63 instruction out,
> but it is a matter of costs and so IMHO should be handled incrementally.
> For that case consider say:
> void corge (int, int, int);
>
> void
> qux (int x, int y, int z, int n)
> {
>   n &= 31;
>   corge (x << n, y << n, z >> n);
> }
> with -O2 -fno-tree-vectorize, on x86_64 one gets
>         sarl    %cl, %edx
>         sall    %cl, %esi
>         sall    %cl, %edi
>         jmp     corge
> but on aarch64
>         and     w3, w3, 31
>         lsl     w0, w0, w3
>         lsl     w1, w1, w3
>         asr     w2, w2, w3
>         b       corge
> The reason it works on x86 is that its rtx_costs hook recognizes
> that the AND in shift+mask patterns is for free.
> Trying 9 -> 11:
>     9: {r85:SI=r96:SI&0x1f;clobber flags:CC;}
>       REG_UNUSED flags:CC
>    11: {r91:SI=r87:SI<<r85:SI#0;clobber flags:CC;}
>       REG_DEAD r87:SI
>       REG_UNUSED flags:CC
> Failed to match this instruction:
> ...
> Failed to match this instruction:
> ...
> Successfully matched this instruction:
> (set (reg/v:SI 85 [ n ])
>     (and:SI (reg:SI 96)
>         (const_int 31 [0x1f])))
> Successfully matched this instruction:
> (set (reg:SI 91)
>     (ashift:SI (reg/v:SI 87 [ y ])
>         (subreg:QI (and:SI (reg:SI 96)
>                 (const_int 31 [0x1f])) 0)))
> allowing combination of insns 9 and 11
> original costs 4 + 4 = 8
> replacement costs 4 + 4 = 8
> Compare that to the aarch64 case:
> Trying 9 -> 11:
>     9: r95:SI=r106:SI&0x1f
>       REG_DEAD r106:SI
>    11: r101:SI=r104:SI<<r95:SI#0
>       REG_DEAD r104:SI
> Failed to match this instruction:
> ...
> Failed to match this instruction:
> ...
> Successfully matched this instruction:
> (set (reg/v:SI 95 [ n ])
>     (and:SI (reg:SI 106)
>         (const_int 31 [0x1f])))
> Successfully matched this instruction:
> (set (reg:SI 101)
>     (ashift:SI (reg:SI 104)
>         (subreg:QI (and:SI (reg:SI 106)
>                 (const_int 31 [0x1f])) 0)))
> rejecting combination of insns 9 and 11
> original costs 4 + 4 = 8
> replacement costs 4 + 8 = 12
> Now, if the rtx costs on the *aarch64_ashl_reg_si3_mask1
> and similar would be the same as for normal shift without the
> masking, this would succeed and we could get rid of the and.
>
> Bootstrapped/regtested on aarch64-linux, ok for trunk?
>
> 2023-02-17  Jakub Jelinek  <jakub@redhat.com>
>
> 	PR middle-end/108803
> 	* optabs.cc (expand_subword_shift): Add MASK_COUNT argument,
> 	if true, use ~op1 & (BITS_PER_WORD - 1) instead of
> 	(BITS_PER_WORD - 1) - op1 as tmp and op1 & (BITS_PER_WORD - 1)
> 	instead of op1 on the other two shifts.
> 	(expand_doubleword_shift_condmove): Use
> 	superword_op1 & (BITS_PER_WORD - 1) instead of superword_op1.  Pass
> 	shift_mask < BITS_PER_WORD - 1 as MASK_COUNT to expand_subword_shift.
> 	(expand_doubleword_shift): Pass false as MASK_COUNT to
> 	expand_subword_shift.
>
> 	* gcc.dg/pr108803.c: New test.
>
> --- gcc/optabs.cc.jj	2023-01-02 09:32:53.309838465 +0100
> +++ gcc/optabs.cc	2023-02-16 20:44:21.862031535 +0100
> @@ -507,7 +507,7 @@ expand_subword_shift (scalar_int_mode op
>  		      rtx outof_input, rtx into_input, rtx op1,
>  		      rtx outof_target, rtx into_target,
>  		      int unsignedp, enum optab_methods methods,
> -		      unsigned HOST_WIDE_INT shift_mask)
> +		      unsigned HOST_WIDE_INT shift_mask, bool mask_count)
>  {
>    optab reverse_unsigned_shift, unsigned_shift;
>    rtx tmp, carries;
> @@ -535,7 +535,7 @@ expand_subword_shift (scalar_int_mode op
>  	 are truncated to the mode size.  */
>        carries = expand_binop (word_mode, reverse_unsigned_shift,
>  			      outof_input, const1_rtx, 0, unsignedp, methods);
> -      if (shift_mask == BITS_PER_WORD - 1)
> +      if (shift_mask == BITS_PER_WORD - 1 || mask_count)
>  	{
>  	  tmp = immed_wide_int_const
>  	    (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode);
> @@ -549,6 +549,15 @@ expand_subword_shift (scalar_int_mode op
>  	  tmp = simplify_expand_binop (op1_mode, sub_optab, tmp, op1,
>  				       0, true, methods);
>  	}
> +      if (mask_count)
> +	{
> +	  rtx tmp2 = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1,
> +					   op1_mode), op1_mode);
> +	  tmp = simplify_expand_binop (op1_mode, and_optab, tmp, tmp2, 0,
> +				       true, methods);
> +	  op1 = simplify_expand_binop (op1_mode, and_optab, op1, tmp2, 0,
> +				       true, methods);
> +	}
>      }
>    if (tmp == 0 || carries == 0)
>      return false;
> @@ -596,6 +605,15 @@ expand_doubleword_shift_condmove (scalar
>  {
>    rtx outof_superword, into_superword;
>  
> +  if (shift_mask < BITS_PER_WORD - 1)
> +    {
> +      rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, op1_mode),
> +				      op1_mode);
> +      superword_op1
> +	= simplify_expand_binop (op1_mode, and_optab, superword_op1, tmp,
> +				 0, true, methods);
> +    }
> +
>    /* Put the superword version of the output into OUTOF_SUPERWORD and
>       INTO_SUPERWORD.  */
>    outof_superword = outof_target != 0 ? gen_reg_rtx (word_mode) : 0;
> @@ -621,7 +639,8 @@ expand_doubleword_shift_condmove (scalar
>    if (!expand_subword_shift (op1_mode, binoptab,
>  			     outof_input, into_input, subword_op1,
>  			     outof_target, into_target,
> -			     unsignedp, methods, shift_mask))
> +			     unsignedp, methods, shift_mask,
> +			     shift_mask < BITS_PER_WORD - 1))
>      return false;
>  
>    /* Select between them.  Do the INTO half first because INTO_SUPERWORD
> @@ -742,7 +761,7 @@ expand_doubleword_shift (scalar_int_mode
>  	return expand_subword_shift (op1_mode, binoptab,
>  				     outof_input, into_input, op1,
>  				     outof_target, into_target,
> -				     unsignedp, methods, shift_mask);
> +				     unsignedp, methods, shift_mask, false);
>      }
>  
>    /* Try using conditional moves to generate straight-line code.  */
> @@ -781,7 +800,7 @@ expand_doubleword_shift (scalar_int_mode
>    if (!expand_subword_shift (op1_mode, binoptab,
>  			     outof_input, into_input, op1,
>  			     outof_target, into_target,
> -			     unsignedp, methods, shift_mask))
> +			     unsignedp, methods, shift_mask, false))
>      return false;
>  
>    emit_label (done_label);
> --- gcc/testsuite/gcc.dg/pr108803.c.jj	2023-02-16 20:48:53.517032422 +0100
> +++ gcc/testsuite/gcc.dg/pr108803.c	2023-02-16 20:48:45.971143498 +0100
> @@ -0,0 +1,24 @@
> +/* PR middle-end/108803 */
> +/* { dg-do run { target int128 } } */
> +/* { dg-options "-Og" } */
> +
> +unsigned i, j;
> +
> +__int128
> +foo (__int128 a)
> +{
> +  a &= 9;
> +  unsigned k = __builtin_add_overflow_p (i, j, a);
> +  __int128 b = (63 + a) << k | ((63 + a) >> (63 & k));
> +  __int128 c = a + b;
> +  return c;
> +}
> +
> +int
> +main (void)
> +{
> +  __int128 x = foo (0);
> +  if (x != 63)
> +    __builtin_abort ();
> +  return 0;
> +}
>
> 	Jakub