public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803]
@ 2023-02-17 10:14 Jakub Jelinek
  2023-02-27 15:34 ` Richard Sandiford
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Jelinek @ 2023-02-17 10:14 UTC (permalink / raw)
  To: Richard Sandiford, Richard Biener; +Cc: gcc-patches

Hi!

The following testcase is miscompiled on aarch64.  The problem is that
aarch64 with TARGET_SIMD is !SHIFT_COUNT_TRUNCATED target with
targetm.shift_truncation_mask (DImode) == 0 which has HAVE_conditional_move
true.  If a doubleword shift (in this case TImode) doesn't have its own
expander (as is the case of e.g. x86) and is handled in generic code,
there are 3 possible expansions.  One is when the shift count is constant,
the code computes in that case superword_op1 as op1 - BITS_PER_WORD,
and chooses at compile time which of expand_superword_shift or
expand_subword_shift to call, which ensures that whatever is used
actually has its shift count (superword_op1 or op1) in [0, BITS_PER_WORD - 1]
range.  If !HAVE_conditional_move or that expansion fails, the function
handles non-constant op1 similarly (there are some special cases for
shift_mask >= BITS_PER_WORD - 1 but let's talk here just about
shift_mask < BITS_PER_WORD - 1), except that the selection is done at
runtime, with branches around the stuff.  While superword_op1 can be
[-BITS_PER_WORD, BITS_PER_WORD - 1] and op1 [0, 2 * BITS_PER_WORD - 1],
the runtime selection ensures that the instructions executed at runtime
have their corresponding shift count again in the right range of
[0, BITS_PER_WORD - 1].
The problematic is the HAVE_conditional_move case, which emits both
sequences into the actually executed code, so necessarily one of them
has an out of range shift count and then using 2 conditional moves
picks up a result.
Now, in the testcase because -Og doesn't perform VRP/EVRP the shift
count isn't optimized to constant during GIMPLE passes, but is determined
to be constant during/after expansion into RTL.  The shift count is
actually const0_rtx later, so superword_op1 is (const_int -64) and we end
up with miscompilation during combine because of that.
I'm afraid on targetm.shift_truncation_mask (DImode) == 0 targets we have
to mask the shift counts when the doubleshift count is in range but
one of subword_op1/superword_op1 is not, which is what the following
patch does and what fixes the wrong-code.  Now, targets like x86 or aarch64,
while they are !SHIFT_COUNT_TRUNCATED, have actually patterns to catch
shift with masked counter, so the ANDs can be optimized out.  On the other
side, when we know the result will be masked this way we can use the
simpler ~op1 instead of (BITS_PER_WORD - 1) - op1 in expand_subword_shift.
So, say on
__attribute__((noipa)) __int128
foo (__int128 a, unsigned k)
{
  return a << k;
}
on aarch64 at -Og the difference is:
 foo:
        subs    w5, w2, #64
-       lsl     x6, x0, x5
+       lsl     x4, x0, x2
        lsr     x3, x0, 1
-       mov     w4, 63
-       sub     w4, w4, w2
-       lsr     x3, x3, x4
+       mvn     w6, w2
+       and     w2, w2, 63
+       lsr     x3, x3, x6
        lsl     x1, x1, x2
        orr     x1, x3, x1
        lsl     x0, x0, x2
        csel    x0, xzr, x0, pl
-       csel    x1, x6, x1, pl
+       csel    x1, x4, x1, pl
        ret
We could do even better and optimize the and w2, w2, 63 instruction out,
but it is a matter of costs and so IMHO should be handled incrementally.
For that case consider say:
void corge (int, int, int);

void
qux (int x, int y, int z, int n)
{
  n &= 31;
  corge (x << n, y << n, z >> n);
}
with -O2 -fno-tree-vectorize, on x86_64 one gets
        sarl    %cl, %edx
        sall    %cl, %esi
        sall    %cl, %edi
        jmp     corge
but on aarch64
        and     w3, w3, 31
        lsl     w0, w0, w3
        lsl     w1, w1, w3
        asr     w2, w2, w3
        b       corge
The reason it works on x86 is that its rtx_costs hook recognizes
that the AND in shift+mask patterns is for free.
Trying 9 -> 11:
    9: {r85:SI=r96:SI&0x1f;clobber flags:CC;}
      REG_UNUSED flags:CC
   11: {r91:SI=r87:SI<<r85:SI#0;clobber flags:CC;}
      REG_DEAD r87:SI
      REG_UNUSED flags:CC
Failed to match this instruction:
...
Failed to match this instruction:
...
Successfully matched this instruction:
(set (reg/v:SI 85 [ n ])
    (and:SI (reg:SI 96)
        (const_int 31 [0x1f])))
Successfully matched this instruction:
(set (reg:SI 91)
    (ashift:SI (reg/v:SI 87 [ y ])
        (subreg:QI (and:SI (reg:SI 96)
                (const_int 31 [0x1f])) 0)))
allowing combination of insns 9 and 11
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
Compare that to the aarch64 case:
Trying 9 -> 11:
    9: r95:SI=r106:SI&0x1f
      REG_DEAD r106:SI
   11: r101:SI=r104:SI<<r95:SI#0
      REG_DEAD r104:SI
Failed to match this instruction:
...
Failed to match this instruction:
...
Successfully matched this instruction:
(set (reg/v:SI 95 [ n ])
    (and:SI (reg:SI 106)
        (const_int 31 [0x1f])))
Successfully matched this instruction:
(set (reg:SI 101)
    (ashift:SI (reg:SI 104)
        (subreg:QI (and:SI (reg:SI 106)
                (const_int 31 [0x1f])) 0)))
rejecting combination of insns 9 and 11
original costs 4 + 4 = 8
replacement costs 4 + 8 = 12
Now, if the rtx costs on the *aarch64_ashl_reg_si3_mask1
and similar would be the same as for normal shift without the
masking, this would succeed and we could get rid of the and.

Bootstrapped/regtested on aarch64-linux, ok for trunk?

2023-02-17  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/108803
	* optabs.cc (expand_subword_shift): Add MASK_COUNT argument,
	if true, use ~op1 & (BITS_PER_WORD - 1) instead of
	(BITS_PER_WORD - 1) - op1 as tmp and op1 & (BITS_PER_WORD - 1)
	instead of op1 on the other two shifts.
	(expand_doubleword_shift_condmove): Use
	superword_op1 & (BITS_PER_WORD - 1) instead of superword_op1.  Pass
	shift_mask < BITS_PER_WORD - 1 as MASK_COUNT to expand_subword_shift.
	(expand_doubleword_shift): Pass false as MASK_COUNT to
	expand_subword_shift.

	* gcc.dg/pr108803.c: New test.

--- gcc/optabs.cc.jj	2023-01-02 09:32:53.309838465 +0100
+++ gcc/optabs.cc	2023-02-16 20:44:21.862031535 +0100
@@ -507,7 +507,7 @@ expand_subword_shift (scalar_int_mode op
 		      rtx outof_input, rtx into_input, rtx op1,
 		      rtx outof_target, rtx into_target,
 		      int unsignedp, enum optab_methods methods,
-		      unsigned HOST_WIDE_INT shift_mask)
+		      unsigned HOST_WIDE_INT shift_mask, bool mask_count)
 {
   optab reverse_unsigned_shift, unsigned_shift;
   rtx tmp, carries;
@@ -535,7 +535,7 @@ expand_subword_shift (scalar_int_mode op
 	 are truncated to the mode size.  */
       carries = expand_binop (word_mode, reverse_unsigned_shift,
 			      outof_input, const1_rtx, 0, unsignedp, methods);
-      if (shift_mask == BITS_PER_WORD - 1)
+      if (shift_mask == BITS_PER_WORD - 1 || mask_count)
 	{
 	  tmp = immed_wide_int_const
 	    (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode);
@@ -549,6 +549,15 @@ expand_subword_shift (scalar_int_mode op
 	  tmp = simplify_expand_binop (op1_mode, sub_optab, tmp, op1,
 				       0, true, methods);
 	}
+      if (mask_count)
+	{
+	  rtx tmp2 = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1,
+					   op1_mode), op1_mode);
+	  tmp = simplify_expand_binop (op1_mode, and_optab, tmp, tmp2, 0,
+				       true, methods);
+	  op1 = simplify_expand_binop (op1_mode, and_optab, op1, tmp2, 0,
+				       true, methods);
+	}
     }
   if (tmp == 0 || carries == 0)
     return false;
@@ -596,6 +605,15 @@ expand_doubleword_shift_condmove (scalar
 {
   rtx outof_superword, into_superword;
 
+  if (shift_mask < BITS_PER_WORD - 1)
+    {
+      rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, op1_mode),
+				      op1_mode);
+      superword_op1
+	= simplify_expand_binop (op1_mode, and_optab, superword_op1, tmp,
+				 0, true, methods);
+    }
+
   /* Put the superword version of the output into OUTOF_SUPERWORD and
      INTO_SUPERWORD.  */
   outof_superword = outof_target != 0 ? gen_reg_rtx (word_mode) : 0;
@@ -621,7 +639,8 @@ expand_doubleword_shift_condmove (scalar
   if (!expand_subword_shift (op1_mode, binoptab,
 			     outof_input, into_input, subword_op1,
 			     outof_target, into_target,
-			     unsignedp, methods, shift_mask))
+			     unsignedp, methods, shift_mask,
+			     shift_mask < BITS_PER_WORD - 1))
     return false;
 
   /* Select between them.  Do the INTO half first because INTO_SUPERWORD
@@ -742,7 +761,7 @@ expand_doubleword_shift (scalar_int_mode
 	return expand_subword_shift (op1_mode, binoptab,
 				     outof_input, into_input, op1,
 				     outof_target, into_target,
-				     unsignedp, methods, shift_mask);
+				     unsignedp, methods, shift_mask, false);
     }
 
   /* Try using conditional moves to generate straight-line code.  */
@@ -781,7 +800,7 @@ expand_doubleword_shift (scalar_int_mode
   if (!expand_subword_shift (op1_mode, binoptab,
 			     outof_input, into_input, op1,
 			     outof_target, into_target,
-			     unsignedp, methods, shift_mask))
+			     unsignedp, methods, shift_mask, false))
     return false;
 
   emit_label (done_label);
--- gcc/testsuite/gcc.dg/pr108803.c.jj	2023-02-16 20:48:53.517032422 +0100
+++ gcc/testsuite/gcc.dg/pr108803.c	2023-02-16 20:48:45.971143498 +0100
@@ -0,0 +1,24 @@
+/* PR middle-end/108803 */
+/* { dg-do run { target int128 } } */
+/* { dg-options "-Og" } */
+
+unsigned i, j;
+
+__int128
+foo (__int128 a)
+{
+  a &= 9;
+  unsigned k = __builtin_add_overflow_p (i, j, a);
+  __int128 b = (63 + a) << k | ((63 + a) >> (63 & k));
+  __int128 c = a + b;
+  return c;
+}
+
+int
+main (void)
+{
+  __int128 x = foo (0);
+  if (x != 63)
+    __builtin_abort ();
+  return 0;
+}

	Jakub


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-02-27 22:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-17 10:14 [PATCH] optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803] Jakub Jelinek
2023-02-27 15:34 ` Richard Sandiford
2023-02-27 19:11   ` Jakub Jelinek
2023-02-27 19:51     ` Richard Sandiford
2023-02-27 20:02       ` Jakub Jelinek
2023-02-27 20:43         ` Richard Sandiford
2023-02-27 20:54           ` Jakub Jelinek
2023-02-27 21:01             ` Richard Sandiford
2023-02-27 21:15               ` Jakub Jelinek
2023-02-27 22:15             ` Segher Boessenkool
2023-02-27 22:21     ` Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).