From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=8hkI=6N=redhat.com=jakub@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id 359E93858C78
	for <gcc-patches@gcc.gnu.org>; Fri, 17 Feb 2023 10:15:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 359E93858C78
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1676628904;
	h=from:from:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type; bh=wl3haC2qtULdNDBEvtcQsBuKrco7yZ/K9SEkdKJ0dVc=;
	b=Mma2PQc62fXT7BuiaO3h9zQNaRFkvAYnwD6lXPXJAZqsvXMvo4L0FRNvwXg5mjHZ9jwSJO
	KQJRYVSBgFF0m4AfK72hl5aGcvvPggRcwz4bzB50bzAzBFI5zAJTlGtEqkn454tVb/o0ug
	8Igy0qOmh6qvgHmLaVEQrepT66U8Mks=
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-549-m3VqmZ0bOneWGCUTlpibSw-1; Fri, 17 Feb 2023 05:15:02 -0500
X-MC-Unique: m3VqmZ0bOneWGCUTlpibSw-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5E8A13C10696;
	Fri, 17 Feb 2023 10:15:02 +0000 (UTC)
Received: from tucnak.zalov.cz (unknown [10.39.192.211])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 189CB175A2;
	Fri, 17 Feb 2023 10:15:01 +0000 (UTC)
Received: from tucnak.zalov.cz (localhost [127.0.0.1])
	by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 31HAExCG2921687
	(version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
	Fri, 17 Feb 2023 11:14:59 +0100
Received: (from jakub@localhost)
	by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 31HAEw2c2921686;
	Fri, 17 Feb 2023 11:14:58 +0100
Date: Fri, 17 Feb 2023 11:14:57 +0100
From: Jakub Jelinek <jakub@redhat.com>
To: Richard Sandiford <richard.sandiford@arm.com>,
        Richard Biener <rguenther@suse.de>
Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH] optabs: Fix up expand_doubleword_shift_condmove for
 shift_mask == 0 [PR108803]
Message-ID: <Y+9TodOHE09e9Vwq@tucnak>
Reply-To: Jakub Jelinek <jakub@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi!

The following testcase is miscompiled on aarch64.  The problem is that
aarch64 with TARGET_SIMD is !SHIFT_COUNT_TRUNCATED target with
targetm.shift_truncation_mask (DImode) == 0 which has HAVE_conditional_move
true.  If a doubleword shift (in this case TImode) doesn't have its own
expander (as is the case of e.g. x86) and is handled in generic code,
there are 3 possible expansions.  One is when the shift count is constant,
the code computes in that case superword_op1 as op1 - BITS_PER_WORD,
and chooses at compile time which of expand_superword_shift or
expand_subword_shift to call, which ensures that whatever is used
actually has its shift count (superword_op1 or op1) in [0, BITS_PER_WORD - 1]
range.  If !HAVE_conditional_move or that expansion fails, the function
handles non-constant op1 similarly (there are some special cases for
shift_mask >= BITS_PER_WORD - 1 but let's talk here just about
shift_mask < BITS_PER_WORD - 1), except that the selection is done at
runtime, with branches around the stuff.  While superword_op1 can be
[-BITS_PER_WORD, BITS_PER_WORD - 1] and op1 [0, 2 * BITS_PER_WORD - 1],
the runtime selection ensures that the instructions executed at runtime
have their corresponding shift count again in the right range of
[0, BITS_PER_WORD - 1].
The problematic is the HAVE_conditional_move case, which emits both
sequences into the actually executed code, so necessarily one of them
has an out of range shift count and then using 2 conditional moves
picks up a result.
Now, in the testcase because -Og doesn't perform VRP/EVRP the shift
count isn't optimized to constant during GIMPLE passes, but is determined
to be constant during/after expansion into RTL.  The shift count is
actually const0_rtx later, so superword_op1 is (const_int -64) and we end
up with miscompilation during combine because of that.
I'm afraid on targetm.shift_truncation_mask (DImode) == 0 targets we have
to mask the shift counts when the doubleshift count is in range but
one of subword_op1/superword_op1 is not, which is what the following
patch does and what fixes the wrong-code.  Now, targets like x86 or aarch64,
while they are !SHIFT_COUNT_TRUNCATED, have actually patterns to catch
shift with masked counter, so the ANDs can be optimized out.  On the other
side, when we know the result will be masked this way we can use the
simpler ~op1 instead of (BITS_PER_WORD - 1) - op1 in expand_subword_shift.
So, say on
__attribute__((noipa)) __int128
foo (__int128 a, unsigned k)
{
  return a << k;
}
on aarch64 at -Og the difference is:
 foo:
        subs    w5, w2, #64
-       lsl     x6, x0, x5
+       lsl     x4, x0, x2
        lsr     x3, x0, 1
-       mov     w4, 63
-       sub     w4, w4, w2
-       lsr     x3, x3, x4
+       mvn     w6, w2
+       and     w2, w2, 63
+       lsr     x3, x3, x6
        lsl     x1, x1, x2
        orr     x1, x3, x1
        lsl     x0, x0, x2
        csel    x0, xzr, x0, pl
-       csel    x1, x6, x1, pl
+       csel    x1, x4, x1, pl
        ret
We could do even better and optimize the and w2, w2, 63 instruction out,
but it is a matter of costs and so IMHO should be handled incrementally.
For that case consider say:
void corge (int, int, int);

void
qux (int x, int y, int z, int n)
{
  n &= 31;
  corge (x << n, y << n, z >> n);
}
with -O2 -fno-tree-vectorize, on x86_64 one gets
        sarl    %cl, %edx
        sall    %cl, %esi
        sall    %cl, %edi
        jmp     corge
but on aarch64
        and     w3, w3, 31
        lsl     w0, w0, w3
        lsl     w1, w1, w3
        asr     w2, w2, w3
        b       corge
The reason it works on x86 is that its rtx_costs hook recognizes
that the AND in shift+mask patterns is for free.
Trying 9 -> 11:
    9: {r85:SI=r96:SI&0x1f;clobber flags:CC;}
      REG_UNUSED flags:CC
   11: {r91:SI=r87:SI<<r85:SI#0;clobber flags:CC;}
      REG_DEAD r87:SI
      REG_UNUSED flags:CC
Failed to match this instruction:
...
Failed to match this instruction:
...
Successfully matched this instruction:
(set (reg/v:SI 85 [ n ])
    (and:SI (reg:SI 96)
        (const_int 31 [0x1f])))
Successfully matched this instruction:
(set (reg:SI 91)
    (ashift:SI (reg/v:SI 87 [ y ])
        (subreg:QI (and:SI (reg:SI 96)
                (const_int 31 [0x1f])) 0)))
allowing combination of insns 9 and 11
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
Compare that to the aarch64 case:
Trying 9 -> 11:
    9: r95:SI=r106:SI&0x1f
      REG_DEAD r106:SI
   11: r101:SI=r104:SI<<r95:SI#0
      REG_DEAD r104:SI
Failed to match this instruction:
...
Failed to match this instruction:
...
Successfully matched this instruction:
(set (reg/v:SI 95 [ n ])
    (and:SI (reg:SI 106)
        (const_int 31 [0x1f])))
Successfully matched this instruction:
(set (reg:SI 101)
    (ashift:SI (reg:SI 104)
        (subreg:QI (and:SI (reg:SI 106)
                (const_int 31 [0x1f])) 0)))
rejecting combination of insns 9 and 11
original costs 4 + 4 = 8
replacement costs 4 + 8 = 12
Now, if the rtx costs on the *aarch64_ashl_reg_si3_mask1
and similar would be the same as for normal shift without the
masking, this would succeed and we could get rid of the and.

Bootstrapped/regtested on aarch64-linux, ok for trunk?

2023-02-17  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/108803
	* optabs.cc (expand_subword_shift): Add MASK_COUNT argument,
	if true, use ~op1 & (BITS_PER_WORD - 1) instead of
	(BITS_PER_WORD - 1) - op1 as tmp and op1 & (BITS_PER_WORD - 1)
	instead of op1 on the other two shifts.
	(expand_doubleword_shift_condmove): Use
	superword_op1 & (BITS_PER_WORD - 1) instead of superword_op1.  Pass
	shift_mask < BITS_PER_WORD - 1 as MASK_COUNT to expand_subword_shift.
	(expand_doubleword_shift): Pass false as MASK_COUNT to
	expand_subword_shift.

	* gcc.dg/pr108803.c: New test.

--- gcc/optabs.cc.jj	2023-01-02 09:32:53.309838465 +0100
+++ gcc/optabs.cc	2023-02-16 20:44:21.862031535 +0100
@@ -507,7 +507,7 @@ expand_subword_shift (scalar_int_mode op
 		      rtx outof_input, rtx into_input, rtx op1,
 		      rtx outof_target, rtx into_target,
 		      int unsignedp, enum optab_methods methods,
-		      unsigned HOST_WIDE_INT shift_mask)
+		      unsigned HOST_WIDE_INT shift_mask, bool mask_count)
 {
   optab reverse_unsigned_shift, unsigned_shift;
   rtx tmp, carries;
@@ -535,7 +535,7 @@ expand_subword_shift (scalar_int_mode op
 	 are truncated to the mode size.  */
       carries = expand_binop (word_mode, reverse_unsigned_shift,
 			      outof_input, const1_rtx, 0, unsignedp, methods);
-      if (shift_mask == BITS_PER_WORD - 1)
+      if (shift_mask == BITS_PER_WORD - 1 || mask_count)
 	{
 	  tmp = immed_wide_int_const
 	    (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode);
@@ -549,6 +549,15 @@ expand_subword_shift (scalar_int_mode op
 	  tmp = simplify_expand_binop (op1_mode, sub_optab, tmp, op1,
 				       0, true, methods);
 	}
+      if (mask_count)
+	{
+	  rtx tmp2 = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1,
+					   op1_mode), op1_mode);
+	  tmp = simplify_expand_binop (op1_mode, and_optab, tmp, tmp2, 0,
+				       true, methods);
+	  op1 = simplify_expand_binop (op1_mode, and_optab, op1, tmp2, 0,
+				       true, methods);
+	}
     }
   if (tmp == 0 || carries == 0)
     return false;
@@ -596,6 +605,15 @@ expand_doubleword_shift_condmove (scalar
 {
   rtx outof_superword, into_superword;
 
+  if (shift_mask < BITS_PER_WORD - 1)
+    {
+      rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, op1_mode),
+				      op1_mode);
+      superword_op1
+	= simplify_expand_binop (op1_mode, and_optab, superword_op1, tmp,
+				 0, true, methods);
+    }
+
   /* Put the superword version of the output into OUTOF_SUPERWORD and
      INTO_SUPERWORD.  */
   outof_superword = outof_target != 0 ? gen_reg_rtx (word_mode) : 0;
@@ -621,7 +639,8 @@ expand_doubleword_shift_condmove (scalar
   if (!expand_subword_shift (op1_mode, binoptab,
 			     outof_input, into_input, subword_op1,
 			     outof_target, into_target,
-			     unsignedp, methods, shift_mask))
+			     unsignedp, methods, shift_mask,
+			     shift_mask < BITS_PER_WORD - 1))
     return false;
 
   /* Select between them.  Do the INTO half first because INTO_SUPERWORD
@@ -742,7 +761,7 @@ expand_doubleword_shift (scalar_int_mode
 	return expand_subword_shift (op1_mode, binoptab,
 				     outof_input, into_input, op1,
 				     outof_target, into_target,
-				     unsignedp, methods, shift_mask);
+				     unsignedp, methods, shift_mask, false);
     }
 
   /* Try using conditional moves to generate straight-line code.  */
@@ -781,7 +800,7 @@ expand_doubleword_shift (scalar_int_mode
   if (!expand_subword_shift (op1_mode, binoptab,
 			     outof_input, into_input, op1,
 			     outof_target, into_target,
-			     unsignedp, methods, shift_mask))
+			     unsignedp, methods, shift_mask, false))
     return false;
 
   emit_label (done_label);
--- gcc/testsuite/gcc.dg/pr108803.c.jj	2023-02-16 20:48:53.517032422 +0100
+++ gcc/testsuite/gcc.dg/pr108803.c	2023-02-16 20:48:45.971143498 +0100
@@ -0,0 +1,24 @@
+/* PR middle-end/108803 */
+/* { dg-do run { target int128 } } */
+/* { dg-options "-Og" } */
+
+unsigned i, j;
+
+__int128
+foo (__int128 a)
+{
+  a &= 9;
+  unsigned k = __builtin_add_overflow_p (i, j, a);
+  __int128 b = (63 + a) << k | ((63 + a) >> (63 & k));
+  __int128 c = a + b;
+  return c;
+}
+
+int
+main (void)
+{
+  __int128 x = foo (0);
+  if (x != 63)
+    __builtin_abort ();
+  return 0;
+}

	Jakub