From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <roger@nextmovesoftware.com>
Received: from server.nextmovesoftware.com (server.nextmovesoftware.com
 [162.254.253.69])
 by sourceware.org (Postfix) with ESMTPS id 5BE323858C50
 for <gcc-patches@gcc.gnu.org>; Mon, 15 Aug 2022 08:29:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5BE323858C50
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=nextmovesoftware.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=nextmovesoftware.com
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID:
 Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID:
 Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
 :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:
 List-Subscribe:List-Post:List-Owner:List-Archive;
 bh=Lesb2MliPFkeX08N9is4cWYNv0cXlGum4cKZYqPIlEs=; b=fWlnM0+XALQhvH/qHXso/DuR3h
 UXYXzkbHUtfE3fF8WFVDfgMzxV5AB+QDMW3J7KHJdBINPFgFX01YEp8rNjDrck0OT8IPTjeZbf7p8
 xAUducCQ7JuPNRIPNATgsOb1+yUb78Z2PHcCHkWmsET/N6FZQ3uLarNSFJa8S38ygPOtHwj5KAKxS
 lqUXuwJGok+JbU5I3qYCWzGG4wFt1Gk1j7yj0EOVjek8eE+vzrHVbO0DWxZdjvO8IHp+izRDiTPA+
 ClKhUXTT3aPX4jFd4NuLZ+DTd+hfSPmMA8+OUIWulXuotWOHDXUMlytdwcM8EFj+jEumROKZT2y6g
 lNDanCsA==;
Received: from host86-169-41-119.range86-169.btcentralplus.com
 ([86.169.41.119]:52837 helo=Dell)
 by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95)
 (envelope-from <roger@nextmovesoftware.com>) id 1oNVTq-0000Kd-E9;
 Mon, 15 Aug 2022 04:29:50 -0400
From: "Roger Sayle" <roger@nextmovesoftware.com>
To: "'GCC Patches'" <gcc-patches@gcc.gnu.org>
Subject: [x86_64 PATCH] Support shifts and rotates by integer constants in
 TImode STV.
Date: Mon, 15 Aug 2022 09:29:47 +0100
Message-ID: <00f801d8b081$2f23e160$8d6ba420$@nextmovesoftware.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="----=_NextPart_000_00F9_01D8B089.90EABA60"
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AdiwgNTn40kTuVFvS3WL8VBBG5afSA==
Content-Language: en-gb
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com
X-AntiAbuse: Original Domain - gcc.gnu.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - nextmovesoftware.com
X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id:
 roger@nextmovesoftware.com
X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 RCVD_IN_BARRACUDACENTRAL, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Aug 2022 08:29:53 -0000

This is a multipart message in MIME format.

------=_NextPart_000_00F9_01D8B089.90EABA60
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit


Many thanks to Uros for reviewing/approving all of the previous pieces.
This patch adds support for converting 128-bit TImode shifts and rotates
to SSE equivalents using V1TImode during the TImode STV pass.
Previously, only logical shifts by multiples of 8 were handled
(from my patch earlier this month).

As an example of the benefits, the following rotate by 32-bits:

unsigned __int128 a, b;
void rot32() { a = (b >> 32) | (b << 96); }

when compiled on x86_64 with -O2 previously generated:

        movq    b(%rip), %rax
        movq    b+8(%rip), %rdx
        movq    %rax, %rcx
        shrdq   $32, %rdx, %rax
        shrdq   $32, %rcx, %rdx
        movq    %rax, a(%rip)
        movq    %rdx, a+8(%rip)
        ret

with this patch, now generates:

        movdqa  b(%rip), %xmm0
        pshufd  $57, %xmm0, %xmm0
        movaps  %xmm0, a(%rip)
        ret

[which uses a V4SI permutation for those that don't read SSE].
This should help 128-bit cryptography codes, that interleave XORs
with rotations (but that don't use additions or subtractions).

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-08-15  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386-features.cc
        (timode_scalar_chain::compute_convert_gain): Provide costs for
        shifts and rotates.  Provide gains for comparisons against 0/-1.
        (timode_scalar_chain::convert_insn): Handle ASHIFTRT, ROTATERT
        and ROTATE just like existing ASHIFT and LSHIFTRT cases.
        (timode_scalar_to_vector_candidate_p): Handle all shifts and
        rotates by integer constants between 0 and 127.

gcc/testsuite/ChangeLog
        * gcc.target/i386/sse4_1-stv-9.c: New test case.


Thanks in advance,
Roger
--


------=_NextPart_000_00F9_01D8B089.90EABA60
Content-Type: text/plain;
	name="patchvs.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="patchvs.txt"

diff --git a/gcc/config/i386/i386-features.cc =
b/gcc/config/i386/i386-features.cc=0A=
index effc2f2..8ab65c8 100644=0A=
--- a/gcc/config/i386/i386-features.cc=0A=
+++ b/gcc/config/i386/i386-features.cc=0A=
@@ -1209,6 +1209,8 @@ timode_scalar_chain::compute_convert_gain ()=0A=
       rtx def_set =3D single_set (insn);=0A=
       rtx src =3D SET_SRC (def_set);=0A=
       rtx dst =3D SET_DEST (def_set);=0A=
+      HOST_WIDE_INT op1val;=0A=
+      int scost, vcost;=0A=
       int igain =3D 0;=0A=
 =0A=
       switch (GET_CODE (src))=0A=
@@ -1245,9 +1247,157 @@ timode_scalar_chain::compute_convert_gain ()=0A=
 =0A=
 	case ASHIFT:=0A=
 	case LSHIFTRT:=0A=
-	  /* For logical shifts by constant multiples of 8. */=0A=
-	  igain =3D optimize_insn_for_size_p () ? COSTS_N_BYTES (4)=0A=
-					      : COSTS_N_INSNS (1);=0A=
+	  /* See ix86_expand_v1ti_shift.  */=0A=
+	  op1val =3D XINT (src, 1);=0A=
+	  if (optimize_insn_for_size_p ())=0A=
+	    {=0A=
+	      if (op1val =3D=3D 64 || op1val =3D=3D 65)=0A=
+		scost =3D COSTS_N_BYTES (5);=0A=
+	      else if (op1val >=3D 66)=0A=
+		scost =3D COSTS_N_BYTES (6);=0A=
+	      else if (op1val =3D=3D 1)=0A=
+		scost =3D COSTS_N_BYTES (8);=0A=
+	      else=0A=
+		scost =3D COSTS_N_BYTES (9);=0A=
+=0A=
+	      if ((op1val & 7) =3D=3D 0)=0A=
+		vcost =3D COSTS_N_BYTES (5);=0A=
+	      else if (op1val > 64)=0A=
+		vcost =3D COSTS_N_BYTES (10);=0A=
+	      else=0A=
+		vcost =3D TARGET_AVX ? COSTS_N_BYTES (19) : COSTS_N_BYTES (23);=0A=
+	    }=0A=
+	  else=0A=
+	    {=0A=
+	      scost =3D COSTS_N_INSNS (2);=0A=
+	      if ((op1val & 7) =3D=3D 0)=0A=
+		vcost =3D COSTS_N_INSNS (1);=0A=
+	      else if (op1val > 64)=0A=
+		vcost =3D COSTS_N_INSNS (2);=0A=
+	      else=0A=
+		vcost =3D TARGET_AVX ? COSTS_N_INSNS (4) : COSTS_N_INSNS (5);=0A=
+	    }=0A=
+	  igain =3D scost - vcost;=0A=
+	  break;=0A=
+=0A=
+	case ASHIFTRT:=0A=
+	  /* See ix86_expand_v1ti_ashiftrt.  */=0A=
+	  op1val =3D XINT (src, 1);=0A=
+	  if (optimize_insn_for_size_p ())=0A=
+	    {=0A=
+	      if (op1val =3D=3D 64 || op1val =3D=3D 127)=0A=
+		scost =3D COSTS_N_BYTES (7);=0A=
+	      else if (op1val =3D=3D 1)=0A=
+		scost =3D COSTS_N_BYTES (8);=0A=
+	      else if (op1val =3D=3D 65)=0A=
+		scost =3D COSTS_N_BYTES (10);=0A=
+	      else if (op1val >=3D 66)=0A=
+		scost =3D COSTS_N_BYTES (11);=0A=
+	      else=0A=
+		scost =3D COSTS_N_BYTES (9);=0A=
+=0A=
+	      if (op1val =3D=3D 127)=0A=
+		vcost =3D COSTS_N_BYTES (10);=0A=
+	      else if (op1val =3D=3D 64)=0A=
+		vcost =3D COSTS_N_BYTES (14);=0A=
+	      else if (op1val =3D=3D 96)=0A=
+		vcost =3D COSTS_N_BYTES (18);=0A=
+	      else if (op1val >=3D 111)=0A=
+		vcost =3D COSTS_N_BYTES (15);=0A=
+              else if (TARGET_AVX2 && op1val =3D=3D 32)=0A=
+		vcost =3D COSTS_N_BYTES (16);=0A=
+	      else if (TARGET_SSE4_1 && op1val =3D=3D 32)=0A=
+		vcost =3D COSTS_N_BYTES (20);=0A=
+	      else if (op1val >=3D 96)=0A=
+		vcost =3D COSTS_N_BYTES (23);=0A=
+	      else if ((op1val & 7) =3D=3D 0)=0A=
+		vcost =3D COSTS_N_BYTES (28);=0A=
+              else if (TARGET_AVX2 && op1val < 32)=0A=
+		vcost =3D COSTS_N_BYTES (30);=0A=
+	      else if (op1val =3D=3D 1 || op1val >=3D 64)=0A=
+		vcost =3D COSTS_N_BYTES (42);=0A=
+	      else=0A=
+		vcost =3D COSTS_N_BYTES (47);=0A=
+	    }=0A=
+	  else=0A=
+	    {=0A=
+	      if (op1val >=3D 65 && op1val <=3D 126)=0A=
+		scost =3D COSTS_N_INSNS (3);=0A=
+	      else=0A=
+		scost =3D COSTS_N_INSNS (2);=0A=
+=0A=
+	      if (op1val =3D=3D 127)=0A=
+		vcost =3D COSTS_N_INSNS (2);=0A=
+	      else if (op1val =3D=3D 64)=0A=
+		vcost =3D COSTS_N_INSNS (3);=0A=
+	      else if (op1val =3D=3D 96)=0A=
+		vcost =3D COSTS_N_INSNS (4);=0A=
+	      else if (op1val >=3D 111)=0A=
+		vcost =3D COSTS_N_INSNS (3);=0A=
+              else if (TARGET_AVX2 && op1val =3D=3D 32)=0A=
+		vcost =3D COSTS_N_INSNS (3);=0A=
+	      else if (TARGET_SSE4_1 && op1val =3D=3D 32)=0A=
+		vcost =3D COSTS_N_INSNS (4);=0A=
+	      else if (op1val >=3D 96)=0A=
+		vcost =3D COSTS_N_INSNS (5);=0A=
+	      else if ((op1val & 7) =3D=3D 0)=0A=
+		vcost =3D COSTS_N_INSNS (6);=0A=
+              else if (TARGET_AVX2 && op1val < 32)=0A=
+		vcost =3D COSTS_N_INSNS (6);=0A=
+	      else if (op1val =3D=3D 1 || op1val >=3D 64)=0A=
+		vcost =3D COSTS_N_INSNS (9);=0A=
+	      else=0A=
+		vcost =3D COSTS_N_INSNS (10);=0A=
+	    }=0A=
+	  igain =3D scost - vcost;=0A=
+	  break;=0A=
+=0A=
+	case ROTATE:=0A=
+	case ROTATERT:=0A=
+	  /* See ix86_expand_v1ti_rotate.  */=0A=
+	  op1val =3D XINT (src, 1);=0A=
+	  if (optimize_insn_for_size_p ())=0A=
+	    {=0A=
+	      scost =3D COSTS_N_BYTES (13);=0A=
+	      if ((op1val & 31) =3D=3D 0)=0A=
+		vcost =3D COSTS_N_BYTES (5);=0A=
+	      else if ((op1val & 7) =3D=3D 0)=0A=
+		vcost =3D TARGET_AVX ? COSTS_N_BYTES (13) : COSTS_N_BYTES (18);=0A=
+              else if (op1val > 32 && op1val < 96)=0A=
+		vcost =3D COSTS_N_BYTES (24);=0A=
+	      else=0A=
+	        vcost =3D COSTS_N_BYTES (19);=0A=
+	    }=0A=
+	  else=0A=
+	    {=0A=
+	      scost =3D COSTS_N_INSNS (3);=0A=
+	      if ((op1val & 31) =3D=3D 0)=0A=
+		vcost =3D COSTS_N_INSNS (1);=0A=
+	      else if ((op1val & 7) =3D=3D 0)=0A=
+		vcost =3D TARGET_AVX ? COSTS_N_INSNS (3) : COSTS_N_INSNS (4);=0A=
+              else if (op1val > 32 && op1val < 96)=0A=
+		vcost =3D COSTS_N_INSNS (5);=0A=
+	      else=0A=
+	        vcost =3D COSTS_N_INSNS (1);=0A=
+	    }=0A=
+	  igain =3D scost - vcost;=0A=
+	  break;=0A=
+=0A=
+	case COMPARE:=0A=
+	  if (XEXP (src, 1) =3D=3D const0_rtx)=0A=
+	    {=0A=
+	      if (GET_CODE (XEXP (src, 0)) =3D=3D AND)=0A=
+	        /* and;and;or (9 bytes) vs. ptest (5 bytes).  */=0A=
+		igain =3D optimize_insn_for_size_p() ? COSTS_N_BYTES (4)=0A=
+						   : COSTS_N_INSNS (2);=0A=
+	      /* or (3 bytes) vs. ptest (5 bytes).  */=0A=
+	      else if (optimize_insn_for_size_p ())=0A=
+		igain =3D -COSTS_N_BYTES (2);=0A=
+	    }=0A=
+	  else if (XEXP (src, 1) =3D=3D const1_rtx)=0A=
+	    /* and;cmp -1 (7 bytes) vs. pcmpeqd;pxor;ptest (13 bytes).  */=0A=
+	    igain =3D optimize_insn_for_size_p() ? -COSTS_N_BYTES (6)=0A=
+					       : -COSTS_N_INSNS (1);=0A=
 	  break;=0A=
 =0A=
 	default:=0A=
@@ -1503,6 +1653,9 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)=0A=
 =0A=
     case ASHIFT:=0A=
     case LSHIFTRT:=0A=
+    case ASHIFTRT:=0A=
+    case ROTATERT:=0A=
+    case ROTATE:=0A=
       convert_op (&XEXP (src, 0), insn);=0A=
       PUT_MODE (src, V1TImode);=0A=
       break;=0A=
@@ -1861,11 +2014,13 @@ timode_scalar_to_vector_candidate_p (rtx_insn =
*insn)=0A=
 =0A=
     case ASHIFT:=0A=
     case LSHIFTRT:=0A=
-      /* Handle logical shifts by integer constants between 0 and 120=0A=
-	 that are multiples of 8.  */=0A=
+    case ASHIFTRT:=0A=
+    case ROTATERT:=0A=
+    case ROTATE:=0A=
+      /* Handle shifts/rotates by integer constants between 0 and 127.  =
*/=0A=
       return REG_P (XEXP (src, 0))=0A=
 	     && CONST_INT_P (XEXP (src, 1))=0A=
-	     && (INTVAL (XEXP (src, 1)) & ~0x78) =3D=3D 0;=0A=
+	     && (INTVAL (XEXP (src, 1)) & ~0x7f) =3D=3D 0;=0A=
 =0A=
     default:=0A=
       return false;=0A=
diff --git a/gcc/testsuite/gcc.target/i386/sse4_1-stv-9.c =
b/gcc/testsuite/gcc.target/i386/sse4_1-stv-9.c=0A=
new file mode 100644=0A=
index 0000000..ee5af3c=0A=
--- /dev/null=0A=
+++ b/gcc/testsuite/gcc.target/i386/sse4_1-stv-9.c=0A=
@@ -0,0 +1,12 @@=0A=
+/* { dg-do compile { target int128 } } */=0A=
+/* { dg-options "-O2 -msse4.1 -mstv -mno-stackrealign" } */=0A=
+=0A=
+unsigned __int128 a, b;=0A=
+void rot1()  { a =3D (b >> 1) | (b << 127); }=0A=
+void rot4()  { a =3D (b >> 4) | (b << 124); }=0A=
+void rot8()  { a =3D (b >> 8) | (b << 120); }=0A=
+void rot32() { a =3D (b >> 32) | (b << 96); }=0A=
+void rot64() { a =3D (b >> 64) | (b << 64); }=0A=
+=0A=
+/* { dg-final { scan-assembler-not "shrdq" } } */=0A=
+/* { dg-final { scan-assembler "pshufd" } } */=0A=

------=_NextPart_000_00F9_01D8B089.90EABA60--