From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id 63C633857433 for ; Mon, 4 Jul 2022 17:48:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 63C633857433 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID: Date:Subject:In-Reply-To:References:Cc:To:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=sHNiHudVlayzZ4Tnbv2XKI8INf4bDua0FBnhI1BFStA=; b=iewGCDnpDAkwWKMUry972aB9Eh YQZzQLOHm6ptF1HH2y8/J2KgFuhQPFjapVHsYV0t/kOBivwM1yudUUlQx2Va4GS+6CaFLlZQj34zV WEbtnqQSTW3qds4rL1tb64faG2e1xXjRcN4m2qnfIp9/gvu03vJrS2y/wLhG79g9cipSPfLof97ut SxFKNyir+0tg3HfLEAAWYNXpqEdiRvpNWFXr3GOjSx2MW5ViW/qLcJNePyenN8oEPvC/EIadYn+9r VdkE1G8qj5W/E+l/50Yp683jM2BaNwjsZv7JyUGiDOOHWT7vldZinCjKkLbIorsq/Vme2VEaynsJh gjhl94RQ==; Received: from [185.62.158.67] (port=64199 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1o8QB7-0006xE-Ou; Mon, 04 Jul 2022 13:48:09 -0400 From: "Roger Sayle" To: "'Hongtao Liu'" Cc: "'GCC Patches'" References: <006901d88cb1$1a06e870$4e14b950$@nextmovesoftware.com> In-Reply-To: Subject: RE: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups. Date: Mon, 4 Jul 2022 18:48:09 +0100 Message-ID: <008701d88fce$39c9c8b0$ad5d5a10$@nextmovesoftware.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0088_01D88FD6.9B90A1B0" X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQLTH3WO6gM2BNESrG5dTdOeGyHanQGh1nGhAgiL60OrXAhnUA== Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jul 2022 17:48:12 -0000 This is a multipart message in MIME format. ------=_NextPart_000_0088_01D88FD6.9B90A1B0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hi Hongtao, Many thanks for your review. This revised patch implements your suggestions of removing the combine splitters, and instead reusing the functionality of the ssse3_palignrdi define_insn_and split. This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and with = --target_board=3Dunix{-32}, with no new failures. Is this revised version Ok for mainline? 2022-07-04 Roger Sayle Hongtao Liu gcc/ChangeLog * config/i386/i386-builtin.def (__builtin_ia32_palignr128): = Change CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti. * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use = V1TImode and gen_ssse3_palignv1ti instead of TImode. * config/i386/sse.md (SSESCALARMODE): Delete. (define_mode_attr ssse3_avx2): Handle V1TImode instead of = TImode. (_palignr): Use VIMAX_AVX2_AVX512BW as a mode iterator instead of SSESCALARMODE. (ssse3_palignrdi): Optimize cases where operands[3] is 0 or 64, using a single move instruction (if required). gcc/testsuite/ChangeLog * gcc.target/i386/ssse3-palignr-2.c: New test case. Thanks in advance, Roger -- > -----Original Message----- > From: Hongtao Liu > Sent: 01 July 2022 03:40 > To: Roger Sayle > Cc: GCC Patches > Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups. >=20 > On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu = wrote: > > > > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle = > wrote: > > > > > > > > > This patch is a follow-up to Hongtao's fix for PR target/105854. > > > That fix is perfectly correct, but the thing that caught my eye = was > > > why is the compiler generating a shift by zero at all. Digging > > > deeper it turns out that we can easily optimize > > > __builtin_ia32_palignr for alignments of 0 and 64 respectively, > > > which may be simplified to moves from the highpart or lowpart. > > > > > > After adding optimizations to simplify the 64-bit DImode palignr, = I > > > started to add the corresponding optimizations for vpalignr (i.e. > > > 128-bit). The first oddity is that sse.md uses TImode and a = special > > > SSESCALARMODE iterator, rather than V1TImode, and indeed the = comment > > > above SSESCALARMODE hints that this should be "dropped in favor of > > > VIMAX_AVX2_AVX512BW". Hence this patch includes the migration of > > > _palignr to use VIMAX_AVX2_AVX512BW, basically > > > using V1TImode instead of TImode for 128-bit palignr. > > > > > > But it was only after I'd implemented this clean-up that I = stumbled > > > across the strange semantics of 128-bit [v]palignr. According to > > > https://www.felixcloutier.com/x86/palignr, the semantics are = subtly > > > different based upon how the instruction is encoded. PALIGNR = leaves > > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears = the > > > highpart, and (unless I'm mistaken) it looks like GCC currently = uses > > > the exact same RTL/templates for both, treating one as an > > > alternative for the other. > > I think as long as patterns or intrinsics only care about the low > > part, they should be ok. > > But if we want to use default behavior for upper bits, we need to > > restrict them under specific isa(.i.e. vmovq in vec_set_0). > > Generally, 128-bit sse legacy instructions have different behaviors > > for upper bits from AVX ones, and that's why vzeroupper is = introduced > > for sse <-> avx instructions transition. > > > > > > Hence I thought I'd post what I have so far (part optimization and > > > part clean-up), to then ask the x86 experts for their opinions. > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > bootstrap and make -k check, both with and without > > > --target_board=3Dunix{-,32}, with no new failures. Ok for = mainline? > > > > > > > > > 2022-06-30 Roger Sayle > > > > > > gcc/ChangeLog > > > * config/i386/i386-builtin.def = (__builtin_ia32_palignr128): Change > > > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti. > > > * config/i386/i386-expand.cc (expand_vec_perm_palignr): = Use > V1TImode > > > and gen_ssse3_palignv1ti instead of TImode. > > > * config/i386/sse.md (SSESCALARMODE): Delete. > > > (define_mode_attr ssse3_avx2): Handle V1TImode instead of = TImode. > > > (_palignr): Use VIMAX_AVX2_AVX512BW as a > mode > > > iterator instead of SSESCALARMODE. > > > > > > (ssse3_palignrdi): Optimize cases when operands[3] is 0 or = 64, > > > using a single move instruction (if required). > > > (define_split): Likewise split UNSPEC_PALIGNR $0 into a = move. > > > (define_split): Likewise split UNSPEC_PALIGNR $64 into a = move. > > > > > > gcc/testsuite/ChangeLog > > > * gcc.target/i386/ssse3-palignr-2.c: New test case. > > > > > > > > > Thanks in advance, > > > Roger > > > -- > > > > > > > +(define_split > > + [(set (match_operand:DI 0 "register_operand") (unspec:DI > > +[(match_operand:DI 1 "register_operand") > > + (match_operand:DI 2 "register_mmxmem_operand") > > + (const_int 0)] > > + UNSPEC_PALIGNR))] > > + "" > > + [(set (match_dup 0) (match_dup 2))]) > > + > > +(define_split > > + [(set (match_operand:DI 0 "register_operand") (unspec:DI > > +[(match_operand:DI 1 "register_operand") > > + (match_operand:DI 2 "register_mmxmem_operand") > > + (const_int 64)] > > + UNSPEC_PALIGNR))] > > + "" > > + [(set (match_dup 0) (match_dup 1))]) > > + > > define_split is assumed to be splitted to 2(or more) insns, hence > > pass_combine will only try define_split if the number of merged = insns > > is greater than 2. > > For palignr, i think most time there would be only 2 merged > > insns(constant propagation), so better to change them as pre_reload > > splitter. > > (.i.e. (define_insn_and_split = "*avx512bw_permvar_truncv16siv16hi_1"). > I think you can just merge 2 define_split into define_insn_and_split > "ssse3_palignrdi" by relaxing split condition as >=20 > - "TARGET_SSSE3 && reload_completed > - && SSE_REGNO_P (REGNO (operands[0]))" > + "(TARGET_SSSE3 && reload_completed > + && SSE_REGNO_P (REGNO (operands[0]))) > + || INVAL(operands[3]) =3D=3D 0 > + || INVAL(operands[3]) =3D=3D 64" >=20 > and you have already handled them by >=20 > + if (operands[3] =3D=3D const0_rtx) > + { > + if (!rtx_equal_p (operands[0], operands[2])) emit_move_insn > + (operands[0], operands[2]); > + else > + emit_note (NOTE_INSN_DELETED); > + DONE; > + } > + else if (INTVAL (operands[3]) =3D=3D 64) > + { > + if (!rtx_equal_p (operands[0], operands[1])) emit_move_insn > + (operands[0], operands[1]); > + else > + emit_note (NOTE_INSN_DELETED); > + DONE; > + } > + >=20 > > >=20 > > > > -- > > BR, > > Hongtao >=20 >=20 >=20 > -- > BR, > Hongtao ------=_NextPart_000_0088_01D88FD6.9B90A1B0 Content-Type: text/plain; name="patchvs5.txt" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="patchvs5.txt" diff --git a/gcc/config/i386/i386-builtin.def = b/gcc/config/i386/i386-builtin.def=0A= index e6daad4..fd16093 100644=0A= --- a/gcc/config/i386/i386-builtin.def=0A= +++ b/gcc/config/i386/i386-builtin.def=0A= @@ -900,7 +900,7 @@ BDESC (OPTION_MASK_ISA_SSSE3, 0, = CODE_FOR_ssse3_psignv4si3, "__builtin_ia32_psig=0A= BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, = CODE_FOR_ssse3_psignv2si3, "__builtin_ia32_psignd", IX86_BUILTIN_PSIGND, = UNKNOWN, (int) V2SI_FTYPE_V2SI_V2SI)=0A= =0A= /* SSSE3. */=0A= -BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrti, = "__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) = V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)=0A= +BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrv1ti, = "__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) = V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)=0A= BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, = CODE_FOR_ssse3_palignrdi, "__builtin_ia32_palignr", = IX86_BUILTIN_PALIGNR, UNKNOWN, (int) V1DI_FTYPE_V1DI_V1DI_INT_CONVERT)=0A= =0A= /* SSE4.1 */=0A= diff --git a/gcc/config/i386/i386-expand.cc = b/gcc/config/i386/i386-expand.cc=0A= index 8bc5430..6a3fcde 100644=0A= --- a/gcc/config/i386/i386-expand.cc=0A= +++ b/gcc/config/i386/i386-expand.cc=0A= @@ -19548,9 +19548,11 @@ expand_vec_perm_palignr (struct = expand_vec_perm_d *d, bool single_insn_only_p)=0A= shift =3D GEN_INT (min * GET_MODE_UNIT_BITSIZE (d->vmode));=0A= if (GET_MODE_SIZE (d->vmode) =3D=3D 16)=0A= {=0A= - target =3D gen_reg_rtx (TImode);=0A= - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, = dcopy.op1),=0A= - gen_lowpart (TImode, dcopy.op0), shift));=0A= + target =3D gen_reg_rtx (V1TImode);=0A= + emit_insn (gen_ssse3_palignrv1ti (target,=0A= + gen_lowpart (V1TImode, dcopy.op1),=0A= + gen_lowpart (V1TImode, dcopy.op0),=0A= + shift));=0A= }=0A= else=0A= {=0A= diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md=0A= index f2f72e8..adf05bf 100644=0A= --- a/gcc/config/i386/sse.md=0A= +++ b/gcc/config/i386/sse.md=0A= @@ -575,10 +575,6 @@=0A= (define_mode_iterator VIMAX_AVX2=0A= [(V2TI "TARGET_AVX2") V1TI])=0A= =0A= -;; ??? This should probably be dropped in favor of VIMAX_AVX2_AVX512BW.=0A= -(define_mode_iterator SSESCALARMODE=0A= - [(V4TI "TARGET_AVX512BW") (V2TI "TARGET_AVX2") TI])=0A= -=0A= (define_mode_iterator VI12_AVX2=0A= [(V32QI "TARGET_AVX2") V16QI=0A= (V16HI "TARGET_AVX2") V8HI])=0A= @@ -712,7 +708,7 @@=0A= (V4HI "ssse3") (V8HI "ssse3") (V16HI "avx2") (V32HI "avx512bw")=0A= (V4SI "ssse3") (V8SI "avx2")=0A= (V2DI "ssse3") (V4DI "avx2")=0A= - (TI "ssse3") (V2TI "avx2") (V4TI "avx512bw")])=0A= + (V1TI "ssse3") (V2TI "avx2") (V4TI "avx512bw")])=0A= =0A= (define_mode_attr sse4_1_avx2=0A= [(V16QI "sse4_1") (V32QI "avx2") (V64QI "avx512bw")=0A= @@ -21092,10 +21088,10 @@=0A= (set_attr "mode" "")])=0A= =0A= (define_insn "_palignr"=0A= - [(set (match_operand:SSESCALARMODE 0 "register_operand" "=3Dx,")=0A= - (unspec:SSESCALARMODE=0A= - [(match_operand:SSESCALARMODE 1 "register_operand" "0,")=0A= - (match_operand:SSESCALARMODE 2 "vector_operand" "xBm,m")=0A= + [(set (match_operand:VIMAX_AVX2_AVX512BW 0 "register_operand" = "=3Dx,")=0A= + (unspec:VIMAX_AVX2_AVX512BW=0A= + [(match_operand:VIMAX_AVX2_AVX512BW 1 "register_operand" "0,")=0A= + (match_operand:VIMAX_AVX2_AVX512BW 2 "vector_operand" "xBm,m")=0A= (match_operand:SI 3 "const_0_to_255_mul_8_operand")]=0A= UNSPEC_PALIGNR))]=0A= "TARGET_SSSE3"=0A= @@ -21141,11 +21137,30 @@=0A= gcc_unreachable ();=0A= }=0A= }=0A= - "TARGET_SSSE3 && reload_completed=0A= - && SSE_REGNO_P (REGNO (operands[0]))"=0A= + "(TARGET_SSSE3 && reload_completed=0A= + && SSE_REGNO_P (REGNO (operands[0])))=0A= + || operands[3] =3D=3D const0_rtx=0A= + || INTVAL (operands[3]) =3D=3D 64"=0A= [(set (match_dup 0)=0A= (lshiftrt:V1TI (match_dup 0) (match_dup 3)))]=0A= {=0A= + if (operands[3] =3D=3D const0_rtx)=0A= + {=0A= + if (!rtx_equal_p (operands[0], operands[2]))=0A= + emit_move_insn (operands[0], operands[2]);=0A= + else=0A= + emit_note (NOTE_INSN_DELETED);=0A= + DONE;=0A= + }=0A= + else if (INTVAL (operands[3]) =3D=3D 64)=0A= + {=0A= + if (!rtx_equal_p (operands[0], operands[1]))=0A= + emit_move_insn (operands[0], operands[1]);=0A= + else=0A= + emit_note (NOTE_INSN_DELETED);=0A= + DONE;=0A= + }=0A= +=0A= /* Emulate MMX palignrdi with SSE psrldq. */=0A= rtx op0 =3D lowpart_subreg (V2DImode, operands[0],=0A= GET_MODE (operands[0]));=0A= diff --git a/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c = b/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c=0A= new file mode 100644=0A= index 0000000..791222d=0A= --- /dev/null=0A= +++ b/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c=0A= @@ -0,0 +1,21 @@=0A= +/* { dg-do compile } */=0A= +/* { dg-options "-O2 -mssse3" } */=0A= +=0A= +typedef long long __attribute__ ((__vector_size__ (8))) T;=0A= +=0A= +T x;=0A= +T y;=0A= +T z;=0A= +=0A= +void foo()=0A= +{=0A= + z =3D __builtin_ia32_palignr (x, y, 0);=0A= +}=0A= +=0A= +void bar()=0A= +{=0A= + z =3D __builtin_ia32_palignr (x, y, 64);=0A= +}=0A= +/* { dg-final { scan-assembler-not "punpcklqdq" } } */=0A= +/* { dg-final { scan-assembler-not "pshufd" } } */=0A= +/* { dg-final { scan-assembler-not "psrldq" } } */=0A= ------=_NextPart_000_0088_01D88FD6.9B90A1B0--