From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw1-x112e.google.com (mail-yw1-x112e.google.com [IPv6:2607:f8b0:4864:20::112e]) by sourceware.org (Postfix) with ESMTPS id 5DB0D3858292 for ; Tue, 5 Jul 2022 00:31:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5DB0D3858292 Received: by mail-yw1-x112e.google.com with SMTP id 00721157ae682-31cb2c649f7so19849597b3.11 for ; Mon, 04 Jul 2022 17:31:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=to1TDBi/2DeQaInd6D1homcoyRMiZ4HpLle7QolBGZg=; b=U8pcb3v1Rtp97zNbAzbjixPDalDUsuuasKhKH5c8V5WuAKbmYFZV1F8wuS5FAPc6qn q1dJKcfcV6gOsV5bpGWxkoRmF9uzR777GAN84RZrENyG0HLkqrhUl1EElji1a958SZUZ F1yP5kE58jLw9j/5DVSOWWjsuSiJxAsO4FOVS38jOUSwWLpEBvxTxob2XTRfGCwJayBw ZZFLbhsdmqTWo4+57fcbOfTkldcF7pxV2+sNCzqY9P+Ks2+Gbz1bS/K5YgwMiRhHYe/9 Kl93GqItKbXJ64tPjIE4rFPmVnRbUYfsZ91SxPEhzy+EVA4A7yQrMxDRbYi8vzqawxxL Yu+w== X-Gm-Message-State: AJIora9TfWcG4ctchGERHjYmLn6o8md4FDf87KgWAzrQVomkabhjfux7 DLp7xR0eoWAjfzEBiorwB2mFjx3UyOcua8DYeAw7UdNCgGQ= X-Google-Smtp-Source: AGRyM1u0X/B4dfPp/GxOgbc5D5UxlNDH64CMe+hSl2ZJTrzDIFJsrLDPEFuMVwOsqPucviWZTnhgP4ViTIrp699zgyA= X-Received: by 2002:a81:7dd6:0:b0:31c:85d9:8488 with SMTP id y205-20020a817dd6000000b0031c85d98488mr14737044ywc.475.1656981068632; Mon, 04 Jul 2022 17:31:08 -0700 (PDT) MIME-Version: 1.0 References: <006901d88cb1$1a06e870$4e14b950$@nextmovesoftware.com> <008701d88fce$39c9c8b0$ad5d5a10$@nextmovesoftware.com> In-Reply-To: <008701d88fce$39c9c8b0$ad5d5a10$@nextmovesoftware.com> From: Hongtao Liu Date: Tue, 5 Jul 2022 08:30:58 +0800 Message-ID: Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups. To: Roger Sayle Cc: GCC Patches Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Jul 2022 00:31:12 -0000 On Tue, Jul 5, 2022 at 1:48 AM Roger Sayle wrote: > > > Hi Hongtao, > Many thanks for your review. This revised patch implements your > suggestions of removing the combine splitters, and instead reusing > the functionality of the ssse3_palignrdi define_insn_and split. > > This revised patch has been tested on x86_64-pc-linux-gnu with make > bootstrap and make -k check, both with and with --target_board=unix{-32}, > with no new failures. Is this revised version Ok for mainline? Ok. > > > 2022-07-04 Roger Sayle > Hongtao Liu > > gcc/ChangeLog > * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti. > * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode > and gen_ssse3_palignv1ti instead of TImode. > * config/i386/sse.md (SSESCALARMODE): Delete. > (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode. > (_palignr): Use VIMAX_AVX2_AVX512BW as a mode > iterator instead of SSESCALARMODE. > > (ssse3_palignrdi): Optimize cases where operands[3] is 0 or 64, > using a single move instruction (if required). > > gcc/testsuite/ChangeLog > * gcc.target/i386/ssse3-palignr-2.c: New test case. > > > Thanks in advance, > Roger > -- > > > -----Original Message----- > > From: Hongtao Liu > > Sent: 01 July 2022 03:40 > > To: Roger Sayle > > Cc: GCC Patches > > Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups. > > > > On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu wrote: > > > > > > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle > > wrote: > > > > > > > > > > > > This patch is a follow-up to Hongtao's fix for PR target/105854. > > > > That fix is perfectly correct, but the thing that caught my eye was > > > > why is the compiler generating a shift by zero at all. Digging > > > > deeper it turns out that we can easily optimize > > > > __builtin_ia32_palignr for alignments of 0 and 64 respectively, > > > > which may be simplified to moves from the highpart or lowpart. > > > > > > > > After adding optimizations to simplify the 64-bit DImode palignr, I > > > > started to add the corresponding optimizations for vpalignr (i.e. > > > > 128-bit). The first oddity is that sse.md uses TImode and a special > > > > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment > > > > above SSESCALARMODE hints that this should be "dropped in favor of > > > > VIMAX_AVX2_AVX512BW". Hence this patch includes the migration of > > > > _palignr to use VIMAX_AVX2_AVX512BW, basically > > > > using V1TImode instead of TImode for 128-bit palignr. > > > > > > > > But it was only after I'd implemented this clean-up that I stumbled > > > > across the strange semantics of 128-bit [v]palignr. According to > > > > https://www.felixcloutier.com/x86/palignr, the semantics are subtly > > > > different based upon how the instruction is encoded. PALIGNR leaves > > > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the > > > > highpart, and (unless I'm mistaken) it looks like GCC currently uses > > > > the exact same RTL/templates for both, treating one as an > > > > alternative for the other. > > > I think as long as patterns or intrinsics only care about the low > > > part, they should be ok. > > > But if we want to use default behavior for upper bits, we need to > > > restrict them under specific isa(.i.e. vmovq in vec_set_0). > > > Generally, 128-bit sse legacy instructions have different behaviors > > > for upper bits from AVX ones, and that's why vzeroupper is introduced > > > for sse <-> avx instructions transition. > > > > > > > > Hence I thought I'd post what I have so far (part optimization and > > > > part clean-up), to then ask the x86 experts for their opinions. > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > > bootstrap and make -k check, both with and without > > > > --target_board=unix{-,32}, with no new failures. Ok for mainline? > > > > > > > > > > > > 2022-06-30 Roger Sayle > > > > > > > > gcc/ChangeLog > > > > * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change > > > > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti. > > > > * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use > > V1TImode > > > > and gen_ssse3_palignv1ti instead of TImode. > > > > * config/i386/sse.md (SSESCALARMODE): Delete. > > > > (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode. > > > > (_palignr): Use VIMAX_AVX2_AVX512BW as a > > mode > > > > iterator instead of SSESCALARMODE. > > > > > > > > (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64, > > > > using a single move instruction (if required). > > > > (define_split): Likewise split UNSPEC_PALIGNR $0 into a move. > > > > (define_split): Likewise split UNSPEC_PALIGNR $64 into a move. > > > > > > > > gcc/testsuite/ChangeLog > > > > * gcc.target/i386/ssse3-palignr-2.c: New test case. > > > > > > > > > > > > Thanks in advance, > > > > Roger > > > > -- > > > > > > > > > > +(define_split > > > + [(set (match_operand:DI 0 "register_operand") (unspec:DI > > > +[(match_operand:DI 1 "register_operand") > > > + (match_operand:DI 2 "register_mmxmem_operand") > > > + (const_int 0)] > > > + UNSPEC_PALIGNR))] > > > + "" > > > + [(set (match_dup 0) (match_dup 2))]) > > > + > > > +(define_split > > > + [(set (match_operand:DI 0 "register_operand") (unspec:DI > > > +[(match_operand:DI 1 "register_operand") > > > + (match_operand:DI 2 "register_mmxmem_operand") > > > + (const_int 64)] > > > + UNSPEC_PALIGNR))] > > > + "" > > > + [(set (match_dup 0) (match_dup 1))]) > > > + > > > define_split is assumed to be splitted to 2(or more) insns, hence > > > pass_combine will only try define_split if the number of merged insns > > > is greater than 2. > > > For palignr, i think most time there would be only 2 merged > > > insns(constant propagation), so better to change them as pre_reload > > > splitter. > > > (.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1"). > > I think you can just merge 2 define_split into define_insn_and_split > > "ssse3_palignrdi" by relaxing split condition as > > > > - "TARGET_SSSE3 && reload_completed > > - && SSE_REGNO_P (REGNO (operands[0]))" > > + "(TARGET_SSSE3 && reload_completed > > + && SSE_REGNO_P (REGNO (operands[0]))) > > + || INVAL(operands[3]) == 0 > > + || INVAL(operands[3]) == 64" > > > > and you have already handled them by > > > > + if (operands[3] == const0_rtx) > > + { > > + if (!rtx_equal_p (operands[0], operands[2])) emit_move_insn > > + (operands[0], operands[2]); > > + else > > + emit_note (NOTE_INSN_DELETED); > > + DONE; > > + } > > + else if (INTVAL (operands[3]) == 64) > > + { > > + if (!rtx_equal_p (operands[0], operands[1])) emit_move_insn > > + (operands[0], operands[1]); > > + else > > + emit_note (NOTE_INSN_DELETED); > > + DONE; > > + } > > + > > > > > > > > > > > > > -- > > > BR, > > > Hongtao > > > > > > > > -- > > BR, > > Hongtao -- BR, Hongtao