From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oo1-xc2c.google.com (mail-oo1-xc2c.google.com [IPv6:2607:f8b0:4864:20::c2c]) by sourceware.org (Postfix) with ESMTPS id 2AA4839960CE for ; Wed, 2 Jun 2021 01:55:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2AA4839960CE Received: by mail-oo1-xc2c.google.com with SMTP id w20-20020a4a35540000b02902458551c0d6so224133oog.7 for ; Tue, 01 Jun 2021 18:55:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=krmNmJfYl6ikJLRE66fQZqhhV+ApnfkKzE1A9E9CFBQ=; b=CY2cL89gdXTvGPRV7Jgk9okg+51mFIbaKUpSzLQCRKvSxAwYzna4T9r+/p/I7ecUqN kHLzoAYkOax2wyk+pX6sM4cggqRHH8dIobq8YXFoNmubHjIByVGnGgDJoQEnncwLB4iV i8nZ3e/BoyRUnaP6z3OOgTFR/fhtIPdFkSLrIZVtoTSMRm8A5z19xNmWbSNpiztrIWsd CI8Rx6OMo2HnxFexzU0BBPIfxd8IMg766RnAguG/NpJO9E6Uwn+EVj0XHGQbYDlZlnmp a+0u6uh1JcXM/Ws5NEAwa2rj5tf/HZTLnoRLPna/hsHSiomwXV3j7HUvHisI6Jb+UbhR RFLQ== X-Gm-Message-State: AOAM5339ip9V0SO28mLsY2L5ticc0L7h9uWyb1koiPMQPCwYATkk7Tmc vK/e+DOskFx2mFVeWmr9i4nsFhONYtEfdnAverY= X-Google-Smtp-Source: ABdhPJyJymB/4WDD1Ay2kubVWOlveVDi2JA32EHQGp7l5GXrcMQjwocX743JtuW+XZiUui6o+1vjiXM8noW3u2BObe8= X-Received: by 2002:a4a:8241:: with SMTP id t1mr22353291oog.17.1622598911297; Tue, 01 Jun 2021 18:55:11 -0700 (PDT) MIME-Version: 1.0 References: <459318f2-a9b8-f542-a29e-0ecbbc82b69a@tachyum.com> In-Reply-To: From: "H.J. Lu" Date: Tue, 1 Jun 2021 18:54:35 -0700 Message-ID: Subject: Re: [PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX To: Hongtao Liu Cc: Jeff Law , Jeff Law , Bernd Edlinger , "H.J. Lu via Gcc-patches" , Richard Sandiford Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3028.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Jun 2021 01:55:14 -0000 On Tue, Jun 1, 2021 at 6:17 PM Hongtao Liu wrote: > > On Wed, Jun 2, 2021 at 7:07 AM H.J. Lu via Gcc-patches > wrote: > > > > On Tue, Jun 1, 2021 at 7:21 AM Jeff Law wrote: > > > > > > > > > > > > On 6/1/2021 7:29 AM, H.J. Lu via Gcc-patches wrote: > > > > On Tue, Jun 1, 2021 at 6:25 AM Richard Biener > > > > wrote: > > > >> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu wrote= : > > > >>> On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote: > > > >>>> > > > >>>> On 5/31/2021 11:50 PM, Richard Sandiford wrote: > > > >>>>> "H.J. Lu via Gcc-patches" writes: > > > >>>>>> On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote: > > > >>>>>>> On Mon, May 31, 2021 at 6:26 AM Richard Biener > > > >>>>>>> wrote: > > > >>>>>>>> On Mon, May 31, 2021 at 3:12 PM H.J. Lu wrote: > > > >>>>>>>>> On Mon, May 31, 2021 at 5:46 AM Richard Biener > > > >>>>>>>>> wrote: > > > >>>>>>>>>> On Mon, May 31, 2021 at 2:09 PM H.J. Lu wrote: > > > >>>>>>>>>>> On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener = wrote: > > > >>>>>>>>>>>>>>> -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx D= ATA, scalar_int_mode > > > >>>>>>>>>>>>>>> MODE) > > > >>>>>>>>>>>>>>> This function returns the RTL of a register co= ntaining > > > >>>>>>>>>>>>>>> 'GET_MODE_SIZE (MODE)' consecutive copies of t= he unsigned char > > > >>>>>>>>>>>>>>> value given in the RTL register DATA. For exa= mple, if MODE is 4 > > > >>>>>>>>>>>>>>> bytes wide, return the RTL for 0x01010101*DATA= . > > > >>>>>>>>>>>>>> For this one I wonder if it should be an optab instead= . Couldn't you > > > >>>>>>>>>>>>>> use the existing vec_duplicate for this by using (para= doxical) subregs > > > >>>>>>>>>>>>>> like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:= QI ...)))? > > > >>>>>>>>>>>>> I tried. It doesn't even work on x86. See: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/5706= 61.html > > > >>>>>>>>>>>> Not sure what I should read from there... > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> There are special cases to subreg HI, SI and DI modes o= f TI mode in > > > >>>>>>>>>>>>> ix86_gen_memset_value_from_prev. simplify_gen_subreg = doesn't > > > >>>>>>>>>>>>> work here. Each backend may need its own special hand= ling. > > > >>>>>>>>>>>> OK, I guess I'm not (RTL) qualified enough to further re= view these parts, > > > >>>>>>>>>>>> sorry. Since we're doing code generation the canonical = way to communicate > > > >>>>>>>>>>>> with backends should be optabs, not some set of disconne= cted target hooks. > > > >>>>>>>>>>>> But as said, I probably don't know enough of RTL to see = why it's the only way. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Richard. > > > >>>>>>>>>>> Here is the patch to add optabs instead. Does it look OK= ? > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks. > > > >>>>>>>>>>> > > > >>>>>>>>>>> H.J. > > > >>>>>>>>>>> --- > > > >>>>>>>>>>> Add 2 optabs: > > > >>>>>>>>>>> > > > >>>>>>>>>>> 1. integer_extract: Extract lower bit value from the inte= ger value in > > > >>>>>>>>>>> TImode, OImode or XImode. > > > >>>>>>>>>> That sounds very specific, esp. the restriction to {TI,OI,= XI}mode. > > > >>>>>>>>>> It also sounds like it matches (subreg:{TI,OI,XI} (...) 0)= . There are > > > >>>>>>>>>> existing target hooks verifying subreg validity - why's th= at not a good > > > >>>>>>>>>> fit here? ISTR you say gen_lowpart () doesn't work (or wa= s it > > > >>>>>>>>>> simplify_gen_subreg?), why's that so? > > > >>>>>>>>> {TI,OI,XI}mode are storage only integer types. subreg doe= sn't work > > > >>>>>>>>> well on them. I got > > > >>>>>>>>> > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ cat s2.i > > > >>>>>>>>> extern void *ops; > > > >>>>>>>>> > > > >>>>>>>>> void > > > >>>>>>>>> foo (int c) > > > >>>>>>>>> { > > > >>>>>>>>> __builtin_memset (ops, c, 34); > > > >>>>>>>>> } > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ make s2.s > > > >>>>>>>>> /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64= -linux/gcc/xgcc > > > >>>>>>>>> -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_= 64-linux/gcc/ > > > >>>>>>>>> -O2 -march=3Dhaswell -S s2.i > > > >>>>>>>>> during RTL pass: reload > > > >>>>>>>>> s2.i: In function =E2=80=98foo=E2=80=99: > > > >>>>>>>>> s2.i:7:1: internal compiler error: maximum number of genera= ted reload > > > >>>>>>>>> insns per insn achieved (90) > > > >>>>>>>>> 7 | } > > > >>>>>>>>> | ^ > > > >>>>>>>>> 0x1050734 lra_constraints(bool) > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c= :5091 > > > >>>>>>>>> 0x1039536 lra(_IO_FILE*) > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336 > > > >>>>>>>>> 0xfe1140 do_reload > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822 > > > >>>>>>>>> 0xfe162e execute > > > >>>>>>>>> /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008 > > > >>>>>>>>> Please submit a full bug report, > > > >>>>>>>>> with preprocessed source if appropriate. > > > >>>>>>>>> Please include the complete backtrace with any bug report. > > > >>>>>>>>> See for instructions. > > > >>>>>>>>> make: *** [Makefile:32: s2.s] Error 1 > > > >>>>>>>>> [hjl@gnu-cfl-2 pieces]$ > > > >>>>>>>>> > > > >>>>>>>>> due to > > > >>>>>>>>> > > > >>>>>>>>> (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84) > > > >>>>>>>>> (const_int 32 [0x20])) [0 MEM [(void > > > >>>>>>>>> *)ops.0_1]+32 S2 A8]) > > > >>>>>>>>> (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1 > > > >>>>>>>>> (nil)) > > > >>>>>>>>> > > > >>>>>>>>> The new optab gives us > > > >>>>>>>>> > > > >>>>>>>>> (insn 12 11 13 2 (set (reg:TI 88) > > > >>>>>>>>> (reg:TI 51 xmm15)) "s2.i":6:3 -1 > > > >>>>>>>>> (nil)) > > > >>>>>>>>> (insn 13 12 14 2 (set (reg:SI 89) > > > >>>>>>>>> (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1 > > > >>>>>>>>> (nil)) > > > >>>>>>>>> (insn 14 13 15 2 (set (reg:HI 87) > > > >>>>>>>>> (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1 > > > >>>>>>>>> (nil)) > > > >>>>>>>> that looks odd to me - what's the final result after LRA? I= think > > > >>>>>>> I got: > > > >>>>>>> > > > >>>>>>> vmovd %edi, %xmm15 > > > >>>>>>> movq ops(%rip), %rdx > > > >>>>>>> vpbroadcastb %xmm15, %ymm15 > > > >>>>>>> vmovq %xmm15, %rax <<<< move to GPR > > > >>>>>>> vmovdqu %ymm15, (%rdx) > > > >>>>>>> movw %ax, 32(%rdx) <<<< subreg of GPR > > > >>>>>>> vzeroupper > > > >>>>>>> ret > > > >>>>>>> > > > >>>>>>>> we should see to make lowpart_subreg work on {XI,OI,TI}mode. > > > >>>>>>>> Only two steps should be necessary at most: > > > >>>>>>>> xmm -> gpr, grp -> subreg, or gpr -> subreg. So the expande= r > > > >>>>>>>> code in memset should try to generate the subreg directly > > > >>>>>>> subreg didn't fail on x86 when I tried. > > > >>>>>>> > > > >>>>>>>> and if that fails, try a word_mode subreg followed by the su= breg. > > > >>>>>>> I will try word_mode subreg. > > > >>>>>>> > > > >>>>>> Here is the v2 patch to use word_mode subreg. For > > > >>>>>> > > > >>>>>> --- > > > >>>>>> extern void *ops; > > > >>>>>> > > > >>>>>> void > > > >>>>>> foo (int c) > > > >>>>>> { > > > >>>>>> __builtin_memset (ops, 4, 32); > > > >>>>>> } > > > >>>>>> --- > > > >>>>>> > > > >>>>>> without vec_const_duplicate, I got > > > >>>>>> > > > >>>>>> movl $4, %eax > > > >>>>>> movq ops(%rip), %rdx > > > >>>>>> movd %eax, %xmm0 > > > >>>>>> punpcklbw %xmm0, %xmm0 > > > >>>>>> punpcklwd %xmm0, %xmm0 > > > >>>>>> pshufd $0, %xmm0, %xmm0 > > > >>>>>> movups %xmm0, (%rdx) > > > >>>>>> movups %xmm0, 16(%rdx) > > > >>>>>> ret > > > >>>>>> > > > >>>>>> With vec_const_duplicate, I got > > > >>>>>> > > > >>>>>> movq ops(%rip), %rax > > > >>>>>> movdqa .LC0(%rip), %xmm0 > > > >>>>>> movups %xmm0, (%rax) > > > >>>>>> movups %xmm0, 16(%rax) > > > >>>>>> ret > > > >>>>>> > > > >>>>>> If vec_duplicate is allowed to fail, I don't need vec_const_du= plicate. > > > >>>>> I don't understand why we need an optab for this though. If th= e operand > > > >>>>> is constant then we should just be doing an ordinary move in wh= ich the > > > >>>>> source is a CONST_VECTOR. It's then up to the move patterns to= handle > > > >>>>> duplicated constants as efficiently as possible. (Sorry if thi= s was > > > >>>>> discussed upthread and I missed it.) > > > >>>> That's exactly the point I'm trying to get across as well. > > > >>>> > > > >>> This is what we do today. But I'd like to generate > > > >>> > > > >>> movl $4, %eax > > > >>> vpbroadcastb %eax, %ymm15 > > > >>> movq ops(%rip), %rax > > > >>> vmovdqu %ymm15, (%rax) > > > >>> vzeroupper > > > >>> ret > > > >>> > > > >>> instead of > > > >>> > > > >>> vmovdqa .LC0(%rip), %ymm15 > > > >>> movq ops(%rip), %rax > > > >>> vmovdqu %ymm15, (%rax) > > > >>> vzeroupper > > > >>> ret > > > >>> > > > >>> Do I need a vec_dup pattern for it? > > > >> I think we have special code sequences to materialize some > > > >> constant vectors already, we should be able to add to that, no? > > > > We can do that for all 0s and all 1s at the final codegen. For > > > > other values, since we need a GPR, we can't do that. > > > You can catch them in your movxx expanders, you can create peep2 > > > patterns that use available GPRs, etc. I don't see a fundamental nee= d > > > to to introduce new target macros or hooks to handle this stuff. In > > > fact I've done both to handle a closely related issue on our port. > > > > > > > One problem of expanding TI/OI/XI moves to broadcast is that other > > RTL passes may change it. For example, expander generates: > It could be handled in pass_data_constant_pool_broadcast which is > designed for avx512 embedding broadcast, but can also do such > transforming. > > see > https://godbolt.org/z/8YGzqf938 It sounds promising, but doesn't work on TImode: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D100865 > > > > (insn 7 5 6 (set (reg:QI 85) > > (const_int 12 [0xc])) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr= 90773-17.c":9:3 > > -1 > > (nil)) > > > > (insn 6 7 8 (set (reg:V16QI 84) > > (vec_duplicate:V16QI (reg:QI 85))) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr= 90773-17.c > > ":9:3 5103 {*avx512vl_vec_dup_gprv16qi} > > (nil)) > > > > (insn 8 6 9 (set (subreg:V16QI (reg:TI 86) 0) > > (reg:V16QI 84)) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr= 90773-17.c":9:3 > > -1 > > (nil)) > > > > (insn 9 8 10 (set (mem:TI (reg/f:DI 83) [0 MEM [(void > > *)dst.0_1]+0 S16 A8]) > > (reg:TI 86)) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr= 90773-17.c":9:3 > > -1 > > (nil)) > > > > combine turns it into: > > > > insn 9 6 10 2 (set (mem:TI (reg/f:DI 83 [ dst ]) [0 MEM > > [(void *)dst.0_1]+0 S16 A8]) > > (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c)) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/ > > i386/pr90773-17.c":9:3 73 {*movti_internal} > > (nil)) > > > > LRA tries: > > > > (insn 14 15 16 2 (set (reg:V16QI 20 xmm0 [89]) > > (vec_duplicate:V16QI (reg:QI 4 si [90]))) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr= 907 > > 73-17.c":9:3 5103 {*avx512vl_vec_dup_gprv16qi} > > (nil)) > > (insn 16 14 9 2 (set (reg:V16QI 1 dx) > > (reg:V16QI 20 xmm0 [89])) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr= 90773-17.c":9:3 > > 152 > > 7 {movv16qi_internal} > > (nil)) > > (insn 9 16 10 2 (set (mem:TI (reg/f:DI 0 ax [orig:83 dst ] [83]) [0 > > MEM [(void *)dst.0_1]+0 S16 A8]) > > (reg:TI 1 dx [88])) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr= 90773-17.c":9:3 > > 73 {*movt > > i_internal} > > > > and fails: > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr9= 0773-17.c: > > In function =E2=80=98foo=E2=80=99: > > /export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr9= 0773-17.c:10:1: > > error: insn does not satisfy its constraints: > > (insn 16 14 9 2 (set (reg:V16QI 1 dx) > > (reg:V16QI 20 xmm0 [89])) > > "/export/gnu/import/git/gitlab/x86-gcc/gcc/testsuite/gcc.target/i386/pr= 90773-17.c":9:3 > > 1527 {movv16qi_internal} > > (nil)) > > > > I want to hide > > > > (const_wide_int 0xc0c0c0c0c0c0c0c0c0c0c0c0c0c0c0c) > > > > from RTL passes. > > > > > > -- > > H.J. > > > > -- > BR, > Hongtao --=20 H.J.