From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 9C830385840B for ; Mon, 25 Oct 2021 14:32:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9C830385840B Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 451081FB; Mon, 25 Oct 2021 07:32:26 -0700 (PDT) Received: from localhost (unknown [10.32.98.88]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 525A73F5A1; Mon, 25 Oct 2021 07:32:25 -0700 (PDT) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina , Tamar Christina via Gcc-patches , Richard Earnshaw , nd , Marcus Shawcroft , richard.sandiford@arm.com Cc: Tamar Christina via Gcc-patches , Richard Earnshaw , nd , Marcus Shawcroft Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants and operations References: Date: Mon, 25 Oct 2021 15:32:24 +0100 In-Reply-To: (Tamar Christina's message of "Mon, 25 Oct 2021 11:49:04 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Oct 2021 14:32:30 -0000 Tamar Christina writes: >> -----Original Message----- >> From: Richard Sandiford >> Sent: Monday, October 25, 2021 10:54 AM >> To: Tamar Christina >> Cc: Tamar Christina via Gcc-patches ; Richard >> Earnshaw ; nd ; Marcus >> Shawcroft >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants >> and operations >>=20 >> Tamar Christina writes: >> >> -----Original Message----- >> >> From: Richard Sandiford >> >> Sent: Saturday, October 23, 2021 11:40 AM >> >> To: Tamar Christina via Gcc-patches >> >> Cc: Tamar Christina ; Richard Earnshaw >> >> ; nd ; Marcus Shawcroft >> >> >> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector >> >> constants and operations >> >> >> >> Tamar Christina via Gcc-patches writes: >> >> >> I'm still a bit sceptical about treating the high-part cost as low= er. >> >> >> ISTM that the subreg cases are the ones that are truly =E2=80=9Cfr= ee=E2=80=9D and >> >> >> any others should have a normal cost. So if CSE handled the >> >> >> subreg case itself (to model how the rtx would actually be >> >> >> generated) then >> >> >> aarch64 code would have to do less work. I imagine that will be >> >> >> true for >> >> other targets as well. >> >> > >> >> > I guess the main problem is that CSE lacks context because it's not >> >> > until after combine that the high part becomes truly "free" when >> >> > pushed >> >> into a high operation. >> >> >> >> Yeah. And the aarch64 code is just being asked to cost the operation >> >> it's given, which could for example come from an existing >> >> aarch64_simd_mov_from_high. I think we should try to ensure >> >> that a aarch64_simd_mov_from_high followed by some >> arithmetic >> >> on the result is more expensive than the fused operation (when fusing >> >> is possible). >> >> >> >> An analogy might be: if the cost code is given: >> >> >> >> (add (reg X) (reg Y)) >> >> >> >> then, at some later point, the (reg X) might be replaced with a >> >> multiplication, in which case we'd have a MADD operation and the >> >> addition is effectively free. Something similar would happen if (reg >> >> X) became a shift by a small amount on newer cores, although I guess >> >> then you could argue either that the cost of the add disappears or th= at >> the cost of the shift disappears. >> >> >> >> But we shouldn't count ADD as free on the basis that it could be >> >> combined with a multiplication or shift in future. We have to cost >> >> what we're given. I think the same thing applies to the high part. >> >> >> >> Here we're trying to prevent cse1 from replacing a DUP (lane) with a >> >> MOVI by saying that the DUP is strictly cheaper than the MOVI. >> >> I don't think that's really true though, and the cost tables in the >> >> patch say that DUP is more expensive (rather than less expensive) than >> MOVI. >> > >> > No we're not. The front end has already pushed the constant into each >> > operation that needs it which is the entire problem. >>=20 >> I think we're talking about different things here. I'll come to the gim= ple stuff >> below, but I was talking purely about the effect on the RTL optimisers. = What >> I meant above is that, in the cse1 dumps, the patch leads to changes lik= e: >>=20 >> (insn 20 19 21 2 (set (reg:V8QI 96 [ _8 ]) >> - (const_vector:V8QI [ >> + (vec_select:V8QI (reg:V16QI 116) >> + (parallel:V16QI [ >> + (const_int 8 [0x8]) >> + (const_int 9 [0x9]) >> + (const_int 10 [0xa]) >> + (const_int 11 [0xb]) >> + (const_int 12 [0xc]) >> + (const_int 13 [0xd]) >> + (const_int 14 [0xe]) >> + (const_int 15 [0xf]) >> + ]))) "include/arm_neon.h":6477:22 1394 >> {aarch64_simd_mov_from_v16qihigh} >> + (expr_list:REG_EQUAL (const_vector:V8QI [ >> (const_int 3 [0x3]) repeated x8 >> - ])) "include/arm_neon.h":6477:22 1160 {*aarch64_simd_movv8q= i} >> - (expr_list:REG_DEAD (reg:V16QI 117) >> - (nil))) >> + ]) >> + (expr_list:REG_DEAD (reg:V16QI 117) >> + (nil)))) >>=20 >> The pre-cse1 code is: >>=20 >> (insn 19 18 20 2 (set (reg:V16QI 117) >> (const_vector:V16QI [ >> (const_int 3 [0x3]) repeated x16 >> ])) "include/arm_neon.h":6477:22 1166 {*aarch64_simd_movv16q= i} >> (nil)) >> (insn 20 19 21 2 (set (reg:V8QI 96 [ _8 ]) >> (vec_select:V8QI (reg:V16QI 117) >> (parallel:V16QI [ >> (const_int 8 [0x8]) >> (const_int 9 [0x9]) >> (const_int 10 [0xa]) >> (const_int 11 [0xb]) >> (const_int 12 [0xc]) >> (const_int 13 [0xd]) >> (const_int 14 [0xe]) >> (const_int 15 [0xf]) >> ]))) "include/arm_neon.h":6477:22 1394 >> {aarch64_simd_mov_from_v16qihigh} >> (nil)) >>=20 >> That is, before the patch, we folded insn 19 into insn 20 to get: >>=20 >> (insn 20 19 21 2 (set (reg:V8QI 96 [ _8 ]) >> (const_vector:V8QI [ >> (const_int 3 [0x3]) repeated x8 >> ])) "include/arm_neon.h":6477:22 1160 {*aarch64_simd_movv8qi} >> (expr_list:REG_DEAD (reg:V16QI 117) >> (nil))) >>=20 >> After the patch we reject that because: >>=20 >> (set (reg:V8QI X) (const_vector:V8QI [3])) >>=20 >> is costed as a MOVI (cost 4) and the original >> aarch64_simd_mov_from_v16qihigh is costed as zero. In other words, the >> patch makes the DUP (lane) in the =E2=80=9Cmov high=E2=80=9D strictly ch= eaper than a >> constant move (MOVI). > > Yes, this was done intentionally because as we talked about a month ago t= here's > no real way to cost this correctly. The use of `X` there determines wheth= er it's cheaper > to use the movi over the dup. The MOVI not only prevent re-use of the va= lue, it also > prevents combining into high operations. All of which is impossible to t= ell currently > in how CSE and costing are done. > > This is an unmodified compiler created from last night's trunk https://go= dbolt.org/z/1saTP4xWs > > While yes, it did fold movi into the set, reg 19 wasn't dead, so you now = materialized the constant 3 times > > test0: > ldr q0, [x0] > movi v3.8b, 0x3 <<<< first > ldr q2, [x1] > movi v5.16b, 0x3 <<< second > uxtl v1.8h, v0.8b > dup d4, v2.d[1] <<< third > uxtl2 v0.8h, v0.16b > umlal v1.8h, v2.8b, v5.8b > umlal v0.8h, v4.8b, v3.8b > addhn v0.8b, v1.8h, v0.8h > str d0, [x2] > ret > > whilst my patch, generates > > test0: > movi v2.16b, 0x3 <<< once > ldr q0, \[x0\] > uxtl v1.8h, v0.8b > uxtl2 v0.8h, v0.16b > ldr q3, \[x1\] > umlal v1.8h, v3.8b, v2.8b > umlal2 v0.8h, v3.16b, v2.16b > addhn v0.8b, v1.8h, v0.8h > str d0, \[x2\] > ret > > Yes it's not perfect, yes you can end up with a dup instead of two movi's= but my argument is it's still a step forward > as the perfect solution doesn't seem to be possible at all with the way t= hings are currently set up. I agree there's no out-of-the-box way of getting what we want for the original testcases. It would require changes outside the target or (if the worst comes to the worst) a target-specific pass. >> Preventing this fold seems like a key part of being able to match the >> *l2 forms in the testcase, since otherwise the =E2=80=9Cmov high=E2=80= =9D disappears and isn't >> available for combining later. > > Yes, and by preventing the folding combine should in principle be able to= fold it back if it wasn't pushed into another > Instruction, but combine does not attempt to touch constants and selects = on their own. If it did this "regression" would be fixed. The problem is that combine is limited to individual EBBs and only combines def-use chains when there is a single use. It's not a general folding engine. > I'm not really quite sure what we're arguing about.. I did think about a= ll three possible cases when making this: > > https://godbolt.org/z/hjWhWq1v1 > > Of the three cases the compiler currently only generates something good f= or test2. Both test1 and test0 are deficient. > The patch doesn't change test2, significantly improves test0 and whether = test1 is a regression is likely uArch specific. > > On Arm Cortex CPUs it is not a regression as a DUP on a SIMD scalar has t= he same throughput and latencies as a MOVI > according to the Arm Performance Software Optimization guides. Costing them as equal would be OK when they are equal. It's the =E2=80=9CD= UP (lane)/ mov high is strictly cheaper bit=E2=80=9D I'm concerned about. > So to me this looks like an improvement overall. And this is where we li= kely disagree? Well, the disagreement isn't about whether the new compiler output for these testcases is better than the old compiler output. It's more a question of how we're getting there. >> > MOVI as I mentioned before is the one case where this is a toss up. >> > But there are far more constants that cannot be created with a movi. >> > A simple example is >> > >> > #include >> > >> > int8x16_t square(int8x16_t full, int8x8_t small) { >> > int8x16_t cst =3D {0,1,2,3,4,5,6,7,8,9,10,11,12,13,15}; >> > int8x8_t low =3D vget_high_s8 (cst); >> > int8x8_t res1 =3D vmul_s8 (small, low); >> > return vaddq_s8 (vmulq_s8 (full, cst), vcombine_s8 (res1, res1)); >> > } >> > >> > Where in Gimple we get >> > >> > [local count: 1073741824]: >> > _2 =3D __builtin_aarch64_get_highv16qi ({ 0, 1, 2, 3, 4, 5, 6, 7, 8,= 9, 10, 11, 12, >> 13, 15, 0 }); >> > _4 =3D _2 * small_3(D); >> > _6 =3D full_5(D) * { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1= 5, 0 }; >> > _7 =3D __builtin_aarch64_combinev8qi (_4, _4); >> > _8 =3D _6 + _7; >> > return _8; >> > >> > Regardless of what happens to __builtin_aarch64_get_highv16qi nothing >> > will recreate the relationship with cst, whether >> __builtin_aarch64_get_highv16qi is lowered or not, constant prop will st= ill >> push in constants. >>=20 >> Yeah, constants are (by design) free in gimple. But that's OK in itself, >> because RTL optimisers have the job of removing any duplicates that end = up >> requiring separate moves. I think we both agree on that. >>=20 >> E.g. for: >>=20 >> #include >>=20 >> void foo(int8x16_t *x) { >> x[0] =3D vaddq_s8 (x[0], (int8x16_t) {0,1,2,3,4,5,6,7,8,9,10,11,12,13,= 14,15}); >> x[1] =3D vaddq_s8 (x[1], (int8x16_t) {0,1,2,3,4,5,6,7,8,9,10,11,12,13,= 14,15}); >> } >>=20 >> the final gimple is: >>=20 >> [local count: 1073741824]: >> _1 =3D *x_4(D); >> _5 =3D _1 + { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }; >> *x_4(D) =3D _5; >> _2 =3D MEM[(int8x16_t *)x_4(D) + 16B]; >> _7 =3D _2 + { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }; >> MEM[(int8x16_t *)x_4(D) + 16B] =3D _7; >> return; >>=20 >> but cse1 removes the duplicated constant even before the patch. > > It doesn't for me, again an unmodified compiler: > > https://godbolt.org/z/qnvf7496h=20 FWIW, the link for my example is: https://godbolt.org/z/G6vaE3nab but it sounds like the disagreement wasn't where I thought it was. > and CSE1 has as the final codegen: > > (insn 7 4 8 2 (set (reg:V16QI 99) > (const_vector:V16QI [ > (const_int 0 [0]) > (const_int 1 [0x1]) > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > (const_int 8 [0x8]) > (const_int 9 [0x9]) > (const_int 10 [0xa]) > (const_int 11 [0xb]) > (const_int 12 [0xc]) > (const_int 13 [0xd]) > (const_int 15 [0xf]) > (const_int 0 [0]) > ])) > > (insn 8 7 9 2 (set (reg:V8QI 92 [ _2 ]) > (const_vector:V8QI [ > (const_int 8 [0x8]) > (const_int 9 [0x9]) > (const_int 10 [0xa]) > (const_int 11 [0xb]) > (const_int 12 [0xc]) > (const_int 13 [0xd]) > (const_int 15 [0xf]) > (const_int 0 [0]) > ])) > > (insn 11 10 12 2 (set (reg:V16QI 95 [ _7 ]) > (vec_concat:V16QI (vec_select:V8QI (reg:V16QI 95 [ _7 ]) > (parallel:V16QI [ > (const_int 0 [0]) > (const_int 1 [0x1]) > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ])) > (reg:V8QI 93 [ _4 ]))) Here, insn 8 is the folded version of the vget_high_s8 and insn 11 is part of the vcombine_s8. With that caveat=E2=80=A6 > So again same constant represented twice, which is reflected in the codeg= en. =E2=80=A6right, the above is also what I was saying that we generate before the patch for your square example. But as you say later this testcase is demonstrating the point that constants loaded from memory should be more expensive than DUP (lane). I agree with that. The bit I don't agree with is costing the DUP (lane) as zero, so that it's also strictly cheaper than MOVI. So I think the disagreement is more about things like the first example in the testcase: https://godbolt.org/z/xrMnezrse Specifically: is it legitimate to fold: (insn 20 19 21 2 (set (reg:V8QI 96 [ _8 ]) (vec_select:V8QI (reg:V16QI 117) (parallel:V16QI [ (const_int 8 [0x8]) (const_int 9 [0x9]) (const_int 10 [0xa]) (const_int 11 [0xb]) (const_int 12 [0xc]) (const_int 13 [0xd]) (const_int 14 [0xe]) (const_int 15 [0xf]) ]))) "/opt/compiler-explorer/arm64/gcc-trunk-20211025/aarch= 64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/12.0.0/include/arm_n= eon.h":6477:22 1394 {aarch64_simd_mov_from_v16qihigh} (nil)) to: (insn 20 19 21 2 (set (reg:V8QI 96 [ _8 ]) (const_vector:V8QI [ (const_int 3 [0x3]) repeated x8 ])) "/opt/compiler-explorer/arm64/gcc-trunk-20211025/aarch64-un= known-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/12.0.0/include/arm_neon.h= ":6477:22 1160 {*aarch64_simd_movv8qi} (expr_list:REG_DEAD (reg:V16QI 117) (nil))) without first trying to get rid of the instruction some other way (through combine)? I think it is legitimate, since the new MOVI instruction is at least as cheap as the original DUP. Even if CSE didn't do the fold itself, and just CSEd the two uses of the V16QI constant, I think it would be legitimate for a later patch to fold the instruction to a constant independently of CSE. IMO: vget_high_s8(vdupq_n_u8(3)) is just a roundabout way of writing: vdup_n_u8(3) We've described what vget_high_s8 does in target-independent rtl (i.e. without unspecs) so it's natural that operations with constant operands will themselves get folded to a constant. I think we should accept that and try to generate the output we want in an environment where such folds do happen, rather than trying to prevent the folds from happening until during or after combine. That approach could also work for autovec output, and cases where the user wrote the 8-byte constants directly. E.g. I think we should aim to optimise: void test0_mod (uint8_t *inptr0, uint8_t *inptr1, uint8_t *outptr0) { uint8x8_t three_u8 =3D vdup_n_u8(3); uint8x16_t x =3D vld1q_u8(inptr0); uint8x16_t y =3D vld1q_u8(inptr1); uint16x8_t x_l =3D vmovl_u8(vget_low_u8(x)); uint16x8_t x_h =3D vmovl_u8(vget_high_u8(x)); uint16x8_t z_l =3D vmlal_u8(x_l, vget_low_u8(y), three_u8); uint16x8_t z_h =3D vmlal_u8(x_h, vget_high_u8(y), three_u8); vst1_u8(outptr0, vaddhn_u16(z_l, z_h)); } in the same way as the original test0. Similarly we should aim to optimise: int8x16_t square_mode(int8x16_t full, int8x8_t small) { int8x16_t cst =3D {0,1,2,3,4,5,6,7,8,9,10,11,12,13,15}; int8x8_t low =3D {8,9,10,11,12,13,15}; int8x8_t res1 =3D vmul_s8 (small, low); return vaddq_s8 (vmulq_s8 (full, cst), vcombine_s8 (res1, res1)); } in the same way as square. >> so that there are no longer any duplicate constants (as far as the RTL c= ode is >> concerned). Instead we have one 16-byte constant and one 8-byte constan= t. >>=20 >> The patch prevents the fold on insn 8 by making the =E2=80=9Cmov high=E2= =80=9D >> strictly cheaper than the constant move, so we keep the =E2=80=9Cmov hig= h=E2=80=9D >> and its 16-byte input. Keeping the =E2=80=9Cmov high=E2=80=9D means tha= t we do have a >> duplicate constant for CSE to remove. >>=20 >> What I meant=E2=80=A6 >>=20 >> >> Also, if I've understood correctly, it looks like we'd be relying on >> >> the vget_high of a constant remaining unfolded until RTL cse1. >> >> I think it's likely in future that we'd try to fold vget_high at the >> >> gimple level instead, since that could expose more optimisations of a >> >> different kind. The gimple optimisers would then fold >> >> vget_high(constant) in a similar way to >> >> cse1 does now. >> >> >> >> So perhaps we should continue to allow the vget_high(constant) to be >> >> foloded in cse1 and come up with some way of coping with the folded >> form. >>=20 >> =E2=80=A6here was that, in future, the gimple optimisers might be able t= o fold the >> vget_high themselves. For your example, we'd then have: >>=20 >> _4 =3D { 8, 9, 10, 11, 12, 13, 15, 0 } * small_3(D); >> _6 =3D full_5(D) * { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15,= 0 }; >> _7 =3D __builtin_aarch64_combinev8qi (_4, _4); >> _8 =3D _6 + _7; >> return _8; >>=20 >> In this situation, we'd need to recreate the relationship between { 0, 1= , 2, 3, 4, >> 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 0 } and { 8, 9, 10, 11, 12, 13, 15, 0= }. We can't >> ensure that the relationship is never lost. >>=20 >> The same thing would be true for vget_low. So a constant like: >>=20 >> cst =3D { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 0 } >> =E2=80=A6 vget_low* (cst) ..; >> =E2=80=A6 vget_high* (cst) =E2=80=A6; >>=20 >> could be folded to two smaller constants: >>=20 >> =E2=80=A6 { 0, 1, 2, 3, 4, 5, 6, 7 } =E2=80=A6; >> =E2=80=A6 { 8, 9, 10, 11, 12, 13, 15, 0 } =E2=80=A6; >>=20 >> We might then need to recreate the combined form, rather than relying on >> the combined form already existing. > > Yes but this is what confuses me. My patch changes it so that CSE1 which = is ran > relatively early is able to find the relationship between the two constan= ts. Yeah, it does that for the case where the vector constant is a duplicate of a single element. My example above doesn't fall into that category though. What I was saying was: let's suppose that a vget_low/vget_high pair for a general V16QI vector constant is folded at the gimple level (by later patches). Then the RTL optimisers just see two V8QI constants rather than a single V16QI constant. The optimisers would need to generate the V16QI =E2=80=9Cfrom scratch=E2=80=9D if they wanted to, as for test0_mod ab= ove. > CSE1 shouldn't do any folding, it doesn't have enough information to do s= o. > By CSE doing folding it makes it so combine is less efficient. I don't agree with that as a general statement. I agree that stopping pre-combine passes from folding helps examples like test0, but I don't think that means that pre-combine passes are doing the wrong thing by folding. IMO the problem is more that we are very opportunistic in looking for high-part operations (and by-lane operations). Legitimate optimisations can easily defeat this opportunistic matching. >> > CSE1 doesn't fold it, because for CSE the cost is too high to do so. W= hich is >> what this costing was attempting to fix. >> > CSE simply does not touch it. It leaves it as >> > >> > (insn 11 10 12 2 (set (reg:V16QI 95 [ _7 ]) >> > (vec_concat:V16QI (vec_select:V8QI (reg:V16QI 95 [ _7 ]) >> > (parallel:V16QI [ >> > (const_int 0 [0]) >> > (const_int 1 [0x1]) >> > (const_int 2 [0x2]) >> > (const_int 3 [0x3]) >> > (const_int 4 [0x4]) >> > (const_int 5 [0x5]) >> > (const_int 6 [0x6]) >> > (const_int 7 [0x7]) >> > ])) >> > (reg:V8QI 93 [ _4 ]))) "":6506:10 1908 >> {aarch64_simd_move_hi_quad_v16qi} >> > (nil)) >> > (insn 12 11 13 2 (set (reg:V16QI 102) >> > (const_vector:V16QI [ >> > (const_int 0 [0]) >> > (const_int 1 [0x1]) >> > (const_int 2 [0x2]) >> > (const_int 3 [0x3]) >> > (const_int 4 [0x4]) >> > (const_int 5 [0x5]) >> > (const_int 6 [0x6]) >> > (const_int 7 [0x7]) >> > (const_int 8 [0x8]) >> > (const_int 9 [0x9]) >> > (const_int 10 [0xa]) >> > (const_int 11 [0xb]) >> > (const_int 12 [0xc]) >> > (const_int 13 [0xd]) >> > (const_int 15 [0xf]) >> > (const_int 0 [0]) >> > ])) "":1466:14 1166 {*aarch64_simd_movv16qi} >> > (nil)) >>=20 >> I don't think that's true for the unpatched compiler. Are you sure this= isn't >> the =E2=80=9Cpre-CSE=E2=80=9D part of the dump? CSE is confusing (to me= ) in that it prints >> each function twice, once in unoptimised form and later in optimised for= m. >>=20 > > Yes I'm sure, see all the compiler explorer links above. Ah, yeah, I misunderstood which insn you were quoting. But insn 11 in: https://godbolt.org/z/rrbP14var is part of the vcombine_s8. The preceding instructions are: (insn 9 8 10 2 (set (reg:V8QI 93 [ _4 ]) (mult:V8QI (reg:V8QI 92 [ _2 ]) (reg/v:V8QI 98 [ small ]))) "/opt/compiler-explorer/arm64/gcc-t= runk-20211025/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/1= 2.0.0/include/arm_neon.h":1402:14 1428 {mulv8qi3} (expr_list:REG_DEAD (reg/v:V8QI 98 [ small ]) (expr_list:REG_DEAD (reg:V8QI 92 [ _2 ]) (nil)))) (insn 10 9 11 2 (set (reg:V16QI 95 [ _7 ]) (vec_concat:V16QI (reg:V8QI 93 [ _4 ]) (const_vector:V8QI [ (const_int 0 [0]) repeated x8 ]))) "/opt/compiler-explorer/arm64/gcc-trunk-20211025/aarch= 64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/12.0.0/include/arm_n= eon.h":6506:10 1892 {move_lo_quad_internal_v16qi} (nil)) and since the multiplication result is variable, we can't fold this. The vget_high is insn 8, which does get folded (but it sounds like we agree on that). > > > And I don't see any way to fix this without having Gimple not push > > constants in, which would lead to worse regressions. > > > I can change the patch to cost the high as a dup which fixes this cod= egen at > > least and has you rematerialize movi. If that's > > > not acceptable I can drop costing for High entirely then, it's not th= e main > > thing I am fixing. > >=20 > > Costing the high as a dup leaves us in the same situation as before the > > patch: the folded V8QI constant is cheaper than the unfolded mov high. > > Yes and the dup will reflect that. The argument that it's not the right c= ost no > longer hold any water in that case. Yeah, my concerns disappear in that case. > In particular as I still maintain that is too > early to do any constant folding in CSE1 for AArch64. > > Whether it's folded or not doesn't make any difference to combine which w= ill > Fold when combinations are possible with the folder version. > > So I have yet to see any actual regression. Well, this is going to win any awards for realism :-), but: #include int8x16_t foo() { int8x16_t a =3D { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }; int8x8_t b =3D vget_high_s8 (a); int8x8_t c =3D { 4, 5, 6, 7, 8, 9, 10, 11 }; int8x8_t d =3D vadd_s8 (b, c); int8x16_t e =3D vcombine_s8 (d, b); return vaddq_s8 (e, a); } is folded to a constant before the patch and isn't after the patch. Your examples are more realistic than that one, but I think this does show why preventing folding can be counter-productive in some cases. My hope is that one day gimple would fold that example to a constant. But if it does, it will also fold the vget_highs and vget_lows in the original testcase to constants, meaning that we can't rely on the original V16QI constant existing as well. Thanks, Richard