From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id CEB653858D28 for ; Wed, 24 May 2023 19:58:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CEB653858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 78B6B1FB; Wed, 24 May 2023 12:58:53 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E93D23F6C4; Wed, 24 May 2023 12:58:07 -0700 (PDT) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , richard.sandiford@arm.com Cc: gcc Patches Subject: Re: [aarch64] Code-gen for vector initialization involving constants References: Date: Wed, 24 May 2023 20:58:06 +0100 In-Reply-To: (Prathamesh Kulkarni's message of "Thu, 25 May 2023 00:43:27 +0530") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-27.8 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_STOCKGEN,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Prathamesh Kulkarni writes: > On Wed, 24 May 2023 at 15:40, Richard Sandiford > wrote: >> >> Prathamesh Kulkarni writes: >> > On Mon, 22 May 2023 at 14:18, Richard Sandiford >> > wrote: >> >> >> >> Prathamesh Kulkarni writes: >> >> > Hi Richard, >> >> > Thanks for the suggestions. Does the attached patch look OK ? >> >> > Boostrap+test in progress on aarch64-linux-gnu. >> >> >> >> Like I say, please wait for the tests to complete before sending an R= FA. >> >> It saves a review cycle if the tests don't in fact pass. >> > Right, sorry, will post patches after completion of testing henceforth. >> >> >> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aar= ch64.cc >> >> > index 29dbacfa917..e611a7cca25 100644 >> >> > --- a/gcc/config/aarch64/aarch64.cc >> >> > +++ b/gcc/config/aarch64/aarch64.cc >> >> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mo= de, rtx vals, bool even_p) >> >> > return gen_rtx_PARALLEL (new_mode, vec); >> >> > } >> >> > >> >> > +/* Return true if INSN is a scalar move. */ >> >> > + >> >> > +static bool >> >> > +scalar_move_insn_p (const rtx_insn *insn) >> >> > +{ >> >> > + rtx set =3D single_set (insn); >> >> > + if (!set) >> >> > + return false; >> >> > + rtx src =3D SET_SRC (set); >> >> > + rtx dest =3D SET_DEST (set); >> >> > + return is_a(GET_MODE (dest)) >> >> > + && aarch64_mov_operand_p (src, GET_MODE (src)); >> >> >> >> Formatting: >> >> >> >> return (is_a(GET_MODE (dest)) >> >> && aarch64_mov_operand_p (src, GET_MODE (src))); >> >> >> >> OK with that change if the tests pass, thanks. >> > Unfortunately, the patch regressed vec-init-21.c: >> > >> > int8x16_t f_s8(int8_t x, int8_t y) >> > { >> > return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6, >> > 7, 8, 9, 10, 11, 12, 13, 14 }; >> > } >> > >> > -O3 code-gen trunk: >> > f_s8: >> > adrp x2, .LC0 >> > ldr q0, [x2, #:lo12:.LC0] >> > ins v0.b[0], w0 >> > ins v0.b[1], w1 >> > ret >> > >> > -O3 code-gen patch: >> > f_s8: >> > adrp x2, .LC0 >> > ldr d31, [x2, #:lo12:.LC0] >> > adrp x2, .LC1 >> > ldr d0, [x2, #:lo12:.LC1] >> > ins v31.b[0], w0 >> > ins v0.b[0], w1 >> > zip1 v0.16b, v31.16b, v0.16b >> > ret >> > >> > With trunk, it chooses the fallback sequence because both fallback >> > and zip1 sequence had cost =3D 20, however with patch applied, >> > we end up with zip1 sequence cost =3D 24 and fallback sequence >> > cost =3D 28. >> > >> > This happens because of using insn_cost instead of >> > set_rtx_cost for the following expression: >> > (set (reg:QI 100) >> > (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)) >> > set_rtx_cost returns 0 for above expression but insn_cost returns 4. >> >> Yeah, was wondering why you'd dropped the set_rtx_cost thing, >> but decided not to question it since using insn_cost seemed >> reasonable if it worked. > The attached patch uses set_rtx_cost for single_set and insn_cost > otherwise for non debug insns similar to seq_cost. FWIW, I think with the aarch64_mov_operand fix, the old way of using insn_cost for everything would have worked too. But either way is fine. >> > This expression template appears twice in fallback sequence, which rai= ses >> > the cost to 28 from 20, while it appears once in each half of zip1 seq= uence, >> > which raises the cost to 24 from 20, and so it now prefers zip1 sequen= ce >> > instead. >> > >> > I assumed this expression would be ignored because it looks like a sca= lar move, >> > but that doesn't seem to be the case ? >> > aarch64_classify_symbolic_expression returns >> > SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0) >> > and thus aarch64_mov_operand_p returns false. >> >> Ah, I guess it should be aarch64_mov_operand instead. Confusing that >> they're so different... > Thanks, using aarch64_mov_operand worked. >> >> > Another issue with the zip1 sequence above is using same register x2 >> > for loading another half of constant in: >> > adrp x2, .LC1 >> > >> > I guess this will create an output dependency from adrp x2, .LC0 -> >> > adrp x2, .LC1 >> > and anti-dependency from ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1 >> > essentially forcing almost the entire sequence (except ins >> > instructions) to execute sequentially ? >> >> I'd expect modern cores to handle that via renaming. > Ah right, thanks for the clarification. > > For some reason, it seems git diff is not formatting the patch correctly = :/ > Or perhaps I am doing something wrongly. No, I think it's fine. It's just tabs vs. spaces. A leading "+" followed by a tab is still only indented 8 columns, whereas "+" followed by 6 spaces is indented 7 columns. So indentation can look a bit weird in the diff. I was accounting for that though. :) > For eg, it shows: > + return is_a(GET_MODE (dest)) > + && aarch64_mov_operand (src, GET_MODE (src)); > but after applying the patch, it's formatted correctly with "&& > aarch64..." right below is_a, both on column 10. Yeah, the indentation itself was OK. But there's an =E2=80=9Cemacs rule=E2= =80=9D that says that parens should be used when splitting an expression over multiple lines like this. So: ------- Formatting: return (is_a(GET_MODE (dest)) && aarch64_mov_operand_p (src, GET_MODE (src))); ------- was about adding the parens. > + for (; seq; seq =3D NEXT_INSN (seq)) > + if (NONDEBUG_INSN_P (seq) > + && !scalar_move_insn_p (seq)) > + { > + if (rtx set =3D single_set (seq)) > + cost +=3D set_rtx_cost (set, speed); > + else > + { > + int this_cost =3D insn_cost (CONST_CAST_RTX_INSN (seq), speed); > + if (this_cost > 0) > + cost +=3D this_cost; > + else > + cost++; > + } > + } I think it'd be better to do the single_set first, and pass the set to scalar_move_insn_p. I.e. for (; seq; seq =3D NEXT_INSN (seq)) if (NONDEBUG_INSN_P (seq)) { if (rtx set =3D single_set (seq)) { if (!scalar_move_insn_p (set)) cost +=3D set_rtx_cost (set, speed); } else { int this_cost =3D insn_cost (CONST_CAST_RTX_INSN (seq), speed); if (this_cost > 0) cost +=3D this_cost; else cost++; } } Then scalar_move_insn_p can just be the last three statements, adjusted as follows: rtx src =3D SET_SRC (set); rtx dest =3D SET_DEST (set); return (is_a (GET_MODE (dest)) && aarch64_mov_operand (src, GET_MODE (dest))); Note the space after ">", and that the mode passed to aarch64_mov_operand is the destination mode (since the source mode might be VOIDmode). OK with that change, thanks. Thanks, Richard