From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 9A9983858CDB for ; Thu, 18 May 2023 16:34:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9A9983858CDB Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 029D41FB; Thu, 18 May 2023 09:35:39 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B034F3F762; Thu, 18 May 2023 09:34:53 -0700 (PDT) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , richard.sandiford@arm.com Cc: gcc Patches Subject: Re: [aarch64] Code-gen for vector initialization involving constants References: Date: Thu, 18 May 2023 17:34:52 +0100 In-Reply-To: (Prathamesh Kulkarni's message of "Thu, 18 May 2023 20:11:01 +0530") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-29.0 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Prathamesh Kulkarni writes: > On Thu, 18 May 2023 at 13:37, Richard Sandiford > wrote: >> >> Prathamesh Kulkarni writes: >> > On Tue, 16 May 2023 at 00:29, Richard Sandiford >> > wrote: >> >> >> >> Prathamesh Kulkarni writes: >> >> > Hi Richard, >> >> > After committing the interleave+zip1 patch for vector initialization, >> >> > it seems to regress the s32 case for this patch: >> >> > >> >> > int32x4_t f_s32(int32_t x) >> >> > { >> >> > return (int32x4_t) { x, x, x, 1 }; >> >> > } >> >> > >> >> > code-gen: >> >> > f_s32: >> >> > movi v30.2s, 0x1 >> >> > fmov s31, w0 >> >> > dup v0.2s, v31.s[0] >> >> > ins v30.s[0], v31.s[0] >> >> > zip1 v0.4s, v0.4s, v30.4s >> >> > ret >> >> > >> >> > instead of expected code-gen: >> >> > f_s32: >> >> > movi v31.2s, 0x1 >> >> > dup v0.4s, w0 >> >> > ins v0.s[3], v31.s[0] >> >> > ret >> >> > >> >> > Cost for fallback sequence: 16 >> >> > Cost for interleave and zip sequence: 12 >> >> > >> >> > For the above case, the cost for interleave+zip1 sequence is computed as: >> >> > halves[0]: >> >> > (set (reg:V2SI 96) >> >> > (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))) >> >> > cost = 8 >> >> > >> >> > halves[1]: >> >> > (set (reg:V2SI 97) >> >> > (const_vector:V2SI [ >> >> > (const_int 1 [0x1]) repeated x2 >> >> > ])) >> >> > (set (reg:V2SI 97) >> >> > (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ])) >> >> > (reg:V2SI 97) >> >> > (const_int 1 [0x1]))) >> >> > cost = 8 >> >> > >> >> > followed by: >> >> > (set (reg:V4SI 95) >> >> > (unspec:V4SI [ >> >> > (subreg:V4SI (reg:V2SI 96) 0) >> >> > (subreg:V4SI (reg:V2SI 97) 0) >> >> > ] UNSPEC_ZIP1)) >> >> > cost = 4 >> >> > >> >> > So the total cost becomes >> >> > max(costs[0], costs[1]) + zip1_insn_cost >> >> > = max(8, 8) + 4 >> >> > = 12 >> >> > >> >> > While the fallback rtl sequence is: >> >> > (set (reg:V4SI 95) >> >> > (vec_duplicate:V4SI (reg/v:SI 93 [ x ]))) >> >> > cost = 8 >> >> > (set (reg:SI 98) >> >> > (const_int 1 [0x1])) >> >> > cost = 4 >> >> > (set (reg:V4SI 95) >> >> > (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98)) >> >> > (reg:V4SI 95) >> >> > (const_int 8 [0x8]))) >> >> > cost = 4 >> >> > >> >> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence. >> >> > >> >> > I think the issue is probably that for the interleave+zip1 sequence we take >> >> > max(costs[0], costs[1]) to reflect that both halves are interleaved, >> >> > but for the fallback seq we use seq_cost, which assumes serial execution >> >> > of insns in the sequence. >> >> > For above fallback sequence, >> >> > set (reg:V4SI 95) >> >> > (vec_duplicate:V4SI (reg/v:SI 93 [ x ]))) >> >> > and >> >> > (set (reg:SI 98) >> >> > (const_int 1 [0x1])) >> >> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12. >> >> >> >> Agreed. >> >> >> >> A good-enough substitute for this might be to ignore scalar moves >> >> (for both alternatives) when costing for speed. >> > Thanks for the suggestions. Just wondering for aarch64, if there's an easy >> > way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p >> > that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ? >> >> It should be enough to check that the pattern is a SET: >> >> (a) whose SET_DEST has a scalar mode and >> (b) whose SET_SRC an aarch64_mov_operand > Hi Richard, > Thanks for the suggestions, the attached patch calls seq_cost to compute > cost for sequence and then subtracts cost of each scalar move insn from it. > Does that look OK ? > The patch is under bootstrap+test on aarch64-linux-gnu. Yeah, the patch looks reasonable (some comments below). The testing for this kind of patch is more than a formality though, so it would be good to wait to see if the tests pass. > [...] > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 29dbacfa917..7efd896d364 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -22332,6 +22332,32 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p) > return gen_rtx_PARALLEL (new_mode, vec); > } > > +/* Return true if INSN is a scalar move. */ > + > +static bool > +scalar_move_insn_p (rtx_insn *insn) > +{ > + rtx set = single_set (insn); > + if (!set) > + return false; > + rtx src = SET_SRC (set); > + rtx dest = SET_DEST (set); > + return is_a(GET_MODE (dest)) && aarch64_mov_operand_p (src, GET_MODE (src)); Long line. > +} > + > +/* Ignore cost for scalar moves from cost of sequence. This function is called > + for calculating sequence costs in aarch64_expand_vector_init. */ > + > +static unsigned > +seq_cost_ignore_scalar_moves (rtx_insn *seq, bool speed) Maybe more readable as "ignoring" rather than "ignore". > +{ > + unsigned cost = seq_cost (seq, speed); > + for (; seq; seq = NEXT_INSN (seq)) > + if (scalar_move_insn_p (seq)) > + cost -= insn_cost (seq, speed); > + return cost; > +} > + seq_cost uses set_rtx_cost rather than insn_cost for single sets. To avoid that kind of inconsistency, I think it'd better to duplicate and adjust seq_cost. Then scalar_move_insn_p can be passed the single set. > /* Expand a vector initialization sequence, such that TARGET is > initialized to contain VALS. */ > > @@ -22367,7 +22393,7 @@ aarch64_expand_vector_init (rtx target, rtx vals) > halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0); > rtx_insn *rec_seq = get_insns (); > end_sequence (); > - costs[i] = seq_cost (rec_seq, !optimize_size); > + costs[i] = seq_cost_ignore_scalar_moves (rec_seq, !optimize_size); > emit_insn (rec_seq); > } > > @@ -22384,7 +22410,7 @@ aarch64_expand_vector_init (rtx target, rtx vals) > start_sequence (); > aarch64_expand_vector_init_fallback (target, vals); > rtx_insn *fallback_seq = get_insns (); > - unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size); > + unsigned fallback_seq_cost = seq_cost_ignore_scalar_moves (fallback_seq, !optimize_size); Long line. Thanks, Richard > end_sequence (); > > emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);