On Thu, 18 May 2023 at 22:04, Richard Sandiford wrote: > > Prathamesh Kulkarni writes: > > On Thu, 18 May 2023 at 13:37, Richard Sandiford > > wrote: > >> > >> Prathamesh Kulkarni writes: > >> > On Tue, 16 May 2023 at 00:29, Richard Sandiford > >> > wrote: > >> >> > >> >> Prathamesh Kulkarni writes: > >> >> > Hi Richard, > >> >> > After committing the interleave+zip1 patch for vector initialization, > >> >> > it seems to regress the s32 case for this patch: > >> >> > > >> >> > int32x4_t f_s32(int32_t x) > >> >> > { > >> >> > return (int32x4_t) { x, x, x, 1 }; > >> >> > } > >> >> > > >> >> > code-gen: > >> >> > f_s32: > >> >> > movi v30.2s, 0x1 > >> >> > fmov s31, w0 > >> >> > dup v0.2s, v31.s[0] > >> >> > ins v30.s[0], v31.s[0] > >> >> > zip1 v0.4s, v0.4s, v30.4s > >> >> > ret > >> >> > > >> >> > instead of expected code-gen: > >> >> > f_s32: > >> >> > movi v31.2s, 0x1 > >> >> > dup v0.4s, w0 > >> >> > ins v0.s[3], v31.s[0] > >> >> > ret > >> >> > > >> >> > Cost for fallback sequence: 16 > >> >> > Cost for interleave and zip sequence: 12 > >> >> > > >> >> > For the above case, the cost for interleave+zip1 sequence is computed as: > >> >> > halves[0]: > >> >> > (set (reg:V2SI 96) > >> >> > (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))) > >> >> > cost = 8 > >> >> > > >> >> > halves[1]: > >> >> > (set (reg:V2SI 97) > >> >> > (const_vector:V2SI [ > >> >> > (const_int 1 [0x1]) repeated x2 > >> >> > ])) > >> >> > (set (reg:V2SI 97) > >> >> > (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ])) > >> >> > (reg:V2SI 97) > >> >> > (const_int 1 [0x1]))) > >> >> > cost = 8 > >> >> > > >> >> > followed by: > >> >> > (set (reg:V4SI 95) > >> >> > (unspec:V4SI [ > >> >> > (subreg:V4SI (reg:V2SI 96) 0) > >> >> > (subreg:V4SI (reg:V2SI 97) 0) > >> >> > ] UNSPEC_ZIP1)) > >> >> > cost = 4 > >> >> > > >> >> > So the total cost becomes > >> >> > max(costs[0], costs[1]) + zip1_insn_cost > >> >> > = max(8, 8) + 4 > >> >> > = 12 > >> >> > > >> >> > While the fallback rtl sequence is: > >> >> > (set (reg:V4SI 95) > >> >> > (vec_duplicate:V4SI (reg/v:SI 93 [ x ]))) > >> >> > cost = 8 > >> >> > (set (reg:SI 98) > >> >> > (const_int 1 [0x1])) > >> >> > cost = 4 > >> >> > (set (reg:V4SI 95) > >> >> > (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98)) > >> >> > (reg:V4SI 95) > >> >> > (const_int 8 [0x8]))) > >> >> > cost = 4 > >> >> > > >> >> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence. > >> >> > > >> >> > I think the issue is probably that for the interleave+zip1 sequence we take > >> >> > max(costs[0], costs[1]) to reflect that both halves are interleaved, > >> >> > but for the fallback seq we use seq_cost, which assumes serial execution > >> >> > of insns in the sequence. > >> >> > For above fallback sequence, > >> >> > set (reg:V4SI 95) > >> >> > (vec_duplicate:V4SI (reg/v:SI 93 [ x ]))) > >> >> > and > >> >> > (set (reg:SI 98) > >> >> > (const_int 1 [0x1])) > >> >> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12. > >> >> > >> >> Agreed. > >> >> > >> >> A good-enough substitute for this might be to ignore scalar moves > >> >> (for both alternatives) when costing for speed. > >> > Thanks for the suggestions. Just wondering for aarch64, if there's an easy > >> > way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p > >> > that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ? > >> > >> It should be enough to check that the pattern is a SET: > >> > >> (a) whose SET_DEST has a scalar mode and > >> (b) whose SET_SRC an aarch64_mov_operand > > Hi Richard, > > Thanks for the suggestions, the attached patch calls seq_cost to compute > > cost for sequence and then subtracts cost of each scalar move insn from it. > > Does that look OK ? > > The patch is under bootstrap+test on aarch64-linux-gnu. > > Yeah, the patch looks reasonable (some comments below). The testing > for this kind of patch is more than a formality though, so it would > be good to wait to see if the tests pass. > > > [...] > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index 29dbacfa917..7efd896d364 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -22332,6 +22332,32 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p) > > return gen_rtx_PARALLEL (new_mode, vec); > > } > > > > +/* Return true if INSN is a scalar move. */ > > + > > +static bool > > +scalar_move_insn_p (rtx_insn *insn) > > +{ > > + rtx set = single_set (insn); > > + if (!set) > > + return false; > > + rtx src = SET_SRC (set); > > + rtx dest = SET_DEST (set); > > + return is_a(GET_MODE (dest)) && aarch64_mov_operand_p (src, GET_MODE (src)); > > Long line. > > > +} > > + > > +/* Ignore cost for scalar moves from cost of sequence. This function is called > > + for calculating sequence costs in aarch64_expand_vector_init. */ > > + > > +static unsigned > > +seq_cost_ignore_scalar_moves (rtx_insn *seq, bool speed) > > Maybe more readable as "ignoring" rather than "ignore". > > > +{ > > + unsigned cost = seq_cost (seq, speed); > > + for (; seq; seq = NEXT_INSN (seq)) > > + if (scalar_move_insn_p (seq)) > > + cost -= insn_cost (seq, speed); > > + return cost; > > +} > > + > > seq_cost uses set_rtx_cost rather than insn_cost for single sets. > > To avoid that kind of inconsistency, I think it'd better to duplicate and > adjust seq_cost. Then scalar_move_insn_p can be passed the single set. > > > /* Expand a vector initialization sequence, such that TARGET is > > initialized to contain VALS. */ > > > > @@ -22367,7 +22393,7 @@ aarch64_expand_vector_init (rtx target, rtx vals) > > halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0); > > rtx_insn *rec_seq = get_insns (); > > end_sequence (); > > - costs[i] = seq_cost (rec_seq, !optimize_size); > > + costs[i] = seq_cost_ignore_scalar_moves (rec_seq, !optimize_size); > > emit_insn (rec_seq); > > } > > > > @@ -22384,7 +22410,7 @@ aarch64_expand_vector_init (rtx target, rtx vals) > > start_sequence (); > > aarch64_expand_vector_init_fallback (target, vals); > > rtx_insn *fallback_seq = get_insns (); > > - unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size); > > + unsigned fallback_seq_cost = seq_cost_ignore_scalar_moves (fallback_seq, !optimize_size); > > Long line. Hi Richard, Thanks for the suggestions. Does the attached patch look OK ? Boostrap+test in progress on aarch64-linux-gnu. Thanks, Prathamesh > > Thanks, > Richard > > > end_sequence (); > > > > emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);