From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=9PAR=BH=linaro.org=prathamesh.kulkarni@sourceware.org>
Received: from mail-wm1-x332.google.com (mail-wm1-x332.google.com [IPv6:2a00:1450:4864:20::332])
	by sourceware.org (Postfix) with ESMTPS id 697BD3853D08
	for <gcc-patches@gcc.gnu.org>; Thu, 18 May 2023 14:41:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 697BD3853D08
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-wm1-x332.google.com with SMTP id 5b1f17b1804b1-3f4ad71b00eso13737565e9.2
        for <gcc-patches@gcc.gnu.org>; Thu, 18 May 2023 07:41:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1684420899; x=1687012899;
        h=to:subject:message-id:date:from:in-reply-to:references:mime-version
         :from:to:cc:subject:date:message-id:reply-to;
        bh=Rn75N88+FqJlx+hVuoEXzQ49QYSN1SFXnerHzMX5Ruk=;
        b=zkGHSLJ4dKAqMAbOpsJjSq4KB5aWGmeqqDUTJfjyr0aQbbbs6say+K0akmHwTYnQDy
         RiBwTKbqedHLZDn3af07wztOvq2IuothjnA4qNcRP4A+uF02gUsLP4XCZS9cBH/9Kwxq
         Fc7/t2tw8nVqmtqLFwtII6LirOVO3UhW4HnHz2H+EzXuXGTWGGxniYE3A/9tDxi9D/ip
         ODIcVC8s5KNCPrqX37n78D3Y1TL4AAjD+SSJlDiEAbpYhHLbTRX/dhfm5+LeYjPg9rec
         id9k2L1ry3UP+cr5NUZbcvfvLYw+Ug74gI74cT/3ySdKtSpGAWjaHyzrcwnzedm3Azjn
         yggw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1684420899; x=1687012899;
        h=to:subject:message-id:date:from:in-reply-to:references:mime-version
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=Rn75N88+FqJlx+hVuoEXzQ49QYSN1SFXnerHzMX5Ruk=;
        b=eRH0gZyfXa0ExQaLsdvh45pAIH0F6Wbb7zyTP37EUzymsteI4g4Fv1KcerDjrFtcQR
         G30tI0OdhYBDa4x6/jqRlA0rOk4uhs5ZFlSn7xz0kzyUqmzQww7FqJUPBd4qJKK12eav
         gxJ7dPs3K+3oTS3QYW9dpth6U/hNyc+emhcjIkW6jvHZozE7F0ezEol3Q39Obff+yf/3
         RKFR3QVnycJvGApbphuEdG+s5SFTwzYmdqu1WTFS9Yt1xoXJptAU4Entkzc7y6+pCFZn
         Lzr0/4BNJIjusYJLIHYF12DbqxAKZN0SttRurb+CGFGGvqsWf/5kuXzLo2RgJlHfRrH1
         VoEw==
X-Gm-Message-State: AC+VfDxFBT5oryyfD729HIvZaM1A8IL8SlA2M461z/gPe4Et4DSRVvqr
	dbo7hnkTlCmwci0cnF/u2pbZAkMCWc5hHuCBhji2Bw==
X-Google-Smtp-Source: ACHHUZ6iiqDHQ6oA6pNFOTOGuyy6265+Oy80OubXF28Eg15jyd+pEYcmtU3v/2DHE9GgpGqRNz/jaNFcLUXZ/ZCCgq8=
X-Received: by 2002:a05:600c:2049:b0:3f1:6fea:790a with SMTP id
 p9-20020a05600c204900b003f16fea790amr1737077wmg.30.1684420899035; Thu, 18 May
 2023 07:41:39 -0700 (PDT)
MIME-Version: 1.0
References: <CAAgBjMnwGk4fOc3PTM_agTXXFvt=767a3-AWOfSr23Xja6K81w@mail.gmail.com>
 <mpt354o448m.fsf@arm.com> <CAAgBjMm2o_64mbmOsnzjZMrLNRozav95sCFR73cz+iA2SLURwg@mail.gmail.com>
 <mpt354fw0dc.fsf@arm.com> <CAAgBjM=ryaC9h-9N_WWmPjcqE_fEA7xXPK8wye8TvFXsnMTyHg@mail.gmail.com>
 <mpto7n3ueju.fsf@arm.com> <CAAgBjMnP5Z3bsge08n5GWszBBqPgwV66KaZjr6vYOs3GUvv2qw@mail.gmail.com>
 <mpt7ctqvqsz.fsf@arm.com> <CAAgBjMn6tOT3HpOdjB3f47D_qAVC2S1jsoGVGvnJ7uk6vDcKdA@mail.gmail.com>
 <mpt3542el2f.fsf@arm.com> <CAAgBjMm9__ZPxuNvtaiwSFaS35kWqj2StsQY6atYnUPMzUYYsw@mail.gmail.com>
 <mpth6sd5slj.fsf@arm.com> <CAAgBjMnAT6keAqMqtj2+0Z6ZKyKDQ+e+jWHANgEESVFHG_WM2w@mail.gmail.com>
 <mpt353uysem.fsf@arm.com>
In-Reply-To: <mpt353uysem.fsf@arm.com>
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Date: Thu, 18 May 2023 20:11:01 +0530
Message-ID: <CAAgBjMnXfy0R8AjJhaXZ7Af7CF7n=xgPEycc4TKJ77V-Tcy5mQ@mail.gmail.com>
Subject: Re: [aarch64] Code-gen for vector initialization involving constants
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>, gcc Patches <gcc-patches@gcc.gnu.org>, 
	richard.sandiford@arm.com
Content-Type: multipart/mixed; boundary="000000000000570e6e05fbf8cd20"
X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

--000000000000570e6e05fbf8cd20
Content-Type: text/plain; charset="UTF-8"

On Thu, 18 May 2023 at 13:37, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Tue, 16 May 2023 at 00:29, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > Hi Richard,
> >> > After committing the interleave+zip1 patch for vector initialization,
> >> > it seems to regress the s32 case for this patch:
> >> >
> >> > int32x4_t f_s32(int32_t x)
> >> > {
> >> >   return (int32x4_t) { x, x, x, 1 };
> >> > }
> >> >
> >> > code-gen:
> >> > f_s32:
> >> >         movi    v30.2s, 0x1
> >> >         fmov    s31, w0
> >> >         dup     v0.2s, v31.s[0]
> >> >         ins     v30.s[0], v31.s[0]
> >> >         zip1    v0.4s, v0.4s, v30.4s
> >> >         ret
> >> >
> >> > instead of expected code-gen:
> >> > f_s32:
> >> >         movi    v31.2s, 0x1
> >> >         dup     v0.4s, w0
> >> >         ins     v0.s[3], v31.s[0]
> >> >         ret
> >> >
> >> > Cost for fallback sequence: 16
> >> > Cost for interleave and zip sequence: 12
> >> >
> >> > For the above case, the cost for interleave+zip1 sequence is computed as:
> >> > halves[0]:
> >> > (set (reg:V2SI 96)
> >> >     (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
> >> > cost = 8
> >> >
> >> > halves[1]:
> >> > (set (reg:V2SI 97)
> >> >     (const_vector:V2SI [
> >> >             (const_int 1 [0x1]) repeated x2
> >> >         ]))
> >> > (set (reg:V2SI 97)
> >> >     (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
> >> >         (reg:V2SI 97)
> >> >         (const_int 1 [0x1])))
> >> > cost = 8
> >> >
> >> > followed by:
> >> > (set (reg:V4SI 95)
> >> >     (unspec:V4SI [
> >> >             (subreg:V4SI (reg:V2SI 96) 0)
> >> >             (subreg:V4SI (reg:V2SI 97) 0)
> >> >         ] UNSPEC_ZIP1))
> >> > cost = 4
> >> >
> >> > So the total cost becomes
> >> > max(costs[0], costs[1]) + zip1_insn_cost
> >> > = max(8, 8) + 4
> >> > = 12
> >> >
> >> > While the fallback rtl sequence is:
> >> > (set (reg:V4SI 95)
> >> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> >> > cost = 8
> >> > (set (reg:SI 98)
> >> >     (const_int 1 [0x1]))
> >> > cost = 4
> >> > (set (reg:V4SI 95)
> >> >     (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
> >> >         (reg:V4SI 95)
> >> >         (const_int 8 [0x8])))
> >> > cost = 4
> >> >
> >> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
> >> >
> >> > I think the issue is probably that for the interleave+zip1 sequence we take
> >> > max(costs[0], costs[1]) to reflect that both halves are interleaved,
> >> > but for the fallback seq we use seq_cost, which assumes serial execution
> >> > of insns in the sequence.
> >> > For above fallback sequence,
> >> > set (reg:V4SI 95)
> >> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> >> > and
> >> > (set (reg:SI 98)
> >> >     (const_int 1 [0x1]))
> >> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.
> >>
> >> Agreed.
> >>
> >> A good-enough substitute for this might be to ignore scalar moves
> >> (for both alternatives) when costing for speed.
> > Thanks for the suggestions. Just wondering for aarch64, if there's an easy
> > way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p
> > that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ?
>
> It should be enough to check that the pattern is a SET:
>
> (a) whose SET_DEST has a scalar mode and
> (b) whose SET_SRC an aarch64_mov_operand
Hi Richard,
Thanks for the suggestions, the attached patch calls seq_cost to compute
cost for sequence and then subtracts cost of each scalar move insn from it.
Does that look OK ?
The patch is under bootstrap+test on aarch64-linux-gnu.

After applying the single-constant case patch on top, the cost of fallback
sequence is now reduced to 12 instead of 16:
Cost before ignoring scalar moves: 16
Ignoring cost = 4 for: (set (reg:SI 98)
    (const_int 1 [0x1]))
Cost after ignoring scalar moves: 12
fallback_seq_cost = 12, zip1_seq_cost = 12

fallback_seq:
(set (reg:V4SI 95)
    (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
(set (reg:SI 98)
    (const_int 1 [0x1]))
(set (reg:V4SI 95)
    (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
        (reg:V4SI 95)
        (const_int 8 [0x8])))

zip1_seq:
(set (reg:V2SI 96)
    (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
(set (reg:V2SI 97)
    (const_vector:V2SI [
            (const_int 1 [0x1]) repeated x2
        ]))
(set (reg:V2SI 97)
    (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
        (reg:V2SI 97)
        (const_int 1 [0x1])))
(set (reg:V4SI 95)
    (unspec:V4SI [
            (subreg:V4SI (reg:V2SI 96) 0)
            (subreg:V4SI (reg:V2SI 97) 0)
        ] UNSPEC_ZIP1))

So now the costs for both sequences are tied at 12, and so it now
chooses the fallback sequence,
which "fixes" this case. However, more generally, if the costs for
both sequences are tied,
how do we evaluate which sequence'd be better ? Currently we choose
the fallback sequence if
the costs for both sequences are same.
>
> >> > I was wondering if we should we make cost for interleave+zip1 sequence
> >> > more conservative
> >> > by not taking max, but summing up costs[0] + costs[1] even for speed ?
> >> > For this case,
> >> > that would be 8 + 8 + 4 = 20.
> >> >
> >> > It generates the fallback sequence for other cases (s8, s16, s64) from
> >> > the test-case.
> >>
> >> What does it do for the tests in the interleave+zip1 patch?  If it doesn't
> >> make a difference there then it sounds like we don't have enough tests. :)
> > Oh right, the tests in interleave+zip1 patch only check for s16 case,
> > sorry about that :/
> > Looking briefly at the code generated for s8, s32 and s64 case,
> > (a) s8, and s16 seem to use same sequence for all cases.
> > (b) s64 seems to use fallback sequence.
> > (c) For vec-init-21.c, s8 and s16 cases prefer fallback sequence
> > because costs are tied,
> > while s32 case prefers interleave+zip1:
> >
> > int32x4_t f_s32(int32_t x, int32_t y)
> > {
> >   return (int32x4_t) { x, y, 1, 2 };
> > }
> >
> > Code-gen with interleave+zip1 sequence:
> > f_s32:
> >         movi    v31.2s, 0x1
> >         movi    v0.2s, 0x2
> >         ins     v31.s[0], w0
> >         ins     v0.s[0], w1
> >         zip1    v0.4s, v31.4s, v0.4s
> >         ret
> >
> > Code-gen with fallback sequence:
> > f_s32:
> >         adrp    x2, .LC0
> >         ldr     q0, [x2, #:lo12:.LC0]
> >         ins     v0.s[0], w0
> >         ins     v0.s[1], w1
> >         ret
> >
> > Fallback sequence cost = 20
> > interleave+zip1 sequence cost = 12
> > I assume interleave+zip1 sequence is better in this case (chosen currently) ?
> >
> > I will send a patch to add cases for s8, s16 and s64 in a follow up patch soon.
> >>
> >> Summing is only conservative if the fallback sequence is somehow "safer".
> >> But I don't think it is.   Building an N-element vector from N scalars
> >> can be done using N instructions in the fallback case and N+1 instructions
> >> in the interleave+zip1 case.  But the interleave+zip1 case is still
> >> better (speedwise) for N==16.
> > Ack, thanks.
> > Should we also prefer interleave+zip1 when the costs are tied ?
>
> No, because the ZIP1 approach requires more temporary registers (in
> general).  And we're making an optimistic (but reasonable) assumption
> that enough vector pipes are free to do the work in parallel.
Oh right, thanks, the zip1 sequence'd also increase register pressure.
>
> > For eg, for the following case:
> > int32x4_t f_s32(int32_t x)
> > {
> >   return (int32x4_t) { x, 1, x, 1 };
> > }
> >
> > costs for both fallback and interleave+zip1 sequence = 12, and we
> > currently choose fallback sequence.
> > Code-gen:
> > f_s32:
> >         movi    v0.4s, 0x1
> >         fmov    s31, w0
> >         ins     v0.s[0], v31.s[0]
> >         ins     v0.s[2], v31.s[0]
> >         ret
> >
> > while, if we choose interleave+zip1, code-gen is:
> > f_s32:
> >         dup     v31.2s, w0
> >         movi    v0.2s, 0x1
> >         zip1    v0.4s, v31.4s, v0.4s
> >         ret
> >
> > I suppose the interleave+zip1 sequence is better in this case ?
> > And more generally, if the costs are tied, would it be OK to prefer
> > interleave+zip1 sequence since it will
> > have parallel execution of two halves, which may not always be the
> > case with fallback sequence ?
>
> But when looking at these sequences, it's important to ask not just
> which sequence wins, but why it wins.  If the zip1 version is better
> (I agree it probably is), then the question is why that isn't showing
> up in the costs.
>
> > Also, would it be OK to commit the above patch that addresses the
> > issue with single constant case and xfail the s32 case for now ?
>
> I think it'd be better to wait until we can avoid the XFAIL.  That way
> it's easier to see that we're making forward progress.
Sure.

Thanks,
Prathamesh
>
> Thanks,
> Richard

--000000000000570e6e05fbf8cd20
Content-Type: text/plain; charset="US-ASCII"; name="ignore-scalar-move-1.txt"
Content-Disposition: attachment; filename="ignore-scalar-move-1.txt"
Content-Transfer-Encoding: base64
Content-ID: <f_lht8kx6g0>
X-Attachment-Id: f_lht8kx6g0

ZGlmZiAtLWdpdCBhL2djYy9jb25maWcvYWFyY2g2NC9hYXJjaDY0LmNjIGIvZ2NjL2NvbmZpZy9h
YXJjaDY0L2FhcmNoNjQuY2MKaW5kZXggMjlkYmFjZmE5MTcuLjdlZmQ4OTZkMzY0IDEwMDY0NAot
LS0gYS9nY2MvY29uZmlnL2FhcmNoNjQvYWFyY2g2NC5jYworKysgYi9nY2MvY29uZmlnL2FhcmNo
NjQvYWFyY2g2NC5jYwpAQCAtMjIzMzIsNiArMjIzMzIsMzIgQEAgYWFyY2g2NF91bnppcF92ZWN0
b3JfaW5pdCAobWFjaGluZV9tb2RlIG1vZGUsIHJ0eCB2YWxzLCBib29sIGV2ZW5fcCkKICAgcmV0
dXJuIGdlbl9ydHhfUEFSQUxMRUwgKG5ld19tb2RlLCB2ZWMpOwogfQogCisvKiBSZXR1cm4gdHJ1
ZSBpZiBJTlNOIGlzIGEgc2NhbGFyIG1vdmUuICAqLworCitzdGF0aWMgYm9vbAorc2NhbGFyX21v
dmVfaW5zbl9wIChydHhfaW5zbiAqaW5zbikKK3sKKyAgcnR4IHNldCA9IHNpbmdsZV9zZXQgKGlu
c24pOworICBpZiAoIXNldCkKKyAgICByZXR1cm4gZmFsc2U7CisgIHJ0eCBzcmMgPSBTRVRfU1JD
IChzZXQpOworICBydHggZGVzdCA9IFNFVF9ERVNUIChzZXQpOworICByZXR1cm4gaXNfYTxzY2Fs
YXJfbW9kZT4oR0VUX01PREUgKGRlc3QpKSAmJiBhYXJjaDY0X21vdl9vcGVyYW5kX3AgKHNyYywg
R0VUX01PREUgKHNyYykpOworfQorCisvKiBJZ25vcmUgY29zdCBmb3Igc2NhbGFyIG1vdmVzIGZy
b20gY29zdCBvZiBzZXF1ZW5jZS4gVGhpcyBmdW5jdGlvbiBpcyBjYWxsZWQKKyAgIGZvciBjYWxj
dWxhdGluZyBzZXF1ZW5jZSBjb3N0cyBpbiBhYXJjaDY0X2V4cGFuZF92ZWN0b3JfaW5pdC4gICov
CisKK3N0YXRpYyB1bnNpZ25lZAorc2VxX2Nvc3RfaWdub3JlX3NjYWxhcl9tb3ZlcyAocnR4X2lu
c24gKnNlcSwgYm9vbCBzcGVlZCkKK3sKKyAgdW5zaWduZWQgY29zdCA9IHNlcV9jb3N0IChzZXEs
IHNwZWVkKTsKKyAgZm9yICg7IHNlcTsgc2VxID0gTkVYVF9JTlNOIChzZXEpKQorICAgIGlmIChz
Y2FsYXJfbW92ZV9pbnNuX3AgKHNlcSkpCisgICAgICBjb3N0IC09IGluc25fY29zdCAoc2VxLCBz
cGVlZCk7CisgIHJldHVybiBjb3N0OworfQorCiAvKiBFeHBhbmQgYSB2ZWN0b3IgaW5pdGlhbGl6
YXRpb24gc2VxdWVuY2UsIHN1Y2ggdGhhdCBUQVJHRVQgaXMKICAgIGluaXRpYWxpemVkIHRvIGNv
bnRhaW4gVkFMUy4gICovCiAKQEAgLTIyMzY3LDcgKzIyMzkzLDcgQEAgYWFyY2g2NF9leHBhbmRf
dmVjdG9yX2luaXQgKHJ0eCB0YXJnZXQsIHJ0eCB2YWxzKQogICAgICAgaGFsdmVzW2ldID0gZ2Vu
X3J0eF9TVUJSRUcgKG1vZGUsIHRtcF9yZWcsIDApOwogICAgICAgcnR4X2luc24gKnJlY19zZXEg
PSBnZXRfaW5zbnMgKCk7CiAgICAgICBlbmRfc2VxdWVuY2UgKCk7Ci0gICAgICBjb3N0c1tpXSA9
IHNlcV9jb3N0IChyZWNfc2VxLCAhb3B0aW1pemVfc2l6ZSk7CisgICAgICBjb3N0c1tpXSA9IHNl
cV9jb3N0X2lnbm9yZV9zY2FsYXJfbW92ZXMgKHJlY19zZXEsICFvcHRpbWl6ZV9zaXplKTsKICAg
ICAgIGVtaXRfaW5zbiAocmVjX3NlcSk7CiAgICAgfQogCkBAIC0yMjM4NCw3ICsyMjQxMCw3IEBA
IGFhcmNoNjRfZXhwYW5kX3ZlY3Rvcl9pbml0IChydHggdGFyZ2V0LCBydHggdmFscykKICAgc3Rh
cnRfc2VxdWVuY2UgKCk7CiAgIGFhcmNoNjRfZXhwYW5kX3ZlY3Rvcl9pbml0X2ZhbGxiYWNrICh0
YXJnZXQsIHZhbHMpOwogICBydHhfaW5zbiAqZmFsbGJhY2tfc2VxID0gZ2V0X2luc25zICgpOwot
ICB1bnNpZ25lZCBmYWxsYmFja19zZXFfY29zdCA9IHNlcV9jb3N0IChmYWxsYmFja19zZXEsICFv
cHRpbWl6ZV9zaXplKTsKKyAgdW5zaWduZWQgZmFsbGJhY2tfc2VxX2Nvc3QgPSBzZXFfY29zdF9p
Z25vcmVfc2NhbGFyX21vdmVzIChmYWxsYmFja19zZXEsICFvcHRpbWl6ZV9zaXplKTsKICAgZW5k
X3NlcXVlbmNlICgpOwogCiAgIGVtaXRfaW5zbiAoc2VxX3RvdGFsX2Nvc3QgPCBmYWxsYmFja19z
ZXFfY29zdCA/IHNlcSA6IGZhbGxiYWNrX3NlcSk7Cg==
--000000000000570e6e05fbf8cd20--