From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id DE94D3861962
 for <gcc-patches@gcc.gnu.org>; Wed, 30 Jun 2021 17:55:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DE94D3861962
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 652A86D;
 Wed, 30 Jun 2021 10:55:14 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 53ACC3F718;
 Wed, 30 Jun 2021 10:55:13 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <Tamar.Christina@arm.com>
Mail-Followup-To: Tamar Christina <Tamar.Christina@arm.com>,
 "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>, richard.sandiford@arm.com
Cc: "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on
 inverted operands
References: <patch-14553-tamar@arm.com> <mpt7diwjyam.fsf@arm.com>
 <VI1PR08MB532566E232B32751F7225EEFFF319@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <mptim2gigqs.fsf@arm.com>
 <VI1PR08MB532548799CA7FAB7BCC00229FF019@VI1PR08MB5325.eurprd08.prod.outlook.com>
Date: Wed, 30 Jun 2021 18:55:12 +0100
In-Reply-To: <VI1PR08MB532548799CA7FAB7BCC00229FF019@VI1PR08MB5325.eurprd08.prod.outlook.com>
 (Tamar Christina's message of "Wed, 30 Jun 2021 16:08:44 +0000")
Message-ID: <mptsg0zz18f.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Jun 2021 17:55:17 -0000

Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, June 14, 2021 4:55 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on
>> inverted operands
>>=20
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> > Hi Richard,
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Monday, June 14, 2021 3:50 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> >> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> >> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
>> <Kyrylo.Tkachov@arm.com>
>> >> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks
>> >> on inverted operands
>> >>
>> >> Tamar Christina <tamar.christina@arm.com> writes:
>> >> > Hi All,
>> >> >
>> >> > This RFC is trying to address the following inefficiency when
>> >> > vectorizing conditional statements with SVE.
>> >> >
>> >> > Consider the case
>> >> >
>> >> > void f10(double * restrict z, double * restrict w, double * restric=
t x,
>> >> > 	 double * restrict y, int n)
>> >> > {
>> >> >     for (int i =3D 0; i < n; i++) {
>> >> >         z[i] =3D (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
>> >> >     }
>> >> > }
>> >> >
>> >> >
>> >> > For which we currently generate at -O3:
>> >> >
>> >> > f10:
>> >> >         cmp     w4, 0
>> >> >         ble     .L1
>> >> >         mov     x5, 0
>> >> >         whilelo p1.d, wzr, w4
>> >> >         ptrue   p3.b, all
>> >> > .L3:
>> >> >         ld1d    z1.d, p1/z, [x1, x5, lsl 3]
>> >> >         fcmgt   p2.d, p1/z, z1.d, #0.0
>> >> >         fcmgt   p0.d, p3/z, z1.d, #0.0
>> >> >         ld1d    z2.d, p2/z, [x2, x5, lsl 3]
>> >> >         bic     p0.b, p3/z, p1.b, p0.b
>> >> >         ld1d    z0.d, p0/z, [x3, x5, lsl 3]
>> >> >         fsub    z0.d, p0/m, z0.d, z1.d
>> >> >         movprfx z0.d, p2/m, z1.d
>> >> >         fadd    z0.d, p2/m, z0.d, z2.d
>> >> >         st1d    z0.d, p1, [x0, x5, lsl 3]
>> >> >         incd    x5
>> >> >         whilelo p1.d, w5, w4
>> >> >         b.any   .L3
>> >> > .L1:
>> >> >         ret
>> >> >
>> >> > Notice that the condition for the else branch duplicates the same
>> >> > predicate as the then branch and then uses BIC to negate the result=
s.
>> >> >
>> >> > The reason for this is that during instruction generation in the
>> >> > vectorizer we emit
>> >> >
>> >> >   mask__41.11_66 =3D vect__4.10_64 > vect_cst__65;
>> >> >   vec_mask_and_69 =3D mask__41.11_66 & loop_mask_63;
>> >> >   vec_mask_and_71 =3D mask__41.11_66 & loop_mask_63;
>> >> >   mask__43.16_73 =3D ~mask__41.11_66;
>> >> >   vec_mask_and_76 =3D mask__43.16_73 & loop_mask_63;
>> >> >   vec_mask_and_78 =3D mask__43.16_73 & loop_mask_63;
>> >> >
>> >> > which ultimately gets optimized to
>> >> >
>> >> >   mask__41.11_66 =3D vect__4.10_64 > { 0.0, ... };
>> >> >   vec_mask_and_69 =3D loop_mask_63 & mask__41.11_66;
>> >> >   mask__43.16_73 =3D ~mask__41.11_66;
>> >> >   vec_mask_and_76 =3D loop_mask_63 & mask__43.16_73;
>> >> >
>> >> > Notice how the negate is on the operation and not the predicate
>> >> > resulting from the operation.  When this is expanded this turns
>> >> > into RTL where the negate is on the compare directly.  This means
>> >> > the RTL is different from the one without the negate and so CSE is
>> >> > unable to
>> >> recognize that they are essentially same operation.
>> >> >
>> >> > To fix this my patch changes it so you negate the mask rather than
>> >> > the operation
>> >> >
>> >> >   mask__41.13_55 =3D vect__4.12_53 > { 0.0, ... };
>> >> >   vec_mask_and_58 =3D loop_mask_52 & mask__41.13_55;
>> >> >   vec_mask_op_67 =3D ~vec_mask_and_58;
>> >> >   vec_mask_and_65 =3D loop_mask_52 & vec_mask_op_67;
>> >>
>> >> But to me this looks like a pessimisation in gimple terms.  We've
>> >> increased the length of the critical path: vec_mask_and_65 now needs
>> >> a chain of
>> >> 4 operations instead of 3.
>> >
>> > True, but it should reduce the number of RTL patterns.  I would have
>> > thought RTL is more expensive to handle than gimple.
>>=20
>> I think this is only a fair gimple optimisation if gimple does the isel =
itself (to a
>> predicated compare and a predicated NOT).
>>=20
>> >> We also need to be careful not to pessimise the case in which the
>> >> comparison is an integer one.  At the moment we'll generate opposed
>> >> conditions, which is the intended behaviour:
>> >
>> > This still happens with this patch at `-Ofast` because that flips the
>> > conditions, So the different representation doesn't harm it.
>>=20
>> OK, that's good.
>>=20
>> >>
>> >> .L3:
>> >>         ld1d    z1.d, p0/z, [x1, x5, lsl 3]
>> >>         cmpgt   p2.d, p0/z, z1.d, #0
>> >>         movprfx z2, z1
>> >>         scvtf   z2.d, p3/m, z1.d
>> >>         cmple   p1.d, p0/z, z1.d, #0
>> >>         ld1d    z0.d, p2/z, [x2, x5, lsl 3]
>> >>         ld1d    z1.d, p1/z, [x3, x5, lsl 3]
>> >>         fadd    z0.d, p2/m, z0.d, z2.d
>> >>         movprfx z0.d, p1/m, z1.d
>> >>         fsub    z0.d, p1/m, z0.d, z2.d
>> >>         st1d    z0.d, p0, [x0, x5, lsl 3]
>> >>         add     x5, x5, x6
>> >>         whilelo p0.d, w5, w4
>> >>         b.any   .L3
>> >>
>> >> Could we handle the fcmp case using a 3->2 define_split instead:
>> >> convert
>> >>
>> >>    (set res (and (not (fcmp X Y)) Z)) ->
>> >>      (set res (fcmp X Y))
>> >>      (set res (and (not res) Z))
>> >>
>> >
>> > This was the other approach I mentioned. It works, and gives you the n=
eg,
>> but only in the case where the compare is single use.
>>=20
>> But in the original example we duplicate the comparison through a
>> 2->2 combine, which leaves the original comparison as a single use.
>> Isn't that enough?
>>=20
>> > e.g. in
>> >
>> > void f11(double * restrict z, double * restrict w, double * restrict
>> > x, double * restrict y, int n) {
>> >     for (int i =3D 0; i < n; i++) {
>> >         z[i] =3D (w[i] > 0) ? w[i] : y[i] - w[i];
>> >     }
>> > }
>> >
>> > You have some of the same problem. It generates
>> >
>> >         ld1d    z0.d, p0/z, [x1, x2, lsl 3]
>> >         fcmgt   p2.d, p3/z, z0.d, #0.0
>> >         bic     p1.b, p3/z, p0.b, p2.b
>> >         ld1d    z1.d, p1/z, [x3, x2, lsl 3]
>> >         fsub    z1.d, p1/m, z1.d, z0.d
>> >         sel     z0.d, p2, z0.d, z1.d
>> >         st1d    z0.d, p0, [x0, x2, lsl 3]
>> >         incd    x2
>> >         whilelo p0.d, w2, w4
>> >
>> > which has two problems. fcmgt doesn't need to be predicated on p3
>> > which is ptrue all, it can/should be p0.
>> >
>> > With that fixed the splitter won't match because p2 is needed in the
>> > sel, so it's not single use and so combine won't try to build the RTL =
so it can
>> be split.
>>=20
>> I think having the vectoriser avoid the dual use between the
>> IFN_MASK_LOAD/STORE and the VEC_COND_EXPR is fair game, since that is
>> the only pass that has the context to prove that including the loop mask=
 in
>> the VEC_COND_EXPR condition is correct.  We already try to do that to so=
me
>> extent:
>>=20
>
> Sorry I have been looking at this these past couple of days and I just do=
n't know how
> this is supposed to work.
>
> In the above example the problem is not just the use of p2 in the VEC_CON=
D_EXPR. If
> the VEC_COND_EXPR is changed to use p1 then p1 now has 3 uses which makes
> combine still not try the combination.
>
> But the general case
>
> void f10(double * restrict z, double * restrict w, double * restrict x, d=
ouble * restrict y, int n)
> {
>     for (int i =3D 0; i < n; i++) {
>         z[i] =3D (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
>     }
> }
>
> Produces
>
> f10:
>         cmp     w4, 0
>         ble     .L1
>         mov     x5, 0
>         whilelo p1.d, wzr, w4
>         ptrue   p3.b, all
>         .p2align 5,,15
> .L3:
>         ld1d    z1.d, p1/z, [x1, x5, lsl 3]
>         fcmgt   p2.d, p1/z, z1.d, #0.0
>         fcmgt   p0.d, p3/z, z1.d, #0.0
>         ld1d    z2.d, p2/z, [x2, x5, lsl 3]
>         bic     p0.b, p3/z, p1.b, p0.b
>         ld1d    z0.d, p0/z, [x3, x5, lsl 3]
>         fsub    z0.d, p0/m, z0.d, z1.d
>         movprfx z0.d, p2/m, z1.d
>         fadd    z0.d, p2/m, z0.d, z2.d
>         st1d    z0.d, p1, [x0, x5, lsl 3]
>         incd    x5
>         whilelo p1.d, w5, w4
>         b.any   .L3
>
> where the VEC_COND_EXPR has been elided.
>
> The problem is that the comparison for the inverse case is unpredicated.

Yeah, for the original f10 example I was suggesting that we use combine
instead of changing the vectoriser.  It turns out that the 3->2 split
thing I mentioned above won't work though, because we match the BIC first.
But I think something like:

(define_insn_and_split "*inverted_fcmgt"
  [(set (match_operand:<VPRED> 0 "register_operand" "=3DUpa")
	(and:<VPRED>
	  (and:<VPRED>
	    (not:<VPRED>
	      (unspec:<VPRED>
		[(match_operand:<VPRED> 1 "register_operand" "Upl")
		 (const_int SVE_KNOWN_PTRUE)
		 (match_operand:SVE_FULL_F 2 "register_operand" "w")
		 (match_operand:SVE_FULL_F 3 "aarch64_simd_reg_or_zero" "wDz")]
		SVE_COND_FP_CMP_I0))
	    (match_operand:<VPRED> 4 "register_operand" "w"))
	  (match_dup 1)))]
  "TARGET_SVE && !reg_overlap_mentioned_p (operands[4], operands[0])"
  "#"
  "&& 1"
  [(set (match_dup 0)
	(unspec:<VPRED>
	  [(match_dup 4)
	   (const_int SVE_MAYBE_NOT_PTRUE)
	   (match_dup 2)
	   (match_dup 3)]
	  SVE_COND_FP_CMP_I0))
   (set (match_dup 0)
	(and:<VPRED> (not:<VPRED> (match_dup 0)) (match_dup 4)))]
)

is a legitimate optimisation in its own right because it gets rid
of the PTRUE operand (at split time).  This would be a good thing
to do wherever the FCMxx and BIC come from.

The snag is that we don't then CSE the comparisons between split1 and RA.
The post-RA optimisers might then leave a move.  E.g. for plain -O3 the
above pattern gives the expected:

        cmp     w4, 0
        ble     .L1
        mov     x5, 0
        cntd    x6
        whilelo p0.d, wzr, w4
        .p2align 3,,7
.L3:
        ld1d    z1.d, p0/z, [x1, x5, lsl 3]
        fcmgt   p2.d, p0/z, z1.d, #0.0
        not     p1.b, p0/z, p2.b
        ld1d    z2.d, p2/z, [x2, x5, lsl 3]
        ld1d    z0.d, p1/z, [x3, x5, lsl 3]
        fsub    z0.d, p1/m, z0.d, z1.d
        movprfx z0.d, p2/m, z1.d
        fadd    z0.d, p2/m, z0.d, z2.d
        st1d    z0.d, p0, [x0, x5, lsl 3]
        add     x5, x5, x6
        whilelo p0.d, w5, w4
        b.any   .L3
.L1:
        ret

but -O3 -fno-trapping-math has a predicate move:

        cmp     w4, 0
        ble     .L1
        mov     x5, 0
        cntd    x6
        whilelo p0.d, wzr, w4
        .p2align 3,,7
.L3:
        ld1d    z1.d, p0/z, [x1, x5, lsl 3]
        fcmgt   p1.d, p0/z, z1.d, #0.0
        mov     p2.b, p1.b
        not     p1.b, p0/z, p1.b
        ld1d    z0.d, p2/z, [x2, x5, lsl 3]
        ld1d    z2.d, p1/z, [x3, x5, lsl 3]
        fadd    z0.d, z1.d, z0.d
        movprfx z0.d, p1/m, z2.d
        fsub    z0.d, p1/m, z0.d, z1.d
        st1d    z0.d, p0, [x0, x5, lsl 3]
        add     x5, x5, x6
        whilelo p0.d, w5, w4
        b.any   .L3
.L1:
        ret

It would be good to fix this in RTL =E2=80=9Csomehow=E2=80=9D.

But for f11 we still get:

f11:
.LFB1:
        .cfi_startproc
        cmp     w4, 0
        ble     .L6
        mov     x2, 0
        cntd    x5
        whilelo p1.d, wzr, w4
        ptrue   p3.b, all
        .p2align 3,,7
.L8:
        ld1d    z0.d, p1/z, [x1, x2, lsl 3]
        fcmgt   p0.d, p1/z, z0.d, #0.0
        fcmgt   p2.d, p3/z, z0.d, #0.0
        not     p0.b, p1/z, p0.b
        ld1d    z1.d, p0/z, [x3, x2, lsl 3]
        fsub    z1.d, p0/m, z1.d, z0.d
        sel     z0.d, p2, z0.d, z1.d
        st1d    z0.d, p1, [x0, x2, lsl 3]
        add     x2, x2, x5
        whilelo p1.d, w2, w4
        b.any   .L8
.L6:
        ret

This is where the VEC_COND_EXPR code I mentioned should come in.
At the moment, before generating:

  VEC_COND_EXPR<C, E1, E2>

we check whether there are any other operations predicated on
C & LOOP_MASK.  If so, we use:

  VEC_COND_EXPR<C & LOOP_MASK, E1, E2>

instead.  This avoids both C and C & LOOP_MASK being live at the
same time.

The change I was suggesting was that we should also check whether there
are any operations predicated on ~C & LOOP_MASK.  If so, we should
generate:

  VEC_COND_EXPR<~C & LOOP_MASK, E2, E1>

Again, the justification is that we don't have C and ~C & LOOP_MASK
live at the same time.  If C itself has the form ~C' then swapping
the operands also saves a negation.

> This code is fine as far as I can tell.  But there's nothing you can do h=
ere. The mask it needs is ~original
> So it does not find an inverse mask to use because it has to honor floati=
ng point exceptions.
>
> And indeed `-fno-trapping-math` or `-Ofast` generates the most optimal se=
quence, but when honoring
> traps it can't re-use invert existing mask, which leaves the operation un=
predicated.
>
> So is what you're requesting that it looks inside unary operators and tri=
es to CSE the thing they're pointed to?

[Hopefully answered this above]

> In which case isn't it about the same as what I had before just that the =
vectorizer did the CSE itself?

IMO the VEC_COND_EXPR optimisation is more selective.  It only changes
the gimple if we know that it will have an effect (in terms of reducing
the number of SSA_NAMEs that have multiple uses).  Also, unlike the original
patch, there's no double-masking invovled: ~C & LOOP_MASK will only apply
LOOP_MASK once.

> If that's the case maybe it's better to do lookups into loop_vinfo->scala=
r_cond_masked_set in prepare_load_store_mask?
> So that it just applies to everything?

scalar_cond_masked_set only exists to convert a mask that ignores
LOOP_MASK into one that is ANDed with LOOP_MASK.  For loads and stores
we always AND with LOOP_MASK, so a lookup isn't necessary.

Thanks,
Richard