From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 101D43860C3E
 for <gcc-patches@gcc.gnu.org>; Mon, 14 Jun 2021 15:54:38 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 101D43860C3E
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3C0D51FB;
 Mon, 14 Jun 2021 08:54:37 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4B4D53F70D;
 Mon, 14 Jun 2021 08:54:36 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <Tamar.Christina@arm.com>
Mail-Followup-To: Tamar Christina <Tamar.Christina@arm.com>,
 "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>, richard.sandiford@arm.com
Cc: "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on
 inverted operands
References: <patch-14553-tamar@arm.com> <mpt7diwjyam.fsf@arm.com>
 <VI1PR08MB532566E232B32751F7225EEFFF319@VI1PR08MB5325.eurprd08.prod.outlook.com>
Date: Mon, 14 Jun 2021 16:54:35 +0100
In-Reply-To: <VI1PR08MB532566E232B32751F7225EEFFF319@VI1PR08MB5325.eurprd08.prod.outlook.com>
 (Tamar Christina's message of "Mon, 14 Jun 2021 15:05:38 +0000")
Message-ID: <mptim2gigqs.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Jun 2021 15:54:40 -0000

Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi Richard,
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, June 14, 2021 3:50 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> Subject: Re: [PATCH][RFC]AArch64 SVE: Fix multiple comparison masks on
>> inverted operands
>> 
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > Hi All,
>> >
>> > This RFC is trying to address the following inefficiency when
>> > vectorizing conditional statements with SVE.
>> >
>> > Consider the case
>> >
>> > void f10(double * restrict z, double * restrict w, double * restrict x,
>> > 	 double * restrict y, int n)
>> > {
>> >     for (int i = 0; i < n; i++) {
>> >         z[i] = (w[i] > 0) ? x[i] + w[i] : y[i] - w[i];
>> >     }
>> > }
>> >
>> >
>> > For which we currently generate at -O3:
>> >
>> > f10:
>> >         cmp     w4, 0
>> >         ble     .L1
>> >         mov     x5, 0
>> >         whilelo p1.d, wzr, w4
>> >         ptrue   p3.b, all
>> > .L3:
>> >         ld1d    z1.d, p1/z, [x1, x5, lsl 3]
>> >         fcmgt   p2.d, p1/z, z1.d, #0.0
>> >         fcmgt   p0.d, p3/z, z1.d, #0.0
>> >         ld1d    z2.d, p2/z, [x2, x5, lsl 3]
>> >         bic     p0.b, p3/z, p1.b, p0.b
>> >         ld1d    z0.d, p0/z, [x3, x5, lsl 3]
>> >         fsub    z0.d, p0/m, z0.d, z1.d
>> >         movprfx z0.d, p2/m, z1.d
>> >         fadd    z0.d, p2/m, z0.d, z2.d
>> >         st1d    z0.d, p1, [x0, x5, lsl 3]
>> >         incd    x5
>> >         whilelo p1.d, w5, w4
>> >         b.any   .L3
>> > .L1:
>> >         ret
>> >
>> > Notice that the condition for the else branch duplicates the same
>> > predicate as the then branch and then uses BIC to negate the results.
>> >
>> > The reason for this is that during instruction generation in the
>> > vectorizer we emit
>> >
>> >   mask__41.11_66 = vect__4.10_64 > vect_cst__65;
>> >   vec_mask_and_69 = mask__41.11_66 & loop_mask_63;
>> >   vec_mask_and_71 = mask__41.11_66 & loop_mask_63;
>> >   mask__43.16_73 = ~mask__41.11_66;
>> >   vec_mask_and_76 = mask__43.16_73 & loop_mask_63;
>> >   vec_mask_and_78 = mask__43.16_73 & loop_mask_63;
>> >
>> > which ultimately gets optimized to
>> >
>> >   mask__41.11_66 = vect__4.10_64 > { 0.0, ... };
>> >   vec_mask_and_69 = loop_mask_63 & mask__41.11_66;
>> >   mask__43.16_73 = ~mask__41.11_66;
>> >   vec_mask_and_76 = loop_mask_63 & mask__43.16_73;
>> >
>> > Notice how the negate is on the operation and not the predicate
>> > resulting from the operation.  When this is expanded this turns into
>> > RTL where the negate is on the compare directly.  This means the RTL
>> > is different from the one without the negate and so CSE is unable to
>> recognize that they are essentially same operation.
>> >
>> > To fix this my patch changes it so you negate the mask rather than the
>> > operation
>> >
>> >   mask__41.13_55 = vect__4.12_53 > { 0.0, ... };
>> >   vec_mask_and_58 = loop_mask_52 & mask__41.13_55;
>> >   vec_mask_op_67 = ~vec_mask_and_58;
>> >   vec_mask_and_65 = loop_mask_52 & vec_mask_op_67;
>> 
>> But to me this looks like a pessimisation in gimple terms.  We've increased
>> the length of the critical path: vec_mask_and_65 now needs a chain of
>> 4 operations instead of 3.
>
> True, but it should reduce the number of RTL patterns.  I would have thought
> RTL is more expensive to handle than gimple.

I think this is only a fair gimple optimisation if gimple does the isel
itself (to a predicated compare and a predicated NOT).

>> We also need to be careful not to pessimise the case in which the
>> comparison is an integer one.  At the moment we'll generate opposed
>> conditions, which is the intended behaviour:
>
> This still happens with this patch at `-Ofast` because that flips the conditions,
> So the different representation doesn't harm it.

OK, that's good.

>> 
>> .L3:
>>         ld1d    z1.d, p0/z, [x1, x5, lsl 3]
>>         cmpgt   p2.d, p0/z, z1.d, #0
>>         movprfx z2, z1
>>         scvtf   z2.d, p3/m, z1.d
>>         cmple   p1.d, p0/z, z1.d, #0
>>         ld1d    z0.d, p2/z, [x2, x5, lsl 3]
>>         ld1d    z1.d, p1/z, [x3, x5, lsl 3]
>>         fadd    z0.d, p2/m, z0.d, z2.d
>>         movprfx z0.d, p1/m, z1.d
>>         fsub    z0.d, p1/m, z0.d, z2.d
>>         st1d    z0.d, p0, [x0, x5, lsl 3]
>>         add     x5, x5, x6
>>         whilelo p0.d, w5, w4
>>         b.any   .L3
>> 
>> Could we handle the fcmp case using a 3->2 define_split instead: convert
>> 
>>    (set res (and (not (fcmp X Y)) Z)) ->
>>      (set res (fcmp X Y))
>>      (set res (and (not res) Z))
>> 
>
> This was the other approach I mentioned. It works, and gives you the neg, but only in the case where the compare is single use.

But in the original example we duplicate the comparison through a
2->2 combine, which leaves the original comparison as a single use.
Isn't that enough?

> e.g. in 
>
> void f11(double * restrict z, double * restrict w, double * restrict x, double * restrict y, int n)
> {
>     for (int i = 0; i < n; i++) {
>         z[i] = (w[i] > 0) ? w[i] : y[i] - w[i];
>     }
> }
>
> You have some of the same problem. It generates
>
>         ld1d    z0.d, p0/z, [x1, x2, lsl 3]
>         fcmgt   p2.d, p3/z, z0.d, #0.0
>         bic     p1.b, p3/z, p0.b, p2.b
>         ld1d    z1.d, p1/z, [x3, x2, lsl 3]
>         fsub    z1.d, p1/m, z1.d, z0.d
>         sel     z0.d, p2, z0.d, z1.d
>         st1d    z0.d, p0, [x0, x2, lsl 3]
>         incd    x2
>         whilelo p0.d, w2, w4
>
> which has two problems. fcmgt doesn't need to be predicated on p3 which is ptrue all,
> it can/should be p0.
>
> With that fixed the splitter won't match because p2 is needed in the sel, so it's not single use and so combine won't
> try to build the RTL so it can be split.

I think having the vectoriser avoid the dual use between the
IFN_MASK_LOAD/STORE and the VEC_COND_EXPR is fair game, since that
is the only pass that has the context to prove that including the loop
mask in the VEC_COND_EXPR condition is correct.  We already try to do
that to some extent:

  /* See whether another part of the vectorized code applies a loop
     mask to the condition, or to its inverse.  */

but it would need extending to handle this case.

Thanks,
Richard