From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-490073-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 97275 invoked by alias); 14 Nov 2018 14:59:51 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 97243 invoked by uid 89); 14 Nov 2018 14:59:50 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=feed, HX-Received:5a05, cycle, while_ult
X-HELO: mail-lf1-f65.google.com
Received: from mail-lf1-f65.google.com (HELO mail-lf1-f65.google.com) (209.85.167.65) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 14 Nov 2018 14:59:48 +0000
Received: by mail-lf1-f65.google.com with SMTP id f23so11709370lfc.13        for <gcc-patches@gcc.gnu.org>; Wed, 14 Nov 2018 06:59:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc;        bh=N7gemRIULWBGrsauyXgGqAuuwSYZxxCA8QS5hW2fyv8=;        b=uaixNJeflKNCjpYgFbM0KTVS39SbxY7e11aMIwpcFbUywpk6nhbs9a8fwWG63VP2tL         5VFIL6KVwAzRwW6xr6B1jdys5zuRdjRk6mtm8QakM89nJ3Eh2wn+3R+Onv8/s3fmMSMi         edM7azkhn235BTaC8s8k7i84gKfLVB/f2/Kd1uumlfK9L7M9g3Ly6KhHdk4EiEFD+iWZ         Lbtbdw1QvOKKMRbWuFzb89WfIGiVcx5Uu0P56r0sutfu/7llFdjcVg3kjNDc8PtHCdsg         sGxvyxTrGdzXaMiS0b0vUz/Y1WPp8znXVcFOBpvcqmdNCMwK3N2Dfx00cThXM+rtHXfh         HYyQ==
MIME-Version: 1.0
References: <a6041c90-2e77-f765-6f0d-f46e326a66de@foss.arm.com> <CAFiYyc1BOJ+sAUBGeKTrFSeYYGXRWqOwAowVeLcJuMJzTBTSow@mail.gmail.com> <afaa665f-3fa0-38cd-28ba-bd1ed1630b0b@foss.arm.com> <CAFiYyc0cSGYvP0sLJgxfEBYKicPrPSHUSGOpcr-osS8eLrMQUQ@mail.gmail.com> <0ff85e5c-c6ef-3220-62c5-1e4924b5c0c8@foss.arm.com>
In-Reply-To: <0ff85e5c-c6ef-3220-62c5-1e4924b5c0c8@foss.arm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 14 Nov 2018 14:59:00 -0000
Message-ID: <CAFiYyc2o+aETPsTPWu8STo3PXJQgKmX=vWwP7fUXeDt9+j9fCQ@mail.gmail.com>
Subject: Re: [RFC][PATCH]Merge VEC_COND_EXPR into MASK_STORE after loop vectorization
To: renlin.li@foss.arm.com
Cc: GCC Patches <gcc-patches@gcc.gnu.org>, Richard Sandiford <richard.sandiford@arm.com>, 	Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com>, James Greenhalgh <James.Greenhalgh@arm.com>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2018-11/txt/msg01275.txt.bz2

On Fri, Nov 9, 2018 at 4:49 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>
> Hi Richard,
>
> On 11/09/2018 11:48 AM, Richard Biener wrote:
> > On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
> >>
> >> Hi Richard,
> >>
> >>
> >> *However*, after I rebased my patch on the latest trunk.
> >> Got the following dump from ifcvt:
> >>     <bb 3> [local count: 1006632961]:
> >>     # i_20 = PHI <i_13(9), 1(21)>
> >>     # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
> >>     a_10 = array[i_20];
> >>     _1 = a_10 & 1;
> >>     _2 = a_10 + 1;
> >>     _ifc__34 = _1 != 0 ? _2 : a_10;
> >>     array[i_20] = _ifc__34;
> >>     _4 = a_10 + 2;
> >>     _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
> >>     array[i_20] = _ifc__37;
> >>     i_13 = i_20 + 1;
> >>     ivtmp_5 = ivtmp_18 - 1;
> >>     if (ivtmp_5 != 0)
> >>       goto <bb 9>; [93.33%]
> >>     else
> >>       goto <bb 8>; [6.67%]
> >>
> >> the redundant load is not generated, but you could still see the unconditional store.
> >
> > Yes, I fixed the redundant loads recently and indeed dead stores
> > remain (for the particular
> > testcase they would be easy to remove).
>
> Right.
>
> >
> >> After loop vectorization, the following is generated (without my change):
> >
> > Huh.  But that's not because of if-conversion but because SVE needs to
> > mask _all_
> > loop operations that are not safe to execute with the loop_mask!
> >
> >>     vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
> >>     a_10 = array[i_20];
> >>     vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
> >>     _1 = a_10 & 1;
> >>     vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
> >>     _2 = a_10 + 1;
> >>     vect__ifc__34.9_43 = VEC_COND_EXPR <vect__1.7_39 != vect_cst__42, vect__2.8_41, vect_a_10.6_6>;
> >>     _ifc__34 = _1 != 0 ? _2 : a_10;
> >>     .MASK_STORE (vectp_array.10_45, 4B, loop_mask_7, vect__ifc__34.9_43);
> >>     vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
> >>     _4 = a_10 + 2;
> >>     vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, vect__4.12_49, vect__ifc__34.9_43>;
> >>     _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
> >>     .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);
> >>
> >> With the old ifcvt code, my change here could improve it a little bit, eliminate some redundant load.
> >> With the new code, it could not improved it further. I'll adjust the patch based on the latest trunk.
> >
> > So what does the patch change the above to?  The code has little to no
> > comments apart from a
> > small picture with code _before_ the transform...
> It is like this:
>    vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
>    a_10 = array[i_20];
>    vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
>    _1 = a_10 & 1;
>    vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
>    _2 = a_10 + 1;
>    _60 = vect__1.7_39 != vect_cst__42;
>    vect__ifc__34.9_43 = VEC_COND_EXPR <_60, vect__2.8_41, vect_a_10.6_6>;
>    _ifc__34 = _1 != 0 ? _2 : a_10;
>    vec_mask_and_61 = _60 & loop_mask_7;
>    .MASK_STORE (vectp_array.10_45, 4B, vec_mask_and_61, vect__2.8_41);
>    vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
>    _4 = a_10 + 2;
>    vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, vect__4.12_49, vect__ifc__34.9_43>;
>    _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
>    .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);

Ah, OK, now I see what you do.

> As the loaded value is used later, It could not be removed.
>
> With the change, ideally, less data is stored.
> However, it might generate more instructions.
>
> 1, The load is not eliminable. Apparently, your change eliminate most of the redundant load.
>     The rest is necessary or not easy to remove.
> 2, additional AND instruction.
>
> With a simpler test case like this:
>
> static int array[100];
> int test (int a, int i)
> {
>    for (unsigned i = 0; i < 16; i++)
>      {
>        if (a & 1)
>         array[i] = a + 1;
>      }
>    return array[i];
> }
>
> The new code-gen will be:
>    vect__2.4_29 = vect_cst__27 + vect_cst__28;
>    _44 = vect_cst__34 != vect_cst__35;
>    vec_mask_and_45 = _44 & loop_mask_32;
>    .MASK_STORE (vectp_array.9_37, 4B, vec_mask_and_45, vect__2.4_29);
>
> While the old one is:
>
>    vect__2.4_29 = vect_cst__27 + vect_cst__28;
>    vect__ifc__24.7_33 = .MASK_LOAD (vectp_array.5_30, 4B, loop_mask_32);
>    vect__ifc__26.8_36 = VEC_COND_EXPR <vect_cst__34 != vect_cst__35, vect__2.4_29, vect__ifc__24.7_33>;
>    .MASK_STORE (vectp_array.9_37, 4B, loop_mask_32, vect__ifc__26.8_36);

I don't see the masked load here on x86_64 btw. (I don't see
if-conversion generating a load).
I guess that's again when store-data-races are allowed that it uses a
RMW cycle and vectorization
generating the masked variants for the loop-mask.  Which means for SVE
if-conversion should
prefer the masked-store variant even when store data races are allowed?

>
> >
> > I was wondering whether we can implement
> >
> >    l = [masked]load;
> >    tem = cond ? x : l;
> >    masked-store = tem;
> >
> > pattern matching in a regular pass - forwprop for example.  Note the
> > load doesn't need to be masked,
> > correct?  In fact if it is masked you need to make sure the
> > conditional never accesses parts that
> > are masked in the load, no?  Or require the mask to be the same as
> > that used by the store.  But then
> > you still cannot simply replace the store mask with a new mask
> > generated from the conditional?
>
> Yes, this would require the mask for load and store is the same.
> This matches the pattern before loop vectorization.
> The mask here is loop mask, to ensure we are bounded by the number of iterations.
>
> The new mask is the (original mask & condition mask) (example shown above).
> In this case, less lanes will be stored.
>
> It is possible we do that in forwprop.
> I could try to integrate the change into it if it is the correct place to go.
>
> As the pattern is initially generated by loop vectorizer, I did the change right after it before it got
> converted into other forms. For example, forwprop will transform the original code into:
>
>    vect__2.4_29 = vect_cst__27 + { 1, ... };
>    _16 = (void *) ivtmp.13_25;
>    _2 = &MEM[base: _16, offset: 0B];
>    vect__ifc__24.7_33 = .MASK_LOAD (_2, 4B, loop_mask_32);
>    _28 = vect_cst__34 != { 0, ... };
>    _35 = .COND_ADD (_28, vect_cst__27, { 1, ... }, vect__ifc__24.7_33);
>    vect__ifc__26.8_36 = _35;
>    .MASK_STORE (_2, 4B, loop_mask_32, vect__ifc__26.8_36);
>    ivtmp_41 = ivtmp_40 + POLY_INT_CST [4, 4];
>    next_mask_43 = .WHILE_ULT (ivtmp_41, 16, { 0, ... });
>    ivtmp.13_15 = ivtmp.13_25 + POLY_INT_CST [16, 16];
>
> This make the pattern matching not straight forward.

Ah, that's because of the .COND_ADDs (I wonder about the copy that's
left - forwprop should eliminate copies).  Note the pattern-matching
could go in the

      /* Apply forward propagation to all stmts in the basic-block.
         Note we update GSI within the loop as necessary.  */

loop which comes before the match.pd pattern matching so you'd
still see the form without the .COND_ADD I think.

There _is_ precedence for some masked-store post-processing
in the vectorizer (optimize_mask_stores), not that I like that
very much either.  Eventually those can be at least combined...

That said, I prefer the forwprop place for any pattern matching
and the current patch needs more comments to understand
what it is doing (the DCE it does is IMHO premature).  You
should also modify the masked store in-place rather than
building a new one.  I don't like how you need to use
find_data_references_in_stmt, can't you simply compare
the address and size arguments?  It should also work for
a non-masked load I guess and thus apply to non-SVE if
we manage to feed the masked store with another conditional.

Richard.

>
> Thanks,
> Renlin
>
>
> >
> > Richard.
> >