From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 09C033858C60
 for <gcc-patches@gcc.gnu.org>; Tue,  2 Nov 2021 17:06:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 09C033858C60
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8AEBF11B3;
 Tue,  2 Nov 2021 10:06:20 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.88])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CE09D3F7B4;
 Tue,  2 Nov 2021 10:06:19 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
Mail-Followup-To: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>,
 Tamar Christina <Tamar.Christina@arm.com>, Richard Biener <rguenther@suse.de>,
 nd <nd@arm.com>, richard.sandiford@arm.com
Cc: Tamar Christina <Tamar.Christina@arm.com>,
 Richard Biener <rguenther@suse.de>, nd <nd@arm.com>
Subject: Re: [PATCH]middle-end Add an RPO pass after successful vectorization
References: <patch-15007-tamar@arm.com>
 <9nnp8so9-p3nq-r26-3098-s96334191030@fhfr.qr>
 <VI1PR08MB5325EFFBA85F9148234D5C41FF8B9@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <q85113-np7p-r70-8775-163s12532846@fhfr.qr>
Date: Tue, 02 Nov 2021 17:06:18 +0000
In-Reply-To: <q85113-np7p-r70-8775-163s12532846@fhfr.qr> (Richard Biener via
 Gcc-patches's message of "Tue, 2 Nov 2021 16:29:00 +0100 (CET)")
Message-ID: <mptr1bybi7p.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Nov 2021 17:06:23 -0000

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Tue, 2 Nov 2021, Tamar Christina wrote:
>
>> > -----Original Message-----
>> > From: Richard Biener <rguenther@suse.de>
>> > Sent: Tuesday, November 2, 2021 2:24 PM
>> > To: Tamar Christina <Tamar.Christina@arm.com>
>> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
>> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
>> > vectorization
>> > 
>> > On Tue, 2 Nov 2021, Tamar Christina wrote:
>> > 
>> > > Hi All,
>> > >
>> > > Following my current SVE predicate optimization series a problem has
>> > > presented itself in that the way vector masks are generated for masked
>> > > operations relies on CSE to share masks efficiently.
>> > >
>> > > The issue however is that masking is done using the & operand and & is
>> > > associative and so reassoc decides to reassociate the masked operations.
>> > 
>> > But it does this for the purpose of canonicalization and thus CSE.
>> 
>> Yes, but it turns something like
>> 
>> (a & b) & mask into a & (b & mask).
>> 
>> When (a & b) is used somewhere else you now lose the CSE.  So it's actually hurting
>> In this case.
>
> OK, so that's a known "issue" with reassoc, it doesn't consider global
> CSE opportunities and I guess it pushes 'mask' to leaf if it is loop
> carried.
>
>> > 
>> > > This makes CSE then unable to CSE an unmasked and a masked operation
>> > > leading to duplicate operations being performed.
>> > >
>> > > To counter this we want to add an RPO pass over the vectorized loop
>> > > body when vectorization succeeds.  This makes it then no longer
>> > > reliant on the RTL level CSE.
>> > >
>> > > I have not added a testcase for this as it requires the changes in my
>> > > patch series, however the entire series relies on this patch to work
>> > > so all the tests there cover it.
>> > >
>> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and
>> > > no issues.
>> > >
>> > > Ok for master?
>> > 
>> > You are running VN over _all_ loop bodies rather only those vectorized.
>> > We loop over vectorized loops earlier for optimizing masked store sequences.
>> > I suppose you could hook in there.  I'll also notice that we have
>> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE.
>> > So I don't understand why it doesn't work to CSE later.
>> > 
>> 
>> Atm, say you have the conditions a > b, and a > b & a > c
>> 
>> We generate
>> 
>> mask1 = (a > b) & loop_mask
>> mask2 = (a > b & a > c) & loop_mask
>> 
>> with the intention that mask1 can be re-used in mask2.
>> 
>> Reassoc changes this to mask2 = a > b & (a > c & loop_mask)
>> 
>> Which has now unmasked (a > b) in mask2, which leaves us unable to combine
>> the mask1 and mask2.  It doesn't generate incorrect code, just inefficient.
>> 
>> >   for (i = 1; i < number_of_loops (cfun); i++)
>> >     {
>> >       loop_vec_info loop_vinfo;
>> >       bool has_mask_store;
>> > 
>> >       loop = get_loop (cfun, i);
>> >       if (!loop || !loop->aux)
>> >         continue;
>> >       loop_vinfo = (loop_vec_info) loop->aux;
>> >       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
>> >       delete loop_vinfo;
>> >       if (has_mask_store
>> >           && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
>> >         optimize_mask_stores (loop);
>> >       loop->aux = NULL;
>> >     }
>> > 
>> 
>> Ah thanks, I'll make the changes.
>
> Note I think that full-blown CSE is a bit overkill just to counter
> a deficient reassoc (or VN).  At least it is supposed to be "cheap"
> and can be conditionalized on loop masks being used as well.

Not sure we should make this conditional on loop masks being used.
It seems either that:

(a) the vectoriser is supposed to avoid creating code that has folding
    or VN opportunities, in which case we need to generate the vectorised
    code in a smarter way or

(b) the vectoriser is allowed to create code that has folding or VN
    opportunities, in which case it would be good to have a defined
    place to get rid of them.

I'm just worried that if we make it conditional on loop masks,
we could see cases that in which non-loop-mask stuff is optimised
differently based on whether the loop has masks or not.  E.g.
we might get worse code with an unpredicated main loop and
a predicated epilogue compared to a predicated main loop.

Thanks,
Richard
>
>> Thanks,
>> Tamar
>> 
>> > 
>> > > Thanks,
>> > > Tamar
>> > >
>> > > gcc/ChangeLog:
>> > >
>> > > 	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN
>> > upon
>> > > 	successful vectorization.
>> > >
>> > > --- inline copy of patch --
>> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
>> > >
>> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c
>> > 660
>> > > af4b6d32dc51 100644
>> > > --- a/gcc/tree-vectorizer.c
>> > > +++ b/gcc/tree-vectorizer.c
>> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
>> > > #include "gimple-pretty-print.h"
>> > >  #include "opt-problem.h"
>> > >  #include "internal-fn.h"
>> > > -
>> > > +#include "tree-ssa-sccvn.h"
>> > >
>> > >  /* Loop or bb location, with hotness information.  */
>> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@
>> > > vectorize_loops (void)
>> > >  	 ???  Also while we try hard to update loop-closed SSA form we fail
>> > >  	 to properly do this in some corner-cases (see PR56286).  */
>> > >        rewrite_into_loop_closed_ssa (NULL,
>> > > TODO_update_ssa_only_virtuals);
>> > > +
>> > > +      for (i = 1; i < number_of_loops (cfun); i++)
>> > > +	{
>> > > +	  loop = get_loop (cfun, i);
>> > > +	  if (!loop || !single_exit (loop))
>> > > +	    continue;
>> > > +
>> > > +	  bitmap exit_bbs;
>> > > +	  /* Perform local CSE, this esp. helps because we emit code for
>> > > +	     predicates that need to be shared for optimal predicate usage.
>> > > +	     However reassoc will re-order them and prevent CSE from working
>> > > +	     as it should.  CSE only the loop body, not the entry.  */
>> > > +	  exit_bbs = BITMAP_ALLOC (NULL);
>> > > +	  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
>> > > +	  bitmap_set_bit (exit_bbs, loop->latch->index);
>> > > +
>> > > +	  do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
>> > > +
>> > > +	  BITMAP_FREE (exit_bbs);
>> > > +	}
>> > > +
>> > >        return TODO_cleanup_cfg;
>> > >      }
>> > >
>> > >
>> > >
>> > >
>> > 
>> > --
>> > Richard Biener <rguenther@suse.de>
>> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
>> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
>>