From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 09C033858C60 for ; Tue, 2 Nov 2021 17:06:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 09C033858C60 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8AEBF11B3; Tue, 2 Nov 2021 10:06:20 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.88]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CE09D3F7B4; Tue, 2 Nov 2021 10:06:19 -0700 (PDT) From: Richard Sandiford To: Richard Biener via Gcc-patches Mail-Followup-To: Richard Biener via Gcc-patches , Tamar Christina , Richard Biener , nd , richard.sandiford@arm.com Cc: Tamar Christina , Richard Biener , nd Subject: Re: [PATCH]middle-end Add an RPO pass after successful vectorization References: <9nnp8so9-p3nq-r26-3098-s96334191030@fhfr.qr> Date: Tue, 02 Nov 2021 17:06:18 +0000 In-Reply-To: (Richard Biener via Gcc-patches's message of "Tue, 2 Nov 2021 16:29:00 +0100 (CET)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Nov 2021 17:06:23 -0000 Richard Biener via Gcc-patches writes: > On Tue, 2 Nov 2021, Tamar Christina wrote: > >> > -----Original Message----- >> > From: Richard Biener >> > Sent: Tuesday, November 2, 2021 2:24 PM >> > To: Tamar Christina >> > Cc: gcc-patches@gcc.gnu.org; nd >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful >> > vectorization >> > >> > On Tue, 2 Nov 2021, Tamar Christina wrote: >> > >> > > Hi All, >> > > >> > > Following my current SVE predicate optimization series a problem has >> > > presented itself in that the way vector masks are generated for masked >> > > operations relies on CSE to share masks efficiently. >> > > >> > > The issue however is that masking is done using the & operand and & is >> > > associative and so reassoc decides to reassociate the masked operations. >> > >> > But it does this for the purpose of canonicalization and thus CSE. >> >> Yes, but it turns something like >> >> (a & b) & mask into a & (b & mask). >> >> When (a & b) is used somewhere else you now lose the CSE. So it's actually hurting >> In this case. > > OK, so that's a known "issue" with reassoc, it doesn't consider global > CSE opportunities and I guess it pushes 'mask' to leaf if it is loop > carried. > >> > >> > > This makes CSE then unable to CSE an unmasked and a masked operation >> > > leading to duplicate operations being performed. >> > > >> > > To counter this we want to add an RPO pass over the vectorized loop >> > > body when vectorization succeeds. This makes it then no longer >> > > reliant on the RTL level CSE. >> > > >> > > I have not added a testcase for this as it requires the changes in my >> > > patch series, however the entire series relies on this patch to work >> > > so all the tests there cover it. >> > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and >> > > no issues. >> > > >> > > Ok for master? >> > >> > You are running VN over _all_ loop bodies rather only those vectorized. >> > We loop over vectorized loops earlier for optimizing masked store sequences. >> > I suppose you could hook in there. I'll also notice that we have >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE. >> > So I don't understand why it doesn't work to CSE later. >> > >> >> Atm, say you have the conditions a > b, and a > b & a > c >> >> We generate >> >> mask1 = (a > b) & loop_mask >> mask2 = (a > b & a > c) & loop_mask >> >> with the intention that mask1 can be re-used in mask2. >> >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask) >> >> Which has now unmasked (a > b) in mask2, which leaves us unable to combine >> the mask1 and mask2. It doesn't generate incorrect code, just inefficient. >> >> > for (i = 1; i < number_of_loops (cfun); i++) >> > { >> > loop_vec_info loop_vinfo; >> > bool has_mask_store; >> > >> > loop = get_loop (cfun, i); >> > if (!loop || !loop->aux) >> > continue; >> > loop_vinfo = (loop_vec_info) loop->aux; >> > has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo); >> > delete loop_vinfo; >> > if (has_mask_store >> > && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE)) >> > optimize_mask_stores (loop); >> > loop->aux = NULL; >> > } >> > >> >> Ah thanks, I'll make the changes. > > Note I think that full-blown CSE is a bit overkill just to counter > a deficient reassoc (or VN). At least it is supposed to be "cheap" > and can be conditionalized on loop masks being used as well. Not sure we should make this conditional on loop masks being used. It seems either that: (a) the vectoriser is supposed to avoid creating code that has folding or VN opportunities, in which case we need to generate the vectorised code in a smarter way or (b) the vectoriser is allowed to create code that has folding or VN opportunities, in which case it would be good to have a defined place to get rid of them. I'm just worried that if we make it conditional on loop masks, we could see cases that in which non-loop-mask stuff is optimised differently based on whether the loop has masks or not. E.g. we might get worse code with an unpredicated main loop and a predicated epilogue compared to a predicated main loop. Thanks, Richard > >> Thanks, >> Tamar >> >> > >> > > Thanks, >> > > Tamar >> > > >> > > gcc/ChangeLog: >> > > >> > > * tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN >> > upon >> > > successful vectorization. >> > > >> > > --- inline copy of patch -- >> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index >> > > >> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c >> > 660 >> > > af4b6d32dc51 100644 >> > > --- a/gcc/tree-vectorizer.c >> > > +++ b/gcc/tree-vectorizer.c >> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3. If not see >> > > #include "gimple-pretty-print.h" >> > > #include "opt-problem.h" >> > > #include "internal-fn.h" >> > > - >> > > +#include "tree-ssa-sccvn.h" >> > > >> > > /* Loop or bb location, with hotness information. */ >> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@ >> > > vectorize_loops (void) >> > > ??? Also while we try hard to update loop-closed SSA form we fail >> > > to properly do this in some corner-cases (see PR56286). */ >> > > rewrite_into_loop_closed_ssa (NULL, >> > > TODO_update_ssa_only_virtuals); >> > > + >> > > + for (i = 1; i < number_of_loops (cfun); i++) >> > > + { >> > > + loop = get_loop (cfun, i); >> > > + if (!loop || !single_exit (loop)) >> > > + continue; >> > > + >> > > + bitmap exit_bbs; >> > > + /* Perform local CSE, this esp. helps because we emit code for >> > > + predicates that need to be shared for optimal predicate usage. >> > > + However reassoc will re-order them and prevent CSE from working >> > > + as it should. CSE only the loop body, not the entry. */ >> > > + exit_bbs = BITMAP_ALLOC (NULL); >> > > + bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index); >> > > + bitmap_set_bit (exit_bbs, loop->latch->index); >> > > + >> > > + do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs); >> > > + >> > > + BITMAP_FREE (exit_bbs); >> > > + } >> > > + >> > > return TODO_cleanup_cfg; >> > > } >> > > >> > > >> > > >> > > >> > >> > -- >> > Richard Biener >> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 >> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg) >>