From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id BF043388A40F for ; Wed, 16 Jun 2021 10:24:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BF043388A40F Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 585321042; Wed, 16 Jun 2021 03:24:27 -0700 (PDT) Received: from [10.57.75.172] (unknown [10.57.75.172]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C18B53F70D; Wed, 16 Jun 2021 03:24:26 -0700 (PDT) Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization To: Richard Biener Cc: Richard Sandiford , "gcc-patches@gcc.gnu.org" References: <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com> <6e4541ef-e0a1-1d2d-53f5-4bfed9a65598@arm.com> <4925fee5-dcea-0c14-388c-85c881ffd918@arm.com> From: "Andre Vieira (lists)" Message-ID: <077c4a2f-d745-dd0a-4d6d-d73d6a558529@arm.com> Date: Wed, 16 Jun 2021 11:24:24 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, NICE_REPLY_A, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Jun 2021 10:24:29 -0000 On 14/06/2021 11:57, Richard Biener wrote: > On Mon, 14 Jun 2021, Richard Biener wrote: > >> Indeed. For example a simple >> int a[1024], b[1024], c[1024]; >> >> void foo(int n) >> { >> for (int i = 0; i < n; ++i) >> a[i+1] += c[i+i] ? b[i+1] : 0; >> } >> >> should usually see peeling for alignment (though on x86 you need >> exotic -march= since cost models generally have equal aligned and >> unaligned access costs). For example with -mavx2 -mtune=atom >> we'll see an alignment peeling prologue, a AVX2 vector loop, >> a SSE2 vectorized epilogue and a scalar epilogue. It also >> shows the original scalar loop being used in the scalar prologue >> and epilogue. >> >> We're not even trying to make the counting IV easily used >> across loops (we're not counting scalar iterations in the >> vector loops). > Specifically we see > > [local count: 94607391]: > niters_vector_mult_vf.10_62 = bnd.9_61 << 3; > _67 = niters_vector_mult_vf.10_62 + 7; > _64 = (int) niters_vector_mult_vf.10_62; > tmp.11_63 = i_43 + _64; > if (niters.8_45 == niters_vector_mult_vf.10_62) > goto ; [12.50%] > else > goto ; [87.50%] > > after the maini vect loop, recomputing the original IV (i) rather > than using the inserted canonical IV. And then the vectorized > epilogue header check doing > > [local count: 93293400]: > # i_59 = PHI > # _66 = PHI <_67(33), 0(18)> > _96 = (unsigned int) n_10(D); > niters.26_95 = _96 - _66; > _108 = (unsigned int) n_10(D); > _109 = _108 - _66; > _110 = _109 + 4294967295; > if (_110 <= 3) > goto ; [10.00%] > else > goto ; [90.00%] > > re-computing everything from scratch again (also notice how > the main vect loop guard jumps around the alignment prologue > as well and lands here - and the vectorized epilogue using > unaligned accesses - good!). > > That is, I'd expect _much_ easier jobs if we'd manage to > track the number of performed scalar iterations (or the > number of scalar iterations remaining) using the canonical > IV we add to all loops across all of the involved loops. > > Richard. So I am now looking at using an IV that counts scalar iterations rather than vector iterations and reusing that through all loops, (prologue, main loop, vect_epilogue and scalar epilogue). The first is easy, since that's what we already do for partial vectors or non-constant VFs. The latter requires some plumbing and removing a lot of the code in there that creates new IV's going from [0, niters - previous iterations]. I don't yet have a clear cut view of how to do this, I first thought of keeping track of the 'control' IV in the loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps a control_ivs struct, but that is used for overflow detection and only keeps track of what looks like a constant 'base' and 'step'. Not quite sure how all that works, but intuitively doesn't seem like the right thing to reuse. I'll go hack around and keep you posted on progress. Regards, Andre