From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id E7C69383B414 for ; Wed, 16 Jun 2021 11:13:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E7C69383B414 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id EC60C21A4E; Wed, 16 Jun 2021 11:13:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1623842035; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=0VpobYR5wxnOD33pndCURu/YMO9tcWu/8T0z5y43K5M=; b=oFEGtjV7jTJGasmCTOPOZWihEDlF9FF7/0b+0TTiD94DDBbOMYBQnEkROS3qEMAV/nTwXF T5+MKsXVcSRgDIszD+6TruE3Wuz/OEx7ADj0pPsY2k9ckUc5ACQDh9/dwU6RV/V4m7y4gY kecMMp3fiJvOzHB94sHkvZ84k+/sCjc= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1623842035; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=0VpobYR5wxnOD33pndCURu/YMO9tcWu/8T0z5y43K5M=; b=Rw9hau9+x2OubbG3SMM0PK4aSS2jKFz9hveMots+JgWVHvBHJRB5eUy7Cl/MOJV+uJFnwc WcohatcQFPtzCsBw== Received: from murzim.suse.de (murzim.suse.de [10.160.4.192]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id E2388A3BAA; Wed, 16 Jun 2021 11:13:55 +0000 (UTC) Date: Wed, 16 Jun 2021 13:13:55 +0200 (CEST) From: Richard Biener To: "Andre Vieira (lists)" cc: Richard Sandiford , "gcc-patches@gcc.gnu.org" Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization In-Reply-To: <077c4a2f-d745-dd0a-4d6d-d73d6a558529@arm.com> Message-ID: References: <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com> <6e4541ef-e0a1-1d2d-53f5-4bfed9a65598@arm.com> <4925fee5-dcea-0c14-388c-85c881ffd918@arm.com> <077c4a2f-d745-dd0a-4d6d-d73d6a558529@arm.com> User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Jun 2021 11:13:58 -0000 On Wed, 16 Jun 2021, Andre Vieira (lists) wrote: > > On 14/06/2021 11:57, Richard Biener wrote: > > On Mon, 14 Jun 2021, Richard Biener wrote: > > > >> Indeed. For example a simple > >> int a[1024], b[1024], c[1024]; > >> > >> void foo(int n) > >> { > >> for (int i = 0; i < n; ++i) > >> a[i+1] += c[i+i] ? b[i+1] : 0; > >> } > >> > >> should usually see peeling for alignment (though on x86 you need > >> exotic -march= since cost models generally have equal aligned and > >> unaligned access costs). For example with -mavx2 -mtune=atom > >> we'll see an alignment peeling prologue, a AVX2 vector loop, > >> a SSE2 vectorized epilogue and a scalar epilogue. It also > >> shows the original scalar loop being used in the scalar prologue > >> and epilogue. > >> > >> We're not even trying to make the counting IV easily used > >> across loops (we're not counting scalar iterations in the > >> vector loops). > > Specifically we see > > > > [local count: 94607391]: > > niters_vector_mult_vf.10_62 = bnd.9_61 << 3; > > _67 = niters_vector_mult_vf.10_62 + 7; > > _64 = (int) niters_vector_mult_vf.10_62; > > tmp.11_63 = i_43 + _64; > > if (niters.8_45 == niters_vector_mult_vf.10_62) > > goto ; [12.50%] > > else > > goto ; [87.50%] > > > > after the maini vect loop, recomputing the original IV (i) rather > > than using the inserted canonical IV. And then the vectorized > > epilogue header check doing > > > > [local count: 93293400]: > > # i_59 = PHI > > # _66 = PHI <_67(33), 0(18)> > > _96 = (unsigned int) n_10(D); > > niters.26_95 = _96 - _66; > > _108 = (unsigned int) n_10(D); > > _109 = _108 - _66; > > _110 = _109 + 4294967295; > > if (_110 <= 3) > > goto ; [10.00%] > > else > > goto ; [90.00%] > > > > re-computing everything from scratch again (also notice how > > the main vect loop guard jumps around the alignment prologue > > as well and lands here - and the vectorized epilogue using > > unaligned accesses - good!). > > > > That is, I'd expect _much_ easier jobs if we'd manage to > > track the number of performed scalar iterations (or the > > number of scalar iterations remaining) using the canonical > > IV we add to all loops across all of the involved loops. > > > > Richard. > > > So I am now looking at using an IV that counts scalar iterations rather than > vector iterations and reusing that through all loops, (prologue, main loop, > vect_epilogue and scalar epilogue). The first is easy, since that's what we > already do for partial vectors or non-constant VFs. The latter requires some > plumbing and removing a lot of the code in there that creates new IV's going > from [0, niters - previous iterations]. I don't yet have a clear cut view of > how to do this, I first thought of keeping track of the 'control' IV in the > loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps > a control_ivs struct, but that is used for overflow detection and only keeps > track of what looks like a constant 'base' and 'step'. Not quite sure how all > that works, but intuitively doesn't seem like the right thing to reuse. Maybe it's enough to maintain this [remaining] scalar iterations counter between loops, thus after the vector loop do remain_scalar_iter -= vector_iters * vf; etc., this should make it possible to do some first order cleanups, avoiding some repeated computations. It does involve placing additional PHIs for this remain_scalar_iter var of course (I'd be hesitant to rely on the SSA renamer for this due to its expense). I think that for all later jump-around tests tracking remaining scalar iters is more convenient than tracking performed scalar iters. > I'll go hack around and keep you posted on progress. Thanks - it's an iffy area ... Richard.