From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by sourceware.org (Postfix) with ESMTPS id 310763858C2D for ; Tue, 11 Oct 2022 07:01:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 310763858C2D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 08FF21F958; Tue, 11 Oct 2022 07:01:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1665471667; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lvvcokxR7og0h/02oAGlCrf8nQHdcBpZL3lcpxeEdT4=; b=UQLjFIDCKpqyfAH1ZruFLd+0qmrt8oopOCCuHXSMf27FF6e+lP+PsGs9y5+fF+fX5BpuKE 83CFL3U/YXsIPOtdp8GHFenjANGAZaevrD3Jlam7mtmpMMhKHugbGuEsO3ijAmHap2dyyS DGVpReoPM8eAAbPZdtsX2cDZ7KNaCsg= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1665471667; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lvvcokxR7og0h/02oAGlCrf8nQHdcBpZL3lcpxeEdT4=; b=ieA3dL+uPrmM+Lu5snQ/6QcnC6apxNWvCJWv4duP7VZeHoGgWn2zgadKgUrq/ZJlogeOkC 4riZOpmOr3pkCvCg== Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id BD3062C142; Tue, 11 Oct 2022 07:01:06 +0000 (UTC) Date: Tue, 11 Oct 2022 07:01:06 +0000 (UTC) From: Richard Biener To: Andrew Stubbs cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com, juzhe.zhong@rivai.ai Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences In-Reply-To: <49b46c57-70b6-d9c0-a267-5e2f8315382b@codesourcery.com> Message-ID: References: <20221010110339.E9E2513479@imap2.suse-dmz.suse.de> <49b46c57-70b6-d9c0-a267-5e2f8315382b@codesourcery.com> User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, 10 Oct 2022, Andrew Stubbs wrote: > On 10/10/2022 12:03, Richard Biener wrote: > > The following picks up the prototype by Ju-Zhe Zhong for vectorizing > > first order recurrences. That solves two TSVC missed optimization PRs. > > > > There's a new scalar cycle def kind, vect_first_order_recurrence > > and it's handling of the backedge value vectorization is complicated > > by the fact that the vectorized value isn't the PHI but instead > > a (series of) permute(s) shifting in the recurring value from the > > previous iteration. I've implemented this by creating both the > > single vectorized PHI and the series of permutes when vectorizing > > the scalar PHI but leave the backedge values in both unassigned. > > The backedge values are (for the testcases) computed by a load > > which is also the place after which the permutes are inserted. > > That placement also restricts the cases we can handle (without > > resorting to code motion). > > > > I added both costing and SLP handling though SLP handling is > > restricted to the case where a single vectorized PHI is enough. > > > > Missing is epilogue handling - while prologue peeling would > > be handled transparently by adjusting iv_phi_p the epilogue > > case doesn't work with just inserting a scalar LC PHI since > > that a) keeps the scalar load live and b) that loads is the > > wrong one, it has to be the last, much like when we'd vectorize > > the LC PHI as live operation. Unfortunately LIVE > > compute/analysis happens too early before we decide on > > peeling. When using fully masked loop vectorization the > > vect-recurr-6.c works as expected though. > > > > I have tested this on x86_64 for now, but since epilogue > > handling is missing there's probably no practical cases. > > My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c > > just fine but I didn't feel like running SPEC within SDE nor > > is the WHILE_ULT patch complete enough. Builds of SPEC 2k7 > > with fully masked loops succeed (minus three cases of > > PR107096, caused by my WHILE_ULT prototype). > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > Testing with SVE, GCN or RVV appreciated, ideas how to cleanly > > handle epilogues welcome. > > The testcases all produce correct code on GCN and pass the execution tests. > > The code isn't terribly optimal because we don't have a two-input permutation > instruction, so we permute each half separately and vec_merge the results. In > this case the first vector is always a no-op permutation so that's wasted > cycles. We'd really want a vector rotate and write-lane (or the other way > around). I think the special-case permutations can be recognised and coded > into the backend, but I don't know if we can easily tell that the first vector > is just a bunch of duplicates, when it's not constant. It's not actually a bunch of duplicates in all but the first iteration. But what you can recognize is that we're only using lane N - 1 of the first vector, so you could model the permute as extract last + shift in scalar (the extracted lane). IIRC VLA vector targets usually have something like shift the vector and set the low lane from a scalar? The extract lane N - 1 might be more difficult but then a rotate plus extracting lane 0 might work as well. So yes, I think special-casing this constant permutation makes most sense. Richard.