From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id 14E233858405 for ; Wed, 27 Apr 2022 10:59:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 14E233858405 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id DD970210E6; Wed, 27 Apr 2022 10:59:19 +0000 (UTC) Received: from murzim.suse.de (murzim.suse.de [10.160.4.192]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id C03542C141; Wed, 27 Apr 2022 10:59:19 +0000 (UTC) Date: Wed, 27 Apr 2022 12:59:19 +0200 (CEST) From: Richard Biener To: Richard Sandiford cc: "Andre Vieira (lists)" , "gcc-patches@gcc.gnu.org" Subject: Re: [PATCH] vect, tree-optimization/105219: Disable epilogue vectorization when peeling for alignment In-Reply-To: Message-ID: <3p2o8n5s-3753-s23o-s8p1-54qr90nr9365@fhfr.qr> References: <8462f41b-895f-9aca-499e-7713ec161673@arm.com> MIME-Version: 1.0 X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Apr 2022 10:59:23 -0000 On Wed, 27 Apr 2022, Richard Biener wrote: > On Tue, 26 Apr 2022, Richard Sandiford wrote: > > > "Andre Vieira (lists)" writes: > > > Hi, > > > > > > This patch disables epilogue vectorization when we are peeling for > > > alignment in the prologue and we can't guarantee the main vectorized > > > loop is entered.  This is to prevent executing vectorized code with an > > > unaligned access if the target has indicated it wants to peel for > > > alignment. We take this conservative approach as we currently do not > > > distinguish between peeling for alignment for correctness or for > > > performance. > > > > > > A better codegen would be to make it skip to the scalar epilogue in case > > > the main loop isn't entered when alignment peeling is required. However, > > > that would require a more aggressive change to the codebase which we > > > chose to avoid at this point of development.  We can revisit this option > > > during stage 1 if we choose to. > > > > > > Bootstrapped on aarch64-none-linux and regression tested on > > > aarch64-none-elf. > > > > > > gcc/ChangeLog: > > > > > >     PR tree-optimization/105219 > > >     * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New > > > function. > > >     (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment > > > to determine > > >     whether to vectorize epilogue. > > >     * testsuite/gcc.target/aarch64/pr105219.c: New. > > >     * testsuite/gcc.target/aarch64/pr105219-2.c: New. > > >     * testsuite/gcc.target/aarch64/pr105219-3.c: New. > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c > > > new file mode 100644 > > > index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c > > > @@ -0,0 +1,29 @@ > > > +/* { dg-do run } */ > > > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */ > > > > I think this should be in gcc.dg/vect, with the options forced > > for { target aarch64 }. > > > > Are the skips necessary? It looks like the test should work correctly > > for all options/targets. > > > > > +/* PR 105219. */ > > > +int data[128]; > > > + > > > +void __attribute((noipa)) > > > +foo (int *data, int n) > > > +{ > > > + for (int i = 0; i < n; ++i) > > > + data[i] = i; > > > +} > > > + > > > +int main() > > > +{ > > > + for (int start = 0; start < 16; ++start) > > > + for (int n = 1; n < 3*16; ++n) > > > + { > > > + __builtin_memset (data, 0, sizeof (data)); > > > + foo (&data[start], n); > > > + for (int j = 0; j < n; ++j) > > > + if (data[start + j] != j) > > > + __builtin_abort (); > > > + } > > > + return 0; > > > +} > > > + > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c > > > new file mode 100644 > > > index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c > > > @@ -0,0 +1,15 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */ > > > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */ > > > +/* PR 105219. */ > > > +int data[128]; > > > + > > > +void foo (void) > > > +{ > > > + for (int i = 0; i < 9; ++i) > > > + data[i + 1] = i; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c > > > new file mode 100644 > > > index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c > > > @@ -0,0 +1,28 @@ > > > +/* { dg-do run { target aarch64_sve128_hw } } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */ > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */ > > > +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */ > > > > Same here. > > > > > +/* PR 105219. */ > > > +int a; > > > +char b[60]; > > > +short c[18]; > > > +short d[4][19]; > > > +long long f; > > > +void e(int g, int h, short k[][19]) { > > > + for (signed i = 0; i < 3; i += 2) > > > + for (signed j = 1; j < h + 14; j++) { > > > + b[i * 14 + j] = 1; > > > + c[i + j] = k[2][j]; > > > + a = g ? k[i][j] : 0; > > > + } > > > +} > > > +int main() { > > > + e(9, 1, d); > > > + for (long l = 0; l < 6; ++l) > > > + for (long m = 0; m < 4; ++m) > > > + f ^= b[l + m * 4]; > > > + if (f) > > > + __builtin_abort (); > > > +} > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > > index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644 > > > --- a/gcc/tree-vect-loop.cc > > > +++ b/gcc/tree-vect-loop.cc > > > @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared, > > > return opt_loop_vec_info::success (loop_vinfo); > > > } > > > > > > +/* Function vect_epilogue_when_peeling_for_alignment > > > + > > > + PR 105219: If we are peeling for alignment in the prologue then we do not > > > + vectorize the epilogue unless we are certain we will enter the main > > > + vectorized loop. This is to prevent entering the vectorized epilogue in > > > + case there aren't enough iterations to enter the main loop. > > > +*/ > > > + > > > +static bool > > > +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo) > > > +{ > > > + if (vect_use_loop_mask_for_alignment_p (loop_vinfo)) > > > + return true; > > > + > > > + int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); > > > + if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > > > + { > > > + poly_uint64 niters_for_main > > > + = upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo), > > > + LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo)); > > > + niters_for_main > > > + = upper_bound (niters_for_main, > > > + LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo)); > > > + niters_for_main += prologue_peeling; > > > + if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main)) > > > + return false; > > > + } > > > + else if (prologue_peeling < 0) > > > > I was surprised that this tests < 0 rather than != 0. If that's the > > right behaviour, could you add a comment saying why? I would have > > expected: > > > > prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > > > > to be handled more conservatively than: > > > > prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > > > > LGTM otherwise, but Richard B should have the final say. > > It works for me. Note it strictly prefers peeling for alignment over > epilogue vectorization (which is probably OK in most cases), but the > number of targets affected is quite small. Btw, I wonder if it is possible to detect the situation by checking if there's more than the fallthru edge from the vector loop to the scalar loop when vectorizing the epilogue. > For GCC 13 we should see to make skipping to the scalar epilogue > possible. Possibly by keeping the scalar loop of the main vectorized loop as scalar loop of the vectorized epilogue - that of course needs adjustments as to how we vectorize the epilogue with possibly more divergence from the main loop vectorization. But maybe not by too much. Richard. > > Thanks, > Richard. > > > Thanks, > > Richard > > > > > + return false; > > > + return true; > > > +} > > > + > > > /* Function vect_analyze_loop. > > > > > > Apply a set of analyses on LOOP, and create a loop_vec_info struct > > > @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > > > } > > > } > > > /* For now only allow one epilogue loop. */ > > > - if (first_loop_vinfo->epilogue_vinfos.is_empty ()) > > > + if (first_loop_vinfo->epilogue_vinfos.is_empty () > > > + && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo)) > > > { > > > first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo); > > > poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); > > > > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)