From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
 by sourceware.org (Postfix) with ESMTPS id 14E233858405
 for <gcc-patches@gcc.gnu.org>; Wed, 27 Apr 2022 10:59:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 14E233858405
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out1.suse.de (Postfix) with ESMTP id DD970210E6;
 Wed, 27 Apr 2022 10:59:19 +0000 (UTC)
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id C03542C141;
 Wed, 27 Apr 2022 10:59:19 +0000 (UTC)
Date: Wed, 27 Apr 2022 12:59:19 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
cc: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>, 
 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] vect, tree-optimization/105219: Disable epilogue
 vectorization when peeling for alignment
In-Reply-To: <s0n34151-793p-257n-42np-8o2297o4s5qq@fhfr.qr>
Message-ID: <3p2o8n5s-3753-s23o-s8p1-54qr90nr9365@fhfr.qr>
References: <8462f41b-895f-9aca-499e-7713ec161673@arm.com>
 <mptlevrnbwe.fsf@arm.com> <s0n34151-793p-257n-42np-8o2297o4s5qq@fhfr.qr>
MIME-Version: 1.0
X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_LOTSOFHASH,
 KAM_SHORT, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8BIT
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Apr 2022 10:59:23 -0000

On Wed, 27 Apr 2022, Richard Biener wrote:

> On Tue, 26 Apr 2022, Richard Sandiford wrote:
> 
> > "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> > > Hi,
> > >
> > > This patch disables epilogue vectorization when we are peeling for 
> > > alignment in the prologue and we can't guarantee the main vectorized 
> > > loop is entered.  This is to prevent executing vectorized code with an 
> > > unaligned access if the target has indicated it wants to peel for 
> > > alignment. We take this conservative approach as we currently do not 
> > > distinguish between peeling for alignment for correctness or for 
> > > performance.
> > >
> > > A better codegen would be to make it skip to the scalar epilogue in case 
> > > the main loop isn't entered when alignment peeling is required. However, 
> > > that would require a more aggressive change to the codebase which we 
> > > chose to avoid at this point of development.  We can revisit this option 
> > > during stage 1 if we choose to.
> > >
> > > Bootstrapped on aarch64-none-linux and regression tested on 
> > > aarch64-none-elf.
> > >
> > > gcc/ChangeLog:
> > >
> > >      PR tree-optimization/105219
> > >      * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New 
> > > function.
> > >      (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment 
> > > to determine
> > >      whether to vectorize epilogue.
> > >      * testsuite/gcc.target/aarch64/pr105219.c: New.
> > >      * testsuite/gcc.target/aarch64/pr105219-2.c: New.
> > >      * testsuite/gcc.target/aarch64/pr105219-3.c: New.
> > >
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > > @@ -0,0 +1,29 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > 
> > I think this should be in gcc.dg/vect, with the options forced
> > for { target aarch64 }.
> > 
> > Are the skips necessary?  It looks like the test should work correctly
> > for all options/targets.
> > 
> > > +/* PR 105219.  */
> > > +int data[128];
> > > +
> > > +void __attribute((noipa))
> > > +foo (int *data, int n)
> > > +{
> > > +  for (int i = 0; i < n; ++i)
> > > +    data[i] = i;
> > > +}
> > > +
> > > +int main()
> > > +{
> > > +  for (int start = 0; start < 16; ++start)
> > > +    for (int n = 1; n < 3*16; ++n)
> > > +      {
> > > +        __builtin_memset (data, 0, sizeof (data));
> > > +        foo (&data[start], n);
> > > +        for (int j = 0; j < n; ++j)
> > > +          if (data[start + j] != j)
> > > +            __builtin_abort ();
> > > +      }
> > > +  return 0;
> > > +}
> > > +
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> > > @@ -0,0 +1,15 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */
> > > +/* PR 105219.  */
> > > +int data[128];
> > > +
> > > +void foo (void)
> > > +{
> > > +  for (int i = 0; i < 9; ++i)
> > > +    data[i + 1] = i;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> > > @@ -0,0 +1,28 @@
> > > +/* { dg-do run { target aarch64_sve128_hw } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */
> > > +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */
> > 
> > Same here.
> > 
> > > +/* PR 105219.  */
> > > +int a;
> > > +char b[60];
> > > +short c[18];
> > > +short d[4][19];
> > > +long long f;
> > > +void e(int g, int h, short k[][19]) {
> > > +  for (signed i = 0; i < 3; i += 2)
> > > +    for (signed j = 1; j < h + 14; j++) {
> > > +      b[i * 14 + j] = 1;
> > > +      c[i + j] = k[2][j];
> > > +      a = g ? k[i][j] : 0;
> > > +    }
> > > +}
> > > +int main() {
> > > +  e(9, 1, d);
> > > +  for (long l = 0; l < 6; ++l)
> > > +    for (long m = 0; m < 4; ++m)
> > > +      f ^= b[l + m * 4];
> > > +  if (f)
> > > +    __builtin_abort ();
> > > +}
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
> > >    return opt_loop_vec_info::success (loop_vinfo);
> > >  }
> > >  
> > > +/* Function vect_epilogue_when_peeling_for_alignment
> > > +
> > > +   PR 105219: If we are peeling for alignment in the prologue then we do not
> > > +   vectorize the epilogue unless we are certain we will enter the main
> > > +   vectorized loop.  This is to prevent entering the vectorized epilogue in
> > > +   case there aren't enough iterations to enter the main loop.
> > > +*/
> > > +
> > > +static bool
> > > +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
> > > +{
> > > +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
> > > +    return true;
> > > +
> > > +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> > > +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> > > +    {
> > > +      poly_uint64 niters_for_main
> > > +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> > > +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
> > > +      niters_for_main
> > > +	= upper_bound (niters_for_main,
> > > +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
> > > +      niters_for_main += prologue_peeling;
> > > +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
> > > +	return false;
> > > +    }
> > > +  else if (prologue_peeling < 0)
> > 
> > I was surprised that this tests < 0 rather than != 0.  If that's the
> > right behaviour, could you add a comment saying why?  I would have
> > expected:
> > 
> >   prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > 
> > to be handled more conservatively than:
> > 
> >   prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > 
> > LGTM otherwise, but Richard B should have the final say.
> 
> It works for me.  Note it strictly prefers peeling for alignment over
> epilogue vectorization (which is probably OK in most cases), but the
> number of targets affected is quite small.

Btw, I wonder if it is possible to detect the situation by
checking if there's more than the fallthru edge from the vector
loop to the scalar loop when vectorizing the epilogue.

> For GCC 13 we should see to make skipping to the scalar epilogue
> possible.

Possibly by keeping the scalar loop of the main vectorized loop as
scalar loop of the vectorized epilogue - that of course needs
adjustments as to how we vectorize the epilogue with possibly
more divergence from the main loop vectorization.  But maybe not
by too much.

Richard.

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Richard
> > 
> > > +    return false;
> > > +  return true;
> > > +}
> > > +
> > >  /* Function vect_analyze_loop.
> > >  
> > >     Apply a set of analyses on LOOP, and create a loop_vec_info struct
> > > @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
> > >  		}
> > >  	    }
> > >  	  /* For now only allow one epilogue loop.  */
> > > -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
> > > +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
> > > +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
> > >  	    {
> > >  	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
> > >  	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)