public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Richard Biener <rguenther@suse.de>
To: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
	 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] vect, tree-optimization/105219: Disable epilogue vectorization when peeling for alignment
Date: Wed, 27 Apr 2022 16:09:44 +0200 (CEST)	[thread overview]
Message-ID: <30p6r237-sn90-3169-36ro-2poo369rq2r7@fhfr.qr> (raw)
In-Reply-To: <5d06cd70-c6c0-1d68-555e-8355de6c9632@arm.com>

On Wed, 27 Apr 2022, Andre Vieira (lists) wrote:

> 
> On 27/04/2022 07:35, Richard Biener wrote:
> > On Tue, 26 Apr 2022, Richard Sandiford wrote:
> >
> >> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> >>> Hi,
> >>>
> >>> This patch disables epilogue vectorization when we are peeling for
> >>> alignment in the prologue and we can't guarantee the main vectorized
> >>> loop is entered.  This is to prevent executing vectorized code with an
> >>> unaligned access if the target has indicated it wants to peel for
> >>> alignment. We take this conservative approach as we currently do not
> >>> distinguish between peeling for alignment for correctness or for
> >>> performance.
> >>>
> >>> A better codegen would be to make it skip to the scalar epilogue in case
> >>> the main loop isn't entered when alignment peeling is required. However,
> >>> that would require a more aggressive change to the codebase which we
> >>> chose to avoid at this point of development.  We can revisit this option
> >>> during stage 1 if we choose to.
> >>>
> >>> Bootstrapped on aarch64-none-linux and regression tested on
> >>> aarch64-none-elf.
> >>>
> >>> gcc/ChangeLog:
> >>>
> >>>       PR tree-optimization/105219
> >>>       * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New
> >>> function.
> >>>       (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment
> >>> to determine
> >>>       whether to vectorize epilogue.
> >>>       * testsuite/gcc.target/aarch64/pr105219.c: New.
> >>>       * testsuite/gcc.target/aarch64/pr105219-2.c: New.
> >>>       * testsuite/gcc.target/aarch64/pr105219-3.c: New.
> >>>
> >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> >>> b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> >>> new file mode 100644
> >>> index
> >>> 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> >>> @@ -0,0 +1,29 @@
> >>> +/* { dg-do run } */
> >>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx
> >>> -fno-vect-cost-model" } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } {
> >>> "-march=armv8.2-a" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } {
> >>> "-mtune=thunderx" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> >> I think this should be in gcc.dg/vect, with the options forced
> >> for { target aarch64 }.
> >>
> >> Are the skips necessary?  It looks like the test should work correctly
> >> for all options/targets.
> >>
> >>> +/* PR 105219.  */
> >>> +int data[128];
> >>> +
> >>> +void __attribute((noipa))
> >>> +foo (int *data, int n)
> >>> +{
> >>> +  for (int i = 0; i < n; ++i)
> >>> +    data[i] = i;
> >>> +}
> >>> +
> >>> +int main()
> >>> +{
> >>> +  for (int start = 0; start < 16; ++start)
> >>> +    for (int n = 1; n < 3*16; ++n)
> >>> +      {
> >>> +        __builtin_memset (data, 0, sizeof (data));
> >>> +        foo (&data[start], n);
> >>> +        for (int j = 0; j < n; ++j)
> >>> +          if (data[start + j] != j)
> >>> +            __builtin_abort ();
> >>> +      }
> >>> +  return 0;
> >>> +}
> >>> +
> >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> >>> b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> >>> new file mode 100644
> >>> index
> >>> 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> >>> @@ -0,0 +1,15 @@
> >>> +/* { dg-do compile } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } {
> >>> "-march=armv8.2-a" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } {
> >>> "-mtune=thunderx" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> >>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx
> >>> -fno-vect-cost-model -fdump-tree-vect-all" } */
> >>> +/* PR 105219.  */
> >>> +int data[128];
> >>> +
> >>> +void foo (void)
> >>> +{
> >>> +  for (int i = 0; i < 9; ++i)
> >>> +    data[i + 1] = i;
> >>> +}
> >>> +
> >>> +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
> >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c
> >>> b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> >>> new file mode 100644
> >>> index
> >>> 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> >>> @@ -0,0 +1,28 @@
> >>> +/* { dg-do run { target aarch64_sve128_hw } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } {
> >>> "-march=armv8.2-a+sve" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } {
> >>> "-mtune=thunderx" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*"
> >>> } { "-msve-vector-bits=128" } } */
> >>> +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128
> >>> -mtune=thunderx" } */
> >> Same here.
> >>
> >>> +/* PR 105219.  */
> >>> +int a;
> >>> +char b[60];
> >>> +short c[18];
> >>> +short d[4][19];
> >>> +long long f;
> >>> +void e(int g, int h, short k[][19]) {
> >>> +  for (signed i = 0; i < 3; i += 2)
> >>> +    for (signed j = 1; j < h + 14; j++) {
> >>> +      b[i * 14 + j] = 1;
> >>> +      c[i + j] = k[2][j];
> >>> +      a = g ? k[i][j] : 0;
> >>> +    }
> >>> +}
> >>> +int main() {
> >>> +  e(9, 1, d);
> >>> +  for (long l = 0; l < 6; ++l)
> >>> +    for (long m = 0; m < 4; ++m)
> >>> +      f ^= b[l + m * 4];
> >>> +  if (f)
> >>> +    __builtin_abort ();
> >>> +}
> >>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> >>> index
> >>> d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863
> >>> 100644
> >>> --- a/gcc/tree-vect-loop.cc
> >>> +++ b/gcc/tree-vect-loop.cc
> >>> @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop,
> >>> @@ vec_info_shared *shared,
> >>>     return opt_loop_vec_info::success (loop_vinfo);
> >>>   }
> >>>   
> >>> +/* Function vect_epilogue_when_peeling_for_alignment
> >>> +
> >>> +   PR 105219: If we are peeling for alignment in the prologue then we do
> >>> not
> >>> +   vectorize the epilogue unless we are certain we will enter the main
> >>> +   vectorized loop.  This is to prevent entering the vectorized epilogue
> >>> in
> >>> +   case there aren't enough iterations to enter the main loop.
> >>> +*/
> >>> +
> >>> +static bool
> >>> +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
> >>> +{
> >>> +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
> >>> +    return true;
> >>> +
> >>> +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> >>> +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> >>> +    {
> >>> +      poly_uint64 niters_for_main
> >>> +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> >>> +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
> >>> +      niters_for_main
> >>> +	= upper_bound (niters_for_main,
> >>> +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
> >>> +      niters_for_main += prologue_peeling;
> >>> +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
> >>> +	return false;
> >>> +    }
> >>> +  else if (prologue_peeling < 0)
> >> I was surprised that this tests < 0 rather than != 0.  If that's the
> >> right behaviour, could you add a comment saying why?  I would have
> >> expected:
> >>
> >>    prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >>
> >> to be handled more conservatively than:
> >>
> >>    prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >>
> >> LGTM otherwise, but Richard B should have the final say.
> > It works for me.  Note it strictly prefers peeling for alignment over
> > epilogue vectorization (which is probably OK in most cases), but the
> > number of targets affected is quite small.
> >
> > For GCC 13 we should see to make skipping to the scalar epilogue
> > possible.
> >
> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> Richard
> >>
> >>> +    return false;
> >>> +  return true;
> >>> +}
> >>> +
> >>>   /* Function vect_analyze_loop.
> >>>   
> >>>      Apply a set of analyses on LOOP, and create a loop_vec_info struct
> >>> @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared
> >>> @@ *shared)
> >>>        	}
> >>>        }
> >>>   	  /* For now only allow one epilogue loop.  */
> >>> -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
> >>> +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
> >>> +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
> >>>        {
> >>>          first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
> >>>          poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
> IIUC after IRC and PR discussions we are dropping this approach in favour of
> changing the loop_iteration tightening as we now believe it's fine to enter
> the vectorized epilogue without peeling?

Yes, I think we are conservative there, at least I don't see too big
alignment used and I saw the DRs being offsetted by a correct amount
in update_epilogue_loop_vinfo.  So unless we have contrary indication
we should be safe here.  You can look at a -gimple suffixed dump
where for example with -mavx2 I see

  __MEM <vector(8) int> ((int *)vectp_data.13_72) = vect_vec_iv_.12_67;

in the main loop (alignment according to the type) and

  __MEM <vector(4) int, 32> ((int *)vectp_data.21_112) = 
vect_vec_iv_.20_106;

in the epilogue (32 bit alignent).

Richard.

      reply	other threads:[~2022-04-27 14:09 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-26 14:34 Andre Vieira (lists)
2022-04-26 14:43 ` Richard Sandiford
2022-04-26 15:12   ` Jakub Jelinek
2022-04-26 15:45     ` Andre Vieira (lists)
2022-04-26 16:02       ` Jakub Jelinek
2022-04-26 15:41   ` Andre Vieira (lists)
2022-04-27  6:35   ` Richard Biener
2022-04-27 10:59     ` Richard Biener
2022-04-27 11:10       ` Richard Biener
2022-04-27 13:46     ` Andre Vieira (lists)
2022-04-27 14:09       ` Richard Biener [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=30p6r237-sn90-3169-36ro-2poo369rq2r7@fhfr.qr \
    --to=rguenther@suse.de \
    --cc=andre.simoesdiasvieira@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).