From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id CE48B3858C83
 for <gcc-patches@gcc.gnu.org>; Tue, 26 Apr 2022 14:43:15 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CE48B3858C83
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8006B23A;
 Tue, 26 Apr 2022 07:43:15 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.88])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E25FF3F73B;
 Tue, 26 Apr 2022 07:43:14 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: "Andre Vieira \(lists\)" <andre.simoesdiasvieira@arm.com>
Mail-Followup-To: "Andre Vieira \(lists\)" <andre.simoesdiasvieira@arm.com>,
 "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
 Richard Biener <rguenther@suse.de>, richard.sandiford@arm.com
Cc: "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
 Richard Biener <rguenther@suse.de>
Subject: Re: [PATCH] vect,
 tree-optimization/105219: Disable epilogue vectorization when peeling
 for alignment
References: <8462f41b-895f-9aca-499e-7713ec161673@arm.com>
Date: Tue, 26 Apr 2022 15:43:13 +0100
In-Reply-To: <8462f41b-895f-9aca-499e-7713ec161673@arm.com> (Andre Vieira's
 message of "Tue, 26 Apr 2022 15:34:12 +0100")
Message-ID: <mptlevrnbwe.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, KAM_LOTSOFHASH, KAM_SHORT, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 26 Apr 2022 14:43:17 -0000

"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> Hi,
>
> This patch disables epilogue vectorization when we are peeling for=20
> alignment in the prologue and we can't guarantee the main vectorized=20
> loop is entered.=C2=A0 This is to prevent executing vectorized code with =
an=20
> unaligned access if the target has indicated it wants to peel for=20
> alignment. We take this conservative approach as we currently do not=20
> distinguish between peeling for alignment for correctness or for=20
> performance.
>
> A better codegen would be to make it skip to the scalar epilogue in case=
=20
> the main loop isn't entered when alignment peeling is required. However,=
=20
> that would require a more aggressive change to the codebase which we=20
> chose to avoid at this point of development.=C2=A0 We can revisit this op=
tion=20
> during stage 1 if we choose to.
>
> Bootstrapped on aarch64-none-linux and regression tested on=20
> aarch64-none-elf.
>
> gcc/ChangeLog:
>
>  =C2=A0=C2=A0=C2=A0 PR tree-optimization/105219
>  =C2=A0=C2=A0=C2=A0 * tree-vect-loop.cc (vect_epilogue_when_peeling_for_a=
lignment): New=20
> function.
>  =C2=A0=C2=A0=C2=A0 (vect_analyze_loop): Use vect_epilogue_when_peeling_f=
or_alignment=20
> to determine
>  =C2=A0=C2=A0=C2=A0 whether to vectorize epilogue.
>  =C2=A0=C2=A0=C2=A0 * testsuite/gcc.target/aarch64/pr105219.c: New.
>  =C2=A0=C2=A0=C2=A0 * testsuite/gcc.target/aarch64/pr105219-2.c: New.
>  =C2=A0=C2=A0=C2=A0 * testsuite/gcc.target/aarch64/pr105219-3.c: New.
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuit=
e/gcc.target/aarch64/pr105219-2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08=
407e1e352f604fe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> @@ -0,0 +1,29 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -march=3Darmv8.2-a -mtune=3Dthunderx -fno-vect-cost=
-model" } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=3D*" } { "-ma=
rch=3Darmv8.2-a" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=3D*" } { "-mt=
une=3Dthunderx" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=3D*" } } */

I think this should be in gcc.dg/vect, with the options forced
for { target aarch64 }.

Are the skips necessary?  It looks like the test should work correctly
for all options/targets.

> +/* PR 105219.  */
> +int data[128];
> +
> +void __attribute((noipa))
> +foo (int *data, int n)
> +{
> +  for (int i =3D 0; i < n; ++i)
> +    data[i] =3D i;
> +}
> +
> +int main()
> +{
> +  for (int start =3D 0; start < 16; ++start)
> +    for (int n =3D 1; n < 3*16; ++n)
> +      {
> +        __builtin_memset (data, 0, sizeof (data));
> +        foo (&data[start], n);
> +        for (int j =3D 0; j < n; ++j)
> +          if (data[start + j] !=3D j)
> +            __builtin_abort ();
> +      }
> +  return 0;
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuit=
e/gcc.target/aarch64/pr105219-3.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6=
236d1ff0fc2d392
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=3D*" } { "-ma=
rch=3Darmv8.2-a" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=3D*" } { "-mt=
une=3Dthunderx" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=3D*" } } */
> +/* { dg-options "-O3 -march=3Darmv8.2-a -mtune=3Dthunderx -fno-vect-cost=
-model -fdump-tree-vect-all" } */
> +/* PR 105219.  */
> +int data[128];
> +
> +void foo (void)
> +{
> +  for (int i =3D 0; i < 9; ++i)
> +    data[i + 1] =3D i;
> +}
> +
> +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/=
gcc.target/aarch64/pr105219.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d2=
0ce1ef9152a526c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> @@ -0,0 +1,28 @@
> +/* { dg-do run { target aarch64_sve128_hw } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=3D*" } { "-ma=
rch=3Darmv8.2-a+sve" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=3D*" } { "-mt=
une=3Dthunderx" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=3D*" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=3D=
*" } { "-msve-vector-bits=3D128" } } */
> +/* { dg-options "-O3 -march=3Darmv8.2-a+sve -msve-vector-bits=3D128 -mtu=
ne=3Dthunderx" } */

Same here.

> +/* PR 105219.  */
> +int a;
> +char b[60];
> +short c[18];
> +short d[4][19];
> +long long f;
> +void e(int g, int h, short k[][19]) {
> +  for (signed i =3D 0; i < 3; i +=3D 2)
> +    for (signed j =3D 1; j < h + 14; j++) {
> +      b[i * 14 + j] =3D 1;
> +      c[i + j] =3D k[2][j];
> +      a =3D g ? k[i][j] : 0;
> +    }
> +}
> +int main() {
> +  e(9, 1, d);
> +  for (long l =3D 0; l < 6; ++l)
> +    for (long m =3D 0; m < 4; ++m)
> +      f ^=3D b[l + m * 4];
> +  if (f)
> +    __builtin_abort ();
> +}
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea94=
74095bf65483863 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_sh=
ared *shared,
>    return opt_loop_vec_info::success (loop_vinfo);
>  }
>=20=20
> +/* Function vect_epilogue_when_peeling_for_alignment
> +
> +   PR 105219: If we are peeling for alignment in the prologue then we do=
 not
> +   vectorize the epilogue unless we are certain we will enter the main
> +   vectorized loop.  This is to prevent entering the vectorized epilogue=
 in
> +   case there aren't enough iterations to enter the main loop.
> +*/
> +
> +static bool
> +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
> +{
> +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
> +    return true;
> +
> +  int prologue_peeling =3D LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +    {
> +      poly_uint64 niters_for_main
> +	=3D upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
> +      niters_for_main
> +	=3D upper_bound (niters_for_main,
> +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
> +      niters_for_main +=3D prologue_peeling;
> +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
> +	return false;
> +    }
> +  else if (prologue_peeling < 0)

I was surprised that this tests < 0 rather than !=3D 0.  If that's the
right behaviour, could you add a comment saying why?  I would have
expected:

  prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

to be handled more conservatively than:

  prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

LGTM otherwise, but Richard B should have the final say.

Thanks,
Richard

> +    return false;
> +  return true;
> +}
> +
>  /* Function vect_analyze_loop.
>=20=20
>     Apply a set of analyses on LOOP, and create a loop_vec_info struct
> @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_share=
d *shared)
>  		}
>  	    }
>  	  /* For now only allow one epilogue loop.  */
> -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
> +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
> +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
>  	    {
>  	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
>  	      poly_uint64 th =3D LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);