[PATCH V2] Enable small loop unrolling for O2

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH V2] Enable small loop unrolling for O2
@ 2022-11-02  3:37 Hongyu Wang
  2022-11-07 14:24 ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Hongyu Wang @ 2022-11-02  3:37 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, ubizjak, hongtao.liu

Hi, this is the updated patch of
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.

This patch does not change rs6000/s390 since I don't have machine to 
test them, but I suppose the default behavior is the same since they
enable flag_unroll_loops at O2.

Bootstrapped & regrtested on x86_64-pc-linux-gnu.

Ok for trunk?

---------- Patch content --------

Modern processors has multiple way instruction decoders
For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
instructions (usually has 3 uops with a cmp/jmp pair that can be
macro-fused), the decoder would have 2 uops bubble for each iteration
and the pipeline could not be fully utilized.

Therefore, this patch enables loop unrolling for small size loop at O2
to fullfill the decoder as much as possible. It turns on rtl loop
unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
In x86 backend the default behavior is to unroll small loops with less
than 4 insns by 1 time.

This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
0.9% codesize increment. For other benchmarks the variants are minor
and overall codesize increased by 0.2%.

The kernel image size increased by 0.06%, and no impact on eembc.

gcc/ChangeLog:

	* common/config/i386/i386-common.cc (ix86_optimization_table):
	Enable small loop unroll at O2 by default.
	* config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
	factor if -munroll-only-small-loops enabled and -funroll-loops/
	-funroll-all-loops are disabled.
	* config/i386/i386.opt: Add -munroll-only-small-loops,
	-param=x86-small-unroll-ninsns= for loop insn limit,
	-param=x86-small-unroll-factor= for unroll factor.
	* doc/invoke.texi: Document -munroll-only-small-loops,
	x86-small-unroll-ninsns and x86-small-unroll-factor.
	* loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
	loop unrolling for -O2-speed and above if target hook
	loop_unroll_adjust exists.

gcc/testsuite/ChangeLog:

	* gcc.dg/guality/loop-1.c: Add additional option
	  -mno-unroll-only-small-loops.
	* gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
	* gcc.target/i386/pr93002.c: Likewise.
---
 gcc/common/config/i386/i386-common.cc   |  1 +
 gcc/config/i386/i386.cc                 | 18 ++++++++++++++++++
 gcc/config/i386/i386.opt                | 13 +++++++++++++
 gcc/doc/invoke.texi                     | 16 ++++++++++++++++
 gcc/loop-init.cc                        | 10 +++++++---
 gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
 gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
 8 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
index f66bdd5a2af..c6891486078 100644
--- a/gcc/common/config/i386/i386-common.cc
+++ b/gcc/common/config/i386/i386-common.cc
@@ -1724,6 +1724,7 @@ static const struct default_options ix86_option_optimization_table[] =
     /* The STC algorithm produces the smallest code at -Os, for x86.  */
     { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
       REORDER_BLOCKS_ALGORITHM_STC },
+    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
     /* Turn off -fschedule-insns by default.  It tends to make the
        problem with not enough registers even worse.  */
     { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index c0f37149ed0..0f94a3b609e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
   unsigned i;
   unsigned mem_count = 0;
 
+  /* Unroll small size loop when unroll factor is not explicitly
+     specified.  */
+  if (!(flag_unroll_loops
+	|| flag_unroll_all_loops
+	|| loop->unroll))
+    {
+      nunroll = 1;
+
+      /* Any explicit -f{no-}unroll-{all-}loops turns off
+	 -munroll-only-small-loops.  */
+      if (ix86_unroll_only_small_loops
+	  && !OPTION_SET_P (flag_unroll_loops))
+	if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)
+	  nunroll = (unsigned) ix86_small_unroll_factor;
+
+      return nunroll;
+    }
+
   if (!TARGET_ADJUST_UNROLL)
      return nunroll;
 
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 53d534f6392..6da9c8d670d 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1224,3 +1224,16 @@ mavxvnniint8
 Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
 Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
 AVXVNNIINT8 built-in functions and code generation.
+
+munroll-only-small-loops
+Target Var(ix86_unroll_only_small_loops) Init(0) Save
+Enable conservative small loop unrolling.
+
+-param=x86-small-unroll-ninsns=
+Target Joined UInteger Var(ix86_small_unroll_ninsns) Init(4) Param
+Insturctions number limit for loop to be unrolled under
+-munroll-only-small-loops.
+
+-param=x86-small-unroll-factor=
+Target Joined UInteger Var(ix86_small_unroll_factor) Init(2) Param
+Unroll factor for -munroll-only-small-loops.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 550aec87809..487218bd0ce 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -15821,6 +15821,14 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
 @item x86-stlf-window-ninsns
 Instructions number above which STFL stall penalty can be compensated.
 
+@item x86-small-unroll-ninsns
+If -munroll-only-small-loops is enabled, only unroll loops with instruction
+count less than this parameter. The default value is 4.
+
+@item x86-small-unroll-factor
+If -munroll-only-small-loops is enabled, reset the unroll factor with this
+value. The default value is 2 which means the loop will be unrolled once.
+
 @end table
 
 @end table
@@ -25232,6 +25240,14 @@ environments where no dynamic link is performed, like firmwares, OS
 kernels, executables linked with @option{-static} or @option{-static-pie}.
 @option{-mdirect-extern-access} is not compatible with @option{-fPIC} or
 @option{-fpic}.
+
+@item -munroll-only-small-loops
+@itemx -mno-unroll-only-small-loops
+@opindex munroll-only-small-loops
+Controls conservative small loop unrolling. It is default enbaled by
+O2, and unrolls loop with less than 4 insns by 1 time. Explicit
+-f[no-]unroll-[all-]loops would disable this flag to avoid any
+unintended unrolling behavior that user does not want.
 @end table
 
 @node M32C Options
diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
index b9e07973dd6..9789efa1e11 100644
--- a/gcc/loop-init.cc
+++ b/gcc/loop-init.cc
@@ -565,9 +565,12 @@ public:
   {}
 
   /* opt_pass methods: */
-  bool gate (function *) final override
+  bool gate (function *fun) final override
     {
-      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
+      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
+	      || (targetm.loop_unroll_adjust
+		  && optimize >= 2
+		  && optimize_function_for_speed_p (fun)));
     }
 
   unsigned int execute (function *) final override;
@@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
       if (dump_file)
 	df_dump (dump_file);
 
-      if (flag_unroll_loops)
+      if (flag_unroll_loops
+	  || targetm.loop_unroll_adjust)
 	flags |= UAP_UNROLL;
       if (flag_unroll_all_loops)
 	flags |= UAP_UNROLL_ALL;
diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
index 1b1f6d32271..a32ea445a3f 100644
--- a/gcc/testsuite/gcc.dg/guality/loop-1.c
+++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
@@ -1,5 +1,7 @@
 /* { dg-do run } */
 /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
+/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
+
 
 #include "../nop.h"
 
diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
index 81841ef5bd7..cbc9fbb0450 100644
--- a/gcc/testsuite/gcc.target/i386/pr86270.c
+++ b/gcc/testsuite/gcc.target/i386/pr86270.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
 
 int *a;
 long len;
diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
index 0248fcc00a5..f75a847f75d 100644
--- a/gcc/testsuite/gcc.target/i386/pr93002.c
+++ b/gcc/testsuite/gcc.target/i386/pr93002.c
@@ -1,6 +1,6 @@
 /* PR target/93002 */
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
 /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
 
 volatile int sink;
-- 
2.18.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] Enable small loop unrolling for O2
  2022-11-02  3:37 [PATCH V2] Enable small loop unrolling for O2 Hongyu Wang
@ 2022-11-07 14:24 ` Richard Biener
  2022-11-08  3:07   ` Hongtao Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2022-11-07 14:24 UTC (permalink / raw)
  To: Hongyu Wang; +Cc: gcc-patches, ubizjak, hongtao.liu

On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang <hongyu.wang@intel.com> wrote:
>
> Hi, this is the updated patch of
> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.
>
> This patch does not change rs6000/s390 since I don't have machine to
> test them, but I suppose the default behavior is the same since they
> enable flag_unroll_loops at O2.
>
> Bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> Ok for trunk?
>
> ---------- Patch content --------
>
> Modern processors has multiple way instruction decoders
> For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> instructions (usually has 3 uops with a cmp/jmp pair that can be
> macro-fused), the decoder would have 2 uops bubble for each iteration
> and the pipeline could not be fully utilized.
>
> Therefore, this patch enables loop unrolling for small size loop at O2
> to fullfill the decoder as much as possible. It turns on rtl loop
> unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> In x86 backend the default behavior is to unroll small loops with less
> than 4 insns by 1 time.
>
> This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> 0.9% codesize increment. For other benchmarks the variants are minor
> and overall codesize increased by 0.2%.
>
> The kernel image size increased by 0.06%, and no impact on eembc.
>
> gcc/ChangeLog:
>
>         * common/config/i386/i386-common.cc (ix86_optimization_table):
>         Enable small loop unroll at O2 by default.
>         * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
>         factor if -munroll-only-small-loops enabled and -funroll-loops/
>         -funroll-all-loops are disabled.
>         * config/i386/i386.opt: Add -munroll-only-small-loops,
>         -param=x86-small-unroll-ninsns= for loop insn limit,
>         -param=x86-small-unroll-factor= for unroll factor.
>         * doc/invoke.texi: Document -munroll-only-small-loops,
>         x86-small-unroll-ninsns and x86-small-unroll-factor.
>         * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
>         loop unrolling for -O2-speed and above if target hook
>         loop_unroll_adjust exists.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/guality/loop-1.c: Add additional option
>           -mno-unroll-only-small-loops.
>         * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
>         * gcc.target/i386/pr93002.c: Likewise.
> ---
>  gcc/common/config/i386/i386-common.cc   |  1 +
>  gcc/config/i386/i386.cc                 | 18 ++++++++++++++++++
>  gcc/config/i386/i386.opt                | 13 +++++++++++++
>  gcc/doc/invoke.texi                     | 16 ++++++++++++++++
>  gcc/loop-init.cc                        | 10 +++++++---
>  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
>  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
>  8 files changed, 59 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
> index f66bdd5a2af..c6891486078 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -1724,6 +1724,7 @@ static const struct default_options ix86_option_optimization_table[] =
>      /* The STC algorithm produces the smallest code at -Os, for x86.  */
>      { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
>        REORDER_BLOCKS_ALGORITHM_STC },
> +    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
>      /* Turn off -fschedule-insns by default.  It tends to make the
>         problem with not enough registers even worse.  */
>      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index c0f37149ed0..0f94a3b609e 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
>    unsigned i;
>    unsigned mem_count = 0;
>
> +  /* Unroll small size loop when unroll factor is not explicitly
> +     specified.  */
> +  if (!(flag_unroll_loops
> +       || flag_unroll_all_loops
> +       || loop->unroll))
> +    {
> +      nunroll = 1;
> +
> +      /* Any explicit -f{no-}unroll-{all-}loops turns off
> +        -munroll-only-small-loops.  */
> +      if (ix86_unroll_only_small_loops
> +         && !OPTION_SET_P (flag_unroll_loops))
> +       if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)

either add braces or combine the two if's

Otherwise the middle-end changes look OK.  The target maintainers need to decide
whether the two --params should be core tunings instead - I would assume that
given your rationale the decode and issue widths of the core plays an important
role here.  That might also suggest a single parameter instead and unrolling
(factor * issue_width) / loop->ninsns times instead of a static unroll_factor?

Thanks,
Richard.

> +         nunroll = (unsigned) ix86_small_unroll_factor;
> +
> +      return nunroll;
> +    }
> +
>    if (!TARGET_ADJUST_UNROLL)
>       return nunroll;
>
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index 53d534f6392..6da9c8d670d 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -1224,3 +1224,16 @@ mavxvnniint8
>  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
>  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
>  AVXVNNIINT8 built-in functions and code generation.
> +
> +munroll-only-small-loops
> +Target Var(ix86_unroll_only_small_loops) Init(0) Save
> +Enable conservative small loop unrolling.
> +
> +-param=x86-small-unroll-ninsns=
> +Target Joined UInteger Var(ix86_small_unroll_ninsns) Init(4) Param
> +Insturctions number limit for loop to be unrolled under
> +-munroll-only-small-loops.
> +
> +-param=x86-small-unroll-factor=
> +Target Joined UInteger Var(ix86_small_unroll_factor) Init(2) Param
> +Unroll factor for -munroll-only-small-loops.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 550aec87809..487218bd0ce 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -15821,6 +15821,14 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
>  @item x86-stlf-window-ninsns
>  Instructions number above which STFL stall penalty can be compensated.
>
> +@item x86-small-unroll-ninsns
> +If -munroll-only-small-loops is enabled, only unroll loops with instruction
> +count less than this parameter. The default value is 4.
> +
> +@item x86-small-unroll-factor
> +If -munroll-only-small-loops is enabled, reset the unroll factor with this
> +value. The default value is 2 which means the loop will be unrolled once.
> +
>  @end table
>
>  @end table
> @@ -25232,6 +25240,14 @@ environments where no dynamic link is performed, like firmwares, OS
>  kernels, executables linked with @option{-static} or @option{-static-pie}.
>  @option{-mdirect-extern-access} is not compatible with @option{-fPIC} or
>  @option{-fpic}.
> +
> +@item -munroll-only-small-loops
> +@itemx -mno-unroll-only-small-loops
> +@opindex munroll-only-small-loops
> +Controls conservative small loop unrolling. It is default enbaled by
> +O2, and unrolls loop with less than 4 insns by 1 time. Explicit
> +-f[no-]unroll-[all-]loops would disable this flag to avoid any
> +unintended unrolling behavior that user does not want.
>  @end table
>
>  @node M32C Options
> diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
> index b9e07973dd6..9789efa1e11 100644
> --- a/gcc/loop-init.cc
> +++ b/gcc/loop-init.cc
> @@ -565,9 +565,12 @@ public:
>    {}
>
>    /* opt_pass methods: */
> -  bool gate (function *) final override
> +  bool gate (function *fun) final override
>      {
> -      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
> +      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
> +             || (targetm.loop_unroll_adjust
> +                 && optimize >= 2
> +                 && optimize_function_for_speed_p (fun)));
>      }
>
>    unsigned int execute (function *) final override;
> @@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
>        if (dump_file)
>         df_dump (dump_file);
>
> -      if (flag_unroll_loops)
> +      if (flag_unroll_loops
> +         || targetm.loop_unroll_adjust)
>         flags |= UAP_UNROLL;
>        if (flag_unroll_all_loops)
>         flags |= UAP_UNROLL_ALL;
> diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
> index 1b1f6d32271..a32ea445a3f 100644
> --- a/gcc/testsuite/gcc.dg/guality/loop-1.c
> +++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
> @@ -1,5 +1,7 @@
>  /* { dg-do run } */
>  /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
> +/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
> +
>
>  #include "../nop.h"
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
> index 81841ef5bd7..cbc9fbb0450 100644
> --- a/gcc/testsuite/gcc.target/i386/pr86270.c
> +++ b/gcc/testsuite/gcc.target/i386/pr86270.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2" } */
> +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
>
>  int *a;
>  long len;
> diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
> index 0248fcc00a5..f75a847f75d 100644
> --- a/gcc/testsuite/gcc.target/i386/pr93002.c
> +++ b/gcc/testsuite/gcc.target/i386/pr93002.c
> @@ -1,6 +1,6 @@
>  /* PR target/93002 */
>  /* { dg-do compile } */
> -/* { dg-options "-O2" } */
> +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
>  /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
>
>  volatile int sink;
> --
> 2.18.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] Enable small loop unrolling for O2
  2022-11-07 14:24 ` Richard Biener
@ 2022-11-08  3:07   ` Hongtao Liu
  2022-11-09  1:24     ` Hongyu Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Hongtao Liu @ 2022-11-08  3:07 UTC (permalink / raw)
  To: Richard Biener; +Cc: Hongyu Wang, gcc-patches, ubizjak, hongtao.liu

On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang <hongyu.wang@intel.com> wrote:
> >
> > Hi, this is the updated patch of
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.
> >
> > This patch does not change rs6000/s390 since I don't have machine to
> > test them, but I suppose the default behavior is the same since they
> > enable flag_unroll_loops at O2.
> >
> > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > Ok for trunk?
> >
> > ---------- Patch content --------
> >
> > Modern processors has multiple way instruction decoders
> > For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > macro-fused), the decoder would have 2 uops bubble for each iteration
> > and the pipeline could not be fully utilized.
> >
> > Therefore, this patch enables loop unrolling for small size loop at O2
> > to fullfill the decoder as much as possible. It turns on rtl loop
> > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> > In x86 backend the default behavior is to unroll small loops with less
> > than 4 insns by 1 time.
> >
> > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > 0.9% codesize increment. For other benchmarks the variants are minor
> > and overall codesize increased by 0.2%.
> >
> > The kernel image size increased by 0.06%, and no impact on eembc.
> >
> > gcc/ChangeLog:
> >
> >         * common/config/i386/i386-common.cc (ix86_optimization_table):
> >         Enable small loop unroll at O2 by default.
> >         * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> >         factor if -munroll-only-small-loops enabled and -funroll-loops/
> >         -funroll-all-loops are disabled.
> >         * config/i386/i386.opt: Add -munroll-only-small-loops,
> >         -param=x86-small-unroll-ninsns= for loop insn limit,
> >         -param=x86-small-unroll-factor= for unroll factor.
> >         * doc/invoke.texi: Document -munroll-only-small-loops,
> >         x86-small-unroll-ninsns and x86-small-unroll-factor.
> >         * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> >         loop unrolling for -O2-speed and above if target hook
> >         loop_unroll_adjust exists.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/guality/loop-1.c: Add additional option
> >           -mno-unroll-only-small-loops.
> >         * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> >         * gcc.target/i386/pr93002.c: Likewise.
> > ---
> >  gcc/common/config/i386/i386-common.cc   |  1 +
> >  gcc/config/i386/i386.cc                 | 18 ++++++++++++++++++
> >  gcc/config/i386/i386.opt                | 13 +++++++++++++
> >  gcc/doc/invoke.texi                     | 16 ++++++++++++++++
> >  gcc/loop-init.cc                        | 10 +++++++---
> >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> >  8 files changed, 59 insertions(+), 5 deletions(-)
> >
> > diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
> > index f66bdd5a2af..c6891486078 100644
> > --- a/gcc/common/config/i386/i386-common.cc
> > +++ b/gcc/common/config/i386/i386-common.cc
> > @@ -1724,6 +1724,7 @@ static const struct default_options ix86_option_optimization_table[] =
> >      /* The STC algorithm produces the smallest code at -Os, for x86.  */
> >      { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
> >        REORDER_BLOCKS_ALGORITHM_STC },
> > +    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
> >      /* Turn off -fschedule-insns by default.  It tends to make the
> >         problem with not enough registers even worse.  */
> >      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index c0f37149ed0..0f94a3b609e 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
> >    unsigned i;
> >    unsigned mem_count = 0;
> >
> > +  /* Unroll small size loop when unroll factor is not explicitly
> > +     specified.  */
> > +  if (!(flag_unroll_loops
> > +       || flag_unroll_all_loops
> > +       || loop->unroll))
> > +    {
> > +      nunroll = 1;
> > +
> > +      /* Any explicit -f{no-}unroll-{all-}loops turns off
> > +        -munroll-only-small-loops.  */
> > +      if (ix86_unroll_only_small_loops
> > +         && !OPTION_SET_P (flag_unroll_loops))
> > +       if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)
>
> either add braces or combine the two if's
>
> Otherwise the middle-end changes look OK.  The target maintainers need to decide
> whether the two --params should be core tunings instead - I would assume that
> given your rationale the decode and issue widths of the core plays an important
> role here.  That might also suggest a single parameter instead and unrolling
> (factor * issue_width) / loop->ninsns times instead of a static unroll_factor?
Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
for codesize.
Make it exact as issue_rate and using factor * issue_width /
loop->ninsns may increase code size too much.
So I prefer to add those 2 parameters to the cost table for core
tunings instead of 1.
>
> Thanks,
> Richard.
>
> > +         nunroll = (unsigned) ix86_small_unroll_factor;
> > +
> > +      return nunroll;
> > +    }
> > +
> >    if (!TARGET_ADJUST_UNROLL)
> >       return nunroll;
> >
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index 53d534f6392..6da9c8d670d 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -1224,3 +1224,16 @@ mavxvnniint8
> >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> >  AVXVNNIINT8 built-in functions and code generation.
> > +
> > +munroll-only-small-loops
> > +Target Var(ix86_unroll_only_small_loops) Init(0) Save
> > +Enable conservative small loop unrolling.
> > +
> > +-param=x86-small-unroll-ninsns=
> > +Target Joined UInteger Var(ix86_small_unroll_ninsns) Init(4) Param
> > +Insturctions number limit for loop to be unrolled under
> > +-munroll-only-small-loops.
> > +
> > +-param=x86-small-unroll-factor=
> > +Target Joined UInteger Var(ix86_small_unroll_factor) Init(2) Param
> > +Unroll factor for -munroll-only-small-loops.
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 550aec87809..487218bd0ce 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -15821,6 +15821,14 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
> >  @item x86-stlf-window-ninsns
> >  Instructions number above which STFL stall penalty can be compensated.
> >
> > +@item x86-small-unroll-ninsns
> > +If -munroll-only-small-loops is enabled, only unroll loops with instruction
> > +count less than this parameter. The default value is 4.
> > +
> > +@item x86-small-unroll-factor
> > +If -munroll-only-small-loops is enabled, reset the unroll factor with this
> > +value. The default value is 2 which means the loop will be unrolled once.
> > +
> >  @end table
> >
> >  @end table
> > @@ -25232,6 +25240,14 @@ environments where no dynamic link is performed, like firmwares, OS
> >  kernels, executables linked with @option{-static} or @option{-static-pie}.
> >  @option{-mdirect-extern-access} is not compatible with @option{-fPIC} or
> >  @option{-fpic}.
> > +
> > +@item -munroll-only-small-loops
> > +@itemx -mno-unroll-only-small-loops
> > +@opindex munroll-only-small-loops
> > +Controls conservative small loop unrolling. It is default enbaled by
> > +O2, and unrolls loop with less than 4 insns by 1 time. Explicit
> > +-f[no-]unroll-[all-]loops would disable this flag to avoid any
> > +unintended unrolling behavior that user does not want.
> >  @end table
> >
> >  @node M32C Options
> > diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
> > index b9e07973dd6..9789efa1e11 100644
> > --- a/gcc/loop-init.cc
> > +++ b/gcc/loop-init.cc
> > @@ -565,9 +565,12 @@ public:
> >    {}
> >
> >    /* opt_pass methods: */
> > -  bool gate (function *) final override
> > +  bool gate (function *fun) final override
> >      {
> > -      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
> > +      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
> > +             || (targetm.loop_unroll_adjust
> > +                 && optimize >= 2
> > +                 && optimize_function_for_speed_p (fun)));
> >      }
> >
> >    unsigned int execute (function *) final override;
> > @@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
> >        if (dump_file)
> >         df_dump (dump_file);
> >
> > -      if (flag_unroll_loops)
> > +      if (flag_unroll_loops
> > +         || targetm.loop_unroll_adjust)
> >         flags |= UAP_UNROLL;
> >        if (flag_unroll_all_loops)
> >         flags |= UAP_UNROLL_ALL;
> > diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > index 1b1f6d32271..a32ea445a3f 100644
> > --- a/gcc/testsuite/gcc.dg/guality/loop-1.c
> > +++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > @@ -1,5 +1,7 @@
> >  /* { dg-do run } */
> >  /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
> > +/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
> > +
> >
> >  #include "../nop.h"
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
> > index 81841ef5bd7..cbc9fbb0450 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr86270.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr86270.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2" } */
> > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> >
> >  int *a;
> >  long len;
> > diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
> > index 0248fcc00a5..f75a847f75d 100644
> > --- a/gcc/testsuite/gcc.target/i386/pr93002.c
> > +++ b/gcc/testsuite/gcc.target/i386/pr93002.c
> > @@ -1,6 +1,6 @@
> >  /* PR target/93002 */
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2" } */
> > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> >  /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
> >
> >  volatile int sink;
> > --
> > 2.18.1
> >



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] Enable small loop unrolling for O2
  2022-11-08  3:07   ` Hongtao Liu
@ 2022-11-09  1:24     ` Hongyu Wang
  2022-11-14  1:35       ` Hongtao Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Hongyu Wang @ 2022-11-09  1:24 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Richard Biener, Hongyu Wang, gcc-patches, ubizjak, hongtao.liu

[-- Attachment #1: Type: text/plain, Size: 12051 bytes --]

> Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> for codesize.
> Make it exact as issue_rate and using factor * issue_width /
> loop->ninsns may increase code size too much.
> So I prefer to add those 2 parameters to the cost table for core
> tunings instead of 1.

Yes, here is the updated patch that changes the cost table.

Bootstrapped & regrtested on x86_64-pc-linux-gnu.

Ok for trunk?

Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 11:05写道：
>
> On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang <hongyu.wang@intel.com> wrote:
> > >
> > > Hi, this is the updated patch of
> > > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > > which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.
> > >
> > > This patch does not change rs6000/s390 since I don't have machine to
> > > test them, but I suppose the default behavior is the same since they
> > > enable flag_unroll_loops at O2.
> > >
> > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > >
> > > Ok for trunk?
> > >
> > > ---------- Patch content --------
> > >
> > > Modern processors has multiple way instruction decoders
> > > For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> > > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > > macro-fused), the decoder would have 2 uops bubble for each iteration
> > > and the pipeline could not be fully utilized.
> > >
> > > Therefore, this patch enables loop unrolling for small size loop at O2
> > > to fullfill the decoder as much as possible. It turns on rtl loop
> > > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> > > In x86 backend the default behavior is to unroll small loops with less
> > > than 4 insns by 1 time.
> > >
> > > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > > 0.9% codesize increment. For other benchmarks the variants are minor
> > > and overall codesize increased by 0.2%.
> > >
> > > The kernel image size increased by 0.06%, and no impact on eembc.
> > >
> > > gcc/ChangeLog:
> > >
> > >         * common/config/i386/i386-common.cc (ix86_optimization_table):
> > >         Enable small loop unroll at O2 by default.
> > >         * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> > >         factor if -munroll-only-small-loops enabled and -funroll-loops/
> > >         -funroll-all-loops are disabled.
> > >         * config/i386/i386.opt: Add -munroll-only-small-loops,
> > >         -param=x86-small-unroll-ninsns= for loop insn limit,
> > >         -param=x86-small-unroll-factor= for unroll factor.
> > >         * doc/invoke.texi: Document -munroll-only-small-loops,
> > >         x86-small-unroll-ninsns and x86-small-unroll-factor.
> > >         * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> > >         loop unrolling for -O2-speed and above if target hook
> > >         loop_unroll_adjust exists.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/guality/loop-1.c: Add additional option
> > >           -mno-unroll-only-small-loops.
> > >         * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> > >         * gcc.target/i386/pr93002.c: Likewise.
> > > ---
> > >  gcc/common/config/i386/i386-common.cc   |  1 +
> > >  gcc/config/i386/i386.cc                 | 18 ++++++++++++++++++
> > >  gcc/config/i386/i386.opt                | 13 +++++++++++++
> > >  gcc/doc/invoke.texi                     | 16 ++++++++++++++++
> > >  gcc/loop-init.cc                        | 10 +++++++---
> > >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> > >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> > >  8 files changed, 59 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
> > > index f66bdd5a2af..c6891486078 100644
> > > --- a/gcc/common/config/i386/i386-common.cc
> > > +++ b/gcc/common/config/i386/i386-common.cc
> > > @@ -1724,6 +1724,7 @@ static const struct default_options ix86_option_optimization_table[] =
> > >      /* The STC algorithm produces the smallest code at -Os, for x86.  */
> > >      { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
> > >        REORDER_BLOCKS_ALGORITHM_STC },
> > > +    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
> > >      /* Turn off -fschedule-insns by default.  It tends to make the
> > >         problem with not enough registers even worse.  */
> > >      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index c0f37149ed0..0f94a3b609e 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
> > >    unsigned i;
> > >    unsigned mem_count = 0;
> > >
> > > +  /* Unroll small size loop when unroll factor is not explicitly
> > > +     specified.  */
> > > +  if (!(flag_unroll_loops
> > > +       || flag_unroll_all_loops
> > > +       || loop->unroll))
> > > +    {
> > > +      nunroll = 1;
> > > +
> > > +      /* Any explicit -f{no-}unroll-{all-}loops turns off
> > > +        -munroll-only-small-loops.  */
> > > +      if (ix86_unroll_only_small_loops
> > > +         && !OPTION_SET_P (flag_unroll_loops))
> > > +       if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)
> >
> > either add braces or combine the two if's
> >
> > Otherwise the middle-end changes look OK.  The target maintainers need to decide
> > whether the two --params should be core tunings instead - I would assume that
> > given your rationale the decode and issue widths of the core plays an important
> > role here.  That might also suggest a single parameter instead and unrolling
> > (factor * issue_width) / loop->ninsns times instead of a static unroll_factor?
> Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> for codesize.
> Make it exact as issue_rate and using factor * issue_width /
> loop->ninsns may increase code size too much.
> So I prefer to add those 2 parameters to the cost table for core
> tunings instead of 1.
> >
> > Thanks,
> > Richard.
> >
> > > +         nunroll = (unsigned) ix86_small_unroll_factor;
> > > +
> > > +      return nunroll;
> > > +    }
> > > +
> > >    if (!TARGET_ADJUST_UNROLL)
> > >       return nunroll;
> > >
> > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > index 53d534f6392..6da9c8d670d 100644
> > > --- a/gcc/config/i386/i386.opt
> > > +++ b/gcc/config/i386/i386.opt
> > > @@ -1224,3 +1224,16 @@ mavxvnniint8
> > >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> > >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> > >  AVXVNNIINT8 built-in functions and code generation.
> > > +
> > > +munroll-only-small-loops
> > > +Target Var(ix86_unroll_only_small_loops) Init(0) Save
> > > +Enable conservative small loop unrolling.
> > > +
> > > +-param=x86-small-unroll-ninsns=
> > > +Target Joined UInteger Var(ix86_small_unroll_ninsns) Init(4) Param
> > > +Insturctions number limit for loop to be unrolled under
> > > +-munroll-only-small-loops.
> > > +
> > > +-param=x86-small-unroll-factor=
> > > +Target Joined UInteger Var(ix86_small_unroll_factor) Init(2) Param
> > > +Unroll factor for -munroll-only-small-loops.
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 550aec87809..487218bd0ce 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -15821,6 +15821,14 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
> > >  @item x86-stlf-window-ninsns
> > >  Instructions number above which STFL stall penalty can be compensated.
> > >
> > > +@item x86-small-unroll-ninsns
> > > +If -munroll-only-small-loops is enabled, only unroll loops with instruction
> > > +count less than this parameter. The default value is 4.
> > > +
> > > +@item x86-small-unroll-factor
> > > +If -munroll-only-small-loops is enabled, reset the unroll factor with this
> > > +value. The default value is 2 which means the loop will be unrolled once.
> > > +
> > >  @end table
> > >
> > >  @end table
> > > @@ -25232,6 +25240,14 @@ environments where no dynamic link is performed, like firmwares, OS
> > >  kernels, executables linked with @option{-static} or @option{-static-pie}.
> > >  @option{-mdirect-extern-access} is not compatible with @option{-fPIC} or
> > >  @option{-fpic}.
> > > +
> > > +@item -munroll-only-small-loops
> > > +@itemx -mno-unroll-only-small-loops
> > > +@opindex munroll-only-small-loops
> > > +Controls conservative small loop unrolling. It is default enbaled by
> > > +O2, and unrolls loop with less than 4 insns by 1 time. Explicit
> > > +-f[no-]unroll-[all-]loops would disable this flag to avoid any
> > > +unintended unrolling behavior that user does not want.
> > >  @end table
> > >
> > >  @node M32C Options
> > > diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
> > > index b9e07973dd6..9789efa1e11 100644
> > > --- a/gcc/loop-init.cc
> > > +++ b/gcc/loop-init.cc
> > > @@ -565,9 +565,12 @@ public:
> > >    {}
> > >
> > >    /* opt_pass methods: */
> > > -  bool gate (function *) final override
> > > +  bool gate (function *fun) final override
> > >      {
> > > -      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
> > > +      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
> > > +             || (targetm.loop_unroll_adjust
> > > +                 && optimize >= 2
> > > +                 && optimize_function_for_speed_p (fun)));
> > >      }
> > >
> > >    unsigned int execute (function *) final override;
> > > @@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
> > >        if (dump_file)
> > >         df_dump (dump_file);
> > >
> > > -      if (flag_unroll_loops)
> > > +      if (flag_unroll_loops
> > > +         || targetm.loop_unroll_adjust)
> > >         flags |= UAP_UNROLL;
> > >        if (flag_unroll_all_loops)
> > >         flags |= UAP_UNROLL_ALL;
> > > diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > index 1b1f6d32271..a32ea445a3f 100644
> > > --- a/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > +++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > @@ -1,5 +1,7 @@
> > >  /* { dg-do run } */
> > >  /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
> > > +/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
> > > +
> > >
> > >  #include "../nop.h"
> > >
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
> > > index 81841ef5bd7..cbc9fbb0450 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr86270.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr86270.c
> > > @@ -1,5 +1,5 @@
> > >  /* { dg-do compile } */
> > > -/* { dg-options "-O2" } */
> > > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> > >
> > >  int *a;
> > >  long len;
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
> > > index 0248fcc00a5..f75a847f75d 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr93002.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr93002.c
> > > @@ -1,6 +1,6 @@
> > >  /* PR target/93002 */
> > >  /* { dg-do compile } */
> > > -/* { dg-options "-O2" } */
> > > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> > >  /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
> > >
> > >  volatile int sink;
> > > --
> > > 2.18.1
> > >
>
>
>
> --
> BR,
> Hongtao

[-- Attachment #2: 0001-Enable-small-loop-unrolling-for-O2.patch --]
[-- Type: text/x-patch, Size: 17926 bytes --]

From e7cd20bcc97481c05c015a4197e574ef9c746087 Mon Sep 17 00:00:00 2001
From: Hongyu Wang <hongyu.wang@intel.com>
Date: Thu, 8 Sep 2022 16:52:02 +0800
Subject: [PATCH] Enable small loop unrolling for O2

Modern processors has multiple way instruction decoders
For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
instructions (usually has 3 uops with a cmp/jmp pair that can be
macro-fused), the decoder would have 2 uops bubble for each iteration
and the pipeline could not be fully utilized.

Therefore, this patch enables loop unrolling for small size loop at O2
to fullfill the decoder as much as possible. It turns on rtl loop
unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
In x86 backend the default behavior is to unroll small loops with less
than 4 insns by 1 time.

This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
0.9% codesize increment. For other benchmarks the variants are minor
and overall codesize increased by 0.2%.

The kernel image size increased by 0.06%, and no impact on eembc.

gcc/ChangeLog:

	* common/config/i386/i386-common.cc (ix86_optimization_table):
	Enable small loop unroll at O2 by default.
	* config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
	factor if -munroll-only-small-loops enabled and -funroll-loops/
	-funroll-all-loops are disabled.
	* config/i386/i386.h (struct processor_costs): Add 2 field
	small_unroll_ninsns and small_unroll_factor.
	* config/i386/i386.opt: Add -munroll-only-small-loops.
	* doc/invoke.texi: Document -munroll-only-small-loops.
	* loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
	loop unrolling for -O2-speed and above if target hook
	loop_unroll_adjust exists.
	(pass_rtl_unroll_loops::execute): Set UAP_UNROLL flag
	when target hook loop_unroll_adjust exists.
	* config/i386/x86-tune-costs.h: Update all processor costs
	with small_unroll_ninsns = 4 and small_unroll_factor = 2.

gcc/testsuite/ChangeLog:

	* gcc.dg/guality/loop-1.c: Add additional option
	  -mno-unroll-only-small-loops.
	* gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
	* gcc.target/i386/pr93002.c: Likewise.
---
 gcc/common/config/i386/i386-common.cc   |  1 +
 gcc/config/i386/i386.cc                 | 18 ++++++++
 gcc/config/i386/i386.h                  |  5 +++
 gcc/config/i386/i386.opt                |  4 ++
 gcc/config/i386/x86-tune-costs.h        | 60 +++++++++++++++++++++++++
 gcc/doc/invoke.texi                     |  8 ++++
 gcc/loop-init.cc                        | 10 +++--
 gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 +
 gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
 10 files changed, 107 insertions(+), 5 deletions(-)

diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
index 431fd0d3ad1..2f491b2f84b 100644
--- a/gcc/common/config/i386/i386-common.cc
+++ b/gcc/common/config/i386/i386-common.cc
@@ -1803,6 +1803,7 @@ static const struct default_options ix86_option_optimization_table[] =
     /* The STC algorithm produces the smallest code at -Os, for x86.  */
     { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
       REORDER_BLOCKS_ALGORITHM_STC },
+    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
     /* Turn off -fschedule-insns by default.  It tends to make the
        problem with not enough registers even worse.  */
     { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index f8586499cd1..292b32c5e99 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
   unsigned i;
   unsigned mem_count = 0;
 
+  /* Unroll small size loop when unroll factor is not explicitly
+     specified.  */
+  if (!(flag_unroll_loops
+	|| flag_unroll_all_loops
+	|| loop->unroll))
+    {
+      nunroll = 1;
+
+      /* Any explicit -f{no-}unroll-{all-}loops turns off
+	 -munroll-only-small-loops.  */
+      if (ix86_unroll_only_small_loops
+	  && !OPTION_SET_P (flag_unroll_loops)
+	  && loop->ninsns <= ix86_cost->small_unroll_ninsns)
+	nunroll = ix86_cost->small_unroll_factor;
+
+      return nunroll;
+    }
+
   if (!TARGET_ADJUST_UNROLL)
      return nunroll;
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index b32db8da109..b7371d0f1a5 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -219,6 +219,11 @@ struct processor_costs {
   const char *const align_jump;		/* Jump alignment.  */
   const char *const align_label;	/* Label alignment.  */
   const char *const align_func;		/* Function alignment.  */
+
+  const unsigned small_unroll_ninsns;	/* Insn count limit for small loop
+					   to be unrolled.  */
+  const unsigned small_unroll_factor;   /* Unroll factor for small loop to
+					   be unrolled.  */
 };
 
 extern const struct processor_costs *ix86_cost;
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 415c52e1bb4..d6b80efa04d 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1246,3 +1246,7 @@ Support PREFETCHI built-in functions and code generation.
 mraoint
 Target Mask(ISA2_RAOINT) Var(ix86_isa_flags2) Save
 Support RAOINT built-in functions and code generation.
+
+munroll-only-small-loops
+Target Var(ix86_unroll_only_small_loops) Init(0) Save
+Enable conservative small loop unrolling.
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index aeaa7eb008e..f01b8ee9eef 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -135,6 +135,8 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   NULL,					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   NULL,					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* Processor costs (relative to an add) */
@@ -244,6 +246,8 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   "4",					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   "4",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs i486_memcpy[2] = {
@@ -354,6 +358,8 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs pentium_memcpy[2] = {
@@ -462,6 +468,8 @@ struct processor_costs pentium_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static const
@@ -563,6 +571,8 @@ struct processor_costs lakemont_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* PentiumPro has optimized rep instructions for blocks aligned by 8 bytes
@@ -679,6 +689,8 @@ struct processor_costs pentiumpro_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs geode_memcpy[2] = {
@@ -786,6 +798,8 @@ struct processor_costs geode_cost = {
   NULL,					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   NULL,					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs k6_memcpy[2] = {
@@ -896,6 +910,8 @@ struct processor_costs k6_cost = {
   "32:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "32",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* For some reason, Athlon deals better with REP prefix (relative to loops)
@@ -1007,6 +1023,8 @@ struct processor_costs athlon_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* K8 has optimized REP instruction for medium sized blocks, but for very
@@ -1127,6 +1145,8 @@ struct processor_costs k8_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
@@ -1255,6 +1275,8 @@ struct processor_costs amdfam10_cost = {
   "32:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "32",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /*  BDVER has optimized REP instruction for medium sized blocks, but for
@@ -1376,6 +1398,8 @@ const struct processor_costs bdver_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "11",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 
@@ -1529,6 +1553,8 @@ struct processor_costs znver1_cost = {
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /*  ZNVER2 has optimized REP instruction for medium sized blocks, but for
@@ -1686,6 +1712,8 @@ struct processor_costs znver2_cost = {
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 struct processor_costs znver3_cost = {
@@ -1818,6 +1846,8 @@ struct processor_costs znver3_cost = {
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* This table currently replicates znver3_cost table. */
@@ -1951,6 +1981,8 @@ struct processor_costs znver4_cost = {
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
@@ -2075,6 +2107,8 @@ struct processor_costs skylake_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* icelake_cost should produce code tuned for Icelake family of CPUs.
@@ -2201,6 +2235,8 @@ struct processor_costs icelake_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* alderlake_cost should produce code tuned for alderlake family of CPUs.  */
@@ -2321,6 +2357,8 @@ struct processor_costs alderlake_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
@@ -2434,6 +2472,8 @@ const struct processor_costs btver1_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "11",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs btver2_memcpy[2] = {
@@ -2544,6 +2584,8 @@ const struct processor_costs btver2_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "11",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs pentium4_memcpy[2] = {
@@ -2653,6 +2695,8 @@ struct processor_costs pentium4_cost = {
   NULL,					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   NULL,					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs nocona_memcpy[2] = {
@@ -2765,6 +2809,8 @@ struct processor_costs nocona_cost = {
   NULL,					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   NULL,					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs atom_memcpy[2] = {
@@ -2875,6 +2921,8 @@ struct processor_costs atom_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs slm_memcpy[2] = {
@@ -2985,6 +3033,8 @@ struct processor_costs slm_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs tremont_memcpy[2] = {
@@ -3109,6 +3159,8 @@ struct processor_costs tremont_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs intel_memcpy[2] = {
@@ -3219,6 +3271,8 @@ struct processor_costs intel_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* lujiazui_cost should produce code tuned for ZHAOXIN lujiazui CPU.  */
@@ -3334,6 +3388,8 @@ struct processor_costs lujiazui_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* Generic should produce code tuned for Core-i7 (and newer chips)
@@ -3453,6 +3509,8 @@ struct processor_costs generic_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* core_cost should produce code tuned for Core familly of CPUs.  */
@@ -3579,5 +3637,7 @@ struct processor_costs core_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 975ee64103f..d94028e4a25 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -25262,6 +25262,14 @@ environments where no dynamic link is performed, like firmwares, OS
 kernels, executables linked with @option{-static} or @option{-static-pie}.
 @option{-mdirect-extern-access} is not compatible with @option{-fPIC} or
 @option{-fpic}.
+
+@item -munroll-only-small-loops
+@itemx -mno-unroll-only-small-loops
+@opindex munroll-only-small-loops
+Controls conservative small loop unrolling. It is default enbaled by
+O2, and unrolls loop with less than 4 insns by 1 time. Explicit
+-f[no-]unroll-[all-]loops would disable this flag to avoid any
+unintended unrolling behavior that user does not want.
 @end table
 
 @node M32C Options
diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
index b9e07973dd6..9789efa1e11 100644
--- a/gcc/loop-init.cc
+++ b/gcc/loop-init.cc
@@ -565,9 +565,12 @@ public:
   {}
 
   /* opt_pass methods: */
-  bool gate (function *) final override
+  bool gate (function *fun) final override
     {
-      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
+      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
+	      || (targetm.loop_unroll_adjust
+		  && optimize >= 2
+		  && optimize_function_for_speed_p (fun)));
     }
 
   unsigned int execute (function *) final override;
@@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
       if (dump_file)
 	df_dump (dump_file);
 
-      if (flag_unroll_loops)
+      if (flag_unroll_loops
+	  || targetm.loop_unroll_adjust)
 	flags |= UAP_UNROLL;
       if (flag_unroll_all_loops)
 	flags |= UAP_UNROLL_ALL;
diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
index 1b1f6d32271..a32ea445a3f 100644
--- a/gcc/testsuite/gcc.dg/guality/loop-1.c
+++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
@@ -1,5 +1,7 @@
 /* { dg-do run } */
 /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
+/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
+
 
 #include "../nop.h"
 
diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
index 81841ef5bd7..cbc9fbb0450 100644
--- a/gcc/testsuite/gcc.target/i386/pr86270.c
+++ b/gcc/testsuite/gcc.target/i386/pr86270.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
 
 int *a;
 long len;
diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
index 0248fcc00a5..f75a847f75d 100644
--- a/gcc/testsuite/gcc.target/i386/pr93002.c
+++ b/gcc/testsuite/gcc.target/i386/pr93002.c
@@ -1,6 +1,6 @@
 /* PR target/93002 */
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
 /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
 
 volatile int sink;
-- 
2.18.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] Enable small loop unrolling for O2
  2022-11-09  1:24     ` Hongyu Wang
@ 2022-11-14  1:35       ` Hongtao Liu
  2022-11-14  5:32         ` Hongyu Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Hongtao Liu @ 2022-11-14  1:35 UTC (permalink / raw)
  To: Hongyu Wang
  Cc: Richard Biener, Hongyu Wang, gcc-patches, ubizjak, hongtao.liu

On Wed, Nov 9, 2022 at 9:29 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
>
> > Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> > for codesize.
> > Make it exact as issue_rate and using factor * issue_width /
> > loop->ninsns may increase code size too much.
> > So I prefer to add those 2 parameters to the cost table for core
> > tunings instead of 1.
>
> Yes, here is the updated patch that changes the cost table.
>
> Bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> Ok for trunk?
Ok, Note GCC documents have been ported to sphinx, so you need to
adjust changes in invoke.texi to new sphinx files.
>
> Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 11:05写道：
> >
> > On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang <hongyu.wang@intel.com> wrote:
> > > >
> > > > Hi, this is the updated patch of
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > > > which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.
> > > >
> > > > This patch does not change rs6000/s390 since I don't have machine to
> > > > test them, but I suppose the default behavior is the same since they
> > > > enable flag_unroll_loops at O2.
> > > >
> > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > Ok for trunk?
> > > >
> > > > ---------- Patch content --------
> > > >
> > > > Modern processors has multiple way instruction decoders
> > > > For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> > > > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > > > macro-fused), the decoder would have 2 uops bubble for each iteration
> > > > and the pipeline could not be fully utilized.
> > > >
> > > > Therefore, this patch enables loop unrolling for small size loop at O2
> > > > to fullfill the decoder as much as possible. It turns on rtl loop
> > > > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> > > > In x86 backend the default behavior is to unroll small loops with less
> > > > than 4 insns by 1 time.
> > > >
> > > > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > > > 0.9% codesize increment. For other benchmarks the variants are minor
> > > > and overall codesize increased by 0.2%.
> > > >
> > > > The kernel image size increased by 0.06%, and no impact on eembc.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * common/config/i386/i386-common.cc (ix86_optimization_table):
> > > >         Enable small loop unroll at O2 by default.
> > > >         * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> > > >         factor if -munroll-only-small-loops enabled and -funroll-loops/
> > > >         -funroll-all-loops are disabled.
> > > >         * config/i386/i386.opt: Add -munroll-only-small-loops,
> > > >         -param=x86-small-unroll-ninsns= for loop insn limit,
> > > >         -param=x86-small-unroll-factor= for unroll factor.
> > > >         * doc/invoke.texi: Document -munroll-only-small-loops,
> > > >         x86-small-unroll-ninsns and x86-small-unroll-factor.
> > > >         * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> > > >         loop unrolling for -O2-speed and above if target hook
> > > >         loop_unroll_adjust exists.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.dg/guality/loop-1.c: Add additional option
> > > >           -mno-unroll-only-small-loops.
> > > >         * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> > > >         * gcc.target/i386/pr93002.c: Likewise.
> > > > ---
> > > >  gcc/common/config/i386/i386-common.cc   |  1 +
> > > >  gcc/config/i386/i386.cc                 | 18 ++++++++++++++++++
> > > >  gcc/config/i386/i386.opt                | 13 +++++++++++++
> > > >  gcc/doc/invoke.texi                     | 16 ++++++++++++++++
> > > >  gcc/loop-init.cc                        | 10 +++++++---
> > > >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> > > >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> > > >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> > > >  8 files changed, 59 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
> > > > index f66bdd5a2af..c6891486078 100644
> > > > --- a/gcc/common/config/i386/i386-common.cc
> > > > +++ b/gcc/common/config/i386/i386-common.cc
> > > > @@ -1724,6 +1724,7 @@ static const struct default_options ix86_option_optimization_table[] =
> > > >      /* The STC algorithm produces the smallest code at -Os, for x86.  */
> > > >      { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
> > > >        REORDER_BLOCKS_ALGORITHM_STC },
> > > > +    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
> > > >      /* Turn off -fschedule-insns by default.  It tends to make the
> > > >         problem with not enough registers even worse.  */
> > > >      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index c0f37149ed0..0f94a3b609e 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
> > > >    unsigned i;
> > > >    unsigned mem_count = 0;
> > > >
> > > > +  /* Unroll small size loop when unroll factor is not explicitly
> > > > +     specified.  */
> > > > +  if (!(flag_unroll_loops
> > > > +       || flag_unroll_all_loops
> > > > +       || loop->unroll))
> > > > +    {
> > > > +      nunroll = 1;
> > > > +
> > > > +      /* Any explicit -f{no-}unroll-{all-}loops turns off
> > > > +        -munroll-only-small-loops.  */
> > > > +      if (ix86_unroll_only_small_loops
> > > > +         && !OPTION_SET_P (flag_unroll_loops))
> > > > +       if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)
> > >
> > > either add braces or combine the two if's
> > >
> > > Otherwise the middle-end changes look OK.  The target maintainers need to decide
> > > whether the two --params should be core tunings instead - I would assume that
> > > given your rationale the decode and issue widths of the core plays an important
> > > role here.  That might also suggest a single parameter instead and unrolling
> > > (factor * issue_width) / loop->ninsns times instead of a static unroll_factor?
> > Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> > for codesize.
> > Make it exact as issue_rate and using factor * issue_width /
> > loop->ninsns may increase code size too much.
> > So I prefer to add those 2 parameters to the cost table for core
> > tunings instead of 1.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > +         nunroll = (unsigned) ix86_small_unroll_factor;
> > > > +
> > > > +      return nunroll;
> > > > +    }
> > > > +
> > > >    if (!TARGET_ADJUST_UNROLL)
> > > >       return nunroll;
> > > >
> > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > > index 53d534f6392..6da9c8d670d 100644
> > > > --- a/gcc/config/i386/i386.opt
> > > > +++ b/gcc/config/i386/i386.opt
> > > > @@ -1224,3 +1224,16 @@ mavxvnniint8
> > > >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> > > >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> > > >  AVXVNNIINT8 built-in functions and code generation.
> > > > +
> > > > +munroll-only-small-loops
> > > > +Target Var(ix86_unroll_only_small_loops) Init(0) Save
> > > > +Enable conservative small loop unrolling.
> > > > +
> > > > +-param=x86-small-unroll-ninsns=
> > > > +Target Joined UInteger Var(ix86_small_unroll_ninsns) Init(4) Param
> > > > +Insturctions number limit for loop to be unrolled under
> > > > +-munroll-only-small-loops.
> > > > +
> > > > +-param=x86-small-unroll-factor=
> > > > +Target Joined UInteger Var(ix86_small_unroll_factor) Init(2) Param
> > > > +Unroll factor for -munroll-only-small-loops.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index 550aec87809..487218bd0ce 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -15821,6 +15821,14 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
> > > >  @item x86-stlf-window-ninsns
> > > >  Instructions number above which STFL stall penalty can be compensated.
> > > >
> > > > +@item x86-small-unroll-ninsns
> > > > +If -munroll-only-small-loops is enabled, only unroll loops with instruction
> > > > +count less than this parameter. The default value is 4.
> > > > +
> > > > +@item x86-small-unroll-factor
> > > > +If -munroll-only-small-loops is enabled, reset the unroll factor with this
> > > > +value. The default value is 2 which means the loop will be unrolled once.
> > > > +
> > > >  @end table
> > > >
> > > >  @end table
> > > > @@ -25232,6 +25240,14 @@ environments where no dynamic link is performed, like firmwares, OS
> > > >  kernels, executables linked with @option{-static} or @option{-static-pie}.
> > > >  @option{-mdirect-extern-access} is not compatible with @option{-fPIC} or
> > > >  @option{-fpic}.
> > > > +
> > > > +@item -munroll-only-small-loops
> > > > +@itemx -mno-unroll-only-small-loops
> > > > +@opindex munroll-only-small-loops
> > > > +Controls conservative small loop unrolling. It is default enbaled by
> > > > +O2, and unrolls loop with less than 4 insns by 1 time. Explicit
> > > > +-f[no-]unroll-[all-]loops would disable this flag to avoid any
> > > > +unintended unrolling behavior that user does not want.
> > > >  @end table
> > > >
> > > >  @node M32C Options
> > > > diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
> > > > index b9e07973dd6..9789efa1e11 100644
> > > > --- a/gcc/loop-init.cc
> > > > +++ b/gcc/loop-init.cc
> > > > @@ -565,9 +565,12 @@ public:
> > > >    {}
> > > >
> > > >    /* opt_pass methods: */
> > > > -  bool gate (function *) final override
> > > > +  bool gate (function *fun) final override
> > > >      {
> > > > -      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
> > > > +      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
> > > > +             || (targetm.loop_unroll_adjust
> > > > +                 && optimize >= 2
> > > > +                 && optimize_function_for_speed_p (fun)));
> > > >      }
> > > >
> > > >    unsigned int execute (function *) final override;
> > > > @@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
> > > >        if (dump_file)
> > > >         df_dump (dump_file);
> > > >
> > > > -      if (flag_unroll_loops)
> > > > +      if (flag_unroll_loops
> > > > +         || targetm.loop_unroll_adjust)
> > > >         flags |= UAP_UNROLL;
> > > >        if (flag_unroll_all_loops)
> > > >         flags |= UAP_UNROLL_ALL;
> > > > diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > > index 1b1f6d32271..a32ea445a3f 100644
> > > > --- a/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > > +++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > > @@ -1,5 +1,7 @@
> > > >  /* { dg-do run } */
> > > >  /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
> > > > +/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
> > > > +
> > > >
> > > >  #include "../nop.h"
> > > >
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
> > > > index 81841ef5bd7..cbc9fbb0450 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/pr86270.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr86270.c
> > > > @@ -1,5 +1,5 @@
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O2" } */
> > > > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> > > >
> > > >  int *a;
> > > >  long len;
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
> > > > index 0248fcc00a5..f75a847f75d 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/pr93002.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr93002.c
> > > > @@ -1,6 +1,6 @@
> > > >  /* PR target/93002 */
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O2" } */
> > > > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> > > >  /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
> > > >
> > > >  volatile int sink;
> > > > --
> > > > 2.18.1
> > > >
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] Enable small loop unrolling for O2
  2022-11-14  1:35       ` Hongtao Liu
@ 2022-11-14  5:32         ` Hongyu Wang
  0 siblings, 0 replies; 8+ messages in thread
From: Hongyu Wang @ 2022-11-14  5:32 UTC (permalink / raw)
  To: Hongtao Liu
  Cc: Richard Biener, Hongyu Wang, gcc-patches, ubizjak, hongtao.liu

[-- Attachment #1: Type: text/plain, Size: 13635 bytes --]

> Ok, Note GCC documents have been ported to sphinx, so you need to
> adjust changes in invoke.texi to new sphinx files.

Yes, this is the patch I'm going to check-in. Thanks.

Hongtao Liu <crazylht@gmail.com> 于2022年11月14日周一 09:35写道：
>
> On Wed, Nov 9, 2022 at 9:29 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> >
> > > Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> > > for codesize.
> > > Make it exact as issue_rate and using factor * issue_width /
> > > loop->ninsns may increase code size too much.
> > > So I prefer to add those 2 parameters to the cost table for core
> > > tunings instead of 1.
> >
> > Yes, here is the updated patch that changes the cost table.
> >
> > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > Ok for trunk?
> Ok, Note GCC documents have been ported to sphinx, so you need to
> adjust changes in invoke.texi to new sphinx files.
> >
> > Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 11:05写道：
> > >
> > > On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang <hongyu.wang@intel.com> wrote:
> > > > >
> > > > > Hi, this is the updated patch of
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > > > > which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.
> > > > >
> > > > > This patch does not change rs6000/s390 since I don't have machine to
> > > > > test them, but I suppose the default behavior is the same since they
> > > > > enable flag_unroll_loops at O2.
> > > > >
> > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > >
> > > > > Ok for trunk?
> > > > >
> > > > > ---------- Patch content --------
> > > > >
> > > > > Modern processors has multiple way instruction decoders
> > > > > For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> > > > > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > > > > macro-fused), the decoder would have 2 uops bubble for each iteration
> > > > > and the pipeline could not be fully utilized.
> > > > >
> > > > > Therefore, this patch enables loop unrolling for small size loop at O2
> > > > > to fullfill the decoder as much as possible. It turns on rtl loop
> > > > > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> > > > > In x86 backend the default behavior is to unroll small loops with less
> > > > > than 4 insns by 1 time.
> > > > >
> > > > > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > > > > 0.9% codesize increment. For other benchmarks the variants are minor
> > > > > and overall codesize increased by 0.2%.
> > > > >
> > > > > The kernel image size increased by 0.06%, and no impact on eembc.
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * common/config/i386/i386-common.cc (ix86_optimization_table):
> > > > >         Enable small loop unroll at O2 by default.
> > > > >         * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> > > > >         factor if -munroll-only-small-loops enabled and -funroll-loops/
> > > > >         -funroll-all-loops are disabled.
> > > > >         * config/i386/i386.opt: Add -munroll-only-small-loops,
> > > > >         -param=x86-small-unroll-ninsns= for loop insn limit,
> > > > >         -param=x86-small-unroll-factor= for unroll factor.
> > > > >         * doc/invoke.texi: Document -munroll-only-small-loops,
> > > > >         x86-small-unroll-ninsns and x86-small-unroll-factor.
> > > > >         * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> > > > >         loop unrolling for -O2-speed and above if target hook
> > > > >         loop_unroll_adjust exists.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.dg/guality/loop-1.c: Add additional option
> > > > >           -mno-unroll-only-small-loops.
> > > > >         * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> > > > >         * gcc.target/i386/pr93002.c: Likewise.
> > > > > ---
> > > > >  gcc/common/config/i386/i386-common.cc   |  1 +
> > > > >  gcc/config/i386/i386.cc                 | 18 ++++++++++++++++++
> > > > >  gcc/config/i386/i386.opt                | 13 +++++++++++++
> > > > >  gcc/doc/invoke.texi                     | 16 ++++++++++++++++
> > > > >  gcc/loop-init.cc                        | 10 +++++++---
> > > > >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> > > > >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> > > > >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> > > > >  8 files changed, 59 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
> > > > > index f66bdd5a2af..c6891486078 100644
> > > > > --- a/gcc/common/config/i386/i386-common.cc
> > > > > +++ b/gcc/common/config/i386/i386-common.cc
> > > > > @@ -1724,6 +1724,7 @@ static const struct default_options ix86_option_optimization_table[] =
> > > > >      /* The STC algorithm produces the smallest code at -Os, for x86.  */
> > > > >      { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
> > > > >        REORDER_BLOCKS_ALGORITHM_STC },
> > > > > +    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
> > > > >      /* Turn off -fschedule-insns by default.  It tends to make the
> > > > >         problem with not enough registers even worse.  */
> > > > >      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > index c0f37149ed0..0f94a3b609e 100644
> > > > > --- a/gcc/config/i386/i386.cc
> > > > > +++ b/gcc/config/i386/i386.cc
> > > > > @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
> > > > >    unsigned i;
> > > > >    unsigned mem_count = 0;
> > > > >
> > > > > +  /* Unroll small size loop when unroll factor is not explicitly
> > > > > +     specified.  */
> > > > > +  if (!(flag_unroll_loops
> > > > > +       || flag_unroll_all_loops
> > > > > +       || loop->unroll))
> > > > > +    {
> > > > > +      nunroll = 1;
> > > > > +
> > > > > +      /* Any explicit -f{no-}unroll-{all-}loops turns off
> > > > > +        -munroll-only-small-loops.  */
> > > > > +      if (ix86_unroll_only_small_loops
> > > > > +         && !OPTION_SET_P (flag_unroll_loops))
> > > > > +       if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)
> > > >
> > > > either add braces or combine the two if's
> > > >
> > > > Otherwise the middle-end changes look OK.  The target maintainers need to decide
> > > > whether the two --params should be core tunings instead - I would assume that
> > > > given your rationale the decode and issue widths of the core plays an important
> > > > role here.  That might also suggest a single parameter instead and unrolling
> > > > (factor * issue_width) / loop->ninsns times instead of a static unroll_factor?
> > > Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> > > for codesize.
> > > Make it exact as issue_rate and using factor * issue_width /
> > > loop->ninsns may increase code size too much.
> > > So I prefer to add those 2 parameters to the cost table for core
> > > tunings instead of 1.
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > +         nunroll = (unsigned) ix86_small_unroll_factor;
> > > > > +
> > > > > +      return nunroll;
> > > > > +    }
> > > > > +
> > > > >    if (!TARGET_ADJUST_UNROLL)
> > > > >       return nunroll;
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > > > index 53d534f6392..6da9c8d670d 100644
> > > > > --- a/gcc/config/i386/i386.opt
> > > > > +++ b/gcc/config/i386/i386.opt
> > > > > @@ -1224,3 +1224,16 @@ mavxvnniint8
> > > > >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> > > > >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> > > > >  AVXVNNIINT8 built-in functions and code generation.
> > > > > +
> > > > > +munroll-only-small-loops
> > > > > +Target Var(ix86_unroll_only_small_loops) Init(0) Save
> > > > > +Enable conservative small loop unrolling.
> > > > > +
> > > > > +-param=x86-small-unroll-ninsns=
> > > > > +Target Joined UInteger Var(ix86_small_unroll_ninsns) Init(4) Param
> > > > > +Insturctions number limit for loop to be unrolled under
> > > > > +-munroll-only-small-loops.
> > > > > +
> > > > > +-param=x86-small-unroll-factor=
> > > > > +Target Joined UInteger Var(ix86_small_unroll_factor) Init(2) Param
> > > > > +Unroll factor for -munroll-only-small-loops.
> > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > index 550aec87809..487218bd0ce 100644
> > > > > --- a/gcc/doc/invoke.texi
> > > > > +++ b/gcc/doc/invoke.texi
> > > > > @@ -15821,6 +15821,14 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
> > > > >  @item x86-stlf-window-ninsns
> > > > >  Instructions number above which STFL stall penalty can be compensated.
> > > > >
> > > > > +@item x86-small-unroll-ninsns
> > > > > +If -munroll-only-small-loops is enabled, only unroll loops with instruction
> > > > > +count less than this parameter. The default value is 4.
> > > > > +
> > > > > +@item x86-small-unroll-factor
> > > > > +If -munroll-only-small-loops is enabled, reset the unroll factor with this
> > > > > +value. The default value is 2 which means the loop will be unrolled once.
> > > > > +
> > > > >  @end table
> > > > >
> > > > >  @end table
> > > > > @@ -25232,6 +25240,14 @@ environments where no dynamic link is performed, like firmwares, OS
> > > > >  kernels, executables linked with @option{-static} or @option{-static-pie}.
> > > > >  @option{-mdirect-extern-access} is not compatible with @option{-fPIC} or
> > > > >  @option{-fpic}.
> > > > > +
> > > > > +@item -munroll-only-small-loops
> > > > > +@itemx -mno-unroll-only-small-loops
> > > > > +@opindex munroll-only-small-loops
> > > > > +Controls conservative small loop unrolling. It is default enbaled by
> > > > > +O2, and unrolls loop with less than 4 insns by 1 time. Explicit
> > > > > +-f[no-]unroll-[all-]loops would disable this flag to avoid any
> > > > > +unintended unrolling behavior that user does not want.
> > > > >  @end table
> > > > >
> > > > >  @node M32C Options
> > > > > diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
> > > > > index b9e07973dd6..9789efa1e11 100644
> > > > > --- a/gcc/loop-init.cc
> > > > > +++ b/gcc/loop-init.cc
> > > > > @@ -565,9 +565,12 @@ public:
> > > > >    {}
> > > > >
> > > > >    /* opt_pass methods: */
> > > > > -  bool gate (function *) final override
> > > > > +  bool gate (function *fun) final override
> > > > >      {
> > > > > -      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
> > > > > +      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
> > > > > +             || (targetm.loop_unroll_adjust
> > > > > +                 && optimize >= 2
> > > > > +                 && optimize_function_for_speed_p (fun)));
> > > > >      }
> > > > >
> > > > >    unsigned int execute (function *) final override;
> > > > > @@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
> > > > >        if (dump_file)
> > > > >         df_dump (dump_file);
> > > > >
> > > > > -      if (flag_unroll_loops)
> > > > > +      if (flag_unroll_loops
> > > > > +         || targetm.loop_unroll_adjust)
> > > > >         flags |= UAP_UNROLL;
> > > > >        if (flag_unroll_all_loops)
> > > > >         flags |= UAP_UNROLL_ALL;
> > > > > diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > > > index 1b1f6d32271..a32ea445a3f 100644
> > > > > --- a/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > > > +++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
> > > > > @@ -1,5 +1,7 @@
> > > > >  /* { dg-do run } */
> > > > >  /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
> > > > > +/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
> > > > > +
> > > > >
> > > > >  #include "../nop.h"
> > > > >
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
> > > > > index 81841ef5bd7..cbc9fbb0450 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pr86270.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr86270.c
> > > > > @@ -1,5 +1,5 @@
> > > > >  /* { dg-do compile } */
> > > > > -/* { dg-options "-O2" } */
> > > > > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> > > > >
> > > > >  int *a;
> > > > >  long len;
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
> > > > > index 0248fcc00a5..f75a847f75d 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pr93002.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr93002.c
> > > > > @@ -1,6 +1,6 @@
> > > > >  /* PR target/93002 */
> > > > >  /* { dg-do compile } */
> > > > > -/* { dg-options "-O2" } */
> > > > > +/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
> > > > >  /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
> > > > >
> > > > >  volatile int sink;
> > > > > --
> > > > > 2.18.1
> > > > >
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
>
>
>
> --
> BR,
> Hongtao

[-- Attachment #2: 0001-Enable-small-loop-unrolling-for-O2.patch --]
[-- Type: text/x-patch, Size: 19023 bytes --]

From 7790175ccf9a37056f4d1234d914fc924cd1812e Mon Sep 17 00:00:00 2001
From: Hongyu Wang <hongyu.wang@intel.com>
Date: Thu, 8 Sep 2022 16:52:02 +0800
Subject: [PATCH] Enable small loop unrolling for O2

Modern processors has multiple way instruction decoders
For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
instructions (usually has 3 uops with a cmp/jmp pair that can be
macro-fused), the decoder would have 2 uops bubble for each iteration
and the pipeline could not be fully utilized.

Therefore, this patch enables loop unrolling for small size loop at O2
to fullfill the decoder as much as possible. It turns on rtl loop
unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
In x86 backend the default behavior is to unroll small loops with less
than 4 insns by 1 time.

This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
0.9% codesize increment. For other benchmarks the variants are minor
and overall codesize increased by 0.2%.

The kernel image size increased by 0.06%, and no impact on eembc.

gcc/ChangeLog:

	* common/config/i386/i386-common.cc (ix86_optimization_table):
	Enable small loop unroll at O2 by default.
	* config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
	factor if -munroll-only-small-loops enabled and -funroll-loops/
	-funroll-all-loops are disabled.
	* config/i386/i386.h (struct processor_costs): Add 2 field
	small_unroll_ninsns and small_unroll_factor.
	* config/i386/i386.opt: Add -munroll-only-small-loops.
	* doc/gcc/gcc-command-options/machine-dependent-options/x86-options.rst:
	Document -munroll-only-small-loops.
	* doc/gcc/gcc-command-options/option-summary.rst: Likewise.
	* loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
	loop unrolling for -O2-speed and above if target hook
	loop_unroll_adjust exists.
	(pass_rtl_unroll_loops::execute): Set UAP_UNROLL flag
	when target hook loop_unroll_adjust exists.
	* config/i386/x86-tune-costs.h: Update all processor costs
	with small_unroll_ninsns = 4 and small_unroll_factor = 2.

gcc/testsuite/ChangeLog:

	* gcc.dg/guality/loop-1.c: Add additional option
	-mno-unroll-only-small-loops.
	* gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
	* gcc.target/i386/pr93002.c: Likewise.
---
 gcc/common/config/i386/i386-common.cc         |  1 +
 gcc/config/i386/i386.cc                       | 18 ++++++
 gcc/config/i386/i386.h                        |  5 ++
 gcc/config/i386/i386.opt                      |  4 ++
 gcc/config/i386/x86-tune-costs.h              | 60 +++++++++++++++++++
 .../machine-dependent-options/x86-options.rst |  6 ++
 .../gcc-command-options/option-summary.rst    |  3 +-
 gcc/loop-init.cc                              | 10 +++-
 gcc/testsuite/gcc.dg/guality/loop-1.c         |  2 +
 gcc/testsuite/gcc.target/i386/pr86270.c       |  2 +-
 gcc/testsuite/gcc.target/i386/pr93002.c       |  2 +-
 11 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/gcc/common/config/i386/i386-common.cc b/gcc/common/config/i386/i386-common.cc
index 431fd0d3ad1..2f491b2f84b 100644
--- a/gcc/common/config/i386/i386-common.cc
+++ b/gcc/common/config/i386/i386-common.cc
@@ -1803,6 +1803,7 @@ static const struct default_options ix86_option_optimization_table[] =
     /* The STC algorithm produces the smallest code at -Os, for x86.  */
     { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
       REORDER_BLOCKS_ALGORITHM_STC },
+    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
     /* Turn off -fschedule-insns by default.  It tends to make the
        problem with not enough registers even worse.  */
     { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index f8586499cd1..292b32c5e99 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
   unsigned i;
   unsigned mem_count = 0;
 
+  /* Unroll small size loop when unroll factor is not explicitly
+     specified.  */
+  if (!(flag_unroll_loops
+	|| flag_unroll_all_loops
+	|| loop->unroll))
+    {
+      nunroll = 1;
+
+      /* Any explicit -f{no-}unroll-{all-}loops turns off
+	 -munroll-only-small-loops.  */
+      if (ix86_unroll_only_small_loops
+	  && !OPTION_SET_P (flag_unroll_loops)
+	  && loop->ninsns <= ix86_cost->small_unroll_ninsns)
+	nunroll = ix86_cost->small_unroll_factor;
+
+      return nunroll;
+    }
+
   if (!TARGET_ADJUST_UNROLL)
      return nunroll;
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index a5ad9f387f7..3869db8f2d3 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -219,6 +219,11 @@ struct processor_costs {
   const char *const align_jump;		/* Jump alignment.  */
   const char *const align_label;	/* Label alignment.  */
   const char *const align_func;		/* Function alignment.  */
+
+  const unsigned small_unroll_ninsns;	/* Insn count limit for small loop
+					   to be unrolled.  */
+  const unsigned small_unroll_factor;   /* Unroll factor for small loop to
+					   be unrolled.  */
 };
 
 extern const struct processor_costs *ix86_cost;
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 415c52e1bb4..d6b80efa04d 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1246,3 +1246,7 @@ Support PREFETCHI built-in functions and code generation.
 mraoint
 Target Mask(ISA2_RAOINT) Var(ix86_isa_flags2) Save
 Support RAOINT built-in functions and code generation.
+
+munroll-only-small-loops
+Target Var(ix86_unroll_only_small_loops) Init(0) Save
+Enable conservative small loop unrolling.
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index aeaa7eb008e..f01b8ee9eef 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -135,6 +135,8 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   NULL,					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   NULL,					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* Processor costs (relative to an add) */
@@ -244,6 +246,8 @@ struct processor_costs i386_cost = {	/* 386 specific costs */
   "4",					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   "4",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs i486_memcpy[2] = {
@@ -354,6 +358,8 @@ struct processor_costs i486_cost = {	/* 486 specific costs */
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs pentium_memcpy[2] = {
@@ -462,6 +468,8 @@ struct processor_costs pentium_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static const
@@ -563,6 +571,8 @@ struct processor_costs lakemont_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* PentiumPro has optimized rep instructions for blocks aligned by 8 bytes
@@ -679,6 +689,8 @@ struct processor_costs pentiumpro_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs geode_memcpy[2] = {
@@ -786,6 +798,8 @@ struct processor_costs geode_cost = {
   NULL,					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   NULL,					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs k6_memcpy[2] = {
@@ -896,6 +910,8 @@ struct processor_costs k6_cost = {
   "32:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "32",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* For some reason, Athlon deals better with REP prefix (relative to loops)
@@ -1007,6 +1023,8 @@ struct processor_costs athlon_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* K8 has optimized REP instruction for medium sized blocks, but for very
@@ -1127,6 +1145,8 @@ struct processor_costs k8_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
@@ -1255,6 +1275,8 @@ struct processor_costs amdfam10_cost = {
   "32:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "32",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /*  BDVER has optimized REP instruction for medium sized blocks, but for
@@ -1376,6 +1398,8 @@ const struct processor_costs bdver_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "11",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 
@@ -1529,6 +1553,8 @@ struct processor_costs znver1_cost = {
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /*  ZNVER2 has optimized REP instruction for medium sized blocks, but for
@@ -1686,6 +1712,8 @@ struct processor_costs znver2_cost = {
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 struct processor_costs znver3_cost = {
@@ -1818,6 +1846,8 @@ struct processor_costs znver3_cost = {
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* This table currently replicates znver3_cost table. */
@@ -1951,6 +1981,8 @@ struct processor_costs znver4_cost = {
   "16",					/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
@@ -2075,6 +2107,8 @@ struct processor_costs skylake_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* icelake_cost should produce code tuned for Icelake family of CPUs.
@@ -2201,6 +2235,8 @@ struct processor_costs icelake_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* alderlake_cost should produce code tuned for alderlake family of CPUs.  */
@@ -2321,6 +2357,8 @@ struct processor_costs alderlake_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
   /* BTVER1 has optimized REP instruction for medium sized blocks, but for
@@ -2434,6 +2472,8 @@ const struct processor_costs btver1_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "11",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs btver2_memcpy[2] = {
@@ -2544,6 +2584,8 @@ const struct processor_costs btver2_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "11",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs pentium4_memcpy[2] = {
@@ -2653,6 +2695,8 @@ struct processor_costs pentium4_cost = {
   NULL,					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   NULL,					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs nocona_memcpy[2] = {
@@ -2765,6 +2809,8 @@ struct processor_costs nocona_cost = {
   NULL,					/* Jump alignment.  */
   NULL,					/* Label alignment.  */
   NULL,					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs atom_memcpy[2] = {
@@ -2875,6 +2921,8 @@ struct processor_costs atom_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs slm_memcpy[2] = {
@@ -2985,6 +3033,8 @@ struct processor_costs slm_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs tremont_memcpy[2] = {
@@ -3109,6 +3159,8 @@ struct processor_costs tremont_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 static stringop_algs intel_memcpy[2] = {
@@ -3219,6 +3271,8 @@ struct processor_costs intel_cost = {
   "16:8:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* lujiazui_cost should produce code tuned for ZHAOXIN lujiazui CPU.  */
@@ -3334,6 +3388,8 @@ struct processor_costs lujiazui_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* Generic should produce code tuned for Core-i7 (and newer chips)
@@ -3453,6 +3509,8 @@ struct processor_costs generic_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
 /* core_cost should produce code tuned for Core familly of CPUs.  */
@@ -3579,5 +3637,7 @@ struct processor_costs core_cost = {
   "16:11:8",				/* Jump alignment.  */
   "0:0:8",				/* Label alignment.  */
   "16",					/* Func alignment.  */
+  4,					/* Small unroll limit.  */
+  2,					/* Small unroll factor.  */
 };
 
diff --git a/gcc/doc/gcc/gcc-command-options/machine-dependent-options/x86-options.rst b/gcc/doc/gcc/gcc-command-options/machine-dependent-options/x86-options.rst
index 6f015e9e96a..5e18fd77f87 100644
--- a/gcc/doc/gcc/gcc-command-options/machine-dependent-options/x86-options.rst
+++ b/gcc/doc/gcc/gcc-command-options/machine-dependent-options/x86-options.rst
@@ -1614,3 +1614,9 @@ on x86-64 processors in 64-bit environments.
 .. option:: -mdirect-extern-access
 
   Default setting; overrides :option:`-mno-direct-extern-access`.
+
+.. option:: -munroll-only-small-loops
+  Controls conservative small loop unrolling. It is default enbaled by
+  O2, and unrolls loop with less than 4 insns by 1 time. Explicit
+  -f[no-]unroll-[all-]loops would disable this flag to avoid any
+  unintended unrolling behavior that user does not want.
diff --git a/gcc/doc/gcc/gcc-command-options/option-summary.rst b/gcc/doc/gcc/gcc-command-options/option-summary.rst
index b90b6600d70..02898fb65cd 100644
--- a/gcc/doc/gcc/gcc-command-options/option-summary.rst
+++ b/gcc/doc/gcc/gcc-command-options/option-summary.rst
@@ -1490,7 +1490,8 @@ in the following sections.
   :option:`-mgeneral-regs-only`  :option:`-mcall-ms2sysv-xlogues` :option:`-mrelax-cmpxchg-loop` |gol|
   :option:`-mindirect-branch=choice`  :option:`-mfunction-return=choice` |gol|
   :option:`-mindirect-branch-register` :option:`-mharden-sls=choice` |gol|
-  :option:`-mindirect-branch-cs-prefix` :option:`-mneeded` :option:`-mno-direct-extern-access`
+  :option:`-mindirect-branch-cs-prefix` :option:`-mneeded` :option:`-mno-direct-extern-access` |gol|
+  :option:`-munroll-only-small-loops`
 
   *x86 Windows Options*
 
diff --git a/gcc/loop-init.cc b/gcc/loop-init.cc
index b9e07973dd6..9789efa1e11 100644
--- a/gcc/loop-init.cc
+++ b/gcc/loop-init.cc
@@ -565,9 +565,12 @@ public:
   {}
 
   /* opt_pass methods: */
-  bool gate (function *) final override
+  bool gate (function *fun) final override
     {
-      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll);
+      return (flag_unroll_loops || flag_unroll_all_loops || cfun->has_unroll
+	      || (targetm.loop_unroll_adjust
+		  && optimize >= 2
+		  && optimize_function_for_speed_p (fun)));
     }
 
   unsigned int execute (function *) final override;
@@ -583,7 +586,8 @@ pass_rtl_unroll_loops::execute (function *fun)
       if (dump_file)
 	df_dump (dump_file);
 
-      if (flag_unroll_loops)
+      if (flag_unroll_loops
+	  || targetm.loop_unroll_adjust)
 	flags |= UAP_UNROLL;
       if (flag_unroll_all_loops)
 	flags |= UAP_UNROLL_ALL;
diff --git a/gcc/testsuite/gcc.dg/guality/loop-1.c b/gcc/testsuite/gcc.dg/guality/loop-1.c
index 1b1f6d32271..a32ea445a3f 100644
--- a/gcc/testsuite/gcc.dg/guality/loop-1.c
+++ b/gcc/testsuite/gcc.dg/guality/loop-1.c
@@ -1,5 +1,7 @@
 /* { dg-do run } */
 /* { dg-options "-fno-tree-scev-cprop -fno-tree-vectorize -g" } */
+/* { dg-additional-options "-mno-unroll-only-small-loops" { target ia32 } } */
+
 
 #include "../nop.h"
 
diff --git a/gcc/testsuite/gcc.target/i386/pr86270.c b/gcc/testsuite/gcc.target/i386/pr86270.c
index 81841ef5bd7..cbc9fbb0450 100644
--- a/gcc/testsuite/gcc.target/i386/pr86270.c
+++ b/gcc/testsuite/gcc.target/i386/pr86270.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
 
 int *a;
 long len;
diff --git a/gcc/testsuite/gcc.target/i386/pr93002.c b/gcc/testsuite/gcc.target/i386/pr93002.c
index 0248fcc00a5..f75a847f75d 100644
--- a/gcc/testsuite/gcc.target/i386/pr93002.c
+++ b/gcc/testsuite/gcc.target/i386/pr93002.c
@@ -1,6 +1,6 @@
 /* PR target/93002 */
 /* { dg-do compile } */
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mno-unroll-only-small-loops" } */
 /* { dg-final { scan-assembler-not "cmp\[^\n\r]*-1" } } */
 
 volatile int sink;
-- 
2.18.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH V2] Enable small loop unrolling for O2
  2022-11-09 17:22 David Edelsohn
@ 2022-11-11  2:04 ` Wang, Hongyu
  0 siblings, 0 replies; 8+ messages in thread
From: Wang, Hongyu @ 2022-11-11  2:04 UTC (permalink / raw)
  To: David Edelsohn; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 774 bytes --]

Thanks for the notification! I’m not aware of the compile farm before. Will see what’s the impact of my patch then.

Regards,
Hongyu, Wang

From: David Edelsohn <dje.gcc@gmail.com>
Sent: Thursday, November 10, 2022 1:22 AM
To: Wang, Hongyu <hongyu.wang@intel.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH V2] Enable small loop unrolling for O2

> This patch does not change rs6000/s390 since I don't have machines to
> test them, but I suppose the default behavior is the same since they
> enable flag_unroll_loops at O2.

There are Power (rs6000) systems in the Compile Farm.

Trial Linux on Z (s390x) VMs are available through the Linux Community Cloud.
https://linuxone.cloud.marist.edu/#/register?flag=VM

Thanks, David



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] Enable small loop unrolling for O2
@ 2022-11-09 17:22 David Edelsohn
  2022-11-11  2:04 ` Wang, Hongyu
  0 siblings, 1 reply; 8+ messages in thread
From: David Edelsohn @ 2022-11-09 17:22 UTC (permalink / raw)
  To: Hongyu Wang; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 379 bytes --]

> This patch does not change rs6000/s390 since I don't have machines to
> test them, but I suppose the default behavior is the same since they
> enable flag_unroll_loops at O2.

There are Power (rs6000) systems in the Compile Farm.

Trial Linux on Z (s390x) VMs are available through the Linux Community
Cloud.
https://linuxone.cloud.marist.edu/#/register?flag=VM

Thanks, David

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-14  5:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-02  3:37 [PATCH V2] Enable small loop unrolling for O2 Hongyu Wang
2022-11-07 14:24 ` Richard Biener
2022-11-08  3:07   ` Hongtao Liu
2022-11-09  1:24     ` Hongyu Wang
2022-11-14  1:35       ` Hongtao Liu
2022-11-14  5:32         ` Hongyu Wang
2022-11-09 17:22 David Edelsohn
2022-11-11  2:04 ` Wang, Hongyu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).