From: Hongtao Liu <crazylht@gmail.com>
To: "Richard Biener" <richard.guenther@gmail.com>,
"Martin Liška" <mliska@suse.cz>
Cc: gcc-patches@gcc.gnu.org, hongtao.liu@intel.com,
"Cui,Lili" <lili.cui@intel.com>
Subject: Re: [PATCH] ix86: Suggest unroll factor for loop vectorization
Date: Tue, 25 Oct 2022 13:49:27 +0800 [thread overview]
Message-ID: <CAMZc-bxuN27fbPRFU9X253ZUJ_Km4POb6q-583XgSixcD7LNZA@mail.gmail.com> (raw)
In-Reply-To: <20221024024604.18324-1-lili.cui@intel.com>
Any comments?
On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi Hongtao,
>
> This patch introduces function finish_cost and
> determine_suggested_unroll_factor for x86 backend, to make it be
> able to suggest the unroll factor for a given loop being vectorized.
> Referring to aarch64, RS6000 backends and basing on the analysis on
> SPEC2017 performance evaluation results.
>
> Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> OK for trunk?
>
>
>
> With this patch, SPEC2017 performance evaluation results on
> ICX/CLX/ADL/Znver3 are listed below:
>
> For single copy:
> - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
>
> For multi-copy:
> - ADL: 549.fotonik3d_r +2.7%, the others are neutral
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (class ix86_vector_costs): Add new members
> m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
> (ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
> and m_nstores.
> (ix86_vector_costs::determine_suggested_unroll_factor): New function.
> (ix86_vector_costs::finish_cost): Diito.
> * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> (x86-vect-unroll-min-ldst-threshold): Likewise.
> (x86-vect-unroll-max-loop-size): Likewise.
> * doc/invoke.texi: Document new parameter.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> * gcc.target/i386/vect-reduc-1.c: Ditto.
> ---
> gcc/config/i386/i386.cc | 106 ++++++++++++++++++
> gcc/config/i386/i386.opt | 15 +++
> gcc/doc/invoke.texi | 17 +++
> .../gcc.target/i386/cond_op_maxmin_b-1.c | 2 +-
> .../gcc.target/i386/cond_op_maxmin_ub-1.c | 2 +-
> .../i386/vect-alignment-peeling-1.c | 2 +-
> .../i386/vect-alignment-peeling-2.c | 2 +-
> gcc/testsuite/gcc.target/i386/vect-reduc-1.c | 2 +-
> 8 files changed, 143 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index aeea26ef4be..a939354e55e 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
> stmt_vec_info stmt_info, slp_tree node,
> tree vectype, int misalign,
> vect_cost_model_location where) override;
> +
> + unsigned int determine_suggested_unroll_factor (loop_vec_info);
> +
> + void finish_cost (const vector_costs *) override;
> +
> + /* Total number of vectorized stmts (loop only). */
> + unsigned m_nstmts = 0;
> + /* Total number of loads (loop only). */
> + unsigned m_nloads = 0;
> + /* Total number of stores (loop only). */
> + unsigned m_nstores = 0;
> };
>
> /* Implement targetm.vectorize.create_costs. */
> @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> retval = (retval * 17) / 10;
> }
>
> + if (!m_costing_for_scalar
> + && is_a<loop_vec_info> (m_vinfo)
> + && where == vect_body)
> + {
> + m_nstmts += count;
> + if (kind == scalar_load || kind == vector_load
> + || kind == unaligned_load || kind == vector_gather_load)
> + m_nloads += count;
> + else if (kind == scalar_store || kind == vector_store
> + || kind == unaligned_store || kind == vector_scatter_store)
> + m_nstores += count;
> + }
> +
> m_costs[where] += retval;
>
> return retval;
> @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
> return nunroll;
> }
>
> +unsigned int
> +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
> +{
> + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +
> + /* Don't unroll if it's specified explicitly not to be unrolled. */
> + if (loop->unroll == 1
> + || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> + || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> + return 1;
> +
> + /* Don't unroll if there is no vectorized stmt. */
> + if (m_nstmts == 0)
> + return 1;
> +
> + /* Don't unroll if vector size is zmm, since zmm throughput is lower than other
> + sizes. */
> + if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
> + return 1;
> +
> + /* Calc the total number of loads and stores in the loop body. */
> + unsigned int nstmts_ldst = m_nloads + m_nstores;
> +
> + /* Don't unroll if loop body size big than threshold, the threshold
> + is a heuristic value inspired by param_max_unrolled_insns. */
> + unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size
> + ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
> + : 1;
> + uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
> + uf = 1 << ceil_log2 (uf);
> +
> + /* Early return if don't need to unroll. */
> + if (uf == 1)
> + return 1;
> +
> + /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll the loop
> + if the number of loads and stores exceeds the threshold, unroll + software
> + schedule will reduce cache miss rate. */
> + if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
> + return uf;
> +
> + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> + unsigned int vf = vect_vf_for_cost (loop_vinfo);
> + unsigned int unrolled_vf = vf * uf;
> + if (est_niter == -1 || est_niter < unrolled_vf)
> + /* When the estimated iteration of this loop is unknown, it's possible
> + that we are able to vectorize this loop with the original VF but fail
> + to vectorize it with the unrolled VF any more if the actual iteration
> + count is in between. */
> + return 1;
> + else
> + {
> + unsigned int epil_niter_unr = est_niter % unrolled_vf;
> + unsigned int epil_niter = est_niter % vf;
> + /* Even if we have partial vector support, it can be still inefficent
> + to calculate the length when the iteration count is unknown, so
> + only expect it's good to unroll when the epilogue iteration count
> + is not bigger than VF (only one time length calculation). */
> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> + && epil_niter_unr <= vf)
> + return uf;
> + /* Without partial vector support, conservatively unroll this when
> + the epilogue iteration count is less than the original one
> + (epilogue execution time wouldn't be longer than before). */
> + else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> + && epil_niter_unr <= epil_niter)
> + return uf;
> + }
> +
> + return 1;
> +}
> +
> +void
> +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> +
> +{
> + if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> + {
> + m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo);
> + }
> + vector_costs::finish_cost (scalar_costs);
> +}
>
> /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P. */
>
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index 53d534f6392..8e49b406aa5 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -1224,3 +1224,18 @@ mavxvnniint8
> Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> AVXVNNIINT8 built-in functions and code generation.
> +
> +-param=x86-vect-unroll-limit=
> +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) IntegerRange(1, 8) Param
> +Used to limit unroll factor which indicates how much the autovectorizer may
> +unroll a loop. The default value is 4.
> +
> +-param=x86-vect-unroll-min-ldst-threshold=
> +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) Init(25) Param
> +Used to limit the mininum of loads and stores in the main loop. The default
> +value is 25.
> +
> +-param=x86-vect-unroll-max-loop-size=
> +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) Param
> +This threshold is used to limit the maxnum size of loop body after unrolling.
> +The default value is 200.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 09548c4528c..c86d686f2cd 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -15779,6 +15779,23 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
> @item x86-stlf-window-ninsns
> Instructions number above which STFL stall penalty can be compensated.
>
> +@item x86-vect-unroll-limit
> +The vectorizer will check with target information to determine whether it
> +would be beneficial to unroll the main vectorized loop and by how much. This
> +parameter sets the upper bound of how much the vectorizer will unroll the main
> +loop. The default value is four.
> +
> +@item x86-vect-unroll-min-ldst-threshold
> +The vectorizer will check with target information to determine whether unroll
> +it. This parameter is used to limit the mininum of loads and stores in the main
> +loop.
> +
> +@item x86-vect-unroll-max-loop-size
> +The vectorizer will check with target information to determine whether unroll
> +it. This threshold is used to limit the max size of loop body after unrolling.
> +The default value is 200.
> +
> +
> @end table
>
> @end table
> diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> index 78c6600f83b..3bf1fb1b12d 100644
> --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-loops -fdump-tree-optimized" } */
> /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> /* { dg-final { scan-assembler-times "vpmaxsb" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> index 117179f2109..ba41fd64386 100644
> --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-loops -fdump-tree-optimized" } */
> /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> /* { dg-final { scan-assembler-times "vpmaxub" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> index 4aa536ba86c..fd2f054af4a 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> @@ -2,7 +2,7 @@
> /* This is a test exercising peeling for alignment for a negative step
> vector loop. We're forcing atom tuning here because that has a higher
> unaligned vs aligned cost unlike most other archs. */
> -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
> +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
>
> float a[1024], b[1024];
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> index 834bf0f770d..62c0db2bb9a 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> @@ -2,7 +2,7 @@
> /* This is a test exercising peeling for alignment for a positive step
> vector loop. We're forcing atom tuning here because that has a higher
> unaligned vs aligned cost unlike most other archs. */
> -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
> +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
>
> float a[1024], b[1024];
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> index 9ee9ba4e736..1ba4be01bea 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */
> +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-tree-vect-details" } */
>
> #define N 32
> int foo (int *a, int n)
> --
> 2.17.1
>
> Thanks,
> Lili.
--
BR,
Hongtao
next prev parent reply other threads:[~2022-10-25 5:46 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-24 2:46 Cui,Lili
2022-10-25 5:49 ` Hongtao Liu [this message]
2022-10-25 11:53 ` Richard Biener
2022-10-26 11:38 ` Cui, Lili
2022-10-28 7:12 ` Richard Biener
2022-11-02 9:37 ` Cui, Lili
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAMZc-bxuN27fbPRFU9X253ZUJ_Km4POb6q-583XgSixcD7LNZA@mail.gmail.com \
--to=crazylht@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hongtao.liu@intel.com \
--cc=lili.cui@intel.com \
--cc=mliska@suse.cz \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).