From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb1-xb30.google.com (mail-yb1-xb30.google.com [IPv6:2607:f8b0:4864:20::b30]) by sourceware.org (Postfix) with ESMTPS id D8CB83858C56 for ; Tue, 25 Oct 2022 05:46:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D8CB83858C56 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-yb1-xb30.google.com with SMTP id 187so4280653ybe.1 for ; Mon, 24 Oct 2022 22:46:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=8h4QWe1vUaIAlOdYXRKl73GjTLYeSRknb1do69esYx8=; b=dESkmg62wsXKs2ee4eMe+9VmiQG/HNaQXnbwbhdS5of6YZHPZdAc2XvZPfalcP73jV 3LNgGMXenCc06H7XuqKgFFYrzQpE4hlZl/r048kosHmhvd4NYJdItnxmOyRhNeTSbQKK xCu9/cVBUxnihDu4CNubsYZuRTj6D7j/m3c9JS4vqZsTTHbMPd4NHFoSBBb3KyRDoHh6 AINqPvJLEsB/esrADi9LAgkrMT0/cRfTZAl76CKWtJA0VPNROlKL8/TCZBajBtMUOJiO RaC+dB2jrtDxwiKOsRhWEr4Ei9VPDR4u1iYQIGW1lJGXbvM2wbNAf06cKpdvUMKrmMjM KHtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=8h4QWe1vUaIAlOdYXRKl73GjTLYeSRknb1do69esYx8=; b=53CHKfFUUzDVnEqeXa/IG2ASlH0BF06imJ+WWJMIqrzPo+5J+mRvVQwnxDIrKQxi5m hmcflDw8STRA4eKUXcpUPhMv8g6RxivJZhPbjBbZEd+L0SLptd3ZfggFwU+v4d7IrvdA eZfjC13whZA8h+n5LZPZpbCAc80q/ZVqXXt4L8cWi9Q/IxSQThaAL7Tfw3zPlkacb1MU FYzWCQQDNeF6UnxsLbdA0+bFqA6zl18J4IqmutrZQ0rJvpxAiXkPS+FbBWa7DHD8QR16 etWz/cLL6DidqlVvFWTRkreJfS2hdvKdlIY/ZDK8RfuWTya9oHb8pibBVIHpozHjzf2n u6zA== X-Gm-Message-State: ACrzQf0IEzWvOniylO3vIt9l6amHyCV0uaapY7vdACHcSiprSy+qSLMl aVXNMaUp6n9qVjUSyUM5g6V+/ppN0AlDkk2AKv4= X-Google-Smtp-Source: AMsMyM5WFXSNeHoGSjXOiznidyF0ICG8IgquLQX5ihTvWq3+oyAlnqBQJ+Dmxj8UM42M7I8/vfeMC/l0cwaOHWV3u54= X-Received: by 2002:a05:6902:150e:b0:6ca:84f9:e837 with SMTP id q14-20020a056902150e00b006ca84f9e837mr16697436ybu.251.1666676809119; Mon, 24 Oct 2022 22:46:49 -0700 (PDT) MIME-Version: 1.0 References: <20221024024604.18324-1-lili.cui@intel.com> In-Reply-To: <20221024024604.18324-1-lili.cui@intel.com> From: Hongtao Liu Date: Tue, 25 Oct 2022 13:49:27 +0800 Message-ID: Subject: Re: [PATCH] ix86: Suggest unroll factor for loop vectorization To: Richard Biener , =?UTF-8?Q?Martin_Li=C5=A1ka?= Cc: gcc-patches@gcc.gnu.org, hongtao.liu@intel.com, "Cui,Lili" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Any comments? On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches wrote: > > Hi Hongtao, > > This patch introduces function finish_cost and > determine_suggested_unroll_factor for x86 backend, to make it be > able to suggest the unroll factor for a given loop being vectorized. > Referring to aarch64, RS6000 backends and basing on the analysis on > SPEC2017 performance evaluation results. > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu. > > OK for trunk? > > > > With this patch, SPEC2017 performance evaluation results on > ICX/CLX/ADL/Znver3 are listed below: > > For single copy: > - ICX: 549.fotonik3d_r +6.2%, the others are neutral > - CLX: 549.fotonik3d_r +1.9%, the others are neutral > - ADL: 549.fotonik3d_r +4.5%, the others are neutral > - Znver3: 549.fotonik3d_r +4.8%, the others are neutral > > For multi-copy: > - ADL: 549.fotonik3d_r +2.7%, the others are neutral > > gcc/ChangeLog: > > * config/i386/i386.cc (class ix86_vector_costs): Add new members > m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor. > (ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads > and m_nstores. > (ix86_vector_costs::determine_suggested_unroll_factor): New function. > (ix86_vector_costs::finish_cost): Diito. > * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter. > (x86-vect-unroll-min-ldst-threshold): Likewise. > (x86-vect-unroll-max-loop-size): Likewise. > * doc/invoke.texi: Document new parameter. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops. > * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto. > * gcc.target/i386/vect-alignment-peeling-1.c: Ditto. > * gcc.target/i386/vect-alignment-peeling-2.c: Ditto. > * gcc.target/i386/vect-reduc-1.c: Ditto. > --- > gcc/config/i386/i386.cc | 106 ++++++++++++++++++ > gcc/config/i386/i386.opt | 15 +++ > gcc/doc/invoke.texi | 17 +++ > .../gcc.target/i386/cond_op_maxmin_b-1.c | 2 +- > .../gcc.target/i386/cond_op_maxmin_ub-1.c | 2 +- > .../i386/vect-alignment-peeling-1.c | 2 +- > .../i386/vect-alignment-peeling-2.c | 2 +- > gcc/testsuite/gcc.target/i386/vect-reduc-1.c | 2 +- > 8 files changed, 143 insertions(+), 5 deletions(-) > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index aeea26ef4be..a939354e55e 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs > stmt_vec_info stmt_info, slp_tree node, > tree vectype, int misalign, > vect_cost_model_location where) override; > + > + unsigned int determine_suggested_unroll_factor (loop_vec_info); > + > + void finish_cost (const vector_costs *) override; > + > + /* Total number of vectorized stmts (loop only). */ > + unsigned m_nstmts = 0; > + /* Total number of loads (loop only). */ > + unsigned m_nloads = 0; > + /* Total number of stores (loop only). */ > + unsigned m_nstores = 0; > }; > > /* Implement targetm.vectorize.create_costs. */ > @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > retval = (retval * 17) / 10; > } > > + if (!m_costing_for_scalar > + && is_a (m_vinfo) > + && where == vect_body) > + { > + m_nstmts += count; > + if (kind == scalar_load || kind == vector_load > + || kind == unaligned_load || kind == vector_gather_load) > + m_nloads += count; > + else if (kind == scalar_store || kind == vector_store > + || kind == unaligned_store || kind == vector_scatter_store) > + m_nstores += count; > + } > + > m_costs[where] += retval; > > return retval; > @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop) > return nunroll; > } > > +unsigned int > +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo) > +{ > + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > + > + /* Don't unroll if it's specified explicitly not to be unrolled. */ > + if (loop->unroll == 1 > + || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops) > + || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops)) > + return 1; > + > + /* Don't unroll if there is no vectorized stmt. */ > + if (m_nstmts == 0) > + return 1; > + > + /* Don't unroll if vector size is zmm, since zmm throughput is lower than other > + sizes. */ > + if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64) > + return 1; > + > + /* Calc the total number of loads and stores in the loop body. */ > + unsigned int nstmts_ldst = m_nloads + m_nstores; > + > + /* Don't unroll if loop body size big than threshold, the threshold > + is a heuristic value inspired by param_max_unrolled_insns. */ > + unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size > + ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts) > + : 1; > + uf = MIN ((unsigned int)x86_vect_unroll_limit, uf); > + uf = 1 << ceil_log2 (uf); > + > + /* Early return if don't need to unroll. */ > + if (uf == 1) > + return 1; > + > + /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll the loop > + if the number of loads and stores exceeds the threshold, unroll + software > + schedule will reduce cache miss rate. */ > + if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold) > + return uf; > + > + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop); > + unsigned int vf = vect_vf_for_cost (loop_vinfo); > + unsigned int unrolled_vf = vf * uf; > + if (est_niter == -1 || est_niter < unrolled_vf) > + /* When the estimated iteration of this loop is unknown, it's possible > + that we are able to vectorize this loop with the original VF but fail > + to vectorize it with the unrolled VF any more if the actual iteration > + count is in between. */ > + return 1; > + else > + { > + unsigned int epil_niter_unr = est_niter % unrolled_vf; > + unsigned int epil_niter = est_niter % vf; > + /* Even if we have partial vector support, it can be still inefficent > + to calculate the length when the iteration count is unknown, so > + only expect it's good to unroll when the epilogue iteration count > + is not bigger than VF (only one time length calculation). */ > + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) > + && epil_niter_unr <= vf) > + return uf; > + /* Without partial vector support, conservatively unroll this when > + the epilogue iteration count is less than the original one > + (epilogue execution time wouldn't be longer than before). */ > + else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) > + && epil_niter_unr <= epil_niter) > + return uf; > + } > + > + return 1; > +} > + > +void > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > + > +{ > + if (loop_vec_info loop_vinfo = dyn_cast (m_vinfo)) > + { > + m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo); > + } > + vector_costs::finish_cost (scalar_costs); > +} > > /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P. */ > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt > index 53d534f6392..8e49b406aa5 100644 > --- a/gcc/config/i386/i386.opt > +++ b/gcc/config/i386/i386.opt > @@ -1224,3 +1224,18 @@ mavxvnniint8 > Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save > Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and > AVXVNNIINT8 built-in functions and code generation. > + > +-param=x86-vect-unroll-limit= > +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) IntegerRange(1, 8) Param > +Used to limit unroll factor which indicates how much the autovectorizer may > +unroll a loop. The default value is 4. > + > +-param=x86-vect-unroll-min-ldst-threshold= > +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) Init(25) Param > +Used to limit the mininum of loads and stores in the main loop. The default > +value is 25. > + > +-param=x86-vect-unroll-max-loop-size= > +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) Param > +This threshold is used to limit the maxnum size of loop body after unrolling. > +The default value is 200. > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 09548c4528c..c86d686f2cd 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -15779,6 +15779,23 @@ The following choices of @var{name} are available on i386 and x86_64 targets: > @item x86-stlf-window-ninsns > Instructions number above which STFL stall penalty can be compensated. > > +@item x86-vect-unroll-limit > +The vectorizer will check with target information to determine whether it > +would be beneficial to unroll the main vectorized loop and by how much. This > +parameter sets the upper bound of how much the vectorizer will unroll the main > +loop. The default value is four. > + > +@item x86-vect-unroll-min-ldst-threshold > +The vectorizer will check with target information to determine whether unroll > +it. This parameter is used to limit the mininum of loads and stores in the main > +loop. > + > +@item x86-vect-unroll-max-loop-size > +The vectorizer will check with target information to determine whether unroll > +it. This threshold is used to limit the max size of loop body after unrolling. > +The default value is 200. > + > + > @end table > > @end table > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c > index 78c6600f83b..3bf1fb1b12d 100644 > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" } */ > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-loops -fdump-tree-optimized" } */ > /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */ > /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */ > /* { dg-final { scan-assembler-times "vpmaxsb" 1 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c > index 117179f2109..ba41fd64386 100644 > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-optimized" } */ > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-loops -fdump-tree-optimized" } */ > /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */ > /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */ > /* { dg-final { scan-assembler-times "vpmaxub" 1 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c > index 4aa536ba86c..fd2f054af4a 100644 > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c > @@ -2,7 +2,7 @@ > /* This is a test exercising peeling for alignment for a negative step > vector loop. We're forcing atom tuning here because that has a higher > unaligned vs aligned cost unlike most other archs. */ > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */ > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */ > > float a[1024], b[1024]; > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c > index 834bf0f770d..62c0db2bb9a 100644 > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c > @@ -2,7 +2,7 @@ > /* This is a test exercising peeling for alignment for a positive step > vector loop. We're forcing atom tuning here because that has a higher > unaligned vs aligned cost unlike most other archs. */ > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */ > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */ > > float a[1024], b[1024]; > > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c > index 9ee9ba4e736..1ba4be01bea 100644 > --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */ > +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-tree-vect-details" } */ > > #define N 32 > int foo (int *a, int n) > -- > 2.17.1 > > Thanks, > Lili. -- BR, Hongtao