From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 76044385829E
 for <gcc-patches@gcc.gnu.org>; Mon, 15 Aug 2022 08:05:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 76044385829E
Received: from pps.filterd (m0187473.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27F7lg6c015065;
 Mon, 15 Aug 2022 08:05:45 GMT
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hyj2q8d6v-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 15 Aug 2022 08:05:45 +0000
Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1])
 by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27F7m4vu016175;
 Mon, 15 Aug 2022 08:05:45 GMT
Received: from ppma01fra.de.ibm.com (46.49.7a9f.ip4.static.sl-reverse.com
 [159.122.73.70])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hyj2q8d58-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 15 Aug 2022 08:05:44 +0000
Received: from pps.filterd (ppma01fra.de.ibm.com [127.0.0.1])
 by ppma01fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27F7acvW002765;
 Mon, 15 Aug 2022 08:05:42 GMT
Received: from b06cxnps4074.portsmouth.uk.ibm.com
 (d06relay11.portsmouth.uk.ibm.com [9.149.109.196])
 by ppma01fra.de.ibm.com with ESMTP id 3hx3k91ax0-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Mon, 15 Aug 2022 08:05:42 +0000
Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com
 [9.149.105.61])
 by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 27F85edh30605696
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Mon, 15 Aug 2022 08:05:40 GMT
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id E6A9F11C054;
 Mon, 15 Aug 2022 08:05:39 +0000 (GMT)
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 7C28C11C04A;
 Mon, 15 Aug 2022 08:05:37 +0000 (GMT)
Received: from [9.197.235.82] (unknown [9.197.235.82])
 by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Mon, 15 Aug 2022 08:05:37 +0000 (GMT)
Message-ID: <88d36e92-cf50-7ae3-f975-c273741c022c@linux.ibm.com>
Date: Mon, 15 Aug 2022 16:05:36 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.6.1
Subject: PING^1 [PATCH] rs6000: Suggest unroll factor for loop vectorization
Content-Language: en-US
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>,
 Peter Bergner <bergner@linux.ibm.com>, David Edelsohn <dje.gcc@gmail.com>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 Richard Biener <richard.guenther@gmail.com>
References: <dd251673-29f8-3310-988f-a957c98b7dab@linux.ibm.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
In-Reply-To: <dd251673-29f8-3310-988f-a957c98b7dab@linux.ibm.com>
Content-Type: text/plain; charset=UTF-8
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: hTBsDhkw3q2pi0nw9Ir8zYPBzWDcX4nZ
X-Proofpoint-ORIG-GUID: bKQNCjBNrzbBHBR-G64GZyC7CaW16q7f
Content-Transfer-Encoding: 7bit
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1
 definitions=2022-08-15_04,2022-08-11_01,2022-06-22_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 impostorscore=0 bulkscore=0
 malwarescore=0 adultscore=0 mlxlogscore=999 spamscore=0 suspectscore=0
 lowpriorityscore=0 priorityscore=1501 mlxscore=0 clxscore=1015
 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2207270000 definitions=main-2208150028
X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_MSPIKE_H2,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Aug 2022 08:05:55 -0000

Hi,

Gentle ping: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598601.html

BR,
Kewen

on 2022/7/20 17:30, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Commit r12-6679-g7ca1582ca60dc8 made vectorizer accept one
> unroll factor to be applied to vectorization factor when
> vectorizing the main loop, it would be suggested by target
> when doing costing.
> 
> This patch introduces function determine_suggested_unroll_factor
> for rs6000 port, to make it be able to suggest the unroll factor
> for a given loop being vectorized.  Referring to aarch64 port
> and basing on the analysis on SPEC2017 performance evaluation
> results, it mainly considers these aspects:
>   1) unroll option and pragma which can disable unrolling for the
>      given loop;
>   2) simple hardware resource model with issued non memory access
>      vector insn per cycle;
>   3) aggressive heuristics when iteration count is unknown:
>      - reduction case to break cross iteration dependency;
>      - emulated gather load;
>   4) estimated iteration count when iteration count is unknown;
> 
> With this patch, SPEC2017 performance evaluation results on
> Power8/9/10 are listed below (speedup pct.):
> 
>   * Power10
>     - O2: all are neutral (excluding some noises);
>     - Ofast: 510.parest_r +6.67%, the others are neutral
>              (use ... for the followings);
>     - Ofast + unroll: 510.parest_r +5.91%, ...
>     - Ofast + LTO + PGO: 510.parest_r +3.00%, ...
>     - Ofast + cheap vect cost: 510.parest_r +6.23%, ...
>     - Ofast + very-cheap vect cost: all are neutral;
> 
>   * Power9
>     - Ofast: 510.parest_r +8.73%, 538.imagick_r +11.18%
>              (likely noise), 500.perlbench_r +1.84%, ...
> 
>   * Power8
>     - Ofast: 510.parest_r +5.43%, ...;
> 
> This patch also introduces one documented parameter
> rs6000-vect-unroll-limit= similar to what aarch64 proposes,
> by evaluating on P8/P9/P10, the default value 4 is slightly
> better than the other choices like 2 and 8.
> 
> It also parameterizes two other values as undocumented
> parameters for future tweaking.  One parameter is
> rs6000-vect-unroll-issue, it's to simply model hardware
> resource for non memory access vector instructions to avoid
> excessive unrolling, initially I tried to use the value in
> the hook rs6000_issue_rate, but the evaluation showed it's
> bad, so I evaluated different values 2/4/6/8 on P8/P9/P10 at
> Ofast, the results showed the default value 4 is good enough
> on these different architectures.  For a record, choice 8
> could make 510.parest_r's gain become smaller or gone on
> P8/P9/P10; choice 6 could make 503.bwaves_r degrade by more
> than 1% on P8/P10; and choice 2 could make 538.imagick_r
> degrade by 3.8%.  The other parameter is
> rs6000-vect-unroll-reduc-threshold.  It's mainly inspired by
> 510.parest_r and tweaked as it, evaluating with different
> values 0/1/2/3 for the threshold, it showed value 1 is the
> best choice.  For a record, choice 0 could make 525.x264_r
> degrade by 2% and 527.cam4_r degrade by 2.95% on P10,
> 548.exchange2_r degrade by 1.41% and 527.cam4_r degrade by
> 2.54% on P8; choice 2 and bigger values could make
> 510.parest_r's gain become smaller.
> 
> Bootstrapped and regtested on powerpc64-linux-gnu P7 and P8,
> and powerpc64le-linux-gnu P9.  Bootstrapped on
> powerpc64le-linux-gnu P10, but one failure was exposed during
> regression testing there, it's identified as one miss
> optimization and can be reproduced without this support,
> PR106365 was opened for further tracking.
> 
> Is it for trunk?
> 
> BR,
> Kewen
> ------
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.cc (class rs6000_cost_data): Add new members
> 	m_nstores, m_reduc_factor, m_gather_load and member function
> 	determine_suggested_unroll_factor.
> 	(rs6000_cost_data::update_target_cost_per_stmt): Update for m_nstores,
> 	m_reduc_factor and m_gather_load.
> 	(rs6000_cost_data::determine_suggested_unroll_factor): New function.
> 	(rs6000_cost_data::finish_cost): Use determine_suggested_unroll_factor.
> 	* config/rs6000/rs6000.opt (rs6000-vect-unroll-limit): New parameter.
> 	(rs6000-vect-unroll-issue): Likewise.
> 	(rs6000-vect-unroll-reduc-threshold): Likewise.
> 	* doc/invoke.texi (rs6000-vect-unroll-limit): Document new parameter.
> 
> ---
>  gcc/config/rs6000/rs6000.cc  | 125 ++++++++++++++++++++++++++++++++++-
>  gcc/config/rs6000/rs6000.opt |  18 +++++
>  gcc/doc/invoke.texi          |   7 ++
>  3 files changed, 147 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 3ff16b8ae04..d0f107d70a8 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -5208,16 +5208,23 @@ protected:
>  				    vect_cost_model_location, unsigned int);
>    void density_test (loop_vec_info);
>    void adjust_vect_cost_per_loop (loop_vec_info);
> +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> 
>    /* Total number of vectorized stmts (loop only).  */
>    unsigned m_nstmts = 0;
>    /* Total number of loads (loop only).  */
>    unsigned m_nloads = 0;
> +  /* Total number of stores (loop only).  */
> +  unsigned m_nstores = 0;
> +  /* Reduction factor for suggesting unroll factor (loop only).  */
> +  unsigned m_reduc_factor = 0;
>    /* Possible extra penalized cost on vector construction (loop only).  */
>    unsigned m_extra_ctor_cost = 0;
>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>       instruction is needed by the vectorization.  */
>    bool m_vect_nonmem = false;
> +  /* If this loop gets vectorized with emulated gather load.  */
> +  bool m_gather_load = false;
>  };
> 
>  /* Test for likely overcommitment of vector hardware resources.  If a
> @@ -5368,9 +5375,34 @@ rs6000_cost_data::update_target_cost_per_stmt (vect_cost_for_stmt kind,
>      {
>        m_nstmts += orig_count;
> 
> -      if (kind == scalar_load || kind == vector_load
> -	  || kind == unaligned_load || kind == vector_gather_load)
> -	m_nloads += orig_count;
> +      if (kind == scalar_load
> +	  || kind == vector_load
> +	  || kind == unaligned_load
> +	  || kind == vector_gather_load)
> +	{
> +	  m_nloads += orig_count;
> +	  if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> +	    m_gather_load = true;
> +	}
> +      else if (kind == scalar_store
> +	       || kind == vector_store
> +	       || kind == unaligned_store
> +	       || kind == vector_scatter_store)
> +	m_nstores += orig_count;
> +      else if ((kind == scalar_stmt
> +		|| kind == vector_stmt
> +		|| kind == vec_to_scalar)
> +	       && stmt_info
> +	       && vect_is_reduction (stmt_info))
> +	{
> +	  /* Loop body contains normal int or fp operations and epilogue
> +	     contains vector reduction.  For simplicity, we assume int
> +	     operation takes one cycle and fp operation takes one more.  */
> +	  tree lhs = gimple_get_lhs (stmt_info->stmt);
> +	  bool is_float = FLOAT_TYPE_P (TREE_TYPE (lhs));
> +	  unsigned int basic_cost = is_float ? 2 : 1;
> +	  m_reduc_factor = MAX (basic_cost * orig_count, m_reduc_factor);
> +	}
> 
>        /* Power processors do not currently have instructions for strided
>  	 and elementwise loads, and instead we must generate multiple
> @@ -5462,6 +5494,90 @@ rs6000_cost_data::adjust_vect_cost_per_loop (loop_vec_info loop_vinfo)
>      }
>  }
> 
> +/* Determine suggested unroll factor by considering some below factors:
> +
> +    - unroll option/pragma which can disable unrolling for this loop;
> +    - simple hardware resource model for non memory vector insns;
> +    - aggressive heuristics when iteration count is unknown:
> +      - reduction case to break cross iteration dependency;
> +      - emulated gather load;
> +    - estimated iteration count when iteration count is unknown;
> +*/
> +
> +
> +unsigned int
> +rs6000_cost_data::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
> +{
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +
> +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> +  if (loop->unroll == 1
> +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> +    return 1;
> +
> +  unsigned int nstmts_nonldst = m_nstmts - m_nloads - m_nstores;
> +  /* Don't unroll if no vector instructions excepting for memory access.  */
> +  if (nstmts_nonldst == 0)
> +    return 1;
> +
> +  /* Consider breaking cross iteration dependency for reduction.  */
> +  unsigned int reduc_factor = m_reduc_factor > 1 ? m_reduc_factor : 1;
> +
> +  /* Use this simple hardware resource model that how many non ld/st
> +     vector instructions can be issued per cycle.  */
> +  unsigned int issue_width = rs6000_vect_unroll_issue;
> +  unsigned int uf = CEIL (reduc_factor * issue_width, nstmts_nonldst);
> +  uf = MIN ((unsigned int) rs6000_vect_unroll_limit, uf);
> +  /* Make sure it is power of 2.  */
> +  uf = 1 << ceil_log2 (uf);
> +
> +  /* If the iteration count is known, the costing would be exact enough,
> +     don't worry it could be worse.  */
> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +    return uf;
> +
> +  /* Inspired by SPEC2017 parest_r, we want to aggressively unroll the
> +     loop if either condition is satisfied:
> +       - reduction factor exceeds the threshold;
> +       - emulated gather load adopted.  */
> +  if (reduc_factor > (unsigned int) rs6000_vect_unroll_reduc_threshold
> +      || m_gather_load)
> +    return uf;
> +
> +  /* Check if we can conclude it's good to unroll from the estimated
> +     iteration count.  */
> +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
> +  unsigned int unrolled_vf = vf * uf;
> +  if (est_niter == -1 || est_niter < unrolled_vf)
> +    /* When the estimated iteration of this loop is unknown, it's possible
> +       that we are able to vectorize this loop with the original VF but fail
> +       to vectorize it with the unrolled VF any more if the actual iteration
> +       count is in between.  */
> +    return 1;
> +  else
> +    {
> +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
> +      unsigned int epil_niter = est_niter % vf;
> +      /* Even if we have partial vector support, it can be still inefficent
> +	 to calculate the length when the iteration count is unknown, so
> +	 only expect it's good to unroll when the epilogue iteration count
> +	 is not bigger than VF (only one time length calculation).  */
> +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +	  && epil_niter_unr <= vf)
> +	return uf;
> +      /* Without partial vector support, conservatively unroll this when
> +	 the epilogue iteration count is less than the original one
> +	 (epilogue execution time wouldn't be longer than before).  */
> +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +	       && epil_niter_unr <= epil_niter)
> +	return uf;
> +    }
> +
> +  return 1;
> +}
> +
>  void
>  rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
>  {
> @@ -5478,6 +5594,9 @@ rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
>  	  && LOOP_VINFO_VECT_FACTOR (loop_vinfo) == 2
>  	  && LOOP_REQUIRES_VERSIONING (loop_vinfo))
>  	m_costs[vect_body] += 10000;
> +
> +      m_suggested_unroll_factor
> +	= determine_suggested_unroll_factor (loop_vinfo);
>      }
> 
>    vector_costs::finish_cost (scalar_costs);
> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
> index 4931d781c4e..80c2c61a9de 100644
> --- a/gcc/config/rs6000/rs6000.opt
> +++ b/gcc/config/rs6000/rs6000.opt
> @@ -624,6 +624,14 @@ mieee128-constant
>  Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save
>  Generate (do not generate) code that uses the LXVKQ instruction.
> 
> +; Documented parameters
> +
> +-param=rs6000-vect-unroll-limit=
> +Target Joined UInteger Var(rs6000_vect_unroll_limit) Init(4) IntegerRange(1, 64) Param
> +Used to limit unroll factor which indicates how much the autovectorizer may
> +unroll a loop.  The default value is 4.
> +
> +; Undocumented parameters
>  -param=rs6000-density-pct-threshold=
>  Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) Init(85) IntegerRange(0, 100) Param
>  When costing for loop vectorization, we probably need to penalize the loop body
> @@ -661,3 +669,13 @@ Like parameter rs6000-density-load-pct-threshold, we also check if the total
>  number of load statements exceeds the threshold specified by this parameter,
>  and penalize only if it's satisfied.  The default value is 20.
> 
> +-param=rs6000-vect-unroll-issue=
> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_issue) Init(4) IntegerRange(1, 128) Param
> +Indicate how many non memory access vector instructions can be issued per
> +cycle, it's used in unroll factor determination for autovectorizer.  The
> +default value is 4.
> +
> +-param=rs6000-vect-unroll-reduc-threshold=
> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_reduc_threshold) Init(1) Param
> +When reduction factor computed for a loop exceeds the threshold specified by
> +this parameter, prefer to unroll this loop.  The default value is 1.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 84d6f0f9860..097ab1d5563 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -29658,6 +29658,13 @@ Generate (do not generate) code that will run in privileged state.
>  @opindex no-block-ops-unaligned-vsx
>  Generate (do not generate) unaligned vsx loads and stores for
>  inline expansion of @code{memcpy} and @code{memmove}.
> +
> +@item --param rs6000-vect-unroll-limit=
> +The vectorizer will check with target information to determine whether it
> +would be beneficial to unroll the main vectorized loop and by how much.  This
> +parameter sets the upper bound of how much the vectorizer will unroll the main
> +loop.  The default value is four.
> +
>  @end table
> 
>  @node RX Options
> --
> 2.27.0