PING^1 [PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Kewen.Lin" <linkw@linux.ibm.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>,
	Michael Meissner <meissner@linux.ibm.com>,
	Peter Bergner <bergner@linux.ibm.com>,
	Richard Sandiford <richard.sandiford@arm.com>,
	Richard Biener <richard.guenther@gmail.com>
Subject: PING^1 [PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10
Date: Tue, 12 Dec 2023 14:16:02 +0800	[thread overview]
Message-ID: <091f04fa-8264-19ac-9b39-5444eb9d1ab0@linux.ibm.com> (raw)
In-Reply-To: <b7b0d8fb-64a0-2ed2-f333-06b79133e68f@linux.ibm.com>

Hi,

Gentle ping:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636599.html

BR,
Kewen

on 2023/11/15 17:16, Kewen.Lin wrote:
> Hi,
> 
> As Power ISA defines, when loading a scalar single precision (SP)
> floating point from memory, we have the double precision (DP) format
> in target register converted from SP, it's unlike some other
> architectures which supports SP and DP in registers with their
> separated formats.  The scalar SP instructions operates on DP format
> value in register and round the result to fit in SP (but still
> keeping the value in DP format).
> 
> On Power10, a scalar SP floating point load insn will be cracked into
> two internal operations, one is to load the value, the other is to
> convert SP to DP format.  Comparing to those uncracked load like
> vector SP load, it has extra 3 cycles load-to-use penalty.  When
> evaluating some critical workloads, we found that for some cases we
> don't really need the conversion if all the involved operations are
> only with SP format.  In this case, we can replace the scalar SP
> loads with vector SP load and splat (no conversion), replace all
> involved computation with the corresponding vector operations (with
> Power10 slice-based design, we expect the latency of scalar operation
> and its equivalent vector operation is the same), that is to promote
> the scalar SP loads and their affected computation to vector
> operations.
> 
> For example for the below case:
> 
> void saxpy (int n, float a, float * restrict x, float * restrict y)
> {
>   for (int i = 0; i < n; ++i)
>       y[i] = a*x[i] + y[i];
> }
> 
> At -O2, the loop body would end up with:
> 
> .L3:
>         lfsx 12,6,9    // conv
>         lfsx 0,5,9     // conv
>         fmadds 0,0,1,12
>         stfsx 0,6,9
>         addi 9,9,4
>         bdnz .L3
> 
> but it can be implemented with:
> 
> .L3:
>         lxvwsx 0,5,9   // load and splat
>         lxvwsx 12,6,9
>         xvmaddmsp 0,1,12
>         stxsiwx 0,6,9  // just store word 1 (BE ordering)
>         addi 9,9,4
>         bdnz .L3
> 
> Evaluated on Power10, the latter can speed up 23% against the former.
> 
> So this patch is to introduce a pass to recognize such case and
> change the scalar SP operations with the appropriate vector SP
> operations when it's proper.
> 
> The processing of this pass starts from scalar SP loads, first it
> checks if it's valid, further checks all the stmts using its loaded
> result, then propagates from them.  This process of propagation
> mainly goes with function visit_stmt, which first checks the
> validity of the given stmt, then checks the feeders of use operands
> with visit_stmt recursively, finally checks all the stmts using the
> def with visit_stmt recursively.  The purpose is to ensure all
> propagated stmts are valid to be transformed with its equivalent
> vector operations.  For some special operands like constant or
> GIMPLE_NOP def ssa, record them as splatting candidates.  There are
> some validity checks like: if the addressing mode can satisfy index
> form with some adjustments, if there is the corresponding vector
> operation support, and so on.  Once all propagated stmts from one
> load are valid, they are transformed by function transform_stmt by
> respecting the information in stmt_info like sf_type, new_ops etc.
> 
> For example, for the below test case:
> 
>   _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1];  // stmt1
>   _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1];  // stmt2
>   _8 = .FMA (_4, a_14(D), _7);                   // stmt3
>   MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8;  // stmt4
> 
> The processing starts from stmt1, which is taken as valid, adds it
> into the chain, then processes its use stmt stmt3, which is also
> valid, iterating its operands _4 whose def is stmt1 (visited), a_14
> which needs splatting and _7 whose def stmt2 is to be processed.
> Then stmt2 is taken as a valid load and it's added into the chain.
> All operands _4, a_14 and _7 of stmt3 are processed well, then it's
> added into the chain.  Then it processes use stmts of _8 (result of
> stmt3), so checks stmt4 which is a valid store.  Since all these
> involved stmts are valid to be transformed, we get below finally:
> 
>   sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
>   sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
>   sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
>   sf_20 = .FMA (sf_5, sf_22, sf_25);
>   __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));
> 
> Since it needs to do some validity checks and adjustments if allowed,
> such as: check if some scalar operation has the corresponding vector
> support, considering scalar SP load can allow reg + {reg, disp}
> addressing modes while vector SP load and splat can only allow reg +
> reg, also considering the efficiency to get UD/DF chain for affected
> operations, we make this pass as a gimple pass.
> 
> Considering gimple_isel pass has some gimple massaging, this pass is
> placed just before that.  Considering this pass can generate some
> extra vector construction (like some constant, values converted from
> int etc.), which are extra comparing to the original scalar, and it
> makes use of more vector resource than before, it's not turned on by
> default conservatively for now.
> 
> With the extra code to make this default on Power10, it's bootstrapped
> and almost regress-tested on Power10 (three test cases need some
> adjustments on its expected so trivial).  Evaluating SPEC2017 specrate
> all bmks at O2, O3 and Ofast, it's observed that it speeds up 521.wrf_r
> by 2.14%, 526.blender_r by 1.85% and fprate geomean by 0.31% at O2, it
> is neutral at O3 and Ofast.
> 
> Evaluating one critical workload related to xgboost, it's shown it
> helps to speed up 8% ~ 16% (avg. 14%, worst 8%, best 16%).
> 
> Note that the current implementation is mainly driven by some typical
> test cases from some motivated workloads, we want to continue to
> extend it as needed.
> 
> Any thoughts?
> 
> BR,
> Kewen
> -----
> 
> gcc/ChangeLog:
> 
> 	* config.gcc: Add rs6000-p10sfopt.o to extra_objs for powerpc*-*-*
> 	and rs6000*-*-* targets.
> 	* config/rs6000/rs6000-builtin.cc (ldv_expand_builtin): Correct tmode
> 	for CODE_FOR_vsx_splat_v4sf.
> 	(stv_expand_builtin): Correct tmode for CODE_FOR_vsx_stxsiwx_v4sf.
> 	* config/rs6000/rs6000-builtins.def (__builtin_vsx_lxvwsx,
> 	__builtin_vsx_stxsiwx): New builtin definitions.
> 	* config/rs6000/rs6000-passes.def: Add pass_rs6000_p10sfopt.
> 	* config/rs6000/rs6000-protos.h (class gimple_opt_pass): New
> 	declaration.
> 	(make_pass_rs6000_p10sfopt): Likewise.
> 	* config/rs6000/rs6000.cc (rs6000_option_override_internal): Check
> 	some prerequisite conditions for TARGET_P10_SF_OPT.
> 	(rs6000_rtx_costs): Cost one unit COSTS_N_INSNS more for vec_duplicate
> 	with {l,st}xv[wd]sx which only support x-form.
> 	* config/rs6000/rs6000.opt (-mp10-sf-opt): New option.
> 	* config/rs6000/t-rs6000: Add rule to build rs6000-p10sfopt.o.
> 	* config/rs6000/vsx.md (vsx_stxsiwx_v4sf): New define_insn.
> 	* config/rs6000/rs6000-p10sfopt.cc: New file.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/powerpc/p10-sf-opt-1.c: New test.
> 	* gcc.target/powerpc/p10-sf-opt-2.c: New test.
> 	* gcc.target/powerpc/p10-sf-opt-3.c: New test.
> ---
>  gcc/config.gcc                                |   4 +-
>  gcc/config/rs6000/rs6000-builtin.cc           |   9 +
>  gcc/config/rs6000/rs6000-builtins.def         |   5 +
>  gcc/config/rs6000/rs6000-p10sfopt.cc          | 950 ++++++++++++++++++
>  gcc/config/rs6000/rs6000-passes.def           |   5 +
>  gcc/config/rs6000/rs6000-protos.h             |   2 +
>  gcc/config/rs6000/rs6000.cc                   |  28 +
>  gcc/config/rs6000/rs6000.opt                  |   5 +
>  gcc/config/rs6000/t-rs6000                    |   4 +
>  gcc/config/rs6000/vsx.md                      |  11 +
>  .../gcc.target/powerpc/p10-sf-opt-1.c         |  22 +
>  .../gcc.target/powerpc/p10-sf-opt-2.c         |  34 +
>  .../gcc.target/powerpc/p10-sf-opt-3.c         |  43 +
>  13 files changed, 1120 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/config/rs6000/rs6000-p10sfopt.cc
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 0782cbc6e91..983fad9fb9a 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -517,7 +517,7 @@ or1k*-*-*)
>  	;;
>  powerpc*-*-*)
>  	cpu_type=rs6000
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p10sfopt.o rs6000-logue.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
> @@ -554,7 +554,7 @@ riscv*)
>  	;;
>  rs6000*-*-*)
>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p10sfopt.o rs6000-logue.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
> index 82cc3a19447..38bb786e5eb 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -2755,6 +2755,10 @@ ldv_expand_builtin (rtx target, insn_code icode, rtx *op, machine_mode tmode)
>        || !insn_data[icode].operand[0].predicate (target, tmode))
>      target = gen_reg_rtx (tmode);
> 
> +  /* Correct tmode with the proper addr mode.  */
> +  if (icode == CODE_FOR_vsx_splat_v4sf)
> +    tmode = SFmode;
> +
>    op[1] = copy_to_mode_reg (Pmode, op[1]);
> 
>    /* These CELL built-ins use BLKmode instead of tmode for historical
> @@ -2898,6 +2902,10 @@ static rtx
>  stv_expand_builtin (insn_code icode, rtx *op,
>  		    machine_mode tmode, machine_mode smode)
>  {
> +  /* Correct tmode with the proper addr mode.  */
> +  if (icode == CODE_FOR_vsx_stxsiwx_v4sf)
> +    tmode = SFmode;
> +
>    op[2] = copy_to_mode_reg (Pmode, op[2]);
> 
>    /* For STVX, express the RTL accurately by ANDing the address with -16.
> @@ -3713,3 +3721,4 @@ rs6000_expand_builtin (tree exp, rtx target, rtx /* subtarget */,
>    emit_insn (pat);
>    return target;
>  }
> +
> diff --git a/gcc/config/rs6000/rs6000-builtins.def b/gcc/config/rs6000/rs6000-builtins.def
> index ce40600e803..c0441f5e27f 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -2810,6 +2810,11 @@
>        __builtin_vsx_scalar_cmp_exp_qp_unordered (_Float128, _Float128);
>      VSCEQPUO xscmpexpqp_unordered_kf {}
> 
> +  vf __builtin_vsx_lxvwsx (signed long, const float *);
> +    LXVWSX_V4SF vsx_splat_v4sf {ldvec}
> +
> +  void __builtin_vsx_stxsiwx (vf, signed long, const float *);
> +    STXSIWX_V4SF vsx_stxsiwx_v4sf {stvec}
> 
>  ; Miscellaneous P9 functions
>  [power9]
> diff --git a/gcc/config/rs6000/rs6000-p10sfopt.cc b/gcc/config/rs6000/rs6000-p10sfopt.cc
> new file mode 100644
> index 00000000000..6e1d90fd93e
> --- /dev/null
> +++ b/gcc/config/rs6000/rs6000-p10sfopt.cc
> @@ -0,0 +1,950 @@
> +/* Subroutines used to mitigate single precision floating point
> +   load and conversion performance issue by replacing scalar
> +   single precision floating point operations with appropriate
> +   vector operations if it is proper.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it
> +under the terms of the GNU General Public License as published by the
> +Free Software Foundation; either version 3, or (at your option) any
> +later version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT
> +ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +/* The processing of this pass starts from scalar SP loads, first it
> +checks if it's valid, further checks all the stmts using its loaded
> +result, then propagates from them.  This process of propagation
> +mainly goes with function visit_stmt, which first checks the
> +validity of the given stmt, then checks the feeders of use operands
> +with visit_stmt recursively, finally checks all the stmts using the
> +def with visit_stmt recursively.  The purpose is to ensure all
> +propagated stmts are valid to be transformed with its equivalent
> +vector operations.  For some special operands like constant or
> +GIMPLE_NOP def ssa, record them as splatting candidates.  There are
> +some validity checks like: if the addressing mode can satisfy index
> +form with some adjustments, if there is the corresponding vector
> +operation support, and so on.  Once all propagated stmts from one
> +load are valid, they are transformed by function transform_stmt by
> +respecting the information in stmt_info like sf_type, new_ops etc.
> +
> +For example, for the below test case:
> +
> +  _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1];  // stmt1
> +  _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1];  // stmt2
> +  _8 = .FMA (_4, a_14(D), _7);                   // stmt3
> +  MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8;  // stmt4
> +
> +The processing starts from stmt1, which is taken as valid, adds it
> +into the chain, then processes its use stmt stmt3, which is also
> +valid, iterating its operands _4 whose def is stmt1 (visited), a_14
> +which needs splatting and _7 whose def stmt2 is to be processed.
> +Then stmt2 is taken as a valid load and it's added into the chain.
> +All operands _4, a_14 and _7 of stmt3 are processed well, then it's
> +added into the chain.  Then it processes use stmts of _8 (result of
> +stmt3), so checks stmt4 which is a valid store.  Since all these
> +involved stmts are valid to be transformed, we get below finally:
> +
> +  sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
> +  sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
> +  sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
> +  sf_20 = .FMA (sf_5, sf_22, sf_25);
> +  __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));
> +*/
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "target.h"
> +#include "rtl.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "tm_p.h"
> +#include "tree-pass.h"
> +#include "ssa.h"
> +#include "optabs-tree.h"
> +#include "fold-const.h"
> +#include "tree-eh.h"
> +#include "gimple-iterator.h"
> +#include "gimple-fold.h"
> +#include "stor-layout.h"
> +#include "tree-ssa.h"
> +#include "tree-ssa-address.h"
> +#include "tree-cfg.h"
> +#include "cfgloop.h"
> +#include "tree-vectorizer.h"
> +#include "builtins.h"
> +#include "internal-fn.h"
> +#include "gimple-pretty-print.h"
> +#include "predict.h"
> +#include "rs6000-internal.h" /* for rs6000_builtin_decls  */
> +
> +namespace {
> +
> +/* Single precision Floating point operation types.
> +
> +   So far, we only take care of load, store, ifn call, phi,
> +   normal arithmetic, comparison and special operations.
> +   Normally for an involved statement, we will process all
> +   statements which use its result, all statements which
> +   define its operands and further propagate, but for some
> +   special assignment statement, we don't want to process
> +   it like this way but just splat it instead, we adopt
> +   SF_SPECIAL for this kind of statement, for now it's only
> +   for to-float conversion assignment.  */
> +enum sf_type
> +{
> +  SF_LOAD,
> +  SF_STORE,
> +  SF_CALL,
> +  SF_PHI,
> +  SF_NORMAL,
> +  SF_COMPARE,
> +  SF_SPECIAL
> +};
> +
> +/* Hold some information for a gimple statement which is valid
> +   to be promoted from scalar operation to vector operation.  */
> +
> +class stmt_info
> +{
> +public:
> +  stmt_info (gimple *s, sf_type t, bitmap bm)
> +  {
> +    stmt = s;
> +    type = t;
> +    splat_ops = BITMAP_ALLOC (NULL);
> +    if (bm)
> +      bitmap_copy (splat_ops, bm);
> +
> +    unsigned nops = gimple_num_args (stmt);
> +    new_ops.create (nops);
> +    new_ops.safe_grow_cleared (nops);
> +    replace_stmt = NULL;
> +    gphi_res = NULL_TREE;
> +  }
> +
> +  ~stmt_info ()
> +  {
> +    BITMAP_FREE (splat_ops);
> +    new_ops.release ();
> +  }
> +
> +  /* Indicate the stmt what this info is for.  */
> +  gimple *stmt;
> +  /* Indicate sf_type of the current stmt.  */
> +  enum sf_type type;
> +  /* Bitmap used to indicate which op needs to be splatted.  */
> +  bitmap splat_ops;
> +  /* New operands used to build new stmt.  */
> +  vec<tree> new_ops;
> +  /* New stmt used to replace the current stmt.  */
> +  gimple *replace_stmt;
> +  /* Hold new gphi result which is created early.  */
> +  tree gphi_res;
> +};
> +
> +typedef stmt_info *stmt_info_p;
> +typedef hash_map<gimple *, stmt_info_p> info_map_t;
> +static info_map_t *stmt_info_map;
> +
> +/* Like the comments for SF_SPECIAL above, for some special
> +   assignment statement (to-float conversion assignment
> +   here), we don't want to do the heavy processing but just
> +   want to generate a splatting for it instead.  Return
> +   true if the given STMT is special (to-float conversion
> +   for now), otherwise return false.  */
> +
> +static bool
> +special_assign_p (gimple *stmt)
> +{
> +  gcc_assert (gimple_code (stmt) == GIMPLE_ASSIGN);
> +  enum tree_code code = gimple_assign_rhs_code (stmt);
> +  if (code == FLOAT_EXPR)
> +    return true;
> +  return false;
> +}
> +
> +/* Make base and index fields from the memory reference REF,
> +   return true and set *BASEP and *INDEXP respectively if it
> +   is successful, otherwise return false.  Since the
> +   transformed vector load (lxvwsx) and vector store (stxsiwx)
> +   only supports reg + reg addressing mode, we need to ensure
> +   the address satisfies it first.  */
> +
> +static bool
> +make_base_and_index (tree ref, tree *basep, tree *indexp)
> +{
> +  if (DECL_P (ref))
> +    {
> +      *basep
> +	= fold_build1 (ADDR_EXPR, build_pointer_type (float32_type_node), ref);
> +      *indexp = size_zero_node;
> +      return true;
> +    }
> +
> +  enum tree_code code = TREE_CODE (ref);
> +  if (code == TARGET_MEM_REF)
> +    {
> +      struct mem_address addr;
> +      get_address_description (ref, &addr);
> +      gcc_assert (!addr.step);
> +      *basep = addr.symbol ? addr.symbol : addr.base;
> +      if (addr.index)
> +	{
> +	  /* Give up if having both offset and index, theoretically
> +	     we can generate one insn to update base with index, but
> +	     it results in more cost, so leave it conservatively.  */
> +	  if (!integer_zerop (addr.offset))
> +	    return false;
> +	  *indexp = addr.index;
> +	}
> +      else
> +	*indexp = addr.offset;
> +      return true;
> +    }
> +
> +  if (code == MEM_REF)
> +    {
> +      *basep = TREE_OPERAND (ref, 0);
> +      tree op1 = TREE_OPERAND (ref, 1);
> +      *indexp = op1 ? op1 : size_zero_node;
> +      return true;
> +    }
> +
> +  if (handled_component_p (ref))
> +    {
> +      machine_mode mode1;
> +      poly_int64 bitsize, bitpos;
> +      tree offset;
> +      int reversep = 0, volatilep = 0, unsignedp = 0;
> +      tree tem = get_inner_reference (ref, &bitsize, &bitpos, &offset, &mode1,
> +				      &unsignedp, &reversep, &volatilep);
> +      if (reversep)
> +	return false;
> +
> +      poly_int64 bytepos = exact_div (bitpos, BITS_PER_UNIT);
> +      if (offset)
> +	{
> +	  gcc_assert (!integer_zerop (offset));
> +	  /* Give up if having both offset and bytepos.  */
> +	  if (maybe_ne (bytepos, 0))
> +	    return false;
> +	  if (!is_gimple_variable (offset))
> +	    return false;
> +	}
> +
> +      tree base1, index1;
> +      /* Further check the inner ref.  */
> +      if (!make_base_and_index (tem, &base1, &index1))
> +	return false;
> +
> +      if (integer_zerop (index1))
> +	{
> +	  /* Only need to consider base1 and offset/bytepos.  */
> +	  *basep = base1;
> +	  *indexp = offset ? offset : wide_int_to_tree (sizetype, bytepos);
> +	  return true;
> +	}
> +      /* Give up if having offset and index1.  */
> +      if (offset)
> +	return false;
> +      /* Give up if bytepos and index1 can not be folded.  */
> +      if (!poly_int_tree_p (index1))
> +	return false;
> +      poly_offset_int new_off
> +	= wi::sext (wi::to_poly_offset (index1), TYPE_PRECISION (sizetype));
> +      new_off += bytepos;
> +
> +      poly_int64 new_index;
> +      if (!new_off.to_shwi (&new_index))
> +	return false;
> +
> +      *basep = base1;
> +      *indexp = wide_int_to_tree (sizetype, new_index);
> +      return true;
> +    }
> +
> +  if (TREE_CODE (ref) == SSA_NAME)
> +    {
> +      /* Inner ref can come from a load.  */
> +      gimple *def = SSA_NAME_DEF_STMT (ref);
> +      if (!gimple_assign_single_p (def))
> +	return false;
> +      tree ref1 = gimple_assign_rhs1 (def);
> +      if (!DECL_P (ref1) && !REFERENCE_CLASS_P (ref1))
> +	return false;
> +
> +      tree base1, offset1;
> +      if (!make_base_and_index (ref1, &base1, &offset1))
> +	return false;
> +      *basep = base1;
> +      *indexp = offset1;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
> +/* Check STMT is an expected SP float load or store, return true
> +   if it is and update IS_LOAD, otherwise return false.  */
> +
> +static bool
> +valid_load_store_p (gimple *stmt, bool &is_load)
> +{
> +  if (!gimple_assign_single_p (stmt))
> +    return false;
> +
> +  tree lhs = gimple_assign_lhs (stmt);
> +  if (TYPE_MODE (TREE_TYPE (lhs)) != SFmode)
> +    return false;
> +
> +  tree rhs = gimple_assign_rhs1 (stmt);
> +  tree base, index;
> +  if (TREE_CODE (lhs) == SSA_NAME
> +      && (DECL_P (rhs) || REFERENCE_CLASS_P (rhs))
> +      && make_base_and_index (rhs, &base, &index))
> +    {
> +      is_load = true;
> +      return true;
> +    }
> +
> +  if ((DECL_P (lhs) || REFERENCE_CLASS_P (lhs))
> +      && make_base_and_index (lhs, &base, &index))
> +    {
> +      is_load = false;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
> +/* Check if it's valid to update the given STMT with the
> +   equivalent vector form, return true if yes and also set
> +   SF_TYPE to the proper sf_type, otherwise return false.  */
> +
> +static bool
> +is_valid (gimple *stmt, enum sf_type &sf_type)
> +{
> +  /* Give up if it has volatile type.  */
> +  if (gimple_has_volatile_ops (stmt))
> +    return false;
> +
> +  /* Give up if it can throw an exception.  */
> +  if (stmt_can_throw_internal (cfun, stmt))
> +    return false;
> +
> +  /* Process phi.  */
> +  gphi *gp = dyn_cast<gphi *> (stmt);
> +  if (gp)
> +    {
> +      sf_type = SF_PHI;
> +      return true;
> +    }
> +
> +  /* Process assignment.  */
> +  gassign *gass = dyn_cast<gassign *> (stmt);
> +  if (gass)
> +    {
> +      bool is_load = false;
> +      if (valid_load_store_p (stmt, is_load))
> +	{
> +	  sf_type = is_load ? SF_LOAD : SF_STORE;
> +	  return true;
> +	}
> +
> +      tree lhs = gimple_assign_lhs (stmt);
> +      if (!lhs || TREE_CODE (lhs) != SSA_NAME)
> +	return false;
> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> +      if (TREE_CODE_CLASS (code) == tcc_comparison)
> +	{
> +	  tree rhs1 = gimple_assign_rhs1 (stmt);
> +	  tree rhs2 = gimple_assign_rhs2 (stmt);
> +	  tree type = TREE_TYPE (lhs);
> +	  if (!VECT_SCALAR_BOOLEAN_TYPE_P (type))
> +	    return false;
> +	  if (TYPE_MODE (type) != QImode)
> +	    return false;
> +	  type = TREE_TYPE (rhs1);
> +	  if (TYPE_MODE (type) != SFmode)
> +	    return false;
> +	  gcc_assert (TYPE_MODE (TREE_TYPE (rhs2)) == SFmode);
> +	  sf_type = SF_COMPARE;
> +	  return true;
> +	}
> +
> +      tree type = TREE_TYPE (lhs);
> +      if (TYPE_MODE (type) != SFmode)
> +	return false;
> +
> +      if (special_assign_p (stmt))
> +	{
> +	  sf_type = SF_SPECIAL;
> +	  return true;
> +	}
> +
> +      /* Check if vector operation is supported.  */
> +      sf_type = SF_NORMAL;
> +      tree vectype = build_vector_type_for_mode (type, V4SFmode);
> +      optab optab = optab_for_tree_code (code, vectype, optab_default);
> +      if (!optab)
> +	return false;
> +      return optab_handler (optab, V4SFmode) != CODE_FOR_nothing;
> +    }
> +
> +  /* Process call.  */
> +  gcall *gc = dyn_cast<gcall *> (stmt);
> +  /* TODO: Extend this to cover some other bifs.  */
> +  if (gc && gimple_call_internal_p (gc))
> +    {
> +      tree lhs = gimple_call_lhs (stmt);
> +      if (!lhs)
> +	return false;
> +      if (TREE_CODE (lhs) != SSA_NAME)
> +	return false;
> +      tree type = TREE_TYPE (lhs);
> +      if (TYPE_MODE (type) != SFmode)
> +	return false;
> +      enum internal_fn ifn = gimple_call_internal_fn (stmt);
> +      tree vectype = build_vector_type_for_mode (type, V4SFmode);
> +      if (direct_internal_fn_p (ifn))
> +	{
> +	  const direct_internal_fn_info &info = direct_internal_fn (ifn);
> +	  if (info.vectorizable
> +	      && (direct_internal_fn_supported_p (ifn,
> +						  tree_pair (vectype, vectype),
> +						  OPTIMIZE_FOR_SPEED)))
> +	    {
> +	      sf_type = SF_CALL;
> +	      return true;
> +	    }
> +	}
> +    }
> +
> +  return false;
> +}
> +
> +/* Process the given STMT, if it's visited before, just return true.
> +   If it's the first time to visit this, set VISITED and check if
> +   the below ones are valid to be optimized with vector operation:
> +     - itself
> +     - all statements which define the operands involved here
> +     - all statements which use the result of STMT
> +   If all are valid, add STMT into CHAIN, create its own stmt_info
> +   and return true.  Otherwise, return false.  */
> +
> +static bool
> +visit_stmt (gimple *stmt, vec<gimple *> &chain, hash_set<gimple *> &visited)
> +{
> +  if (visited.add (stmt))
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "Stmt visited: %G", stmt);
> +      return true;
> +    }
> +  else if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "Visiting stmt: %G", stmt);
> +
> +  /* Checking this statement is valid for this optimization.  */
> +  enum sf_type st_type;
> +  if (!is_valid (stmt, st_type))
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "Invalid stmt: %G", stmt);
> +      return false;
> +    }
> +
> +  /* For store, it's the end of this chain, don't need to
> +     process anything further.  For special assignment, we
> +     don't want to process all statements using its result
> +     and all statements defining its operands.  */
> +  if (st_type == SF_STORE || st_type == SF_SPECIAL)
> +    {
> +      chain.safe_push (stmt);
> +      stmt_info_p si = new stmt_info (stmt, st_type, NULL);
> +      stmt_info_map->put (stmt, si);
> +      return true;
> +    }
> +
> +  /* Check all feeders of operands involved here.  */
> +
> +  /* Indicate which operand needs to be splatted, such as: constant.  */
> +  auto_bitmap splat_bm;
> +  if (st_type != SF_LOAD)
> +    {
> +      unsigned nops = gimple_num_args (stmt);
> +      for (unsigned i = 0; i < nops; i++)
> +	{
> +	  tree op = gimple_arg (stmt, i);
> +	  if (TREE_CODE (op) != SSA_NAME
> +	      && TREE_CODE (op) != REAL_CST)
> +	    {
> +	      if (dump_enabled_p ())
> +		dump_printf (MSG_NOTE, "With problematic %T in stmt: %G", op,
> +			     stmt);
> +	      return false;
> +	    }
> +
> +	  bool need_splat = false;
> +	  if (TREE_CODE (op) == SSA_NAME)
> +	    {
> +	      gimple *op_stmt = SSA_NAME_DEF_STMT (op);
> +	      if (gimple_code (op_stmt) == GIMPLE_NOP)
> +		need_splat = true;
> +	      else if (!visit_stmt (op_stmt, chain, visited))
> +		return false;
> +	    }
> +	  else
> +	    {
> +	      gcc_assert (TREE_CODE (op) == REAL_CST);
> +	      need_splat = true;
> +	    }
> +
> +	  if (need_splat)
> +	    bitmap_set_bit (splat_bm, i);
> +	}
> +    }
> +
> +  /* Push this stmt before all its use stmts, then it's transformed
> +     first during the transform phase, new_ops are prepared when
> +     transforming use stmts.  */
> +  chain.safe_push (stmt);
> +
> +  /* Comparison may have some constant operand, we need the above
> +     handlings on splatting, but don't need any further processing
> +     on all uses of its result.  */
> +  if (st_type == SF_COMPARE)
> +    {
> +      stmt_info_p si = new stmt_info (stmt, st_type, splat_bm);
> +      stmt_info_map->put (stmt, si);
> +      return true;
> +    }
> +
> +  /* Process each use of definition.  */
> +  gimple *use_stmt;
> +  imm_use_iterator iter;
> +  tree lhs = gimple_get_lhs (stmt);
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    if (!visit_stmt (use_stmt, chain, visited))
> +      return false;
> +
> +  /* Create the corresponding stmt_info.  */
> +  stmt_info_p si = new stmt_info (stmt, st_type, splat_bm);
> +  stmt_info_map->put (stmt, si);
> +  return true;
> +}
> +
> +/* Tree NEW_LHS with vector type has been used to replace the
> +   original tree LHS, for each use of LHS, find each use stmt
> +   and its corresponding stmt_info, update whose new_ops array
> +   accordingly to prepare later replacement.  */
> +
> +static void
> +update_all_uses (tree lhs, tree new_lhs, sf_type type)
> +{
> +  gimple *use_stmt;
> +  imm_use_iterator iter;
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    {
> +      stmt_info_p *slot = stmt_info_map->get (use_stmt);
> +      /* Each use stmt should have been processed excepting
> +	 for SF_SPECIAL stmts since for which we stop
> +	 processing early.  */
> +      gcc_assert (slot || type == SF_SPECIAL);
> +      if (!slot)
> +	continue;
> +      stmt_info_p info = *slot;
> +      unsigned n = gimple_num_args (use_stmt);
> +      for (unsigned i = 0; i < n; i++)
> +	if (gimple_arg (use_stmt, i) == lhs)
> +	  info->new_ops[i] = new_lhs;
> +    }
> +}
> +
> +/* Remove old STMT and insert NEW_STMT before.  */
> +
> +static void
> +replace_stmt (gimple_stmt_iterator *gsi_ptr, gimple *stmt, gimple *new_stmt)
> +{
> +  gimple_set_location (new_stmt, gimple_location (stmt));
> +  gimple_move_vops (new_stmt, stmt);
> +  gsi_insert_before (gsi_ptr, new_stmt, GSI_SAME_STMT);
> +  gsi_remove (gsi_ptr, true);
> +}
> +
> +/* Transform the given STMT with vector type, only transform
> +   phi stmt if HANDLE_PHI_P is true since there are def-use
> +   cycles for phi, we transform them in the second time.  */
> +
> +static void
> +transform_stmt (gimple *stmt, bool handle_phi_p = false)
> +{
> +  stmt_info_p info = *stmt_info_map->get (stmt);
> +
> +  /* This statement has been replaced.  */
> +  if (info->replace_stmt)
> +    return;
> +
> +  gcc_assert (!handle_phi_p || gimple_code (stmt) == GIMPLE_PHI);
> +
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, " Transforming stmt: %G", stmt);
> +
> +  tree lhs = gimple_get_lhs (stmt);
> +  tree type = float_type_node;
> +  tree vectype = build_vector_type_for_mode (type, V4SFmode);
> +  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> +
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, " info->type: %d\n", info->type);
> +
> +  /* Replace load with bif __builtin_vsx_lxvwsx.  */
> +  if (info->type == SF_LOAD)
> +    {
> +      tree fndecl = rs6000_builtin_decls[RS6000_BIF_LXVWSX_V4SF];
> +      tree rhs = gimple_op (stmt, 1);
> +      tree base, index;
> +      bool mem_p = make_base_and_index (rhs, &base, &index);
> +      gcc_assert (mem_p);
> +      gimple *load = gimple_build_call (fndecl, 2, index, base);
> +      tree res = make_temp_ssa_name (vectype, NULL, "sf");
> +      gimple_call_set_lhs (load, res);
> +      info->replace_stmt = load;
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "  => Gen load: %G", load);
> +      update_all_uses (lhs, res, info->type);
> +      replace_stmt (&gsi, stmt, load);
> +      return;
> +    }
> +
> +  /* Replace store with bif __builtin_vsx_stxsiwx.  */
> +  if (info->type == SF_STORE)
> +    {
> +      tree fndecl = rs6000_builtin_decls[RS6000_BIF_STXSIWX_V4SF];
> +      tree base, index;
> +      bool mem_p = make_base_and_index (lhs, &base, &index);
> +      gcc_assert (mem_p);
> +      gcc_assert (info->new_ops[0]);
> +      gimple *store
> +	= gimple_build_call (fndecl, 3, info->new_ops[0], index, base);
> +      info->replace_stmt = store;
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "  => Gen store: %G", store);
> +      replace_stmt (&gsi, stmt, store);
> +      return;
> +    }
> +
> +  /* Generate vector construction for special stmt.  */
> +  if (info->type == SF_SPECIAL)
> +    {
> +      tree op = gimple_get_lhs (stmt);
> +      tree val = build_vector_from_val (vectype, op);
> +      tree res = make_temp_ssa_name (vectype, NULL, "sf");
> +      gimple *splat = gimple_build_assign (res, val);
> +      gimple_set_location (splat, gimple_location (stmt));
> +      gsi_insert_after (&gsi, splat, GSI_SAME_STMT);
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "  => Gen special %G", splat);
> +      update_all_uses (lhs, res, info->type);
> +      info->replace_stmt = splat;
> +      return;
> +    }
> +
> +  /* Handle the operands which haven't have an according vector
> +     operand yet, like those ones need splatting etc.  */
> +  unsigned nargs = gimple_num_args (stmt);
> +  gphi *phi = dyn_cast<gphi *> (stmt);
> +  for (unsigned i = 0; i < nargs; i++)
> +    {
> +      /* This operand already has the replacing one.  */
> +      if (info->new_ops[i])
> +	continue;
> +      /* When only handling phi, all operands should have the
> +	 prepared new_op.  */
> +      gcc_assert (!handle_phi_p);
> +      tree op = gimple_arg (stmt, i);
> +      /* This operand needs splatting.  */
> +      if (bitmap_bit_p (info->splat_ops, i))
> +	{
> +	  tree val = build_vector_from_val (vectype, op);
> +	  tree res = make_temp_ssa_name (vectype, NULL, "sf");
> +	  gimple *splat = gimple_build_assign (res, val);
> +	  /* If it's a PHI, push it to its incoming block.  */
> +	  if (phi)
> +	    {
> +	      basic_block src = gimple_phi_arg_edge (phi, i)->src;
> +	      gimple_stmt_iterator src_gsi = gsi_last_bb (src);
> +	      if (!gsi_end_p (src_gsi) && stmt_ends_bb_p (gsi_stmt (src_gsi)))
> +		gsi_insert_before (&src_gsi, splat, GSI_SAME_STMT);
> +	      else
> +		gsi_insert_after (&src_gsi, splat, GSI_NEW_STMT);
> +	    }
> +	  else
> +	    gsi_insert_before (&gsi, splat, GSI_SAME_STMT);
> +	  info->new_ops[i] = res;
> +	  bitmap_clear_bit (info->splat_ops, i);
> +	}
> +      else
> +	{
> +	  gcc_assert (TREE_CODE (op) == SSA_NAME);
> +	  /* Ensure all operands have the replacing new_op excepting
> +	     for phi stmt.  */
> +	  if (!phi)
> +	    {
> +	      gimple *def = SSA_NAME_DEF_STMT (op);
> +	      transform_stmt (def);
> +	      gcc_assert (info->new_ops[i]);
> +	    }
> +	}
> +    }
> +
> +  gimple *new_stmt;
> +  tree res;
> +  if (info->type == SF_PHI)
> +    {
> +      /* At the first time, ensure phi result is prepared and all its
> +	 use stmt can be transformed well.  */
> +      if (!handle_phi_p)
> +	{
> +	  res = info->gphi_res;
> +	  if (!res)
> +	    {
> +	      res = make_temp_ssa_name (vectype, NULL, "sf");
> +	      info->gphi_res = res;
> +	    }
> +	  update_all_uses (lhs, res, info->type);
> +	  return;
> +	}
> +      /* Transform actually at the second time.  */
> +      basic_block bb = gimple_bb (stmt);
> +      gphi *new_phi = create_phi_node (info->gphi_res, bb);
> +      for (unsigned i = 0; i < nargs; i++)
> +	{
> +	  location_t loc = gimple_phi_arg_location (phi, i);
> +	  edge e = gimple_phi_arg_edge (phi, i);
> +	  add_phi_arg (new_phi, info->new_ops[i], e, loc);
> +	}
> +      gimple_set_location (new_phi, gimple_location (stmt));
> +      remove_phi_node (&gsi, true);
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "  => Gen phi %G", (gimple *) new_phi);
> +      return;
> +    }
> +
> +  if (info->type == SF_COMPARE)
> +    {
> +      /* Build a vector comparison.  */
> +      tree vectype1 = truth_type_for (vectype);
> +      tree res1 = make_temp_ssa_name (vectype1, NULL, "sf_vb4");
> +      enum tree_code subcode = gimple_assign_rhs_code (stmt);
> +      gimple *new_stmt1 = gimple_build_assign (res1, subcode, info->new_ops[0],
> +					       info->new_ops[1]);
> +      gsi_insert_before (&gsi, new_stmt1, GSI_SAME_STMT);
> +
> +      /* Build a VEC_COND_EXPR with -1 (true) or 0 (false).  */
> +      tree vectype2 = build_vector_type_for_mode (intSI_type_node, V4SImode);
> +      tree res2 = make_temp_ssa_name (vectype2, NULL, "sf_vi4");
> +      tree minus_one_vec = build_minus_one_cst (vectype2);
> +      tree zero_vec = build_zero_cst (vectype2);
> +      gimple *new_stmt2 = gimple_build_assign (res2, VEC_COND_EXPR, res1,
> +					       minus_one_vec, zero_vec);
> +      gsi_insert_before (&gsi, new_stmt2, GSI_SAME_STMT);
> +
> +      /* Build a BIT_FIELD_REF to extract lane 1 (BE ordering).  */
> +      tree bfr = build3 (BIT_FIELD_REF, intSI_type_node, res2, bitsize_int (32),
> +			 bitsize_int (BYTES_BIG_ENDIAN ? 32 : 64));
> +      tree res3 = make_temp_ssa_name (intSI_type_node, NULL, "sf_i4");
> +      gimple *new_stmt3 = gimple_build_assign (res3, BIT_FIELD_REF, bfr);
> +      gsi_insert_before (&gsi, new_stmt3, GSI_SAME_STMT);
> +
> +      /* Convert it accordingly.  */
> +      gimple *new_stmt = gimple_build_assign (lhs, NOP_EXPR, res3);
> +
> +      if (dump_enabled_p ())
> +	{
> +	  dump_printf (MSG_NOTE, "  => Gen comparison: %G",
> +		       (gimple *) new_stmt1);
> +	  dump_printf (MSG_NOTE, "                     %G",
> +		       (gimple *) new_stmt2);
> +	  dump_printf (MSG_NOTE, "                     %G",
> +		       (gimple *) new_stmt3);
> +	  dump_printf (MSG_NOTE, "                     %G",
> +		       (gimple *) new_stmt);
> +	}
> +      gsi_replace (&gsi, new_stmt, false);
> +      info->replace_stmt = new_stmt;
> +      return;
> +    }
> +
> +  if (info->type == SF_CALL)
> +    {
> +      res = make_temp_ssa_name (vectype, NULL, "sf");
> +      enum internal_fn ifn = gimple_call_internal_fn (stmt);
> +      new_stmt = gimple_build_call_internal_vec (ifn, info->new_ops);
> +      gimple_call_set_lhs (new_stmt, res);
> +    }
> +  else
> +    {
> +      gcc_assert (info->type == SF_NORMAL);
> +      enum tree_code subcode = gimple_assign_rhs_code (stmt);
> +      res = make_temp_ssa_name (vectype, NULL, "sf");
> +
> +      if (nargs == 1)
> +	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0]);
> +      else if (nargs == 2)
> +	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0],
> +					info->new_ops[1]);
> +      else
> +	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0],
> +					info->new_ops[1], info->new_ops[2]);
> +    }
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "  => Gen call/normal %G", new_stmt);
> +  update_all_uses (lhs, res, info->type);
> +  info->replace_stmt = new_stmt;
> +  replace_stmt (&gsi, stmt, new_stmt);
> +}
> +
> +/* Start from load STMT, find and check all related statements are
> +   valid to be optimized as vector operations, transform all of
> +   them if succeed.  */
> +
> +static void
> +process_chain_from_load (gimple *stmt)
> +{
> +  auto_vec<gimple *> chain;
> +  hash_set<gimple *> visited;
> +
> +  /* Load is the first of its chain.  */
> +  chain.safe_push (stmt);
> +  visited.add (stmt);
> +
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "\nDetecting the chain from %G", stmt);
> +
> +  gimple *use_stmt;
> +  imm_use_iterator iter;
> +  tree lhs = gimple_assign_lhs (stmt);
> +  /* Propagate from uses of load result.  */
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    /* Fail if encounting any unexpected.  */
> +    if (!visit_stmt (use_stmt, chain, visited))
> +      return;
> +
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf (MSG_NOTE, "Found a chain from load %G", stmt);
> +      for (gimple *s : chain)
> +	dump_printf (MSG_NOTE, "  -> %G", s);
> +      dump_printf (MSG_NOTE, "\n");
> +    }
> +
> +  /* Create stmt info for this load.  */
> +  stmt_info_p si = new stmt_info (stmt, SF_LOAD, NULL);
> +  stmt_info_map->put (stmt, si);
> +
> +  /* Transform the chain.  */
> +  for (gimple *stmt : chain)
> +    transform_stmt (stmt, false);
> +  /* Handle the remaining phis.  */
> +  for (gimple *stmt : chain)
> +    if (gimple_code (stmt) == GIMPLE_PHI)
> +      transform_stmt (stmt, true);
> +}
> +
> +const pass_data pass_data_rs6000_p10sfopt = {
> +  GIMPLE_PASS,	     /* type */
> +  "rs6000_p10sfopt", /* name */
> +  OPTGROUP_NONE,     /* optinfo_flags */
> +  TV_NONE,	     /* tv_id */
> +  PROP_ssa,	     /* properties_required */
> +  0,		     /* properties_provided */
> +  0,		     /* properties_destroyed */
> +  0,		     /* todo_flags_start */
> +  TODO_update_ssa,   /* todo_flags_finish */
> +};
> +
> +class pass_rs6000_p10sfopt : public gimple_opt_pass
> +{
> +public:
> +  pass_rs6000_p10sfopt (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_rs6000_p10sfopt, ctxt)
> +  {
> +  }
> +
> +  bool
> +  gate (function *fun) final override
> +    {
> +      /* Not each FE initialize target built-ins, so we need to
> +	 ensure the support of lxvwsx_v4sf decl, and we can't do
> +	 this check in rs6000_option_override_internal since the
> +	 bif decls are uninitialized at that time.  */
> +      return TARGET_P10_SF_OPT
> +	     && optimize
> +	     && optimize_function_for_speed_p (fun)
> +	     && rs6000_builtin_decls[RS6000_BIF_LXVWSX_V4SF];
> +    }
> +
> +  unsigned int execute (function *) final override;
> +
> +}; /* end of class pass_rs6000_p10sfopt  */
> +
> +unsigned int
> +pass_rs6000_p10sfopt::execute (function *fun)
> +{
> +  stmt_info_map = new hash_map<gimple *, stmt_info_p>;
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      for (gimple_stmt_iterator gsi = gsi_start_nondebug_after_labels_bb (bb);
> +	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> +	{
> +	  gimple *stmt = gsi_stmt (gsi);
> +
> +	  switch (gimple_code (stmt))
> +	    {
> +	    case GIMPLE_ASSIGN:
> +	      if (gimple_assign_single_p (stmt))
> +		{
> +		  bool is_load = false;
> +		  if (!stmt_info_map->get (stmt)
> +		      && valid_load_store_p (stmt, is_load)
> +		      && is_load)
> +		    process_chain_from_load (stmt);
> +		}
> +	      break;
> +	    default:
> +	      break;
> +	    }
> +	}
> +    }
> +
> +  for (info_map_t::iterator it = stmt_info_map->begin ();
> +       it != stmt_info_map->end (); ++it)
> +    {
> +      stmt_info_p info = (*it).second;
> +      delete info;
> +    }
> +  delete stmt_info_map;
> +
> +  return 0;
> +}
> +
> +}
> +
> +gimple_opt_pass *
> +make_pass_rs6000_p10sfopt (gcc::context *ctxt)
> +{
> +  return new pass_rs6000_p10sfopt (ctxt);
> +}
> +
> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
> index ca899d5f7af..bc59a7d5f99 100644
> --- a/gcc/config/rs6000/rs6000-passes.def
> +++ b/gcc/config/rs6000/rs6000-passes.def
> @@ -24,6 +24,11 @@ along with GCC; see the file COPYING3.  If not see
>     REPLACE_PASS (PASS, INSTANCE, TGT_PASS)
>   */
> 
> +  /* Pass to mitigate the performance issue on scalar single precision
> +     floating point load, by updating some scalar single precision
> +     floating point operations with appropriate vector opeations.  */
> +  INSERT_PASS_BEFORE (pass_gimple_isel, 1, pass_rs6000_p10sfopt);
> +
>    /* Pass to add the appropriate vector swaps on power8 little endian systems.
>       The power8 does not have instructions that automaticaly do the byte swaps
>       for loads and stores.  */
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index f70118ea40f..aa0f782f186 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -341,9 +341,11 @@ extern unsigned rs6000_linux_libm_function_max_error (unsigned, machine_mode,
>  /* Pass management.  */
>  namespace gcc { class context; }
>  class rtl_opt_pass;
> +class gimple_opt_pass;
> 
>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
> +extern gimple_opt_pass *make_pass_rs6000_p10sfopt (gcc::context *);
>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>  extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index cc24dd5301e..0e36860a73e 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -4254,6 +4254,22 @@ rs6000_option_override_internal (bool global_init_p)
>        rs6000_isa_flags &= ~OPTION_MASK_PCREL;
>      }
> 
> +  if (TARGET_P10_SF_OPT)
> +    {
> +      if (!TARGET_HARD_FLOAT)
> +	{
> +	  if ((rs6000_isa_flags_explicit & OPTION_MASK_P10_SF_OPT) != 0)
> +	    error ("%qs requires %qs", "-mp10-sf-opt", "-mhard-float");
> +	  rs6000_isa_flags &= ~OPTION_MASK_P10_SF_OPT;
> +	}
> +      if (!TARGET_P9_VECTOR)
> +	{
> +	  if ((rs6000_isa_flags_explicit & OPTION_MASK_P10_SF_OPT) != 0)
> +	    error ("%qs requires %qs", "-mp10-sf-opt", "-mpower9-vector");
> +	  rs6000_isa_flags &= ~OPTION_MASK_P10_SF_OPT;
> +	}
> +    }
> +
>    /* Print the options after updating the defaults.  */
>    if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
>      rs6000_print_isa_options (stderr, 0, "after defaults", rs6000_isa_flags);
> @@ -22301,6 +22317,17 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
>        *total = !speed ? COSTS_N_INSNS (1) + 1 : COSTS_N_INSNS (2);
>        if (rs6000_slow_unaligned_access (mode, MEM_ALIGN (x)))
>  	*total += COSTS_N_INSNS (100);
> +      /* Specially treat vec_duplicate here, since vector splat insns
> +	 {l,st}xv[wd]sx only support x-form, we should ensure reg + reg
> +	 is preferred over reg + const, otherwise cprop will propagate
> +	 const and result in sub-optimal code.  */
> +      if (outer_code == VEC_DUPLICATE
> +	  && (GET_MODE_SIZE (mode) == 4
> +	    || GET_MODE_SIZE (mode) == 8)
> +	  && GET_CODE (XEXP (x, 0)) == PLUS
> +	  && CONST_INT_P (XEXP (XEXP (x, 0), 1))
> +	  && REG_P (XEXP (XEXP (x, 0), 0)))
> +	*total += COSTS_N_INSNS (1);
>        return true;
> 
>      case LABEL_REF:
> @@ -24443,6 +24470,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
>    { "modulo",			OPTION_MASK_MODULO,		false, true  },
>    { "mulhw",			OPTION_MASK_MULHW,		false, true  },
>    { "multiple",			OPTION_MASK_MULTIPLE,		false, true  },
> +  { "p10-sf-opt",		OPTION_MASK_P10_SF_OPT,		false, true  },
>    { "pcrel",			OPTION_MASK_PCREL,		false, true  },
>    { "pcrel-opt",		OPTION_MASK_PCREL_OPT,		false, true  },
>    { "popcntb",			OPTION_MASK_POPCNTB,		false, true  },
> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
> index bde6d3ff664..fb50e00d0d9 100644
> --- a/gcc/config/rs6000/rs6000.opt
> +++ b/gcc/config/rs6000/rs6000.opt
> @@ -597,6 +597,11 @@ mmma
>  Target Mask(MMA) Var(rs6000_isa_flags)
>  Generate (do not generate) MMA instructions.
> 
> +mp10-sf-opt
> +Target Mask(P10_SF_OPT) Var(rs6000_isa_flags)
> +Generate code to mitigate single-precision floating point loading performance
> +issue.
> +
>  mrelative-jumptables
>  Target Undocumented Var(rs6000_relative_jumptables) Init(1) Save
> 
> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
> index f183b42ce1d..e7cd6d2f694 100644
> --- a/gcc/config/rs6000/t-rs6000
> +++ b/gcc/config/rs6000/t-rs6000
> @@ -35,6 +35,10 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
> 
> +rs6000-p10sfopt.o: $(srcdir)/config/rs6000/rs6000-p10sfopt.cc
> +	$(COMPILE) $<
> +	$(POSTCOMPILE)
> +
>  rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index f3b40229094..690318e82b2 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -6690,3 +6690,14 @@ (define_insn "vmsumcud"
>    "vmsumcud %0,%1,%2,%3"
>    [(set_attr "type" "veccomplex")]
>  )
> +
> +;; For expanding internal use bif __builtin_vsx_stxsiwx
> +(define_insn "vsx_stxsiwx_v4sf"
> + [(set (match_operand:SF 0 "memory_operand" "=Z")
> +       (unspec:SF
> +	  [(match_operand:V4SF 1 "vsx_register_operand" "wa")]
> +	  UNSPEC_STFIWX))]
> + "TARGET_P9_VECTOR"
> + "stxsiwx %x1,%y0"
> + [(set_attr "type" "fpstore")])
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
> new file mode 100644
> index 00000000000..6e8c6a84de6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
> @@ -0,0 +1,22 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -fno-tree-vectorize -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> +   expectedly with one case having normal arithmetic.   */
> +
> +void
> +saxpy (int n, float a, float *restrict x, float *restrict y)
> +{
> +#pragma GCC unroll 1
> +  for (int i = 0; i < n; ++i)
> +    y[i] = a * x[i] + y[i];
> +}
> +
> +/* Checking lfsx -> lxvwsx, stfsx -> stxsiwx, fmadds -> xvmaddmsp etc.  */
> +/* { dg-final { scan-assembler-times {\mlxvwsx\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mstxsiwx\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxvmaddmsp\M} 1 } } */
> +/* { dg-final { scan-assembler-not {\mlfsx?\M} } } */
> +/* { dg-final { scan-assembler-not {\mstfsx?\M} } } */
> +/* { dg-final { scan-assembler-not {\mfmadds\M} } } */
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
> new file mode 100644
> index 00000000000..7593da8ecf4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
> @@ -0,0 +1,34 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -fno-tree-vectorize -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> +   expectedly with one case having reduction.   */
> +
> +/* Partially reduced from pytorch batch_norm_kernel.cpp.  */
> +
> +typedef long long int64_t;
> +typedef float accscalar_t;
> +typedef float scalar_t;
> +
> +void
> +foo (int64_t n1, int64_t n2, accscalar_t sum, int64_t bound, int64_t N,
> +     scalar_t *input_data, scalar_t *var_sum_data, int64_t index)
> +{
> +  scalar_t mean = sum / N;
> +  accscalar_t _var_sum = 0;
> +  for (int64_t c = 0; c < n1; c++)
> +    {
> +      for (int64_t i = 0; i < n2; i++)
> +	{
> +	  int64_t offset = index + i;
> +	  scalar_t x = input_data[offset];
> +	  _var_sum += (x - mean) * (x - mean);
> +	}
> +      var_sum_data[c] = _var_sum;
> +    }
> +}
> +
> +/* { dg-final { scan-assembler {\mlxvwsx\M} } } */
> +/* { dg-final { scan-assembler {\mstxsiwx\M} } } */
> +/* { dg-final { scan-assembler {\mxvmaddasp\M} } } */
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
> new file mode 100644
> index 00000000000..38aedd00faa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
> @@ -0,0 +1,43 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> +   expectedly with one case having comparison.   */
> +
> +/* Partially reduced from xgboost cpu_predictor.cc.  */
> +
> +typedef struct {
> +  unsigned int sindex;
> +  signed int cleft;
> +  unsigned int a1;
> +  unsigned int a2;
> +  float val;
> +} Node;
> +
> +extern void bar(Node *n);
> +
> +void
> +foo (Node *n0, float *pa, Node *var_843, int c)
> +{
> +  Node *var_821;
> +  Node *n = n0;
> +  int cleft_idx = c;
> +  do
> +    {
> +      unsigned idx = n->sindex;
> +      idx = (idx & ((1U << 31) - 1U));
> +      float f1 = pa[idx];
> +      float f2 = n->val;
> +      int t = f2 > f1;
> +      int var_825 = cleft_idx + t;
> +      unsigned long long var_823 = var_825;
> +      var_821 = &var_843[var_823];
> +      cleft_idx = var_821->cleft;
> +      n = var_821;
> +  } while (cleft_idx != -1);
> +
> +  bar (n);
> +}
> +
> +/* { dg-final { scan-assembler-times {\mlxvwsx\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mxvcmpgtsp\M} 1 } } */
> --
> 2.39.3

     prev parent reply	other threads:[~2023-12-12  6:16 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-15  9:16 Kewen.Lin
2023-12-12  6:16 ` Kewen.Lin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=091f04fa-8264-19ac-9b39-5444eb9d1ab0@linux.ibm.com \
    --to=linkw@linux.ibm.com \
    --cc=bergner@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=meissner@linux.ibm.com \
    --cc=richard.guenther@gmail.com \
    --cc=richard.sandiford@arm.com \
    --cc=segher@kernel.crashing.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).