[PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10
@ 2023-11-15  9:16 Kewen.Lin
  2023-12-12  6:16 ` PING^1 " Kewen.Lin
  0 siblings, 1 reply; 2+ messages in thread
From: Kewen.Lin @ 2023-11-15  9:16 UTC (permalink / raw)
  To: GCC Patches
  Cc: Segher Boessenkool, David Edelsohn, Michael Meissner,
	Peter Bergner, Richard Sandiford, Richard Biener

Hi,

As Power ISA defines, when loading a scalar single precision (SP)
floating point from memory, we have the double precision (DP) format
in target register converted from SP, it's unlike some other
architectures which supports SP and DP in registers with their
separated formats.  The scalar SP instructions operates on DP format
value in register and round the result to fit in SP (but still
keeping the value in DP format).

On Power10, a scalar SP floating point load insn will be cracked into
two internal operations, one is to load the value, the other is to
convert SP to DP format.  Comparing to those uncracked load like
vector SP load, it has extra 3 cycles load-to-use penalty.  When
evaluating some critical workloads, we found that for some cases we
don't really need the conversion if all the involved operations are
only with SP format.  In this case, we can replace the scalar SP
loads with vector SP load and splat (no conversion), replace all
involved computation with the corresponding vector operations (with
Power10 slice-based design, we expect the latency of scalar operation
and its equivalent vector operation is the same), that is to promote
the scalar SP loads and their affected computation to vector
operations.

For example for the below case:

void saxpy (int n, float a, float * restrict x, float * restrict y)
{
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];
}

At -O2, the loop body would end up with:

.L3:
        lfsx 12,6,9    // conv
        lfsx 0,5,9     // conv
        fmadds 0,0,1,12
        stfsx 0,6,9
        addi 9,9,4
        bdnz .L3

but it can be implemented with:

.L3:
        lxvwsx 0,5,9   // load and splat
        lxvwsx 12,6,9
        xvmaddmsp 0,1,12
        stxsiwx 0,6,9  // just store word 1 (BE ordering)
        addi 9,9,4
        bdnz .L3

Evaluated on Power10, the latter can speed up 23% against the former.

So this patch is to introduce a pass to recognize such case and
change the scalar SP operations with the appropriate vector SP
operations when it's proper.

The processing of this pass starts from scalar SP loads, first it
checks if it's valid, further checks all the stmts using its loaded
result, then propagates from them.  This process of propagation
mainly goes with function visit_stmt, which first checks the
validity of the given stmt, then checks the feeders of use operands
with visit_stmt recursively, finally checks all the stmts using the
def with visit_stmt recursively.  The purpose is to ensure all
propagated stmts are valid to be transformed with its equivalent
vector operations.  For some special operands like constant or
GIMPLE_NOP def ssa, record them as splatting candidates.  There are
some validity checks like: if the addressing mode can satisfy index
form with some adjustments, if there is the corresponding vector
operation support, and so on.  Once all propagated stmts from one
load are valid, they are transformed by function transform_stmt by
respecting the information in stmt_info like sf_type, new_ops etc.

For example, for the below test case:

  _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1];  // stmt1
  _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1];  // stmt2
  _8 = .FMA (_4, a_14(D), _7);                   // stmt3
  MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8;  // stmt4

The processing starts from stmt1, which is taken as valid, adds it
into the chain, then processes its use stmt stmt3, which is also
valid, iterating its operands _4 whose def is stmt1 (visited), a_14
which needs splatting and _7 whose def stmt2 is to be processed.
Then stmt2 is taken as a valid load and it's added into the chain.
All operands _4, a_14 and _7 of stmt3 are processed well, then it's
added into the chain.  Then it processes use stmts of _8 (result of
stmt3), so checks stmt4 which is a valid store.  Since all these
involved stmts are valid to be transformed, we get below finally:

  sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
  sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
  sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
  sf_20 = .FMA (sf_5, sf_22, sf_25);
  __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));

Since it needs to do some validity checks and adjustments if allowed,
such as: check if some scalar operation has the corresponding vector
support, considering scalar SP load can allow reg + {reg, disp}
addressing modes while vector SP load and splat can only allow reg +
reg, also considering the efficiency to get UD/DF chain for affected
operations, we make this pass as a gimple pass.

Considering gimple_isel pass has some gimple massaging, this pass is
placed just before that.  Considering this pass can generate some
extra vector construction (like some constant, values converted from
int etc.), which are extra comparing to the original scalar, and it
makes use of more vector resource than before, it's not turned on by
default conservatively for now.

With the extra code to make this default on Power10, it's bootstrapped
and almost regress-tested on Power10 (three test cases need some
adjustments on its expected so trivial).  Evaluating SPEC2017 specrate
all bmks at O2, O3 and Ofast, it's observed that it speeds up 521.wrf_r
by 2.14%, 526.blender_r by 1.85% and fprate geomean by 0.31% at O2, it
is neutral at O3 and Ofast.

Evaluating one critical workload related to xgboost, it's shown it
helps to speed up 8% ~ 16% (avg. 14%, worst 8%, best 16%).

Note that the current implementation is mainly driven by some typical
test cases from some motivated workloads, we want to continue to
extend it as needed.

Any thoughts?

BR,
Kewen
-----

gcc/ChangeLog:

	* config.gcc: Add rs6000-p10sfopt.o to extra_objs for powerpc*-*-*
	and rs6000*-*-* targets.
	* config/rs6000/rs6000-builtin.cc (ldv_expand_builtin): Correct tmode
	for CODE_FOR_vsx_splat_v4sf.
	(stv_expand_builtin): Correct tmode for CODE_FOR_vsx_stxsiwx_v4sf.
	* config/rs6000/rs6000-builtins.def (__builtin_vsx_lxvwsx,
	__builtin_vsx_stxsiwx): New builtin definitions.
	* config/rs6000/rs6000-passes.def: Add pass_rs6000_p10sfopt.
	* config/rs6000/rs6000-protos.h (class gimple_opt_pass): New
	declaration.
	(make_pass_rs6000_p10sfopt): Likewise.
	* config/rs6000/rs6000.cc (rs6000_option_override_internal): Check
	some prerequisite conditions for TARGET_P10_SF_OPT.
	(rs6000_rtx_costs): Cost one unit COSTS_N_INSNS more for vec_duplicate
	with {l,st}xv[wd]sx which only support x-form.
	* config/rs6000/rs6000.opt (-mp10-sf-opt): New option.
	* config/rs6000/t-rs6000: Add rule to build rs6000-p10sfopt.o.
	* config/rs6000/vsx.md (vsx_stxsiwx_v4sf): New define_insn.
	* config/rs6000/rs6000-p10sfopt.cc: New file.

gcc/testsuite/ChangeLog:

	* gcc.target/powerpc/p10-sf-opt-1.c: New test.
	* gcc.target/powerpc/p10-sf-opt-2.c: New test.
	* gcc.target/powerpc/p10-sf-opt-3.c: New test.
---
 gcc/config.gcc                                |   4 +-
 gcc/config/rs6000/rs6000-builtin.cc           |   9 +
 gcc/config/rs6000/rs6000-builtins.def         |   5 +
 gcc/config/rs6000/rs6000-p10sfopt.cc          | 950 ++++++++++++++++++
 gcc/config/rs6000/rs6000-passes.def           |   5 +
 gcc/config/rs6000/rs6000-protos.h             |   2 +
 gcc/config/rs6000/rs6000.cc                   |  28 +
 gcc/config/rs6000/rs6000.opt                  |   5 +
 gcc/config/rs6000/t-rs6000                    |   4 +
 gcc/config/rs6000/vsx.md                      |  11 +
 .../gcc.target/powerpc/p10-sf-opt-1.c         |  22 +
 .../gcc.target/powerpc/p10-sf-opt-2.c         |  34 +
 .../gcc.target/powerpc/p10-sf-opt-3.c         |  43 +
 13 files changed, 1120 insertions(+), 2 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-p10sfopt.cc
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 0782cbc6e91..983fad9fb9a 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -517,7 +517,7 @@ or1k*-*-*)
 	;;
 powerpc*-*-*)
 	cpu_type=rs6000
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p10sfopt.o rs6000-logue.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
 	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
@@ -554,7 +554,7 @@ riscv*)
 	;;
 rs6000*-*-*)
 	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
-	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
+	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p10sfopt.o rs6000-logue.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
index 82cc3a19447..38bb786e5eb 100644
--- a/gcc/config/rs6000/rs6000-builtin.cc
+++ b/gcc/config/rs6000/rs6000-builtin.cc
@@ -2755,6 +2755,10 @@ ldv_expand_builtin (rtx target, insn_code icode, rtx *op, machine_mode tmode)
       || !insn_data[icode].operand[0].predicate (target, tmode))
     target = gen_reg_rtx (tmode);

+  /* Correct tmode with the proper addr mode.  */
+  if (icode == CODE_FOR_vsx_splat_v4sf)
+    tmode = SFmode;
+
   op[1] = copy_to_mode_reg (Pmode, op[1]);

   /* These CELL built-ins use BLKmode instead of tmode for historical
@@ -2898,6 +2902,10 @@ static rtx
 stv_expand_builtin (insn_code icode, rtx *op,
 		    machine_mode tmode, machine_mode smode)
 {
+  /* Correct tmode with the proper addr mode.  */
+  if (icode == CODE_FOR_vsx_stxsiwx_v4sf)
+    tmode = SFmode;
+
   op[2] = copy_to_mode_reg (Pmode, op[2]);

   /* For STVX, express the RTL accurately by ANDing the address with -16.
@@ -3713,3 +3721,4 @@ rs6000_expand_builtin (tree exp, rtx target, rtx /* subtarget */,
   emit_insn (pat);
   return target;
 }
+
diff --git a/gcc/config/rs6000/rs6000-builtins.def b/gcc/config/rs6000/rs6000-builtins.def
index ce40600e803..c0441f5e27f 100644
--- a/gcc/config/rs6000/rs6000-builtins.def
+++ b/gcc/config/rs6000/rs6000-builtins.def
@@ -2810,6 +2810,11 @@
       __builtin_vsx_scalar_cmp_exp_qp_unordered (_Float128, _Float128);
     VSCEQPUO xscmpexpqp_unordered_kf {}

+  vf __builtin_vsx_lxvwsx (signed long, const float *);
+    LXVWSX_V4SF vsx_splat_v4sf {ldvec}
+
+  void __builtin_vsx_stxsiwx (vf, signed long, const float *);
+    STXSIWX_V4SF vsx_stxsiwx_v4sf {stvec}

 ; Miscellaneous P9 functions
 [power9]
diff --git a/gcc/config/rs6000/rs6000-p10sfopt.cc b/gcc/config/rs6000/rs6000-p10sfopt.cc
new file mode 100644
index 00000000000..6e1d90fd93e
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-p10sfopt.cc
@@ -0,0 +1,950 @@
+/* Subroutines used to mitigate single precision floating point
+   load and conversion performance issue by replacing scalar
+   single precision floating point operations with appropriate
+   vector operations if it is proper.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* The processing of this pass starts from scalar SP loads, first it
+checks if it's valid, further checks all the stmts using its loaded
+result, then propagates from them.  This process of propagation
+mainly goes with function visit_stmt, which first checks the
+validity of the given stmt, then checks the feeders of use operands
+with visit_stmt recursively, finally checks all the stmts using the
+def with visit_stmt recursively.  The purpose is to ensure all
+propagated stmts are valid to be transformed with its equivalent
+vector operations.  For some special operands like constant or
+GIMPLE_NOP def ssa, record them as splatting candidates.  There are
+some validity checks like: if the addressing mode can satisfy index
+form with some adjustments, if there is the corresponding vector
+operation support, and so on.  Once all propagated stmts from one
+load are valid, they are transformed by function transform_stmt by
+respecting the information in stmt_info like sf_type, new_ops etc.
+
+For example, for the below test case:
+
+  _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1];  // stmt1
+  _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1];  // stmt2
+  _8 = .FMA (_4, a_14(D), _7);                   // stmt3
+  MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8;  // stmt4
+
+The processing starts from stmt1, which is taken as valid, adds it
+into the chain, then processes its use stmt stmt3, which is also
+valid, iterating its operands _4 whose def is stmt1 (visited), a_14
+which needs splatting and _7 whose def stmt2 is to be processed.
+Then stmt2 is taken as a valid load and it's added into the chain.
+All operands _4, a_14 and _7 of stmt3 are processed well, then it's
+added into the chain.  Then it processes use stmts of _8 (result of
+stmt3), so checks stmt4 which is a valid store.  Since all these
+involved stmts are valid to be transformed, we get below finally:
+
+  sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
+  sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
+  sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
+  sf_20 = .FMA (sf_5, sf_22, sf_25);
+  __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));
+*/
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tm_p.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "optabs-tree.h"
+#include "fold-const.h"
+#include "tree-eh.h"
+#include "gimple-iterator.h"
+#include "gimple-fold.h"
+#include "stor-layout.h"
+#include "tree-ssa.h"
+#include "tree-ssa-address.h"
+#include "tree-cfg.h"
+#include "cfgloop.h"
+#include "tree-vectorizer.h"
+#include "builtins.h"
+#include "internal-fn.h"
+#include "gimple-pretty-print.h"
+#include "predict.h"
+#include "rs6000-internal.h" /* for rs6000_builtin_decls  */
+
+namespace {
+
+/* Single precision Floating point operation types.
+
+   So far, we only take care of load, store, ifn call, phi,
+   normal arithmetic, comparison and special operations.
+   Normally for an involved statement, we will process all
+   statements which use its result, all statements which
+   define its operands and further propagate, but for some
+   special assignment statement, we don't want to process
+   it like this way but just splat it instead, we adopt
+   SF_SPECIAL for this kind of statement, for now it's only
+   for to-float conversion assignment.  */
+enum sf_type
+{
+  SF_LOAD,
+  SF_STORE,
+  SF_CALL,
+  SF_PHI,
+  SF_NORMAL,
+  SF_COMPARE,
+  SF_SPECIAL
+};
+
+/* Hold some information for a gimple statement which is valid
+   to be promoted from scalar operation to vector operation.  */
+
+class stmt_info
+{
+public:
+  stmt_info (gimple *s, sf_type t, bitmap bm)
+  {
+    stmt = s;
+    type = t;
+    splat_ops = BITMAP_ALLOC (NULL);
+    if (bm)
+      bitmap_copy (splat_ops, bm);
+
+    unsigned nops = gimple_num_args (stmt);
+    new_ops.create (nops);
+    new_ops.safe_grow_cleared (nops);
+    replace_stmt = NULL;
+    gphi_res = NULL_TREE;
+  }
+
+  ~stmt_info ()
+  {
+    BITMAP_FREE (splat_ops);
+    new_ops.release ();
+  }
+
+  /* Indicate the stmt what this info is for.  */
+  gimple *stmt;
+  /* Indicate sf_type of the current stmt.  */
+  enum sf_type type;
+  /* Bitmap used to indicate which op needs to be splatted.  */
+  bitmap splat_ops;
+  /* New operands used to build new stmt.  */
+  vec<tree> new_ops;
+  /* New stmt used to replace the current stmt.  */
+  gimple *replace_stmt;
+  /* Hold new gphi result which is created early.  */
+  tree gphi_res;
+};
+
+typedef stmt_info *stmt_info_p;
+typedef hash_map<gimple *, stmt_info_p> info_map_t;
+static info_map_t *stmt_info_map;
+
+/* Like the comments for SF_SPECIAL above, for some special
+   assignment statement (to-float conversion assignment
+   here), we don't want to do the heavy processing but just
+   want to generate a splatting for it instead.  Return
+   true if the given STMT is special (to-float conversion
+   for now), otherwise return false.  */
+
+static bool
+special_assign_p (gimple *stmt)
+{
+  gcc_assert (gimple_code (stmt) == GIMPLE_ASSIGN);
+  enum tree_code code = gimple_assign_rhs_code (stmt);
+  if (code == FLOAT_EXPR)
+    return true;
+  return false;
+}
+
+/* Make base and index fields from the memory reference REF,
+   return true and set *BASEP and *INDEXP respectively if it
+   is successful, otherwise return false.  Since the
+   transformed vector load (lxvwsx) and vector store (stxsiwx)
+   only supports reg + reg addressing mode, we need to ensure
+   the address satisfies it first.  */
+
+static bool
+make_base_and_index (tree ref, tree *basep, tree *indexp)
+{
+  if (DECL_P (ref))
+    {
+      *basep
+	= fold_build1 (ADDR_EXPR, build_pointer_type (float32_type_node), ref);
+      *indexp = size_zero_node;
+      return true;
+    }
+
+  enum tree_code code = TREE_CODE (ref);
+  if (code == TARGET_MEM_REF)
+    {
+      struct mem_address addr;
+      get_address_description (ref, &addr);
+      gcc_assert (!addr.step);
+      *basep = addr.symbol ? addr.symbol : addr.base;
+      if (addr.index)
+	{
+	  /* Give up if having both offset and index, theoretically
+	     we can generate one insn to update base with index, but
+	     it results in more cost, so leave it conservatively.  */
+	  if (!integer_zerop (addr.offset))
+	    return false;
+	  *indexp = addr.index;
+	}
+      else
+	*indexp = addr.offset;
+      return true;
+    }
+
+  if (code == MEM_REF)
+    {
+      *basep = TREE_OPERAND (ref, 0);
+      tree op1 = TREE_OPERAND (ref, 1);
+      *indexp = op1 ? op1 : size_zero_node;
+      return true;
+    }
+
+  if (handled_component_p (ref))
+    {
+      machine_mode mode1;
+      poly_int64 bitsize, bitpos;
+      tree offset;
+      int reversep = 0, volatilep = 0, unsignedp = 0;
+      tree tem = get_inner_reference (ref, &bitsize, &bitpos, &offset, &mode1,
+				      &unsignedp, &reversep, &volatilep);
+      if (reversep)
+	return false;
+
+      poly_int64 bytepos = exact_div (bitpos, BITS_PER_UNIT);
+      if (offset)
+	{
+	  gcc_assert (!integer_zerop (offset));
+	  /* Give up if having both offset and bytepos.  */
+	  if (maybe_ne (bytepos, 0))
+	    return false;
+	  if (!is_gimple_variable (offset))
+	    return false;
+	}
+
+      tree base1, index1;
+      /* Further check the inner ref.  */
+      if (!make_base_and_index (tem, &base1, &index1))
+	return false;
+
+      if (integer_zerop (index1))
+	{
+	  /* Only need to consider base1 and offset/bytepos.  */
+	  *basep = base1;
+	  *indexp = offset ? offset : wide_int_to_tree (sizetype, bytepos);
+	  return true;
+	}
+      /* Give up if having offset and index1.  */
+      if (offset)
+	return false;
+      /* Give up if bytepos and index1 can not be folded.  */
+      if (!poly_int_tree_p (index1))
+	return false;
+      poly_offset_int new_off
+	= wi::sext (wi::to_poly_offset (index1), TYPE_PRECISION (sizetype));
+      new_off += bytepos;
+
+      poly_int64 new_index;
+      if (!new_off.to_shwi (&new_index))
+	return false;
+
+      *basep = base1;
+      *indexp = wide_int_to_tree (sizetype, new_index);
+      return true;
+    }
+
+  if (TREE_CODE (ref) == SSA_NAME)
+    {
+      /* Inner ref can come from a load.  */
+      gimple *def = SSA_NAME_DEF_STMT (ref);
+      if (!gimple_assign_single_p (def))
+	return false;
+      tree ref1 = gimple_assign_rhs1 (def);
+      if (!DECL_P (ref1) && !REFERENCE_CLASS_P (ref1))
+	return false;
+
+      tree base1, offset1;
+      if (!make_base_and_index (ref1, &base1, &offset1))
+	return false;
+      *basep = base1;
+      *indexp = offset1;
+      return true;
+    }
+
+  return false;
+}
+
+/* Check STMT is an expected SP float load or store, return true
+   if it is and update IS_LOAD, otherwise return false.  */
+
+static bool
+valid_load_store_p (gimple *stmt, bool &is_load)
+{
+  if (!gimple_assign_single_p (stmt))
+    return false;
+
+  tree lhs = gimple_assign_lhs (stmt);
+  if (TYPE_MODE (TREE_TYPE (lhs)) != SFmode)
+    return false;
+
+  tree rhs = gimple_assign_rhs1 (stmt);
+  tree base, index;
+  if (TREE_CODE (lhs) == SSA_NAME
+      && (DECL_P (rhs) || REFERENCE_CLASS_P (rhs))
+      && make_base_and_index (rhs, &base, &index))
+    {
+      is_load = true;
+      return true;
+    }
+
+  if ((DECL_P (lhs) || REFERENCE_CLASS_P (lhs))
+      && make_base_and_index (lhs, &base, &index))
+    {
+      is_load = false;
+      return true;
+    }
+
+  return false;
+}
+
+/* Check if it's valid to update the given STMT with the
+   equivalent vector form, return true if yes and also set
+   SF_TYPE to the proper sf_type, otherwise return false.  */
+
+static bool
+is_valid (gimple *stmt, enum sf_type &sf_type)
+{
+  /* Give up if it has volatile type.  */
+  if (gimple_has_volatile_ops (stmt))
+    return false;
+
+  /* Give up if it can throw an exception.  */
+  if (stmt_can_throw_internal (cfun, stmt))
+    return false;
+
+  /* Process phi.  */
+  gphi *gp = dyn_cast<gphi *> (stmt);
+  if (gp)
+    {
+      sf_type = SF_PHI;
+      return true;
+    }
+
+  /* Process assignment.  */
+  gassign *gass = dyn_cast<gassign *> (stmt);
+  if (gass)
+    {
+      bool is_load = false;
+      if (valid_load_store_p (stmt, is_load))
+	{
+	  sf_type = is_load ? SF_LOAD : SF_STORE;
+	  return true;
+	}
+
+      tree lhs = gimple_assign_lhs (stmt);
+      if (!lhs || TREE_CODE (lhs) != SSA_NAME)
+	return false;
+      enum tree_code code = gimple_assign_rhs_code (stmt);
+      if (TREE_CODE_CLASS (code) == tcc_comparison)
+	{
+	  tree rhs1 = gimple_assign_rhs1 (stmt);
+	  tree rhs2 = gimple_assign_rhs2 (stmt);
+	  tree type = TREE_TYPE (lhs);
+	  if (!VECT_SCALAR_BOOLEAN_TYPE_P (type))
+	    return false;
+	  if (TYPE_MODE (type) != QImode)
+	    return false;
+	  type = TREE_TYPE (rhs1);
+	  if (TYPE_MODE (type) != SFmode)
+	    return false;
+	  gcc_assert (TYPE_MODE (TREE_TYPE (rhs2)) == SFmode);
+	  sf_type = SF_COMPARE;
+	  return true;
+	}
+
+      tree type = TREE_TYPE (lhs);
+      if (TYPE_MODE (type) != SFmode)
+	return false;
+
+      if (special_assign_p (stmt))
+	{
+	  sf_type = SF_SPECIAL;
+	  return true;
+	}
+
+      /* Check if vector operation is supported.  */
+      sf_type = SF_NORMAL;
+      tree vectype = build_vector_type_for_mode (type, V4SFmode);
+      optab optab = optab_for_tree_code (code, vectype, optab_default);
+      if (!optab)
+	return false;
+      return optab_handler (optab, V4SFmode) != CODE_FOR_nothing;
+    }
+
+  /* Process call.  */
+  gcall *gc = dyn_cast<gcall *> (stmt);
+  /* TODO: Extend this to cover some other bifs.  */
+  if (gc && gimple_call_internal_p (gc))
+    {
+      tree lhs = gimple_call_lhs (stmt);
+      if (!lhs)
+	return false;
+      if (TREE_CODE (lhs) != SSA_NAME)
+	return false;
+      tree type = TREE_TYPE (lhs);
+      if (TYPE_MODE (type) != SFmode)
+	return false;
+      enum internal_fn ifn = gimple_call_internal_fn (stmt);
+      tree vectype = build_vector_type_for_mode (type, V4SFmode);
+      if (direct_internal_fn_p (ifn))
+	{
+	  const direct_internal_fn_info &info = direct_internal_fn (ifn);
+	  if (info.vectorizable
+	      && (direct_internal_fn_supported_p (ifn,
+						  tree_pair (vectype, vectype),
+						  OPTIMIZE_FOR_SPEED)))
+	    {
+	      sf_type = SF_CALL;
+	      return true;
+	    }
+	}
+    }
+
+  return false;
+}
+
+/* Process the given STMT, if it's visited before, just return true.
+   If it's the first time to visit this, set VISITED and check if
+   the below ones are valid to be optimized with vector operation:
+     - itself
+     - all statements which define the operands involved here
+     - all statements which use the result of STMT
+   If all are valid, add STMT into CHAIN, create its own stmt_info
+   and return true.  Otherwise, return false.  */
+
+static bool
+visit_stmt (gimple *stmt, vec<gimple *> &chain, hash_set<gimple *> &visited)
+{
+  if (visited.add (stmt))
+    {
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "Stmt visited: %G", stmt);
+      return true;
+    }
+  else if (dump_enabled_p ())
+    dump_printf (MSG_NOTE, "Visiting stmt: %G", stmt);
+
+  /* Checking this statement is valid for this optimization.  */
+  enum sf_type st_type;
+  if (!is_valid (stmt, st_type))
+    {
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "Invalid stmt: %G", stmt);
+      return false;
+    }
+
+  /* For store, it's the end of this chain, don't need to
+     process anything further.  For special assignment, we
+     don't want to process all statements using its result
+     and all statements defining its operands.  */
+  if (st_type == SF_STORE || st_type == SF_SPECIAL)
+    {
+      chain.safe_push (stmt);
+      stmt_info_p si = new stmt_info (stmt, st_type, NULL);
+      stmt_info_map->put (stmt, si);
+      return true;
+    }
+
+  /* Check all feeders of operands involved here.  */
+
+  /* Indicate which operand needs to be splatted, such as: constant.  */
+  auto_bitmap splat_bm;
+  if (st_type != SF_LOAD)
+    {
+      unsigned nops = gimple_num_args (stmt);
+      for (unsigned i = 0; i < nops; i++)
+	{
+	  tree op = gimple_arg (stmt, i);
+	  if (TREE_CODE (op) != SSA_NAME
+	      && TREE_CODE (op) != REAL_CST)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf (MSG_NOTE, "With problematic %T in stmt: %G", op,
+			     stmt);
+	      return false;
+	    }
+
+	  bool need_splat = false;
+	  if (TREE_CODE (op) == SSA_NAME)
+	    {
+	      gimple *op_stmt = SSA_NAME_DEF_STMT (op);
+	      if (gimple_code (op_stmt) == GIMPLE_NOP)
+		need_splat = true;
+	      else if (!visit_stmt (op_stmt, chain, visited))
+		return false;
+	    }
+	  else
+	    {
+	      gcc_assert (TREE_CODE (op) == REAL_CST);
+	      need_splat = true;
+	    }
+
+	  if (need_splat)
+	    bitmap_set_bit (splat_bm, i);
+	}
+    }
+
+  /* Push this stmt before all its use stmts, then it's transformed
+     first during the transform phase, new_ops are prepared when
+     transforming use stmts.  */
+  chain.safe_push (stmt);
+
+  /* Comparison may have some constant operand, we need the above
+     handlings on splatting, but don't need any further processing
+     on all uses of its result.  */
+  if (st_type == SF_COMPARE)
+    {
+      stmt_info_p si = new stmt_info (stmt, st_type, splat_bm);
+      stmt_info_map->put (stmt, si);
+      return true;
+    }
+
+  /* Process each use of definition.  */
+  gimple *use_stmt;
+  imm_use_iterator iter;
+  tree lhs = gimple_get_lhs (stmt);
+  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
+    if (!visit_stmt (use_stmt, chain, visited))
+      return false;
+
+  /* Create the corresponding stmt_info.  */
+  stmt_info_p si = new stmt_info (stmt, st_type, splat_bm);
+  stmt_info_map->put (stmt, si);
+  return true;
+}
+
+/* Tree NEW_LHS with vector type has been used to replace the
+   original tree LHS, for each use of LHS, find each use stmt
+   and its corresponding stmt_info, update whose new_ops array
+   accordingly to prepare later replacement.  */
+
+static void
+update_all_uses (tree lhs, tree new_lhs, sf_type type)
+{
+  gimple *use_stmt;
+  imm_use_iterator iter;
+  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
+    {
+      stmt_info_p *slot = stmt_info_map->get (use_stmt);
+      /* Each use stmt should have been processed excepting
+	 for SF_SPECIAL stmts since for which we stop
+	 processing early.  */
+      gcc_assert (slot || type == SF_SPECIAL);
+      if (!slot)
+	continue;
+      stmt_info_p info = *slot;
+      unsigned n = gimple_num_args (use_stmt);
+      for (unsigned i = 0; i < n; i++)
+	if (gimple_arg (use_stmt, i) == lhs)
+	  info->new_ops[i] = new_lhs;
+    }
+}
+
+/* Remove old STMT and insert NEW_STMT before.  */
+
+static void
+replace_stmt (gimple_stmt_iterator *gsi_ptr, gimple *stmt, gimple *new_stmt)
+{
+  gimple_set_location (new_stmt, gimple_location (stmt));
+  gimple_move_vops (new_stmt, stmt);
+  gsi_insert_before (gsi_ptr, new_stmt, GSI_SAME_STMT);
+  gsi_remove (gsi_ptr, true);
+}
+
+/* Transform the given STMT with vector type, only transform
+   phi stmt if HANDLE_PHI_P is true since there are def-use
+   cycles for phi, we transform them in the second time.  */
+
+static void
+transform_stmt (gimple *stmt, bool handle_phi_p = false)
+{
+  stmt_info_p info = *stmt_info_map->get (stmt);
+
+  /* This statement has been replaced.  */
+  if (info->replace_stmt)
+    return;
+
+  gcc_assert (!handle_phi_p || gimple_code (stmt) == GIMPLE_PHI);
+
+  if (dump_enabled_p ())
+    dump_printf (MSG_NOTE, " Transforming stmt: %G", stmt);
+
+  tree lhs = gimple_get_lhs (stmt);
+  tree type = float_type_node;
+  tree vectype = build_vector_type_for_mode (type, V4SFmode);
+  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+
+  if (dump_enabled_p ())
+    dump_printf (MSG_NOTE, " info->type: %d\n", info->type);
+
+  /* Replace load with bif __builtin_vsx_lxvwsx.  */
+  if (info->type == SF_LOAD)
+    {
+      tree fndecl = rs6000_builtin_decls[RS6000_BIF_LXVWSX_V4SF];
+      tree rhs = gimple_op (stmt, 1);
+      tree base, index;
+      bool mem_p = make_base_and_index (rhs, &base, &index);
+      gcc_assert (mem_p);
+      gimple *load = gimple_build_call (fndecl, 2, index, base);
+      tree res = make_temp_ssa_name (vectype, NULL, "sf");
+      gimple_call_set_lhs (load, res);
+      info->replace_stmt = load;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "  => Gen load: %G", load);
+      update_all_uses (lhs, res, info->type);
+      replace_stmt (&gsi, stmt, load);
+      return;
+    }
+
+  /* Replace store with bif __builtin_vsx_stxsiwx.  */
+  if (info->type == SF_STORE)
+    {
+      tree fndecl = rs6000_builtin_decls[RS6000_BIF_STXSIWX_V4SF];
+      tree base, index;
+      bool mem_p = make_base_and_index (lhs, &base, &index);
+      gcc_assert (mem_p);
+      gcc_assert (info->new_ops[0]);
+      gimple *store
+	= gimple_build_call (fndecl, 3, info->new_ops[0], index, base);
+      info->replace_stmt = store;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "  => Gen store: %G", store);
+      replace_stmt (&gsi, stmt, store);
+      return;
+    }
+
+  /* Generate vector construction for special stmt.  */
+  if (info->type == SF_SPECIAL)
+    {
+      tree op = gimple_get_lhs (stmt);
+      tree val = build_vector_from_val (vectype, op);
+      tree res = make_temp_ssa_name (vectype, NULL, "sf");
+      gimple *splat = gimple_build_assign (res, val);
+      gimple_set_location (splat, gimple_location (stmt));
+      gsi_insert_after (&gsi, splat, GSI_SAME_STMT);
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "  => Gen special %G", splat);
+      update_all_uses (lhs, res, info->type);
+      info->replace_stmt = splat;
+      return;
+    }
+
+  /* Handle the operands which haven't have an according vector
+     operand yet, like those ones need splatting etc.  */
+  unsigned nargs = gimple_num_args (stmt);
+  gphi *phi = dyn_cast<gphi *> (stmt);
+  for (unsigned i = 0; i < nargs; i++)
+    {
+      /* This operand already has the replacing one.  */
+      if (info->new_ops[i])
+	continue;
+      /* When only handling phi, all operands should have the
+	 prepared new_op.  */
+      gcc_assert (!handle_phi_p);
+      tree op = gimple_arg (stmt, i);
+      /* This operand needs splatting.  */
+      if (bitmap_bit_p (info->splat_ops, i))
+	{
+	  tree val = build_vector_from_val (vectype, op);
+	  tree res = make_temp_ssa_name (vectype, NULL, "sf");
+	  gimple *splat = gimple_build_assign (res, val);
+	  /* If it's a PHI, push it to its incoming block.  */
+	  if (phi)
+	    {
+	      basic_block src = gimple_phi_arg_edge (phi, i)->src;
+	      gimple_stmt_iterator src_gsi = gsi_last_bb (src);
+	      if (!gsi_end_p (src_gsi) && stmt_ends_bb_p (gsi_stmt (src_gsi)))
+		gsi_insert_before (&src_gsi, splat, GSI_SAME_STMT);
+	      else
+		gsi_insert_after (&src_gsi, splat, GSI_NEW_STMT);
+	    }
+	  else
+	    gsi_insert_before (&gsi, splat, GSI_SAME_STMT);
+	  info->new_ops[i] = res;
+	  bitmap_clear_bit (info->splat_ops, i);
+	}
+      else
+	{
+	  gcc_assert (TREE_CODE (op) == SSA_NAME);
+	  /* Ensure all operands have the replacing new_op excepting
+	     for phi stmt.  */
+	  if (!phi)
+	    {
+	      gimple *def = SSA_NAME_DEF_STMT (op);
+	      transform_stmt (def);
+	      gcc_assert (info->new_ops[i]);
+	    }
+	}
+    }
+
+  gimple *new_stmt;
+  tree res;
+  if (info->type == SF_PHI)
+    {
+      /* At the first time, ensure phi result is prepared and all its
+	 use stmt can be transformed well.  */
+      if (!handle_phi_p)
+	{
+	  res = info->gphi_res;
+	  if (!res)
+	    {
+	      res = make_temp_ssa_name (vectype, NULL, "sf");
+	      info->gphi_res = res;
+	    }
+	  update_all_uses (lhs, res, info->type);
+	  return;
+	}
+      /* Transform actually at the second time.  */
+      basic_block bb = gimple_bb (stmt);
+      gphi *new_phi = create_phi_node (info->gphi_res, bb);
+      for (unsigned i = 0; i < nargs; i++)
+	{
+	  location_t loc = gimple_phi_arg_location (phi, i);
+	  edge e = gimple_phi_arg_edge (phi, i);
+	  add_phi_arg (new_phi, info->new_ops[i], e, loc);
+	}
+      gimple_set_location (new_phi, gimple_location (stmt));
+      remove_phi_node (&gsi, true);
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "  => Gen phi %G", (gimple *) new_phi);
+      return;
+    }
+
+  if (info->type == SF_COMPARE)
+    {
+      /* Build a vector comparison.  */
+      tree vectype1 = truth_type_for (vectype);
+      tree res1 = make_temp_ssa_name (vectype1, NULL, "sf_vb4");
+      enum tree_code subcode = gimple_assign_rhs_code (stmt);
+      gimple *new_stmt1 = gimple_build_assign (res1, subcode, info->new_ops[0],
+					       info->new_ops[1]);
+      gsi_insert_before (&gsi, new_stmt1, GSI_SAME_STMT);
+
+      /* Build a VEC_COND_EXPR with -1 (true) or 0 (false).  */
+      tree vectype2 = build_vector_type_for_mode (intSI_type_node, V4SImode);
+      tree res2 = make_temp_ssa_name (vectype2, NULL, "sf_vi4");
+      tree minus_one_vec = build_minus_one_cst (vectype2);
+      tree zero_vec = build_zero_cst (vectype2);
+      gimple *new_stmt2 = gimple_build_assign (res2, VEC_COND_EXPR, res1,
+					       minus_one_vec, zero_vec);
+      gsi_insert_before (&gsi, new_stmt2, GSI_SAME_STMT);
+
+      /* Build a BIT_FIELD_REF to extract lane 1 (BE ordering).  */
+      tree bfr = build3 (BIT_FIELD_REF, intSI_type_node, res2, bitsize_int (32),
+			 bitsize_int (BYTES_BIG_ENDIAN ? 32 : 64));
+      tree res3 = make_temp_ssa_name (intSI_type_node, NULL, "sf_i4");
+      gimple *new_stmt3 = gimple_build_assign (res3, BIT_FIELD_REF, bfr);
+      gsi_insert_before (&gsi, new_stmt3, GSI_SAME_STMT);
+
+      /* Convert it accordingly.  */
+      gimple *new_stmt = gimple_build_assign (lhs, NOP_EXPR, res3);
+
+      if (dump_enabled_p ())
+	{
+	  dump_printf (MSG_NOTE, "  => Gen comparison: %G",
+		       (gimple *) new_stmt1);
+	  dump_printf (MSG_NOTE, "                     %G",
+		       (gimple *) new_stmt2);
+	  dump_printf (MSG_NOTE, "                     %G",
+		       (gimple *) new_stmt3);
+	  dump_printf (MSG_NOTE, "                     %G",
+		       (gimple *) new_stmt);
+	}
+      gsi_replace (&gsi, new_stmt, false);
+      info->replace_stmt = new_stmt;
+      return;
+    }
+
+  if (info->type == SF_CALL)
+    {
+      res = make_temp_ssa_name (vectype, NULL, "sf");
+      enum internal_fn ifn = gimple_call_internal_fn (stmt);
+      new_stmt = gimple_build_call_internal_vec (ifn, info->new_ops);
+      gimple_call_set_lhs (new_stmt, res);
+    }
+  else
+    {
+      gcc_assert (info->type == SF_NORMAL);
+      enum tree_code subcode = gimple_assign_rhs_code (stmt);
+      res = make_temp_ssa_name (vectype, NULL, "sf");
+
+      if (nargs == 1)
+	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0]);
+      else if (nargs == 2)
+	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0],
+					info->new_ops[1]);
+      else
+	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0],
+					info->new_ops[1], info->new_ops[2]);
+    }
+  if (dump_enabled_p ())
+    dump_printf (MSG_NOTE, "  => Gen call/normal %G", new_stmt);
+  update_all_uses (lhs, res, info->type);
+  info->replace_stmt = new_stmt;
+  replace_stmt (&gsi, stmt, new_stmt);
+}
+
+/* Start from load STMT, find and check all related statements are
+   valid to be optimized as vector operations, transform all of
+   them if succeed.  */
+
+static void
+process_chain_from_load (gimple *stmt)
+{
+  auto_vec<gimple *> chain;
+  hash_set<gimple *> visited;
+
+  /* Load is the first of its chain.  */
+  chain.safe_push (stmt);
+  visited.add (stmt);
+
+  if (dump_enabled_p ())
+    dump_printf (MSG_NOTE, "\nDetecting the chain from %G", stmt);
+
+  gimple *use_stmt;
+  imm_use_iterator iter;
+  tree lhs = gimple_assign_lhs (stmt);
+  /* Propagate from uses of load result.  */
+  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
+    /* Fail if encounting any unexpected.  */
+    if (!visit_stmt (use_stmt, chain, visited))
+      return;
+
+  if (dump_enabled_p ())
+    {
+      dump_printf (MSG_NOTE, "Found a chain from load %G", stmt);
+      for (gimple *s : chain)
+	dump_printf (MSG_NOTE, "  -> %G", s);
+      dump_printf (MSG_NOTE, "\n");
+    }
+
+  /* Create stmt info for this load.  */
+  stmt_info_p si = new stmt_info (stmt, SF_LOAD, NULL);
+  stmt_info_map->put (stmt, si);
+
+  /* Transform the chain.  */
+  for (gimple *stmt : chain)
+    transform_stmt (stmt, false);
+  /* Handle the remaining phis.  */
+  for (gimple *stmt : chain)
+    if (gimple_code (stmt) == GIMPLE_PHI)
+      transform_stmt (stmt, true);
+}
+
+const pass_data pass_data_rs6000_p10sfopt = {
+  GIMPLE_PASS,	     /* type */
+  "rs6000_p10sfopt", /* name */
+  OPTGROUP_NONE,     /* optinfo_flags */
+  TV_NONE,	     /* tv_id */
+  PROP_ssa,	     /* properties_required */
+  0,		     /* properties_provided */
+  0,		     /* properties_destroyed */
+  0,		     /* todo_flags_start */
+  TODO_update_ssa,   /* todo_flags_finish */
+};
+
+class pass_rs6000_p10sfopt : public gimple_opt_pass
+{
+public:
+  pass_rs6000_p10sfopt (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_rs6000_p10sfopt, ctxt)
+  {
+  }
+
+  bool
+  gate (function *fun) final override
+    {
+      /* Not each FE initialize target built-ins, so we need to
+	 ensure the support of lxvwsx_v4sf decl, and we can't do
+	 this check in rs6000_option_override_internal since the
+	 bif decls are uninitialized at that time.  */
+      return TARGET_P10_SF_OPT
+	     && optimize
+	     && optimize_function_for_speed_p (fun)
+	     && rs6000_builtin_decls[RS6000_BIF_LXVWSX_V4SF];
+    }
+
+  unsigned int execute (function *) final override;
+
+}; /* end of class pass_rs6000_p10sfopt  */
+
+unsigned int
+pass_rs6000_p10sfopt::execute (function *fun)
+{
+  stmt_info_map = new hash_map<gimple *, stmt_info_p>;
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, fun)
+    {
+      for (gimple_stmt_iterator gsi = gsi_start_nondebug_after_labels_bb (bb);
+	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  switch (gimple_code (stmt))
+	    {
+	    case GIMPLE_ASSIGN:
+	      if (gimple_assign_single_p (stmt))
+		{
+		  bool is_load = false;
+		  if (!stmt_info_map->get (stmt)
+		      && valid_load_store_p (stmt, is_load)
+		      && is_load)
+		    process_chain_from_load (stmt);
+		}
+	      break;
+	    default:
+	      break;
+	    }
+	}
+    }
+
+  for (info_map_t::iterator it = stmt_info_map->begin ();
+       it != stmt_info_map->end (); ++it)
+    {
+      stmt_info_p info = (*it).second;
+      delete info;
+    }
+  delete stmt_info_map;
+
+  return 0;
+}
+
+}
+
+gimple_opt_pass *
+make_pass_rs6000_p10sfopt (gcc::context *ctxt)
+{
+  return new pass_rs6000_p10sfopt (ctxt);
+}
+
diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
index ca899d5f7af..bc59a7d5f99 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -24,6 +24,11 @@ along with GCC; see the file COPYING3.  If not see
    REPLACE_PASS (PASS, INSTANCE, TGT_PASS)
  */

+  /* Pass to mitigate the performance issue on scalar single precision
+     floating point load, by updating some scalar single precision
+     floating point operations with appropriate vector opeations.  */
+  INSERT_PASS_BEFORE (pass_gimple_isel, 1, pass_rs6000_p10sfopt);
+
   /* Pass to add the appropriate vector swaps on power8 little endian systems.
      The power8 does not have instructions that automaticaly do the byte swaps
      for loads and stores.  */
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index f70118ea40f..aa0f782f186 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -341,9 +341,11 @@ extern unsigned rs6000_linux_libm_function_max_error (unsigned, machine_mode,
 /* Pass management.  */
 namespace gcc { class context; }
 class rtl_opt_pass;
+class gimple_opt_pass;

 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
+extern gimple_opt_pass *make_pass_rs6000_p10sfopt (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index cc24dd5301e..0e36860a73e 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -4254,6 +4254,22 @@ rs6000_option_override_internal (bool global_init_p)
       rs6000_isa_flags &= ~OPTION_MASK_PCREL;
     }

+  if (TARGET_P10_SF_OPT)
+    {
+      if (!TARGET_HARD_FLOAT)
+	{
+	  if ((rs6000_isa_flags_explicit & OPTION_MASK_P10_SF_OPT) != 0)
+	    error ("%qs requires %qs", "-mp10-sf-opt", "-mhard-float");
+	  rs6000_isa_flags &= ~OPTION_MASK_P10_SF_OPT;
+	}
+      if (!TARGET_P9_VECTOR)
+	{
+	  if ((rs6000_isa_flags_explicit & OPTION_MASK_P10_SF_OPT) != 0)
+	    error ("%qs requires %qs", "-mp10-sf-opt", "-mpower9-vector");
+	  rs6000_isa_flags &= ~OPTION_MASK_P10_SF_OPT;
+	}
+    }
+
   /* Print the options after updating the defaults.  */
   if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
     rs6000_print_isa_options (stderr, 0, "after defaults", rs6000_isa_flags);
@@ -22301,6 +22317,17 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
       *total = !speed ? COSTS_N_INSNS (1) + 1 : COSTS_N_INSNS (2);
       if (rs6000_slow_unaligned_access (mode, MEM_ALIGN (x)))
 	*total += COSTS_N_INSNS (100);
+      /* Specially treat vec_duplicate here, since vector splat insns
+	 {l,st}xv[wd]sx only support x-form, we should ensure reg + reg
+	 is preferred over reg + const, otherwise cprop will propagate
+	 const and result in sub-optimal code.  */
+      if (outer_code == VEC_DUPLICATE
+	  && (GET_MODE_SIZE (mode) == 4
+	    || GET_MODE_SIZE (mode) == 8)
+	  && GET_CODE (XEXP (x, 0)) == PLUS
+	  && CONST_INT_P (XEXP (XEXP (x, 0), 1))
+	  && REG_P (XEXP (XEXP (x, 0), 0)))
+	*total += COSTS_N_INSNS (1);
       return true;

     case LABEL_REF:
@@ -24443,6 +24470,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
   { "modulo",			OPTION_MASK_MODULO,		false, true  },
   { "mulhw",			OPTION_MASK_MULHW,		false, true  },
   { "multiple",			OPTION_MASK_MULTIPLE,		false, true  },
+  { "p10-sf-opt",		OPTION_MASK_P10_SF_OPT,		false, true  },
   { "pcrel",			OPTION_MASK_PCREL,		false, true  },
   { "pcrel-opt",		OPTION_MASK_PCREL_OPT,		false, true  },
   { "popcntb",			OPTION_MASK_POPCNTB,		false, true  },
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index bde6d3ff664..fb50e00d0d9 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -597,6 +597,11 @@ mmma
 Target Mask(MMA) Var(rs6000_isa_flags)
 Generate (do not generate) MMA instructions.

+mp10-sf-opt
+Target Mask(P10_SF_OPT) Var(rs6000_isa_flags)
+Generate code to mitigate single-precision floating point loading performance
+issue.
+
 mrelative-jumptables
 Target Undocumented Var(rs6000_relative_jumptables) Init(1) Save

diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
index f183b42ce1d..e7cd6d2f694 100644
--- a/gcc/config/rs6000/t-rs6000
+++ b/gcc/config/rs6000/t-rs6000
@@ -35,6 +35,10 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)

+rs6000-p10sfopt.o: $(srcdir)/config/rs6000/rs6000-p10sfopt.cc
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
 rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index f3b40229094..690318e82b2 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -6690,3 +6690,14 @@ (define_insn "vmsumcud"
   "vmsumcud %0,%1,%2,%3"
   [(set_attr "type" "veccomplex")]
 )
+
+;; For expanding internal use bif __builtin_vsx_stxsiwx
+(define_insn "vsx_stxsiwx_v4sf"
+ [(set (match_operand:SF 0 "memory_operand" "=Z")
+       (unspec:SF
+	  [(match_operand:V4SF 1 "vsx_register_operand" "wa")]
+	  UNSPEC_STFIWX))]
+ "TARGET_P9_VECTOR"
+ "stxsiwx %x1,%y0"
+ [(set_attr "type" "fpstore")])
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
new file mode 100644
index 00000000000..6e8c6a84de6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
@@ -0,0 +1,22 @@
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O2 -fno-tree-vectorize -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
+
+/* Verify Power10 SP floating point loading perf mitigation works
+   expectedly with one case having normal arithmetic.   */
+
+void
+saxpy (int n, float a, float *restrict x, float *restrict y)
+{
+#pragma GCC unroll 1
+  for (int i = 0; i < n; ++i)
+    y[i] = a * x[i] + y[i];
+}
+
+/* Checking lfsx -> lxvwsx, stfsx -> stxsiwx, fmadds -> xvmaddmsp etc.  */
+/* { dg-final { scan-assembler-times {\mlxvwsx\M} 2 } } */
+/* { dg-final { scan-assembler-times {\mstxsiwx\M} 1 } } */
+/* { dg-final { scan-assembler-times {\mxvmaddmsp\M} 1 } } */
+/* { dg-final { scan-assembler-not {\mlfsx?\M} } } */
+/* { dg-final { scan-assembler-not {\mstfsx?\M} } } */
+/* { dg-final { scan-assembler-not {\mfmadds\M} } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
new file mode 100644
index 00000000000..7593da8ecf4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
@@ -0,0 +1,34 @@
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O2 -fno-tree-vectorize -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
+
+/* Verify Power10 SP floating point loading perf mitigation works
+   expectedly with one case having reduction.   */
+
+/* Partially reduced from pytorch batch_norm_kernel.cpp.  */
+
+typedef long long int64_t;
+typedef float accscalar_t;
+typedef float scalar_t;
+
+void
+foo (int64_t n1, int64_t n2, accscalar_t sum, int64_t bound, int64_t N,
+     scalar_t *input_data, scalar_t *var_sum_data, int64_t index)
+{
+  scalar_t mean = sum / N;
+  accscalar_t _var_sum = 0;
+  for (int64_t c = 0; c < n1; c++)
+    {
+      for (int64_t i = 0; i < n2; i++)
+	{
+	  int64_t offset = index + i;
+	  scalar_t x = input_data[offset];
+	  _var_sum += (x - mean) * (x - mean);
+	}
+      var_sum_data[c] = _var_sum;
+    }
+}
+
+/* { dg-final { scan-assembler {\mlxvwsx\M} } } */
+/* { dg-final { scan-assembler {\mstxsiwx\M} } } */
+/* { dg-final { scan-assembler {\mxvmaddasp\M} } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
new file mode 100644
index 00000000000..38aedd00faa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
@@ -0,0 +1,43 @@
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O2 -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
+
+/* Verify Power10 SP floating point loading perf mitigation works
+   expectedly with one case having comparison.   */
+
+/* Partially reduced from xgboost cpu_predictor.cc.  */
+
+typedef struct {
+  unsigned int sindex;
+  signed int cleft;
+  unsigned int a1;
+  unsigned int a2;
+  float val;
+} Node;
+
+extern void bar(Node *n);
+
+void
+foo (Node *n0, float *pa, Node *var_843, int c)
+{
+  Node *var_821;
+  Node *n = n0;
+  int cleft_idx = c;
+  do
+    {
+      unsigned idx = n->sindex;
+      idx = (idx & ((1U << 31) - 1U));
+      float f1 = pa[idx];
+      float f2 = n->val;
+      int t = f2 > f1;
+      int var_825 = cleft_idx + t;
+      unsigned long long var_823 = var_825;
+      var_821 = &var_843[var_823];
+      cleft_idx = var_821->cleft;
+      n = var_821;
+  } while (cleft_idx != -1);
+
+  bar (n);
+}
+
+/* { dg-final { scan-assembler-times {\mlxvwsx\M} 2 } } */
+/* { dg-final { scan-assembler-times {\mxvcmpgtsp\M} 1 } } */
--
2.39.3

^ permalink raw reply	[flat|nested] 2+ messages in thread

* PING^1 [PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10
  2023-11-15  9:16 [PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10 Kewen.Lin
@ 2023-12-12  6:16 ` Kewen.Lin
  0 siblings, 0 replies; 2+ messages in thread
From: Kewen.Lin @ 2023-12-12  6:16 UTC (permalink / raw)
  To: GCC Patches
  Cc: Segher Boessenkool, David Edelsohn, Michael Meissner,
	Peter Bergner, Richard Sandiford, Richard Biener

Hi,

Gentle ping:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636599.html

BR,
Kewen

on 2023/11/15 17:16, Kewen.Lin wrote:
> Hi,
> 
> As Power ISA defines, when loading a scalar single precision (SP)
> floating point from memory, we have the double precision (DP) format
> in target register converted from SP, it's unlike some other
> architectures which supports SP and DP in registers with their
> separated formats.  The scalar SP instructions operates on DP format
> value in register and round the result to fit in SP (but still
> keeping the value in DP format).
> 
> On Power10, a scalar SP floating point load insn will be cracked into
> two internal operations, one is to load the value, the other is to
> convert SP to DP format.  Comparing to those uncracked load like
> vector SP load, it has extra 3 cycles load-to-use penalty.  When
> evaluating some critical workloads, we found that for some cases we
> don't really need the conversion if all the involved operations are
> only with SP format.  In this case, we can replace the scalar SP
> loads with vector SP load and splat (no conversion), replace all
> involved computation with the corresponding vector operations (with
> Power10 slice-based design, we expect the latency of scalar operation
> and its equivalent vector operation is the same), that is to promote
> the scalar SP loads and their affected computation to vector
> operations.
> 
> For example for the below case:
> 
> void saxpy (int n, float a, float * restrict x, float * restrict y)
> {
>   for (int i = 0; i < n; ++i)
>       y[i] = a*x[i] + y[i];
> }
> 
> At -O2, the loop body would end up with:
> 
> .L3:
>         lfsx 12,6,9    // conv
>         lfsx 0,5,9     // conv
>         fmadds 0,0,1,12
>         stfsx 0,6,9
>         addi 9,9,4
>         bdnz .L3
> 
> but it can be implemented with:
> 
> .L3:
>         lxvwsx 0,5,9   // load and splat
>         lxvwsx 12,6,9
>         xvmaddmsp 0,1,12
>         stxsiwx 0,6,9  // just store word 1 (BE ordering)
>         addi 9,9,4
>         bdnz .L3
> 
> Evaluated on Power10, the latter can speed up 23% against the former.
> 
> So this patch is to introduce a pass to recognize such case and
> change the scalar SP operations with the appropriate vector SP
> operations when it's proper.
> 
> The processing of this pass starts from scalar SP loads, first it
> checks if it's valid, further checks all the stmts using its loaded
> result, then propagates from them.  This process of propagation
> mainly goes with function visit_stmt, which first checks the
> validity of the given stmt, then checks the feeders of use operands
> with visit_stmt recursively, finally checks all the stmts using the
> def with visit_stmt recursively.  The purpose is to ensure all
> propagated stmts are valid to be transformed with its equivalent
> vector operations.  For some special operands like constant or
> GIMPLE_NOP def ssa, record them as splatting candidates.  There are
> some validity checks like: if the addressing mode can satisfy index
> form with some adjustments, if there is the corresponding vector
> operation support, and so on.  Once all propagated stmts from one
> load are valid, they are transformed by function transform_stmt by
> respecting the information in stmt_info like sf_type, new_ops etc.
> 
> For example, for the below test case:
> 
>   _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1];  // stmt1
>   _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1];  // stmt2
>   _8 = .FMA (_4, a_14(D), _7);                   // stmt3
>   MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8;  // stmt4
> 
> The processing starts from stmt1, which is taken as valid, adds it
> into the chain, then processes its use stmt stmt3, which is also
> valid, iterating its operands _4 whose def is stmt1 (visited), a_14
> which needs splatting and _7 whose def stmt2 is to be processed.
> Then stmt2 is taken as a valid load and it's added into the chain.
> All operands _4, a_14 and _7 of stmt3 are processed well, then it's
> added into the chain.  Then it processes use stmts of _8 (result of
> stmt3), so checks stmt4 which is a valid store.  Since all these
> involved stmts are valid to be transformed, we get below finally:
> 
>   sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
>   sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
>   sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
>   sf_20 = .FMA (sf_5, sf_22, sf_25);
>   __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));
> 
> Since it needs to do some validity checks and adjustments if allowed,
> such as: check if some scalar operation has the corresponding vector
> support, considering scalar SP load can allow reg + {reg, disp}
> addressing modes while vector SP load and splat can only allow reg +
> reg, also considering the efficiency to get UD/DF chain for affected
> operations, we make this pass as a gimple pass.
> 
> Considering gimple_isel pass has some gimple massaging, this pass is
> placed just before that.  Considering this pass can generate some
> extra vector construction (like some constant, values converted from
> int etc.), which are extra comparing to the original scalar, and it
> makes use of more vector resource than before, it's not turned on by
> default conservatively for now.
> 
> With the extra code to make this default on Power10, it's bootstrapped
> and almost regress-tested on Power10 (three test cases need some
> adjustments on its expected so trivial).  Evaluating SPEC2017 specrate
> all bmks at O2, O3 and Ofast, it's observed that it speeds up 521.wrf_r
> by 2.14%, 526.blender_r by 1.85% and fprate geomean by 0.31% at O2, it
> is neutral at O3 and Ofast.
> 
> Evaluating one critical workload related to xgboost, it's shown it
> helps to speed up 8% ~ 16% (avg. 14%, worst 8%, best 16%).
> 
> Note that the current implementation is mainly driven by some typical
> test cases from some motivated workloads, we want to continue to
> extend it as needed.
> 
> Any thoughts?
> 
> BR,
> Kewen
> -----
> 
> gcc/ChangeLog:
> 
> 	* config.gcc: Add rs6000-p10sfopt.o to extra_objs for powerpc*-*-*
> 	and rs6000*-*-* targets.
> 	* config/rs6000/rs6000-builtin.cc (ldv_expand_builtin): Correct tmode
> 	for CODE_FOR_vsx_splat_v4sf.
> 	(stv_expand_builtin): Correct tmode for CODE_FOR_vsx_stxsiwx_v4sf.
> 	* config/rs6000/rs6000-builtins.def (__builtin_vsx_lxvwsx,
> 	__builtin_vsx_stxsiwx): New builtin definitions.
> 	* config/rs6000/rs6000-passes.def: Add pass_rs6000_p10sfopt.
> 	* config/rs6000/rs6000-protos.h (class gimple_opt_pass): New
> 	declaration.
> 	(make_pass_rs6000_p10sfopt): Likewise.
> 	* config/rs6000/rs6000.cc (rs6000_option_override_internal): Check
> 	some prerequisite conditions for TARGET_P10_SF_OPT.
> 	(rs6000_rtx_costs): Cost one unit COSTS_N_INSNS more for vec_duplicate
> 	with {l,st}xv[wd]sx which only support x-form.
> 	* config/rs6000/rs6000.opt (-mp10-sf-opt): New option.
> 	* config/rs6000/t-rs6000: Add rule to build rs6000-p10sfopt.o.
> 	* config/rs6000/vsx.md (vsx_stxsiwx_v4sf): New define_insn.
> 	* config/rs6000/rs6000-p10sfopt.cc: New file.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/powerpc/p10-sf-opt-1.c: New test.
> 	* gcc.target/powerpc/p10-sf-opt-2.c: New test.
> 	* gcc.target/powerpc/p10-sf-opt-3.c: New test.
> ---
>  gcc/config.gcc                                |   4 +-
>  gcc/config/rs6000/rs6000-builtin.cc           |   9 +
>  gcc/config/rs6000/rs6000-builtins.def         |   5 +
>  gcc/config/rs6000/rs6000-p10sfopt.cc          | 950 ++++++++++++++++++
>  gcc/config/rs6000/rs6000-passes.def           |   5 +
>  gcc/config/rs6000/rs6000-protos.h             |   2 +
>  gcc/config/rs6000/rs6000.cc                   |  28 +
>  gcc/config/rs6000/rs6000.opt                  |   5 +
>  gcc/config/rs6000/t-rs6000                    |   4 +
>  gcc/config/rs6000/vsx.md                      |  11 +
>  .../gcc.target/powerpc/p10-sf-opt-1.c         |  22 +
>  .../gcc.target/powerpc/p10-sf-opt-2.c         |  34 +
>  .../gcc.target/powerpc/p10-sf-opt-3.c         |  43 +
>  13 files changed, 1120 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/config/rs6000/rs6000-p10sfopt.cc
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 0782cbc6e91..983fad9fb9a 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -517,7 +517,7 @@ or1k*-*-*)
>  	;;
>  powerpc*-*-*)
>  	cpu_type=rs6000
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p10sfopt.o rs6000-logue.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
> @@ -554,7 +554,7 @@ riscv*)
>  	;;
>  rs6000*-*-*)
>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
> -	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> +	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p10sfopt.o rs6000-logue.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
> index 82cc3a19447..38bb786e5eb 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -2755,6 +2755,10 @@ ldv_expand_builtin (rtx target, insn_code icode, rtx *op, machine_mode tmode)
>        || !insn_data[icode].operand[0].predicate (target, tmode))
>      target = gen_reg_rtx (tmode);
> 
> +  /* Correct tmode with the proper addr mode.  */
> +  if (icode == CODE_FOR_vsx_splat_v4sf)
> +    tmode = SFmode;
> +
>    op[1] = copy_to_mode_reg (Pmode, op[1]);
> 
>    /* These CELL built-ins use BLKmode instead of tmode for historical
> @@ -2898,6 +2902,10 @@ static rtx
>  stv_expand_builtin (insn_code icode, rtx *op,
>  		    machine_mode tmode, machine_mode smode)
>  {
> +  /* Correct tmode with the proper addr mode.  */
> +  if (icode == CODE_FOR_vsx_stxsiwx_v4sf)
> +    tmode = SFmode;
> +
>    op[2] = copy_to_mode_reg (Pmode, op[2]);
> 
>    /* For STVX, express the RTL accurately by ANDing the address with -16.
> @@ -3713,3 +3721,4 @@ rs6000_expand_builtin (tree exp, rtx target, rtx /* subtarget */,
>    emit_insn (pat);
>    return target;
>  }
> +
> diff --git a/gcc/config/rs6000/rs6000-builtins.def b/gcc/config/rs6000/rs6000-builtins.def
> index ce40600e803..c0441f5e27f 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -2810,6 +2810,11 @@
>        __builtin_vsx_scalar_cmp_exp_qp_unordered (_Float128, _Float128);
>      VSCEQPUO xscmpexpqp_unordered_kf {}
> 
> +  vf __builtin_vsx_lxvwsx (signed long, const float *);
> +    LXVWSX_V4SF vsx_splat_v4sf {ldvec}
> +
> +  void __builtin_vsx_stxsiwx (vf, signed long, const float *);
> +    STXSIWX_V4SF vsx_stxsiwx_v4sf {stvec}
> 
>  ; Miscellaneous P9 functions
>  [power9]
> diff --git a/gcc/config/rs6000/rs6000-p10sfopt.cc b/gcc/config/rs6000/rs6000-p10sfopt.cc
> new file mode 100644
> index 00000000000..6e1d90fd93e
> --- /dev/null
> +++ b/gcc/config/rs6000/rs6000-p10sfopt.cc
> @@ -0,0 +1,950 @@
> +/* Subroutines used to mitigate single precision floating point
> +   load and conversion performance issue by replacing scalar
> +   single precision floating point operations with appropriate
> +   vector operations if it is proper.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it
> +under the terms of the GNU General Public License as published by the
> +Free Software Foundation; either version 3, or (at your option) any
> +later version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT
> +ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +/* The processing of this pass starts from scalar SP loads, first it
> +checks if it's valid, further checks all the stmts using its loaded
> +result, then propagates from them.  This process of propagation
> +mainly goes with function visit_stmt, which first checks the
> +validity of the given stmt, then checks the feeders of use operands
> +with visit_stmt recursively, finally checks all the stmts using the
> +def with visit_stmt recursively.  The purpose is to ensure all
> +propagated stmts are valid to be transformed with its equivalent
> +vector operations.  For some special operands like constant or
> +GIMPLE_NOP def ssa, record them as splatting candidates.  There are
> +some validity checks like: if the addressing mode can satisfy index
> +form with some adjustments, if there is the corresponding vector
> +operation support, and so on.  Once all propagated stmts from one
> +load are valid, they are transformed by function transform_stmt by
> +respecting the information in stmt_info like sf_type, new_ops etc.
> +
> +For example, for the below test case:
> +
> +  _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1];  // stmt1
> +  _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1];  // stmt2
> +  _8 = .FMA (_4, a_14(D), _7);                   // stmt3
> +  MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8;  // stmt4
> +
> +The processing starts from stmt1, which is taken as valid, adds it
> +into the chain, then processes its use stmt stmt3, which is also
> +valid, iterating its operands _4 whose def is stmt1 (visited), a_14
> +which needs splatting and _7 whose def stmt2 is to be processed.
> +Then stmt2 is taken as a valid load and it's added into the chain.
> +All operands _4, a_14 and _7 of stmt3 are processed well, then it's
> +added into the chain.  Then it processes use stmts of _8 (result of
> +stmt3), so checks stmt4 which is a valid store.  Since all these
> +involved stmts are valid to be transformed, we get below finally:
> +
> +  sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
> +  sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
> +  sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
> +  sf_20 = .FMA (sf_5, sf_22, sf_25);
> +  __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));
> +*/
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "target.h"
> +#include "rtl.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "tm_p.h"
> +#include "tree-pass.h"
> +#include "ssa.h"
> +#include "optabs-tree.h"
> +#include "fold-const.h"
> +#include "tree-eh.h"
> +#include "gimple-iterator.h"
> +#include "gimple-fold.h"
> +#include "stor-layout.h"
> +#include "tree-ssa.h"
> +#include "tree-ssa-address.h"
> +#include "tree-cfg.h"
> +#include "cfgloop.h"
> +#include "tree-vectorizer.h"
> +#include "builtins.h"
> +#include "internal-fn.h"
> +#include "gimple-pretty-print.h"
> +#include "predict.h"
> +#include "rs6000-internal.h" /* for rs6000_builtin_decls  */
> +
> +namespace {
> +
> +/* Single precision Floating point operation types.
> +
> +   So far, we only take care of load, store, ifn call, phi,
> +   normal arithmetic, comparison and special operations.
> +   Normally for an involved statement, we will process all
> +   statements which use its result, all statements which
> +   define its operands and further propagate, but for some
> +   special assignment statement, we don't want to process
> +   it like this way but just splat it instead, we adopt
> +   SF_SPECIAL for this kind of statement, for now it's only
> +   for to-float conversion assignment.  */
> +enum sf_type
> +{
> +  SF_LOAD,
> +  SF_STORE,
> +  SF_CALL,
> +  SF_PHI,
> +  SF_NORMAL,
> +  SF_COMPARE,
> +  SF_SPECIAL
> +};
> +
> +/* Hold some information for a gimple statement which is valid
> +   to be promoted from scalar operation to vector operation.  */
> +
> +class stmt_info
> +{
> +public:
> +  stmt_info (gimple *s, sf_type t, bitmap bm)
> +  {
> +    stmt = s;
> +    type = t;
> +    splat_ops = BITMAP_ALLOC (NULL);
> +    if (bm)
> +      bitmap_copy (splat_ops, bm);
> +
> +    unsigned nops = gimple_num_args (stmt);
> +    new_ops.create (nops);
> +    new_ops.safe_grow_cleared (nops);
> +    replace_stmt = NULL;
> +    gphi_res = NULL_TREE;
> +  }
> +
> +  ~stmt_info ()
> +  {
> +    BITMAP_FREE (splat_ops);
> +    new_ops.release ();
> +  }
> +
> +  /* Indicate the stmt what this info is for.  */
> +  gimple *stmt;
> +  /* Indicate sf_type of the current stmt.  */
> +  enum sf_type type;
> +  /* Bitmap used to indicate which op needs to be splatted.  */
> +  bitmap splat_ops;
> +  /* New operands used to build new stmt.  */
> +  vec<tree> new_ops;
> +  /* New stmt used to replace the current stmt.  */
> +  gimple *replace_stmt;
> +  /* Hold new gphi result which is created early.  */
> +  tree gphi_res;
> +};
> +
> +typedef stmt_info *stmt_info_p;
> +typedef hash_map<gimple *, stmt_info_p> info_map_t;
> +static info_map_t *stmt_info_map;
> +
> +/* Like the comments for SF_SPECIAL above, for some special
> +   assignment statement (to-float conversion assignment
> +   here), we don't want to do the heavy processing but just
> +   want to generate a splatting for it instead.  Return
> +   true if the given STMT is special (to-float conversion
> +   for now), otherwise return false.  */
> +
> +static bool
> +special_assign_p (gimple *stmt)
> +{
> +  gcc_assert (gimple_code (stmt) == GIMPLE_ASSIGN);
> +  enum tree_code code = gimple_assign_rhs_code (stmt);
> +  if (code == FLOAT_EXPR)
> +    return true;
> +  return false;
> +}
> +
> +/* Make base and index fields from the memory reference REF,
> +   return true and set *BASEP and *INDEXP respectively if it
> +   is successful, otherwise return false.  Since the
> +   transformed vector load (lxvwsx) and vector store (stxsiwx)
> +   only supports reg + reg addressing mode, we need to ensure
> +   the address satisfies it first.  */
> +
> +static bool
> +make_base_and_index (tree ref, tree *basep, tree *indexp)
> +{
> +  if (DECL_P (ref))
> +    {
> +      *basep
> +	= fold_build1 (ADDR_EXPR, build_pointer_type (float32_type_node), ref);
> +      *indexp = size_zero_node;
> +      return true;
> +    }
> +
> +  enum tree_code code = TREE_CODE (ref);
> +  if (code == TARGET_MEM_REF)
> +    {
> +      struct mem_address addr;
> +      get_address_description (ref, &addr);
> +      gcc_assert (!addr.step);
> +      *basep = addr.symbol ? addr.symbol : addr.base;
> +      if (addr.index)
> +	{
> +	  /* Give up if having both offset and index, theoretically
> +	     we can generate one insn to update base with index, but
> +	     it results in more cost, so leave it conservatively.  */
> +	  if (!integer_zerop (addr.offset))
> +	    return false;
> +	  *indexp = addr.index;
> +	}
> +      else
> +	*indexp = addr.offset;
> +      return true;
> +    }
> +
> +  if (code == MEM_REF)
> +    {
> +      *basep = TREE_OPERAND (ref, 0);
> +      tree op1 = TREE_OPERAND (ref, 1);
> +      *indexp = op1 ? op1 : size_zero_node;
> +      return true;
> +    }
> +
> +  if (handled_component_p (ref))
> +    {
> +      machine_mode mode1;
> +      poly_int64 bitsize, bitpos;
> +      tree offset;
> +      int reversep = 0, volatilep = 0, unsignedp = 0;
> +      tree tem = get_inner_reference (ref, &bitsize, &bitpos, &offset, &mode1,
> +				      &unsignedp, &reversep, &volatilep);
> +      if (reversep)
> +	return false;
> +
> +      poly_int64 bytepos = exact_div (bitpos, BITS_PER_UNIT);
> +      if (offset)
> +	{
> +	  gcc_assert (!integer_zerop (offset));
> +	  /* Give up if having both offset and bytepos.  */
> +	  if (maybe_ne (bytepos, 0))
> +	    return false;
> +	  if (!is_gimple_variable (offset))
> +	    return false;
> +	}
> +
> +      tree base1, index1;
> +      /* Further check the inner ref.  */
> +      if (!make_base_and_index (tem, &base1, &index1))
> +	return false;
> +
> +      if (integer_zerop (index1))
> +	{
> +	  /* Only need to consider base1 and offset/bytepos.  */
> +	  *basep = base1;
> +	  *indexp = offset ? offset : wide_int_to_tree (sizetype, bytepos);
> +	  return true;
> +	}
> +      /* Give up if having offset and index1.  */
> +      if (offset)
> +	return false;
> +      /* Give up if bytepos and index1 can not be folded.  */
> +      if (!poly_int_tree_p (index1))
> +	return false;
> +      poly_offset_int new_off
> +	= wi::sext (wi::to_poly_offset (index1), TYPE_PRECISION (sizetype));
> +      new_off += bytepos;
> +
> +      poly_int64 new_index;
> +      if (!new_off.to_shwi (&new_index))
> +	return false;
> +
> +      *basep = base1;
> +      *indexp = wide_int_to_tree (sizetype, new_index);
> +      return true;
> +    }
> +
> +  if (TREE_CODE (ref) == SSA_NAME)
> +    {
> +      /* Inner ref can come from a load.  */
> +      gimple *def = SSA_NAME_DEF_STMT (ref);
> +      if (!gimple_assign_single_p (def))
> +	return false;
> +      tree ref1 = gimple_assign_rhs1 (def);
> +      if (!DECL_P (ref1) && !REFERENCE_CLASS_P (ref1))
> +	return false;
> +
> +      tree base1, offset1;
> +      if (!make_base_and_index (ref1, &base1, &offset1))
> +	return false;
> +      *basep = base1;
> +      *indexp = offset1;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
> +/* Check STMT is an expected SP float load or store, return true
> +   if it is and update IS_LOAD, otherwise return false.  */
> +
> +static bool
> +valid_load_store_p (gimple *stmt, bool &is_load)
> +{
> +  if (!gimple_assign_single_p (stmt))
> +    return false;
> +
> +  tree lhs = gimple_assign_lhs (stmt);
> +  if (TYPE_MODE (TREE_TYPE (lhs)) != SFmode)
> +    return false;
> +
> +  tree rhs = gimple_assign_rhs1 (stmt);
> +  tree base, index;
> +  if (TREE_CODE (lhs) == SSA_NAME
> +      && (DECL_P (rhs) || REFERENCE_CLASS_P (rhs))
> +      && make_base_and_index (rhs, &base, &index))
> +    {
> +      is_load = true;
> +      return true;
> +    }
> +
> +  if ((DECL_P (lhs) || REFERENCE_CLASS_P (lhs))
> +      && make_base_and_index (lhs, &base, &index))
> +    {
> +      is_load = false;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
> +/* Check if it's valid to update the given STMT with the
> +   equivalent vector form, return true if yes and also set
> +   SF_TYPE to the proper sf_type, otherwise return false.  */
> +
> +static bool
> +is_valid (gimple *stmt, enum sf_type &sf_type)
> +{
> +  /* Give up if it has volatile type.  */
> +  if (gimple_has_volatile_ops (stmt))
> +    return false;
> +
> +  /* Give up if it can throw an exception.  */
> +  if (stmt_can_throw_internal (cfun, stmt))
> +    return false;
> +
> +  /* Process phi.  */
> +  gphi *gp = dyn_cast<gphi *> (stmt);
> +  if (gp)
> +    {
> +      sf_type = SF_PHI;
> +      return true;
> +    }
> +
> +  /* Process assignment.  */
> +  gassign *gass = dyn_cast<gassign *> (stmt);
> +  if (gass)
> +    {
> +      bool is_load = false;
> +      if (valid_load_store_p (stmt, is_load))
> +	{
> +	  sf_type = is_load ? SF_LOAD : SF_STORE;
> +	  return true;
> +	}
> +
> +      tree lhs = gimple_assign_lhs (stmt);
> +      if (!lhs || TREE_CODE (lhs) != SSA_NAME)
> +	return false;
> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> +      if (TREE_CODE_CLASS (code) == tcc_comparison)
> +	{
> +	  tree rhs1 = gimple_assign_rhs1 (stmt);
> +	  tree rhs2 = gimple_assign_rhs2 (stmt);
> +	  tree type = TREE_TYPE (lhs);
> +	  if (!VECT_SCALAR_BOOLEAN_TYPE_P (type))
> +	    return false;
> +	  if (TYPE_MODE (type) != QImode)
> +	    return false;
> +	  type = TREE_TYPE (rhs1);
> +	  if (TYPE_MODE (type) != SFmode)
> +	    return false;
> +	  gcc_assert (TYPE_MODE (TREE_TYPE (rhs2)) == SFmode);
> +	  sf_type = SF_COMPARE;
> +	  return true;
> +	}
> +
> +      tree type = TREE_TYPE (lhs);
> +      if (TYPE_MODE (type) != SFmode)
> +	return false;
> +
> +      if (special_assign_p (stmt))
> +	{
> +	  sf_type = SF_SPECIAL;
> +	  return true;
> +	}
> +
> +      /* Check if vector operation is supported.  */
> +      sf_type = SF_NORMAL;
> +      tree vectype = build_vector_type_for_mode (type, V4SFmode);
> +      optab optab = optab_for_tree_code (code, vectype, optab_default);
> +      if (!optab)
> +	return false;
> +      return optab_handler (optab, V4SFmode) != CODE_FOR_nothing;
> +    }
> +
> +  /* Process call.  */
> +  gcall *gc = dyn_cast<gcall *> (stmt);
> +  /* TODO: Extend this to cover some other bifs.  */
> +  if (gc && gimple_call_internal_p (gc))
> +    {
> +      tree lhs = gimple_call_lhs (stmt);
> +      if (!lhs)
> +	return false;
> +      if (TREE_CODE (lhs) != SSA_NAME)
> +	return false;
> +      tree type = TREE_TYPE (lhs);
> +      if (TYPE_MODE (type) != SFmode)
> +	return false;
> +      enum internal_fn ifn = gimple_call_internal_fn (stmt);
> +      tree vectype = build_vector_type_for_mode (type, V4SFmode);
> +      if (direct_internal_fn_p (ifn))
> +	{
> +	  const direct_internal_fn_info &info = direct_internal_fn (ifn);
> +	  if (info.vectorizable
> +	      && (direct_internal_fn_supported_p (ifn,
> +						  tree_pair (vectype, vectype),
> +						  OPTIMIZE_FOR_SPEED)))
> +	    {
> +	      sf_type = SF_CALL;
> +	      return true;
> +	    }
> +	}
> +    }
> +
> +  return false;
> +}
> +
> +/* Process the given STMT, if it's visited before, just return true.
> +   If it's the first time to visit this, set VISITED and check if
> +   the below ones are valid to be optimized with vector operation:
> +     - itself
> +     - all statements which define the operands involved here
> +     - all statements which use the result of STMT
> +   If all are valid, add STMT into CHAIN, create its own stmt_info
> +   and return true.  Otherwise, return false.  */
> +
> +static bool
> +visit_stmt (gimple *stmt, vec<gimple *> &chain, hash_set<gimple *> &visited)
> +{
> +  if (visited.add (stmt))
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "Stmt visited: %G", stmt);
> +      return true;
> +    }
> +  else if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "Visiting stmt: %G", stmt);
> +
> +  /* Checking this statement is valid for this optimization.  */
> +  enum sf_type st_type;
> +  if (!is_valid (stmt, st_type))
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "Invalid stmt: %G", stmt);
> +      return false;
> +    }
> +
> +  /* For store, it's the end of this chain, don't need to
> +     process anything further.  For special assignment, we
> +     don't want to process all statements using its result
> +     and all statements defining its operands.  */
> +  if (st_type == SF_STORE || st_type == SF_SPECIAL)
> +    {
> +      chain.safe_push (stmt);
> +      stmt_info_p si = new stmt_info (stmt, st_type, NULL);
> +      stmt_info_map->put (stmt, si);
> +      return true;
> +    }
> +
> +  /* Check all feeders of operands involved here.  */
> +
> +  /* Indicate which operand needs to be splatted, such as: constant.  */
> +  auto_bitmap splat_bm;
> +  if (st_type != SF_LOAD)
> +    {
> +      unsigned nops = gimple_num_args (stmt);
> +      for (unsigned i = 0; i < nops; i++)
> +	{
> +	  tree op = gimple_arg (stmt, i);
> +	  if (TREE_CODE (op) != SSA_NAME
> +	      && TREE_CODE (op) != REAL_CST)
> +	    {
> +	      if (dump_enabled_p ())
> +		dump_printf (MSG_NOTE, "With problematic %T in stmt: %G", op,
> +			     stmt);
> +	      return false;
> +	    }
> +
> +	  bool need_splat = false;
> +	  if (TREE_CODE (op) == SSA_NAME)
> +	    {
> +	      gimple *op_stmt = SSA_NAME_DEF_STMT (op);
> +	      if (gimple_code (op_stmt) == GIMPLE_NOP)
> +		need_splat = true;
> +	      else if (!visit_stmt (op_stmt, chain, visited))
> +		return false;
> +	    }
> +	  else
> +	    {
> +	      gcc_assert (TREE_CODE (op) == REAL_CST);
> +	      need_splat = true;
> +	    }
> +
> +	  if (need_splat)
> +	    bitmap_set_bit (splat_bm, i);
> +	}
> +    }
> +
> +  /* Push this stmt before all its use stmts, then it's transformed
> +     first during the transform phase, new_ops are prepared when
> +     transforming use stmts.  */
> +  chain.safe_push (stmt);
> +
> +  /* Comparison may have some constant operand, we need the above
> +     handlings on splatting, but don't need any further processing
> +     on all uses of its result.  */
> +  if (st_type == SF_COMPARE)
> +    {
> +      stmt_info_p si = new stmt_info (stmt, st_type, splat_bm);
> +      stmt_info_map->put (stmt, si);
> +      return true;
> +    }
> +
> +  /* Process each use of definition.  */
> +  gimple *use_stmt;
> +  imm_use_iterator iter;
> +  tree lhs = gimple_get_lhs (stmt);
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    if (!visit_stmt (use_stmt, chain, visited))
> +      return false;
> +
> +  /* Create the corresponding stmt_info.  */
> +  stmt_info_p si = new stmt_info (stmt, st_type, splat_bm);
> +  stmt_info_map->put (stmt, si);
> +  return true;
> +}
> +
> +/* Tree NEW_LHS with vector type has been used to replace the
> +   original tree LHS, for each use of LHS, find each use stmt
> +   and its corresponding stmt_info, update whose new_ops array
> +   accordingly to prepare later replacement.  */
> +
> +static void
> +update_all_uses (tree lhs, tree new_lhs, sf_type type)
> +{
> +  gimple *use_stmt;
> +  imm_use_iterator iter;
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    {
> +      stmt_info_p *slot = stmt_info_map->get (use_stmt);
> +      /* Each use stmt should have been processed excepting
> +	 for SF_SPECIAL stmts since for which we stop
> +	 processing early.  */
> +      gcc_assert (slot || type == SF_SPECIAL);
> +      if (!slot)
> +	continue;
> +      stmt_info_p info = *slot;
> +      unsigned n = gimple_num_args (use_stmt);
> +      for (unsigned i = 0; i < n; i++)
> +	if (gimple_arg (use_stmt, i) == lhs)
> +	  info->new_ops[i] = new_lhs;
> +    }
> +}
> +
> +/* Remove old STMT and insert NEW_STMT before.  */
> +
> +static void
> +replace_stmt (gimple_stmt_iterator *gsi_ptr, gimple *stmt, gimple *new_stmt)
> +{
> +  gimple_set_location (new_stmt, gimple_location (stmt));
> +  gimple_move_vops (new_stmt, stmt);
> +  gsi_insert_before (gsi_ptr, new_stmt, GSI_SAME_STMT);
> +  gsi_remove (gsi_ptr, true);
> +}
> +
> +/* Transform the given STMT with vector type, only transform
> +   phi stmt if HANDLE_PHI_P is true since there are def-use
> +   cycles for phi, we transform them in the second time.  */
> +
> +static void
> +transform_stmt (gimple *stmt, bool handle_phi_p = false)
> +{
> +  stmt_info_p info = *stmt_info_map->get (stmt);
> +
> +  /* This statement has been replaced.  */
> +  if (info->replace_stmt)
> +    return;
> +
> +  gcc_assert (!handle_phi_p || gimple_code (stmt) == GIMPLE_PHI);
> +
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, " Transforming stmt: %G", stmt);
> +
> +  tree lhs = gimple_get_lhs (stmt);
> +  tree type = float_type_node;
> +  tree vectype = build_vector_type_for_mode (type, V4SFmode);
> +  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> +
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, " info->type: %d\n", info->type);
> +
> +  /* Replace load with bif __builtin_vsx_lxvwsx.  */
> +  if (info->type == SF_LOAD)
> +    {
> +      tree fndecl = rs6000_builtin_decls[RS6000_BIF_LXVWSX_V4SF];
> +      tree rhs = gimple_op (stmt, 1);
> +      tree base, index;
> +      bool mem_p = make_base_and_index (rhs, &base, &index);
> +      gcc_assert (mem_p);
> +      gimple *load = gimple_build_call (fndecl, 2, index, base);
> +      tree res = make_temp_ssa_name (vectype, NULL, "sf");
> +      gimple_call_set_lhs (load, res);
> +      info->replace_stmt = load;
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "  => Gen load: %G", load);
> +      update_all_uses (lhs, res, info->type);
> +      replace_stmt (&gsi, stmt, load);
> +      return;
> +    }
> +
> +  /* Replace store with bif __builtin_vsx_stxsiwx.  */
> +  if (info->type == SF_STORE)
> +    {
> +      tree fndecl = rs6000_builtin_decls[RS6000_BIF_STXSIWX_V4SF];
> +      tree base, index;
> +      bool mem_p = make_base_and_index (lhs, &base, &index);
> +      gcc_assert (mem_p);
> +      gcc_assert (info->new_ops[0]);
> +      gimple *store
> +	= gimple_build_call (fndecl, 3, info->new_ops[0], index, base);
> +      info->replace_stmt = store;
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "  => Gen store: %G", store);
> +      replace_stmt (&gsi, stmt, store);
> +      return;
> +    }
> +
> +  /* Generate vector construction for special stmt.  */
> +  if (info->type == SF_SPECIAL)
> +    {
> +      tree op = gimple_get_lhs (stmt);
> +      tree val = build_vector_from_val (vectype, op);
> +      tree res = make_temp_ssa_name (vectype, NULL, "sf");
> +      gimple *splat = gimple_build_assign (res, val);
> +      gimple_set_location (splat, gimple_location (stmt));
> +      gsi_insert_after (&gsi, splat, GSI_SAME_STMT);
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "  => Gen special %G", splat);
> +      update_all_uses (lhs, res, info->type);
> +      info->replace_stmt = splat;
> +      return;
> +    }
> +
> +  /* Handle the operands which haven't have an according vector
> +     operand yet, like those ones need splatting etc.  */
> +  unsigned nargs = gimple_num_args (stmt);
> +  gphi *phi = dyn_cast<gphi *> (stmt);
> +  for (unsigned i = 0; i < nargs; i++)
> +    {
> +      /* This operand already has the replacing one.  */
> +      if (info->new_ops[i])
> +	continue;
> +      /* When only handling phi, all operands should have the
> +	 prepared new_op.  */
> +      gcc_assert (!handle_phi_p);
> +      tree op = gimple_arg (stmt, i);
> +      /* This operand needs splatting.  */
> +      if (bitmap_bit_p (info->splat_ops, i))
> +	{
> +	  tree val = build_vector_from_val (vectype, op);
> +	  tree res = make_temp_ssa_name (vectype, NULL, "sf");
> +	  gimple *splat = gimple_build_assign (res, val);
> +	  /* If it's a PHI, push it to its incoming block.  */
> +	  if (phi)
> +	    {
> +	      basic_block src = gimple_phi_arg_edge (phi, i)->src;
> +	      gimple_stmt_iterator src_gsi = gsi_last_bb (src);
> +	      if (!gsi_end_p (src_gsi) && stmt_ends_bb_p (gsi_stmt (src_gsi)))
> +		gsi_insert_before (&src_gsi, splat, GSI_SAME_STMT);
> +	      else
> +		gsi_insert_after (&src_gsi, splat, GSI_NEW_STMT);
> +	    }
> +	  else
> +	    gsi_insert_before (&gsi, splat, GSI_SAME_STMT);
> +	  info->new_ops[i] = res;
> +	  bitmap_clear_bit (info->splat_ops, i);
> +	}
> +      else
> +	{
> +	  gcc_assert (TREE_CODE (op) == SSA_NAME);
> +	  /* Ensure all operands have the replacing new_op excepting
> +	     for phi stmt.  */
> +	  if (!phi)
> +	    {
> +	      gimple *def = SSA_NAME_DEF_STMT (op);
> +	      transform_stmt (def);
> +	      gcc_assert (info->new_ops[i]);
> +	    }
> +	}
> +    }
> +
> +  gimple *new_stmt;
> +  tree res;
> +  if (info->type == SF_PHI)
> +    {
> +      /* At the first time, ensure phi result is prepared and all its
> +	 use stmt can be transformed well.  */
> +      if (!handle_phi_p)
> +	{
> +	  res = info->gphi_res;
> +	  if (!res)
> +	    {
> +	      res = make_temp_ssa_name (vectype, NULL, "sf");
> +	      info->gphi_res = res;
> +	    }
> +	  update_all_uses (lhs, res, info->type);
> +	  return;
> +	}
> +      /* Transform actually at the second time.  */
> +      basic_block bb = gimple_bb (stmt);
> +      gphi *new_phi = create_phi_node (info->gphi_res, bb);
> +      for (unsigned i = 0; i < nargs; i++)
> +	{
> +	  location_t loc = gimple_phi_arg_location (phi, i);
> +	  edge e = gimple_phi_arg_edge (phi, i);
> +	  add_phi_arg (new_phi, info->new_ops[i], e, loc);
> +	}
> +      gimple_set_location (new_phi, gimple_location (stmt));
> +      remove_phi_node (&gsi, true);
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "  => Gen phi %G", (gimple *) new_phi);
> +      return;
> +    }
> +
> +  if (info->type == SF_COMPARE)
> +    {
> +      /* Build a vector comparison.  */
> +      tree vectype1 = truth_type_for (vectype);
> +      tree res1 = make_temp_ssa_name (vectype1, NULL, "sf_vb4");
> +      enum tree_code subcode = gimple_assign_rhs_code (stmt);
> +      gimple *new_stmt1 = gimple_build_assign (res1, subcode, info->new_ops[0],
> +					       info->new_ops[1]);
> +      gsi_insert_before (&gsi, new_stmt1, GSI_SAME_STMT);
> +
> +      /* Build a VEC_COND_EXPR with -1 (true) or 0 (false).  */
> +      tree vectype2 = build_vector_type_for_mode (intSI_type_node, V4SImode);
> +      tree res2 = make_temp_ssa_name (vectype2, NULL, "sf_vi4");
> +      tree minus_one_vec = build_minus_one_cst (vectype2);
> +      tree zero_vec = build_zero_cst (vectype2);
> +      gimple *new_stmt2 = gimple_build_assign (res2, VEC_COND_EXPR, res1,
> +					       minus_one_vec, zero_vec);
> +      gsi_insert_before (&gsi, new_stmt2, GSI_SAME_STMT);
> +
> +      /* Build a BIT_FIELD_REF to extract lane 1 (BE ordering).  */
> +      tree bfr = build3 (BIT_FIELD_REF, intSI_type_node, res2, bitsize_int (32),
> +			 bitsize_int (BYTES_BIG_ENDIAN ? 32 : 64));
> +      tree res3 = make_temp_ssa_name (intSI_type_node, NULL, "sf_i4");
> +      gimple *new_stmt3 = gimple_build_assign (res3, BIT_FIELD_REF, bfr);
> +      gsi_insert_before (&gsi, new_stmt3, GSI_SAME_STMT);
> +
> +      /* Convert it accordingly.  */
> +      gimple *new_stmt = gimple_build_assign (lhs, NOP_EXPR, res3);
> +
> +      if (dump_enabled_p ())
> +	{
> +	  dump_printf (MSG_NOTE, "  => Gen comparison: %G",
> +		       (gimple *) new_stmt1);
> +	  dump_printf (MSG_NOTE, "                     %G",
> +		       (gimple *) new_stmt2);
> +	  dump_printf (MSG_NOTE, "                     %G",
> +		       (gimple *) new_stmt3);
> +	  dump_printf (MSG_NOTE, "                     %G",
> +		       (gimple *) new_stmt);
> +	}
> +      gsi_replace (&gsi, new_stmt, false);
> +      info->replace_stmt = new_stmt;
> +      return;
> +    }
> +
> +  if (info->type == SF_CALL)
> +    {
> +      res = make_temp_ssa_name (vectype, NULL, "sf");
> +      enum internal_fn ifn = gimple_call_internal_fn (stmt);
> +      new_stmt = gimple_build_call_internal_vec (ifn, info->new_ops);
> +      gimple_call_set_lhs (new_stmt, res);
> +    }
> +  else
> +    {
> +      gcc_assert (info->type == SF_NORMAL);
> +      enum tree_code subcode = gimple_assign_rhs_code (stmt);
> +      res = make_temp_ssa_name (vectype, NULL, "sf");
> +
> +      if (nargs == 1)
> +	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0]);
> +      else if (nargs == 2)
> +	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0],
> +					info->new_ops[1]);
> +      else
> +	new_stmt = gimple_build_assign (res, subcode, info->new_ops[0],
> +					info->new_ops[1], info->new_ops[2]);
> +    }
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "  => Gen call/normal %G", new_stmt);
> +  update_all_uses (lhs, res, info->type);
> +  info->replace_stmt = new_stmt;
> +  replace_stmt (&gsi, stmt, new_stmt);
> +}
> +
> +/* Start from load STMT, find and check all related statements are
> +   valid to be optimized as vector operations, transform all of
> +   them if succeed.  */
> +
> +static void
> +process_chain_from_load (gimple *stmt)
> +{
> +  auto_vec<gimple *> chain;
> +  hash_set<gimple *> visited;
> +
> +  /* Load is the first of its chain.  */
> +  chain.safe_push (stmt);
> +  visited.add (stmt);
> +
> +  if (dump_enabled_p ())
> +    dump_printf (MSG_NOTE, "\nDetecting the chain from %G", stmt);
> +
> +  gimple *use_stmt;
> +  imm_use_iterator iter;
> +  tree lhs = gimple_assign_lhs (stmt);
> +  /* Propagate from uses of load result.  */
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    /* Fail if encounting any unexpected.  */
> +    if (!visit_stmt (use_stmt, chain, visited))
> +      return;
> +
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf (MSG_NOTE, "Found a chain from load %G", stmt);
> +      for (gimple *s : chain)
> +	dump_printf (MSG_NOTE, "  -> %G", s);
> +      dump_printf (MSG_NOTE, "\n");
> +    }
> +
> +  /* Create stmt info for this load.  */
> +  stmt_info_p si = new stmt_info (stmt, SF_LOAD, NULL);
> +  stmt_info_map->put (stmt, si);
> +
> +  /* Transform the chain.  */
> +  for (gimple *stmt : chain)
> +    transform_stmt (stmt, false);
> +  /* Handle the remaining phis.  */
> +  for (gimple *stmt : chain)
> +    if (gimple_code (stmt) == GIMPLE_PHI)
> +      transform_stmt (stmt, true);
> +}
> +
> +const pass_data pass_data_rs6000_p10sfopt = {
> +  GIMPLE_PASS,	     /* type */
> +  "rs6000_p10sfopt", /* name */
> +  OPTGROUP_NONE,     /* optinfo_flags */
> +  TV_NONE,	     /* tv_id */
> +  PROP_ssa,	     /* properties_required */
> +  0,		     /* properties_provided */
> +  0,		     /* properties_destroyed */
> +  0,		     /* todo_flags_start */
> +  TODO_update_ssa,   /* todo_flags_finish */
> +};
> +
> +class pass_rs6000_p10sfopt : public gimple_opt_pass
> +{
> +public:
> +  pass_rs6000_p10sfopt (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_rs6000_p10sfopt, ctxt)
> +  {
> +  }
> +
> +  bool
> +  gate (function *fun) final override
> +    {
> +      /* Not each FE initialize target built-ins, so we need to
> +	 ensure the support of lxvwsx_v4sf decl, and we can't do
> +	 this check in rs6000_option_override_internal since the
> +	 bif decls are uninitialized at that time.  */
> +      return TARGET_P10_SF_OPT
> +	     && optimize
> +	     && optimize_function_for_speed_p (fun)
> +	     && rs6000_builtin_decls[RS6000_BIF_LXVWSX_V4SF];
> +    }
> +
> +  unsigned int execute (function *) final override;
> +
> +}; /* end of class pass_rs6000_p10sfopt  */
> +
> +unsigned int
> +pass_rs6000_p10sfopt::execute (function *fun)
> +{
> +  stmt_info_map = new hash_map<gimple *, stmt_info_p>;
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      for (gimple_stmt_iterator gsi = gsi_start_nondebug_after_labels_bb (bb);
> +	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> +	{
> +	  gimple *stmt = gsi_stmt (gsi);
> +
> +	  switch (gimple_code (stmt))
> +	    {
> +	    case GIMPLE_ASSIGN:
> +	      if (gimple_assign_single_p (stmt))
> +		{
> +		  bool is_load = false;
> +		  if (!stmt_info_map->get (stmt)
> +		      && valid_load_store_p (stmt, is_load)
> +		      && is_load)
> +		    process_chain_from_load (stmt);
> +		}
> +	      break;
> +	    default:
> +	      break;
> +	    }
> +	}
> +    }
> +
> +  for (info_map_t::iterator it = stmt_info_map->begin ();
> +       it != stmt_info_map->end (); ++it)
> +    {
> +      stmt_info_p info = (*it).second;
> +      delete info;
> +    }
> +  delete stmt_info_map;
> +
> +  return 0;
> +}
> +
> +}
> +
> +gimple_opt_pass *
> +make_pass_rs6000_p10sfopt (gcc::context *ctxt)
> +{
> +  return new pass_rs6000_p10sfopt (ctxt);
> +}
> +
> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
> index ca899d5f7af..bc59a7d5f99 100644
> --- a/gcc/config/rs6000/rs6000-passes.def
> +++ b/gcc/config/rs6000/rs6000-passes.def
> @@ -24,6 +24,11 @@ along with GCC; see the file COPYING3.  If not see
>     REPLACE_PASS (PASS, INSTANCE, TGT_PASS)
>   */
> 
> +  /* Pass to mitigate the performance issue on scalar single precision
> +     floating point load, by updating some scalar single precision
> +     floating point operations with appropriate vector opeations.  */
> +  INSERT_PASS_BEFORE (pass_gimple_isel, 1, pass_rs6000_p10sfopt);
> +
>    /* Pass to add the appropriate vector swaps on power8 little endian systems.
>       The power8 does not have instructions that automaticaly do the byte swaps
>       for loads and stores.  */
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index f70118ea40f..aa0f782f186 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -341,9 +341,11 @@ extern unsigned rs6000_linux_libm_function_max_error (unsigned, machine_mode,
>  /* Pass management.  */
>  namespace gcc { class context; }
>  class rtl_opt_pass;
> +class gimple_opt_pass;
> 
>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
> +extern gimple_opt_pass *make_pass_rs6000_p10sfopt (gcc::context *);
>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>  extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index cc24dd5301e..0e36860a73e 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -4254,6 +4254,22 @@ rs6000_option_override_internal (bool global_init_p)
>        rs6000_isa_flags &= ~OPTION_MASK_PCREL;
>      }
> 
> +  if (TARGET_P10_SF_OPT)
> +    {
> +      if (!TARGET_HARD_FLOAT)
> +	{
> +	  if ((rs6000_isa_flags_explicit & OPTION_MASK_P10_SF_OPT) != 0)
> +	    error ("%qs requires %qs", "-mp10-sf-opt", "-mhard-float");
> +	  rs6000_isa_flags &= ~OPTION_MASK_P10_SF_OPT;
> +	}
> +      if (!TARGET_P9_VECTOR)
> +	{
> +	  if ((rs6000_isa_flags_explicit & OPTION_MASK_P10_SF_OPT) != 0)
> +	    error ("%qs requires %qs", "-mp10-sf-opt", "-mpower9-vector");
> +	  rs6000_isa_flags &= ~OPTION_MASK_P10_SF_OPT;
> +	}
> +    }
> +
>    /* Print the options after updating the defaults.  */
>    if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
>      rs6000_print_isa_options (stderr, 0, "after defaults", rs6000_isa_flags);
> @@ -22301,6 +22317,17 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
>        *total = !speed ? COSTS_N_INSNS (1) + 1 : COSTS_N_INSNS (2);
>        if (rs6000_slow_unaligned_access (mode, MEM_ALIGN (x)))
>  	*total += COSTS_N_INSNS (100);
> +      /* Specially treat vec_duplicate here, since vector splat insns
> +	 {l,st}xv[wd]sx only support x-form, we should ensure reg + reg
> +	 is preferred over reg + const, otherwise cprop will propagate
> +	 const and result in sub-optimal code.  */
> +      if (outer_code == VEC_DUPLICATE
> +	  && (GET_MODE_SIZE (mode) == 4
> +	    || GET_MODE_SIZE (mode) == 8)
> +	  && GET_CODE (XEXP (x, 0)) == PLUS
> +	  && CONST_INT_P (XEXP (XEXP (x, 0), 1))
> +	  && REG_P (XEXP (XEXP (x, 0), 0)))
> +	*total += COSTS_N_INSNS (1);
>        return true;
> 
>      case LABEL_REF:
> @@ -24443,6 +24470,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
>    { "modulo",			OPTION_MASK_MODULO,		false, true  },
>    { "mulhw",			OPTION_MASK_MULHW,		false, true  },
>    { "multiple",			OPTION_MASK_MULTIPLE,		false, true  },
> +  { "p10-sf-opt",		OPTION_MASK_P10_SF_OPT,		false, true  },
>    { "pcrel",			OPTION_MASK_PCREL,		false, true  },
>    { "pcrel-opt",		OPTION_MASK_PCREL_OPT,		false, true  },
>    { "popcntb",			OPTION_MASK_POPCNTB,		false, true  },
> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
> index bde6d3ff664..fb50e00d0d9 100644
> --- a/gcc/config/rs6000/rs6000.opt
> +++ b/gcc/config/rs6000/rs6000.opt
> @@ -597,6 +597,11 @@ mmma
>  Target Mask(MMA) Var(rs6000_isa_flags)
>  Generate (do not generate) MMA instructions.
> 
> +mp10-sf-opt
> +Target Mask(P10_SF_OPT) Var(rs6000_isa_flags)
> +Generate code to mitigate single-precision floating point loading performance
> +issue.
> +
>  mrelative-jumptables
>  Target Undocumented Var(rs6000_relative_jumptables) Init(1) Save
> 
> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
> index f183b42ce1d..e7cd6d2f694 100644
> --- a/gcc/config/rs6000/t-rs6000
> +++ b/gcc/config/rs6000/t-rs6000
> @@ -35,6 +35,10 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
> 
> +rs6000-p10sfopt.o: $(srcdir)/config/rs6000/rs6000-p10sfopt.cc
> +	$(COMPILE) $<
> +	$(POSTCOMPILE)
> +
>  rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index f3b40229094..690318e82b2 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -6690,3 +6690,14 @@ (define_insn "vmsumcud"
>    "vmsumcud %0,%1,%2,%3"
>    [(set_attr "type" "veccomplex")]
>  )
> +
> +;; For expanding internal use bif __builtin_vsx_stxsiwx
> +(define_insn "vsx_stxsiwx_v4sf"
> + [(set (match_operand:SF 0 "memory_operand" "=Z")
> +       (unspec:SF
> +	  [(match_operand:V4SF 1 "vsx_register_operand" "wa")]
> +	  UNSPEC_STFIWX))]
> + "TARGET_P9_VECTOR"
> + "stxsiwx %x1,%y0"
> + [(set_attr "type" "fpstore")])
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
> new file mode 100644
> index 00000000000..6e8c6a84de6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
> @@ -0,0 +1,22 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -fno-tree-vectorize -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> +   expectedly with one case having normal arithmetic.   */
> +
> +void
> +saxpy (int n, float a, float *restrict x, float *restrict y)
> +{
> +#pragma GCC unroll 1
> +  for (int i = 0; i < n; ++i)
> +    y[i] = a * x[i] + y[i];
> +}
> +
> +/* Checking lfsx -> lxvwsx, stfsx -> stxsiwx, fmadds -> xvmaddmsp etc.  */
> +/* { dg-final { scan-assembler-times {\mlxvwsx\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mstxsiwx\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxvmaddmsp\M} 1 } } */
> +/* { dg-final { scan-assembler-not {\mlfsx?\M} } } */
> +/* { dg-final { scan-assembler-not {\mstfsx?\M} } } */
> +/* { dg-final { scan-assembler-not {\mfmadds\M} } } */
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
> new file mode 100644
> index 00000000000..7593da8ecf4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
> @@ -0,0 +1,34 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -fno-tree-vectorize -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> +   expectedly with one case having reduction.   */
> +
> +/* Partially reduced from pytorch batch_norm_kernel.cpp.  */
> +
> +typedef long long int64_t;
> +typedef float accscalar_t;
> +typedef float scalar_t;
> +
> +void
> +foo (int64_t n1, int64_t n2, accscalar_t sum, int64_t bound, int64_t N,
> +     scalar_t *input_data, scalar_t *var_sum_data, int64_t index)
> +{
> +  scalar_t mean = sum / N;
> +  accscalar_t _var_sum = 0;
> +  for (int64_t c = 0; c < n1; c++)
> +    {
> +      for (int64_t i = 0; i < n2; i++)
> +	{
> +	  int64_t offset = index + i;
> +	  scalar_t x = input_data[offset];
> +	  _var_sum += (x - mean) * (x - mean);
> +	}
> +      var_sum_data[c] = _var_sum;
> +    }
> +}
> +
> +/* { dg-final { scan-assembler {\mlxvwsx\M} } } */
> +/* { dg-final { scan-assembler {\mstxsiwx\M} } } */
> +/* { dg-final { scan-assembler {\mxvmaddasp\M} } } */
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
> new file mode 100644
> index 00000000000..38aedd00faa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
> @@ -0,0 +1,43 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> +   expectedly with one case having comparison.   */
> +
> +/* Partially reduced from xgboost cpu_predictor.cc.  */
> +
> +typedef struct {
> +  unsigned int sindex;
> +  signed int cleft;
> +  unsigned int a1;
> +  unsigned int a2;
> +  float val;
> +} Node;
> +
> +extern void bar(Node *n);
> +
> +void
> +foo (Node *n0, float *pa, Node *var_843, int c)
> +{
> +  Node *var_821;
> +  Node *n = n0;
> +  int cleft_idx = c;
> +  do
> +    {
> +      unsigned idx = n->sindex;
> +      idx = (idx & ((1U << 31) - 1U));
> +      float f1 = pa[idx];
> +      float f2 = n->val;
> +      int t = f2 > f1;
> +      int var_825 = cleft_idx + t;
> +      unsigned long long var_823 = var_825;
> +      var_821 = &var_843[var_823];
> +      cleft_idx = var_821->cleft;
> +      n = var_821;
> +  } while (cleft_idx != -1);
> +
> +  bar (n);
> +}
> +
> +/* { dg-final { scan-assembler-times {\mlxvwsx\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mxvcmpgtsp\M} 1 } } */
> --
> 2.39.3

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-12-12  6:16 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-15  9:16 [PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10 Kewen.Lin
2023-12-12  6:16 ` PING^1 " Kewen.Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).