From: "Kewen.Lin" <linkw@linux.ibm.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
David Edelsohn <dje.gcc@gmail.com>,
Michael Meissner <meissner@linux.ibm.com>,
Peter Bergner <bergner@linux.ibm.com>,
Richard Sandiford <richard.sandiford@arm.com>,
Richard Biener <richard.guenther@gmail.com>
Subject: PING^1 [PATCH] rs6000: New pass to mitigate SP float load perf issue on Power10
Date: Tue, 12 Dec 2023 14:16:02 +0800 [thread overview]
Message-ID: <091f04fa-8264-19ac-9b39-5444eb9d1ab0@linux.ibm.com> (raw)
In-Reply-To: <b7b0d8fb-64a0-2ed2-f333-06b79133e68f@linux.ibm.com>
Hi,
Gentle ping:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636599.html
BR,
Kewen
on 2023/11/15 17:16, Kewen.Lin wrote:
> Hi,
>
> As Power ISA defines, when loading a scalar single precision (SP)
> floating point from memory, we have the double precision (DP) format
> in target register converted from SP, it's unlike some other
> architectures which supports SP and DP in registers with their
> separated formats. The scalar SP instructions operates on DP format
> value in register and round the result to fit in SP (but still
> keeping the value in DP format).
>
> On Power10, a scalar SP floating point load insn will be cracked into
> two internal operations, one is to load the value, the other is to
> convert SP to DP format. Comparing to those uncracked load like
> vector SP load, it has extra 3 cycles load-to-use penalty. When
> evaluating some critical workloads, we found that for some cases we
> don't really need the conversion if all the involved operations are
> only with SP format. In this case, we can replace the scalar SP
> loads with vector SP load and splat (no conversion), replace all
> involved computation with the corresponding vector operations (with
> Power10 slice-based design, we expect the latency of scalar operation
> and its equivalent vector operation is the same), that is to promote
> the scalar SP loads and their affected computation to vector
> operations.
>
> For example for the below case:
>
> void saxpy (int n, float a, float * restrict x, float * restrict y)
> {
> for (int i = 0; i < n; ++i)
> y[i] = a*x[i] + y[i];
> }
>
> At -O2, the loop body would end up with:
>
> .L3:
> lfsx 12,6,9 // conv
> lfsx 0,5,9 // conv
> fmadds 0,0,1,12
> stfsx 0,6,9
> addi 9,9,4
> bdnz .L3
>
> but it can be implemented with:
>
> .L3:
> lxvwsx 0,5,9 // load and splat
> lxvwsx 12,6,9
> xvmaddmsp 0,1,12
> stxsiwx 0,6,9 // just store word 1 (BE ordering)
> addi 9,9,4
> bdnz .L3
>
> Evaluated on Power10, the latter can speed up 23% against the former.
>
> So this patch is to introduce a pass to recognize such case and
> change the scalar SP operations with the appropriate vector SP
> operations when it's proper.
>
> The processing of this pass starts from scalar SP loads, first it
> checks if it's valid, further checks all the stmts using its loaded
> result, then propagates from them. This process of propagation
> mainly goes with function visit_stmt, which first checks the
> validity of the given stmt, then checks the feeders of use operands
> with visit_stmt recursively, finally checks all the stmts using the
> def with visit_stmt recursively. The purpose is to ensure all
> propagated stmts are valid to be transformed with its equivalent
> vector operations. For some special operands like constant or
> GIMPLE_NOP def ssa, record them as splatting candidates. There are
> some validity checks like: if the addressing mode can satisfy index
> form with some adjustments, if there is the corresponding vector
> operation support, and so on. Once all propagated stmts from one
> load are valid, they are transformed by function transform_stmt by
> respecting the information in stmt_info like sf_type, new_ops etc.
>
> For example, for the below test case:
>
> _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1]; // stmt1
> _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1]; // stmt2
> _8 = .FMA (_4, a_14(D), _7); // stmt3
> MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8; // stmt4
>
> The processing starts from stmt1, which is taken as valid, adds it
> into the chain, then processes its use stmt stmt3, which is also
> valid, iterating its operands _4 whose def is stmt1 (visited), a_14
> which needs splatting and _7 whose def stmt2 is to be processed.
> Then stmt2 is taken as a valid load and it's added into the chain.
> All operands _4, a_14 and _7 of stmt3 are processed well, then it's
> added into the chain. Then it processes use stmts of _8 (result of
> stmt3), so checks stmt4 which is a valid store. Since all these
> involved stmts are valid to be transformed, we get below finally:
>
> sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
> sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
> sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
> sf_20 = .FMA (sf_5, sf_22, sf_25);
> __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));
>
> Since it needs to do some validity checks and adjustments if allowed,
> such as: check if some scalar operation has the corresponding vector
> support, considering scalar SP load can allow reg + {reg, disp}
> addressing modes while vector SP load and splat can only allow reg +
> reg, also considering the efficiency to get UD/DF chain for affected
> operations, we make this pass as a gimple pass.
>
> Considering gimple_isel pass has some gimple massaging, this pass is
> placed just before that. Considering this pass can generate some
> extra vector construction (like some constant, values converted from
> int etc.), which are extra comparing to the original scalar, and it
> makes use of more vector resource than before, it's not turned on by
> default conservatively for now.
>
> With the extra code to make this default on Power10, it's bootstrapped
> and almost regress-tested on Power10 (three test cases need some
> adjustments on its expected so trivial). Evaluating SPEC2017 specrate
> all bmks at O2, O3 and Ofast, it's observed that it speeds up 521.wrf_r
> by 2.14%, 526.blender_r by 1.85% and fprate geomean by 0.31% at O2, it
> is neutral at O3 and Ofast.
>
> Evaluating one critical workload related to xgboost, it's shown it
> helps to speed up 8% ~ 16% (avg. 14%, worst 8%, best 16%).
>
> Note that the current implementation is mainly driven by some typical
> test cases from some motivated workloads, we want to continue to
> extend it as needed.
>
> Any thoughts?
>
> BR,
> Kewen
> -----
>
> gcc/ChangeLog:
>
> * config.gcc: Add rs6000-p10sfopt.o to extra_objs for powerpc*-*-*
> and rs6000*-*-* targets.
> * config/rs6000/rs6000-builtin.cc (ldv_expand_builtin): Correct tmode
> for CODE_FOR_vsx_splat_v4sf.
> (stv_expand_builtin): Correct tmode for CODE_FOR_vsx_stxsiwx_v4sf.
> * config/rs6000/rs6000-builtins.def (__builtin_vsx_lxvwsx,
> __builtin_vsx_stxsiwx): New builtin definitions.
> * config/rs6000/rs6000-passes.def: Add pass_rs6000_p10sfopt.
> * config/rs6000/rs6000-protos.h (class gimple_opt_pass): New
> declaration.
> (make_pass_rs6000_p10sfopt): Likewise.
> * config/rs6000/rs6000.cc (rs6000_option_override_internal): Check
> some prerequisite conditions for TARGET_P10_SF_OPT.
> (rs6000_rtx_costs): Cost one unit COSTS_N_INSNS more for vec_duplicate
> with {l,st}xv[wd]sx which only support x-form.
> * config/rs6000/rs6000.opt (-mp10-sf-opt): New option.
> * config/rs6000/t-rs6000: Add rule to build rs6000-p10sfopt.o.
> * config/rs6000/vsx.md (vsx_stxsiwx_v4sf): New define_insn.
> * config/rs6000/rs6000-p10sfopt.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/powerpc/p10-sf-opt-1.c: New test.
> * gcc.target/powerpc/p10-sf-opt-2.c: New test.
> * gcc.target/powerpc/p10-sf-opt-3.c: New test.
> ---
> gcc/config.gcc | 4 +-
> gcc/config/rs6000/rs6000-builtin.cc | 9 +
> gcc/config/rs6000/rs6000-builtins.def | 5 +
> gcc/config/rs6000/rs6000-p10sfopt.cc | 950 ++++++++++++++++++
> gcc/config/rs6000/rs6000-passes.def | 5 +
> gcc/config/rs6000/rs6000-protos.h | 2 +
> gcc/config/rs6000/rs6000.cc | 28 +
> gcc/config/rs6000/rs6000.opt | 5 +
> gcc/config/rs6000/t-rs6000 | 4 +
> gcc/config/rs6000/vsx.md | 11 +
> .../gcc.target/powerpc/p10-sf-opt-1.c | 22 +
> .../gcc.target/powerpc/p10-sf-opt-2.c | 34 +
> .../gcc.target/powerpc/p10-sf-opt-3.c | 43 +
> 13 files changed, 1120 insertions(+), 2 deletions(-)
> create mode 100644 gcc/config/rs6000/rs6000-p10sfopt.cc
> create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
> create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
> create mode 100644 gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 0782cbc6e91..983fad9fb9a 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -517,7 +517,7 @@ or1k*-*-*)
> ;;
> powerpc*-*-*)
> cpu_type=rs6000
> - extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> + extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p10sfopt.o rs6000-logue.o"
> extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
> extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
> extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
> @@ -554,7 +554,7 @@ riscv*)
> ;;
> rs6000*-*-*)
> extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
> - extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> + extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p10sfopt.o rs6000-logue.o"
> extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
> target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
> target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
> index 82cc3a19447..38bb786e5eb 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -2755,6 +2755,10 @@ ldv_expand_builtin (rtx target, insn_code icode, rtx *op, machine_mode tmode)
> || !insn_data[icode].operand[0].predicate (target, tmode))
> target = gen_reg_rtx (tmode);
>
> + /* Correct tmode with the proper addr mode. */
> + if (icode == CODE_FOR_vsx_splat_v4sf)
> + tmode = SFmode;
> +
> op[1] = copy_to_mode_reg (Pmode, op[1]);
>
> /* These CELL built-ins use BLKmode instead of tmode for historical
> @@ -2898,6 +2902,10 @@ static rtx
> stv_expand_builtin (insn_code icode, rtx *op,
> machine_mode tmode, machine_mode smode)
> {
> + /* Correct tmode with the proper addr mode. */
> + if (icode == CODE_FOR_vsx_stxsiwx_v4sf)
> + tmode = SFmode;
> +
> op[2] = copy_to_mode_reg (Pmode, op[2]);
>
> /* For STVX, express the RTL accurately by ANDing the address with -16.
> @@ -3713,3 +3721,4 @@ rs6000_expand_builtin (tree exp, rtx target, rtx /* subtarget */,
> emit_insn (pat);
> return target;
> }
> +
> diff --git a/gcc/config/rs6000/rs6000-builtins.def b/gcc/config/rs6000/rs6000-builtins.def
> index ce40600e803..c0441f5e27f 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -2810,6 +2810,11 @@
> __builtin_vsx_scalar_cmp_exp_qp_unordered (_Float128, _Float128);
> VSCEQPUO xscmpexpqp_unordered_kf {}
>
> + vf __builtin_vsx_lxvwsx (signed long, const float *);
> + LXVWSX_V4SF vsx_splat_v4sf {ldvec}
> +
> + void __builtin_vsx_stxsiwx (vf, signed long, const float *);
> + STXSIWX_V4SF vsx_stxsiwx_v4sf {stvec}
>
> ; Miscellaneous P9 functions
> [power9]
> diff --git a/gcc/config/rs6000/rs6000-p10sfopt.cc b/gcc/config/rs6000/rs6000-p10sfopt.cc
> new file mode 100644
> index 00000000000..6e1d90fd93e
> --- /dev/null
> +++ b/gcc/config/rs6000/rs6000-p10sfopt.cc
> @@ -0,0 +1,950 @@
> +/* Subroutines used to mitigate single precision floating point
> + load and conversion performance issue by replacing scalar
> + single precision floating point operations with appropriate
> + vector operations if it is proper.
> + Copyright (C) 2023 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it
> +under the terms of the GNU General Public License as published by the
> +Free Software Foundation; either version 3, or (at your option) any
> +later version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT
> +ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3. If not see
> +<http://www.gnu.org/licenses/>. */
> +
> +/* The processing of this pass starts from scalar SP loads, first it
> +checks if it's valid, further checks all the stmts using its loaded
> +result, then propagates from them. This process of propagation
> +mainly goes with function visit_stmt, which first checks the
> +validity of the given stmt, then checks the feeders of use operands
> +with visit_stmt recursively, finally checks all the stmts using the
> +def with visit_stmt recursively. The purpose is to ensure all
> +propagated stmts are valid to be transformed with its equivalent
> +vector operations. For some special operands like constant or
> +GIMPLE_NOP def ssa, record them as splatting candidates. There are
> +some validity checks like: if the addressing mode can satisfy index
> +form with some adjustments, if there is the corresponding vector
> +operation support, and so on. Once all propagated stmts from one
> +load are valid, they are transformed by function transform_stmt by
> +respecting the information in stmt_info like sf_type, new_ops etc.
> +
> +For example, for the below test case:
> +
> + _4 = MEM[(float *)x_13(D) + ivtmp.13_24 * 1]; // stmt1
> + _7 = MEM[(float *)y_15(D) + ivtmp.13_24 * 1]; // stmt2
> + _8 = .FMA (_4, a_14(D), _7); // stmt3
> + MEM[(float *)y_15(D) + ivtmp.13_24 * 1] = _8; // stmt4
> +
> +The processing starts from stmt1, which is taken as valid, adds it
> +into the chain, then processes its use stmt stmt3, which is also
> +valid, iterating its operands _4 whose def is stmt1 (visited), a_14
> +which needs splatting and _7 whose def stmt2 is to be processed.
> +Then stmt2 is taken as a valid load and it's added into the chain.
> +All operands _4, a_14 and _7 of stmt3 are processed well, then it's
> +added into the chain. Then it processes use stmts of _8 (result of
> +stmt3), so checks stmt4 which is a valid store. Since all these
> +involved stmts are valid to be transformed, we get below finally:
> +
> + sf_5 = __builtin_vsx_lxvwsx (ivtmp.13_24, x_13(D));
> + sf_25 = __builtin_vsx_lxvwsx (ivtmp.13_24, y_15(D));
> + sf_22 = {a_14(D), a_14(D), a_14(D), a_14(D)};
> + sf_20 = .FMA (sf_5, sf_22, sf_25);
> + __builtin_vsx_stxsiwx (sf_20, ivtmp.13_24, y_15(D));
> +*/
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "target.h"
> +#include "rtl.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "tm_p.h"
> +#include "tree-pass.h"
> +#include "ssa.h"
> +#include "optabs-tree.h"
> +#include "fold-const.h"
> +#include "tree-eh.h"
> +#include "gimple-iterator.h"
> +#include "gimple-fold.h"
> +#include "stor-layout.h"
> +#include "tree-ssa.h"
> +#include "tree-ssa-address.h"
> +#include "tree-cfg.h"
> +#include "cfgloop.h"
> +#include "tree-vectorizer.h"
> +#include "builtins.h"
> +#include "internal-fn.h"
> +#include "gimple-pretty-print.h"
> +#include "predict.h"
> +#include "rs6000-internal.h" /* for rs6000_builtin_decls */
> +
> +namespace {
> +
> +/* Single precision Floating point operation types.
> +
> + So far, we only take care of load, store, ifn call, phi,
> + normal arithmetic, comparison and special operations.
> + Normally for an involved statement, we will process all
> + statements which use its result, all statements which
> + define its operands and further propagate, but for some
> + special assignment statement, we don't want to process
> + it like this way but just splat it instead, we adopt
> + SF_SPECIAL for this kind of statement, for now it's only
> + for to-float conversion assignment. */
> +enum sf_type
> +{
> + SF_LOAD,
> + SF_STORE,
> + SF_CALL,
> + SF_PHI,
> + SF_NORMAL,
> + SF_COMPARE,
> + SF_SPECIAL
> +};
> +
> +/* Hold some information for a gimple statement which is valid
> + to be promoted from scalar operation to vector operation. */
> +
> +class stmt_info
> +{
> +public:
> + stmt_info (gimple *s, sf_type t, bitmap bm)
> + {
> + stmt = s;
> + type = t;
> + splat_ops = BITMAP_ALLOC (NULL);
> + if (bm)
> + bitmap_copy (splat_ops, bm);
> +
> + unsigned nops = gimple_num_args (stmt);
> + new_ops.create (nops);
> + new_ops.safe_grow_cleared (nops);
> + replace_stmt = NULL;
> + gphi_res = NULL_TREE;
> + }
> +
> + ~stmt_info ()
> + {
> + BITMAP_FREE (splat_ops);
> + new_ops.release ();
> + }
> +
> + /* Indicate the stmt what this info is for. */
> + gimple *stmt;
> + /* Indicate sf_type of the current stmt. */
> + enum sf_type type;
> + /* Bitmap used to indicate which op needs to be splatted. */
> + bitmap splat_ops;
> + /* New operands used to build new stmt. */
> + vec<tree> new_ops;
> + /* New stmt used to replace the current stmt. */
> + gimple *replace_stmt;
> + /* Hold new gphi result which is created early. */
> + tree gphi_res;
> +};
> +
> +typedef stmt_info *stmt_info_p;
> +typedef hash_map<gimple *, stmt_info_p> info_map_t;
> +static info_map_t *stmt_info_map;
> +
> +/* Like the comments for SF_SPECIAL above, for some special
> + assignment statement (to-float conversion assignment
> + here), we don't want to do the heavy processing but just
> + want to generate a splatting for it instead. Return
> + true if the given STMT is special (to-float conversion
> + for now), otherwise return false. */
> +
> +static bool
> +special_assign_p (gimple *stmt)
> +{
> + gcc_assert (gimple_code (stmt) == GIMPLE_ASSIGN);
> + enum tree_code code = gimple_assign_rhs_code (stmt);
> + if (code == FLOAT_EXPR)
> + return true;
> + return false;
> +}
> +
> +/* Make base and index fields from the memory reference REF,
> + return true and set *BASEP and *INDEXP respectively if it
> + is successful, otherwise return false. Since the
> + transformed vector load (lxvwsx) and vector store (stxsiwx)
> + only supports reg + reg addressing mode, we need to ensure
> + the address satisfies it first. */
> +
> +static bool
> +make_base_and_index (tree ref, tree *basep, tree *indexp)
> +{
> + if (DECL_P (ref))
> + {
> + *basep
> + = fold_build1 (ADDR_EXPR, build_pointer_type (float32_type_node), ref);
> + *indexp = size_zero_node;
> + return true;
> + }
> +
> + enum tree_code code = TREE_CODE (ref);
> + if (code == TARGET_MEM_REF)
> + {
> + struct mem_address addr;
> + get_address_description (ref, &addr);
> + gcc_assert (!addr.step);
> + *basep = addr.symbol ? addr.symbol : addr.base;
> + if (addr.index)
> + {
> + /* Give up if having both offset and index, theoretically
> + we can generate one insn to update base with index, but
> + it results in more cost, so leave it conservatively. */
> + if (!integer_zerop (addr.offset))
> + return false;
> + *indexp = addr.index;
> + }
> + else
> + *indexp = addr.offset;
> + return true;
> + }
> +
> + if (code == MEM_REF)
> + {
> + *basep = TREE_OPERAND (ref, 0);
> + tree op1 = TREE_OPERAND (ref, 1);
> + *indexp = op1 ? op1 : size_zero_node;
> + return true;
> + }
> +
> + if (handled_component_p (ref))
> + {
> + machine_mode mode1;
> + poly_int64 bitsize, bitpos;
> + tree offset;
> + int reversep = 0, volatilep = 0, unsignedp = 0;
> + tree tem = get_inner_reference (ref, &bitsize, &bitpos, &offset, &mode1,
> + &unsignedp, &reversep, &volatilep);
> + if (reversep)
> + return false;
> +
> + poly_int64 bytepos = exact_div (bitpos, BITS_PER_UNIT);
> + if (offset)
> + {
> + gcc_assert (!integer_zerop (offset));
> + /* Give up if having both offset and bytepos. */
> + if (maybe_ne (bytepos, 0))
> + return false;
> + if (!is_gimple_variable (offset))
> + return false;
> + }
> +
> + tree base1, index1;
> + /* Further check the inner ref. */
> + if (!make_base_and_index (tem, &base1, &index1))
> + return false;
> +
> + if (integer_zerop (index1))
> + {
> + /* Only need to consider base1 and offset/bytepos. */
> + *basep = base1;
> + *indexp = offset ? offset : wide_int_to_tree (sizetype, bytepos);
> + return true;
> + }
> + /* Give up if having offset and index1. */
> + if (offset)
> + return false;
> + /* Give up if bytepos and index1 can not be folded. */
> + if (!poly_int_tree_p (index1))
> + return false;
> + poly_offset_int new_off
> + = wi::sext (wi::to_poly_offset (index1), TYPE_PRECISION (sizetype));
> + new_off += bytepos;
> +
> + poly_int64 new_index;
> + if (!new_off.to_shwi (&new_index))
> + return false;
> +
> + *basep = base1;
> + *indexp = wide_int_to_tree (sizetype, new_index);
> + return true;
> + }
> +
> + if (TREE_CODE (ref) == SSA_NAME)
> + {
> + /* Inner ref can come from a load. */
> + gimple *def = SSA_NAME_DEF_STMT (ref);
> + if (!gimple_assign_single_p (def))
> + return false;
> + tree ref1 = gimple_assign_rhs1 (def);
> + if (!DECL_P (ref1) && !REFERENCE_CLASS_P (ref1))
> + return false;
> +
> + tree base1, offset1;
> + if (!make_base_and_index (ref1, &base1, &offset1))
> + return false;
> + *basep = base1;
> + *indexp = offset1;
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/* Check STMT is an expected SP float load or store, return true
> + if it is and update IS_LOAD, otherwise return false. */
> +
> +static bool
> +valid_load_store_p (gimple *stmt, bool &is_load)
> +{
> + if (!gimple_assign_single_p (stmt))
> + return false;
> +
> + tree lhs = gimple_assign_lhs (stmt);
> + if (TYPE_MODE (TREE_TYPE (lhs)) != SFmode)
> + return false;
> +
> + tree rhs = gimple_assign_rhs1 (stmt);
> + tree base, index;
> + if (TREE_CODE (lhs) == SSA_NAME
> + && (DECL_P (rhs) || REFERENCE_CLASS_P (rhs))
> + && make_base_and_index (rhs, &base, &index))
> + {
> + is_load = true;
> + return true;
> + }
> +
> + if ((DECL_P (lhs) || REFERENCE_CLASS_P (lhs))
> + && make_base_and_index (lhs, &base, &index))
> + {
> + is_load = false;
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/* Check if it's valid to update the given STMT with the
> + equivalent vector form, return true if yes and also set
> + SF_TYPE to the proper sf_type, otherwise return false. */
> +
> +static bool
> +is_valid (gimple *stmt, enum sf_type &sf_type)
> +{
> + /* Give up if it has volatile type. */
> + if (gimple_has_volatile_ops (stmt))
> + return false;
> +
> + /* Give up if it can throw an exception. */
> + if (stmt_can_throw_internal (cfun, stmt))
> + return false;
> +
> + /* Process phi. */
> + gphi *gp = dyn_cast<gphi *> (stmt);
> + if (gp)
> + {
> + sf_type = SF_PHI;
> + return true;
> + }
> +
> + /* Process assignment. */
> + gassign *gass = dyn_cast<gassign *> (stmt);
> + if (gass)
> + {
> + bool is_load = false;
> + if (valid_load_store_p (stmt, is_load))
> + {
> + sf_type = is_load ? SF_LOAD : SF_STORE;
> + return true;
> + }
> +
> + tree lhs = gimple_assign_lhs (stmt);
> + if (!lhs || TREE_CODE (lhs) != SSA_NAME)
> + return false;
> + enum tree_code code = gimple_assign_rhs_code (stmt);
> + if (TREE_CODE_CLASS (code) == tcc_comparison)
> + {
> + tree rhs1 = gimple_assign_rhs1 (stmt);
> + tree rhs2 = gimple_assign_rhs2 (stmt);
> + tree type = TREE_TYPE (lhs);
> + if (!VECT_SCALAR_BOOLEAN_TYPE_P (type))
> + return false;
> + if (TYPE_MODE (type) != QImode)
> + return false;
> + type = TREE_TYPE (rhs1);
> + if (TYPE_MODE (type) != SFmode)
> + return false;
> + gcc_assert (TYPE_MODE (TREE_TYPE (rhs2)) == SFmode);
> + sf_type = SF_COMPARE;
> + return true;
> + }
> +
> + tree type = TREE_TYPE (lhs);
> + if (TYPE_MODE (type) != SFmode)
> + return false;
> +
> + if (special_assign_p (stmt))
> + {
> + sf_type = SF_SPECIAL;
> + return true;
> + }
> +
> + /* Check if vector operation is supported. */
> + sf_type = SF_NORMAL;
> + tree vectype = build_vector_type_for_mode (type, V4SFmode);
> + optab optab = optab_for_tree_code (code, vectype, optab_default);
> + if (!optab)
> + return false;
> + return optab_handler (optab, V4SFmode) != CODE_FOR_nothing;
> + }
> +
> + /* Process call. */
> + gcall *gc = dyn_cast<gcall *> (stmt);
> + /* TODO: Extend this to cover some other bifs. */
> + if (gc && gimple_call_internal_p (gc))
> + {
> + tree lhs = gimple_call_lhs (stmt);
> + if (!lhs)
> + return false;
> + if (TREE_CODE (lhs) != SSA_NAME)
> + return false;
> + tree type = TREE_TYPE (lhs);
> + if (TYPE_MODE (type) != SFmode)
> + return false;
> + enum internal_fn ifn = gimple_call_internal_fn (stmt);
> + tree vectype = build_vector_type_for_mode (type, V4SFmode);
> + if (direct_internal_fn_p (ifn))
> + {
> + const direct_internal_fn_info &info = direct_internal_fn (ifn);
> + if (info.vectorizable
> + && (direct_internal_fn_supported_p (ifn,
> + tree_pair (vectype, vectype),
> + OPTIMIZE_FOR_SPEED)))
> + {
> + sf_type = SF_CALL;
> + return true;
> + }
> + }
> + }
> +
> + return false;
> +}
> +
> +/* Process the given STMT, if it's visited before, just return true.
> + If it's the first time to visit this, set VISITED and check if
> + the below ones are valid to be optimized with vector operation:
> + - itself
> + - all statements which define the operands involved here
> + - all statements which use the result of STMT
> + If all are valid, add STMT into CHAIN, create its own stmt_info
> + and return true. Otherwise, return false. */
> +
> +static bool
> +visit_stmt (gimple *stmt, vec<gimple *> &chain, hash_set<gimple *> &visited)
> +{
> + if (visited.add (stmt))
> + {
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, "Stmt visited: %G", stmt);
> + return true;
> + }
> + else if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, "Visiting stmt: %G", stmt);
> +
> + /* Checking this statement is valid for this optimization. */
> + enum sf_type st_type;
> + if (!is_valid (stmt, st_type))
> + {
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, "Invalid stmt: %G", stmt);
> + return false;
> + }
> +
> + /* For store, it's the end of this chain, don't need to
> + process anything further. For special assignment, we
> + don't want to process all statements using its result
> + and all statements defining its operands. */
> + if (st_type == SF_STORE || st_type == SF_SPECIAL)
> + {
> + chain.safe_push (stmt);
> + stmt_info_p si = new stmt_info (stmt, st_type, NULL);
> + stmt_info_map->put (stmt, si);
> + return true;
> + }
> +
> + /* Check all feeders of operands involved here. */
> +
> + /* Indicate which operand needs to be splatted, such as: constant. */
> + auto_bitmap splat_bm;
> + if (st_type != SF_LOAD)
> + {
> + unsigned nops = gimple_num_args (stmt);
> + for (unsigned i = 0; i < nops; i++)
> + {
> + tree op = gimple_arg (stmt, i);
> + if (TREE_CODE (op) != SSA_NAME
> + && TREE_CODE (op) != REAL_CST)
> + {
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, "With problematic %T in stmt: %G", op,
> + stmt);
> + return false;
> + }
> +
> + bool need_splat = false;
> + if (TREE_CODE (op) == SSA_NAME)
> + {
> + gimple *op_stmt = SSA_NAME_DEF_STMT (op);
> + if (gimple_code (op_stmt) == GIMPLE_NOP)
> + need_splat = true;
> + else if (!visit_stmt (op_stmt, chain, visited))
> + return false;
> + }
> + else
> + {
> + gcc_assert (TREE_CODE (op) == REAL_CST);
> + need_splat = true;
> + }
> +
> + if (need_splat)
> + bitmap_set_bit (splat_bm, i);
> + }
> + }
> +
> + /* Push this stmt before all its use stmts, then it's transformed
> + first during the transform phase, new_ops are prepared when
> + transforming use stmts. */
> + chain.safe_push (stmt);
> +
> + /* Comparison may have some constant operand, we need the above
> + handlings on splatting, but don't need any further processing
> + on all uses of its result. */
> + if (st_type == SF_COMPARE)
> + {
> + stmt_info_p si = new stmt_info (stmt, st_type, splat_bm);
> + stmt_info_map->put (stmt, si);
> + return true;
> + }
> +
> + /* Process each use of definition. */
> + gimple *use_stmt;
> + imm_use_iterator iter;
> + tree lhs = gimple_get_lhs (stmt);
> + FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> + if (!visit_stmt (use_stmt, chain, visited))
> + return false;
> +
> + /* Create the corresponding stmt_info. */
> + stmt_info_p si = new stmt_info (stmt, st_type, splat_bm);
> + stmt_info_map->put (stmt, si);
> + return true;
> +}
> +
> +/* Tree NEW_LHS with vector type has been used to replace the
> + original tree LHS, for each use of LHS, find each use stmt
> + and its corresponding stmt_info, update whose new_ops array
> + accordingly to prepare later replacement. */
> +
> +static void
> +update_all_uses (tree lhs, tree new_lhs, sf_type type)
> +{
> + gimple *use_stmt;
> + imm_use_iterator iter;
> + FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> + {
> + stmt_info_p *slot = stmt_info_map->get (use_stmt);
> + /* Each use stmt should have been processed excepting
> + for SF_SPECIAL stmts since for which we stop
> + processing early. */
> + gcc_assert (slot || type == SF_SPECIAL);
> + if (!slot)
> + continue;
> + stmt_info_p info = *slot;
> + unsigned n = gimple_num_args (use_stmt);
> + for (unsigned i = 0; i < n; i++)
> + if (gimple_arg (use_stmt, i) == lhs)
> + info->new_ops[i] = new_lhs;
> + }
> +}
> +
> +/* Remove old STMT and insert NEW_STMT before. */
> +
> +static void
> +replace_stmt (gimple_stmt_iterator *gsi_ptr, gimple *stmt, gimple *new_stmt)
> +{
> + gimple_set_location (new_stmt, gimple_location (stmt));
> + gimple_move_vops (new_stmt, stmt);
> + gsi_insert_before (gsi_ptr, new_stmt, GSI_SAME_STMT);
> + gsi_remove (gsi_ptr, true);
> +}
> +
> +/* Transform the given STMT with vector type, only transform
> + phi stmt if HANDLE_PHI_P is true since there are def-use
> + cycles for phi, we transform them in the second time. */
> +
> +static void
> +transform_stmt (gimple *stmt, bool handle_phi_p = false)
> +{
> + stmt_info_p info = *stmt_info_map->get (stmt);
> +
> + /* This statement has been replaced. */
> + if (info->replace_stmt)
> + return;
> +
> + gcc_assert (!handle_phi_p || gimple_code (stmt) == GIMPLE_PHI);
> +
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, " Transforming stmt: %G", stmt);
> +
> + tree lhs = gimple_get_lhs (stmt);
> + tree type = float_type_node;
> + tree vectype = build_vector_type_for_mode (type, V4SFmode);
> + gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> +
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, " info->type: %d\n", info->type);
> +
> + /* Replace load with bif __builtin_vsx_lxvwsx. */
> + if (info->type == SF_LOAD)
> + {
> + tree fndecl = rs6000_builtin_decls[RS6000_BIF_LXVWSX_V4SF];
> + tree rhs = gimple_op (stmt, 1);
> + tree base, index;
> + bool mem_p = make_base_and_index (rhs, &base, &index);
> + gcc_assert (mem_p);
> + gimple *load = gimple_build_call (fndecl, 2, index, base);
> + tree res = make_temp_ssa_name (vectype, NULL, "sf");
> + gimple_call_set_lhs (load, res);
> + info->replace_stmt = load;
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, " => Gen load: %G", load);
> + update_all_uses (lhs, res, info->type);
> + replace_stmt (&gsi, stmt, load);
> + return;
> + }
> +
> + /* Replace store with bif __builtin_vsx_stxsiwx. */
> + if (info->type == SF_STORE)
> + {
> + tree fndecl = rs6000_builtin_decls[RS6000_BIF_STXSIWX_V4SF];
> + tree base, index;
> + bool mem_p = make_base_and_index (lhs, &base, &index);
> + gcc_assert (mem_p);
> + gcc_assert (info->new_ops[0]);
> + gimple *store
> + = gimple_build_call (fndecl, 3, info->new_ops[0], index, base);
> + info->replace_stmt = store;
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, " => Gen store: %G", store);
> + replace_stmt (&gsi, stmt, store);
> + return;
> + }
> +
> + /* Generate vector construction for special stmt. */
> + if (info->type == SF_SPECIAL)
> + {
> + tree op = gimple_get_lhs (stmt);
> + tree val = build_vector_from_val (vectype, op);
> + tree res = make_temp_ssa_name (vectype, NULL, "sf");
> + gimple *splat = gimple_build_assign (res, val);
> + gimple_set_location (splat, gimple_location (stmt));
> + gsi_insert_after (&gsi, splat, GSI_SAME_STMT);
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, " => Gen special %G", splat);
> + update_all_uses (lhs, res, info->type);
> + info->replace_stmt = splat;
> + return;
> + }
> +
> + /* Handle the operands which haven't have an according vector
> + operand yet, like those ones need splatting etc. */
> + unsigned nargs = gimple_num_args (stmt);
> + gphi *phi = dyn_cast<gphi *> (stmt);
> + for (unsigned i = 0; i < nargs; i++)
> + {
> + /* This operand already has the replacing one. */
> + if (info->new_ops[i])
> + continue;
> + /* When only handling phi, all operands should have the
> + prepared new_op. */
> + gcc_assert (!handle_phi_p);
> + tree op = gimple_arg (stmt, i);
> + /* This operand needs splatting. */
> + if (bitmap_bit_p (info->splat_ops, i))
> + {
> + tree val = build_vector_from_val (vectype, op);
> + tree res = make_temp_ssa_name (vectype, NULL, "sf");
> + gimple *splat = gimple_build_assign (res, val);
> + /* If it's a PHI, push it to its incoming block. */
> + if (phi)
> + {
> + basic_block src = gimple_phi_arg_edge (phi, i)->src;
> + gimple_stmt_iterator src_gsi = gsi_last_bb (src);
> + if (!gsi_end_p (src_gsi) && stmt_ends_bb_p (gsi_stmt (src_gsi)))
> + gsi_insert_before (&src_gsi, splat, GSI_SAME_STMT);
> + else
> + gsi_insert_after (&src_gsi, splat, GSI_NEW_STMT);
> + }
> + else
> + gsi_insert_before (&gsi, splat, GSI_SAME_STMT);
> + info->new_ops[i] = res;
> + bitmap_clear_bit (info->splat_ops, i);
> + }
> + else
> + {
> + gcc_assert (TREE_CODE (op) == SSA_NAME);
> + /* Ensure all operands have the replacing new_op excepting
> + for phi stmt. */
> + if (!phi)
> + {
> + gimple *def = SSA_NAME_DEF_STMT (op);
> + transform_stmt (def);
> + gcc_assert (info->new_ops[i]);
> + }
> + }
> + }
> +
> + gimple *new_stmt;
> + tree res;
> + if (info->type == SF_PHI)
> + {
> + /* At the first time, ensure phi result is prepared and all its
> + use stmt can be transformed well. */
> + if (!handle_phi_p)
> + {
> + res = info->gphi_res;
> + if (!res)
> + {
> + res = make_temp_ssa_name (vectype, NULL, "sf");
> + info->gphi_res = res;
> + }
> + update_all_uses (lhs, res, info->type);
> + return;
> + }
> + /* Transform actually at the second time. */
> + basic_block bb = gimple_bb (stmt);
> + gphi *new_phi = create_phi_node (info->gphi_res, bb);
> + for (unsigned i = 0; i < nargs; i++)
> + {
> + location_t loc = gimple_phi_arg_location (phi, i);
> + edge e = gimple_phi_arg_edge (phi, i);
> + add_phi_arg (new_phi, info->new_ops[i], e, loc);
> + }
> + gimple_set_location (new_phi, gimple_location (stmt));
> + remove_phi_node (&gsi, true);
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, " => Gen phi %G", (gimple *) new_phi);
> + return;
> + }
> +
> + if (info->type == SF_COMPARE)
> + {
> + /* Build a vector comparison. */
> + tree vectype1 = truth_type_for (vectype);
> + tree res1 = make_temp_ssa_name (vectype1, NULL, "sf_vb4");
> + enum tree_code subcode = gimple_assign_rhs_code (stmt);
> + gimple *new_stmt1 = gimple_build_assign (res1, subcode, info->new_ops[0],
> + info->new_ops[1]);
> + gsi_insert_before (&gsi, new_stmt1, GSI_SAME_STMT);
> +
> + /* Build a VEC_COND_EXPR with -1 (true) or 0 (false). */
> + tree vectype2 = build_vector_type_for_mode (intSI_type_node, V4SImode);
> + tree res2 = make_temp_ssa_name (vectype2, NULL, "sf_vi4");
> + tree minus_one_vec = build_minus_one_cst (vectype2);
> + tree zero_vec = build_zero_cst (vectype2);
> + gimple *new_stmt2 = gimple_build_assign (res2, VEC_COND_EXPR, res1,
> + minus_one_vec, zero_vec);
> + gsi_insert_before (&gsi, new_stmt2, GSI_SAME_STMT);
> +
> + /* Build a BIT_FIELD_REF to extract lane 1 (BE ordering). */
> + tree bfr = build3 (BIT_FIELD_REF, intSI_type_node, res2, bitsize_int (32),
> + bitsize_int (BYTES_BIG_ENDIAN ? 32 : 64));
> + tree res3 = make_temp_ssa_name (intSI_type_node, NULL, "sf_i4");
> + gimple *new_stmt3 = gimple_build_assign (res3, BIT_FIELD_REF, bfr);
> + gsi_insert_before (&gsi, new_stmt3, GSI_SAME_STMT);
> +
> + /* Convert it accordingly. */
> + gimple *new_stmt = gimple_build_assign (lhs, NOP_EXPR, res3);
> +
> + if (dump_enabled_p ())
> + {
> + dump_printf (MSG_NOTE, " => Gen comparison: %G",
> + (gimple *) new_stmt1);
> + dump_printf (MSG_NOTE, " %G",
> + (gimple *) new_stmt2);
> + dump_printf (MSG_NOTE, " %G",
> + (gimple *) new_stmt3);
> + dump_printf (MSG_NOTE, " %G",
> + (gimple *) new_stmt);
> + }
> + gsi_replace (&gsi, new_stmt, false);
> + info->replace_stmt = new_stmt;
> + return;
> + }
> +
> + if (info->type == SF_CALL)
> + {
> + res = make_temp_ssa_name (vectype, NULL, "sf");
> + enum internal_fn ifn = gimple_call_internal_fn (stmt);
> + new_stmt = gimple_build_call_internal_vec (ifn, info->new_ops);
> + gimple_call_set_lhs (new_stmt, res);
> + }
> + else
> + {
> + gcc_assert (info->type == SF_NORMAL);
> + enum tree_code subcode = gimple_assign_rhs_code (stmt);
> + res = make_temp_ssa_name (vectype, NULL, "sf");
> +
> + if (nargs == 1)
> + new_stmt = gimple_build_assign (res, subcode, info->new_ops[0]);
> + else if (nargs == 2)
> + new_stmt = gimple_build_assign (res, subcode, info->new_ops[0],
> + info->new_ops[1]);
> + else
> + new_stmt = gimple_build_assign (res, subcode, info->new_ops[0],
> + info->new_ops[1], info->new_ops[2]);
> + }
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, " => Gen call/normal %G", new_stmt);
> + update_all_uses (lhs, res, info->type);
> + info->replace_stmt = new_stmt;
> + replace_stmt (&gsi, stmt, new_stmt);
> +}
> +
> +/* Start from load STMT, find and check all related statements are
> + valid to be optimized as vector operations, transform all of
> + them if succeed. */
> +
> +static void
> +process_chain_from_load (gimple *stmt)
> +{
> + auto_vec<gimple *> chain;
> + hash_set<gimple *> visited;
> +
> + /* Load is the first of its chain. */
> + chain.safe_push (stmt);
> + visited.add (stmt);
> +
> + if (dump_enabled_p ())
> + dump_printf (MSG_NOTE, "\nDetecting the chain from %G", stmt);
> +
> + gimple *use_stmt;
> + imm_use_iterator iter;
> + tree lhs = gimple_assign_lhs (stmt);
> + /* Propagate from uses of load result. */
> + FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> + /* Fail if encounting any unexpected. */
> + if (!visit_stmt (use_stmt, chain, visited))
> + return;
> +
> + if (dump_enabled_p ())
> + {
> + dump_printf (MSG_NOTE, "Found a chain from load %G", stmt);
> + for (gimple *s : chain)
> + dump_printf (MSG_NOTE, " -> %G", s);
> + dump_printf (MSG_NOTE, "\n");
> + }
> +
> + /* Create stmt info for this load. */
> + stmt_info_p si = new stmt_info (stmt, SF_LOAD, NULL);
> + stmt_info_map->put (stmt, si);
> +
> + /* Transform the chain. */
> + for (gimple *stmt : chain)
> + transform_stmt (stmt, false);
> + /* Handle the remaining phis. */
> + for (gimple *stmt : chain)
> + if (gimple_code (stmt) == GIMPLE_PHI)
> + transform_stmt (stmt, true);
> +}
> +
> +const pass_data pass_data_rs6000_p10sfopt = {
> + GIMPLE_PASS, /* type */
> + "rs6000_p10sfopt", /* name */
> + OPTGROUP_NONE, /* optinfo_flags */
> + TV_NONE, /* tv_id */
> + PROP_ssa, /* properties_required */
> + 0, /* properties_provided */
> + 0, /* properties_destroyed */
> + 0, /* todo_flags_start */
> + TODO_update_ssa, /* todo_flags_finish */
> +};
> +
> +class pass_rs6000_p10sfopt : public gimple_opt_pass
> +{
> +public:
> + pass_rs6000_p10sfopt (gcc::context *ctxt)
> + : gimple_opt_pass (pass_data_rs6000_p10sfopt, ctxt)
> + {
> + }
> +
> + bool
> + gate (function *fun) final override
> + {
> + /* Not each FE initialize target built-ins, so we need to
> + ensure the support of lxvwsx_v4sf decl, and we can't do
> + this check in rs6000_option_override_internal since the
> + bif decls are uninitialized at that time. */
> + return TARGET_P10_SF_OPT
> + && optimize
> + && optimize_function_for_speed_p (fun)
> + && rs6000_builtin_decls[RS6000_BIF_LXVWSX_V4SF];
> + }
> +
> + unsigned int execute (function *) final override;
> +
> +}; /* end of class pass_rs6000_p10sfopt */
> +
> +unsigned int
> +pass_rs6000_p10sfopt::execute (function *fun)
> +{
> + stmt_info_map = new hash_map<gimple *, stmt_info_p>;
> + basic_block bb;
> + FOR_EACH_BB_FN (bb, fun)
> + {
> + for (gimple_stmt_iterator gsi = gsi_start_nondebug_after_labels_bb (bb);
> + !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> + {
> + gimple *stmt = gsi_stmt (gsi);
> +
> + switch (gimple_code (stmt))
> + {
> + case GIMPLE_ASSIGN:
> + if (gimple_assign_single_p (stmt))
> + {
> + bool is_load = false;
> + if (!stmt_info_map->get (stmt)
> + && valid_load_store_p (stmt, is_load)
> + && is_load)
> + process_chain_from_load (stmt);
> + }
> + break;
> + default:
> + break;
> + }
> + }
> + }
> +
> + for (info_map_t::iterator it = stmt_info_map->begin ();
> + it != stmt_info_map->end (); ++it)
> + {
> + stmt_info_p info = (*it).second;
> + delete info;
> + }
> + delete stmt_info_map;
> +
> + return 0;
> +}
> +
> +}
> +
> +gimple_opt_pass *
> +make_pass_rs6000_p10sfopt (gcc::context *ctxt)
> +{
> + return new pass_rs6000_p10sfopt (ctxt);
> +}
> +
> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
> index ca899d5f7af..bc59a7d5f99 100644
> --- a/gcc/config/rs6000/rs6000-passes.def
> +++ b/gcc/config/rs6000/rs6000-passes.def
> @@ -24,6 +24,11 @@ along with GCC; see the file COPYING3. If not see
> REPLACE_PASS (PASS, INSTANCE, TGT_PASS)
> */
>
> + /* Pass to mitigate the performance issue on scalar single precision
> + floating point load, by updating some scalar single precision
> + floating point operations with appropriate vector opeations. */
> + INSERT_PASS_BEFORE (pass_gimple_isel, 1, pass_rs6000_p10sfopt);
> +
> /* Pass to add the appropriate vector swaps on power8 little endian systems.
> The power8 does not have instructions that automaticaly do the byte swaps
> for loads and stores. */
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index f70118ea40f..aa0f782f186 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -341,9 +341,11 @@ extern unsigned rs6000_linux_libm_function_max_error (unsigned, machine_mode,
> /* Pass management. */
> namespace gcc { class context; }
> class rtl_opt_pass;
> +class gimple_opt_pass;
>
> extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
> extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
> +extern gimple_opt_pass *make_pass_rs6000_p10sfopt (gcc::context *);
> extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
> extern bool rs6000_quadword_masked_address_p (const_rtx exp);
> extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index cc24dd5301e..0e36860a73e 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -4254,6 +4254,22 @@ rs6000_option_override_internal (bool global_init_p)
> rs6000_isa_flags &= ~OPTION_MASK_PCREL;
> }
>
> + if (TARGET_P10_SF_OPT)
> + {
> + if (!TARGET_HARD_FLOAT)
> + {
> + if ((rs6000_isa_flags_explicit & OPTION_MASK_P10_SF_OPT) != 0)
> + error ("%qs requires %qs", "-mp10-sf-opt", "-mhard-float");
> + rs6000_isa_flags &= ~OPTION_MASK_P10_SF_OPT;
> + }
> + if (!TARGET_P9_VECTOR)
> + {
> + if ((rs6000_isa_flags_explicit & OPTION_MASK_P10_SF_OPT) != 0)
> + error ("%qs requires %qs", "-mp10-sf-opt", "-mpower9-vector");
> + rs6000_isa_flags &= ~OPTION_MASK_P10_SF_OPT;
> + }
> + }
> +
> /* Print the options after updating the defaults. */
> if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
> rs6000_print_isa_options (stderr, 0, "after defaults", rs6000_isa_flags);
> @@ -22301,6 +22317,17 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
> *total = !speed ? COSTS_N_INSNS (1) + 1 : COSTS_N_INSNS (2);
> if (rs6000_slow_unaligned_access (mode, MEM_ALIGN (x)))
> *total += COSTS_N_INSNS (100);
> + /* Specially treat vec_duplicate here, since vector splat insns
> + {l,st}xv[wd]sx only support x-form, we should ensure reg + reg
> + is preferred over reg + const, otherwise cprop will propagate
> + const and result in sub-optimal code. */
> + if (outer_code == VEC_DUPLICATE
> + && (GET_MODE_SIZE (mode) == 4
> + || GET_MODE_SIZE (mode) == 8)
> + && GET_CODE (XEXP (x, 0)) == PLUS
> + && CONST_INT_P (XEXP (XEXP (x, 0), 1))
> + && REG_P (XEXP (XEXP (x, 0), 0)))
> + *total += COSTS_N_INSNS (1);
> return true;
>
> case LABEL_REF:
> @@ -24443,6 +24470,7 @@ static struct rs6000_opt_mask const rs6000_opt_masks[] =
> { "modulo", OPTION_MASK_MODULO, false, true },
> { "mulhw", OPTION_MASK_MULHW, false, true },
> { "multiple", OPTION_MASK_MULTIPLE, false, true },
> + { "p10-sf-opt", OPTION_MASK_P10_SF_OPT, false, true },
> { "pcrel", OPTION_MASK_PCREL, false, true },
> { "pcrel-opt", OPTION_MASK_PCREL_OPT, false, true },
> { "popcntb", OPTION_MASK_POPCNTB, false, true },
> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
> index bde6d3ff664..fb50e00d0d9 100644
> --- a/gcc/config/rs6000/rs6000.opt
> +++ b/gcc/config/rs6000/rs6000.opt
> @@ -597,6 +597,11 @@ mmma
> Target Mask(MMA) Var(rs6000_isa_flags)
> Generate (do not generate) MMA instructions.
>
> +mp10-sf-opt
> +Target Mask(P10_SF_OPT) Var(rs6000_isa_flags)
> +Generate code to mitigate single-precision floating point loading performance
> +issue.
> +
> mrelative-jumptables
> Target Undocumented Var(rs6000_relative_jumptables) Init(1) Save
>
> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
> index f183b42ce1d..e7cd6d2f694 100644
> --- a/gcc/config/rs6000/t-rs6000
> +++ b/gcc/config/rs6000/t-rs6000
> @@ -35,6 +35,10 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
> $(COMPILE) $<
> $(POSTCOMPILE)
>
> +rs6000-p10sfopt.o: $(srcdir)/config/rs6000/rs6000-p10sfopt.cc
> + $(COMPILE) $<
> + $(POSTCOMPILE)
> +
> rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
> $(COMPILE) $<
> $(POSTCOMPILE)
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index f3b40229094..690318e82b2 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -6690,3 +6690,14 @@ (define_insn "vmsumcud"
> "vmsumcud %0,%1,%2,%3"
> [(set_attr "type" "veccomplex")]
> )
> +
> +;; For expanding internal use bif __builtin_vsx_stxsiwx
> +(define_insn "vsx_stxsiwx_v4sf"
> + [(set (match_operand:SF 0 "memory_operand" "=Z")
> + (unspec:SF
> + [(match_operand:V4SF 1 "vsx_register_operand" "wa")]
> + UNSPEC_STFIWX))]
> + "TARGET_P9_VECTOR"
> + "stxsiwx %x1,%y0"
> + [(set_attr "type" "fpstore")])
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
> new file mode 100644
> index 00000000000..6e8c6a84de6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-1.c
> @@ -0,0 +1,22 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -fno-tree-vectorize -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> + expectedly with one case having normal arithmetic. */
> +
> +void
> +saxpy (int n, float a, float *restrict x, float *restrict y)
> +{
> +#pragma GCC unroll 1
> + for (int i = 0; i < n; ++i)
> + y[i] = a * x[i] + y[i];
> +}
> +
> +/* Checking lfsx -> lxvwsx, stfsx -> stxsiwx, fmadds -> xvmaddmsp etc. */
> +/* { dg-final { scan-assembler-times {\mlxvwsx\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mstxsiwx\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxvmaddmsp\M} 1 } } */
> +/* { dg-final { scan-assembler-not {\mlfsx?\M} } } */
> +/* { dg-final { scan-assembler-not {\mstfsx?\M} } } */
> +/* { dg-final { scan-assembler-not {\mfmadds\M} } } */
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
> new file mode 100644
> index 00000000000..7593da8ecf4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-2.c
> @@ -0,0 +1,34 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -fno-tree-vectorize -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> + expectedly with one case having reduction. */
> +
> +/* Partially reduced from pytorch batch_norm_kernel.cpp. */
> +
> +typedef long long int64_t;
> +typedef float accscalar_t;
> +typedef float scalar_t;
> +
> +void
> +foo (int64_t n1, int64_t n2, accscalar_t sum, int64_t bound, int64_t N,
> + scalar_t *input_data, scalar_t *var_sum_data, int64_t index)
> +{
> + scalar_t mean = sum / N;
> + accscalar_t _var_sum = 0;
> + for (int64_t c = 0; c < n1; c++)
> + {
> + for (int64_t i = 0; i < n2; i++)
> + {
> + int64_t offset = index + i;
> + scalar_t x = input_data[offset];
> + _var_sum += (x - mean) * (x - mean);
> + }
> + var_sum_data[c] = _var_sum;
> + }
> +}
> +
> +/* { dg-final { scan-assembler {\mlxvwsx\M} } } */
> +/* { dg-final { scan-assembler {\mstxsiwx\M} } } */
> +/* { dg-final { scan-assembler {\mxvmaddasp\M} } } */
> +
> diff --git a/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
> new file mode 100644
> index 00000000000..38aedd00faa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/p10-sf-opt-3.c
> @@ -0,0 +1,43 @@
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power10 -mvsx -mp10-sf-opt" } */
> +
> +/* Verify Power10 SP floating point loading perf mitigation works
> + expectedly with one case having comparison. */
> +
> +/* Partially reduced from xgboost cpu_predictor.cc. */
> +
> +typedef struct {
> + unsigned int sindex;
> + signed int cleft;
> + unsigned int a1;
> + unsigned int a2;
> + float val;
> +} Node;
> +
> +extern void bar(Node *n);
> +
> +void
> +foo (Node *n0, float *pa, Node *var_843, int c)
> +{
> + Node *var_821;
> + Node *n = n0;
> + int cleft_idx = c;
> + do
> + {
> + unsigned idx = n->sindex;
> + idx = (idx & ((1U << 31) - 1U));
> + float f1 = pa[idx];
> + float f2 = n->val;
> + int t = f2 > f1;
> + int var_825 = cleft_idx + t;
> + unsigned long long var_823 = var_825;
> + var_821 = &var_843[var_823];
> + cleft_idx = var_821->cleft;
> + n = var_821;
> + } while (cleft_idx != -1);
> +
> + bar (n);
> +}
> +
> +/* { dg-final { scan-assembler-times {\mlxvwsx\M} 2 } } */
> +/* { dg-final { scan-assembler-times {\mxvcmpgtsp\M} 1 } } */
> --
> 2.39.3
prev parent reply other threads:[~2023-12-12 6:16 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-15 9:16 Kewen.Lin
2023-12-12 6:16 ` Kewen.Lin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=091f04fa-8264-19ac-9b39-5444eb9d1ab0@linux.ibm.com \
--to=linkw@linux.ibm.com \
--cc=bergner@linux.ibm.com \
--cc=dje.gcc@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=meissner@linux.ibm.com \
--cc=richard.guenther@gmail.com \
--cc=richard.sandiford@arm.com \
--cc=segher@kernel.crashing.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).