From: Richard Biener <richard.guenther@gmail.com>
To: Feng Xue OS <fxue@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
Date: Tue, 25 Jun 2024 12:26:03 +0200 [thread overview]
Message-ID: <CAFiYyc3uE-CwLn45kFmCwA7Kh6nASmGF6bg3yhQJLiNnMwKWZQ@mail.gmail.com> (raw)
In-Reply-To: <LV2PR01MB783913E7961168910BAC2DB7F7D52@LV2PR01MB7839.prod.exchangelabs.com>
On Tue, Jun 25, 2024 at 11:32 AM Feng Xue OS
<fxue@os.amperecomputing.com> wrote:
>
> >>
> >> >> - if (slp_node)
> >> >> + if (slp_node && SLP_TREE_LANES (slp_node) > 1)
> >> >
> >> > Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
> >> > instead, which is bad.
> >> >
> >> >> nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >> >> else
> >> >> nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
> >> >> }
> >> >> }
> >> >>
> >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
> >> >> + the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
> >> >> + Now there are three such kinds of operations: dot-prod/widen-sum/sad
> >> >> + (sum-of-absolute-differences).
> >> >> +
> >> >> + For a lane-reducing operation, the loop reduction path that it lies in,
> >> >> + may contain normal operation, or other lane-reducing operation of different
> >> >> + input type size, an example as:
> >> >> +
> >> >> + int sum = 0;
> >> >> + for (i)
> >> >> + {
> >> >> + ...
> >> >> + sum += d0[i] * d1[i]; // dot-prod <vector(16) char>
> >> >> + sum += w[i]; // widen-sum <vector(16) char>
> >> >> + sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> >> >> + sum += n[i]; // normal <vector(4) int>
> >> >> + ...
> >> >> + }
> >> >> +
> >> >> + Vectorization factor is essentially determined by operation whose input
> >> >> + vectype has the most lanes ("vector(16) char" in the example), while we
> >> >> + need to choose input vectype with the least lanes ("vector(4) int" in the
> >> >> + example) for the reduction PHI statement. */
> >> >> +
> >> >> +bool
> >> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> >> >> + slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> >> >> +{
> >> >> + gimple *stmt = stmt_info->stmt;
> >> >> +
> >> >> + if (!lane_reducing_stmt_p (stmt))
> >> >> + return false;
> >> >> +
> >> >> + tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> >> >> +
> >> >> + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> >> >> + return false;
> >> >> +
> >> >> + /* Do not try to vectorize bit-precision reductions. */
> >> >> + if (!type_has_mode_precision_p (type))
> >> >> + return false;
> >> >> +
> >> >> + if (!slp_node)
> >> >> + return false;
> >> >> +
> >> >> + for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> >> >> + {
> >> >> + stmt_vec_info def_stmt_info;
> >> >> + slp_tree slp_op;
> >> >> + tree op;
> >> >> + tree vectype;
> >> >> + enum vect_def_type dt;
> >> >> +
> >> >> + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> >> >> + &slp_op, &dt, &vectype, &def_stmt_info))
> >> >> + {
> >> >> + if (dump_enabled_p ())
> >> >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> + "use not simple.\n");
> >> >> + return false;
> >> >> + }
> >> >> +
> >> >> + if (!vectype)
> >> >> + {
> >> >> + vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op),
> >> >> + slp_op);
> >> >> + if (!vectype)
> >> >> + return false;
> >> >> + }
> >> >> +
> >> >> + if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> >> >> + {
> >> >> + if (dump_enabled_p ())
> >> >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> + "incompatible vector types for invariants\n");
> >> >> + return false;
> >> >> + }
> >> >> +
> >> >> + if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> >> >> + continue;
> >> >> +
> >> >> + /* There should be at most one cycle def in the stmt. */
> >> >> + if (VECTORIZABLE_CYCLE_DEF (dt))
> >> >> + return false;
> >> >> + }
> >> >> +
> >> >> + stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> >> >> +
> >> >> + /* TODO: Support lane-reducing operation that does not directly participate
> >> >> + in loop reduction. */
> >> >> + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> >> >> + return false;
> >> >> +
> >> >> + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> >> >> + recoginized. */
> >> >> + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> >> >> + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
> >> >> +
> >> >> + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> >> + int ncopies_for_cost;
> >> >> +
> >> >> + if (SLP_TREE_LANES (slp_node) > 1)
> >> >> + {
> >> >> + /* Now lane-reducing operations in a non-single-lane slp node should only
> >> >> + come from the same loop reduction path. */
> >> >> + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> >> >> + ncopies_for_cost = 1;
> >> >> + }
> >> >> + else
> >> >> + {
> >> >> + ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
> >> >
> >> > OK, so the fact that the ops are lane-reducing means they effectively
> >> > change the VF for the result. That's only possible as we tightly control
> >> > code generation and "adjust" to the expected VF (by inserting the copies
> >> > you mentioned above), but only up to the highest number of outputs
> >> > created in the reduction chain. In that sense instead of talking and recording
> >> > "input vector types" wouldn't it make more sense to record the effective
> >> > vectorization factor for the reduction instance? That VF would be at most
> >> > the loops VF but could be as low as 1. Once we have a non-lane-reducing
> >> > operation in the reduction chain it would be always equal to the loops VF.
> >> >
> >> > ncopies would then be always determined by that reduction instance VF and
> >> > the accumulator vector type (STMT_VINFO_VECTYPE). This reduction
> >> > instance VF would also trivially indicate the force-single-def-use-cycle
> >> > case, possibly simplifying code?
> >>
> >> I tried to add such an effective VF, while the vectype_in is still needed in some
> >> scenarios, such as when checking whether a dot-prod stmt is emulated or not.
> >> The former could be deduced from the later, so recording both things seems
> >> to be redundant. Another consideration is that for normal op, ncopies
> >> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
> >> it is from VF. So, a better means to make them unified?
> >
> > AFAICS reductions are special in that they, for the accumulation SSA cycle,
> > do not adhere to the loops VF but as optimization can chose a smaller one.
> > OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
> > operations which even for lane-reducing ops is adhered to - those just
> > may use a smaller VF, that of the reduction SSA cycle.
> >
> > So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
> > it's not fully redundant but needlessly replicated over all stmts participating
> > in the reduction instead of recording the reduction VF in the reduc_info and
> > using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
> > for stmts in the reduction cycle.
> >
> > At least that was my idea ...
> >
>
> For lane-reducing ops and single-defuse-cycle optimization, we could assume
> no lane would be reduced, and always generate vectorization statements
> according to the normal VF, if placeholder is needed, just insert some trivial
> statement like zero-initialization, or pass-through copy. And define a"effective VF or
> ncopies" to control lane-reducing related aspects in analysis and codegen (such
> as the below vect_get_loop_mask). Since all things will become SLP-based finally,
> I think a suitable place to add such a field might be in slp_node, as a supplement to
> "vect_stmts_size", and it is expected to be adjusted in vectorizable_reduction. So
> could we do the refinement as separate patches when non-slp code path is to be
> removed?
I suppose so.
Thanks,
Richard.
> >> >> + gcc_assert (ncopies_for_cost >= 1);
> >> >> + }
> >> >> +
> >> >> + if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> + {
> >> >> + /* We need extra two invariants: one that contains the minimum signed
> >> >> + value and one that contains half of its negative. */
> >> >> + int prologue_stmts = 2;
> >> >> + unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> >> >> + scalar_to_vec, stmt_info, 0,
> >> >> + vect_prologue);
> >> >> + if (dump_enabled_p ())
> >> >> + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> >> >> + "extra prologue_cost = %d .\n", cost);
> >> >> +
> >> >> + /* Three dot-products and a subtraction. */
> >> >> + ncopies_for_cost *= 4;
> >> >> + }
> >> >> +
> >> >> + record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, 0,
> >> >> + vect_body);
> >> >> +
> >> >> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> >> >> + {
> >> >> + enum tree_code code = gimple_assign_rhs_code (stmt);
> >> >> + vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info,
> >> >> + slp_node, code, type,
> >> >> + vectype_in);
> >> >> + }
> >> >> +
> >> >
> >> > Add a comment:
> >> >
> >> > /* Transform via vect_transform_reduction. */
> >> >
> >> >> + STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> >> >> + return true;
> >> >> +}
> >> >> +
> >> >> /* Function vectorizable_reduction.
> >> >>
> >> >> Check if STMT_INFO performs a reduction operation that can be vectorized.
> >> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >> if (!type_has_mode_precision_p (op.type))
> >> >> return false;
> >> >>
> >> >> - /* For lane-reducing ops we're reducing the number of reduction PHIs
> >> >> - which means the only use of that may be in the lane-reducing operation. */
> >> >> - if (lane_reducing
> >> >> - && reduc_chain_length != 1
> >> >> - && !only_slp_reduc_chain)
> >> >> - {
> >> >> - if (dump_enabled_p ())
> >> >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> - "lane-reducing reduction with extra stmts.\n");
> >> >> - return false;
> >> >> - }
> >> >> -
> >> >> /* Lane-reducing ops also never can be used in a SLP reduction group
> >> >> since we'll mix lanes belonging to different reductions. But it's
> >> >> OK to use them in a reduction chain or when the reduction group
> >> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >> && loop_vinfo->suggested_unroll_factor == 1)
> >> >> single_defuse_cycle = true;
> >> >>
> >> >> - if (single_defuse_cycle || lane_reducing)
> >> >> + if (single_defuse_cycle && !lane_reducing)
> >> >
> >> > If there's also a non-lane-reducing plus in the chain don't we have to
> >> > check for that reduction op? So shouldn't it be
> >> > single_defuse_cycle && ... fact that we don't record
> >> > (non-lane-reducing op there) ...
> >>
> >> Quite not understand this point. For a non-lane-reducing op in the chain,
> >> it should be handled in its own vectorizable_xxx function? The below check
> >> is only for the first statement (vect_reduction_def) in the reduction.
> >
> > Hmm. So we have vectorizable_lane_reducing_* for the check on the
> > lane-reducing stmts, vectorizable_* for !single-def-use stmts. And the
> > following is then just for the case there's a single def that's not
> > lane-reducing
> > and we're forcing a single-def-use and thus go via vect_transform_reduction?
>
> Yes. Non-lane-reducing with single-defuse-cycle is handled in the function.
> This logic is same as the original.
>
> >> >
> >> >> {
> >> >> gcc_assert (op.code != COND_EXPR);
> >> >>
> >> >> - /* 4. Supportable by target? */
> >> >> - bool ok = true;
> >> >> -
> >> >> - /* 4.1. check support for the operation in the loop
> >> >> + /* 4. check support for the operation in the loop
> >> >>
> >> >> This isn't necessary for the lane reduction codes, since they
> >> >> can only be produced by pattern matching, and it's up to the
> >> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >> mixed-sign dot-products can be implemented using signed
> >> >> dot-products. */
> >> >> machine_mode vec_mode = TYPE_MODE (vectype_in);
> >> >> - if (!lane_reducing
> >> >> - && !directly_supported_p (op.code, vectype_in, optab_vector))
> >> >> + if (!directly_supported_p (op.code, vectype_in, optab_vector))
> >> >> {
> >> >> if (dump_enabled_p ())
> >> >> dump_printf (MSG_NOTE, "op not supported by target.\n");
> >> >> if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >> >> || !vect_can_vectorize_without_simd_p (op.code))
> >> >> - ok = false;
> >> >> + single_defuse_cycle = false;
> >> >> else
> >> >> if (dump_enabled_p ())
> >> >> dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> >> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >> dump_printf (MSG_NOTE, "using word mode not possible.\n");
> >> >> return false;
> >> >> }
> >> >> -
> >> >> - /* lane-reducing operations have to go through vect_transform_reduction.
> >> >> - For the other cases try without the single cycle optimization. */
> >> >> - if (!ok)
> >> >> - {
> >> >> - if (lane_reducing)
> >> >> - return false;
> >> >> - else
> >> >> - single_defuse_cycle = false;
> >> >> - }
> >> >> }
> >> >> if (dump_enabled_p () && single_defuse_cycle)
> >> >> dump_printf_loc (MSG_NOTE, vect_location,
> >> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >> "multiple vectors to one in the loop body\n");
> >> >> STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
> >> >>
> >> >> - /* If the reduction stmt is one of the patterns that have lane
> >> >> - reduction embedded we cannot handle the case of ! single_defuse_cycle. */
> >> >> - if ((ncopies > 1 && ! single_defuse_cycle)
> >> >> - && lane_reducing)
> >> >> - {
> >> >> - if (dump_enabled_p ())
> >> >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> - "multi def-use cycle not possible for lane-reducing "
> >> >> - "reduction operation\n");
> >> >> - return false;
> >> >> - }
> >> >> + /* For lane-reducing operation, the below processing related to single
> >> >> + defuse-cycle will be done in its own vectorizable function. One more
> >> >> + thing to note is that the operation must not be involved in fold-left
> >> >> + reduction. */
> >> >> + single_defuse_cycle &= !lane_reducing;
> >> >>
> >> >> if (slp_node
> >> >> - && !(!single_defuse_cycle
> >> >> - && !lane_reducing
> >> >> - && reduction_type != FOLD_LEFT_REDUCTION))
> >> >> + && (single_defuse_cycle || reduction_type == FOLD_LEFT_REDUCTION))
> >> >> for (i = 0; i < (int) op.num_ops; i++)
> >> >> if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
> >> >> {
> >> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >> vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >> >> reduction_type, ncopies, cost_vec);
> >> >> /* Cost the reduction op inside the loop if transformed via
> >> >> - vect_transform_reduction. Otherwise this is costed by the
> >> >> - separate vectorizable_* routines. */
> >> >> - if (single_defuse_cycle || lane_reducing)
> >> >> - {
> >> >> - int factor = 1;
> >> >> - if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> - /* Three dot-products and a subtraction. */
> >> >> - factor = 4;
> >> >> - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> >> >> - stmt_info, 0, vect_body);
> >> >> - }
> >> >> + vect_transform_reduction for non-lane-reducing operation. Otherwise
> >> >> + this is costed by the separate vectorizable_* routines. */
> >> >> + if (single_defuse_cycle)
> >> >> + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect_body);
> >> >>
> >> >> if (dump_enabled_p ()
> >> >> && reduction_type == FOLD_LEFT_REDUCTION)
> >> >> dump_printf_loc (MSG_NOTE, vect_location,
> >> >> "using an in-order (fold-left) reduction.\n");
> >> >> STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> >> >> - /* All but single defuse-cycle optimized, lane-reducing and fold-left
> >> >> - reductions go through their own vectorizable_* routines. */
> >> >> - if (!single_defuse_cycle
> >> >> - && !lane_reducing
> >> >> - && reduction_type != FOLD_LEFT_REDUCTION)
> >> >> +
> >> >> + /* All but single defuse-cycle optimized and fold-left reductions go
> >> >> + through their own vectorizable_* routines. */
> >> >> + if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
> >> >> {
> >> >> stmt_vec_info tem
> >> >> = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> >> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >> bool lane_reducing = lane_reducing_op_p (code);
> >> >> gcc_assert (single_defuse_cycle || lane_reducing);
> >> >>
> >> >> + if (lane_reducing)
> >> >> + {
> >> >> + /* The last operand of lane-reducing op is for reduction. */
> >> >> + gcc_assert (reduc_index == (int) op.num_ops - 1);
> >> >> +
> >> >> + /* Now all lane-reducing ops are covered by some slp node. */
> >> >> + gcc_assert (slp_node);
> >> >> + }
> >> >> +
> >> >> /* Create the destination vector */
> >> >> tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
> >> >> tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
> >> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >> reduc_index == 2 ? op.ops[2] : NULL_TREE,
> >> >> &vec_oprnds[2]);
> >> >> }
> >> >> + else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
> >> >> + && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ())
> >> >> + {
> >> >> + /* For lane-reducing op covered by single-lane slp node, the input
> >> >> + vectype of the reduction PHI determines copies of vectorized def-use
> >> >> + cycles, which might be more than effective copies of vectorized lane-
> >> >> + reducing reduction statements. This could be complemented by
> >> >> + generating extra trivial pass-through copies. For example:
> >> >> +
> >> >> + int sum = 0;
> >> >> + for (i)
> >> >> + {
> >> >> + sum += d0[i] * d1[i]; // dot-prod <vector(16) char>
> >> >> + sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> >> >> + sum += n[i]; // normal <vector(4) int>
> >> >> + }
> >> >> +
> >> >> + The vector size is 128-bit,vectorization factor is 16. Reduction
> >> >> + statements would be transformed as:
> >> >> +
> >> >> + vector<4> int sum_v0 = { 0, 0, 0, 0 };
> >> >> + vector<4> int sum_v1 = { 0, 0, 0, 0 };
> >> >> + vector<4> int sum_v2 = { 0, 0, 0, 0 };
> >> >> + vector<4> int sum_v3 = { 0, 0, 0, 0 };
> >> >> +
> >> >> + for (i / 16)
> >> >> + {
> >> >> + sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> >> >> + sum_v1 = sum_v1; // copy
> >> >> + sum_v2 = sum_v2; // copy
> >> >> + sum_v3 = sum_v3; // copy
> >> >> +
> >> >> + sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> >> >> + sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> >> >> + sum_v2 = sum_v2; // copy
> >> >> + sum_v3 = sum_v3; // copy
> >> >> +
> >> >> + sum_v0 += n_v0[i: 0 ~ 3 ];
> >> >> + sum_v1 += n_v1[i: 4 ~ 7 ];
> >> >> + sum_v2 += n_v2[i: 8 ~ 11];
> >> >> + sum_v3 += n_v3[i: 12 ~ 15];
> >> >> + }
> >> >> + */
> >> >> + unsigned using_ncopies = vec_oprnds[0].length ();
> >> >> + unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> >> >> +
> >> >
> >> > assert reduc_ncopies >= using_ncopies? Maybe assert
> >> > reduc_index == op.num_ops - 1 given you use one above
> >> > and the other below? Or simply iterate till op.num_ops
> >> > and sip i == reduc_index.
> >> >
> >> >> + for (unsigned i = 0; i < op.num_ops - 1; i++)
> >> >> + {
> >> >> + gcc_assert (vec_oprnds[i].length () == using_ncopies);
> >> >> + vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> >> >> + }
> >> >> + }
> >> >>
> >> >> bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> >> >> unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> >> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >> {
> >> >> gimple *new_stmt;
> >> >> tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> >> >> - if (masked_loop_p && !mask_by_cond_expr)
> >> >> +
> >> >> + if (!vop[0] || !vop[1])
> >> >> + {
> >> >> + tree reduc_vop = vec_oprnds[reduc_index][i];
> >> >> +
> >> >> + /* Insert trivial copy if no need to generate vectorized
> >> >> + statement. */
> >> >> + gcc_assert (reduc_vop);
> >> >> +
> >> >> + new_stmt = gimple_build_assign (vec_dest, reduc_vop);
> >> >> + new_temp = make_ssa_name (vec_dest, new_stmt);
> >> >> + gimple_set_lhs (new_stmt, new_temp);
> >> >> + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> >> >
> >> > I think you could simply do
> >> >
> >> > slp_node->push_vec_def (reduc_vop);
> >> > continue;
> >> >
> >> > without any code generation.
> >> >
> >>
> >> OK, that would be easy. Here comes another question, this patch assumes
> >> lane-reducing op would always be contained in a slp node, since single-lane
> >> slp node feature has been enabled. But I got some regression if I enforced
> >> such constraint on lane-reducing op check. Those cases are founded to
> >> be unvectorizable with single-lane slp, so this should not be what we want?
> >> and need to be fixed?
> >
> > Yes, in the end we need to chase down all unsupported cases and fix them
> > (there's known issues with load permutes, I'm working on that - hopefully
> > when finding a continuous stretch of time...).
> >
> >>
> >> >> + }
> >> >> + else if (masked_loop_p && !mask_by_cond_expr)
> >> >> {
> >> >> /* No conditional ifns have been defined for lane-reducing op
> >> >> yet. */
> >> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >> >>
> >> >> if (masked_loop_p && mask_by_cond_expr)
> >> >> {
> >> >> + tree stmt_vectype_in = vectype_in;
> >> >> + unsigned nvectors = vec_num * ncopies;
> >> >> +
> >> >> + if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> >> >> + {
> >> >> + /* Input vectype of the reduction PHI may be defferent from
> >> >
> >> > different
> >> >
> >> >> + that of lane-reducing operation. */
> >> >> + stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> >> + nvectors = vect_get_num_copies (loop_vinfo, stmt_vectype_in);
> >> >
> >> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
> >>
> >> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width,
> >> we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 *char>)
> >> to vect_get_loop_mask?
> >
> > Probably - it depends on the vectorization factor. What I wanted to
> > point out is that
> > vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong. The
> > place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
> > or we should forgo with it (but that's possibly a post-only-SLP
> > cleanup to be done).
> >
> > See vect_slp_analyze_node_operations_1 where that's computed. For reductions
> > it's probably not quite right (and we might have latent issues like
> > those you are
> > "fixing" with code like above). The order we analyze stmts might also be not
> > optimal for reductions with SLP - in fact given that stmt analysis
> > relies on a fixed VF
> > it would probably make sense to determine the reduction VF in advance as well.
> > But again this sounds like post-only-SLP cleanup opportunities.
> >
> > In the end I might suggest to always use reduct-VF and vectype to determine
> > the number of vector stmts rather than computing ncopies/vec_num separately.
> >
>
> Thanks,
> Feng
next prev parent reply other threads:[~2024-06-25 10:26 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-16 7:31 Feng Xue OS
2024-06-20 5:59 ` Feng Xue OS
2024-06-20 12:26 ` Richard Biener
2024-06-23 15:10 ` Feng Xue OS
2024-06-24 12:58 ` Richard Biener
2024-06-25 9:32 ` Feng Xue OS
2024-06-25 10:26 ` Richard Biener [this message]
2024-06-26 14:50 ` Feng Xue OS
2024-06-28 13:06 ` Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFiYyc3uE-CwLn45kFmCwA7Kh6nASmGF6bg3yhQJLiNnMwKWZQ@mail.gmail.com \
--to=richard.guenther@gmail.com \
--cc=fxue@os.amperecomputing.com \
--cc=gcc-patches@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).