On 28/03/2022 15:59, Richard Sandiford wrote: > "Andre Vieira (lists)" writes: >> Hi, >> >> Addressed all of your comments bar the pred ops one. >> >> Is this OK? >> >> >> gcc/ChangeLog: >> >>         * config/aarch64/aarch64.cc (aarch64_vector_costs): Define >> determine_suggested_unroll_factor and m_nosve_pattern. >>         (determine_suggested_unroll_factor): New function. >>         (aarch64_vector_costs::add_stmt_cost): Check for a qualifying >> pattern >>         to set m_nosve_pattern. >>         (aarch64_vector_costs::finish_costs): Use >> determine_suggested_unroll_factor. >>         * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New. >> >> On 16/03/2022 18:01, Richard Sandiford wrote: >>> "Andre Vieira (lists)" writes: >>>> Hi, >>>> >>>> This patch implements the costing function >>>> determine_suggested_unroll_factor for aarch64. >>>> It determines the unrolling factor by dividing the number of X >>>> operations we can do per cycle by the number of X operations in the loop >>>> body, taking this information from the vec_ops analysis during vector >>>> costing and the available issue_info information. >>>> We multiply the dividend by a potential reduction_latency, to improve >>>> our pipeline utilization if we are stalled waiting on a particular >>>> reduction operation. >>>> >>>> Right now we also have a work around for vectorization choices where the >>>> main loop uses a NEON mode and predication is available, such that if >>>> the main loop makes use of a NEON pattern that is not directly supported >>>> by SVE we do not unroll, as that might cause performance regressions in >>>> cases where we would enter the original main loop's VF. As an example if >>>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would >>>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated >>>> epilogue, using other instructions. Whereas with the unrolling you would >>>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping >>>> the original 8x NEON. In the future, we could handle this differently, >>>> by either using a different costing model for epilogues, or potentially >>>> vectorizing more than one single epilogue. >>>> >>>> gcc/ChangeLog: >>>> >>>>         * config/aarch64/aarch64.cc (aarch64_vector_costs): Define >>>> determine_suggested_unroll_factor. >>>>         (determine_suggested_unroll_factor): New function. >>>>         (aarch64_vector_costs::finish_costs): Use >>>> determine_suggested_unroll_factor. >>>> >>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>>> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644 >>>> --- a/gcc/config/aarch64/aarch64.cc >>>> +++ b/gcc/config/aarch64/aarch64.cc >>>> @@ -15680,6 +15680,7 @@ private: >>>> unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, >>>> unsigned int); >>>> bool prefer_unrolled_loop () const; >>>> + unsigned int determine_suggested_unroll_factor (); >>>> >>>> /* True if we have performed one-time initialization based on the >>>> vec_info. */ >>>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, >>>> return sve_cycles_per_iter; >>>> } >>>> >>>> +unsigned int >>>> +aarch64_vector_costs::determine_suggested_unroll_factor () >>>> +{ >>>> + auto *issue_info = aarch64_tune_params.vec_costs->issue_info; >>>> + if (!issue_info) >>>> + return 1; >>>> + bool sve = false; >>>> + if (aarch64_sve_mode_p (m_vinfo->vector_mode)) >>> Other code uses m_vec_flags & VEC_ANY_SVE for this. >>> >>>> + { >>>> + if (!issue_info->sve) >>>> + return 1; >>>> + sve = true; >>>> + } >>>> + else >>>> + { >>>> + if (!issue_info->advsimd) >>>> + return 1; >>> The issue info should instead be taken from vec_ops.simd_issue_info () >>> in the loop below. It can vary for each entry. >>> >>>> + /* If we are trying to unroll a NEON main loop that contains patterns >>> s/a NEON/an Advanced SIMD/ >>> >>>> + that we do not support with SVE and we might use a predicated >>>> + epilogue, we need to be conservative and block unrolling as this might >>>> + lead to a less optimal loop for the first and only epilogue using the >>>> + original loop's vectorization factor. >>>> + TODO: Remove this constraint when we add support for multiple epilogue >>>> + vectorization. */ >>>> + if (partial_vectors_supported_p () >>>> + && param_vect_partial_vector_usage != 0 >>>> + && !TARGET_SVE2) >>>> + { >>>> + unsigned int i; >>>> + stmt_vec_info stmt_vinfo; >>>> + FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo) >>>> + { >>>> + if (is_pattern_stmt_p (stmt_vinfo)) >>>> + { >>>> + gimple *stmt = stmt_vinfo->stmt; >>>> + if (is_gimple_call (stmt) >>>> + && gimple_call_internal_p (stmt)) >>>> + { >>>> + enum internal_fn ifn >>>> + = gimple_call_internal_fn (stmt); >>>> + switch (ifn) >>>> + { >>>> + case IFN_AVG_FLOOR: >>>> + case IFN_AVG_CEIL: >>>> + return 1; >>>> + default: >>>> + break; >>>> + } >>>> + } >>>> + } >>>> + } >>>> + } >>> I think we should instead record this during add_stmt_cost, like we do >>> for the other data that this function uses. >>> >>> We need to drop the partial_vectors_supported_p () condition though: >>> enabling SVE must not make Advanced SIMD code worse. So for now, >>> Advanced SIMD-only code will have to suffer the missed unrolling >>> opportunity too. >>> >>>> + } >>>> + >>>> + unsigned int max_unroll_factor = 1; >>>> + aarch64_simd_vec_issue_info const *vec_issue >>>> + = sve ? issue_info->sve : issue_info->advsimd; >>>> + for (auto vec_ops : m_ops) >>>> + { >>>> + /* Limit unroll factor to 4 for now. */ >>>> + unsigned int unroll_factor = 4; >>> Would be good to make this an aarch64-specific --param in case people >>> want to play with different values. >>> >>>> + unsigned int factor >>>> + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; >>>> + unsigned int temp; >>>> + >>>> + /* Sanity check, this should never happen. */ >>>> + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) >>>> + return 1; >>>> + >>>> + /* Check stores. */ >>>> + if (vec_ops.stores > 0) >>>> + { >>>> + temp = CEIL (factor * vec_issue->stores_per_cycle, >>>> + vec_ops.stores); >>>> + unroll_factor = MIN (unroll_factor, temp); >>>> + } >>>> + >>>> + /* Check loads. */ >>>> + if (vec_ops.loads > 0) >>>> + { >>>> + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, >>>> + vec_ops.loads); >>> This should use loads + stores rather than just loads. >>> >>>> + unroll_factor = MIN (unroll_factor, temp); >>>> + } >>>> + >>>> + /* Check general ops. */ >>>> + if (vec_ops.general_ops > 0) >>>> + { >>>> + temp = CEIL (factor * vec_issue->general_ops_per_cycle, >>>> + vec_ops.general_ops); >>>> + unroll_factor = MIN (unroll_factor, temp); >>>> + } >>> In principle we should also take pred_ops into account when >>> sve_issue_info () is nonnull, while ignoring the pred_ops associated >>> with loop predication (added by analyze_loop_vinfo). I guess we >>> should maintain two pred_op counts. >>> >>> That's just a suggestion for a possible future improvement though. >>> It doesn't need to be done as part of this patch. >>> >>> Thanks, >>> Richard >>> >>>> + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); >>>> + } >>>> + >>>> + /* Make sure unroll factor is power of 2. */ >>>> + return 1 << ceil_log2 (max_unroll_factor); >>>> +} >>>> + >>>> /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary >>>> and return the new cost. */ >>>> unsigned int >>>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) >>>> if (loop_vinfo >>>> && m_vec_flags >>>> && aarch64_use_new_vector_costs_p ()) >>>> - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>> - m_costs[vect_body]); >>>> + { >>>> + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>> + m_costs[vect_body]); >>>> + m_suggested_unroll_factor = determine_suggested_unroll_factor (); >>>> + } >>>> >>>> /* Apply the heuristic described above m_stp_sequence_cost. Prefer >>>> the scalar code in the event of a tie, since there is more chance >> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> index 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 100644 >> --- a/gcc/config/aarch64/aarch64.cc >> +++ b/gcc/config/aarch64/aarch64.cc >> @@ -15637,6 +15637,7 @@ private: >> unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *, >> unsigned int); >> bool prefer_unrolled_loop () const; >> + unsigned int determine_suggested_unroll_factor (); >> >> /* True if we have performed one-time initialization based on the >> vec_info. */ >> @@ -15699,6 +15700,10 @@ private: >> or vector loop. There is one entry for each tuning option of >> interest. */ >> auto_vec m_ops; >> + >> + /* This loop uses a pattern that is not supported by SVE, but is supported by >> + Advanced SIMD and SVE2. */ >> + bool m_nosve_pattern = false; > This is true for other things besides AVG_FLOOR/CEIL. I think we should > just accept that this is an x264 hack ;-) and name it “m_has_avg” instead. > > Could you put it with the other bool field? That would lead to a nicer > layout, although it hardly matters much here. > >> }; >> >> aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo, >> @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, >> as one iteration of the SVE loop. */ >> if (where == vect_body && m_unrolled_advsimd_niters) >> m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters; >> + >> + if (is_pattern_stmt_p (stmt_info)) >> + { > Not sure this is necessary. We'd want the same thing if (for whatever > reason) we had a pre-existing IFN_AVG_FLOOR/CEIL in future. > >> + gimple *stmt = stmt_info->stmt; >> + if (is_gimple_call (stmt) >> + && gimple_call_internal_p (stmt)) >> + { >> + enum internal_fn ifn >> + = gimple_call_internal_fn (stmt); >> + switch (ifn) > Very minor nit, but might as well switch on gimple_call_internal_fn (stmt) > directly, to avoid the awkward line break. > >> + { >> + case IFN_AVG_FLOOR: >> + case IFN_AVG_CEIL: >> + m_nosve_pattern = true; >> + default: >> + break; >> + } >> + } >> + } >> } >> return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); >> } >> @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops, >> return sve_cycles_per_iter; >> } >> >> +unsigned int >> +aarch64_vector_costs::determine_suggested_unroll_factor () >> +{ >> + bool sve = m_vec_flags & VEC_ANY_SVE; >> + /* If we are trying to unroll an Advanced SIMD main loop that contains >> + patterns that we do not support with SVE and we might use a predicated >> + epilogue, we need to be conservative and block unrolling as this might >> + lead to a less optimal loop for the first and only epilogue using the >> + original loop's vectorization factor. >> + TODO: Remove this constraint when we add support for multiple epilogue >> + vectorization. */ >> + if (!sve && TARGET_SVE2 && m_nosve_pattern) >> + return 1; > Shouldn't this be !TARGET_SVE2? > >> + >> + unsigned int max_unroll_factor = 1; >> + for (auto vec_ops : m_ops) >> + { >> + aarch64_simd_vec_issue_info const *vec_issue >> + = vec_ops.simd_issue_info (); >> + if (!vec_issue) >> + return 1; >> + /* Limit unroll factor to a value adjustable by the user, the default >> + value is 4. */ >> + unsigned int unroll_factor = aarch64_vect_unroll_limit; >> + unsigned int factor >> + = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1; >> + unsigned int temp; >> + >> + /* Sanity check, this should never happen. */ >> + if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0) >> + return 1; >> + >> + /* Check stores. */ >> + if (vec_ops.stores > 0) >> + { >> + temp = CEIL (factor * vec_issue->stores_per_cycle, >> + vec_ops.stores); >> + unroll_factor = MIN (unroll_factor, temp); >> + } >> + >> + /* Check loads + stores. */ >> + if (vec_ops.loads > 0) >> + { >> + temp = CEIL (factor * vec_issue->loads_stores_per_cycle, >> + vec_ops.loads + vec_ops.stores); >> + unroll_factor = MIN (unroll_factor, temp); >> + } >> + >> + /* Check general ops. */ >> + if (vec_ops.general_ops > 0) >> + { >> + temp = CEIL (factor * vec_issue->general_ops_per_cycle, >> + vec_ops.general_ops); >> + unroll_factor = MIN (unroll_factor, temp); >> + } >> + max_unroll_factor = MAX (max_unroll_factor, unroll_factor); >> + } >> + >> + /* Make sure unroll factor is power of 2. */ >> + return 1 << ceil_log2 (max_unroll_factor); >> +} >> + >> /* BODY_COST is the cost of a vector loop body. Adjust the cost as necessary >> and return the new cost. */ >> unsigned int >> @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) >> if (loop_vinfo >> && m_vec_flags >> && aarch64_use_new_vector_costs_p ()) >> - m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >> - m_costs[vect_body]); >> + { >> + m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >> + m_costs[vect_body]); >> + m_suggested_unroll_factor = determine_suggested_unroll_factor (); >> + } >> >> /* Apply the heuristic described above m_stp_sequence_cost. Prefer >> the scalar code in the event of a tie, since there is more chance >> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt >> index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644 >> --- a/gcc/config/aarch64/aarch64.opt >> +++ b/gcc/config/aarch64/aarch64.opt >> @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence. >> -param=aarch64-mops-memset-size-threshold= >> Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param >> Constant memset size in bytes from which to start using MOPS sequence. >> + >> +-param=aarch64-vect-unroll-limit= >> +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param >> +Limit how much the autovectorizer may unroll a loop. > This needs to be documented in invoke.texi, alongside the existing > AArch64-specific params. > > OK with those changes, thanks. > > Richard Committed attached patch.