On 28/03/2022 15:59, Richard Sandiford wrote:
> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>> Hi,
>>
>> Addressed all of your comments bar the pred ops one.
>>
>> Is this OK?
>>
>>
>> gcc/ChangeLog:
>>
>>           * config/aarch64/aarch64.cc (aarch64_vector_costs): Define
>> determine_suggested_unroll_factor and m_nosve_pattern.
>>           (determine_suggested_unroll_factor): New function.
>>           (aarch64_vector_costs::add_stmt_cost): Check for a qualifying
>> pattern
>>           to set m_nosve_pattern.
>>           (aarch64_vector_costs::finish_costs): Use
>> determine_suggested_unroll_factor.
>>           * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New.
>>
>> On 16/03/2022 18:01, Richard Sandiford wrote:
>>> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>>>> Hi,
>>>>
>>>> This patch implements the costing function
>>>> determine_suggested_unroll_factor for aarch64.
>>>> It determines the unrolling factor by dividing the number of X
>>>> operations we can do per cycle by the number of X operations in the loop
>>>> body, taking this information from the vec_ops analysis during vector
>>>> costing and the available issue_info information.
>>>> We multiply the dividend by a potential reduction_latency, to improve
>>>> our pipeline utilization if we are stalled waiting on a particular
>>>> reduction operation.
>>>>
>>>> Right now we also have a work around for vectorization choices where the
>>>> main loop uses a NEON mode and predication is available, such that if
>>>> the main loop makes use of a NEON pattern that is not directly supported
>>>> by SVE we do not unroll, as that might cause performance regressions in
>>>> cases where we would enter the original main loop's VF. As an example if
>>>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would
>>>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated
>>>> epilogue, using other instructions. Whereas with the unrolling you would
>>>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping
>>>> the original 8x NEON. In the future, we could handle this differently,
>>>> by either using a different costing model for epilogues, or potentially
>>>> vectorizing more than one single epilogue.
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>            * config/aarch64/aarch64.cc (aarch64_vector_costs): Define
>>>> determine_suggested_unroll_factor.
>>>>            (determine_suggested_unroll_factor): New function.
>>>>            (aarch64_vector_costs::finish_costs): Use
>>>> determine_suggested_unroll_factor.
>>>>
>>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>>> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644
>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>> @@ -15680,6 +15680,7 @@ private:
>>>>      unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>>>>    				 unsigned int);
>>>>      bool prefer_unrolled_loop () const;
>>>> +  unsigned int determine_suggested_unroll_factor ();
>>>>    
>>>>      /* True if we have performed one-time initialization based on the
>>>>         vec_info.  */
>>>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
>>>>      return sve_cycles_per_iter;
>>>>    }
>>>>    
>>>> +unsigned int
>>>> +aarch64_vector_costs::determine_suggested_unroll_factor ()
>>>> +{
>>>> +  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
>>>> +  if (!issue_info)
>>>> +    return 1;
>>>> +  bool sve = false;
>>>> +  if (aarch64_sve_mode_p (m_vinfo->vector_mode))
>>> Other code uses m_vec_flags & VEC_ANY_SVE for this.
>>>
>>>> +    {
>>>> +      if (!issue_info->sve)
>>>> +	return 1;
>>>> +      sve = true;
>>>> +    }
>>>> +  else
>>>> +    {
>>>> +      if (!issue_info->advsimd)
>>>> +	return 1;
>>> The issue info should instead be taken from vec_ops.simd_issue_info ()
>>> in the loop below.  It can vary for each entry.
>>>
>>>> +      /* If we are trying to unroll a NEON main loop that contains patterns
>>> s/a NEON/an Advanced SIMD/
>>>
>>>> +	 that we do not support with SVE and we might use a predicated
>>>> +	 epilogue, we need to be conservative and block unrolling as this might
>>>> +	 lead to a less optimal loop for the first and only epilogue using the
>>>> +	 original loop's vectorization factor.
>>>> +	 TODO: Remove this constraint when we add support for multiple epilogue
>>>> +	 vectorization.  */
>>>> +      if (partial_vectors_supported_p ()
>>>> +	  && param_vect_partial_vector_usage != 0
>>>> +	  && !TARGET_SVE2)
>>>> +	{
>>>> +	  unsigned int i;
>>>> +	  stmt_vec_info stmt_vinfo;
>>>> +	  FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
>>>> +	    {
>>>> +	      if (is_pattern_stmt_p (stmt_vinfo))
>>>> +		{
>>>> +		  gimple *stmt = stmt_vinfo->stmt;
>>>> +		  if (is_gimple_call (stmt)
>>>> +		      && gimple_call_internal_p (stmt))
>>>> +		    {
>>>> +		      enum internal_fn ifn
>>>> +			= gimple_call_internal_fn (stmt);
>>>> +		      switch (ifn)
>>>> +			{
>>>> +			case IFN_AVG_FLOOR:
>>>> +			case IFN_AVG_CEIL:
>>>> +			  return 1;
>>>> +			default:
>>>> +			  break;
>>>> +			}
>>>> +		    }
>>>> +		}
>>>> +	    }
>>>> +	}
>>> I think we should instead record this during add_stmt_cost, like we do
>>> for the other data that this function uses.
>>>
>>> We need to drop the partial_vectors_supported_p () condition though:
>>> enabling SVE must not make Advanced SIMD code worse.  So for now,
>>> Advanced SIMD-only code will have to suffer the missed unrolling
>>> opportunity too.
>>>
>>>> +    }
>>>> +
>>>> +  unsigned int max_unroll_factor = 1;
>>>> +  aarch64_simd_vec_issue_info const *vec_issue
>>>> +    = sve ? issue_info->sve : issue_info->advsimd;
>>>> +  for (auto vec_ops : m_ops)
>>>> +    {
>>>> +      /* Limit unroll factor to 4 for now.  */
>>>> +      unsigned int unroll_factor = 4;
>>> Would be good to make this an aarch64-specific --param in case people
>>> want to play with different values.
>>>
>>>> +      unsigned int factor
>>>> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
>>>> +      unsigned int temp;
>>>> +
>>>> +      /* Sanity check, this should never happen.  */
>>>> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
>>>> +	return 1;
>>>> +
>>>> +      /* Check stores.  */
>>>> +      if (vec_ops.stores > 0)
>>>> +	{
>>>> +	  temp = CEIL (factor * vec_issue->stores_per_cycle,
>>>> +		       vec_ops.stores);
>>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>>> +	}
>>>> +
>>>> +      /* Check loads.  */
>>>> +      if (vec_ops.loads > 0)
>>>> +	{
>>>> +	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
>>>> +		       vec_ops.loads);
>>> This should use loads + stores rather than just loads.
>>>
>>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>>> +	}
>>>> +
>>>> +      /* Check general ops.  */
>>>> +      if (vec_ops.general_ops > 0)
>>>> +	{
>>>> +	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
>>>> +	               vec_ops.general_ops);
>>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>>> +	 }
>>> In principle we should also take pred_ops into account when
>>> sve_issue_info () is nonnull, while ignoring the pred_ops associated
>>> with loop predication (added by analyze_loop_vinfo).  I guess we
>>> should maintain two pred_op counts.
>>>
>>> That's just a suggestion for a possible future improvement though.
>>> It doesn't need to be done as part of this patch.
>>>
>>> Thanks,
>>> Richard
>>>
>>>> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
>>>> +    }
>>>> +
>>>> +  /* Make sure unroll factor is power of 2.  */
>>>> +  return 1 << ceil_log2 (max_unroll_factor);
>>>> +}
>>>> +
>>>>    /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>>>>       and return the new cost.  */
>>>>    unsigned int
>>>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>>>>      if (loop_vinfo
>>>>          && m_vec_flags
>>>>          && aarch64_use_new_vector_costs_p ())
>>>> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>> -					   m_costs[vect_body]);
>>>> +    {
>>>> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>> +					     m_costs[vect_body]);
>>>> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
>>>> +    }
>>>>    
>>>>      /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>>>>         the scalar code in the event of a tie, since there is more chance
>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> index 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 100644
>> --- a/gcc/config/aarch64/aarch64.cc
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -15637,6 +15637,7 @@ private:
>>     unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>>   				 unsigned int);
>>     bool prefer_unrolled_loop () const;
>> +  unsigned int determine_suggested_unroll_factor ();
>>   
>>     /* True if we have performed one-time initialization based on the
>>        vec_info.  */
>> @@ -15699,6 +15700,10 @@ private:
>>        or vector loop.  There is one entry for each tuning option of
>>        interest.  */
>>     auto_vec<aarch64_vec_op_count, 2> m_ops;
>> +
>> +  /* This loop uses a pattern that is not supported by SVE, but is supported by
>> +     Advanced SIMD and SVE2.  */
>> +  bool m_nosve_pattern = false;
> This is true for other things besides AVG_FLOOR/CEIL.  I think we should
> just accept that this is an x264 hack ;-) and name it “m_has_avg” instead.
>
> Could you put it with the other bool field?  That would lead to a nicer
> layout, although it hardly matters much here.
>
>>   };
>>   
>>   aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
>> @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>>   	 as one iteration of the SVE loop.  */
>>         if (where == vect_body && m_unrolled_advsimd_niters)
>>   	m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters;
>> +
>> +      if (is_pattern_stmt_p (stmt_info))
>> +	{
> Not sure this is necessary.  We'd want the same thing if (for whatever
> reason) we had a pre-existing IFN_AVG_FLOOR/CEIL in future.
>
>> +	  gimple *stmt = stmt_info->stmt;
>> +	  if (is_gimple_call (stmt)
>> +	      && gimple_call_internal_p (stmt))
>> +	    {
>> +	      enum internal_fn ifn
>> +		= gimple_call_internal_fn (stmt);
>> +	      switch (ifn)
> Very minor nit, but might as well switch on gimple_call_internal_fn (stmt)
> directly, to avoid the awkward line break.
>
>> +		{
>> +		case IFN_AVG_FLOOR:
>> +		case IFN_AVG_CEIL:
>> +		  m_nosve_pattern = true;
>> +		default:
>> +		  break;
>> +		}
>> +	    }
>> +	}
>>       }
>>     return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
>>   }
>> @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
>>     return sve_cycles_per_iter;
>>   }
>>   
>> +unsigned int
>> +aarch64_vector_costs::determine_suggested_unroll_factor ()
>> +{
>> +  bool sve = m_vec_flags & VEC_ANY_SVE;
>> +  /* If we are trying to unroll an Advanced SIMD main loop that contains
>> +     patterns that we do not support with SVE and we might use a predicated
>> +     epilogue, we need to be conservative and block unrolling as this might
>> +     lead to a less optimal loop for the first and only epilogue using the
>> +     original loop's vectorization factor.
>> +     TODO: Remove this constraint when we add support for multiple epilogue
>> +     vectorization.  */
>> +  if (!sve && TARGET_SVE2 && m_nosve_pattern)
>> +    return 1;
> Shouldn't this be !TARGET_SVE2?
>
>> +
>> +  unsigned int max_unroll_factor = 1;
>> +  for (auto vec_ops : m_ops)
>> +    {
>> +      aarch64_simd_vec_issue_info const *vec_issue
>> +	= vec_ops.simd_issue_info ();
>> +      if (!vec_issue)
>> +	return 1;
>> +      /* Limit unroll factor to a value adjustable by the user, the default
>> +	 value is 4. */
>> +      unsigned int unroll_factor = aarch64_vect_unroll_limit;
>> +      unsigned int factor
>> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
>> +      unsigned int temp;
>> +
>> +      /* Sanity check, this should never happen.  */
>> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
>> +	return 1;
>> +
>> +      /* Check stores.  */
>> +      if (vec_ops.stores > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->stores_per_cycle,
>> +		       vec_ops.stores);
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	}
>> +
>> +      /* Check loads + stores.  */
>> +      if (vec_ops.loads > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
>> +		       vec_ops.loads + vec_ops.stores);
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	}
>> +
>> +      /* Check general ops.  */
>> +      if (vec_ops.general_ops > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
>> +		       vec_ops.general_ops);
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	 }
>> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
>> +    }
>> +
>> +  /* Make sure unroll factor is power of 2.  */
>> +  return 1 << ceil_log2 (max_unroll_factor);
>> +}
>> +
>>   /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>>      and return the new cost.  */
>>   unsigned int
>> @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>>     if (loop_vinfo
>>         && m_vec_flags
>>         && aarch64_use_new_vector_costs_p ())
>> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>> -					   m_costs[vect_body]);
>> +    {
>> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>> +					     m_costs[vect_body]);
>> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
>> +    }
>>   
>>     /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>>        the scalar code in the event of a tie, since there is more chance
>> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
>> index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644
>> --- a/gcc/config/aarch64/aarch64.opt
>> +++ b/gcc/config/aarch64/aarch64.opt
>> @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence.
>>   -param=aarch64-mops-memset-size-threshold=
>>   Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param
>>   Constant memset size in bytes from which to start using MOPS sequence.
>> +
>> +-param=aarch64-vect-unroll-limit=
>> +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
>> +Limit how much the autovectorizer may unroll a loop.
> This needs to be documented in invoke.texi, alongside the existing
> AArch64-specific params.
>
> OK with those changes, thanks.
>
> Richard
Committed attached patch.