Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
       [not found] <20230731140207.C8FC63858C39@sourceware.org>
@ 2023-08-01  8:47 ` Richard Sandiford
  2023-08-01  9:38   ` Richard Sandiford
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Sandiford @ 2023-08-01  8:47 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches; +Cc: Richard Biener

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> The following makes sure to limit the shift operand when vectorizing
> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> operand otherwise invokes undefined behavior.  When we determine
> whether we can demote the operand we know we at most shift in the
> sign bit so we can adjust the shift amount.
>
> Note this has the possibility of un-CSEing common shift operands
> as there's no good way to share pattern stmts between patterns.
> We'd have to separately pattern recognize the definition.
>
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>
> Not sure about LSHIFT_EXPR, it probably has the same issue but
> the fallback optimistic zero for out-of-range shifts is at least
> "corrrect".  Not sure we ever try to demote rotates (probably not).

I guess you mean "correct" for x86?  But that's just a quirk of x86.
IMO the behaviour is equally wrong for LSHIFT_EXPR.

Richard

> OK?
>
> Thanks,
> Richard.
>
> 	PR tree-optimization/110838
> 	* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
> 	Adjust the shift operand of RSHIFT_EXPRs.
>
> 	* gcc.dg/torture/pr110838.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/torture/pr110838.c | 43 +++++++++++++++++++++++++
>  gcc/tree-vect-patterns.cc               | 24 ++++++++++++++
>  2 files changed, 67 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr110838.c
>
> diff --git a/gcc/testsuite/gcc.dg/torture/pr110838.c b/gcc/testsuite/gcc.dg/torture/pr110838.c
> new file mode 100644
> index 00000000000..f039bd6c8ea
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr110838.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +
> +typedef __UINT32_TYPE__ uint32_t;
> +typedef __UINT8_TYPE__ uint8_t;
> +typedef __INT8_TYPE__ int8_t;
> +typedef uint8_t pixel;
> +
> +/* get the sign of input variable (TODO: this is a dup, make common) */
> +static inline int8_t signOf(int x)
> +{
> +  return (x >> 31) | ((int)((((uint32_t)-x)) >> 31));
> +}
> +
> +__attribute__((noipa))
> +static void calSign_bug(int8_t *dst, const pixel *src1, const pixel *src2, const int endX)
> +{
> +  for (int x = 0; x < endX; x++)
> +    dst[x] = signOf(src1[x] - src2[x]);
> +}
> +
> +__attribute__((noipa, optimize(0)))
> +static void calSign_ok(int8_t *dst, const pixel *src1, const pixel *src2, const int endX)
> +{
> +  for (int x = 0; x < endX; x++)
> +    dst[x] = signOf(src1[x] - src2[x]);
> +}
> +
> +__attribute__((noipa, optimize(0)))
> +int main()
> +{
> +  const pixel s1[9] = { 0xcd, 0x33, 0xd4, 0x3e, 0xb0, 0xfb, 0x95, 0x64, 0x70, };
> +  const pixel s2[9] = { 0xba, 0x9f, 0xab, 0xa1, 0x3b, 0x29, 0xb1, 0xbd, 0x64, };
> +  int endX = 9;
> +  int8_t dst[9];
> +  int8_t dst_ok[9];
> +
> +  calSign_bug(dst, s1, s2, endX);
> +  calSign_ok(dst_ok, s1, s2, endX);
> +
> +  if (__builtin_memcmp(dst, dst_ok, endX) != 0)
> +    __builtin_abort ();
> +  return 0;
> +}
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index ef806e2346e..e4ab8c2d65b 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3099,9 +3099,33 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
>    tree ops[3] = {};
>    for (unsigned int i = 1; i < first_op; ++i)
>      ops[i - 1] = gimple_op (last_stmt, i);
> +  /* For right shifts limit the shift operand.  */
>    vect_convert_inputs (vinfo, last_stmt_info, nops, &ops[first_op - 1],
>  		       op_type, &unprom[0], op_vectype);
>  
> +  /* Limit shift operands.  */
> +  if (code == RSHIFT_EXPR)
> +    {
> +      wide_int min_value, max_value;
> +      if (TREE_CODE (ops[1]) == INTEGER_CST)
> +	ops[1] = wide_int_to_tree (op_type,
> +				   wi::bit_and (wi::to_wide (ops[1]),
> +						new_precision - 1));
> +      else if (!vect_get_range_info (ops[1], &min_value, &max_value)
> +	       || wi::ge_p (max_value, new_precision, TYPE_SIGN (op_type)))
> +	{
> +	  /* ???  Note the following bad for SLP as that only supports
> +	     same argument widened shifts and it un-CSEs same arguments.  */
> +	  tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
> +	  gimple *pattern_stmt
> +	    = gimple_build_assign (new_var, BIT_AND_EXPR, ops[1],
> +				   build_int_cst (op_type, new_precision - 1));
> +	  ops[1] = new_var;
> +	  gimple_set_location (pattern_stmt, gimple_location (last_stmt));
> +	  append_pattern_def_seq (vinfo, last_stmt_info, pattern_stmt);
> +	}
> +    }
> +
>    /* Use the operation to produce a result of type OP_TYPE.  */
>    tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
>    gimple *pattern_stmt = gimple_build_assign (new_var, code,

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
  2023-08-01  8:47 ` [PATCH] tree-optimization/110838 - vectorization of widened shifts Richard Sandiford
@ 2023-08-01  9:38   ` Richard Sandiford
  2023-08-02 12:31     ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Sandiford @ 2023-08-01  9:38 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches; +Cc: Richard Biener

Richard Sandiford <richard.sandiford@arm.com> writes:
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> The following makes sure to limit the shift operand when vectorizing
>> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
>> operand otherwise invokes undefined behavior.  When we determine
>> whether we can demote the operand we know we at most shift in the
>> sign bit so we can adjust the shift amount.
>>
>> Note this has the possibility of un-CSEing common shift operands
>> as there's no good way to share pattern stmts between patterns.
>> We'd have to separately pattern recognize the definition.
>>
>> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>>
>> Not sure about LSHIFT_EXPR, it probably has the same issue but
>> the fallback optimistic zero for out-of-range shifts is at least
>> "corrrect".  Not sure we ever try to demote rotates (probably not).
>
> I guess you mean "correct" for x86?  But that's just a quirk of x86.
> IMO the behaviour is equally wrong for LSHIFT_EXPR.

Sorry for the multiple messages.  Wanted to get something out quickly
because I wasn't sure how long it would take me to write this...

On rotates, for:

void
foo (unsigned short *restrict ptr)
{
  for (int i = 0; i < 200; ++i)
    {
      unsigned int x = ptr[i] & 0xff0;
      ptr[i] = (x << 1) | (x >> 31);
    }
}

we do get:

can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;

although aarch64 doesn't provide rrotate patterns, so nothing actually
comes of it.

I think the handling of variable shifts is flawed for other reasons.  Given:

void
uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
    ptr1[i] = ptr1[i] >> ptr2[i];
}

void
us (unsigned short *restrict ptr1, short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
    ptr1[i] = ptr1[i] >> ptr2[i];
}

void
su (short *restrict ptr1, unsigned short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
    ptr1[i] = ptr1[i] >> ptr2[i];
}

void
ss (short *restrict ptr1, short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
    ptr1[i] = ptr1[i] >> ptr2[i];
}

we only narrow uu and ss, due to:

	    /* Ignore codes that don't take uniform arguments.  */
	    if (!types_compatible_p (TREE_TYPE (op), type))
	      return;

in vect_determine_precisions_from_range.  Maybe we should drop
the shift handling from there and instead rely on
vect_determine_precisions_from_users, extending:

	if (TREE_CODE (shift) != INTEGER_CST
	    || !wi::ltu_p (wi::to_widest (shift), precision))
	  return;

to handle ranges where the max is known to be < precision.

There again, if masking is enough for right shifts and right rotates,
maybe we should keep the current handling for then (with your fix)
and skip the types_compatible_p check for those cases.

So:

- restrict shift handling in vect_determine_precisions_from_range to
  RSHIFT_EXPR and RROTATE_EXPR

- remove types_compatible_p restriction for those cases

- extend vect_determine_precisions_from_users shift handling to check
  for ranges on the shift amount

Does that sound right?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
  2023-08-01  9:38   ` Richard Sandiford
@ 2023-08-02 12:31     ` Richard Biener
  2023-08-02 13:47       ` Richard Sandiford
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2023-08-02 12:31 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener via Gcc-patches

On Tue, 1 Aug 2023, Richard Sandiford wrote:

> Richard Sandiford <richard.sandiford@arm.com> writes:
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> The following makes sure to limit the shift operand when vectorizing
> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> >> operand otherwise invokes undefined behavior.  When we determine
> >> whether we can demote the operand we know we at most shift in the
> >> sign bit so we can adjust the shift amount.
> >>
> >> Note this has the possibility of un-CSEing common shift operands
> >> as there's no good way to share pattern stmts between patterns.
> >> We'd have to separately pattern recognize the definition.
> >>
> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >>
> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
> >> the fallback optimistic zero for out-of-range shifts is at least
> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
> >
> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
> > IMO the behaviour is equally wrong for LSHIFT_EXPR.

I meant "correct" for the constant folding that evaluates out-of-bound
shifts as zero.

> Sorry for the multiple messages.  Wanted to get something out quickly
> because I wasn't sure how long it would take me to write this...
> 
> On rotates, for:
> 
> void
> foo (unsigned short *restrict ptr)
> {
>   for (int i = 0; i < 200; ++i)
>     {
>       unsigned int x = ptr[i] & 0xff0;
>       ptr[i] = (x << 1) | (x >> 31);
>     }
> }
> 
> we do get:
> 
> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
> 
> although aarch64 doesn't provide rrotate patterns, so nothing actually
> comes of it.

I think it's still correct that we only need unsigned:13 for the input,
we know other bits are zero.  But of course when actually applying
this as documented

/* Record that STMT_INFO could be changed from operating on TYPE to
   operating on a type with the precision and sign given by PRECISION
   and SIGN respectively.

the operation itself has to be altered (the above doesn't suggest
promoting/demoting the operands to TYPE is the only thing to do).

So it seems to be the burden is on the consumers of the information?

> I think the handling of variable shifts is flawed for other reasons.  Given:
> 
> void
> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
>     ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> us (unsigned short *restrict ptr1, short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
>     ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> su (short *restrict ptr1, unsigned short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
>     ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> ss (short *restrict ptr1, short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
>     ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> we only narrow uu and ss, due to:
> 
> 	    /* Ignore codes that don't take uniform arguments.  */
> 	    if (!types_compatible_p (TREE_TYPE (op), type))
> 	      return;

I suppose that's because we care about the shift operand at all here.
We could possibly use [0 .. precision-1] as known range for it
and only if that doesn't fit 'type' give up (and otherwise simply
ignore the input range of the shift operands here).

> in vect_determine_precisions_from_range.  Maybe we should drop
> the shift handling from there and instead rely on
> vect_determine_precisions_from_users, extending:
> 
> 	if (TREE_CODE (shift) != INTEGER_CST
> 	    || !wi::ltu_p (wi::to_widest (shift), precision))
> 	  return;
> 
> to handle ranges where the max is known to be < precision.
> 
> There again, if masking is enough for right shifts and right rotates,
> maybe we should keep the current handling for then (with your fix)
> and skip the types_compatible_p check for those cases.

I think it should be enough for left-shifts as well?  If we lshift
out like 0x100 << 9 so the lhs range is [0,0] the input range from
op0 will still make us use HImode.  I think we only ever get overly
conservative answers for left-shifts from this function?

Whatever works for RROTATE should also work for LROTATE.

> So:
> 
> - restrict shift handling in vect_determine_precisions_from_range to
>   RSHIFT_EXPR and RROTATE_EXPR
> 
> - remove types_compatible_p restriction for those cases
> 
> - extend vect_determine_precisions_from_users shift handling to check
>   for ranges on the shift amount
> 
> Does that sound right?

I'm not sure.   This all felt somewhat fragile when looking closer
(I was hoping you would eventually tackle it from the older
referenced bug) ... so my main idea was to perform incremental changes
where I have test coverage (as with the new wrong-code bug).

Originally I completely disabled shift support but that regressed
the over-widen testcases a lot which at least have widened shifts
by constants a lot.

x86 has vector rotates only for AMD XOP (which is dead) plus
some for V1TImode AFAICS, but I think we pattern-match rotates
to shifts, so maybe the precision stuff is interesting for the
case where we match the pattern rotate sequence for widenings?

So for the types_compatible_p issue something along
the following?  We could also exempt the shift operand from
being covered by min_precision so the consumer would have
to make sure it can be represented (I think that's never going
to be an issue in practice until we get 256bit integers vectorized).
It will have to fixup the shift operands anyway.

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index e4ab8c2d65b..cdeeaf98a47 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
(stmt_vec_info stmt_info, gassign *stmt)
          }
        else if (TREE_CODE (op) == SSA_NAME)
          {
-           /* Ignore codes that don't take uniform arguments.  */
-           if (!types_compatible_p (TREE_TYPE (op), type))
+           /* Ignore codes that don't take uniform arguments.  For shifts
+              the shift amount is known to be in-range.  */
+           if (code == LSHIFT_EXPR
+               || code == RSHIFT_EXPR
+               || code == LROTATE_EXPR
+               || code == RROTATE_EXPR)
+             {
+               min_value = wi::min (min_value, 0, sign);
+               max_value = wi::max (max_value, TYPE_PRECISION (type), 
sign);
+             }
+           else if (types_compatible_p (TREE_TYPE (op), type))
+             {
+               wide_int op_min_value, op_max_value;
+               if (!vect_get_range_info (op, &op_min_value, 
&op_max_value))
+                 return;
+               min_value = wi::min (min_value, op_min_value, sign);
+               max_value = wi::max (max_value, op_max_value, sign);
+             }
+           else
              return;
-
-           wide_int op_min_value, op_max_value;
-           if (!vect_get_range_info (op, &op_min_value, &op_max_value))
-             return;
-
-           min_value = wi::min (min_value, op_min_value, sign);
-           max_value = wi::max (max_value, op_max_value, sign);
          }
        else
          return;


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
  2023-08-02 12:31     ` Richard Biener
@ 2023-08-02 13:47       ` Richard Sandiford
  2023-08-02 17:39         ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Sandiford @ 2023-08-02 13:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener via Gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Tue, 1 Aug 2023, Richard Sandiford wrote:
>
>> Richard Sandiford <richard.sandiford@arm.com> writes:
>> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> The following makes sure to limit the shift operand when vectorizing
>> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
>> >> operand otherwise invokes undefined behavior.  When we determine
>> >> whether we can demote the operand we know we at most shift in the
>> >> sign bit so we can adjust the shift amount.
>> >>
>> >> Note this has the possibility of un-CSEing common shift operands
>> >> as there's no good way to share pattern stmts between patterns.
>> >> We'd have to separately pattern recognize the definition.
>> >>
>> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>> >>
>> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
>> >> the fallback optimistic zero for out-of-range shifts is at least
>> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
>> >
>> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
>> > IMO the behaviour is equally wrong for LSHIFT_EXPR.
>
> I meant "correct" for the constant folding that evaluates out-of-bound
> shifts as zero.
>
>> Sorry for the multiple messages.  Wanted to get something out quickly
>> because I wasn't sure how long it would take me to write this...
>> 
>> On rotates, for:
>> 
>> void
>> foo (unsigned short *restrict ptr)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     {
>>       unsigned int x = ptr[i] & 0xff0;
>>       ptr[i] = (x << 1) | (x >> 31);
>>     }
>> }
>> 
>> we do get:
>> 
>> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
>> 
>> although aarch64 doesn't provide rrotate patterns, so nothing actually
>> comes of it.
>
> I think it's still correct that we only need unsigned:13 for the input,
> we know other bits are zero.  But of course when actually applying
> this as documented
>
> /* Record that STMT_INFO could be changed from operating on TYPE to
>    operating on a type with the precision and sign given by PRECISION
>    and SIGN respectively.
>
> the operation itself has to be altered (the above doesn't suggest
> promoting/demoting the operands to TYPE is the only thing to do).
>
> So it seems to be the burden is on the consumers of the information?

Yeah, textually that seems fair.  Not sure I was thinking of it in
those terms at the time though. :)

>> I think the handling of variable shifts is flawed for other reasons.  Given:
>> 
>> void
>> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> us (unsigned short *restrict ptr1, short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> su (short *restrict ptr1, unsigned short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> ss (short *restrict ptr1, short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> we only narrow uu and ss, due to:
>> 
>> 	    /* Ignore codes that don't take uniform arguments.  */
>> 	    if (!types_compatible_p (TREE_TYPE (op), type))
>> 	      return;
>
> I suppose that's because we care about the shift operand at all here.
> We could possibly use [0 .. precision-1] as known range for it
> and only if that doesn't fit 'type' give up (and otherwise simply
> ignore the input range of the shift operands here).
>
>> in vect_determine_precisions_from_range.  Maybe we should drop
>> the shift handling from there and instead rely on
>> vect_determine_precisions_from_users, extending:
>> 
>> 	if (TREE_CODE (shift) != INTEGER_CST
>> 	    || !wi::ltu_p (wi::to_widest (shift), precision))
>> 	  return;
>> 
>> to handle ranges where the max is known to be < precision.
>> 
>> There again, if masking is enough for right shifts and right rotates,
>> maybe we should keep the current handling for then (with your fix)
>> and skip the types_compatible_p check for those cases.
>
> I think it should be enough for left-shifts as well?  If we lshift
> out like 0x100 << 9 so the lhs range is [0,0] the input range from
> op0 will still make us use HImode.  I think we only ever get overly
> conservative answers for left-shifts from this function?

But if we have:

  short x, y;
  int z = (int) x << (int) y;

and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
whereas x << y would invoke UB and x << (y & 15) would be 1.

> Whatever works for RROTATE should also work for LROTATE.

I think the same problem affects LROTATE.

>> So:
>> 
>> - restrict shift handling in vect_determine_precisions_from_range to
>>   RSHIFT_EXPR and RROTATE_EXPR
>> 
>> - remove types_compatible_p restriction for those cases
>> 
>> - extend vect_determine_precisions_from_users shift handling to check
>>   for ranges on the shift amount
>> 
>> Does that sound right?
>
> I'm not sure.   This all felt somewhat fragile when looking closer
> (I was hoping you would eventually tackle it from the older
> referenced bug) ... so my main idea was to perform incremental changes
> where I have test coverage (as with the new wrong-code bug).

Fair :)

> Originally I completely disabled shift support but that regressed
> the over-widen testcases a lot which at least have widened shifts
> by constants a lot.
>
> x86 has vector rotates only for AMD XOP (which is dead) plus
> some for V1TImode AFAICS, but I think we pattern-match rotates
> to shifts, so maybe the precision stuff is interesting for the
> case where we match the pattern rotate sequence for widenings?
>
> So for the types_compatible_p issue something along
> the following?  We could also exempt the shift operand from
> being covered by min_precision so the consumer would have
> to make sure it can be represented (I think that's never going
> to be an issue in practice until we get 256bit integers vectorized).
> It will have to fixup the shift operands anyway.
>
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index e4ab8c2d65b..cdeeaf98a47 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
> (stmt_vec_info stmt_info, gassign *stmt)
>           }
>         else if (TREE_CODE (op) == SSA_NAME)
>           {
> -           /* Ignore codes that don't take uniform arguments.  */
> -           if (!types_compatible_p (TREE_TYPE (op), type))
> +           /* Ignore codes that don't take uniform arguments.  For shifts
> +              the shift amount is known to be in-range.  */

I guess it's more "we can assume that the amount is in range"?

> +           if (code == LSHIFT_EXPR
> +               || code == RSHIFT_EXPR
> +               || code == LROTATE_EXPR
> +               || code == RROTATE_EXPR)
> +             {
> +               min_value = wi::min (min_value, 0, sign);
> +               max_value = wi::max (max_value, TYPE_PRECISION (type), 
> sign);

LGTM for shifts right.  Because of the above lshift thing, I think we
need something like:

  if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
    {
      wide_int op_min_value, op_max_value;
      if (!vect_get_range_info (op, &op_min_value, op_max_value))
        return;

      /* We can ignore left shifts by negative amounts, which are UB.  */
      min_value = wi::min (min_value, 0, sign);

      /* Make sure the highest non-UB shift amount doesn't become UB.  */
      op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
      auto mask = wi::mask (TYPE_PRECISION (type), false,
			    op_max_value.to_uhwi ());
      max_value = wi::max (max_value, mask, sign);
    }

Does that look right?

Thanks, and sorry for the scope creep.

Richard

> +             }
> +           else if (types_compatible_p (TREE_TYPE (op), type))
> +             {
> +               wide_int op_min_value, op_max_value;
> +               if (!vect_get_range_info (op, &op_min_value, 
> &op_max_value))
> +                 return;
> +               min_value = wi::min (min_value, op_min_value, sign);
> +               max_value = wi::max (max_value, op_max_value, sign);
> +             }
> +           else
>               return;
> -
> -           wide_int op_min_value, op_max_value;
> -           if (!vect_get_range_info (op, &op_min_value, &op_max_value))
> -             return;
> -
> -           min_value = wi::min (min_value, op_min_value, sign);
> -           max_value = wi::max (max_value, op_max_value, sign);
>           }
>         else
>           return;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
  2023-08-02 13:47       ` Richard Sandiford
@ 2023-08-02 17:39         ` Richard Biener
  2023-08-02 19:48           ` Richard Sandiford
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Biener @ 2023-08-02 17:39 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener via Gcc-patches

On Wed, 2 Aug 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Tue, 1 Aug 2023, Richard Sandiford wrote:
> >
> >> Richard Sandiford <richard.sandiford@arm.com> writes:
> >> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> The following makes sure to limit the shift operand when vectorizing
> >> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> >> >> operand otherwise invokes undefined behavior.  When we determine
> >> >> whether we can demote the operand we know we at most shift in the
> >> >> sign bit so we can adjust the shift amount.
> >> >>
> >> >> Note this has the possibility of un-CSEing common shift operands
> >> >> as there's no good way to share pattern stmts between patterns.
> >> >> We'd have to separately pattern recognize the definition.
> >> >>
> >> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >> >>
> >> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
> >> >> the fallback optimistic zero for out-of-range shifts is at least
> >> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
> >> >
> >> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
> >> > IMO the behaviour is equally wrong for LSHIFT_EXPR.
> >
> > I meant "correct" for the constant folding that evaluates out-of-bound
> > shifts as zero.
> >
> >> Sorry for the multiple messages.  Wanted to get something out quickly
> >> because I wasn't sure how long it would take me to write this...
> >> 
> >> On rotates, for:
> >> 
> >> void
> >> foo (unsigned short *restrict ptr)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     {
> >>       unsigned int x = ptr[i] & 0xff0;
> >>       ptr[i] = (x << 1) | (x >> 31);
> >>     }
> >> }
> >> 
> >> we do get:
> >> 
> >> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
> >> 
> >> although aarch64 doesn't provide rrotate patterns, so nothing actually
> >> comes of it.
> >
> > I think it's still correct that we only need unsigned:13 for the input,
> > we know other bits are zero.  But of course when actually applying
> > this as documented
> >
> > /* Record that STMT_INFO could be changed from operating on TYPE to
> >    operating on a type with the precision and sign given by PRECISION
> >    and SIGN respectively.
> >
> > the operation itself has to be altered (the above doesn't suggest
> > promoting/demoting the operands to TYPE is the only thing to do).
> >
> > So it seems to be the burden is on the consumers of the information?
> 
> Yeah, textually that seems fair.  Not sure I was thinking of it in
> those terms at the time though. :)
> 
> >> I think the handling of variable shifts is flawed for other reasons.  Given:
> >> 
> >> void
> >> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> us (unsigned short *restrict ptr1, short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> su (short *restrict ptr1, unsigned short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> ss (short *restrict ptr1, short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> we only narrow uu and ss, due to:
> >> 
> >> 	    /* Ignore codes that don't take uniform arguments.  */
> >> 	    if (!types_compatible_p (TREE_TYPE (op), type))
> >> 	      return;
> >
> > I suppose that's because we care about the shift operand at all here.
> > We could possibly use [0 .. precision-1] as known range for it
> > and only if that doesn't fit 'type' give up (and otherwise simply
> > ignore the input range of the shift operands here).
> >
> >> in vect_determine_precisions_from_range.  Maybe we should drop
> >> the shift handling from there and instead rely on
> >> vect_determine_precisions_from_users, extending:
> >> 
> >> 	if (TREE_CODE (shift) != INTEGER_CST
> >> 	    || !wi::ltu_p (wi::to_widest (shift), precision))
> >> 	  return;
> >> 
> >> to handle ranges where the max is known to be < precision.
> >> 
> >> There again, if masking is enough for right shifts and right rotates,
> >> maybe we should keep the current handling for then (with your fix)
> >> and skip the types_compatible_p check for those cases.
> >
> > I think it should be enough for left-shifts as well?  If we lshift
> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
> > op0 will still make us use HImode.  I think we only ever get overly
> > conservative answers for left-shifts from this function?
> 
> But if we have:
> 
>   short x, y;
>   int z = (int) x << (int) y;
> 
> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
> whereas x << y would invoke UB and x << (y & 15) would be 1.

True, but we start with the range of the LHS which in this case
would be of type 'int' and thus 1 << 16 and not zero.  You
might call that a failure of vect_determine_precisions_from_range
of course, since it makes it not exactly a forward propagation ...

> > Whatever works for RROTATE should also work for LROTATE.
> 
> I think the same problem affects LROTATE.
>
> >> So:
> >> 
> >> - restrict shift handling in vect_determine_precisions_from_range to
> >>   RSHIFT_EXPR and RROTATE_EXPR
> >> 
> >> - remove types_compatible_p restriction for those cases
> >> 
> >> - extend vect_determine_precisions_from_users shift handling to check
> >>   for ranges on the shift amount
> >> 
> >> Does that sound right?
> >
> > I'm not sure.   This all felt somewhat fragile when looking closer
> > (I was hoping you would eventually tackle it from the older
> > referenced bug) ... so my main idea was to perform incremental changes
> > where I have test coverage (as with the new wrong-code bug).
> 
> Fair :)
> 
> > Originally I completely disabled shift support but that regressed
> > the over-widen testcases a lot which at least have widened shifts
> > by constants a lot.
> >
> > x86 has vector rotates only for AMD XOP (which is dead) plus
> > some for V1TImode AFAICS, but I think we pattern-match rotates
> > to shifts, so maybe the precision stuff is interesting for the
> > case where we match the pattern rotate sequence for widenings?
> >
> > So for the types_compatible_p issue something along
> > the following?  We could also exempt the shift operand from
> > being covered by min_precision so the consumer would have
> > to make sure it can be represented (I think that's never going
> > to be an issue in practice until we get 256bit integers vectorized).
> > It will have to fixup the shift operands anyway.
> >
> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > index e4ab8c2d65b..cdeeaf98a47 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
> > (stmt_vec_info stmt_info, gassign *stmt)
> >           }
> >         else if (TREE_CODE (op) == SSA_NAME)
> >           {
> > -           /* Ignore codes that don't take uniform arguments.  */
> > -           if (!types_compatible_p (TREE_TYPE (op), type))
> > +           /* Ignore codes that don't take uniform arguments.  For shifts
> > +              the shift amount is known to be in-range.  */
> 
> I guess it's more "we can assume that the amount is in range"?

Yes.

> > +           if (code == LSHIFT_EXPR
> > +               || code == RSHIFT_EXPR
> > +               || code == LROTATE_EXPR
> > +               || code == RROTATE_EXPR)
> > +             {
> > +               min_value = wi::min (min_value, 0, sign);
> > +               max_value = wi::max (max_value, TYPE_PRECISION (type), 
> > sign);
> 
> LGTM for shifts right.  Because of the above lshift thing, I think we
> need something like:
> 
>   if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
>     {
>       wide_int op_min_value, op_max_value;
>       if (!vect_get_range_info (op, &op_min_value, op_max_value))
>         return;
> 
>       /* We can ignore left shifts by negative amounts, which are UB.  */
>       min_value = wi::min (min_value, 0, sign);
> 
>       /* Make sure the highest non-UB shift amount doesn't become UB.  */
>       op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
>       auto mask = wi::mask (TYPE_PRECISION (type), false,
> 			    op_max_value.to_uhwi ());
>       max_value = wi::max (max_value, mask, sign);
>     }
> 
> Does that look right?

As said it looks overly conservative to me?  For example with my patch
for

void foo (signed char *v, int s)
{
  if (s < 1 || s > 7)
    return;
  for (int i = 0; i < 1024; ++i)
    v[i] = v[i] << s;
}

I get

t.c:5:21: note:   _7 has range [0xffffc000, 0x3f80]
t.c:5:21: note:   can narrow to signed:15 without loss of precision: _7 = 
_6 << s_12(D);
t.c:5:21: note:   only the low 15 bits of _6 are significant
t.c:5:21: note:   _6 has range [0xffffff80, 0x7f]
...
t.c:5:21: note:   vect_recog_over_widening_pattern: detected: _7 = _6 << 
s_12(D);
t.c:5:21: note:   demoting int to signed short
t.c:5:21: note:   Splitting statement: _6 = (int) _5;
t.c:5:21: note:   into pattern statements: patt_24 = (signed short) _5;
t.c:5:21: note:   and: patt_23 = (int) patt_24;
t.c:5:21: note:   created pattern stmt: patt_21 = patt_24 << patt_22;
t.c:5:21: note:   over_widening pattern recognized: patt_18 = (int) 
patt_21;

and eventually

  patt_22 = (signed short) s_12(D);
...
  vect__5.7_28 = MEM <vector(16) signed char> [(signed char 
*)vectp_v.5_10];
  vect_patt_24.8_29 = [vec_unpack_lo_expr] vect__5.7_28;
  vect_patt_24.8_30 = [vec_unpack_hi_expr] vect__5.7_28;
  vect_patt_21.9_31 = vect_patt_24.8_29 << patt_22;
  vect_patt_21.9_32 = vect_patt_24.8_30 << patt_22;
  vect_patt_17.10_33 = VEC_PACK_TRUNC_EXPR <vect_patt_21.9_31, 
vect_patt_21.9_32>;
  MEM <vector(16) signed char> [(signed char *)vectp_v.11_34] = 
vect_patt_17.10_33;

which I think is correct.  If I'd up the range of the shift
operand then the range for _7 would be larger and we wouldn't
be able to shrink the range.  I suppose the _from_users
might back-propagate the truncation to signed char and enable
that though?

> Thanks, and sorry for the scope creep.

Well, if we can agree on that the patch as originally is
correct (but incomplete) we can maybe still install it.

As seen above I wonder if there's really an existing problem
for LSHIFT_EXPR.  I'm sure we mishandle rotates, but again
at the consumer side - in priciple we could analyze them
as lshift + rshift which means the existing behavior is
probably also correct.

Ricahrd.

> Richard
> 
> > +             }
> > +           else if (types_compatible_p (TREE_TYPE (op), type))
> > +             {
> > +               wide_int op_min_value, op_max_value;
> > +               if (!vect_get_range_info (op, &op_min_value, 
> > &op_max_value))
> > +                 return;
> > +               min_value = wi::min (min_value, op_min_value, sign);
> > +               max_value = wi::max (max_value, op_max_value, sign);
> > +             }
> > +           else
> >               return;
> > -
> > -           wide_int op_min_value, op_max_value;
> > -           if (!vect_get_range_info (op, &op_min_value, &op_max_value))
> > -             return;
> > -
> > -           min_value = wi::min (min_value, op_min_value, sign);
> > -           max_value = wi::max (max_value, op_max_value, sign);
> >           }
> >         else
> >           return;


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
  2023-08-02 17:39         ` Richard Biener
@ 2023-08-02 19:48           ` Richard Sandiford
  2023-08-03 12:53             ` Richard Biener
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Sandiford @ 2023-08-02 19:48 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Biener via Gcc-patches

Richard Biener <rguenther@suse.de> writes:
> [...]
>> >> in vect_determine_precisions_from_range.  Maybe we should drop
>> >> the shift handling from there and instead rely on
>> >> vect_determine_precisions_from_users, extending:
>> >> 
>> >> 	if (TREE_CODE (shift) != INTEGER_CST
>> >> 	    || !wi::ltu_p (wi::to_widest (shift), precision))
>> >> 	  return;
>> >> 
>> >> to handle ranges where the max is known to be < precision.
>> >> 
>> >> There again, if masking is enough for right shifts and right rotates,
>> >> maybe we should keep the current handling for then (with your fix)
>> >> and skip the types_compatible_p check for those cases.
>> >
>> > I think it should be enough for left-shifts as well?  If we lshift
>> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
>> > op0 will still make us use HImode.  I think we only ever get overly
>> > conservative answers for left-shifts from this function?
>> 
>> But if we have:
>> 
>>   short x, y;
>>   int z = (int) x << (int) y;
>> 
>> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
>> whereas x << y would invoke UB and x << (y & 15) would be 1.
>
> True, but we start with the range of the LHS which in this case
> would be of type 'int' and thus 1 << 16 and not zero.  You
> might call that a failure of vect_determine_precisions_from_range
> of course, since it makes it not exactly a forward propagation ...

Ah, right, sorry.  I should have done more checking.

> [...]
>> > Originally I completely disabled shift support but that regressed
>> > the over-widen testcases a lot which at least have widened shifts
>> > by constants a lot.
>> >
>> > x86 has vector rotates only for AMD XOP (which is dead) plus
>> > some for V1TImode AFAICS, but I think we pattern-match rotates
>> > to shifts, so maybe the precision stuff is interesting for the
>> > case where we match the pattern rotate sequence for widenings?
>> >
>> > So for the types_compatible_p issue something along
>> > the following?  We could also exempt the shift operand from
>> > being covered by min_precision so the consumer would have
>> > to make sure it can be represented (I think that's never going
>> > to be an issue in practice until we get 256bit integers vectorized).
>> > It will have to fixup the shift operands anyway.
>> >
>> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
>> > index e4ab8c2d65b..cdeeaf98a47 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
>> > (stmt_vec_info stmt_info, gassign *stmt)
>> >           }
>> >         else if (TREE_CODE (op) == SSA_NAME)
>> >           {
>> > -           /* Ignore codes that don't take uniform arguments.  */
>> > -           if (!types_compatible_p (TREE_TYPE (op), type))
>> > +           /* Ignore codes that don't take uniform arguments.  For shifts
>> > +              the shift amount is known to be in-range.  */
>> 
>> I guess it's more "we can assume that the amount is in range"?
>
> Yes.
>
>> > +           if (code == LSHIFT_EXPR
>> > +               || code == RSHIFT_EXPR
>> > +               || code == LROTATE_EXPR
>> > +               || code == RROTATE_EXPR)
>> > +             {
>> > +               min_value = wi::min (min_value, 0, sign);
>> > +               max_value = wi::max (max_value, TYPE_PRECISION (type), 
>> > sign);
>> 
>> LGTM for shifts right.  Because of the above lshift thing, I think we
>> need something like:
>> 
>>   if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
>>     {
>>       wide_int op_min_value, op_max_value;
>>       if (!vect_get_range_info (op, &op_min_value, op_max_value))
>>         return;
>> 
>>       /* We can ignore left shifts by negative amounts, which are UB.  */
>>       min_value = wi::min (min_value, 0, sign);
>> 
>>       /* Make sure the highest non-UB shift amount doesn't become UB.  */
>>       op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
>>       auto mask = wi::mask (TYPE_PRECISION (type), false,
>> 			    op_max_value.to_uhwi ());
>>       max_value = wi::max (max_value, mask, sign);
>>     }
>> 
>> Does that look right?
>
> As said it looks overly conservative to me?  For example with my patch
> for
>
> void foo (signed char *v, int s)
> {
>   if (s < 1 || s > 7)
>     return;
>   for (int i = 0; i < 1024; ++i)
>     v[i] = v[i] << s;
> }
>
> I get
>
> t.c:5:21: note:   _7 has range [0xffffc000, 0x3f80]
> t.c:5:21: note:   can narrow to signed:15 without loss of precision: _7 = 
> _6 << s_12(D);
> t.c:5:21: note:   only the low 15 bits of _6 are significant
> t.c:5:21: note:   _6 has range [0xffffff80, 0x7f]
> ...
> t.c:5:21: note:   vect_recog_over_widening_pattern: detected: _7 = _6 << 
> s_12(D);
> t.c:5:21: note:   demoting int to signed short
> t.c:5:21: note:   Splitting statement: _6 = (int) _5;
> t.c:5:21: note:   into pattern statements: patt_24 = (signed short) _5;
> t.c:5:21: note:   and: patt_23 = (int) patt_24;
> t.c:5:21: note:   created pattern stmt: patt_21 = patt_24 << patt_22;
> t.c:5:21: note:   over_widening pattern recognized: patt_18 = (int) 
> patt_21;
>
> and eventually
>
>   patt_22 = (signed short) s_12(D);
> ...
>   vect__5.7_28 = MEM <vector(16) signed char> [(signed char 
> *)vectp_v.5_10];
>   vect_patt_24.8_29 = [vec_unpack_lo_expr] vect__5.7_28;
>   vect_patt_24.8_30 = [vec_unpack_hi_expr] vect__5.7_28;
>   vect_patt_21.9_31 = vect_patt_24.8_29 << patt_22;
>   vect_patt_21.9_32 = vect_patt_24.8_30 << patt_22;
>   vect_patt_17.10_33 = VEC_PACK_TRUNC_EXPR <vect_patt_21.9_31, 
> vect_patt_21.9_32>;
>   MEM <vector(16) signed char> [(signed char *)vectp_v.11_34] = 
> vect_patt_17.10_33;
>
> which I think is correct.

Agreed.

> If I'd up the range of the shift
> operand then the range for _7 would be larger and we wouldn't
> be able to shrink the range.  I suppose the _from_users
> might back-propagate the truncation to signed char and enable
> that though?

Yeah, that's the mechanism I was getting confused with, but like
you say, it isn't a factor for vect_determine_precisions_from_range.

>> Thanks, and sorry for the scope creep.
>
> Well, if we can agree on that the patch as originally is
> correct (but incomplete) we can maybe still install it.

Yeah, works for me.

> As seen above I wonder if there's really an existing problem
> for LSHIFT_EXPR.

No, maybe not.  For:

void foo (int *v, int s)
{
  for (int i = 0; i < 1024; ++i)
    v[i] = 2 << ((v[i] & 3) * 8 + 7);
}

we get:

note:   _8 has range [0x0, 0x1010100]
note:   _7 has range [0x7, 0x1f]
note:   can narrow to unsigned:25 without loss of precision: _8 = 2 << _7;

which isn't true, but I don't think anything will come of it.

With a bit of work, we could in principle realise that:

void foo (short *v, int s)
{
  for (int i = 0; i < 1024; ++i)
    v[i] = 2 << (((v[i] & 3) - 1) & 31);
}

gives a shift range of [0,2] \union {31}, and so the shift result is
either 0, 2, 4 or 8 (range [0, 8]), but we don't compute that result
range at present.

So yeah, I still can't trigger an actual problem.

> I'm sure we mishandle rotates, but again at the consumer side - in
> priciple we could analyze them as lshift + rshift which means the
> existing behavior is probably also correct.

Yeah, please go ahead with the original patch.  And sorry for the
distraction.  It's been a useful exercise for me at least...

Richard

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
  2023-08-02 19:48           ` Richard Sandiford
@ 2023-08-03 12:53             ` Richard Biener
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Biener @ 2023-08-03 12:53 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener via Gcc-patches

On Wed, 2 Aug 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > [...]
> >> >> in vect_determine_precisions_from_range.  Maybe we should drop
> >> >> the shift handling from there and instead rely on
> >> >> vect_determine_precisions_from_users, extending:
> >> >> 
> >> >> 	if (TREE_CODE (shift) != INTEGER_CST
> >> >> 	    || !wi::ltu_p (wi::to_widest (shift), precision))
> >> >> 	  return;
> >> >> 
> >> >> to handle ranges where the max is known to be < precision.
> >> >> 
> >> >> There again, if masking is enough for right shifts and right rotates,
> >> >> maybe we should keep the current handling for then (with your fix)
> >> >> and skip the types_compatible_p check for those cases.
> >> >
> >> > I think it should be enough for left-shifts as well?  If we lshift
> >> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
> >> > op0 will still make us use HImode.  I think we only ever get overly
> >> > conservative answers for left-shifts from this function?
> >> 
> >> But if we have:
> >> 
> >>   short x, y;
> >>   int z = (int) x << (int) y;
> >> 
> >> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
> >> whereas x << y would invoke UB and x << (y & 15) would be 1.
> >
> > True, but we start with the range of the LHS which in this case
> > would be of type 'int' and thus 1 << 16 and not zero.  You
> > might call that a failure of vect_determine_precisions_from_range
> > of course, since it makes it not exactly a forward propagation ...
> 
> Ah, right, sorry.  I should have done more checking.
> 
> > [...]
> >> > Originally I completely disabled shift support but that regressed
> >> > the over-widen testcases a lot which at least have widened shifts
> >> > by constants a lot.
> >> >
> >> > x86 has vector rotates only for AMD XOP (which is dead) plus
> >> > some for V1TImode AFAICS, but I think we pattern-match rotates
> >> > to shifts, so maybe the precision stuff is interesting for the
> >> > case where we match the pattern rotate sequence for widenings?
> >> >
> >> > So for the types_compatible_p issue something along
> >> > the following?  We could also exempt the shift operand from
> >> > being covered by min_precision so the consumer would have
> >> > to make sure it can be represented (I think that's never going
> >> > to be an issue in practice until we get 256bit integers vectorized).
> >> > It will have to fixup the shift operands anyway.
> >> >
> >> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> >> > index e4ab8c2d65b..cdeeaf98a47 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
> >> > (stmt_vec_info stmt_info, gassign *stmt)
> >> >           }
> >> >         else if (TREE_CODE (op) == SSA_NAME)
> >> >           {
> >> > -           /* Ignore codes that don't take uniform arguments.  */
> >> > -           if (!types_compatible_p (TREE_TYPE (op), type))
> >> > +           /* Ignore codes that don't take uniform arguments.  For shifts
> >> > +              the shift amount is known to be in-range.  */
> >> 
> >> I guess it's more "we can assume that the amount is in range"?
> >
> > Yes.
> >
> >> > +           if (code == LSHIFT_EXPR
> >> > +               || code == RSHIFT_EXPR
> >> > +               || code == LROTATE_EXPR
> >> > +               || code == RROTATE_EXPR)
> >> > +             {
> >> > +               min_value = wi::min (min_value, 0, sign);
> >> > +               max_value = wi::max (max_value, TYPE_PRECISION (type), 
> >> > sign);
> >> 
> >> LGTM for shifts right.  Because of the above lshift thing, I think we
> >> need something like:
> >> 
> >>   if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
> >>     {
> >>       wide_int op_min_value, op_max_value;
> >>       if (!vect_get_range_info (op, &op_min_value, op_max_value))
> >>         return;
> >> 
> >>       /* We can ignore left shifts by negative amounts, which are UB.  */
> >>       min_value = wi::min (min_value, 0, sign);
> >> 
> >>       /* Make sure the highest non-UB shift amount doesn't become UB.  */
> >>       op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
> >>       auto mask = wi::mask (TYPE_PRECISION (type), false,
> >> 			    op_max_value.to_uhwi ());
> >>       max_value = wi::max (max_value, mask, sign);
> >>     }
> >> 
> >> Does that look right?
> >
> > As said it looks overly conservative to me?  For example with my patch
> > for
> >
> > void foo (signed char *v, int s)
> > {
> >   if (s < 1 || s > 7)
> >     return;
> >   for (int i = 0; i < 1024; ++i)
> >     v[i] = v[i] << s;
> > }
> >
> > I get
> >
> > t.c:5:21: note:   _7 has range [0xffffc000, 0x3f80]
> > t.c:5:21: note:   can narrow to signed:15 without loss of precision: _7 = 
> > _6 << s_12(D);
> > t.c:5:21: note:   only the low 15 bits of _6 are significant
> > t.c:5:21: note:   _6 has range [0xffffff80, 0x7f]
> > ...
> > t.c:5:21: note:   vect_recog_over_widening_pattern: detected: _7 = _6 << 
> > s_12(D);
> > t.c:5:21: note:   demoting int to signed short
> > t.c:5:21: note:   Splitting statement: _6 = (int) _5;
> > t.c:5:21: note:   into pattern statements: patt_24 = (signed short) _5;
> > t.c:5:21: note:   and: patt_23 = (int) patt_24;
> > t.c:5:21: note:   created pattern stmt: patt_21 = patt_24 << patt_22;
> > t.c:5:21: note:   over_widening pattern recognized: patt_18 = (int) 
> > patt_21;
> >
> > and eventually
> >
> >   patt_22 = (signed short) s_12(D);
> > ...
> >   vect__5.7_28 = MEM <vector(16) signed char> [(signed char 
> > *)vectp_v.5_10];
> >   vect_patt_24.8_29 = [vec_unpack_lo_expr] vect__5.7_28;
> >   vect_patt_24.8_30 = [vec_unpack_hi_expr] vect__5.7_28;
> >   vect_patt_21.9_31 = vect_patt_24.8_29 << patt_22;
> >   vect_patt_21.9_32 = vect_patt_24.8_30 << patt_22;
> >   vect_patt_17.10_33 = VEC_PACK_TRUNC_EXPR <vect_patt_21.9_31, 
> > vect_patt_21.9_32>;
> >   MEM <vector(16) signed char> [(signed char *)vectp_v.11_34] = 
> > vect_patt_17.10_33;
> >
> > which I think is correct.
> 
> Agreed.
> 
> > If I'd up the range of the shift
> > operand then the range for _7 would be larger and we wouldn't
> > be able to shrink the range.  I suppose the _from_users
> > might back-propagate the truncation to signed char and enable
> > that though?
> 
> Yeah, that's the mechanism I was getting confused with, but like
> you say, it isn't a factor for vect_determine_precisions_from_range.
> 
> >> Thanks, and sorry for the scope creep.
> >
> > Well, if we can agree on that the patch as originally is
> > correct (but incomplete) we can maybe still install it.
> 
> Yeah, works for me.
> 
> > As seen above I wonder if there's really an existing problem
> > for LSHIFT_EXPR.
> 
> No, maybe not.  For:
> 
> void foo (int *v, int s)
> {
>   for (int i = 0; i < 1024; ++i)
>     v[i] = 2 << ((v[i] & 3) * 8 + 7);
> }
> 
> we get:
> 
> note:   _8 has range [0x0, 0x1010100]
> note:   _7 has range [0x7, 0x1f]
> note:   can narrow to unsigned:25 without loss of precision: _8 = 2 << _7;
> 
> which isn't true, but I don't think anything will come of it.
> 
> With a bit of work, we could in principle realise that:
> 
> void foo (short *v, int s)
> {
>   for (int i = 0; i < 1024; ++i)
>     v[i] = 2 << (((v[i] & 3) - 1) & 31);
> }
> 
> gives a shift range of [0,2] \union {31}, and so the shift result is
> either 0, 2, 4 or 8 (range [0, 8]), but we don't compute that result
> range at present.
> 
> So yeah, I still can't trigger an actual problem.
> 
> > I'm sure we mishandle rotates, but again at the consumer side - in
> > priciple we could analyze them as lshift + rshift which means the
> > existing behavior is probably also correct.
> 
> Yeah, please go ahead with the original patch.  And sorry for the
> distraction.  It's been a useful exercise for me at least...

Installed.  Indeed it was useful for me as well.

Richard.

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <20230731140203.F3D0B3858412@sourceware.org>]

* Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
       [not found] <20230731140203.F3D0B3858412@sourceware.org>
@ 2023-07-31 15:28 ` Jeff Law
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff Law @ 2023-07-31 15:28 UTC (permalink / raw)
  To: Richard Biener, gcc-patches; +Cc: richard.sandiford



On 7/31/23 08:01, Richard Biener via Gcc-patches wrote:
> The following makes sure to limit the shift operand when vectorizing
> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> operand otherwise invokes undefined behavior.  When we determine
> whether we can demote the operand we know we at most shift in the
> sign bit so we can adjust the shift amount.
> 
> Note this has the possibility of un-CSEing common shift operands
> as there's no good way to share pattern stmts between patterns.
> We'd have to separately pattern recognize the definition.
> 
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> 
> Not sure about LSHIFT_EXPR, it probably has the same issue but
> the fallback optimistic zero for out-of-range shifts is at least
> "corrrect".  Not sure we ever try to demote rotates (probably not).
> 
> OK?
> 
> Thanks,
> Richard.
> 
> 	PR tree-optimization/110838
> 	* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
> 	Adjust the shift operand of RSHIFT_EXPRs.
> 
> 	* gcc.dg/torture/pr110838.c: New testcase.
I'm not a fan of the asymmetric handling across RSHIFT/LSHIFT.  But if 
you think the asymmetry isn't a problem in practice, then I won't object.

Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] tree-optimization/110838 - vectorization of widened shifts
@ 2023-07-31 14:01 Richard Biener
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Biener @ 2023-07-31 14:01 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

The following makes sure to limit the shift operand when vectorizing
(short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
operand otherwise invokes undefined behavior.  When we determine
whether we can demote the operand we know we at most shift in the
sign bit so we can adjust the shift amount.

Note this has the possibility of un-CSEing common shift operands
as there's no good way to share pattern stmts between patterns.
We'd have to separately pattern recognize the definition.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

Not sure about LSHIFT_EXPR, it probably has the same issue but
the fallback optimistic zero for out-of-range shifts is at least
"corrrect".  Not sure we ever try to demote rotates (probably not).

OK?

Thanks,
Richard.

	PR tree-optimization/110838
	* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
	Adjust the shift operand of RSHIFT_EXPRs.

	* gcc.dg/torture/pr110838.c: New testcase.
---
 gcc/testsuite/gcc.dg/torture/pr110838.c | 43 +++++++++++++++++++++++++
 gcc/tree-vect-patterns.cc               | 24 ++++++++++++++
 2 files changed, 67 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr110838.c

diff --git a/gcc/testsuite/gcc.dg/torture/pr110838.c b/gcc/testsuite/gcc.dg/torture/pr110838.c
new file mode 100644
index 00000000000..f039bd6c8ea
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr110838.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+
+typedef __UINT32_TYPE__ uint32_t;
+typedef __UINT8_TYPE__ uint8_t;
+typedef __INT8_TYPE__ int8_t;
+typedef uint8_t pixel;
+
+/* get the sign of input variable (TODO: this is a dup, make common) */
+static inline int8_t signOf(int x)
+{
+  return (x >> 31) | ((int)((((uint32_t)-x)) >> 31));
+}
+
+__attribute__((noipa))
+static void calSign_bug(int8_t *dst, const pixel *src1, const pixel *src2, const int endX)
+{
+  for (int x = 0; x < endX; x++)
+    dst[x] = signOf(src1[x] - src2[x]);
+}
+
+__attribute__((noipa, optimize(0)))
+static void calSign_ok(int8_t *dst, const pixel *src1, const pixel *src2, const int endX)
+{
+  for (int x = 0; x < endX; x++)
+    dst[x] = signOf(src1[x] - src2[x]);
+}
+
+__attribute__((noipa, optimize(0)))
+int main()
+{
+  const pixel s1[9] = { 0xcd, 0x33, 0xd4, 0x3e, 0xb0, 0xfb, 0x95, 0x64, 0x70, };
+  const pixel s2[9] = { 0xba, 0x9f, 0xab, 0xa1, 0x3b, 0x29, 0xb1, 0xbd, 0x64, };
+  int endX = 9;
+  int8_t dst[9];
+  int8_t dst_ok[9];
+
+  calSign_bug(dst, s1, s2, endX);
+  calSign_ok(dst_ok, s1, s2, endX);
+
+  if (__builtin_memcmp(dst, dst_ok, endX) != 0)
+    __builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index ef806e2346e..e4ab8c2d65b 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3099,9 +3099,33 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
   tree ops[3] = {};
   for (unsigned int i = 1; i < first_op; ++i)
     ops[i - 1] = gimple_op (last_stmt, i);
+  /* For right shifts limit the shift operand.  */
   vect_convert_inputs (vinfo, last_stmt_info, nops, &ops[first_op - 1],
 		       op_type, &unprom[0], op_vectype);
 
+  /* Limit shift operands.  */
+  if (code == RSHIFT_EXPR)
+    {
+      wide_int min_value, max_value;
+      if (TREE_CODE (ops[1]) == INTEGER_CST)
+	ops[1] = wide_int_to_tree (op_type,
+				   wi::bit_and (wi::to_wide (ops[1]),
+						new_precision - 1));
+      else if (!vect_get_range_info (ops[1], &min_value, &max_value)
+	       || wi::ge_p (max_value, new_precision, TYPE_SIGN (op_type)))
+	{
+	  /* ???  Note the following bad for SLP as that only supports
+	     same argument widened shifts and it un-CSEs same arguments.  */
+	  tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
+	  gimple *pattern_stmt
+	    = gimple_build_assign (new_var, BIT_AND_EXPR, ops[1],
+				   build_int_cst (op_type, new_precision - 1));
+	  ops[1] = new_var;
+	  gimple_set_location (pattern_stmt, gimple_location (last_stmt));
+	  append_pattern_def_seq (vinfo, last_stmt_info, pattern_stmt);
+	}
+    }
+
   /* Use the operation to produce a result of type OP_TYPE.  */
   tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
   gimple *pattern_stmt = gimple_build_assign (new_var, code,
-- 
2.35.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-03 12:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20230731140207.C8FC63858C39@sourceware.org>
2023-08-01  8:47 ` [PATCH] tree-optimization/110838 - vectorization of widened shifts Richard Sandiford
2023-08-01  9:38   ` Richard Sandiford
2023-08-02 12:31     ` Richard Biener
2023-08-02 13:47       ` Richard Sandiford
2023-08-02 17:39         ` Richard Biener
2023-08-02 19:48           ` Richard Sandiford
2023-08-03 12:53             ` Richard Biener
     [not found] <20230731140203.F3D0B3858412@sourceware.org>
2023-07-31 15:28 ` Jeff Law
2023-07-31 14:01 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).