From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=IyOR=DT=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 4C3BF3858D33
	for <gcc-patches@gcc.gnu.org>; Wed,  2 Aug 2023 13:47:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4C3BF3858D33
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D62BF113E;
	Wed,  2 Aug 2023 06:48:33 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 28F4C3F5A1;
	Wed,  2 Aug 2023 06:47:50 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Mail-Followup-To: Richard Biener <rguenther@suse.de>,Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts
References: <20230731140207.C8FC63858C39@sourceware.org>
	<mptmszb9miv.fsf@arm.com> <mpt35139k71.fsf@arm.com>
	<nycvar.YFH.7.77.849.2308021202320.12935@jbgna.fhfr.qr>
Date: Wed, 02 Aug 2023 14:47:48 +0100
In-Reply-To: <nycvar.YFH.7.77.849.2308021202320.12935@jbgna.fhfr.qr> (Richard
	Biener's message of "Wed, 2 Aug 2023 12:31:46 +0000 (UTC)")
Message-ID: <mptr0ol5zej.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-26.2 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener <rguenther@suse.de> writes:
> On Tue, 1 Aug 2023, Richard Sandiford wrote:
>
>> Richard Sandiford <richard.sandiford@arm.com> writes:
>> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> The following makes sure to limit the shift operand when vectorizing
>> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
>> >> operand otherwise invokes undefined behavior.  When we determine
>> >> whether we can demote the operand we know we at most shift in the
>> >> sign bit so we can adjust the shift amount.
>> >>
>> >> Note this has the possibility of un-CSEing common shift operands
>> >> as there's no good way to share pattern stmts between patterns.
>> >> We'd have to separately pattern recognize the definition.
>> >>
>> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>> >>
>> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
>> >> the fallback optimistic zero for out-of-range shifts is at least
>> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
>> >
>> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
>> > IMO the behaviour is equally wrong for LSHIFT_EXPR.
>
> I meant "correct" for the constant folding that evaluates out-of-bound
> shifts as zero.
>
>> Sorry for the multiple messages.  Wanted to get something out quickly
>> because I wasn't sure how long it would take me to write this...
>> 
>> On rotates, for:
>> 
>> void
>> foo (unsigned short *restrict ptr)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     {
>>       unsigned int x = ptr[i] & 0xff0;
>>       ptr[i] = (x << 1) | (x >> 31);
>>     }
>> }
>> 
>> we do get:
>> 
>> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
>> 
>> although aarch64 doesn't provide rrotate patterns, so nothing actually
>> comes of it.
>
> I think it's still correct that we only need unsigned:13 for the input,
> we know other bits are zero.  But of course when actually applying
> this as documented
>
> /* Record that STMT_INFO could be changed from operating on TYPE to
>    operating on a type with the precision and sign given by PRECISION
>    and SIGN respectively.
>
> the operation itself has to be altered (the above doesn't suggest
> promoting/demoting the operands to TYPE is the only thing to do).
>
> So it seems to be the burden is on the consumers of the information?

Yeah, textually that seems fair.  Not sure I was thinking of it in
those terms at the time though. :)

>> I think the handling of variable shifts is flawed for other reasons.  Given:
>> 
>> void
>> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> us (unsigned short *restrict ptr1, short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> su (short *restrict ptr1, unsigned short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> ss (short *restrict ptr1, short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>>     ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> we only narrow uu and ss, due to:
>> 
>> 	    /* Ignore codes that don't take uniform arguments.  */
>> 	    if (!types_compatible_p (TREE_TYPE (op), type))
>> 	      return;
>
> I suppose that's because we care about the shift operand at all here.
> We could possibly use [0 .. precision-1] as known range for it
> and only if that doesn't fit 'type' give up (and otherwise simply
> ignore the input range of the shift operands here).
>
>> in vect_determine_precisions_from_range.  Maybe we should drop
>> the shift handling from there and instead rely on
>> vect_determine_precisions_from_users, extending:
>> 
>> 	if (TREE_CODE (shift) != INTEGER_CST
>> 	    || !wi::ltu_p (wi::to_widest (shift), precision))
>> 	  return;
>> 
>> to handle ranges where the max is known to be < precision.
>> 
>> There again, if masking is enough for right shifts and right rotates,
>> maybe we should keep the current handling for then (with your fix)
>> and skip the types_compatible_p check for those cases.
>
> I think it should be enough for left-shifts as well?  If we lshift
> out like 0x100 << 9 so the lhs range is [0,0] the input range from
> op0 will still make us use HImode.  I think we only ever get overly
> conservative answers for left-shifts from this function?

But if we have:

  short x, y;
  int z = (int) x << (int) y;

and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
whereas x << y would invoke UB and x << (y & 15) would be 1.

> Whatever works for RROTATE should also work for LROTATE.

I think the same problem affects LROTATE.

>> So:
>> 
>> - restrict shift handling in vect_determine_precisions_from_range to
>>   RSHIFT_EXPR and RROTATE_EXPR
>> 
>> - remove types_compatible_p restriction for those cases
>> 
>> - extend vect_determine_precisions_from_users shift handling to check
>>   for ranges on the shift amount
>> 
>> Does that sound right?
>
> I'm not sure.   This all felt somewhat fragile when looking closer
> (I was hoping you would eventually tackle it from the older
> referenced bug) ... so my main idea was to perform incremental changes
> where I have test coverage (as with the new wrong-code bug).

Fair :)

> Originally I completely disabled shift support but that regressed
> the over-widen testcases a lot which at least have widened shifts
> by constants a lot.
>
> x86 has vector rotates only for AMD XOP (which is dead) plus
> some for V1TImode AFAICS, but I think we pattern-match rotates
> to shifts, so maybe the precision stuff is interesting for the
> case where we match the pattern rotate sequence for widenings?
>
> So for the types_compatible_p issue something along
> the following?  We could also exempt the shift operand from
> being covered by min_precision so the consumer would have
> to make sure it can be represented (I think that's never going
> to be an issue in practice until we get 256bit integers vectorized).
> It will have to fixup the shift operands anyway.
>
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index e4ab8c2d65b..cdeeaf98a47 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
> (stmt_vec_info stmt_info, gassign *stmt)
>           }
>         else if (TREE_CODE (op) == SSA_NAME)
>           {
> -           /* Ignore codes that don't take uniform arguments.  */
> -           if (!types_compatible_p (TREE_TYPE (op), type))
> +           /* Ignore codes that don't take uniform arguments.  For shifts
> +              the shift amount is known to be in-range.  */

I guess it's more "we can assume that the amount is in range"?

> +           if (code == LSHIFT_EXPR
> +               || code == RSHIFT_EXPR
> +               || code == LROTATE_EXPR
> +               || code == RROTATE_EXPR)
> +             {
> +               min_value = wi::min (min_value, 0, sign);
> +               max_value = wi::max (max_value, TYPE_PRECISION (type), 
> sign);

LGTM for shifts right.  Because of the above lshift thing, I think we
need something like:

  if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
    {
      wide_int op_min_value, op_max_value;
      if (!vect_get_range_info (op, &op_min_value, op_max_value))
        return;

      /* We can ignore left shifts by negative amounts, which are UB.  */
      min_value = wi::min (min_value, 0, sign);

      /* Make sure the highest non-UB shift amount doesn't become UB.  */
      op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
      auto mask = wi::mask (TYPE_PRECISION (type), false,
			    op_max_value.to_uhwi ());
      max_value = wi::max (max_value, mask, sign);
    }

Does that look right?

Thanks, and sorry for the scope creep.

Richard

> +             }
> +           else if (types_compatible_p (TREE_TYPE (op), type))
> +             {
> +               wide_int op_min_value, op_max_value;
> +               if (!vect_get_range_info (op, &op_min_value, 
> &op_max_value))
> +                 return;
> +               min_value = wi::min (min_value, op_min_value, sign);
> +               max_value = wi::max (max_value, op_max_value, sign);
> +             }
> +           else
>               return;
> -
> -           wide_int op_min_value, op_max_value;
> -           if (!vect_get_range_info (op, &op_min_value, &op_max_value))
> -             return;
> -
> -           min_value = wi::min (min_value, op_min_value, sign);
> -           max_value = wi::max (max_value, op_max_value, sign);
>           }
>         else
>           return;