public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled
@ 2023-10-11 12:38 tnfchris at gcc dot gnu.org
  2023-10-11 12:58 ` [Bug tree-optimization/111770] " rguenth at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2023-10-11 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

            Bug ID: 111770
           Summary: predicated loads inactive lane values not modelled
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

For this example:

int foo(int n, char *a, char *b) {
  int sum = 0;
  for (int i = 0; i < n; ++i) {
    sum += a[i] * b[i];
  }
  return sum;
}

we generate with -O3 -march=armv8-a+sve

.L3:
        ld1b    z29.b, p7/z, [x1, x3]
        ld1b    z31.b, p7/z, [x2, x3]
        add     x3, x3, x4
        sel     z31.b, p7, z31.b, z28.b
        whilelo p7.b, w3, w0
        udot    z30.s, z29.b, z31.b
        b.any   .L3
        uaddv   d30, p6, z30.s
        fmov    w0, s30
        ret

Which is pretty good, but we completely ruin it with the SEL.

In gimple this is:

  vect__7.12_81 = .MASK_LOAD (_21, 8B, loop_mask_77);
  masked_op1_82 = .VCOND_MASK (loop_mask_77, vect__7.12_81, { 0, ... });
  vect_patt_33.13_83 = DOT_PROD_EXPR <vect__3.9_78, masked_op1_82,
vect_sum_19.6_74>;

The missed optimization here is that we don't model what happens with
predicated operations that zero inactive lanes.

i.e. in this case .MASK_LOAD will zero the unactive lanes, so the .VCOND_MASK
is  completely superfluous.

I'm not entirely sure how we should go about fixing this generally.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/111770] predicated loads inactive lane values not modelled
  2023-10-11 12:38 [Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled tnfchris at gcc dot gnu.org
@ 2023-10-11 12:58 ` rguenth at gcc dot gnu.org
  2024-02-21 17:14 ` acoplan at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-10-11 12:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-10-11
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
"Generally" we need an "else" value for .MASK_LOAD and there model "don't
care".
Elsewhere we observed that most uarchs do zero masked elements (or at least
offer so) and thus .MASK_LOAD without else value might be interpreted as
doing that (until I came along in the other related PR sugggesting an omitted
'else' value means 'don't care' - but IIRC the RISC-V folks ended up
implementing that with default-defs).

Btw, we should possibly vectorize this with a COND_DOT_PROD, since adding
zeros isn't correct for HONOR_SIGNED_ZEROS/HONOR_SIGN_DEPENDENT_ROUNDING.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/111770] predicated loads inactive lane values not modelled
  2023-10-11 12:38 [Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled tnfchris at gcc dot gnu.org
  2023-10-11 12:58 ` [Bug tree-optimization/111770] " rguenth at gcc dot gnu.org
@ 2024-02-21 17:14 ` acoplan at gcc dot gnu.org
  2024-02-22  8:10 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-02-21 17:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #2 from Alex Coplan <acoplan at gcc dot gnu.org> ---
I think to progress this and related cases we need to have .MASK_LOAD defined
to zero in the case that the predicate is false (either unconditionally for all
targets if possible or otherwise conditionally for targets where that is safe).

Here is a related case:

int bar(int n, char *a, char *b, char *c) {
  int sum = 0;
  for (int i = 0; i < n; ++i)
    if (c[i] == 0)
      sum += a[i] * b[i];
  return sum;
}

in this case we get the missed optimization even before vectorization during
ifcvt (in some ways it is a simpler case to consider as only scalars are
involved).  Here with -O3 -march=armv9-a from ifcvt we get:

  <bb 3> [local count: 955630224]:
  # sum_23 = PHI <_ifc__41(8), 0(18)>
  # i_25 = PHI <i_20(8), 0(18)>
  _1 = (sizetype) i_25;
  _2 = c_16(D) + _1;
  _3 = *_2;
  _29 = _3 == 0;
  _43 = _42 + _1;
  _4 = (char *) _43;
  _5 = .MASK_LOAD (_4, 8B, _29);
  _6 = (int) _5;
  _45 = _44 + _1;
  _7 = (char *) _45;
  _8 = .MASK_LOAD (_7, 8B, _29);
  _9 = (int) _8;
  _46 = (unsigned int) _6;
  _47 = (unsigned int) _9;
  _48 = _46 * _47;
  _10 = (int) _48;
  _ifc__41 = .COND_ADD (_29, sum_23, _10, sum_23);

for this case it should be possible to use an unpredicated add instead of a
.COND_ADD.  We essentially need to show that this transformation is valid:

  _29 ? sum_23 + _10 : sum_23 --> sum_23 + _10

and this essentially boils down to showing that:

  _29 = false => _10 = 0

now I'm not sure if there's a way of match-and-simplifying some GIMPLE
expression under the assumption that a given SSA name takes a particular value;
but if there were, and we defined .MASK_LOAD to zero given a false predicate,
then we could evaluate _10 under the assumption that _29 = false, which if we
added some simple match.pd rule for .MASK_LOAD with a false predicate would
allow it to evaluate to zero, and thus we could establish _10 = 0 proving the
transformation is correct.  If such an approach is possible then I guess ifcvt
could use it to avoid conditionalizing statements unnecessarily.

Richi: any thoughts on the above or on how we should handle this sort of thing?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/111770] predicated loads inactive lane values not modelled
  2023-10-11 12:38 [Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled tnfchris at gcc dot gnu.org
  2023-10-11 12:58 ` [Bug tree-optimization/111770] " rguenth at gcc dot gnu.org
  2024-02-21 17:14 ` acoplan at gcc dot gnu.org
@ 2024-02-22  8:10 ` rguenth at gcc dot gnu.org
  2024-02-22 11:10 ` acoplan at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-22  8:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
As said X + 0. -> X is an invalid transform with FP unless there are no signed
zeros (maybe also problematic with sign-dependent rounding).

I think we agree to define .MASK_LOAD to zero masked elements.  When we need
something else we need to add an explicit ELSE value.  That needs documenting
of course and also possibly testsuite coverage - I _think_ you should be able
to do a GIMPLE frontend testcase for this.

Note this behavior would extend to .MASK_GATHER_LOAD as well as
the load-lanes and -len variants.

Unfortunately we do not have _any_ internals manual documentation for
internal functions - but you can backtrack to the optabs documentation
where this would need documenting.

Now, if-conversion could indeed elide the .COND_ADD for integers.  It's
problematic there only because of signed overflow undefinedness, so
you shouldn't see it for 'unsigned' already, and adding zero is safe.
if-conversion would need to have an idea of all the ranges involved here
so it might be a bit sophisticated to get it right.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/111770] predicated loads inactive lane values not modelled
  2023-10-11 12:38 [Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled tnfchris at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-02-22  8:10 ` rguenth at gcc dot gnu.org
@ 2024-02-22 11:10 ` acoplan at gcc dot gnu.org
  2024-02-22 13:16 ` rguenth at gcc dot gnu.org
  2024-02-22 17:11 ` jsm28 at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: acoplan at gcc dot gnu.org @ 2024-02-22 11:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #4 from Alex Coplan <acoplan at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> As said X + 0. -> X is an invalid transform with FP unless there are no
> signed zeros (maybe also problematic with sign-dependent rounding).

Yeah, I was thinking about the integer case above.

> 
> I think we agree to define .MASK_LOAD to zero masked elements.  When we need
> something else we need to add an explicit ELSE value.  That needs documenting
> of course and also possibly testsuite coverage - I _think_ you should be able
> to do a GIMPLE frontend testcase for this.

Sounds good, thanks.

> 
> Note this behavior would extend to .MASK_GATHER_LOAD as well as
> the load-lanes and -len variants.
> 
> Unfortunately we do not have _any_ internals manual documentation for
> internal functions - but you can backtrack to the optabs documentation
> where this would need documenting.
> 
> Now, if-conversion could indeed elide the .COND_ADD for integers.  It's
> problematic there only because of signed overflow undefinedness, so
> you shouldn't see it for 'unsigned' already, and adding zero is safe.

Can you elaborate on this a bit? Do you mean to say that the .COND_ADD is only
there to avoid if-conversion introducing UB due to signed overflow? ISTM it's
needed for correctness even without that, as the addend needn't be guaranteed
to be zero in the general case.

> if-conversion would need to have an idea of all the ranges involved here
> so it might be a bit sophisticated to get it right.

Does what I suggested above make any sense, or do you have in mind a different
way of handling this in if-conversion? I'm wondering how ifcvt should determine
that the addend is zero in the case where the predicate is false.

Thanks

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/111770] predicated loads inactive lane values not modelled
  2023-10-11 12:38 [Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled tnfchris at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-02-22 11:10 ` acoplan at gcc dot gnu.org
@ 2024-02-22 13:16 ` rguenth at gcc dot gnu.org
  2024-02-22 17:11 ` jsm28 at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-22 13:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Alex Coplan from comment #4)
> (In reply to Richard Biener from comment #3)
> > Now, if-conversion could indeed elide the .COND_ADD for integers.  It's
> > problematic there only because of signed overflow undefinedness, so
> > you shouldn't see it for 'unsigned' already, and adding zero is safe.
> 
> Can you elaborate on this a bit? Do you mean to say that the .COND_ADD is
> only there to avoid if-conversion introducing UB due to signed overflow?

No, you are right.

> ISTM it's needed for correctness even without that, as the addend needn't be
> guaranteed to be zero in the general case.
> 
> > if-conversion would need to have an idea of all the ranges involved here
> > so it might be a bit sophisticated to get it right.
> 
> Does what I suggested above make any sense, or do you have in mind a
> different way of handling this in if-conversion? I'm wondering how ifcvt
> should determine that the addend is zero in the case where the predicate is
> false.

ifcvt would need to compute said fact, say, keep a lattice of the value
(or value-range) that's there when the block isn't executed (simulating
a disabled vector lane).  A load that's going to be replaced by a
.MASK_LOAD can be then known zero and this needs to be propagated through
regular stmts (like the multiply).  There's also .COND_* which if-conversion
could actually provide the else value for - like if we have a following
division to avoid dividing by zero.  But that would be propagating backwards
(I'd still have this usecase in mind).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/111770] predicated loads inactive lane values not modelled
  2023-10-11 12:38 [Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled tnfchris at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-02-22 13:16 ` rguenth at gcc dot gnu.org
@ 2024-02-22 17:11 ` jsm28 at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: jsm28 at gcc dot gnu.org @ 2024-02-22 17:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111770

--- Comment #6 from Joseph S. Myers <jsm28 at gcc dot gnu.org> ---
X + 0. -> X is invalid for FP with signed zero or signaling NaN, and also gets
the wrong quantum exponent for decimal FP unless the zero has the maximum
possible quantum exponent (which is not what you get from all bits of the
representation zero, which is a zero with the minimum possible quantum
exponent, or from converting integer 0 to DFP, which has quantum exponent 0).

(If you add -0. (with maximum quantum exponent, in the DFP case) instead of
+0., that does produce X for X not a signaling NaN - except in FE_DOWNWARD
mode. Whereas adding +0. is only correct in FE_DOWNWARD mode, since 0. - 0. has
positive sign in all other modes.)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-02-22 17:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-11 12:38 [Bug tree-optimization/111770] New: predicated loads inactive lane values not modelled tnfchris at gcc dot gnu.org
2023-10-11 12:58 ` [Bug tree-optimization/111770] " rguenth at gcc dot gnu.org
2024-02-21 17:14 ` acoplan at gcc dot gnu.org
2024-02-22  8:10 ` rguenth at gcc dot gnu.org
2024-02-22 11:10 ` acoplan at gcc dot gnu.org
2024-02-22 13:16 ` rguenth at gcc dot gnu.org
2024-02-22 17:11 ` jsm28 at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).