public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114440] New: Fail to recognize a chain of lane-reduced operations for loop reduction vect
@ 2024-03-23 10:45 fxue at os dot amperecomputing.com
  2024-03-25  9:04 ` [Bug tree-optimization/114440] " rguenth at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: fxue at os dot amperecomputing.com @ 2024-03-23 10:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440

            Bug ID: 114440
           Summary: Fail to recognize a chain of lane-reduced operations
                    for loop reduction vect
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

In a loop reduction path containing a lane-reduced operation
(DOT_PROD/SAD/WIDEN_SUM), current vectorizer could not handle the pattern if
there are other operations, which might be a normal or another lane-reduced
one. A pseudo example is represented as:

   char *d0, *d1;
   char *s0, *s1;
   char *w;
   int *n;

   ...
   int sum = 0;

   for (i) {
     ...
     sum += d0[i] * d1[i];       /* DOT_PROD */
     ...
     sum += abs(s0[i] - s1[i]);  /* SAD */
     ...
     sum += w[i];                /* WIDEN_SUM */
     ...
     sum += n[i];                /* Normal */
     ...
   }

   ... = sum;

For the case, reduction vectype would vary with operations, and this causes
mismatch on count of vectorized defs and uses, a possible means might be fixing
that by generating extra trivial pass-through copies. Given a concrete example
as:

   sum = 0; 
   for (i) {
     sum += d0[i] * d1[i];       /* 16*char -> 4*int */
     sum += n[i];                /*   4*int -> 4*int */
   }

Final vetorized statements could be:

   sum_v0 = { 0, 0, 0, 0 };
   sum_v1 = { 0, 0, 0, 0 };
   sum_v2 = { 0, 0, 0, 0 };
   sum_v3 = { 0, 0, 0, 0 };

   for (i / 16) {
     sum_v0 += DOT_PROD (v_d0[i: 0 .. 15], v_d1[i: 0 .. 15]);
     sum_v1 += 0;  // copy
     sum_v2 += 0;  // copy
     sum_v3 += 0;  // copy

     sum_v0 += v_n[i:  0 .. 3];
     sum_v1 += v_n[i:  4 .. 7];
     sum_v2 += v_n[i:  8 .. 11];
     sum_v3 += v_n[i: 12 .. 15]; 
   }

   sum = REDUC_PLUS(sum_v0 + sum_v1 + sum_v2 + sum_v3);

In the above sequence, one summation statement simply forms one pattern.
Though, we could easily compose a somewhat more complicated variant that gets
into the similar situation. That is, a chain of lane-reduced operations comes
from the non-reduction addend in one summation statement, like:

   sum += d0[i] * d1[i] + abs(s0[i] - s1[i]) + n[i];

Probably, this requires some extension in the vector pattern formation stage to
split the patterns.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug tree-optimization/114440] Fail to recognize a chain of lane-reduced operations for loop reduction vect
  2024-03-23 10:45 [Bug tree-optimization/114440] New: Fail to recognize a chain of lane-reduced operations for loop reduction vect fxue at os dot amperecomputing.com
@ 2024-03-25  9:04 ` rguenth at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-25  9:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
   Last reconfirmed|                            |2024-03-25
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
It looks like this should be possible, but of course this performs more "weird"
re-association which might be a problem with floating-point (though I don't
know of any lane-reducing ops implemented for FP).

Representing the "scalar" side in the vectorizer IL is tricky though, this
is why the current handling is separated and not integrated with the rest
(so it doesn't compose as you noticed).

Note we lack SLP recognition for dot_prod and sad also because we do not
specify which lanes are combined, so the optabs are a black-box and only
useful when the resulting lanes are reduced.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-03-25  9:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-23 10:45 [Bug tree-optimization/114440] New: Fail to recognize a chain of lane-reduced operations for loop reduction vect fxue at os dot amperecomputing.com
2024-03-25  9:04 ` [Bug tree-optimization/114440] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).