[Bug tree-optimization/112325] New: Missed vectorization after cunrolli

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/112325] New: Missed vectorization after cunrolli
@ 2023-11-01  2:41 wwwhhhyyy333 at gmail dot com
  2023-11-01  2:46 ` [Bug tree-optimization/112325] " pinskia at gcc dot gnu.org
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: wwwhhhyyy333 at gmail dot com @ 2023-11-01  2:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

            Bug ID: 112325
           Summary: Missed vectorization after cunrolli
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wwwhhhyyy333 at gmail dot com
  Target Milestone: ---

testcase:

#include <stdint.h>
#include <string.h>

typedef struct {
    float s;
    int8_t qs[32];
} block;

void foo (const int n, float * restrict s, const int8_t q[4], const block *
restrict y) {
    const int qk = 32;
    const int nb = n / qk;

    float sumf = 0.0;
    int sumi = 0;

    for (int i = 0; i < nb; i++) {
        uint32_t qh;
        memcpy(&qh, q, 4);

        for (int j = 0; j < qk/2; ++j) {
            sumi += (qh >> j) * y[i].qs[j];
        }
        sumf += (y[i].s * (float) sumi);
    }
    *s = sumf;
}

This can be vectorized under -O2 -mavx512vl but not -O3 -mavx512vl, see
https://godbolt.org/z/csPr4cPen

Under -O3 -mavx512vl -fdisable-tree-cunrolli the loop can also be vectorized.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization after cunrolli
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
@ 2023-11-01  2:46 ` pinskia at gcc dot gnu.org
  2023-11-02  9:51 ` [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling rguenth at gcc dot gnu.org
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-01  2:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2023-11-01

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Seems like there is a missing SLP ...

Confirmed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
  2023-11-01  2:46 ` [Bug tree-optimization/112325] " pinskia at gcc dot gnu.org
@ 2023-11-02  9:51 ` rguenth at gcc dot gnu.org
  2023-11-16  8:03 ` liuhongt at gcc dot gnu.org
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-11-02  9:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
            Summary|Missed vectorization after  |Missed vectorization of
                   |cunrolli                    |reduction after unrolling

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
We're detecting the integer reduction:

t.c:21:18: note:   Starting SLP discovery for
t.c:21:18: note:     _236 = _230 * _235;
t.c:21:18: note:     _223 = _217 * _222;
t.c:21:18: note:     _210 = _204 * _209;
t.c:21:18: note:     _197 = _191 * _196;
t.c:21:18: note:     _184 = _178 * _183;
t.c:21:18: note:     _171 = _165 * _170;
t.c:21:18: note:     _158 = _152 * _157;
t.c:21:18: note:     _145 = _139 * _144;
t.c:21:18: note:     _132 = _126 * _131;
t.c:21:18: note:     _119 = _113 * _118;
t.c:21:18: note:     _106 = _100 * _105;
t.c:21:18: note:     _93 = _87 * _92;
t.c:21:18: note:     _80 = _74 * _79;
t.c:21:18: note:     _67 = _61 * _66;
t.c:21:18: note:     sumi.1_42 = (unsigned int) sumi_96;
t.c:21:18: note:     _41 = _27 * _40;
t.c:21:18: note:     _54 = _48 * _53;

but we're not clever enough to split out

t.c:21:18: note:     sumi.1_42 = (unsigned int) sumi_96;

it's possibly easy enough to do.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
  2023-11-01  2:46 ` [Bug tree-optimization/112325] " pinskia at gcc dot gnu.org
  2023-11-02  9:51 ` [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling rguenth at gcc dot gnu.org
@ 2023-11-16  8:03 ` liuhongt at gcc dot gnu.org
  2023-11-16  8:15 ` liuhongt at gcc dot gnu.org
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2023-11-16  8:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

liuhongt at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #3 from liuhongt at gcc dot gnu.org ---
BB vectorizer relies on the backend support of .REDUC_PLUS for reduction, but
loop vectorizer can manually do reduction. That's why it's not vectorized after
cunrolli.

After adding reduc_plus_scal_v4si, it's vectorized.
Looks like we need to support
reduc_plus_scal_{v4si,v8si,v16si,v8hi,v16hi,v32hi}
Similar for reduc_{and,ior,xor}_scal_m.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (2 preceding siblings ...)
  2023-11-16  8:03 ` liuhongt at gcc dot gnu.org
@ 2023-11-16  8:15 ` liuhongt at gcc dot gnu.org
  2023-11-16  9:15 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2023-11-16  8:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #4 from liuhongt at gcc dot gnu.org ---
(In reply to liuhongt from comment #3)
> BB vectorizer relies on the backend support of .REDUC_PLUS for reduction,
> but loop vectorizer can manually do reduction. That's why it's not
> vectorized after cunrolli.
> 
> After adding reduc_plus_scal_v4si, it's vectorized.
> Looks like we need to support
> reduc_plus_scal_{v4si,v8si,v16si,v8hi,v16hi,v32hi}
> Similar for reduc_{and,ior,xor}_scal_m.

This one can be vectorized after support reduc_plus_scal_v4si, but the original
loop is still not vectorized, should be other issues.

int
foo (int* a)
{
  int sum = 0;
  sum += a[0];
  sum += a[1];
  sum += a[2];
  sum += a[3];
  return sum;
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (3 preceding siblings ...)
  2023-11-16  8:15 ` liuhongt at gcc dot gnu.org
@ 2023-11-16  9:15 ` rguenth at gcc dot gnu.org
  2023-11-17  6:19 ` pinskia at gcc dot gnu.org
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-11-16  9:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, as I said in comment#2.  Note I specifically ended up not open-coding the
reduction because of concerns of efficiency.  So a target should only provide
reduc_*_scal patterns when they are more efficient than open-coding with
the obvious sequence.

That possibly means fixing the issue I mentioned in comment#2 will not be
enough to get this BB vectorized on x86.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (4 preceding siblings ...)
  2023-11-16  9:15 ` rguenth at gcc dot gnu.org
@ 2023-11-17  6:19 ` pinskia at gcc dot gnu.org
  2023-11-17  6:21 ` pinskia at gcc dot gnu.org
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-17  6:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
Bug 112325 depends on bug 112579, which changed state.

Bug 112579 Summary: bb vectorizer failed to reduction sum += inv >> {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112579

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |DUPLICATE

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (5 preceding siblings ...)
  2023-11-17  6:19 ` pinskia at gcc dot gnu.org
@ 2023-11-17  6:21 ` pinskia at gcc dot gnu.org
  2023-11-20  2:52 ` cvs-commit at gcc dot gnu.org
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-17  6:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |106343

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The `>> 0` issue is the NOP case which is recorded in PR 106343 too.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106343
[Bug 106343] SLP does not support no-op case

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (6 preceding siblings ...)
  2023-11-17  6:21 ` pinskia at gcc dot gnu.org
@ 2023-11-20  2:52 ` cvs-commit at gcc dot gnu.org
  2023-11-21  0:34 ` cvs-commit at gcc dot gnu.org
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-11-20  2:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:2b59e2b4dff42118fe3a505f07b9a6aa4cf53bdf

commit r14-5603-g2b59e2b4dff42118fe3a505f07b9a6aa4cf53bdf
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Nov 16 18:38:39 2023 +0800

    Support reduc_{plus,xor,and,ior}_scal_m for vector integer mode.

    BB vectorizer relies on the backend support of
    .REDUC_{PLUS,IOR,XOR,AND} to vectorize reduction.

    gcc/ChangeLog:

            PR target/112325
            * config/i386/sse.md (reduc_<code>_scal_<mode>): New expander.
            (REDUC_ANY_LOGIC_MODE): New iterator.
            (REDUC_PLUS_MODE): Extend to VxHI/SI/DImode.
            (REDUC_SSE_PLUS_MODE): Ditto.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr112325-1.c: New test.
            * gcc.target/i386/pr112325-2.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (7 preceding siblings ...)
  2023-11-20  2:52 ` cvs-commit at gcc dot gnu.org
@ 2023-11-21  0:34 ` cvs-commit at gcc dot gnu.org
  2024-02-27  6:02 ` liuhongt at gcc dot gnu.org
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-11-21  0:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:e5e305e6048c042139037378fe6abfad5735b54f

commit r14-5632-ge5e305e6048c042139037378fe6abfad5735b54f
Author: liuhongt <hongtao.liu@intel.com>
Date:   Fri Nov 17 08:45:47 2023 +0800

    Support reduc_{and,ior,xor}_scal_m for V4HI/V8QI/V4QImode

    gcc/ChangeLog:

            PR target/112325
            * config/i386/i386-expand.cc (emit_reduc_half): Hanlde
            V8QImode.
            * config/i386/mmx.md (reduc_<code>_scal_<mode>): New expander.
            (reduc_<code>_scal_v4qi): Ditto.

    gcc/testsuite/ChangeLog:

            * gcc.target/i386/pr112325-mmx-1.c: New test.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (8 preceding siblings ...)
  2023-11-21  0:34 ` cvs-commit at gcc dot gnu.org
@ 2024-02-27  6:02 ` liuhongt at gcc dot gnu.org
  2024-02-27  6:13 ` liuhongt at gcc dot gnu.org
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-27  6:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #9 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
The original case is a little different from the one in PR.
It comes from ggml

#include <stdint.h>
#include <string.h>

typedef uint16_t ggml_fp16_t;
static float table_f32_f16[1 << 16];

inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
    uint16_t s;
    memcpy(&s, &f, sizeof(uint16_t));
    return table_f32_f16[s];
}

typedef struct {
    ggml_fp16_t d;
    ggml_fp16_t m;
    uint8_t qh[4];
    uint8_t qs[32 / 2];
} block_q5_1;

typedef struct {
    float d;
    float s;
    int8_t qs[32];
} block_q8_1;

void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void *
restrict vx, const void * restrict vy) {
    const int qk = 32;
    const int nb = n / qk;

    const block_q5_1 * restrict x = vx;
    const block_q8_1 * restrict y = vy;

    float sumf = 0.0;

    for (int i = 0; i < nb; i++) {
        uint32_t qh;
        memcpy(&qh, x[i].qh, sizeof(qh));

        int sumi = 0;

        for (int j = 0; j < qk/2; ++j) {
            const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
            const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10;

            const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
            const int32_t x1 = (x[i].qs[j] >> 4) | xh_1;

            sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
        }

        sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi +
ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
    }

    *s = sumf;
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (9 preceding siblings ...)
  2024-02-27  6:02 ` liuhongt at gcc dot gnu.org
@ 2024-02-27  6:13 ` liuhongt at gcc dot gnu.org
  2024-02-27  7:26 ` liuhongt at gcc dot gnu.org
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-27  6:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #10 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #9)
> The original case is a little different from the one in PR.
But the issue is similar, after cunrolli, GCC failed to vectorize the outer
loop.

The interesting thing is in estimated_unrolled_size, the original unr_insns is
288 which is bigger than param_max_completely_peeled_insns(200), but unr_insn
is decreased by 1/3 due to

   Loop body is likely going to simplify further, this is difficult
   to guess, we just decrease the result by 1/3.  */

In practice, this loop body is not simplied for 1/3 of the instructions.

Considering the unroll factor is 16, the unr_insn is large(192), I was
wondering if we could add some heuristic algorithm to avoid complete loop
unroll, because usually for such a big loop, both loop and BB vectorizer may
not perform well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (10 preceding siblings ...)
  2024-02-27  6:13 ` liuhongt at gcc dot gnu.org
@ 2024-02-27  7:26 ` liuhongt at gcc dot gnu.org
  2024-02-27  7:53 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-27  7:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---

>    Loop body is likely going to simplify further, this is difficult
>    to guess, we just decrease the result by 1/3.  */
> 

This is introduced by r0-68074-g91a01f21abfe19

/* Estimate number of insns of completely unrolled loop.  We assume
+   that the size of the unrolled loop is decreased in the
+   following way (the numbers of insns are based on what
+   estimate_num_insns returns for appropriate statements):
+
+   1) exit condition gets removed (2 insns)
+   2) increment of the control variable gets removed (2 insns)
+   3) All remaining statements are likely to get simplified
+      due to constant propagation.  Hard to estimate; just
+      as a heuristics we decrease the rest by 1/3.
+
+   NINSNS is the number of insns in the loop before unrolling.
+   NUNROLL is the number of times the loop is unrolled.  */
+
+static unsigned HOST_WIDE_INT
+estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
+                        unsigned HOST_WIDE_INT nunroll)
+{
+  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
+  if (unr_insns <= 0)
+    unr_insns = 1;
+  unr_insns *= (nunroll + 1);
+
+  return unr_insns;
+}

And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
keeped.

+/* Estimate number of insns of completely unrolled loop.
+   It is (NUNROLL + 1) * size of loop body with taking into account
+   the fact that in last copy everything after exit conditional
+   is dead and that some instructions will be eliminated after
+   peeling.

-   NINSNS is the number of insns in the loop before unrolling.
-   NUNROLL is the number of times the loop is unrolled.  */
+   Loop body is likely going to simplify futher, this is difficult
+   to guess, we just decrease the result by 1/3.  */

 static unsigned HOST_WIDE_INT
-estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
+estimated_unrolled_size (struct loop_size *size,
                         unsigned HOST_WIDE_INT nunroll)
 {
-  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
+  HOST_WIDE_INT unr_insns = ((nunroll)
+                            * (HOST_WIDE_INT) (size->overall
+                                               -
size->eliminated_by_peeling));
+  if (!nunroll)
+    unr_insns = 0;
+  unr_insns += size->last_iteration -
size->last_iteration_eliminated_by_peeling;
+
+  unr_insns = unr_insns * 2 / 3;
   if (unr_insns <= 0)
     unr_insns = 1;
-  unr_insns *= (nunroll + 1);

It looks to me 1 / 3 overestimates the instructions that can be optimised away,
especially if we've subtracted eliminated_by_peeling

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (11 preceding siblings ...)
  2024-02-27  7:26 ` liuhongt at gcc dot gnu.org
@ 2024-02-27  7:53 ` rguenther at suse dot de
  2024-02-27  7:58 ` rguenther at suse dot de
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenther at suse dot de @ 2024-02-27  7:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #10 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> (In reply to Hongtao Liu from comment #9)
> > The original case is a little different from the one in PR.
> But the issue is similar, after cunrolli, GCC failed to vectorize the outer
> loop.
> 
> The interesting thing is in estimated_unrolled_size, the original unr_insns is
> 288 which is bigger than param_max_completely_peeled_insns(200), but unr_insn
> is decreased by 1/3 due to
> 
>    Loop body is likely going to simplify further, this is difficult
>    to guess, we just decrease the result by 1/3.  */
> 
> In practice, this loop body is not simplied for 1/3 of the instructions.
> 
> Considering the unroll factor is 16, the unr_insn is large(192), I was
> wondering if we could add some heuristic algorithm to avoid complete loop
> unroll, because usually for such a big loop, both loop and BB vectorizer may
> not perform well.

There were several attempts at making the unroller guess less (that 1/3
reduction) but work out what actually will be simplified to be able to
shrink those numbers.

My favorite (but never implemented) idea was to code-generate 
optimistically but while running value-numbering on-the-fly on the
code and cost the so simplified unrolled code, stopping when we
reach a limit (and scrap the sofar accumulated code).  While
reasonably "easy" for unrolled code that ends up without branches
it gets complicated for branches.

My most recent attempt at improving was only for tracking what
unrolling estimates as ending up constant.

I think what might be the least controversical thing to do is to split
the instruction limit between the early cunrolli and the late cunroll
passes and lower the ones for cunrolli a lot.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (12 preceding siblings ...)
  2024-02-27  7:53 ` rguenther at suse dot de
@ 2024-02-27  7:58 ` rguenther at suse dot de
  2024-02-28  7:26 ` liuhongt at gcc dot gnu.org
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenther at suse dot de @ 2024-02-27  7:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> 
> >    Loop body is likely going to simplify further, this is difficult
> >    to guess, we just decrease the result by 1/3.  */
> > 
> 
> This is introduced by r0-68074-g91a01f21abfe19
> 
> /* Estimate number of insns of completely unrolled loop.  We assume
> +   that the size of the unrolled loop is decreased in the
> +   following way (the numbers of insns are based on what
> +   estimate_num_insns returns for appropriate statements):
> +
> +   1) exit condition gets removed (2 insns)
> +   2) increment of the control variable gets removed (2 insns)
> +   3) All remaining statements are likely to get simplified
> +      due to constant propagation.  Hard to estimate; just
> +      as a heuristics we decrease the rest by 1/3.
> +
> +   NINSNS is the number of insns in the loop before unrolling.
> +   NUNROLL is the number of times the loop is unrolled.  */
> +
> +static unsigned HOST_WIDE_INT
> +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +                        unsigned HOST_WIDE_INT nunroll)
> +{
> +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  if (unr_insns <= 0)
> +    unr_insns = 1;
> +  unr_insns *= (nunroll + 1);
> +
> +  return unr_insns;
> +}
> 
> And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> keeped.
> 
> +/* Estimate number of insns of completely unrolled loop.
> +   It is (NUNROLL + 1) * size of loop body with taking into account
> +   the fact that in last copy everything after exit conditional
> +   is dead and that some instructions will be eliminated after
> +   peeling.
> 
> -   NINSNS is the number of insns in the loop before unrolling.
> -   NUNROLL is the number of times the loop is unrolled.  */
> +   Loop body is likely going to simplify futher, this is difficult
> +   to guess, we just decrease the result by 1/3.  */
> 
>  static unsigned HOST_WIDE_INT
> -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +estimated_unrolled_size (struct loop_size *size,
>                          unsigned HOST_WIDE_INT nunroll)
>  {
> -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  HOST_WIDE_INT unr_insns = ((nunroll)
> +                            * (HOST_WIDE_INT) (size->overall
> +                                               -
> size->eliminated_by_peeling));
> +  if (!nunroll)
> +    unr_insns = 0;
> +  unr_insns += size->last_iteration -
> size->last_iteration_eliminated_by_peeling;
> +
> +  unr_insns = unr_insns * 2 / 3;
>    if (unr_insns <= 0)
>      unr_insns = 1;
> -  unr_insns *= (nunroll + 1);
> 
> It looks to me 1 / 3 overestimates the instructions that can be optimised away,
> especially if we've subtracted eliminated_by_peeling

Yes, that 1/3 reduction is a bit odd - you could have the same effect
by increasing the instruction limit by 1/3, but that means it doesn't
really matter, does it?  It would be interesting to see if increasing
the limit by 1/3 and removing the above is neutral on SPEC?

Note this kind of "simplification guessing" is most important for
the 2nd stage unrolling an outer loop with an unrolled inner loop
as there are 2nd level recurrences to be optimized the "elmiminated by
peeling" heuristics do not get (but value-numbering would).  So another
thing to do would be not do the 1/3 reduction for innermost loops
but only for loops up from that.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (13 preceding siblings ...)
  2024-02-27  7:58 ` rguenther at suse dot de
@ 2024-02-28  7:26 ` liuhongt at gcc dot gnu.org
  2024-02-28  8:23 ` rguenther at suse dot de
  2024-02-28  8:26 ` liuhongt at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-28  7:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #13)
> On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> > 
> > --- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> > 
> > >    Loop body is likely going to simplify further, this is difficult
> > >    to guess, we just decrease the result by 1/3.  */
> > > 
> > 
> > This is introduced by r0-68074-g91a01f21abfe19
> > 
> > /* Estimate number of insns of completely unrolled loop.  We assume
> > +   that the size of the unrolled loop is decreased in the
> > +   following way (the numbers of insns are based on what
> > +   estimate_num_insns returns for appropriate statements):
> > +
> > +   1) exit condition gets removed (2 insns)
> > +   2) increment of the control variable gets removed (2 insns)
> > +   3) All remaining statements are likely to get simplified
> > +      due to constant propagation.  Hard to estimate; just
> > +      as a heuristics we decrease the rest by 1/3.
> > +
> > +   NINSNS is the number of insns in the loop before unrolling.
> > +   NUNROLL is the number of times the loop is unrolled.  */
> > +
> > +static unsigned HOST_WIDE_INT
> > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > +                        unsigned HOST_WIDE_INT nunroll)
> > +{
> > +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > +  if (unr_insns <= 0)
> > +    unr_insns = 1;
> > +  unr_insns *= (nunroll + 1);
> > +
> > +  return unr_insns;
> > +}
> > 
> > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> > likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> > keeped.
> > 
> > +/* Estimate number of insns of completely unrolled loop.
> > +   It is (NUNROLL + 1) * size of loop body with taking into account
> > +   the fact that in last copy everything after exit conditional
> > +   is dead and that some instructions will be eliminated after
> > +   peeling.
> > 
> > -   NINSNS is the number of insns in the loop before unrolling.
> > -   NUNROLL is the number of times the loop is unrolled.  */
> > +   Loop body is likely going to simplify futher, this is difficult
> > +   to guess, we just decrease the result by 1/3.  */
> > 
> >  static unsigned HOST_WIDE_INT
> > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > +estimated_unrolled_size (struct loop_size *size,
> >                          unsigned HOST_WIDE_INT nunroll)
> >  {
> > -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > +  HOST_WIDE_INT unr_insns = ((nunroll)
> > +                            * (HOST_WIDE_INT) (size->overall
> > +                                               -
> > size->eliminated_by_peeling));
> > +  if (!nunroll)
> > +    unr_insns = 0;
> > +  unr_insns += size->last_iteration -
> > size->last_iteration_eliminated_by_peeling;
> > +
> > +  unr_insns = unr_insns * 2 / 3;
> >    if (unr_insns <= 0)
> >      unr_insns = 1;
> > -  unr_insns *= (nunroll + 1);
> > 
> > It looks to me 1 / 3 overestimates the instructions that can be optimised away,
> > especially if we've subtracted eliminated_by_peeling
> 
> Yes, that 1/3 reduction is a bit odd - you could have the same effect
> by increasing the instruction limit by 1/3, but that means it doesn't
> really matter, does it?  It would be interesting to see if increasing
> the limit by 1/3 and removing the above is neutral on SPEC?

Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with
-march=native -O3, no big impact on other integer benchmark.

The regression comes from below function, cunrolli unrolls the inner loop,
cunroll unrolls the outer loop, and causes lots of spills.

typedef unsigned long long uint64_t;
typedef unsigned char uint8_t;
typedef unsigned int uint32_t;
uint64_t x264_pixel_var_8x8(uint8_t *pix, int i_stride )
{
    uint32_t sum = 0, sqr = 0;
    for( int y = 0; y < 8; y++ )
    {
        for( int x = 0; x < 8; x++ ) 
        {
            sum += pix[x]; 
            sqr += pix[x] * pix[x]; 
        }                                                              
        pix += i_stride;                                                       
    }                                                                           
    return sum + ((uint64_t)sqr << 32);    
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (14 preceding siblings ...)
  2024-02-28  7:26 ` liuhongt at gcc dot gnu.org
@ 2024-02-28  8:23 ` rguenther at suse dot de
  2024-02-28  8:26 ` liuhongt at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: rguenther at suse dot de @ 2024-02-28  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 28 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #14 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #13)
> > On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> > > 
> > > --- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> > > 
> > > >    Loop body is likely going to simplify further, this is difficult
> > > >    to guess, we just decrease the result by 1/3.  */
> > > > 
> > > 
> > > This is introduced by r0-68074-g91a01f21abfe19
> > > 
> > > /* Estimate number of insns of completely unrolled loop.  We assume
> > > +   that the size of the unrolled loop is decreased in the
> > > +   following way (the numbers of insns are based on what
> > > +   estimate_num_insns returns for appropriate statements):
> > > +
> > > +   1) exit condition gets removed (2 insns)
> > > +   2) increment of the control variable gets removed (2 insns)
> > > +   3) All remaining statements are likely to get simplified
> > > +      due to constant propagation.  Hard to estimate; just
> > > +      as a heuristics we decrease the rest by 1/3.
> > > +
> > > +   NINSNS is the number of insns in the loop before unrolling.
> > > +   NUNROLL is the number of times the loop is unrolled.  */
> > > +
> > > +static unsigned HOST_WIDE_INT
> > > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > > +                        unsigned HOST_WIDE_INT nunroll)
> > > +{
> > > +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > > +  if (unr_insns <= 0)
> > > +    unr_insns = 1;
> > > +  unr_insns *= (nunroll + 1);
> > > +
> > > +  return unr_insns;
> > > +}
> > > 
> > > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> > > likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> > > keeped.
> > > 
> > > +/* Estimate number of insns of completely unrolled loop.
> > > +   It is (NUNROLL + 1) * size of loop body with taking into account
> > > +   the fact that in last copy everything after exit conditional
> > > +   is dead and that some instructions will be eliminated after
> > > +   peeling.
> > > 
> > > -   NINSNS is the number of insns in the loop before unrolling.
> > > -   NUNROLL is the number of times the loop is unrolled.  */
> > > +   Loop body is likely going to simplify futher, this is difficult
> > > +   to guess, we just decrease the result by 1/3.  */
> > > 
> > >  static unsigned HOST_WIDE_INT
> > > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > > +estimated_unrolled_size (struct loop_size *size,
> > >                          unsigned HOST_WIDE_INT nunroll)
> > >  {
> > > -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > > +  HOST_WIDE_INT unr_insns = ((nunroll)
> > > +                            * (HOST_WIDE_INT) (size->overall
> > > +                                               -
> > > size->eliminated_by_peeling));
> > > +  if (!nunroll)
> > > +    unr_insns = 0;
> > > +  unr_insns += size->last_iteration -
> > > size->last_iteration_eliminated_by_peeling;
> > > +
> > > +  unr_insns = unr_insns * 2 / 3;
> > >    if (unr_insns <= 0)
> > >      unr_insns = 1;
> > > -  unr_insns *= (nunroll + 1);
> > > 
> > > It looks to me 1 / 3 overestimates the instructions that can be optimised away,
> > > especially if we've subtracted eliminated_by_peeling
> > 
> > Yes, that 1/3 reduction is a bit odd - you could have the same effect
> > by increasing the instruction limit by 1/3, but that means it doesn't
> > really matter, does it?  It would be interesting to see if increasing
> > the limit by 1/3 and removing the above is neutral on SPEC?
> 
> Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with
> -march=native -O3, no big impact on other integer benchmark.

454.calculix was always the benchmark to cross check as that benefits
from much unrolling.

I'm all for removing the 1/3 for innermost loop handling (in cunroll
the unrolled loop is then innermost).  I'm more concerned about
unrolling more than one level which is exactly what's required for
454.calculix.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling
  2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
                   ` (15 preceding siblings ...)
  2024-02-28  8:23 ` rguenther at suse dot de
@ 2024-02-28  8:26 ` liuhongt at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-28  8:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #16 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---

> I'm all for removing the 1/3 for innermost loop handling (in cunroll
> the unrolled loop is then innermost).  I'm more concerned about
> unrolling more than one level which is exactly what's required for
> 454.calculix.

Removing 1/3 for the innermost loop would be sufficient to solve both the issue
in the PR and x264_pixel_var_8x8 from 525.x264_r. I'll try to benchmark that.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-02-28  8:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-01  2:41 [Bug tree-optimization/112325] New: Missed vectorization after cunrolli wwwhhhyyy333 at gmail dot com
2023-11-01  2:46 ` [Bug tree-optimization/112325] " pinskia at gcc dot gnu.org
2023-11-02  9:51 ` [Bug tree-optimization/112325] Missed vectorization of reduction after unrolling rguenth at gcc dot gnu.org
2023-11-16  8:03 ` liuhongt at gcc dot gnu.org
2023-11-16  8:15 ` liuhongt at gcc dot gnu.org
2023-11-16  9:15 ` rguenth at gcc dot gnu.org
2023-11-17  6:19 ` pinskia at gcc dot gnu.org
2023-11-17  6:21 ` pinskia at gcc dot gnu.org
2023-11-20  2:52 ` cvs-commit at gcc dot gnu.org
2023-11-21  0:34 ` cvs-commit at gcc dot gnu.org
2024-02-27  6:02 ` liuhongt at gcc dot gnu.org
2024-02-27  6:13 ` liuhongt at gcc dot gnu.org
2024-02-27  7:26 ` liuhongt at gcc dot gnu.org
2024-02-27  7:53 ` rguenther at suse dot de
2024-02-27  7:58 ` rguenther at suse dot de
2024-02-28  7:26 ` liuhongt at gcc dot gnu.org
2024-02-28  8:23 ` rguenther at suse dot de
2024-02-28  8:26 ` liuhongt at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).