public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
@ 2024-02-25 23:40 nathanael.schaeffer at gmail dot com
  2024-02-25 23:46 ` [Bug target/114107] " pinskia at gcc dot gnu.org
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-25 23:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

            Bug ID: 114107
           Summary: poor vectorization at -O3 when dealing with arrays of
                    different multiplicity, good with -O2
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nathanael.schaeffer at gmail dot com
  Target Milestone: ---

A simple loop multiplying two arrays, with different multiplicity fails to
vectorize efficiently with -O3.
Target is AVX x86_64.
The loop is the following, where 4 consecutive values in data are multiplied by
the same factor :

    for (int i=0; i<n; i++) {
     for (int k=0; k<4; k++) data[4*i+k] *= factor[i];
    }

See the very poor generated assembly with -O3 on godbolt, while 
the correct solution of a simple vbroadcastsd is generated by gcc 12.1+ with
-O2 

https://godbolt.org/z/fWj34bbhq

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
@ 2024-02-25 23:46 ` pinskia at gcc dot gnu.org
  2024-02-25 23:56 ` pinskia at gcc dot gnu.org
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-25 23:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Created attachment 57534
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57534&action=edit
Full testcase

`-O3 -march=skylake`

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
  2024-02-25 23:46 ` [Bug target/114107] " pinskia at gcc dot gnu.org
@ 2024-02-25 23:56 ` pinskia at gcc dot gnu.org
  2024-02-26  0:12 ` nathanael.schaeffer at gmail dot com
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-25 23:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-linux

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I am not 100% sure that is always better.

What is happening is GCC is vectorizing even the outer loop.

It is easier to understand via aarch64 asm too:
.L4:
        ldr     q27, [x3], 16
        ld4     {v28.2d - v31.2d}, [x4]
        fmul    v24.2d, v27.2d, v28.2d
        fmul    v25.2d, v27.2d, v29.2d
        fmul    v26.2d, v27.2d, v30.2d
        fmul    v27.2d, v27.2d, v31.2d
        st4     {v24.2d - v27.2d}, [x4], 64
        cmp     x3, x5
        bne     .L4

Have you benchmarked both?

If anything this is a cost model issue.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
  2024-02-25 23:46 ` [Bug target/114107] " pinskia at gcc dot gnu.org
  2024-02-25 23:56 ` pinskia at gcc dot gnu.org
@ 2024-02-26  0:12 ` nathanael.schaeffer at gmail dot com
  2024-02-26  0:13 ` nathanael.schaeffer at gmail dot com
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26  0:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #3 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
I have not benchmarked.
For 4 vmulpd doing the actual work, there are more than 40 permute/mov
instructions, among which 24 vpermd instructions which have a 3 cycle latency.
That is 6 vpermd per vmulpd.
There is no way this can be faster than vbroadcastsd. I would bet it is 4 to 10
times slower than the vbroadcastsd loop.
If you want, I can benchmark it tomorrow.

If this is a cost model problem, it is a bad one. Even ignoring the decoding of
all these instructions, how can adding 6 vpermd to each vmulpd be faster?
I would rather think (hope?) the optimizer does not consider the vbroadcastsd
solution at all.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (2 preceding siblings ...)
  2024-02-26  0:12 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26  0:13 ` nathanael.schaeffer at gmail dot com
  2024-02-26  0:27 ` pinskia at gcc dot gnu.org
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26  0:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #4 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
... and thank you for your quick reply!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (3 preceding siblings ...)
  2024-02-26  0:13 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26  0:27 ` pinskia at gcc dot gnu.org
  2024-02-26  0:34 ` nathanael.schaeffer at gmail dot com
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-26  0:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to N Schaeffer from comment #3)
> If this is a cost model problem, it is a bad one.

It is almost definitely a cost model in the x86_64 backend issue. Because I
tried on aarch64 with -march=armv9-a+sve and then we get only the vectorization
of the inner loop for both -O2 and -O3:
```
.L3:
        ldp     q29, q30, [x0]
        ld1r    {v31.2d}, [x1], 8
        fmul    v30.2d, v30.2d, v31.2d
        fmul    v29.2d, v29.2d, v31.2d
        stp     q29, q30, [x0], 32
        cmp     x2, x1
        bne     .L3
```

With the default generic armv8-a cost model we get the ld4 there and
vectorizing the outer loop.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (4 preceding siblings ...)
  2024-02-26  0:27 ` pinskia at gcc dot gnu.org
@ 2024-02-26  0:34 ` nathanael.schaeffer at gmail dot com
  2024-02-26  2:51 ` liuhongt at gcc dot gnu.org
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26  0:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #6 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
indeed, aarch64 assembly looks very good.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (5 preceding siblings ...)
  2024-02-26  0:34 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26  2:51 ` liuhongt at gcc dot gnu.org
  2024-02-26  3:28 ` liuhongt at gcc dot gnu.org
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-26  2:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Hongtao Liu <liuhongt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
pshufb/shufps are avaible for most cases.
But for 256/512-bit vectors, when the permuation is cross-lane, the cost could
be higher. One solution is increase perm_cost when vector size is more than 128
since vperm is most likely used instead of vblend/vpblend/vpshuf/vshuf.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (6 preceding siblings ...)
  2024-02-26  2:51 ` liuhongt at gcc dot gnu.org
@ 2024-02-26  3:28 ` liuhongt at gcc dot gnu.org
  2024-02-26  7:42 ` nathanael.schaeffer at gmail dot com
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-26  3:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #7)
> perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
> pshufb/shufps are avaible for most cases.
> But for 256/512-bit vectors, when the permuation is cross-lane, the cost
> could be higher. One solution is increase perm_cost when vector size is more
> than 128 since vperm is most likely used instead of
> vblend/vpblend/vpshuf/vshuf.

Furthermore, if we can get indices in the backend when calculating vec_perm
cost, we can check if the permutation is cross-lane or not, and set cost more
accurately for 256/512-bit vector permutation.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (7 preceding siblings ...)
  2024-02-26  3:28 ` liuhongt at gcc dot gnu.org
@ 2024-02-26  7:42 ` nathanael.schaeffer at gmail dot com
  2024-02-26  7:49 ` nathanael.schaeffer at gmail dot com
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26  7:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #9 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
In addition, optimizing for size with -Os leads to a non-vectorized double-loop
(51 bytes) while the vectorized loop with vbroadcastsd (produced by clang -Os)
leads to 40 bytes.
It is thus also a missed optimization for -Os.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (8 preceding siblings ...)
  2024-02-26  7:42 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26  7:49 ` nathanael.schaeffer at gmail dot com
  2024-02-26  7:54 ` liuhongt at gcc dot gnu.org
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26  7:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #10 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
intrestingly (and maybe surprisingly) I can get gcc to produce nearly optimal
code using vbroadcastsd with the following options:

    -O2 -march=skylake -ftree-vectorize

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (9 preceding siblings ...)
  2024-02-26  7:49 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26  7:54 ` liuhongt at gcc dot gnu.org
  2024-02-26  8:13 ` nathanael.schaeffer at gmail dot com
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-02-26  7:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #11 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to N Schaeffer from comment #9)
> In addition, optimizing for size with -Os leads to a non-vectorized
> double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced
> by clang -Os) leads to 40 bytes.
> It is thus also a missed optimization for -Os.

vectorization is enabled with O2 but not Os.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (10 preceding siblings ...)
  2024-02-26  7:54 ` liuhongt at gcc dot gnu.org
@ 2024-02-26  8:13 ` nathanael.schaeffer at gmail dot com
  2024-02-26  9:10 ` [Bug tree-optimization/114107] " rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: nathanael.schaeffer at gmail dot com @ 2024-02-26  8:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #12 from N Schaeffer <nathanael.schaeffer at gmail dot com> ---
I found the "offending" option, and it seems to be indeed a cost-model problem
as Andrew Pinski said:

good code is generated by:

   gcc -O2 -ftree-vectorize -march=skylake   (since gcc 6.1)
   gcc -O1 -ftree-vectorize -march=skylake   (since gcc 8.1)
   gcc -O3 -fvect-cost-model=very-cheap -march=skylake   (with gcc 13.1+)

bad code is generated otherwise, and in particular:

   gcc -O2 -march=skylake  (does not vectorize)
   gcc -O3 -march=skylake  (bad vectorization with so many permutations)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (11 preceding siblings ...)
  2024-02-26  8:13 ` nathanael.schaeffer at gmail dot com
@ 2024-02-26  9:10 ` rguenth at gcc dot gnu.org
  2024-02-26 14:40 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-26  9:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
             Blocks|                            |53947
          Component|target                      |tree-optimization
   Last reconfirmed|                            |2024-02-26

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that we fail to SLP vectorize this (at -O3 we unroll the inner loop):

t.c:4:20: note:   ==> examining statement: _34 = *_33;
t.c:4:20: missed:   peeling for gaps insufficient for access
t.c:5:51: missed:   not vectorized: relevant stmt not supported: _34 = *_33;
t.c:4:20: note:   removing SLP instance operations starting from: *_29 = _35;
t.c:4:20: missed:  unsupported SLP instances

which is because 'factor[i]' is treated as vector load

t.c:4:20: note:   node 0x687f730 (max_nunits=4, refcnt=2) const vector(4)
double
t.c:4:20: note:   op template: _34 = *_33;
t.c:4:20: note:         stmt 0 _34 = *_33;
t.c:4:20: note:         stmt 1 _34 = *_33;
t.c:4:20: note:         stmt 2 _34 = *_33;
t.c:4:20: note:         stmt 3 _34 = *_33;
t.c:4:20: note:         load permutation { 0 0 0 0 }

and we don't anticipate we can do this with a load-and-splat (I'm not sure
we'd eventually do that even).

I think we might have a duplicate bugreport for this issue.

Note with GCC 13 we refuse to SLP because

t.c:4:20: missed:   Build SLP failed: not grouped load _35 = *_34;

You can help GCC by doign

void rescale_x4(double* __restrict data, const double * __restrict factor, int
n)
{
    for (int i=0; i<n; i++) {
#pragma GCC unroll 0
     for (int k=0; k<4; k++) data[4*i+k] *= factor[i];
    }
}

which will get you

rescale_x4:
.LFB0:
        .cfi_startproc
        testl   %edx, %edx
        jle     .L5
        movslq  %edx, %rdx
        salq    $5, %rdx
        leaq    (%rdi,%rdx), %rax
        .p2align 4,,10
        .p2align 3
.L3:
        vbroadcastsd    (%rsi), %ymm0
        addq    $32, %rdi
        addq    $8, %rsi
        vmulpd  -32(%rdi), %ymm0, %ymm0
        vmovupd %ymm0, -32(%rdi)
        cmpq    %rdi, %rax
        jne     .L3
        vzeroupper
.L5:
        ret


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (12 preceding siblings ...)
  2024-02-26  9:10 ` [Bug tree-optimization/114107] " rguenth at gcc dot gnu.org
@ 2024-02-26 14:40 ` rguenth at gcc dot gnu.org
  2024-06-13  6:22 ` cvs-commit at gcc dot gnu.org
  2024-06-13  7:13 ` rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-26 14:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Mine.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (13 preceding siblings ...)
  2024-02-26 14:40 ` rguenth at gcc dot gnu.org
@ 2024-06-13  6:22 ` cvs-commit at gcc dot gnu.org
  2024-06-13  7:13 ` rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-06-13  6:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

--- Comment #15 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:1fe55a1794863b5ad9eeca5062782834716016b2

commit r15-1238-g1fe55a1794863b5ad9eeca5062782834716016b2
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Jun 7 11:29:05 2024 +0200

    tree-optimization/114107 - avoid peeling for gaps in more cases

    The following refactors the code to detect necessary peeling for
    gaps, in particular the PR103116 case when there is no gap but
    the group size is smaller than the vector size.  The testcase in
    PR114107 shows we fail to SLP

      for (int i=0; i<n; i++)
        for (int k=0; k<4; k++)
          data[4*i+k] *= factor[i];

    because peeling one scalar iteration isn't enough to cover a gap
    of 3 elements of factor[i].  But the code detecting this is placed
    after the logic that detects cases we handle properly already as
    we'd code generate { factor[i], 0., 0., 0. } for V4DFmode vectorization
    already.  In fact the check to detect when peeling a single iteration
    isn't enough seems improperly guarded as it should apply to all cases.

    I'm not sure we correctly handle VMAT_CONTIGUOUS_REVERSE but I
    checked that VMAT_STRIDED_SLP and VMAT_ELEMENTWISE correctly avoid
    touching excess elements.

    With this change we can use SLP for the above testcase and the
    PR103116 testcases no longer require an epilogue on x86-64.  It
    might be different on other targets so I made those testcases
    runtime FAIL only instead of relying on dump scanning there's
    currently no easy way to properly constrain.

            PR tree-optimization/114107
            PR tree-optimization/110445
            * tree-vect-stmts.cc (get_group_load_store_type): Refactor
            contiguous access case.  Make sure peeling for gap constraints
            are always tested and consistently relax when we know we can
            avoid touching excess elements during code generation.  But
            rewrite the check poly-int aware.

            * gcc.dg/vect/pr114107.c: New testcase.
            * gcc.dg/vect/pr103116-1.c: Adjust.
            * gcc.dg/vect/pr103116-2.c: Likewise.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug tree-optimization/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2
  2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
                   ` (14 preceding siblings ...)
  2024-06-13  6:22 ` cvs-commit at gcc dot gnu.org
@ 2024-06-13  7:13 ` rguenth at gcc dot gnu.org
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-06-13  7:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED
   Target Milestone|---                         |15.0

--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
This is fixed now, we produce

.L3:
        vbroadcastsd    (%rsi,%rax), %ymm0
        vmulpd  (%rdi,%rax,4), %ymm0, %ymm0
        vmovupd %ymm0, (%rdi,%rax,4)
        addq    $8, %rax
        cmpq    %rdx, %rax
        jne     .L3

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-06-13  7:13 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-25 23:40 [Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2 nathanael.schaeffer at gmail dot com
2024-02-25 23:46 ` [Bug target/114107] " pinskia at gcc dot gnu.org
2024-02-25 23:56 ` pinskia at gcc dot gnu.org
2024-02-26  0:12 ` nathanael.schaeffer at gmail dot com
2024-02-26  0:13 ` nathanael.schaeffer at gmail dot com
2024-02-26  0:27 ` pinskia at gcc dot gnu.org
2024-02-26  0:34 ` nathanael.schaeffer at gmail dot com
2024-02-26  2:51 ` liuhongt at gcc dot gnu.org
2024-02-26  3:28 ` liuhongt at gcc dot gnu.org
2024-02-26  7:42 ` nathanael.schaeffer at gmail dot com
2024-02-26  7:49 ` nathanael.schaeffer at gmail dot com
2024-02-26  7:54 ` liuhongt at gcc dot gnu.org
2024-02-26  8:13 ` nathanael.schaeffer at gmail dot com
2024-02-26  9:10 ` [Bug tree-optimization/114107] " rguenth at gcc dot gnu.org
2024-02-26 14:40 ` rguenth at gcc dot gnu.org
2024-06-13  6:22 ` cvs-commit at gcc dot gnu.org
2024-06-13  7:13 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).