* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
@ 2024-03-13 19:45 ` pinskia at gcc dot gnu.org
2024-03-14 8:32 ` rguenth at gcc dot gnu.org
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-13 19:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Target Milestone|--- |12.4
Last reconfirmed| |2024-03-13
Status|UNCONFIRMED |NEW
Summary|AVX2 vectorisation |[13/14 Regression] AVX2
|performance regression with |vectorisation performance
|gfortran 13/14 |regression with gfortran
| |13/14
Blocks| |53947
Component|target |tree-optimization
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Definitely there is some vectorization changes happening.
Confirmed.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
2024-03-13 19:45 ` [Bug tree-optimization/114324] [13/14 Regression] " pinskia at gcc dot gnu.org
@ 2024-03-14 8:32 ` rguenth at gcc dot gnu.org
2024-03-14 8:32 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-14 8:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P2
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 12 manages to fully SLP the loop resulting in a vectorization factor of two
while GCC 13 ends up with hybrid SLP and a vectorization factor of four. The
IL into the vectorizer is almost the same besides
REALPART_EXPR <(*a_28(D))[_13]> = _53; | REALPART_EXPR
<(*a_28(D))[_12]> = _53;
IMAGPART_EXPR <(*a_28(D))[_13]> = _54; | IMAGPART_EXPR
<(*a_28(D))[_12]> = _54;
_108 = d1$real_51 - _42; | _55 =
ctmp$real_43 + d1$real_51;
_56 = ctmp$imag_44 + d1$imag_52; _56 =
ctmp$imag_44 + d1$imag_52;
REALPART_EXPR <(*a_28(D))[_5]> = _108; | REALPART_EXPR
<(*a_28(D))[_6]> = _55;
IMAGPART_EXPR <(*a_28(D))[_5]> = _56; | IMAGPART_EXPR
<(*a_28(D))[_6]> = _56;
_57 = _42 + d1$real_51; | _57 =
d1$real_51 - ctmp$real_43;
_58 = d1$imag_52 - ctmp$imag_44; _58 =
d1$imag_52 - ctmp$imag_44;
REALPART_EXPR <(*a_28(D))[_9]> = _57; REALPART_EXPR
<(*a_28(D))[_9]> = _57;
IMAGPART_EXPR <(*a_28(D))[_9]> = _58; IMAGPART_EXPR
<(*a_28(D))[_9]> = _58;
where in GCC 13 we ended up destroying the nice complex pattern by
merging the negation into the defining stmts which causes SLP discovery
to fail there:
t.f90:11:7: note: Build SLP for _35 = IMAGPART_EXPR <(*a_28(D))[_9]>;
t.f90:11:7: note: precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note: nunits = 4
t.f90:11:7: note: Build SLP for _21 = REALPART_EXPR <(*a_28(D))[_6]>;
t.f90:11:7: note: precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note: nunits = 4
t.f90:11:7: missed: Build SLP failed: different interleaving chains in one
node _21 = REALPART_EXPR <(*a_28(D))[_6]>;
since we got there from
t.f90:11:7: note: Build SLP for _7 = _35 - _20;
t.f90:11:7: note: precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note: nunits = 4
t.f90:11:7: note: Build SLP for _40 = _21 - _36;
t.f90:11:7: note: precomputed vectype: vector(4) real(kind=8)
there's nothing to "swap".
I'll note that complex lowering produces what GCC 13 has in the end and
it seems to be PRE is what produces the "desired" IL:
Inserted _107 = -_42;
Replaced _6 * 8.660254037844385965883020617184229195117950439453125e-1 with
_107 in all uses of ctmp$real_43 = _6 *
8.660254037844385965883020617184229195117950439453125e-1;
gimple_simplified to _108 = d1$real_51 - _42;
_55 = _108;
gimple_simplified to _57 = _42 + d1$real_51;
Removing dead stmt ctmp$real_43 = _6 *
8.660254037844385965883020617184229195117950439453125e-1;
before PRE we have
_41 = _20 - _4;
_42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1;
_6 = _4 - _20;
ctmp$real_43 = _6 * 8.660254037844385965883020617184229195117950439453125e-1;
GCC 13 seems to perform the same value numbering but in the end doesn't
insert. This is because _42 is dead (also in with GCC 12) so we don't
want to make it live again by expressing _43 as -_42 as that wouldn't
be profitable. That was added by r13-6834-g41ade3399bd1ec on purpose.
From complex lowering we had
_41 = _20 - _35;
_42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1;
ctmp$real_43 = -_42;
and forwprop rightfully turned that into
_7 = _35 - _20;
ctmp$real_43 = _7 * 8.660254037844385965883020617184229195117950439453125e-1;
and PRE undid this in GCC 12 which the change now prohibits.
In this case this simplification is prohibitive to SLP vectorization and
we can't at the moment recover from it.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
2024-03-13 19:45 ` [Bug tree-optimization/114324] [13/14 Regression] " pinskia at gcc dot gnu.org
2024-03-14 8:32 ` rguenth at gcc dot gnu.org
@ 2024-03-14 8:32 ` rguenth at gcc dot gnu.org
2024-03-14 10:37 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-14 8:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|12.4 |13.3
CC| |rguenth at gcc dot gnu.org
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
` (2 preceding siblings ...)
2024-03-14 8:32 ` rguenth at gcc dot gnu.org
@ 2024-03-14 10:37 ` rguenth at gcc dot gnu.org
2024-03-15 20:06 ` mjr19 at cam dot ac.uk
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-14 10:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the missed feature is to implement swapping operands of a MINUS_EXPR
during SLP discovery by introducing a conditional negate (for example
by multiplying with { 1, -1 } or with two_operator negate, "nop" and blend).
Note that with GCC 12 and the +- mixed op we ae able to use vaddsubpd,
that's in the end likely the perfect code gen for the testcase. I'm not
sure it's easy to get back to that with the "more optimized" scalar IL.
I'll note the negate could be also consumed by the constant in the
multiplication.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
` (3 preceding siblings ...)
2024-03-14 10:37 ` rguenth at gcc dot gnu.org
@ 2024-03-15 20:06 ` mjr19 at cam dot ac.uk
2024-05-01 13:21 ` [Bug tree-optimization/114324] [13/14/15 " mjr19 at cam dot ac.uk
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-03-15 20:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
--- Comment #4 from mjr19 at cam dot ac.uk ---
Created attachment 57713
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57713&action=edit
Second testcase, very similar to first
Thank you for looking into this. The real code in question has more than one
loop which suffers a slow-down with gfortran 13/14 when compared to 12, and I
suspect it is the same underlying issue in all cases.
I attach another test case, which seems very similar. The odd logic surrounding
the initialisation of ci is to replicate the fact that in the real code the
sign of ci depends on an argument which I have dropped, and so the compiler
cannot optimise it away completely.
For this case, gfortran 12 and ifort produce very similar performance, gfortran
13 is over 20% slower, and ifx slower still.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
` (4 preceding siblings ...)
2024-03-15 20:06 ` mjr19 at cam dot ac.uk
@ 2024-05-01 13:21 ` mjr19 at cam dot ac.uk
2024-05-21 9:19 ` jakub at gcc dot gnu.org
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-05-01 13:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
--- Comment #5 from mjr19 at cam dot ac.uk ---
Note that bug 114767 also turns out to be a case in which the inability to
alternate neg and nop along a vector leads to poor performance with some
operations on the complex type. That optimisation improvement request also
discusses that the ability to alternate add and nop could be beneficial.
Ifort can alternate neg and nop, at least in the simple case of
complex(kind(1d0)) :: c(*)
do i=1,n
c(i)=conjg(c(i))
enddo
Helped by aggressive default unrolling, it ends up being almost four times
faster than gfortran-14 on the machine I tested it on. On asking gfortran-14 to
unroll, the difference is reduced to about a factor of two.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
` (5 preceding siblings ...)
2024-05-01 13:21 ` [Bug tree-optimization/114324] [13/14/15 " mjr19 at cam dot ac.uk
@ 2024-05-21 9:19 ` jakub at gcc dot gnu.org
2024-06-25 13:47 ` mjr19 at cam dot ac.uk
2024-06-25 14:04 ` mjr19 at cam dot ac.uk
8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-05-21 9:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|13.3 |13.4
--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 13.3 is being released, retargeting bugs to GCC 13.4.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
` (6 preceding siblings ...)
2024-05-21 9:19 ` jakub at gcc dot gnu.org
@ 2024-06-25 13:47 ` mjr19 at cam dot ac.uk
2024-06-25 14:04 ` mjr19 at cam dot ac.uk
8 siblings, 0 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-06-25 13:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
--- Comment #7 from mjr19 at cam dot ac.uk ---
The patch to GCC 15 in commit
r15-1508-g59221dc587f369695d9b0c2f73aedf8458931f0f from pr 68855 has made a
significant improvement to the optimisation of these examples at -O3, causing
the -Ofast version now to be slower than the -O3 version for both of the
attachments. For the two examples given, rough timings in ns/iteration on a
3GHz Kaby Lake are
m3spf
gf-12 -Ofast 26.5
gf-15 -O3 27.6
gf-14 -Ofast 34.8
gf-15 -Ofast 35.1
gf-14 -O3 43.8
gf-12 -O3 44.8
m4spf
gf-12 -Ofast 23.3
gf-15 -O3 23.8
gf-14 -Ofast 29.6
gf-15 -Ofast 29.7
gf-14 -O3 37.3
gf-12 -O3 37.6
All with the flag -mavx2, and in both cases the fastest time is very similar to
ifort -O3. gf-15 is gfortran 15.0-20240623
(I believe there is interest in the optimisation of these expressions. I am an
electronic structure physicist, and the major simulation codes in my area,
Abinit, CASTEP, QE, Siesta, VASP, are all written in Fortran, all use the
complex datatype, are likely to make use of conjugation and also multiplication
by +/-i, and use large amounts of time on academic supercomputers. The ability
to alternate neg and nop efficiently along a vector would be very useful if it
dealt with conjg and *(+/-i), and the obvious xor seems quite safe.)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
` (7 preceding siblings ...)
2024-06-25 13:47 ` mjr19 at cam dot ac.uk
@ 2024-06-25 14:04 ` mjr19 at cam dot ac.uk
8 siblings, 0 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-06-25 14:04 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324
--- Comment #8 from mjr19 at cam dot ac.uk ---
Ooops -- timings not ns/iteration as claimed, nor even comparable between the
m3spf and m4spf examples, but they are consistent within each example.
^ permalink raw reply [flat|nested] 10+ messages in thread