[Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14
@ 2024-03-13 12:52 mjr19 at cam dot ac.uk
  2024-03-13 19:45 ` [Bug tree-optimization/114324] [13/14 Regression] " pinskia at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-03-13 12:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

            Bug ID: 114324
           Summary: AVX2 vectorisation performance regression with
                    gfortran 13/14
           Product: gcc
           Version: 13.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mjr19 at cam dot ac.uk
  Target Milestone: ---

Created attachment 57685
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57685&action=edit
Test case of loop showing performance regression

The attached loop, when compiled with "-Ofast -mavx2" runs over 20% slower on
gfortran 13 or (pre-release) 14 than it does on 12.x. Precise versions tested
12.3.0, 13.1.0 and GCC 14 downloaded on 11th March.

Precise slowdown depends on CPU. Tested on Haswell and Kaby Lake desktops.

Adding "-fopenmp" changes the code produced, but 12.3 still beats later
compilers. The analysis below is without -fopenmp.

It appears (to me) that 12.x is using the full width of the ymm registers, and
has a loop of 17 vector instructions, and some scalar loop control, which
performs two iterations of the original Fortran loop.

13.x manages more aggressive unrolling, performing four iterations per pass,
but uses about 54 vector instructions, rather than the 34 one might naively
expect. More instructions does not necessarily mean slower, but here it does.

I attach the test case to which I refer. I would be happy to add the trivial
timing program to show how I have been timing it. The full code is an FFT, but
the test case has been reduced to functional nonsense.

(I note that in other areas there are pleasing performance gains in gfortran
13.x. It is a pity that this partially cancels them.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
@ 2024-03-13 19:45 ` pinskia at gcc dot gnu.org
  2024-03-14  8:32 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-13 19:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Target Milestone|---                         |12.4
   Last reconfirmed|                            |2024-03-13
             Status|UNCONFIRMED                 |NEW
            Summary|AVX2 vectorisation          |[13/14 Regression] AVX2
                   |performance regression with |vectorisation performance
                   |gfortran 13/14              |regression with gfortran
                   |                            |13/14
             Blocks|                            |53947
          Component|target                      |tree-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Definitely there is some vectorization changes happening. 
Confirmed.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
  2024-03-13 19:45 ` [Bug tree-optimization/114324] [13/14 Regression] " pinskia at gcc dot gnu.org
@ 2024-03-14  8:32 ` rguenth at gcc dot gnu.org
  2024-03-14  8:32 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-14  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 12 manages to fully SLP the loop resulting in a vectorization factor of two
while GCC 13 ends up with hybrid SLP and a vectorization factor of four.  The
IL into the vectorizer is almost the same besides

  REALPART_EXPR <(*a_28(D))[_13]> = _53;                      |   REALPART_EXPR
<(*a_28(D))[_12]> = _53;
  IMAGPART_EXPR <(*a_28(D))[_13]> = _54;                      |   IMAGPART_EXPR
<(*a_28(D))[_12]> = _54;
  _108 = d1$real_51 - _42;                                    |   _55 =
ctmp$real_43 + d1$real_51;
  _56 = ctmp$imag_44 + d1$imag_52;                                _56 =
ctmp$imag_44 + d1$imag_52;
  REALPART_EXPR <(*a_28(D))[_5]> = _108;                      |   REALPART_EXPR
<(*a_28(D))[_6]> = _55;
  IMAGPART_EXPR <(*a_28(D))[_5]> = _56;                       |   IMAGPART_EXPR
<(*a_28(D))[_6]> = _56;
  _57 = _42 + d1$real_51;                                     |   _57 =
d1$real_51 - ctmp$real_43;
  _58 = d1$imag_52 - ctmp$imag_44;                                _58 =
d1$imag_52 - ctmp$imag_44;
  REALPART_EXPR <(*a_28(D))[_9]> = _57;                           REALPART_EXPR
<(*a_28(D))[_9]> = _57;
  IMAGPART_EXPR <(*a_28(D))[_9]> = _58;                           IMAGPART_EXPR
<(*a_28(D))[_9]> = _58;

where in GCC 13 we ended up destroying the nice complex pattern by
merging the negation into the defining stmts which causes SLP discovery
to fail there:

t.f90:11:7: note:   Build SLP for _35 = IMAGPART_EXPR <(*a_28(D))[_9]>;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: note:   Build SLP for _21 = REALPART_EXPR <(*a_28(D))[_6]>;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: missed:   Build SLP failed: different interleaving chains in one
node _21 = REALPART_EXPR <(*a_28(D))[_6]>;

since we got there from

t.f90:11:7: note:   Build SLP for _7 = _35 - _20;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: note:   Build SLP for _40 = _21 - _36;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)

there's nothing to "swap".

I'll note that complex lowering produces what GCC 13 has in the end and
it seems to be PRE is what produces the "desired" IL:

Inserted _107 = -_42;
Replaced _6 * 8.660254037844385965883020617184229195117950439453125e-1 with
_107 in all uses of ctmp$real_43 = _6 *
8.660254037844385965883020617184229195117950439453125e-1;
gimple_simplified to _108 = d1$real_51 - _42;
_55 = _108;
gimple_simplified to _57 = _42 + d1$real_51;
Removing dead stmt ctmp$real_43 = _6 *
8.660254037844385965883020617184229195117950439453125e-1;

before PRE we have

  _41 = _20 - _4;
  _42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1;
  _6 = _4 - _20;
  ctmp$real_43 = _6 * 8.660254037844385965883020617184229195117950439453125e-1;

GCC 13 seems to perform the same value numbering but in the end doesn't
insert.  This is because _42 is dead (also in with GCC 12) so we don't
want to make it live again by expressing _43 as -_42 as that wouldn't
be profitable.  That was added by r13-6834-g41ade3399bd1ec on purpose.

From complex lowering we had

  _41 = _20 - _35;
  _42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1;
  ctmp$real_43 = -_42;

and forwprop rightfully turned that into

  _7 = _35 - _20;
  ctmp$real_43 = _7 * 8.660254037844385965883020617184229195117950439453125e-1;

and PRE undid this in GCC 12 which the change now prohibits.

In this case this simplification is prohibitive to SLP vectorization and
we can't at the moment recover from it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
  2024-03-13 19:45 ` [Bug tree-optimization/114324] [13/14 Regression] " pinskia at gcc dot gnu.org
  2024-03-14  8:32 ` rguenth at gcc dot gnu.org
@ 2024-03-14  8:32 ` rguenth at gcc dot gnu.org
  2024-03-14 10:37 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-14  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|12.4                        |13.3
                 CC|                            |rguenth at gcc dot gnu.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
                   ` (2 preceding siblings ...)
  2024-03-14  8:32 ` rguenth at gcc dot gnu.org
@ 2024-03-14 10:37 ` rguenth at gcc dot gnu.org
  2024-03-15 20:06 ` mjr19 at cam dot ac.uk
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-14 10:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the missed feature is to implement swapping operands of a MINUS_EXPR
during SLP discovery by introducing a conditional negate (for example
by multiplying with { 1, -1 } or with two_operator negate, "nop" and blend).

Note that with GCC 12 and the +- mixed op we ae able to use vaddsubpd,
that's in the end likely the perfect code gen for the testcase.  I'm not
sure it's easy to get back to that with the "more optimized" scalar IL.

I'll note the negate could be also consumed by the constant in the
multiplication.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
                   ` (3 preceding siblings ...)
  2024-03-14 10:37 ` rguenth at gcc dot gnu.org
@ 2024-03-15 20:06 ` mjr19 at cam dot ac.uk
  2024-05-01 13:21 ` [Bug tree-optimization/114324] [13/14/15 " mjr19 at cam dot ac.uk
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-03-15 20:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #4 from mjr19 at cam dot ac.uk ---
Created attachment 57713
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57713&action=edit
Second testcase, very similar to first

Thank you for looking into this. The real code in question has more than one
loop which suffers a slow-down with gfortran 13/14 when compared to 12, and I
suspect it is the same underlying issue in all cases.

I attach another test case, which seems very similar. The odd logic surrounding
the initialisation of ci is to replicate the fact that in the real code the
sign of ci depends on an argument which I have dropped, and so the compiler
cannot optimise it away completely.

For this case, gfortran 12 and ifort produce very similar performance, gfortran
13 is over 20% slower, and ifx slower still.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
                   ` (4 preceding siblings ...)
  2024-03-15 20:06 ` mjr19 at cam dot ac.uk
@ 2024-05-01 13:21 ` mjr19 at cam dot ac.uk
  2024-05-21  9:19 ` jakub at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-05-01 13:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #5 from mjr19 at cam dot ac.uk ---
Note that bug 114767 also turns out to be a case in which the inability to
alternate neg and nop along a vector leads to poor performance with some
operations on the complex type. That optimisation improvement request also
discusses that the ability to alternate add and nop could be beneficial.

Ifort can alternate neg and nop, at least in the simple case of

  complex(kind(1d0)) :: c(*)
  do i=1,n
     c(i)=conjg(c(i))
  enddo

Helped by aggressive default unrolling, it ends up being almost four times
faster than gfortran-14 on the machine I tested it on. On asking gfortran-14 to
unroll, the difference is reduced to about a factor of two.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
                   ` (5 preceding siblings ...)
  2024-05-01 13:21 ` [Bug tree-optimization/114324] [13/14/15 " mjr19 at cam dot ac.uk
@ 2024-05-21  9:19 ` jakub at gcc dot gnu.org
  2024-06-25 13:47 ` mjr19 at cam dot ac.uk
  2024-06-25 14:04 ` mjr19 at cam dot ac.uk
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-05-21  9:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|13.3                        |13.4

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 13.3 is being released, retargeting bugs to GCC 13.4.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
                   ` (6 preceding siblings ...)
  2024-05-21  9:19 ` jakub at gcc dot gnu.org
@ 2024-06-25 13:47 ` mjr19 at cam dot ac.uk
  2024-06-25 14:04 ` mjr19 at cam dot ac.uk
  8 siblings, 0 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-06-25 13:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #7 from mjr19 at cam dot ac.uk ---
The patch to GCC 15 in commit
r15-1508-g59221dc587f369695d9b0c2f73aedf8458931f0f  from pr 68855 has made a
significant improvement to the optimisation of these examples at -O3, causing
the -Ofast version now to be slower than the -O3 version for both of the
attachments. For the two examples given, rough timings in ns/iteration on a
3GHz Kaby Lake are

m3spf

gf-12  -Ofast   26.5
gf-15  -O3      27.6
gf-14  -Ofast   34.8
gf-15  -Ofast   35.1
gf-14  -O3      43.8
gf-12  -O3      44.8

m4spf

gf-12  -Ofast   23.3
gf-15  -O3      23.8
gf-14  -Ofast   29.6
gf-15  -Ofast   29.7
gf-14  -O3      37.3
gf-12  -O3      37.6

All with the flag -mavx2, and in both cases the fastest time is very similar to
ifort -O3. gf-15 is gfortran 15.0-20240623

(I believe there is interest in the optimisation of these expressions. I am an
electronic structure physicist, and the major simulation codes in my area,
Abinit, CASTEP, QE, Siesta, VASP, are all written in Fortran, all use the
complex datatype, are likely to make use of conjugation and also multiplication
by +/-i, and use large amounts of time on academic supercomputers. The ability
to alternate neg and nop efficiently along a vector would be very useful if it
dealt with conjg and *(+/-i), and the obvious xor seems quite safe.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114324] [13/14/15 Regression] AVX2 vectorisation performance regression with gfortran 13/14
  2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
                   ` (7 preceding siblings ...)
  2024-06-25 13:47 ` mjr19 at cam dot ac.uk
@ 2024-06-25 14:04 ` mjr19 at cam dot ac.uk
  8 siblings, 0 replies; 10+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-06-25 14:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #8 from mjr19 at cam dot ac.uk ---
Ooops -- timings not ns/iteration as claimed, nor even comparable between the
m3spf and m4spf examples, but they are consistent within each example.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-06-25 14:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-13 12:52 [Bug fortran/114324] New: AVX2 vectorisation performance regression with gfortran 13/14 mjr19 at cam dot ac.uk
2024-03-13 19:45 ` [Bug tree-optimization/114324] [13/14 Regression] " pinskia at gcc dot gnu.org
2024-03-14  8:32 ` rguenth at gcc dot gnu.org
2024-03-14  8:32 ` rguenth at gcc dot gnu.org
2024-03-14 10:37 ` rguenth at gcc dot gnu.org
2024-03-15 20:06 ` mjr19 at cam dot ac.uk
2024-05-01 13:21 ` [Bug tree-optimization/114324] [13/14/15 " mjr19 at cam dot ac.uk
2024-05-21  9:19 ` jakub at gcc dot gnu.org
2024-06-25 13:47 ` mjr19 at cam dot ac.uk
2024-06-25 14:04 ` mjr19 at cam dot ac.uk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).