public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
@ 2024-04-18 12:13 mjr19 at cam dot ac.uk
  2024-04-18 13:05 ` [Bug tree-optimization/114767] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-04-18 12:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

            Bug ID: 114767
           Summary: gfortran AVX2 complex multiplication by (0d0,1d0)
                    suboptimal
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mjr19 at cam dot ac.uk
  Target Milestone: ---

Gfortran 14 shows considerable improvement over 13.1 on x86_64 AVX2 on the test
case

subroutine scale_i(c,n)
  integer :: i,n
  complex(kind(1d0)) :: c(*)

  do i=1,n
     c(i)=c(i)*(0d0,1d0)
  enddo
end subroutine scale_i

Both vectorise well, and use an xor for the multiplication by -1 -- good.

But both progress by forming one vector containing the real parts of four
consecutive complex elements, and one of the imaginary parts. The imaginary
parts are then all xor'ed to swap their signs, and further permuting and
shuffling occurs to reassemble things into the correct interleaved order.
Gfortran-14 has reduced the amount of permuting and shuffling to achieve the
same result.

I think that it should be possible to do this with the vector registers holding
the complex data in their natural order. A single xor could switch the signs of
alternate elements, leaving the real parts untouched, and a single vpermilpd
(or the more general vpermpd) could then swap pairs of elements. This should
not only be faster, but use fewer registers too.

(This could be generalised to the case where the constant is a multiple of
(0d0,1d0), say (0d0,q), either by a final multiplication with a vector
containing q repeated, or by replacing the xor by a multiplication by a vector
containing q repeated but with alternating signs.)

Compilation tests with -mavx2 -O3 -ffast-math.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
  2024-04-18 12:13 [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal mjr19 at cam dot ac.uk
@ 2024-04-18 13:05 ` rguenth at gcc dot gnu.org
  2024-04-18 13:58 ` mjr19 at cam dot ac.uk
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-18 13:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
   Last reconfirmed|                            |2024-04-18
             Blocks|                            |53947
          Component|fortran                     |tree-optimization
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
When you avoid -ffast-math you'll get

.L4:
        vmovupd (%rax), %ymm0
        addq    $32, %rax
        vmulpd  %ymm2, %ymm0, %ymm1
        vpermilpd       $5, %ymm0, %ymm0
        vaddsubpd       %ymm0, %ymm1, %ymm1
        vmovupd %ymm1, -32(%rax)
        cmpq    %rax, %rdx
        jne     .L4

this consumes the negation with the vaddsubpd but has a multiplication
by zero for the sake of preserving signed zeros.

The main issue in the way of optimal code is that we lower the operations
early, arriving at

  _6 = REALPART_EXPR <(*c_10(D))[_1]>;
  _5 = IMAGPART_EXPR <(*c_10(D))[_1]>;
  _14 = -_5;
  REALPART_EXPR <(*c_10(D))[_1]> = _14;
  IMAGPART_EXPR <(*c_10(D))[_1]> = _6;

and vectorizing that fails to use SLP which introduces an interleaving chain.
Negation isn't supported as "two-operators" op as there's no corresponding
"no negation" operation and we lack a way to insert a noop during SLP build.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
  2024-04-18 12:13 [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal mjr19 at cam dot ac.uk
  2024-04-18 13:05 ` [Bug tree-optimization/114767] " rguenth at gcc dot gnu.org
@ 2024-04-18 13:58 ` mjr19 at cam dot ac.uk
  2024-04-18 17:35 ` roger at nextmovesoftware dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-04-18 13:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #2 from mjr19 at cam dot ac.uk ---
Ah, I see. An inability to alternate negation with noop also means that
conjugation is treated suboptimally.

  do i=1,n
     c(i)=conjg(c(i))
  enddo

Here gfortran-13 and -14 are differently suboptimal, and again it appears to be
because they don't wish to alternate noops and negations along a single vector. 

In this case -14 fails to vectorise, and loads just the imaginary values,
negating them and storing them, moving with a stride of two. Not a bad answer,
but one which would not generalise well should the loop contain other
operations on c(i).

In practice almost every vector operation can be trivially alternated with a
noop, as +, -, *, xor, or, and, all have special values which reduce the
operation to a noop and which could be used to pad things. Is there no scope
for making the SLP build more flexible here, as otherwise shuffling and
permuting can cost both time and registers?

I also note that it means that -ffast-math (or simply -fno-signed-zeros) slows
down these examples. This is unfortunate, as there are cases in which
-fno-signed-zeros gives a useful performance increase, but this can be reduced
or reversed when there are cases, perhaps in the same source file, when it
reduces performance.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
  2024-04-18 12:13 [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal mjr19 at cam dot ac.uk
  2024-04-18 13:05 ` [Bug tree-optimization/114767] " rguenth at gcc dot gnu.org
  2024-04-18 13:58 ` mjr19 at cam dot ac.uk
@ 2024-04-18 17:35 ` roger at nextmovesoftware dot com
  2024-04-18 17:58 ` mjr19 at cam dot ac.uk
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-04-18 17:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com

--- Comment #3 from Roger Sayle <roger at nextmovesoftware dot com> ---
Richard has already changed this from "gfortran" to "tree-optimization", but
for the record, the C equivalent of this test case (with the same issue) is:

void scale_i(_Complex double *c, int n)
{
    for (int i=0; i<n; i++)
      c[i] *= __builtin_complex(0.0,1.0);
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
  2024-04-18 12:13 [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal mjr19 at cam dot ac.uk
                   ` (2 preceding siblings ...)
  2024-04-18 17:35 ` roger at nextmovesoftware dot com
@ 2024-04-18 17:58 ` mjr19 at cam dot ac.uk
  2024-04-18 18:01 ` roger at nextmovesoftware dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-04-18 17:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #4 from mjr19 at cam dot ac.uk ---
An issue which I suspect is related is shown by

subroutine zradd(c,n)
  integer :: i,n
  complex(kind(1d0)) :: c(*)

  do i=1,n
     c(i)=c(i)+1d0
  enddo
end subroutine

If compiled with gfortran-14 and -O3 -mavx2 it all looks very sensible.

If one adds -ffast-math, it looks a lot less sensible, and takes over 70%
longer to run. I think it has changed from promoting 1d0 to (1d0,0d0) and then
adding that (which one might argue that a strict interpretation of the Fortran
standard requires, but I am not certain that it does), to collecting all the
real parts in a vector, adding 1d0 to them, and avoiding adding 0d0 to the
imaginary parts. Unsurprisingly the gain in halving the number of additions is
more than offset by the required vperms and vshufs.

Ideally -ffast-math would have noticed that adding 0d0 to the imaginary part is
not necessary, but then concluded that doing so was faster than any alternative
method, and so done so anyway.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
  2024-04-18 12:13 [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal mjr19 at cam dot ac.uk
                   ` (3 preceding siblings ...)
  2024-04-18 17:58 ` mjr19 at cam dot ac.uk
@ 2024-04-18 18:01 ` roger at nextmovesoftware dot com
  2024-04-19 13:44 ` mjr19 at cam dot ac.uk
  2024-05-14 15:30 ` mjr19 at cam dot ac.uk
  6 siblings, 0 replies; 8+ messages in thread
From: roger at nextmovesoftware dot com @ 2024-04-18 18:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #5 from Roger Sayle <roger at nextmovesoftware dot com> ---
Another interesting (simpler) case of -ffast-math pessimization is:
void foo(_Complex double *c)
{
    for (int i=0; i<16; i++)
      c[i] += __builtin_complex(1.0,0.0);
}

Again without -ffast-math we vectorize consecutive additions, but with
-ffast-math we (not so) cleverly avoid every second addition by producing
significantly larger code that shuffles the real/imaginary parts around.

This even suggests a missed-optimization for:
void bar(_Complex double *c, double x)
{
    for (int i=0; i<16; i++)
      c[i] += x;
}

which may be more efficiently implemented (when safe) by:
void bar(_Complex double *c, double x)
{
    for (int i=0; i<16; i++)
      c[i] += __builtin_complex(x,0.0);
}

i.e. insert/interleave a no-op zero addition, to simplify the vectorization.

The existence of a suitable identity operation (+0, *1.0, &~0, |0, ^0) can be
used to avoid shuffling/permuting values/lanes out of vectors, when its
possible for the vector operation to leave the other values unchanged.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
  2024-04-18 12:13 [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal mjr19 at cam dot ac.uk
                   ` (4 preceding siblings ...)
  2024-04-18 18:01 ` roger at nextmovesoftware dot com
@ 2024-04-19 13:44 ` mjr19 at cam dot ac.uk
  2024-05-14 15:30 ` mjr19 at cam dot ac.uk
  6 siblings, 0 replies; 8+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-04-19 13:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #6 from mjr19 at cam dot ac.uk ---
I was starting to wonder whether this issue might be related to that in bug
114324, which is a slightly more complicated example in which multiplication by
a purely imaginary number destroys vectorisation.

In 114324 the problem seems to arise from a refusal to alternate no-ops and
negs along a vector, which is pretty much the issue here too.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/114767] gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal
  2024-04-18 12:13 [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal mjr19 at cam dot ac.uk
                   ` (5 preceding siblings ...)
  2024-04-19 13:44 ` mjr19 at cam dot ac.uk
@ 2024-05-14 15:30 ` mjr19 at cam dot ac.uk
  6 siblings, 0 replies; 8+ messages in thread
From: mjr19 at cam dot ac.uk @ 2024-05-14 15:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114767

--- Comment #7 from mjr19 at cam dot ac.uk ---
Another manifestation of this issue in GCC 13.1 and 14.1 is that the loop

  do i=1,n
     c(i)=a(i)*c(i)*(0d0,1d0)
  enddo

takes about twice as long to run as

  do i=1,n
     c(i)=a(i)*(0d0,1d0)*c(i)
  enddo

when compiled -Ofast -mavx2. In the second case the compiler manages to merge
its unnecessary desire to form separate vectors of real and imaginary
components to perform the sign flips on multiplying by i, with its much more
reasonable desire to form such vectors for the general complex-complex
multiplication.

One might also argue that, as the above expressions are mathematically
identical, at -Ofast the compiler ought to chose the faster anyway.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-05-14 15:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18 12:13 [Bug fortran/114767] New: gfortran AVX2 complex multiplication by (0d0,1d0) suboptimal mjr19 at cam dot ac.uk
2024-04-18 13:05 ` [Bug tree-optimization/114767] " rguenth at gcc dot gnu.org
2024-04-18 13:58 ` mjr19 at cam dot ac.uk
2024-04-18 17:35 ` roger at nextmovesoftware dot com
2024-04-18 17:58 ` mjr19 at cam dot ac.uk
2024-04-18 18:01 ` roger at nextmovesoftware dot com
2024-04-19 13:44 ` mjr19 at cam dot ac.uk
2024-05-14 15:30 ` mjr19 at cam dot ac.uk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).