[Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
@ 2020-04-21  6:43 ` bisqwit at iki dot fi
  2020-04-21  7:07 ` rguenth at gcc dot gnu.org
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: bisqwit at iki dot fi @ 2020-04-21  6:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #11 from Joel Yliluoma <bisqwit at iki dot fi> ---
Looks like this issue has taken a step or two *backwards* in the past years.

Where as the second function used to be vectorized properly, today it seems
neither of them are.

Contrast this with Clang, which compiles *both* functions into a single
instruction:

  vaddps xmm0, xmm1, xmm0

or some variant thereof depending on the -m options.

Compiler Explorer link: https://godbolt.org/z/2AKhnt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
  2020-04-21  6:43 ` [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity bisqwit at iki dot fi
@ 2020-04-21  7:07 ` rguenth at gcc dot gnu.org
  2020-04-21  7:17 ` bisqwit at iki dot fi
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-21  7:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
                 CC|                            |uros at gcc dot gnu.org

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Joel Yliluoma from comment #11)
> Looks like this issue has taken a step or two *backwards* in the past years.
> 
> Where as the second function used to be vectorized properly, today it seems
> neither of them are.

Which version do you see vectorizing the second (add2) function?

> Contrast this with Clang, which compiles *both* functions into a single
> instruction:
> 
>   vaddps xmm0, xmm1, xmm0
> 
> or some variant thereof depending on the -m options.
> 
> Compiler Explorer link: https://godbolt.org/z/2AKhnt

The main issues on the GCC side are
  a) ABI details not exposed at the point of vectorization (several PRs about
     this exist)
  b) "Poor" support for two-element float vectors (an understatement, we have
     some support for MMX but that's integer only, but I'm not sure we've
     enabled the 3dnow part to be emulated with SSE)

oddly enough even with -mmmx -m3dnow I see add2 lowered by veclower so
the vector type or the vector add must be unsupported(?).

llvm is known to support emulating smaller vectors just fine (and by
design is also aware of ABI details).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
  2020-04-21  6:43 ` [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity bisqwit at iki dot fi
  2020-04-21  7:07 ` rguenth at gcc dot gnu.org
@ 2020-04-21  7:17 ` bisqwit at iki dot fi
  2020-04-21  7:37 ` rguenth at gcc dot gnu.org
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: bisqwit at iki dot fi @ 2020-04-21  7:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #13 from Joel Yliluoma <bisqwit at iki dot fi> ---
GCC 4.1.2 is indicated in the bug report headers.
Luckily, Compiler Explorer has a copy of that exact version, and it indeed
vectorizes the second function: https://godbolt.org/z/DC_SSb

On my own system, the earliest I have is 4.6. The Compiler Explorer has 4.4,
and it, or anything newer than that, no longer vectorizes either function.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2020-04-21  7:17 ` bisqwit at iki dot fi
@ 2020-04-21  7:37 ` rguenth at gcc dot gnu.org
  2020-04-21  8:18 ` bisqwit at iki dot fi
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-21  7:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Joel Yliluoma from comment #13)
> GCC 4.1.2 is indicated in the bug report headers.
> Luckily, Compiler Explorer has a copy of that exact version, and it indeed
> vectorizes the second function: https://godbolt.org/z/DC_SSb
> 
> On my own system, the earliest I have is 4.6. The Compiler Explorer has 4.4,
> and it, or anything newer than that, no longer vectorizes either function.

Ah, OK - that's before GCC learned vectorization and is code-generated by
RTL expanding

  return {BIT_FIELD_REF <a, 128, 0> + BIT_FIELD_REF <b, 128, 0>};

so the only vector support was GCCs generic vectors (and intrinsics).  The
generated code is far from perfect though.  I also think llvms code
generation is bogus since it appears the ABI does not guarantee zeroed
upper elements of the xmm0 argument which means they could contain sNaNs:

typedef float ss2 __attribute__((vector_size(8)));
typedef float ss4 __attribute__((vector_size(16)));
ss2 add2(ss2 a, ss2 b);
void bar(ss4 a)
{
  volatile ss2 x;
  x = add2 ((ss2){a[0], a[1]}, (ss2){a[0], a[1]});
}

produces

bar:
.LFB1:  
        .cfi_startproc
        subq    $56, %rsp
        .cfi_def_cfa_offset 64
        movdqa  %xmm0, %xmm1
        call    add2
        movq    %xmm0, 24(%rsp)
        addq    $56, %rsp

which means we pass through 'a' unchanged.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2020-04-21  7:37 ` rguenth at gcc dot gnu.org
@ 2020-04-21  8:18 ` bisqwit at iki dot fi
  2020-04-21  8:23 ` jakub at gcc dot gnu.org
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: bisqwit at iki dot fi @ 2020-04-21  8:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #15 from Joel Yliluoma <bisqwit at iki dot fi> ---
(In reply to Richard Biener from comment #14)
> I also think llvms code generation is bogus since it appears the ABI
> does not guarantee zeroed upper elements of the xmm0 argument
> which means they could contain sNaNs:

Why would it matter that the unused portions of the register contain NaNs?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2020-04-21  8:18 ` bisqwit at iki dot fi
@ 2020-04-21  8:23 ` jakub at gcc dot gnu.org
  2020-04-21  8:29 ` rguenther at suse dot de
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-21  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Joel Yliluoma from comment #15)
> (In reply to Richard Biener from comment #14)
> > I also think llvms code generation is bogus since it appears the ABI
> > does not guarantee zeroed upper elements of the xmm0 argument
> > which means they could contain sNaNs:
> 
> Why would it matter that the unused portions of the register contain NaNs?

Because it could then raise exceptions that shouldn't be raised?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2020-04-21  8:23 ` jakub at gcc dot gnu.org
@ 2020-04-21  8:29 ` rguenther at suse dot de
  2020-04-21  8:32 ` jakub at gcc dot gnu.org
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: rguenther at suse dot de @ 2020-04-21  8:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #17 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 21 Apr 2020, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |jakub at gcc dot gnu.org
> 
> --- Comment #16 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Joel Yliluoma from comment #15)
> > (In reply to Richard Biener from comment #14)
> > > I also think llvms code generation is bogus since it appears the ABI
> > > does not guarantee zeroed upper elements of the xmm0 argument
> > > which means they could contain sNaNs:
> > 
> > Why would it matter that the unused portions of the register contain NaNs?
> 
> Because it could then raise exceptions that shouldn't be raised?

Note it might be llvm actually zeros the upper half at the caller
(in disagreement with GCC).  Maybe also the psABI specifies that
should happen and GCC is wrong.  Just at the moment interoperating
GCC and LLVM is prone to the above mentioned issue.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2020-04-21  8:29 ` rguenther at suse dot de
@ 2020-04-21  8:32 ` jakub at gcc dot gnu.org
  2020-04-21  8:33 ` jakub at gcc dot gnu.org
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-21  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Note, we could do movq %xmm0, %xmm0; movq %xmm1, %xmm1; addpd %xmm1, %xmm0 for
the #c4 first function.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2020-04-21  8:32 ` jakub at gcc dot gnu.org
@ 2020-04-21  8:33 ` jakub at gcc dot gnu.org
  2020-04-21  8:34 ` bisqwit at iki dot fi
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-21  8:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl.tools at gmail dot com,
                   |                            |hubicka at gcc dot gnu.org,
                   |                            |matz at gcc dot gnu.org

--- Comment #19 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
CCing Micha and Honza on the ABI question.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2020-04-21  8:33 ` jakub at gcc dot gnu.org
@ 2020-04-21  8:34 ` bisqwit at iki dot fi
  2020-04-21  8:43 ` jakub at gcc dot gnu.org
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: bisqwit at iki dot fi @ 2020-04-21  8:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #20 from Joel Yliluoma <bisqwit at iki dot fi> ---
(In reply to Jakub Jelinek from comment #16)
> (In reply to Joel Yliluoma from comment #15)
> > (In reply to Richard Biener from comment #14)
> > > I also think llvms code generation is bogus since it appears the ABI
> > > does not guarantee zeroed upper elements of the xmm0 argument
> > > which means they could contain sNaNs:
> > 
> > Why would it matter that the unused portions of the register contain NaNs?
> 
> Because it could then raise exceptions that shouldn't be raised?

Which exceptions would be generated by data in an unused portion of a register?
Does for example “addps” generate an exception if one or two of the operands
contains NaNs? Which instructions would generate exceptions?

I can only think of divps, when dividing by a zero, but it does not seem that
even LLVM compiles the two-element vector division into divps.

If the register is passed as a parameter to a library function, they would not
make judgments based on the values of the unused portions of the registers.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2020-04-21  8:34 ` bisqwit at iki dot fi
@ 2020-04-21  8:43 ` jakub at gcc dot gnu.org
  2020-04-21  8:47 ` rguenther at suse dot de
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-21  8:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Joel Yliluoma from comment #20)
> Which exceptions would be generated by data in an unused portion of a
> register?

addps adds 4 float elements, there is no "unused" portion.
If some of the elements contain garbage, it can trigger for e.g. the addition
FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW or FE_INEXACT (FE_DIVBYZERO obviously
isn't relevant to addition).
Please read the standard about floating point exceptions, fenv.h etc.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2020-04-21  8:43 ` jakub at gcc dot gnu.org
@ 2020-04-21  8:47 ` rguenther at suse dot de
  2020-04-21  8:51 ` bisqwit at iki dot fi
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: rguenther at suse dot de @ 2020-04-21  8:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #22 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 21 Apr 2020, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485
> 
> Jakub Jelinek <jakub at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |hjl.tools at gmail dot com,
>                    |                            |hubicka at gcc dot gnu.org,
>                    |                            |matz at gcc dot gnu.org
> 
> --- Comment #19 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> CCing Micha and Honza on the ABI question.

The arguments are class SSE (__m64), but I fail to find clarification
as to whether "unused" parts of argument registers (the SSEUP part
of the %xmmN register) is supposed to be zeroed or has unspecified
contents.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2020-04-21  8:47 ` rguenther at suse dot de
@ 2020-04-21  8:51 ` bisqwit at iki dot fi
  2020-04-21  8:58 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: bisqwit at iki dot fi @ 2020-04-21  8:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #23 from Joel Yliluoma <bisqwit at iki dot fi> ---
(In reply to Jakub Jelinek from comment #21)
> (In reply to Joel Yliluoma from comment #20)
> > Which exceptions would be generated by data in an unused portion of a
> > register?
> 
> addps adds 4 float elements, there is no "unused" portion.
> If some of the elements contain garbage, it can trigger for e.g. the addition
> FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW or FE_INEXACT (FE_DIVBYZERO obviously
> isn't relevant to addition).
> Please read the standard about floating point exceptions, fenv.h etc.

There is “unused” portion, for the purposes of the data use. Same as with
padding in structs; the memory is unused because no part in program relies on
its contents, even though the CPU may load those portions in registers when
e.g. moving and copying the struct. The CPU won’t know whether it’s used or
not.

You mention FE_INVALID etc., but those are concepts within the C standard
library, not in the hardware. The C standard library will not make judgments on
the upper portions of the register. So if you have two float[2]s, and you add
them together into another float[2], and the compiler uses addps to achieve
this task, what is the mechanism that would supposedly generate an exception,
when no part in the software depends and makes judgments on the irrelevant
parts of the register?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2020-04-21  8:51 ` bisqwit at iki dot fi
@ 2020-04-21  8:58 ` jakub at gcc dot gnu.org
  2020-04-21  9:06 ` bisqwit at iki dot fi
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-04-21  8:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #24 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Bugzilla is not the right place to educate users.  Of course the C FE_*
exceptions map to real hardware exceptions, on x86 read e.g. about MXCSR
register and in the description of each instruction on which Exceptions it can
raise.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2020-04-21  8:58 ` jakub at gcc dot gnu.org
@ 2020-04-21  9:06 ` bisqwit at iki dot fi
  2021-08-16 21:29 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: bisqwit at iki dot fi @ 2020-04-21  9:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #25 from Joel Yliluoma <bisqwit at iki dot fi> ---
(In reply to Jakub Jelinek from comment #24)
> on x86 read e.g. about MXCSR register and in the description of each
> instruction on which Exceptions it can raise.

So the quick answer to #15 is that addps instruction may raise exceptions. Ok,
thanks for clearing that up. My bad. So it seems that LLVM relies on the
assumption that the upper portions of the register are zeroed, and this is what
you said in the first place.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2020-04-21  9:06 ` bisqwit at iki dot fi
@ 2021-08-16 21:29 ` pinskia at gcc dot gnu.org
  2021-08-16 21:29 ` [Bug tree-optimization/31485] " pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-16 21:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #26 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note there might be a dup of this bug somewhere too.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug tree-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2021-08-16 21:29 ` pinskia at gcc dot gnu.org
@ 2021-08-16 21:29 ` pinskia at gcc dot gnu.org
  2022-02-07  8:48 ` rguenth at gcc dot gnu.org
  2023-10-01 18:45 ` pinskia at gcc dot gnu.org
  18 siblings, 0 replies; 29+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-16 21:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |tree-optimization

--- Comment #27 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
And there are two issues here, one is related to SLP not happening and the
other deals with the argument and return value passing.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug tree-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (16 preceding siblings ...)
  2021-08-16 21:29 ` [Bug tree-optimization/31485] " pinskia at gcc dot gnu.org
@ 2022-02-07  8:48 ` rguenth at gcc dot gnu.org
  2023-10-01 18:45 ` pinskia at gcc dot gnu.org
  18 siblings, 0 replies; 29+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-07  8:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #28 from Richard Biener <rguenth at gcc dot gnu.org> ---
*** Bug 104406 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug tree-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
       [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
                   ` (17 preceding siblings ...)
  2022-02-07  8:48 ` rguenth at gcc dot gnu.org
@ 2023-10-01 18:45 ` pinskia at gcc dot gnu.org
  18 siblings, 0 replies; 29+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-10-01 18:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

--- Comment #29 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630011.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
                   ` (8 preceding siblings ...)
  2010-04-21 11:44 ` rguenther at suse dot de
@ 2010-04-21 18:34 ` irar at il dot ibm dot com
  9 siblings, 0 replies; 29+ messages in thread
From: irar at il dot ibm dot com @ 2010-04-21 18:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from irar at il dot ibm dot com  2010-04-21 18:33 -------
Thanks. So, it is not always profitable and requires a cost model. 
I am now working on cost model for basic block vectorization, I can look at
this once we have one.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
                   ` (7 preceding siblings ...)
  2010-04-21 11:33 ` irar at il dot ibm dot com
@ 2010-04-21 11:44 ` rguenther at suse dot de
  2010-04-21 18:34 ` irar at il dot ibm dot com
  9 siblings, 0 replies; 29+ messages in thread
From: rguenther at suse dot de @ 2010-04-21 11:44 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #9 from rguenther at suse dot de  2010-04-21 11:44 -------
Subject: Re:  C complex numbers, amd64 SSE, missed
 optimization opportunity

On Wed, 21 Apr 2010, irar at il dot ibm dot com wrote:

> ------- Comment #8 from irar at il dot ibm dot com  2010-04-21 11:33 -------
> Yes, it's possible to add this to SLP. But I don't understand how 
> D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>;
> should be vectorized. D.3154_3 is complex and the rhs will be a vector
> {D.3163_8, D.3164_9} (btw, we have to change float to double, otherwise, we
> don't have complete vectors and this is not supported).

Dependent on how D.3154_3 is used afterwards it will be much like
an interleaved/strided store (if {D.3163_8, D.3164_9} is in xmm2 and the
complex is in the lower halves of the register pair xmm0 and xmm1
we'd emit vec_extracts).  On the tree level we can probably
represent this as

 D.3154_3 = VIEW_CONVERT_EXPR <compex_double> (vec_temp_4);

where vec_temp_4 is the {D.3163_8, D.3164_9} vector.
Or similar, but with present known-to-work trees

 realpart = BIT_FIELD_REF <0, ..> (vec_tmp_4);
 imagpart = BIT_FIELD_REF <64, ..> (vec_tmp_4);
 D.3154_3 = COMPLEX_EXPR <realpart, imagpart>;

One could also see the COMPLEX_EXPR as a root for SLP induction
vectorization (I suppose we don't do SLP induction at the moment,
induction in the sense that we pick arbitrary scalars and combine
them into vectors).

Richard.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
                   ` (6 preceding siblings ...)
  2010-04-17 11:11 ` rguenth at gcc dot gnu dot org
@ 2010-04-21 11:33 ` irar at il dot ibm dot com
  2010-04-21 11:44 ` rguenther at suse dot de
  2010-04-21 18:34 ` irar at il dot ibm dot com
  9 siblings, 0 replies; 29+ messages in thread
From: irar at il dot ibm dot com @ 2010-04-21 11:33 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from irar at il dot ibm dot com  2010-04-21 11:33 -------
Yes, it's possible to add this to SLP. But I don't understand how 
D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>;
should be vectorized. D.3154_3 is complex and the rhs will be a vector
{D.3163_8, D.3164_9} (btw, we have to change float to double, otherwise, we
don't have complete vectors and this is not supported).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
                   ` (5 preceding siblings ...)
  2010-04-17  0:28 ` ddesics at gmail dot com
@ 2010-04-17 11:11 ` rguenth at gcc dot gnu dot org
  2010-04-21 11:33 ` irar at il dot ibm dot com
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2010-04-17 11:11 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from rguenth at gcc dot gnu dot org  2010-04-17 11:11 -------
We now have basic-block vectorization but it still works on memory accesses
(visible on the gimple level) only.  So it doesn't handle

add1 (ss1 a, ss1 b)
{
  float D.3164;
  float D.3163;
  float b$imag;
  float b$real;
  float a$imag;
  float a$real;
  ss1 D.3154;

<bb 2>:
  a$real_4 = REALPART_EXPR <a_1(D)>;
  a$imag_5 = IMAGPART_EXPR <a_1(D)>;
  b$real_6 = REALPART_EXPR <b_2(D)>;
  b$imag_7 = IMAGPART_EXPR <b_2(D)>;
  D.3163_8 = a$real_4 + b$real_6;
  D.3164_9 = a$imag_5 + b$imag_7;
  D.3154_3 = COMPLEX_EXPR <D.3163_8, D.3164_9>;
  return D.3154_3;

}

though maybe it could be teached to see REAL/IMAG_PART exprs as loads
and COMPLEX_EXPR as store.  Ira?


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
                   ` (4 preceding siblings ...)
  2008-08-02 13:19 ` rguenth at gcc dot gnu dot org
@ 2010-04-17  0:28 ` ddesics at gmail dot com
  2010-04-17 11:11 ` rguenth at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: ddesics at gmail dot com @ 2010-04-17  0:28 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from ddesics at gmail dot com  2010-04-17 00:28 -------
Has any work been done on this enhancement?  I'm using gcc 4.3.2, and I noticed
that there is still limited use of SSE instructions for complex arithmetic.  

Unless I'm missing something in my understanding, wouldn't the ideal for all
_Complex double additions with SSE2 be to use addpd, and movapd or movupd for
memory operations?


-- 

ddesics at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ddesics at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
                   ` (3 preceding siblings ...)
  2008-08-02 13:01 ` ubizjak at gmail dot com
@ 2008-08-02 13:19 ` rguenth at gcc dot gnu dot org
  2010-04-17  0:28 ` ddesics at gmail dot com
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-08-02 13:19 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from rguenth at gcc dot gnu dot org  2008-08-02 13:18 -------
Doh, this is indeed completely broken ;)  I'll experiment with lowering
complex operations to vectorized form a bit.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
                   ` (2 preceding siblings ...)
  2008-08-02 12:23 ` rguenth at gcc dot gnu dot org
@ 2008-08-02 13:01 ` ubizjak at gmail dot com
  2008-08-02 13:19 ` rguenth at gcc dot gnu dot org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: ubizjak at gmail dot com @ 2008-08-02 13:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from ubizjak at gmail dot com  2008-08-02 13:00 -------
(In reply to comment #3)
> Operations in loops should now be vectorized.  The original testcase is
> probably not worth vectorizing due to calling convention problems (_Complex T
> is not passed as a vector).

Not really. For some unknown reason, _Complex float is passed as a two element
vector in SSE register. This introduces (double!) store forwarding penalty,
since we have to split the value into SSE pair before processing. This is wrong
ABI design, as shown by comparing generated code from following example:

--cut here--
_Complex float testf (_Complex float a, _Complex float b)
{
  return a + b;
}

_Complex double testd (_Complex double a, _Complex double b)
{
  return a + b;
}
--cut here--

testf:
        movq    %xmm0, -8(%rsp)
        movq    %xmm1, -16(%rsp)
        movss   -8(%rsp), %xmm0
        movss   -4(%rsp), %xmm2
        addss   -16(%rsp), %xmm0
        addss   -12(%rsp), %xmm2
        movss   %xmm0, -24(%rsp)
        movss   %xmm2, -20(%rsp)
        movq    -24(%rsp), %xmm0
        ret

testd:
        addsd   %xmm3, %xmm1
        addsd   %xmm2, %xmm0
        ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
  2007-04-09 17:48 ` [Bug rtl-optimization/31485] " rguenth at gcc dot gnu dot org
  2008-07-29 22:07 ` victork at gcc dot gnu dot org
@ 2008-08-02 12:23 ` rguenth at gcc dot gnu dot org
  2008-08-02 13:01 ` ubizjak at gmail dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-08-02 12:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from rguenth at gcc dot gnu dot org  2008-08-02 12:21 -------
Operations in loops should now be vectorized.  The original testcase is
probably not worth vectorizing due to calling convention problems (_Complex T
is not passed as a vector).

Complex lowering could generate vectorized code directly though for operations
not in a loop.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2008-08-02 12:21:42
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
  2007-04-09 17:48 ` [Bug rtl-optimization/31485] " rguenth at gcc dot gnu dot org
@ 2008-07-29 22:07 ` victork at gcc dot gnu dot org
  2008-08-02 12:23 ` rguenth at gcc dot gnu dot org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: victork at gcc dot gnu dot org @ 2008-07-29 22:07 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from victork at gcc dot gnu dot org  2008-07-29 22:06 -------
Revision 138198 fixes loop aware SLP vectorization for addition of complex
numbers. So if addition of is done inside a loop, there is a good chance now
that it will be vectorized.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity
  2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
@ 2007-04-09 17:48 ` rguenth at gcc dot gnu dot org
  2008-07-29 22:07 ` victork at gcc dot gnu dot org
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-04-09 17:48 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from rguenth at gcc dot gnu dot org  2007-04-09 18:47 -------
Complex operations are lowered at the tree-level so this would require
vectorizing
of straight line code.  Second, calling conventions are different.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31485


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-10-01 18:45 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-31485-4@http.gcc.gnu.org/bugzilla/>
2020-04-21  6:43 ` [Bug rtl-optimization/31485] C complex numbers, amd64 SSE, missed optimization opportunity bisqwit at iki dot fi
2020-04-21  7:07 ` rguenth at gcc dot gnu.org
2020-04-21  7:17 ` bisqwit at iki dot fi
2020-04-21  7:37 ` rguenth at gcc dot gnu.org
2020-04-21  8:18 ` bisqwit at iki dot fi
2020-04-21  8:23 ` jakub at gcc dot gnu.org
2020-04-21  8:29 ` rguenther at suse dot de
2020-04-21  8:32 ` jakub at gcc dot gnu.org
2020-04-21  8:33 ` jakub at gcc dot gnu.org
2020-04-21  8:34 ` bisqwit at iki dot fi
2020-04-21  8:43 ` jakub at gcc dot gnu.org
2020-04-21  8:47 ` rguenther at suse dot de
2020-04-21  8:51 ` bisqwit at iki dot fi
2020-04-21  8:58 ` jakub at gcc dot gnu.org
2020-04-21  9:06 ` bisqwit at iki dot fi
2021-08-16 21:29 ` pinskia at gcc dot gnu.org
2021-08-16 21:29 ` [Bug tree-optimization/31485] " pinskia at gcc dot gnu.org
2022-02-07  8:48 ` rguenth at gcc dot gnu.org
2023-10-01 18:45 ` pinskia at gcc dot gnu.org
2007-04-05 11:29 [Bug rtl-optimization/31485] New: " bisqwit at iki dot fi
2007-04-09 17:48 ` [Bug rtl-optimization/31485] " rguenth at gcc dot gnu dot org
2008-07-29 22:07 ` victork at gcc dot gnu dot org
2008-08-02 12:23 ` rguenth at gcc dot gnu dot org
2008-08-02 13:01 ` ubizjak at gmail dot com
2008-08-02 13:19 ` rguenth at gcc dot gnu dot org
2010-04-17  0:28 ` ddesics at gmail dot com
2010-04-17 11:11 ` rguenth at gcc dot gnu dot org
2010-04-21 11:33 ` irar at il dot ibm dot com
2010-04-21 11:44 ` rguenther at suse dot de
2010-04-21 18:34 ` irar at il dot ibm dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).