public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/94863] New: Failure to use blendps over mov when possible
@ 2020-04-29 21:54 gabravier at gmail dot com
  2020-04-29 22:05 ` [Bug rtl-optimization/94863] " gabravier at gmail dot com
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: gabravier at gmail dot com @ 2020-04-29 21:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

            Bug ID: 94863
           Summary: Failure to use blendps over mov when possible
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gabravier at gmail dot com
  Target Milestone: ---

typedef double v2df __attribute__((vector_size(16)));

v2df move_sd(v2df a, v2df b)
{
    v2df result = a;
    result[0] = b[0];
    return result;
}

LLVM -O3 compiles this as such :

move_sd(double __vector(2), double __vector(2)): # @move_sd(double __vector(2),
double __vector(2))
  blendps xmm0, xmm1, 3 # xmm0 = xmm1[0,1],xmm0[2,3]
  ret

GCC gives this : 

move_sd(double __vector(2), double __vector(2)):
  movsd xmm0, xmm1
  ret

Using `blendps` here should be a worthy tradeoff. Here is a table of
throughputs for various CPU architectures formatted as "arch-name:
blendps-throughput, movsd-throughput" :

Wolfdale: 1, 0.33
Nehalem: 1, 1
Westmere: 1, 1
Sandy Bridge: 0.5, 1
Ivy Bridge: 0.5, 1
Haswell: 0.33, 1
Broadwell: 0.33, 1
Skylake: 0.33, 1
Skylake-X: 0.33, 1
Kaby Lake: 0.33, 1
Coffee Lake: 0.33, 1
Cannon Lake: 0.33, 0.33
Ice Lake: 0.33, 0.33
Zen+: 0.5, 0.25
Zen 2: 0.33, 0.25

Unless there is an important factor other than thoughput that could affect
this, this should improve performance or keep it identical on every
architecture except Zen+

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug rtl-optimization/94863] Failure to use blendps over mov when possible
  2020-04-29 21:54 [Bug rtl-optimization/94863] New: Failure to use blendps over mov when possible gabravier at gmail dot com
@ 2020-04-29 22:05 ` gabravier at gmail dot com
  2020-04-30  6:57 ` [Bug target/94863] " rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: gabravier at gmail dot com @ 2020-04-29 22:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

--- Comment #1 from Gabriel Ravier <gabravier at gmail dot com> ---
Note: The given outputs for LLVM and GCC are when compiling with `-O3 -msse4.1`

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94863] Failure to use blendps over mov when possible
  2020-04-29 21:54 [Bug rtl-optimization/94863] New: Failure to use blendps over mov when possible gabravier at gmail dot com
  2020-04-29 22:05 ` [Bug rtl-optimization/94863] " gabravier at gmail dot com
@ 2020-04-30  6:57 ` rguenth at gcc dot gnu.org
  2020-04-30  7:52 ` gabravier at gmail dot com
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-30  6:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
throughputs put aside - how's port allocation and latency figures?  That said,
GCC usually sides on the smaller insn encoding variant when latency isn't
different - we're usually not looking at throughput since throughput figures
are quite useless if you look at single insns (and our decision is per
individual instruction).  Somehow modeling alternatives during scheduling
might make more sense but that's not implemented.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94863] Failure to use blendps over mov when possible
  2020-04-29 21:54 [Bug rtl-optimization/94863] New: Failure to use blendps over mov when possible gabravier at gmail dot com
  2020-04-29 22:05 ` [Bug rtl-optimization/94863] " gabravier at gmail dot com
  2020-04-30  6:57 ` [Bug target/94863] " rguenth at gcc dot gnu.org
@ 2020-04-30  7:52 ` gabravier at gmail dot com
  2021-04-26  1:18 ` pinskia at gcc dot gnu.org
  2024-04-14  2:28 ` pinskia at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: gabravier at gmail dot com @ 2020-04-30  7:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

--- Comment #3 from Gabriel Ravier <gabravier at gmail dot com> ---
For binary size, the `movsd` takes 4 bytes and the `blendps` takes 6 bytes

The port allocations for the instructions are as such (same formatting as for
the throughputs) : 

Wolfdale: p5, p015
Nehalem: p5, p5
Westmere: p5, p5
Sandy Bridge: p05, p5
Ivy Bridge: p05, p5
Haswell: p015, p5
Broadwell: p015, p5
Skylake: p015, p5
Skylake-X: p015, p5
Kaby Lake: p015, p5
Coffee Lake: p015, p5
Cannon Lake: p015, p015
Ice Lake: p015, p015
Zen+: fp01, fp0123
Zen 2: fp013, fp0123

Something like "p015" meaning that the instruction can be executed on port 0, 1
or 5. Also, all architectures have both instructions take a single uop.

The latency of `blendps` and `movsd` are 1 on every single architecture I could
test

Final note : The numbers are specifically for the `blendps xmm, xmm, imm8` and
the `movsd xmm, xmm` forms of those instructions

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94863] Failure to use blendps over mov when possible
  2020-04-29 21:54 [Bug rtl-optimization/94863] New: Failure to use blendps over mov when possible gabravier at gmail dot com
                   ` (2 preceding siblings ...)
  2020-04-30  7:52 ` gabravier at gmail dot com
@ 2021-04-26  1:18 ` pinskia at gcc dot gnu.org
  2024-04-14  2:28 ` pinskia at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-04-26  1:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/94863] Failure to use blendps over mov when possible
  2020-04-29 21:54 [Bug rtl-optimization/94863] New: Failure to use blendps over mov when possible gabravier at gmail dot com
                   ` (3 preceding siblings ...)
  2021-04-26  1:18 ` pinskia at gcc dot gnu.org
@ 2024-04-14  2:28 ` pinskia at gcc dot gnu.org
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-14  2:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94863

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Gabriel Ravier from comment #1)
> Note: The given outputs for LLVM and GCC are when compiling with `-O3
> -msse4.1`

I think you have the oppsite meaning with respect to `-msse4.1` here. They are
the same at -O3 but LLVM produces blendps with `-msse4.1` .

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-04-14  2:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-29 21:54 [Bug rtl-optimization/94863] New: Failure to use blendps over mov when possible gabravier at gmail dot com
2020-04-29 22:05 ` [Bug rtl-optimization/94863] " gabravier at gmail dot com
2020-04-30  6:57 ` [Bug target/94863] " rguenth at gcc dot gnu.org
2020-04-30  7:52 ` gabravier at gmail dot com
2021-04-26  1:18 ` pinskia at gcc dot gnu.org
2024-04-14  2:28 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).