[Bug target/108401] New: gcc defeats vector constant generation with intrinsics

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/108401] New: gcc defeats vector constant generation with intrinsics
@ 2023-01-14  0:24 andysem at mail dot ru
  2023-01-16  5:06 ` [Bug target/108401] " pinskia at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: andysem at mail dot ru @ 2023-01-14  0:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

            Bug ID: 108401
           Summary: gcc defeats vector constant generation with intrinsics
           Product: gcc
           Version: 11.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

Consider the following code:

#include <immintrin.h>

__m256i load_00FF()
{
    __m256i mm = _mm256_setzero_si256();
    return _mm256_srli_epi16(_mm256_cmpeq_epi64(mm, mm), 8);
}

This function generates a vector constant of alternating 0xFF and 0x00 bytes.
The code is written this way to avoid a load from memory, which may cause a
cache miss. The expected generated code is this:

        vpcmpeqq        ymm0, ymm0, ymm0
        vpsrlw  ymm0, ymm0, 8
        ret

which is almost exactly what gcc 8 generates (it uses vpcmpeqd instead of
vpcmpeqq, which is fine). However, gcc 9 through 11 generates a memory load
instead, defeating the attempt to avoid it:

        vmovdqa ymm0, YMMWORD PTR .LC0[rip]
        ret

and gcc 12 generates a worse code:

        movabs  rax, 71777214294589695
        vmovq   xmm1, rax
        vpbroadcastq    ymm0, xmm1
        ret

In all cases, the compiler flags are: -O3 -march=haswell

Code on godbolt.org: https://gcc.godbolt.org/z/sfT787PY9

I think the compiler should follow the code in intrinsics more closely since
despite the apparent equivalence, the choice of instructions can have
performance implications. The original code that is written by the developer is
better anyway, so it's not clear why the compiler is being so creative in this
case.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/108401] gcc defeats vector constant generation with intrinsics
  2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
@ 2023-01-16  5:06 ` pinskia at gcc dot gnu.org
  2023-01-16  5:23 ` pinskia at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-16  5:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>and gcc 12 generates a worse code:


it is not worse really; depending on the how fast moving between the register
sets is.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/108401] gcc defeats vector constant generation with intrinsics
  2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
  2023-01-16  5:06 ` [Bug target/108401] " pinskia at gcc dot gnu.org
@ 2023-01-16  5:23 ` pinskia at gcc dot gnu.org
  2023-01-16  5:28 ` crazylht at gmail dot com
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-16  5:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
r12-1958-gedafb35bdadf30 changed the behavior in GCC 12 to be better ...
(see the commit message that it shows it is better than doing a memory load).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/108401] gcc defeats vector constant generation with intrinsics
  2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
  2023-01-16  5:06 ` [Bug target/108401] " pinskia at gcc dot gnu.org
  2023-01-16  5:23 ` pinskia at gcc dot gnu.org
@ 2023-01-16  5:28 ` crazylht at gmail dot com
  2023-01-16  7:00 ` amonakov at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: crazylht at gmail dot com @ 2023-01-16  5:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---

> and gcc 12 generates a worse code:
> 
>         movabs  rax, 71777214294589695
>         vmovq   xmm1, rax
>         vpbroadcastq    ymm0, xmm1
>         ret
> 

It's on purpose by edafb35bdadf309ebb9d1eddc5549f9e1ad49c09 since
microbenchmark shows moving from imm is faster than memory.

> In all cases, the compiler flags are: -O3 -march=haswell
> 
> Code on godbolt.org: https://gcc.godbolt.org/z/sfT787PY9
> 
> I think the compiler should follow the code in intrinsics more closely since
> despite the apparent equivalence, the choice of instructions can have
> performance implications. The original code that is written by the developer
> is better anyway, so it's not clear why the compiler is being so creative in
> this case.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/108401] gcc defeats vector constant generation with intrinsics
  2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
                   ` (2 preceding siblings ...)
  2023-01-16  5:28 ` crazylht at gmail dot com
@ 2023-01-16  7:00 ` amonakov at gcc dot gnu.org
  2023-01-16  7:47 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-01-16  7:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #3)
> > and gcc 12 generates a worse code:
> > 
> >         movabs  rax, 71777214294589695
> >         vmovq   xmm1, rax
> >         vpbroadcastq    ymm0, xmm1
> >         ret
> > 
> 
> It's on purpose by edafb35bdadf309ebb9d1eddc5549f9e1ad49c09 since
> microbenchmark shows moving from imm is faster than memory.

But the bug is not asking you to reinstate loading from memory. The bug is
asking you to notice that the result can be constructed via cmpeq+psrlw, which
is even better than a broadcast (cmpeq with dst same as src is usually a
dependency-breaking instruction that does not occupy an execution port).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/108401] gcc defeats vector constant generation with intrinsics
  2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
                   ` (3 preceding siblings ...)
  2023-01-16  7:00 ` amonakov at gcc dot gnu.org
@ 2023-01-16  7:47 ` rguenth at gcc dot gnu.org
  2023-01-16 10:04 ` andysem at mail dot ru
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-16  7:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-01-16
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  We expand from

  return { 71777214294589695, 71777214294589695, 71777214294589695,
71777214294589695 };

where we could reduce the DImode broadcast to a HImode one (if that exists).
But sure, the x86 backend could implement the intrinsic suggested way to
generate this particular pattern.

I'll also note that -O0 produces quite bad code here.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/108401] gcc defeats vector constant generation with intrinsics
  2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
                   ` (4 preceding siblings ...)
  2023-01-16  7:47 ` rguenth at gcc dot gnu.org
@ 2023-01-16 10:04 ` andysem at mail dot ru
  2023-01-16 10:42 ` andysem at mail dot ru
  2023-01-17  5:42 ` crazylht at gmail dot com
  7 siblings, 0 replies; 9+ messages in thread
From: andysem at mail dot ru @ 2023-01-16 10:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #6 from andysem at mail dot ru ---
(In reply to Andrew Pinski from comment #1)
> >and gcc 12 generates a worse code:
> 
> it is not worse really; depending on the how fast moving between the
> register sets is.

I meant "worse" compared to vpcmpeq+vpsrlw pair.

(Side note about the broadcast version: it could have been smaller if it used a
32-bit constant and vpbroadcastd. vpcmpeq+vpsrlw would still be better in this
particular case, but if broadcast is needed, a smaller footprint code is
preferred.)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/108401] gcc defeats vector constant generation with intrinsics
  2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
                   ` (5 preceding siblings ...)
  2023-01-16 10:04 ` andysem at mail dot ru
@ 2023-01-16 10:42 ` andysem at mail dot ru
  2023-01-17  5:42 ` crazylht at gmail dot com
  7 siblings, 0 replies; 9+ messages in thread
From: andysem at mail dot ru @ 2023-01-16 10:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #7 from andysem at mail dot ru ---
To be clear, I'm not asking the compiler to recognize the particular pattern of
alternating 0x00 and 0xFF bytes. Because hardcoding this particular pattern
won't improve generated code in other cases.

Rather, I'm asking to tune down code transformations for intrinsics. If the
developer wrote a sequence of intrinsics to generate a constant then he
probably wanted that sequence instead of a simple _mm_set1_epi32 or a load from
memory.

But, if you're going to improve constant generation, please make it so that it
can recognize not only the particular pattern described in this bug. More
importantly, it should recognize the all-ones case (as a single pcmpeq) as a
starting point. Then it can apply shifts to achieve the final result from the
all-ones vector - shifts of any width, length or direction, including
psrldq/pslldq. This would improve generated code in a wider range of cases.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/108401] gcc defeats vector constant generation with intrinsics
  2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
                   ` (6 preceding siblings ...)
  2023-01-16 10:42 ` andysem at mail dot ru
@ 2023-01-17  5:42 ` crazylht at gmail dot com
  7 siblings, 0 replies; 9+ messages in thread
From: crazylht at gmail dot com @ 2023-01-17  5:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108401

--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---

> But, if you're going to improve constant generation, please make it so that
> it can recognize not only the particular pattern described in this bug. More
> importantly, it should recognize the all-ones case (as a single pcmpeq) as a
> starting point. Then it can apply shifts to achieve the final result from
> the all-ones vector - shifts of any width, length or direction, including
> psrldq/pslldq. This would improve generated code in a wider range of cases.
yes, we will try to do that. Generally fold intrinsic into compiler IR helps
performance, and for this case we need to optimize codegen for special
immediate broadcast(all-ones + shift)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-01-17  5:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-14  0:24 [Bug target/108401] New: gcc defeats vector constant generation with intrinsics andysem at mail dot ru
2023-01-16  5:06 ` [Bug target/108401] " pinskia at gcc dot gnu.org
2023-01-16  5:23 ` pinskia at gcc dot gnu.org
2023-01-16  5:28 ` crazylht at gmail dot com
2023-01-16  7:00 ` amonakov at gcc dot gnu.org
2023-01-16  7:47 ` rguenth at gcc dot gnu.org
2023-01-16 10:04 ` andysem at mail dot ru
2023-01-16 10:42 ` andysem at mail dot ru
2023-01-17  5:42 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).