[Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf
@ 2020-09-03 12:58 gabravier at gmail dot com
  2020-09-03 13:03 ` [Bug target/96918] " rguenth at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-09-03 12:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

            Bug ID: 96918
           Summary: Failure to optimize vector shift left+shift right+or
                    to pshuf
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gabravier at gmail dot com
  Target Milestone: ---

typedef __INT16_TYPE__ v8i16 __attribute__((vector_size(16)));
typedef char v16i8 __attribute__((vector_size(16)));

v8i16 bswap_epi16(v8i16 x)
{
        return __builtin_ia32_psllwi128(x, 8) | __builtin_ia32_psrlwi128(x, 8);
}

This can be optimized to `return (v8i16)__builtin_ia32_pshufb128((v16i8)x,
v16i8{1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14});`. This
transformation is done by LLVM, but not by GCC.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
@ 2020-09-03 13:03 ` rguenth at gcc dot gnu.org
  2020-09-03 13:13 ` gabravier at gmail dot com
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-03 13:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note it's not so interesting to handle testcases with __builtin_ia32_* which
are internal but I guess the same can be replicated using intrinsics?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
  2020-09-03 13:03 ` [Bug target/96918] " rguenth at gcc dot gnu.org
@ 2020-09-03 13:13 ` gabravier at gmail dot com
  2020-09-03 13:21 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-09-03 13:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

--- Comment #2 from Gabriel Ravier <gabravier at gmail dot com> ---
Yes, you can reproduce this with _mm_shuffle_epi8, _mm_slli_epi16 and
_mm_srli_epi16. I'm assuming GCC developers are more familiar with the internal
intrinsics than with the Intel-provided intrinsics, so I didn't bother
converting it to Intel intrinsics, is that wrong ?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
  2020-09-03 13:03 ` [Bug target/96918] " rguenth at gcc dot gnu.org
  2020-09-03 13:13 ` gabravier at gmail dot com
@ 2020-09-03 13:21 ` jakub at gcc dot gnu.org
  2020-09-03 14:30 ` gabravier at gmail dot com
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-09-03 13:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The __builtin_ia32_* builtins are implementation detail, we don't guarantee
them between releases, they can be added or removed, the only supported thing
are the _mm* intrinsics in the headers.  E.g. sometimes we replace a builtin
with just generic C code if it can do e.g. with generic vectors what the
builtin was doing before.
So, for testcases, unless we test for something that can't be reproduced with
the intrinsics and can be only reproduced with the builtins, we strongly prefer
using the supported intrinsics.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
                   ` (2 preceding siblings ...)
  2020-09-03 13:21 ` jakub at gcc dot gnu.org
@ 2020-09-03 14:30 ` gabravier at gmail dot com
  2020-09-03 20:41 ` glisse at gcc dot gnu.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: gabravier at gmail dot com @ 2020-09-03 14:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

--- Comment #4 from Gabriel Ravier <gabravier at gmail dot com> ---
Oh, now I realise that those headers are actually in fact part of directly
GCC-provided headers and not some external package. I must say though, if those
functions are internal implementation detail and not intended to be stable, I
wonder why they are mentioned so many times in GCC docs (not gccint, the normal
manual), such as here:

'-mcrc32'
     This option enables built-in functions '__builtin_ia32_crc32qi',
     '__builtin_ia32_crc32hi', '__builtin_ia32_crc32si' and
     '__builtin_ia32_crc32di' to generate the 'crc32' machine
     instruction.

or here:

 The following built-in function is always available.

'void __builtin_ia32_pause (void)'
     Generates the 'pause' machine instruction with a compiler memory
     barrier.

or here:

 The following built-in functions are made available by '-mmmx'.  All of
them generate the machine instruction that is part of the name.

     v8qi __builtin_ia32_paddb (v8qi, v8qi)
     v4hi __builtin_ia32_paddw (v4hi, v4hi)
     v2si __builtin_ia32_paddd (v2si, v2si)
     v8qi __builtin_ia32_psubb (v8qi, v8qi)
     v4hi __builtin_ia32_psubw (v4hi, v4hi)
[long list continues, and this is also available for other switches like it]

Which seems to imply that these aren't implementation detail, unless them being
part of the manual is accidental. I don't mind using the Intel intrinsics, but
I still find this quite odd if they are intended to be internal. Could you
clarify this ?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
                   ` (3 preceding siblings ...)
  2020-09-03 14:30 ` gabravier at gmail dot com
@ 2020-09-03 20:41 ` glisse at gcc dot gnu.org
  2020-09-03 20:57 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: glisse at gcc dot gnu.org @ 2020-09-03 20:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

--- Comment #5 from Marc Glisse <glisse at gcc dot gnu.org> ---
typedef unsigned short v8i16 __attribute__((vector_size(16)));

v8i16 bswap_epi16(v8i16 x)
{
        return (x << 8) | (x >> 8);
}

We do recognize a rotate already in GENERIC

  return x r<< 8;

But this is expanded to

        movdqa  %xmm0, %xmm1
        psrlw   $8, %xmm0
        psllw   $8, %xmm1
        por     %xmm1, %xmm0

probably the target could advertise a rotate insn for that mode, restricted to
an argument of 8?

IIRC, I didn't use vector extensions for the corresponding shift intrinsics
because for large shift amounts they set the result to 0. But for a constant
scalar, we could lower the builtin to a shift (or fold to 0).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
                   ` (4 preceding siblings ...)
  2020-09-03 20:41 ` glisse at gcc dot gnu.org
@ 2020-09-03 20:57 ` jakub at gcc dot gnu.org
  2020-09-04  6:49 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: jakub at gcc dot gnu.org @ 2020-09-03 20:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Or the generic code could try to expand vector rotates by multiplies of
BITS_PER_UNIT as vector permutations, perhaps only if there is no optab for it.
 Or trying to expand both permutation and rotate and determine at expansion
time using costs which sequence is cheaper.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
                   ` (5 preceding siblings ...)
  2020-09-03 20:57 ` jakub at gcc dot gnu.org
@ 2020-09-04  6:49 ` rguenth at gcc dot gnu.org
  2021-08-03  5:10 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-09-04  6:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #6)
> Or the generic code could try to expand vector rotates by multiplies of
> BITS_PER_UNIT as vector permutations, perhaps only if there is no optab for
> it.  Or trying to expand both permutation and rotate and determine at
> expansion time using costs which sequence is cheaper.

I guess for the specific case we could think of what is canonical for
GIMPLE and then deal with that during RTL expansion as you say.

The question is whether vector rotate or permutes are more likely to
be combined with earlier/later stmts.  Guess there's no clear answer to
that which means there would need to be special handling anyway of
which follows that we might stick to what the source did.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
                   ` (6 preceding siblings ...)
  2020-09-04  6:49 ` rguenth at gcc dot gnu.org
@ 2021-08-03  5:10 ` pinskia at gcc dot gnu.org
  2021-08-03  5:36 ` crazylht at gmail dot com
  2021-08-03  5:51 ` crazylht at gmail dot com
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-03  5:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-08-03
             Status|UNCONFIRMED                 |NEW

--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
                   ` (7 preceding siblings ...)
  2021-08-03  5:10 ` pinskia at gcc dot gnu.org
@ 2021-08-03  5:36 ` crazylht at gmail dot com
  2021-08-03  5:51 ` crazylht at gmail dot com
  9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2021-08-03  5:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

--- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
Or the backend add combine helper insn to match

Failed to match this instruction:
(set (reg:V8HI 90)
    (rotate:V8HI (reg:V8HI 91)
        (const_int 8 [0x8])))

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/96918] Failure to optimize vector shift left+shift right+or to pshuf
  2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
                   ` (8 preceding siblings ...)
  2021-08-03  5:36 ` crazylht at gmail dot com
@ 2021-08-03  5:51 ` crazylht at gmail dot com
  9 siblings, 0 replies; 11+ messages in thread
From: crazylht at gmail dot com @ 2021-08-03  5:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96918

--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #9)
> Or the backend add combine helper insn to match
> 
> Failed to match this instruction:
> (set (reg:V8HI 90)
>     (rotate:V8HI (reg:V8HI 91)
>         (const_int 8 [0x8])))

letency of sequence in bswap_epi16 is 3, but 5 for vpshufb w/ memory operand.
it looks to me gcc's version is better.


bswap_epi16(short __vector(8)):
  vpsllw xmm1, xmm0, 8
  vpsrlw xmm0, xmm0, 8
  vpor xmm0, xmm1, xmm0
  ret

foo(char __vector(16)):
  vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]
  ret

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-08-03  5:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 12:58 [Bug target/96918] New: Failure to optimize vector shift left+shift right+or to pshuf gabravier at gmail dot com
2020-09-03 13:03 ` [Bug target/96918] " rguenth at gcc dot gnu.org
2020-09-03 13:13 ` gabravier at gmail dot com
2020-09-03 13:21 ` jakub at gcc dot gnu.org
2020-09-03 14:30 ` gabravier at gmail dot com
2020-09-03 20:41 ` glisse at gcc dot gnu.org
2020-09-03 20:57 ` jakub at gcc dot gnu.org
2020-09-04  6:49 ` rguenth at gcc dot gnu.org
2021-08-03  5:10 ` pinskia at gcc dot gnu.org
2021-08-03  5:36 ` crazylht at gmail dot com
2021-08-03  5:51 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).