[Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd
@ 2024-05-01 16:05 lee.imple at gmail dot com
  2024-05-01 17:21 ` [Bug target/114908] " pinskia at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: lee.imple at gmail dot com @ 2024-05-01 16:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

            Bug ID: 114908
           Summary: fails to optimize avx2 in-register permute written
                    with std::experimental::simd
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lee.imple at gmail dot com
  Target Milestone: ---

I am trying to write simd code with std::experimental::simd.
Here is the same function written in both std::experimental::simd and GNU
vector extension versions (available online at https://godbolt.org/z/dc169rY3o
).
The purpose is to permute the register from [w, x, y, z] into [0, w, x, y].

```c++
#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;

using data_t = std::uint64_t;
constexpr std::size_t data_size = 4;

template <std::size_t N>
using simd_of = std::experimental::simd<data_t,
std::experimental::simd_abi::deduce_t<data_t, N>>;
using simd_t = simd_of<data_size>;

template <std::size_t N>
constexpr simd_of<N> zero = {};

// stdx version
simd_t permute_simd(simd_t data) {
    auto [carry, _] = split<data_size-1, 1>(data);
    return concat(zero<1>, carry);
}



typedef data_t vector_t [[gnu::vector_size(data_size * sizeof(data_t))]];
constexpr vector_t zero_v = {0};

// gnu vector extension version
vector_t permute_vector(vector_t data) {
    return __builtin_shufflevector(data, zero_v, 4, 0, 1, 2);
}
```

The code is compiled with the options `-O3 -march=x86-64-v3 -std=c++20`.
Although they should have the same functionality, generated assembly (by GCC)
is so different.

```asm
permute_simd(std::experimental::parallelism_v2::simd<unsigned long,
std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32> >):
  pushq %rbp
  vpxor %xmm1, %xmm1, %xmm1
  movq %rsp, %rbp
  andq $-32, %rsp
  subq $8, %rsp
  vmovdqa %ymm0, -120(%rsp)
  vmovdqa %ymm1, -56(%rsp)
  movq -104(%rsp), %rax
  vmovdqa %xmm0, -56(%rsp)
  movq -48(%rsp), %rdx
  movq $0, -88(%rsp)
  movq %rax, -40(%rsp)
  movq -56(%rsp), %rax
  vmovdqa -56(%rsp), %ymm2
  vmovq %rax, %xmm0
  vmovdqa %ymm2, -24(%rsp)
  movq -8(%rsp), %rax
  vpinsrq $1, %rdx, %xmm0, %xmm0
  vmovdqu %xmm0, -80(%rsp)
  movq %rax, -64(%rsp)
  vmovdqa -88(%rsp), %ymm0
  leave
  ret
permute_vector(unsigned long __vector(4)):
  vpxor %xmm1, %xmm1, %xmm1
  vpermq $144, %ymm0, %ymm0
  vpblendd $3, %ymm1, %ymm0, %ymm0
  ret
```

However, Clang can optimize `permute_simd` into the same assembly as
`permute_vector`, so I think, instead of a bug in the std::experimental::simd,
it is a missed optimization in GCC.

```asm
permute_simd(std::experimental::parallelism_v2::simd<unsigned long,
std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32> >): #
@permute_simd(std::experimental::parallelism_v2::simd<unsigned long,
std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32> >)
        vpermpd $144, %ymm0, %ymm0              # ymm0 = ymm0[0,0,1,2]
        vxorps  %xmm1, %xmm1, %xmm1
        vblendps        $3, %ymm1, %ymm0, %ymm0         # ymm0 =
ymm1[0,1],ymm0[2,3,4,5,6,7]
        retq
permute_vector(unsigned long __vector(4)):                #
@permute_vector(unsigned long __vector(4))
        vpermpd $144, %ymm0, %ymm0              # ymm0 = ymm0[0,0,1,2]
        vxorps  %xmm1, %xmm1, %xmm1
        vblendps        $3, %ymm1, %ymm0, %ymm0         # ymm0 =
ymm1[0,1],ymm0[2,3,4,5,6,7]
        retq
```

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug target/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
@ 2024-05-01 17:21 ` pinskia at gcc dot gnu.org
  2024-05-01 17:31 ` [Bug tree-optimization/114908] " pinskia at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-01 17:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           Keywords|                            |missed-optimization

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
  2024-05-01 17:21 ` [Bug target/114908] " pinskia at gcc dot gnu.org
@ 2024-05-01 17:31 ` pinskia at gcc dot gnu.org
  2024-05-02 11:12 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-01 17:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |tree-optimization
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-05-01

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
  __r = { 0, 0, 0, 0 };
  __builtin_memcpy (&__r, &data, 24);
  __r.2_4 = __r;
  _5 = VIEW_CONVERT_EXPR<_Rp>(__r.2_4);

That is a fancy way of doing a PERM like:
  _5 = VEC_PERM_EXPR <{ 0, 0, 0, 0 }, data_1(D), { 4, 5, 6, 0 }>;



And then it does:
  MEM[(struct _Head_base *)&D.183459 +
32B]._M_head_impl._M_data.D.184855._M_data = _5;
  MEM <unsigned long> [(struct simd *)&__r] = 0;
  __builtin_memcpy (&MEM <struct _MemberType> [(void *)&__r + 8B], &MEM[(const
struct simd &)&D.183459 + 32]._M_data, 24);
  SR.136_19 = MEM[(struct simd *)&__r];
  __r ={v} {CLOBBER(eos)};
  MEM[(struct simd *)&D.200690] = SR.136_19;

Which is another fancy way of doing a PERM like:
  SR.136_19 = VEC_PERM_EXPR <{ 0, 0, 0, 0 }, _5, { 0, 4, 5, 6 }>;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
  2024-05-01 17:21 ` [Bug target/114908] " pinskia at gcc dot gnu.org
  2024-05-01 17:31 ` [Bug tree-optimization/114908] " pinskia at gcc dot gnu.org
@ 2024-05-02 11:12 ` rguenth at gcc dot gnu.org
  2024-05-06  9:05 ` mkretz at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-02 11:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mkretz at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org
             Target|x86-64-v3                   |x86_64-*-*

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The memcpy() calls are definitely a hindrance.  I suppose that
update-address-taken could replace some of them with BIT_INSERT_EXPRs but then
it doesn't
handle any calls right now.  Replacing the memcpy on its own would be possible
(special-casing just the "sub-vector" case) like

  __builtin_memcpy (&__r, &data, 24);

to

  _1 = __r;
  _2 = data;
  _3 = VEC_PERM <_2, _1, {0, 1, 2, 7 }>;
  __r = _3;

or if copying a single element using BIT_INSERT_EXPR.  OTOH that's not
good if __r stays in memory (the whole vector store might be good to
avoid STLF fails, but the read will be bad for the same reason).

The update-address-taken pass would know __r and data become registers.
We already have a similar case involving ATOMIC_COMPARE_EXCHANGE that
has delayed processing requring register arguments.  It might or might
not be a good example how to deal with this.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
                   ` (2 preceding siblings ...)
  2024-05-02 11:12 ` rguenth at gcc dot gnu.org
@ 2024-05-06  9:05 ` mkretz at gcc dot gnu.org
  2024-05-06  9:16 ` rguenther at suse dot de
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: mkretz at gcc dot gnu.org @ 2024-05-06  9:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #3 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
The stdx::simd implementation in this area is old and mainly tuned to be
correct. I can rewrite the split and concat implementation to use
__builtin_shufflevector (which wasn't available in GCC at the time when I
originally implemented it). Doing so I can resolve this issue.

How do you want to handle this? Because it would certainly be nice if the
compiler can optimize this in the same way as Clang can. Should I try to come
up with a testcase that doesn't need stdx::simd and then improve stdx::simd
independently?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
                   ` (3 preceding siblings ...)
  2024-05-06  9:05 ` mkretz at gcc dot gnu.org
@ 2024-05-06  9:16 ` rguenther at suse dot de
  2024-05-06 10:00 ` mkretz at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenther at suse dot de @ 2024-05-06  9:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #4 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 6 May 2024, mkretz at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908
> 
> --- Comment #3 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
> The stdx::simd implementation in this area is old and mainly tuned to be
> correct. I can rewrite the split and concat implementation to use
> __builtin_shufflevector (which wasn't available in GCC at the time when I
> originally implemented it). Doing so I can resolve this issue.
> 
> How do you want to handle this? Because it would certainly be nice if the
> compiler can optimize this in the same way as Clang can. Should I try to come
> up with a testcase that doesn't need stdx::simd and then improve stdx::simd
> independently?

Yes, that would be nice (best the testcase w/o stdx::simd being a C
testcase even, no C++ abstraction).

I do think stdx::simd should be improved to use __builtin_shufflevector
though.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
                   ` (4 preceding siblings ...)
  2024-05-06  9:16 ` rguenther at suse dot de
@ 2024-05-06 10:00 ` mkretz at gcc dot gnu.org
  2024-05-06 10:12 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: mkretz at gcc dot gnu.org @ 2024-05-06 10:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #5 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
https://godbolt.org/z/P6cfbjT9f

#include <stdint.h>

typedef uint64_t T;

typedef T V [[gnu::vector_size(32)]];

typedef struct simd4 {
    V data;
} simd4;

typedef struct simd1 {
    T data;
} simd1;

typedef struct tup3_1 {
    simd4 a;
    simd1 b;
} tup3_1;

simd1 load1(const T* ptr) {
    simd1 ret = {ptr[0]};
    return ret;
}

simd4 load3(const T* ptr) {
    simd4 ret = {};
    __builtin_memcpy(&ret, ptr, 3 * sizeof(T));
    return ret;
}

tup3_1 split3_1(simd4 x) {
    const T* ptr = (T*)&x;
    tup3_1 ret = {load3(ptr), load1(ptr + 3)};
    return ret;
}

simd4 concat1_3(simd1 a, simd4 b) {
    simd4 ret = {};
    char* ptr = (char*)&ret;
    __builtin_memcpy(ptr, &a, sizeof(T));
    __builtin_memcpy(ptr + sizeof(T), &b, 3 * sizeof(T));
    return ret;
}

simd4 perm(simd4 data) {
    tup3_1 carry = split3_1(data);
    simd1 zero = {};
    return concat1_3(zero, carry.a);
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
                   ` (5 preceding siblings ...)
  2024-05-06 10:00 ` mkretz at gcc dot gnu.org
@ 2024-05-06 10:12 ` rguenth at gcc dot gnu.org
  2024-05-06 10:30 ` mkretz at gcc dot gnu.org
  2024-05-07  3:09 ` lee.imple at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-06 10:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Thanks, and it might be enough to handle

typedef unsigned long V [[gnu::vector_size(32)]];

V load3(const unsigned long* ptr)
{
  V ret = {};
  __builtin_memcpy(&ret, ptr, 3 * sizeof(unsigned long));
  return ret;
}

where with -O2 .optimized still has

  <bb 2> [local count: 1073741824]:
  ret = { 0, 0, 0, 0 };
  __builtin_memcpy (&ret, ptr_3(D), 24);
  _5 = ret;
  ret ={v} {CLOBBER(eos)};
  return _5;

and thus 'ret' not promoted to a register.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
                   ` (6 preceding siblings ...)
  2024-05-06 10:12 ` rguenth at gcc dot gnu.org
@ 2024-05-06 10:30 ` mkretz at gcc dot gnu.org
  2024-05-07  3:09 ` lee.imple at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: mkretz at gcc dot gnu.org @ 2024-05-06 10:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #7 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
I suspect resolving this is only one part of it. But I'm happy to be proven
wrong. :)

I opened PR114958 to track the simd implementation change.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd
  2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
                   ` (7 preceding siblings ...)
  2024-05-06 10:30 ` mkretz at gcc dot gnu.org
@ 2024-05-07  3:09 ` lee.imple at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: lee.imple at gmail dot com @ 2024-05-07  3:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #8 from Imple Lee <lee.imple at gmail dot com> ---
I tried another way to permute the register.
Although GCC does generate simd instructions, the generated code is
sub-optimal.
I opened PR114966 for that.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-05-07  3:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-01 16:05 [Bug target/114908] New: fails to optimize avx2 in-register permute written with std::experimental::simd lee.imple at gmail dot com
2024-05-01 17:21 ` [Bug target/114908] " pinskia at gcc dot gnu.org
2024-05-01 17:31 ` [Bug tree-optimization/114908] " pinskia at gcc dot gnu.org
2024-05-02 11:12 ` rguenth at gcc dot gnu.org
2024-05-06  9:05 ` mkretz at gcc dot gnu.org
2024-05-06  9:16 ` rguenther at suse dot de
2024-05-06 10:00 ` mkretz at gcc dot gnu.org
2024-05-06 10:12 ` rguenth at gcc dot gnu.org
2024-05-06 10:30 ` mkretz at gcc dot gnu.org
2024-05-07  3:09 ` lee.imple at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).