[Bug target/107916] New: PPC VSX code generation for OpenZFS

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/107916] New: PPC VSX code generation for OpenZFS
@ 2022-11-29 14:38 dje at gcc dot gnu.org
  2022-11-29 14:39 ` [Bug target/107916] " dje at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: dje at gcc dot gnu.org @ 2022-11-29 14:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

            Bug ID: 107916
           Summary: PPC VSX code generation for OpenZFS
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dje at gcc dot gnu.org
                CC: bergner at gcc dot gnu.org, segher at gcc dot gnu.org
  Target Milestone: ---
            Target: powerpc64le-*-linux

https://github.com/openzfs/zfs/pull/14234

GCC codegen https://gcc.godbolt.org/z/bhPo9sWsx

Clang codegen https://gcc.godbolt.org/z/4rTEe3WMG

Clang is relatively compact and efficient
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        lxvd2x 1, 0, 4
        addi 4, 4, 16
        xxswapd 1, 1
        xxmrghw 40, 0, 1
        xxmrglw 41, 0, 1
        vaddudm 7, 7, 8
        vaddudm 6, 6, 9
        vaddudm 1, 7, 1
        vaddudm 5, 6, 5
        vaddudm 0, 1, 0
        vaddudm 4, 5, 4
        vaddudm 3, 0, 3
        vaddudm 2, 4, 2
        bdnz .LBB0_2

GCC is rather less efficient.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/107916] PPC VSX code generation for OpenZFS
  2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
@ 2022-11-29 14:39 ` dje at gcc dot gnu.org
  2022-11-29 14:54 ` [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64 pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: dje at gcc dot gnu.org @ 2022-11-29 14:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

David Edelsohn <dje at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-11-29
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from David Edelsohn <dje at gcc dot gnu.org> ---
Confirmed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64
  2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
  2022-11-29 14:39 ` [Bug target/107916] " dje at gcc dot gnu.org
@ 2022-11-29 14:54 ` pinskia at gcc dot gnu.org
  2022-11-29 14:55 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-29 14:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
            Summary|PPC VSX code generation for |vector_size(32) is
                   |OpenZFS                     |inefficient for VSX on
                   |                            |powerpc64
          Component|target                      |middle-end

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Reduced testcase:
```
#include <stdint.h>

typedef uint32_t u32x4 __attribute__ ((vector_size (16)));

typedef uint32_t u32x8 __attribute__ ((vector_size (32)));
typedef uint64_t u64x4 __attribute__ ((vector_size (32)));

#pragma GCC push_options

#if defined(__x86_64__)

#ifdef __clang_major__
#pragma clang attribute push(__attribute__((target("avx2"))), \
  apply_to = function)
#else
#pragma GCC target ("avx2")
#endif

#elif defined(__powerpc64__)
#ifdef __clang_major__
#pragma clang attribute
push(__attribute__((target("vsx,block-ops-unaligned-vsx,power8-vector"))), \
  apply_to = function)
#else
#pragma GCC target ("vsx,block-ops-unaligned-vsx,power8-vector,power9-vector")
#endif

#endif

void f(int n, u32x8 *a, u32x8 *b)
{
  u32x8 c = {0};
  for(int i = 0; i < n; i++)
     c+=*a;
  *b += c;
}
#ifdef __clang_major__
#if defined(__x86_64__) || defined(__powerpc64__)
#pragma clang attribute pop
#endif
#else
#pragma GCC pop_options
#endif
```
Basically what is going wrong is that c is being pushed to the stack. But
really I had expected c's phi node to be split during vector lowering.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64
  2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
  2022-11-29 14:39 ` [Bug target/107916] " dje at gcc dot gnu.org
  2022-11-29 14:54 ` [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64 pinskia at gcc dot gnu.org
@ 2022-11-29 14:55 ` pinskia at gcc dot gnu.org
  2022-11-29 15:07 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-29 14:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
aarch64 has a similar issue too:
.L3:
        add     w1, w1, 1
        add     v0.4s, v5.4s, v2.4s
        add     v1.4s, v4.4s, v3.4s
        mov     v2.16b, v0.16b
        mov     v3.16b, v1.16b
        cmp     w0, w1
        bne     .L3
Though not as bad as it is just extra moves inside the loop as there is OI mode
there ... .

This is a generic vect lowering issue I think.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64
  2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-11-29 14:55 ` pinskia at gcc dot gnu.org
@ 2022-11-29 15:07 ` pinskia at gcc dot gnu.org
  2022-11-30  8:48 ` [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-11-29 15:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Reduced even further just compile with `-O2 -mvsx` is enough to show the issue
really:
```
typedef unsigned u32x8 __attribute__ ((vector_size (32)));

void f(int n, u32x8 *a, u32x8 *b)
{
  u32x8 c = {0};
  for(int i = 0; i < n; i++)
     c+=*a;
  *b += c;
}
```

With the above you can see the issue on x86_64 with just -O2 (not turning on
AVX 512 or anything):
.L3:
        movdqa  xmm4, XMMWORD PTR [rsp-32]
        movdqa  xmm5, XMMWORD PTR [rsp-16]
        add     eax, 1
        paddd   xmm4, xmm2
        paddd   xmm5, xmm3
        movaps  XMMWORD PTR [rsp-32], xmm4
        movaps  XMMWORD PTR [rsp-16], xmm5
        cmp     edi, eax
        jne     .L3

See the extra load/stores.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops
  2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-11-29 15:07 ` pinskia at gcc dot gnu.org
@ 2022-11-30  8:48 ` rguenth at gcc dot gnu.org
  2024-04-03 16:52 ` pinskia at gcc dot gnu.org
  2024-04-03 16:54 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-11-30  8:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unknown                     |13.0

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Vector lowering does only lower "operations", it doesn't touch data transfer
which means {reg,mem} <-> {mem,reg} copies, even if performed as part of
PHI node copies.  In the end this means unsupported vector modes will be
expanded to the stack variables.  Note there's later forwprop which will
deal with the loads/stores in most cases (but that's really an
afterthought), nothing handles the (loop) PHI node case so we end up with

  <bb 4> [local count: 955630225]:
  # c_15 = PHI <c_12(4), { 0, 0, 0, 0, 0, 0, 0, 0 }(3)>
  # i_17 = PHI <i_13(4), 0(3)>
  _4 = BIT_FIELD_REF <c_15, 128, 0>;
  _6 = _4 + _5;
  _18 = BIT_FIELD_REF <c_15, 128, 128>;
  _19 = _14 + _18;
  c_12 = {_6, _19};
  i_13 = i_17 + 1;
  if (n_7(D) != i_13)
    goto <bb 4>; [89.00%]
  else
    goto <bb 5>; [11.00%]

  <bb 5> [local count: 118111600]:
  # c_16 = PHI <c_12(4), { 0, 0, 0, 0, 0, 0, 0, 0 }(2)>

Vector lowering would need to work more like Complex lowering to improve
things here.  I'm not sure if stmt-by-stmt lowering of PHIs and
other reg-reg copies will give the desired results (esp. when backedges
are involved).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops
  2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-11-30  8:48 ` [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops rguenth at gcc dot gnu.org
@ 2024-04-03 16:52 ` pinskia at gcc dot gnu.org
  2024-04-03 16:54 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-03 16:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 114570 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops
  2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-04-03 16:52 ` pinskia at gcc dot gnu.org
@ 2024-04-03 16:54 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-03 16:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ajidala at gmail dot com

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 100745 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-04-03 16:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-29 14:38 [Bug target/107916] New: PPC VSX code generation for OpenZFS dje at gcc dot gnu.org
2022-11-29 14:39 ` [Bug target/107916] " dje at gcc dot gnu.org
2022-11-29 14:54 ` [Bug middle-end/107916] vector_size(32) is inefficient for VSX on powerpc64 pinskia at gcc dot gnu.org
2022-11-29 14:55 ` pinskia at gcc dot gnu.org
2022-11-29 15:07 ` pinskia at gcc dot gnu.org
2022-11-30  8:48 ` [Bug middle-end/107916] bigger vector_size than the target can handle causes extra load/stores inside loops rguenth at gcc dot gnu.org
2024-04-03 16:52 ` pinskia at gcc dot gnu.org
2024-04-03 16:54 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).