[Bug target/104151] New: x86: excessive code generated for 128-bit byteswap

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap
@ 2022-01-20 23:24 nekotekina at gmail dot com
  2022-01-20 23:41 ` [Bug middle-end/104151] [9/10/11/12 Regression] " pinskia at gcc dot gnu.org
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: nekotekina at gmail dot com @ 2022-01-20 23:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

            Bug ID: 104151
           Summary: x86: excessive code generated for 128-bit byteswap
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nekotekina at gmail dot com
  Target Milestone: ---

Hello, noticed that gcc generates redundant sequence of instructions for code
that does 128-bit byteswap implemented with 2 64-bit byteswap intrinsics. I
narrowed it to something like this:

__uint128_t bswap(__uint128_t a)
{
    std::uint64_t x[2];
    memcpy(x, &a, 16);
    std::uint64_t y[2];
    y[0] = __builtin_bswap64(x[1]);
    y[1] = __builtin_bswap64(x[0]);
    memcpy(&a, y, 16);
    return a;
}

Produces:
https://godbolt.org/z/hEsPqvhv3

        mov     QWORD PTR [rsp-24], rdi
        mov     QWORD PTR [rsp-16], rsi
        movdqa  xmm0, XMMWORD PTR [rsp-24]
        palignr xmm0, xmm0, 8
        movdqa  xmm1, xmm0
        pshufb  xmm1, XMMWORD PTR .LC0[rip]
        movaps  XMMWORD PTR [rsp-24], xmm1
        mov     rax, QWORD PTR [rsp-24]
        mov     rdx, QWORD PTR [rsp-16]
        ret

Expected (alternatively for simd types - single pshufb, clang can do it):

        mov     rdx, rdi
        mov     rax, rsi
        bswap   rdx
        bswap   rax
        ret

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
@ 2022-01-20 23:41 ` pinskia at gcc dot gnu.org
  2022-01-21  1:03 ` crazylht at gmail dot com
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-20 23:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-01-20
             Status|UNCONFIRMED                 |NEW
            Summary|x86: excessive code         |[9/10/11/12 Regression]
                   |generated for 128-bit       |x86: excessive code
                   |byteswap                    |generated for 128-bit
                   |                            |byteswap
     Ever confirmed|0                           |1
             Blocks|                            |101926
   Target Milestone|---                         |12.0
      Known to work|                            |6.1.0
          Component|target                      |middle-end
           Keywords|                            |missed-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
GCC 11 and before does at -O2 (GCC 6 and before could it for -O3 too):

        mov     rax, rsi
        mov     rdx, rdi
        bswap   rax
        bswap   rdx

The reason is SLP vectorizer is turned on for -O2 in GCC 12.
We get:

  _11 = {_1, _2};
  _5 = VIEW_CONVERT_EXPR<uint128_t>(_11);


The expansion of this could be done using move instructions ....

I notice for aarch64, SLP kicks in even more and does the following:

        fmov    d0, x0
        fmov    v0.d[1], x1
        ext     v0.16b, v0.16b, v0.16b, #8
        rev64   v0.16b, v0.16b
        umov    x0, v0.d[0]
        umov    x1, v0.d[1]

This is even true for -O2 -mavx too:

        mov     QWORD PTR [rsp-24], rdi
        mov     QWORD PTR [rsp-16], rsi
        vmovdqa xmm1, XMMWORD PTR [rsp-24]
        vpalignr        xmm0, xmm1, xmm1, 8
        vpshufb xmm2, xmm0, XMMWORD PTR .LC0[rip]
        vmovdqa XMMWORD PTR [rsp-24], xmm2
        mov     rax, QWORD PTR [rsp-24]
        mov     rdx, QWORD PTR [rsp-16]

There are so many different little regressions when handling this code it
seems.
But I think it all comes down to modeling arguments and return value on the
gimple level which breaks this.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926
[Bug 101926] [meta-bug] struct/complex argument passing and return should be
improved

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
  2022-01-20 23:41 ` [Bug middle-end/104151] [9/10/11/12 Regression] " pinskia at gcc dot gnu.org
@ 2022-01-21  1:03 ` crazylht at gmail dot com
  2022-01-21  1:25 ` crazylht at gmail dot com
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2022-01-21  1:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
with -fno-tree-vectorize, gcc also produce optimal code.

        mov     rax, rsi
        mov     rdx, rdi
        bswap   rax
        bswap   rdx
        ret

Guess it's related to vectorizer cost model.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
  2022-01-20 23:41 ` [Bug middle-end/104151] [9/10/11/12 Regression] " pinskia at gcc dot gnu.org
  2022-01-21  1:03 ` crazylht at gmail dot com
@ 2022-01-21  1:25 ` crazylht at gmail dot com
  2022-01-21  1:28 ` crazylht at gmail dot com
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2022-01-21  1:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
172_1 1 times scalar_store costs 12 in body
173_2 1 times scalar_store costs 12 in body
174__builtin_bswap64 (_8) 1 times scalar_stmt costs 4 in body
175__builtin_bswap64 (_10) 1 times scalar_stmt costs 4 in body
176BIT_FIELD_REF <a_3(D), 64, 64> 1 times scalar_stmt costs 4 in body
177BIT_FIELD_REF <a_3(D), 64, 0> 1 times scalar_stmt costs 4 in body
1781<unknown> 1 times vec_perm costs 4 in body
179__builtin_bswap64 (_8) 1 times vector_stmt costs 4 in prologue
180__builtin_bswap64 (_8) 1 times vec_perm costs 4 in body
181_1 1 times vector_store costs 16 in body
182test.cc:9:10: note: Cost model analysis for part in loop 0:
183  Vector cost: 28
184  Scalar cost: 40
...
243  <bb 2> [local count: 1073741824]:
244  _8 = BIT_FIELD_REF <a_3(D), 64, 64>;
245  _11 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(a_3(D));
246  _13 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(a_3(D));
247  _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>;
248  _14 = VIEW_CONVERT_EXPR<vector(16) char>(_12);
249  _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12,
11, 10, 9, 8 }>;
250  _16 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_15);
251  _1 = __builtin_bswap64 (_8);
252  _10 = BIT_FIELD_REF <a_3(D), 64, 0>;
253  _2 = __builtin_bswap64 (_10);
254  MEM <vector(2) long unsigned int> [(long unsigned int *)&y] = _16;
255  _7 = MEM <uint128_t> [(char * {ref-all})&y];


1. According to ABI, uint128 is passed by 2 gpr, and there should be extra cost
for _11 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(a_3(D));
2. Why there's 1781<unknown> 1 times vec_perm costs 4 in body, should it be 2
times vec_perm costs?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (2 preceding siblings ...)
  2022-01-21  1:25 ` crazylht at gmail dot com
@ 2022-01-21  1:28 ` crazylht at gmail dot com
  2022-01-21  1:32 ` crazylht at gmail dot com
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2022-01-21  1:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
Also there's separate issue, codegen for below is not optimal
gimple:
_11 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(a_3(D))
asm:
        mov     QWORD PTR [rsp-24], rdi
        mov     QWORD PTR [rsp-16], rsi
        movdqa  xmm0, XMMWORD PTR [rsp-24]


I think this issue has been recorded in several existed PRs.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (3 preceding siblings ...)
  2022-01-21  1:28 ` crazylht at gmail dot com
@ 2022-01-21  1:32 ` crazylht at gmail dot com
  2022-01-21  8:28 ` rguenth at gcc dot gnu.org
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: crazylht at gmail dot com @ 2022-01-21  1:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
> 1. According to ABI, uint128 is passed by 2 gpr, and there should be extra
> cost for _11 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(a_3(D));

And no scalar_cost is needed for
BIT_FIELD_REF <a_3(D), 64, 64> 1 times scalar_stmt costs 4 in body
BIT_FIELD_REF <a_3(D), 64, 0> 1 times scalar_stmt costs 4 in body

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (4 preceding siblings ...)
  2022-01-21  1:32 ` crazylht at gmail dot com
@ 2022-01-21  8:28 ` rguenth at gcc dot gnu.org
  2022-01-21  9:11 ` rsandifo at gcc dot gnu.org
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-21  8:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
           Priority|P3                          |P2
             Target|                            |x86_64-*-*

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
With just SSE2 we get the store vectorized only:

bswap:
.LFB0:
        .cfi_startproc
        bswap   %rsi
        bswap   %rdi
        movq    %rsi, %xmm0
        movq    %rdi, %xmm1
        punpcklqdq      %xmm1, %xmm0
        movaps  %xmm0, -24(%rsp)
        movq    -24(%rsp), %rax
        movq    -16(%rsp), %rdx
        ret

the

<unknown> 1 times vec_perm costs 4 in body
BIT_FIELD_REF <a_3(D), 64, 64> 1 times scalar_stmt costs 4 in body
BIT_FIELD_REF <a_3(D), 64, 0> 1 times scalar_stmt costs 4 in body

costs are what we cost building the initial vector from __int128
compared to splitting that into a low/high part.

  <bb 2> [local count: 1073741824]:
  _8 = BIT_FIELD_REF <a_3(D), 64, 64>;
  _11 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));
  _13 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));
  _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>;
  _14 = VIEW_CONVERT_EXPR<vector(16) char>(_12);
  _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11,
10, 9, 8 }>;
  _16 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_15);
  _1 = __builtin_bswap64 (_8);
  _10 = BIT_FIELD_REF <a_3(D), 64, 0>;
  _2 = __builtin_bswap64 (_10);
  MEM <vector(2) long long unsigned int> [(long long unsigned int *)&y] = _16;
  _7 = MEM <uint128_t> [(char * {ref-all})&y];

doesn't realize that it can maybe move the hi/lo swap across the two
permutes to before the store, otherwise it looks as expected.

Yes, the vectorizer doesn't account for ABI details on the function boundary
but it's very hard to do that in a sensible way.

Practically the worst part of the generated code is

        movq    %rdi, -24(%rsp)
        movq    %rsi, -16(%rsp)
        movdqa  -24(%rsp), %xmm0

because the store will fail to forward, causing a huge performance issue.
I wonder why we fail to merge those.  We face

(insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0)
        (reg:DI 97)) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 97)
        (nil)))
(insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8)
        (reg:DI 98)) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 98)
        (nil)))
(note 7 27 12 2 NOTE_INSN_FUNCTION_BEG)
(insn 12 7 14 2 (set (reg:V2DI 91)
        (vec_select:V2DI (subreg:V2DI (reg/v:TI 87 [ a ]) 0)
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) "t.c":6:12 7927 {*ssse3_palignrv2di_perm}
     (expr_list:REG_DEAD (reg/v:TI 87 [ a ])
        (nil)))

where

Trying 26, 27 -> 12:
   26: r87:TI#0=r97:DI
      REG_DEAD r97:DI
   27: r87:TI#8=r98:DI
      REG_DEAD r98:DI
   12: r91:V2DI=vec_select(r87:TI#0,parallel)
      REG_DEAD r87:TI
Can't combine i2 into i3

possibly because 27 is a partial def of r87.  We expand to

(insn 4 3 5 2 (set (reg:TI 88)
        (subreg:TI (reg:DI 89) 0)) "t.c":2:1 -1
     (nil))
(insn 5 4 6 2 (set (subreg:DI (reg:TI 88) 8)
        (reg:DI 90)) "t.c":2:1 -1
     (nil))
(insn 6 5 7 2 (set (reg/v:TI 87 [ a ])
        (reg:TI 88)) "t.c":2:1 -1
     (nil))
(note 7 6 10 2 NOTE_INSN_FUNCTION_BEG)
(insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ])
        (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) -1
     (nil))
(insn 12 10 13 2 (set (reg:V2DI 91)
        (vec_select:V2DI (reg:V2DI 82 [ _11 ])
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) "t.c":6:12 -1
     (nil))

initially from

  _11 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));

and fwprop1 still sees

(insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0)
        (reg:DI 89 [ a ])) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 95 [ a ])
        (nil)))
(insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8)
        (reg:DI 90 [ a+8 ])) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 96 [+8 ])
        (nil)))
(note 7 27 10 2 NOTE_INSN_FUNCTION_BEG) 
(insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ])
        (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) 1700 {movv2di_internal}
     (expr_list:REG_DEAD (reg/v:TI 87 [ a ])
        (nil)))

so that would be the best place to fix this up, realizing reg 87 dies
after insn 10.

Richard - I'm sure we can construct a similar case for aarch64 where
argument passing and vector mode use cause spilling?

On x86 the simplest testcase showing this is

typedef unsigned long long v2di __attribute__((vector_size(16)));
v2di bswap(__uint128_t a)
{
    return *(v2di *)&a;
}

that produces

bswap:
.LFB0:
        .cfi_startproc
        sub     sp, sp, #16
        .cfi_def_cfa_offset 16
        stp     x0, x1, [sp]
        ldr     q0, [sp]
        add     sp, sp, 16
        .cfi_def_cfa_offset 0
        ret

on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
though and I'm not sure there's another way to move x0/x1 to q0.

Providing LRA with a way to move TImode to VnmImode would of course
also avoid the spilling but getting rid of the TImode pseudo when
it's on there as intermediary for moving two DImode vals to V2DImode
sounds like a useful transform to me.  combine is too late since
fwprop already merged the subreg with the following shuffle for the
larger testcase.

Alternatively LRA could also be taught to spill to %xmm by somehow
telling it of the wastly increased cost of the double-spill, single-reload
sequence?  But I guess it would still need to be teached how to
reload V2DImode from a {DImode, DImode} pair in %xmm regs ...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (5 preceding siblings ...)
  2022-01-21  8:28 ` rguenth at gcc dot gnu.org
@ 2022-01-21  9:11 ` rsandifo at gcc dot gnu.org
  2022-01-21 10:18 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2022-01-21  9:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #7 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #6)
> Richard - I'm sure we can construct a similar case for aarch64 where
> argument passing and vector mode use cause spilling?
> 
> On x86 the simplest testcase showing this is
> 
> typedef unsigned long long v2di __attribute__((vector_size(16)));
> v2di bswap(__uint128_t a)
> {
>     return *(v2di *)&a;
> }
> 
> that produces
> 
> bswap:
> .LFB0:
>         .cfi_startproc
>         sub     sp, sp, #16
>         .cfi_def_cfa_offset 16
>         stp     x0, x1, [sp]
>         ldr     q0, [sp]
>         add     sp, sp, 16
>         .cfi_def_cfa_offset 0
>         ret
> 
> on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
> though and I'm not sure there's another way to move x0/x1 to q0.
It looks like this is a deliberate choice for aarch64.  The generic
costing has:

  /* Avoid the use of slow int<->fp moves for spilling by setting
     their cost higher than memmov_cost.  */
  5, /* GP2FP  */

So in cases like the above, we're telling IRA that spilling to
memory and reloading is cheaper than moving between registers.
For -mtune=thunderx we generate:

        fmov    d0, x0
        ins     v0.d[1], x1
        ret

instead.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (6 preceding siblings ...)
  2022-01-21  9:11 ` rsandifo at gcc dot gnu.org
@ 2022-01-21 10:18 ` rguenth at gcc dot gnu.org
  2022-01-21 10:29 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-21 10:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #7)
> (In reply to Richard Biener from comment #6)
> > Richard - I'm sure we can construct a similar case for aarch64 where
> > argument passing and vector mode use cause spilling?
> > 
> > On x86 the simplest testcase showing this is
> > 
> > typedef unsigned long long v2di __attribute__((vector_size(16)));
> > v2di bswap(__uint128_t a)
> > {
> >     return *(v2di *)&a;
> > }
> > 
> > that produces
> > 
> > bswap:
> > .LFB0:
> >         .cfi_startproc
> >         sub     sp, sp, #16
> >         .cfi_def_cfa_offset 16
> >         stp     x0, x1, [sp]
> >         ldr     q0, [sp]
> >         add     sp, sp, 16
> >         .cfi_def_cfa_offset 0
> >         ret
> > 
> > on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
> > though and I'm not sure there's another way to move x0/x1 to q0.
> It looks like this is a deliberate choice for aarch64.  The generic
> costing has:
> 
>   /* Avoid the use of slow int<->fp moves for spilling by setting
>      their cost higher than memmov_cost.  */
>   5, /* GP2FP  */
> 
> So in cases like the above, we're telling IRA that spilling to
> memory and reloading is cheaper than moving between registers.
> For -mtune=thunderx we generate:
> 
>         fmov    d0, x0
>         ins     v0.d[1], x1
>         ret
> 
> instead.

Ah, interesting.  On x86 we disallow/pessimize GPR<->XMM moves with
some tunings as well, still there a sequence like

       movq    %rdi, -24(%rsp)
       movq    %rsi, -16(%rsp)
       movq    -24(%rsp), %xmm0
       movq    -16(%rsp), %xmm1
       unpckhpd  %xmm0, %xmm1   (fixme - that's wrong, but you get the idea)

instead of

        movq    %rdi, -24(%rsp)
        movq    %rsi, -16(%rsp)
        movdqa  -24(%rsp), %xmm0

would likely be faster.  Not sure if one can get LRA to produce this
two-staged reload with just appropriate costing though.  As said the
key of the cost of the bad sequence is the failing store forwarding,
so it's special for spilling of two-GPR TImode and reloading as
single FPR V2DImode.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (7 preceding siblings ...)
  2022-01-21 10:18 ` rguenth at gcc dot gnu.org
@ 2022-01-21 10:29 ` rguenth at gcc dot gnu.org
  2022-01-21 12:20 ` ubizjak at gmail dot com
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-21 10:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> (In reply to rsandifo@gcc.gnu.org from comment #7)
> > (In reply to Richard Biener from comment #6)
> > > Richard - I'm sure we can construct a similar case for aarch64 where
> > > argument passing and vector mode use cause spilling?
> > > 
> > > On x86 the simplest testcase showing this is
> > > 
> > > typedef unsigned long long v2di __attribute__((vector_size(16)));
> > > v2di bswap(__uint128_t a)
> > > {
> > >     return *(v2di *)&a;
> > > }
> > > 
> > > that produces
> > > 
> > > bswap:
> > > .LFB0:
> > >         .cfi_startproc
> > >         sub     sp, sp, #16
> > >         .cfi_def_cfa_offset 16
> > >         stp     x0, x1, [sp]
> > >         ldr     q0, [sp]
> > >         add     sp, sp, 16
> > >         .cfi_def_cfa_offset 0
> > >         ret
> > > 
> > > on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
> > > though and I'm not sure there's another way to move x0/x1 to q0.
> > It looks like this is a deliberate choice for aarch64.  The generic
> > costing has:
> > 
> >   /* Avoid the use of slow int<->fp moves for spilling by setting
> >      their cost higher than memmov_cost.  */
> >   5, /* GP2FP  */
> > 
> > So in cases like the above, we're telling IRA that spilling to
> > memory and reloading is cheaper than moving between registers.
> > For -mtune=thunderx we generate:
> > 
> >         fmov    d0, x0
> >         ins     v0.d[1], x1
> >         ret
> > 
> > instead.
> 
> Ah, interesting.  On x86 we disallow/pessimize GPR<->XMM moves with
> some tunings as well, still there a sequence like
> 
>        movq    %rdi, -24(%rsp)
>        movq    %rsi, -16(%rsp)
>        movq    -24(%rsp), %xmm0
>        movq    -16(%rsp), %xmm1
>        unpckhpd  %xmm0, %xmm1   (fixme - that's wrong, but you get the idea)
> 
> instead of
> 
>         movq    %rdi, -24(%rsp)
>         movq    %rsi, -16(%rsp)
>         movdqa  -24(%rsp), %xmm0
> 
> would likely be faster.  Not sure if one can get LRA to produce this
> two-staged reload with just appropriate costing though.  As said the
> key of the cost of the bad sequence is the failing store forwarding,
> so it's special for spilling of two-GPR TImode and reloading as
> single FPR V2DImode.

And a speciality for aarch64 seems to be that it has arguments
passed in (reg:TI x0) which supposedly is a register-pair.  On x86
there are no TImode register pair registers I think, instead the
__int128 is passed as two 8-bytes in regular GPRs.  So on aarch64
we have the simpler

(insn 13 3 10 2 (set (reg:TI 95)
        (reg:TI 0 x0 [ a ])) "t.ii":3:2 58 {*movti_aarch64}
     (expr_list:REG_DEAD (reg:TI 0 x0 [ a ])
        (nil)))
(insn 10 13 11 2 (set (reg/i:V2DI 32 v0)
        (subreg:V2DI (reg:TI 95) 0)) "t.ii":5:2 1173 {*aarch64_simd_movv2di}
     (expr_list:REG_DEAD (reg:TI 95)
        (nil)))

before RA.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (8 preceding siblings ...)
  2022-01-21 10:29 ` rguenth at gcc dot gnu.org
@ 2022-01-21 12:20 ` ubizjak at gmail dot com
  2022-01-28 12:20 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ubizjak at gmail dot com @ 2022-01-21 12:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Hongtao.liu from comment #4)
> Also there's separate issue, codegen for below is not optimal
> gimple:
> _11 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(a_3(D))
> asm:
>         mov     QWORD PTR [rsp-24], rdi
>         mov     QWORD PTR [rsp-16], rsi
>         movdqa  xmm0, XMMWORD PTR [rsp-24]
> 
> 
> I think this issue has been recorded in several existed PRs.

Maybe this can be solved with secondary_reload when GPR and XMM regs are
involved.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (9 preceding siblings ...)
  2022-01-21 12:20 ` ubizjak at gmail dot com
@ 2022-01-28 12:20 ` jakub at gcc dot gnu.org
  2022-01-31 14:06 ` ubizjak at gmail dot com
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-01-28 12:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
With -O3 it regresses with
r7-2009-g8d4fc2d3d0c8f87bb3e182be1a618a511f8f9465
__uint128_t bswap(__uint128_t a) { return __builtin_bswap128 (a); }
emits the optimal code but is only in GCC 11.1 and later.

One fix for this might be to handle
  _8 = BIT_FIELD_REF <a_3(D), 64, 64>;
  _1 = __builtin_bswap64 (_8);
  y[0] = _1;
  _10 = BIT_FIELD_REF <a_3(D), 64, 0>;
  _2 = __builtin_bswap64 (_10);
  y[1] = _2;
  _7 = MEM <uint128_t> [(char * {ref-all})&y];
in bswap or store merging.
Though, current bswap infrastructure I'm afraid limits it to 64-bit size,
because it tracks the bytes in uint64_t vars and uses 8 bits to determine which
byte it is (0 value of zero, 1-8 byte index and 0xff unknown).
While that is 10 different values right now, if we handled uint128_t we'd need
18 different values times 16.

Note, even:
unsigned long long
bswap (unsigned long long a)
{
  unsigned int x[2];
  __builtin_memcpy (x, &a, 8);
  unsigned int y[2];
  y[0] = __builtin_bswap32 (x[1]);
  y[1] = __builtin_bswap32 (x[0]);
  __builtin_memcpy (&a, y, 8);
  return a;
}

unsigned long long
bswap2 (unsigned long long a)
{
  return __builtin_bswap64 (a);
}
emits better code in the latter function rather than former store-merging
isn't able to handle even that.
So we want to handle it in store-merging, we should start with handling
  _8 = BIT_FIELD_REF <a_3(D), 32, 32>;
  _1 = __builtin_bswap32 (_8);
  _10 = (unsigned int) a_3(D);
  _2 = __builtin_bswap32 (_10);
  _11 = {_1, _2};
  _5 = VIEW_CONVERT_EXPR<unsigned long>(_11);
and
  _8 = BIT_FIELD_REF <a_3(D), 32, 32>;
  _1 = __builtin_bswap32 (_8);
  y[0] = _1;
  _10 = (unsigned int) a_3(D);
  _2 = __builtin_bswap32 (_10);
  y[1] = _2;
  _7 = MEM <unsigned long> [(char * {ref-all})&y];
and only once that is handled try
  _8 = BIT_FIELD_REF <a_3(D), 64, 64>;
  _1 = __builtin_bswap64 (_8);
  _10 = (long long unsigned int) a_3(D);
  _2 = __builtin_bswap64 (_10);
  _11 = {_1, _2};
  _5 = VIEW_CONVERT_EXPR<uint128_t>(_11);
Doesn't look like stage4 material though.

So in the meantime perhaps some other improvements.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (10 preceding siblings ...)
  2022-01-28 12:20 ` jakub at gcc dot gnu.org
@ 2022-01-31 14:06 ` ubizjak at gmail dot com
  2022-05-06  8:32 ` [Bug middle-end/104151] [9/10/11/12/13 " jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ubizjak at gmail dot com @ 2022-01-31 14:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #12 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Uroš Bizjak from comment #10)
> (In reply to Hongtao.liu from comment #4)
> > Also there's separate issue, codegen for below is not optimal
> > gimple:
> > _11 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(a_3(D))
> > asm:
> >         mov     QWORD PTR [rsp-24], rdi
> >         mov     QWORD PTR [rsp-16], rsi
> >         movdqa  xmm0, XMMWORD PTR [rsp-24]
> > 
> > 
> > I think this issue has been recorded in several existed PRs.
> 
> Maybe this can be solved with secondary_reload when GPR and XMM regs are
> involved.

PR104306 is an experiment.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [9/10/11/12/13 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (11 preceding siblings ...)
  2022-01-31 14:06 ` ubizjak at gmail dot com
@ 2022-05-06  8:32 ` jakub at gcc dot gnu.org
  2022-09-06 22:04 ` [Bug middle-end/104151] [10/11/12/13 " pobrn at protonmail dot com
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-05-06  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|12.0                        |12.2

--- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 12.1 is being released, retargeting bugs to GCC 12.2.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [10/11/12/13 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (12 preceding siblings ...)
  2022-05-06  8:32 ` [Bug middle-end/104151] [9/10/11/12/13 " jakub at gcc dot gnu.org
@ 2022-09-06 22:04 ` pobrn at protonmail dot com
  2022-09-07  8:18 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: pobrn at protonmail dot com @ 2022-09-06 22:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Barnabás Pőcze <pobrn at protonmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pobrn at protonmail dot com

--- Comment #15 from Barnabás Pőcze <pobrn at protonmail dot com> ---
Sorry, I haven't found a better issue. But I think the example below exhibits
the same or a very similar issue.

I would expect the following code

void f(unsigned char *p, std::uint32_t x, std::uint32_t y)
{
    p[0] = x >> 24;
    p[1] = x >> 16;
    p[2] = x >>  8;
    p[3] = x >>  0;

    p[4] = y >> 24;
    p[5] = y >> 16;
    p[6] = y >>  8;
    p[7] = y >>  0;
}

to be compiled to something along the lines of

f(unsigned char*, unsigned int, unsigned int):
        bswap   esi
        bswap   edx
        mov     DWORD PTR [rdi], esi
        mov     DWORD PTR [rdi+4], edx
        ret

however, I get scores of bitwise operations instead if `-fno-tree-vectorize` is
not specified.

https://gcc.godbolt.org/z/z51K6qorv

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [10/11/12/13 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (13 preceding siblings ...)
  2022-09-06 22:04 ` [Bug middle-end/104151] [10/11/12/13 " pobrn at protonmail dot com
@ 2022-09-07  8:18 ` rguenth at gcc dot gnu.org
  2023-05-08 12:23 ` [Bug middle-end/104151] [10/11/12/13/14 " rguenth at gcc dot gnu.org
  2023-05-11 13:17 ` chfast at gmail dot com
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-09-07  8:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Barnabás Pőcze from comment #15)
> Sorry, I haven't found a better issue. But I think the example below
> exhibits the same or a very similar issue.
> 
> I would expect the following code
> 
> void f(unsigned char *p, std::uint32_t x, std::uint32_t y)
> {
>     p[0] = x >> 24;
>     p[1] = x >> 16;
>     p[2] = x >>  8;
>     p[3] = x >>  0;
> 
>     p[4] = y >> 24;
>     p[5] = y >> 16;
>     p[6] = y >>  8;
>     p[7] = y >>  0;
> }
> 
> to be compiled to something along the lines of
> 
> f(unsigned char*, unsigned int, unsigned int):
>         bswap   esi
>         bswap   edx
>         mov     DWORD PTR [rdi], esi
>         mov     DWORD PTR [rdi+4], edx
>         ret
> 
> however, I get scores of bitwise operations instead if `-fno-tree-vectorize`
> is not specified.
> 
> https://gcc.godbolt.org/z/z51K6qorv

Yes, here we vectorize the store:

  <bb 2> [local count: 1073741824]:
  _1 = x_15(D) >> 24;
  _2 = (unsigned char) _1;
  _3 = x_15(D) >> 16;
  _4 = (unsigned char) _3;
  _5 = x_15(D) >> 8;
  _6 = (unsigned char) _5;
  _7 = (unsigned char) x_15(D);
  _8 = y_22(D) >> 24;
  _9 = (unsigned char) _8;
  _10 = y_22(D) >> 16;
  _11 = (unsigned char) _10;
  _12 = y_22(D) >> 8;
  _13 = (unsigned char) _12;
  _14 = (unsigned char) y_22(D);
  _35 = {_2, _4, _6, _7, _9, _11, _13, _14};
  vectp.4_36 = p_17(D);
  MEM <vector(8) unsigned char> [(unsigned char *)vectp.4_36] = _35;

but without vectorizing the store merging pass (which comes after
vectorization) is able to detect two SImode bswaps.

Basically we fail to consider "generic" vectorization as option here
and generic vectorization fails to consider using bswap for permutes
of "existing vectors".  Likewise we fail to consider _1, _3, etc.
as element accesses of the existing "vectors" x and y.  That would
work iff the shift + truncates were canonicalized as BIT_FIELD_REF,
but it's certainly possible to work with the existing IL here.

Note this issue is probably better tracked in a separate bugreport.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [10/11/12/13/14 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (14 preceding siblings ...)
  2022-09-07  8:18 ` rguenth at gcc dot gnu.org
@ 2023-05-08 12:23 ` rguenth at gcc dot gnu.org
  2023-05-11 13:17 ` chfast at gmail dot com
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-08 12:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|12.3                        |12.4

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 12.3 is being released, retargeting bugs to GCC 12.4.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug middle-end/104151] [10/11/12/13/14 Regression] x86: excessive code generated for 128-bit byteswap
  2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
                   ` (15 preceding siblings ...)
  2023-05-08 12:23 ` [Bug middle-end/104151] [10/11/12/13/14 " rguenth at gcc dot gnu.org
@ 2023-05-11 13:17 ` chfast at gmail dot com
  16 siblings, 0 replies; 18+ messages in thread
From: chfast at gmail dot com @ 2023-05-11 13:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Paweł Bylica <chfast at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |chfast at gmail dot com

--- Comment #18 from Paweł Bylica <chfast at gmail dot com> ---
Not sure if this helps in any way, but this is a 256-bit variant:
https://godbolt.org/z/84fMTs1YP.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-05-11 13:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-20 23:24 [Bug target/104151] New: x86: excessive code generated for 128-bit byteswap nekotekina at gmail dot com
2022-01-20 23:41 ` [Bug middle-end/104151] [9/10/11/12 Regression] " pinskia at gcc dot gnu.org
2022-01-21  1:03 ` crazylht at gmail dot com
2022-01-21  1:25 ` crazylht at gmail dot com
2022-01-21  1:28 ` crazylht at gmail dot com
2022-01-21  1:32 ` crazylht at gmail dot com
2022-01-21  8:28 ` rguenth at gcc dot gnu.org
2022-01-21  9:11 ` rsandifo at gcc dot gnu.org
2022-01-21 10:18 ` rguenth at gcc dot gnu.org
2022-01-21 10:29 ` rguenth at gcc dot gnu.org
2022-01-21 12:20 ` ubizjak at gmail dot com
2022-01-28 12:20 ` jakub at gcc dot gnu.org
2022-01-31 14:06 ` ubizjak at gmail dot com
2022-05-06  8:32 ` [Bug middle-end/104151] [9/10/11/12/13 " jakub at gcc dot gnu.org
2022-09-06 22:04 ` [Bug middle-end/104151] [10/11/12/13 " pobrn at protonmail dot com
2022-09-07  8:18 ` rguenth at gcc dot gnu.org
2023-05-08 12:23 ` [Bug middle-end/104151] [10/11/12/13/14 " rguenth at gcc dot gnu.org
2023-05-11 13:17 ` chfast at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).