public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap
Date: Fri, 21 Jan 2022 08:28:35 +0000	[thread overview]
Message-ID: <bug-104151-4-WBJNOFyN7e@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-104151-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
           Priority|P3                          |P2
             Target|                            |x86_64-*-*

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
With just SSE2 we get the store vectorized only:

bswap:
.LFB0:
        .cfi_startproc
        bswap   %rsi
        bswap   %rdi
        movq    %rsi, %xmm0
        movq    %rdi, %xmm1
        punpcklqdq      %xmm1, %xmm0
        movaps  %xmm0, -24(%rsp)
        movq    -24(%rsp), %rax
        movq    -16(%rsp), %rdx
        ret

the

<unknown> 1 times vec_perm costs 4 in body
BIT_FIELD_REF <a_3(D), 64, 64> 1 times scalar_stmt costs 4 in body
BIT_FIELD_REF <a_3(D), 64, 0> 1 times scalar_stmt costs 4 in body

costs are what we cost building the initial vector from __int128
compared to splitting that into a low/high part.

  <bb 2> [local count: 1073741824]:
  _8 = BIT_FIELD_REF <a_3(D), 64, 64>;
  _11 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));
  _13 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));
  _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>;
  _14 = VIEW_CONVERT_EXPR<vector(16) char>(_12);
  _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11,
10, 9, 8 }>;
  _16 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_15);
  _1 = __builtin_bswap64 (_8);
  _10 = BIT_FIELD_REF <a_3(D), 64, 0>;
  _2 = __builtin_bswap64 (_10);
  MEM <vector(2) long long unsigned int> [(long long unsigned int *)&y] = _16;
  _7 = MEM <uint128_t> [(char * {ref-all})&y];

doesn't realize that it can maybe move the hi/lo swap across the two
permutes to before the store, otherwise it looks as expected.

Yes, the vectorizer doesn't account for ABI details on the function boundary
but it's very hard to do that in a sensible way.

Practically the worst part of the generated code is

        movq    %rdi, -24(%rsp)
        movq    %rsi, -16(%rsp)
        movdqa  -24(%rsp), %xmm0

because the store will fail to forward, causing a huge performance issue.
I wonder why we fail to merge those.  We face

(insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0)
        (reg:DI 97)) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 97)
        (nil)))
(insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8)
        (reg:DI 98)) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 98)
        (nil)))
(note 7 27 12 2 NOTE_INSN_FUNCTION_BEG)
(insn 12 7 14 2 (set (reg:V2DI 91)
        (vec_select:V2DI (subreg:V2DI (reg/v:TI 87 [ a ]) 0)
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) "t.c":6:12 7927 {*ssse3_palignrv2di_perm}
     (expr_list:REG_DEAD (reg/v:TI 87 [ a ])
        (nil)))

where

Trying 26, 27 -> 12:
   26: r87:TI#0=r97:DI
      REG_DEAD r97:DI
   27: r87:TI#8=r98:DI
      REG_DEAD r98:DI
   12: r91:V2DI=vec_select(r87:TI#0,parallel)
      REG_DEAD r87:TI
Can't combine i2 into i3

possibly because 27 is a partial def of r87.  We expand to

(insn 4 3 5 2 (set (reg:TI 88)
        (subreg:TI (reg:DI 89) 0)) "t.c":2:1 -1
     (nil))
(insn 5 4 6 2 (set (subreg:DI (reg:TI 88) 8)
        (reg:DI 90)) "t.c":2:1 -1
     (nil))
(insn 6 5 7 2 (set (reg/v:TI 87 [ a ])
        (reg:TI 88)) "t.c":2:1 -1
     (nil))
(note 7 6 10 2 NOTE_INSN_FUNCTION_BEG)
(insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ])
        (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) -1
     (nil))
(insn 12 10 13 2 (set (reg:V2DI 91)
        (vec_select:V2DI (reg:V2DI 82 [ _11 ])
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) "t.c":6:12 -1
     (nil))

initially from

  _11 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D));

and fwprop1 still sees

(insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0)
        (reg:DI 89 [ a ])) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 95 [ a ])
        (nil)))
(insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8)
        (reg:DI 90 [ a+8 ])) "t.c":2:1 80 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 96 [+8 ])
        (nil)))
(note 7 27 10 2 NOTE_INSN_FUNCTION_BEG) 
(insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ])
        (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) 1700 {movv2di_internal}
     (expr_list:REG_DEAD (reg/v:TI 87 [ a ])
        (nil)))

so that would be the best place to fix this up, realizing reg 87 dies
after insn 10.

Richard - I'm sure we can construct a similar case for aarch64 where
argument passing and vector mode use cause spilling?

On x86 the simplest testcase showing this is

typedef unsigned long long v2di __attribute__((vector_size(16)));
v2di bswap(__uint128_t a)
{
    return *(v2di *)&a;
}

that produces

bswap:
.LFB0:
        .cfi_startproc
        sub     sp, sp, #16
        .cfi_def_cfa_offset 16
        stp     x0, x1, [sp]
        ldr     q0, [sp]
        add     sp, sp, 16
        .cfi_def_cfa_offset 0
        ret

on arm for me.  Maybe the stp x0, x1 store can forward to the ldr load
though and I'm not sure there's another way to move x0/x1 to q0.

Providing LRA with a way to move TImode to VnmImode would of course
also avoid the spilling but getting rid of the TImode pseudo when
it's on there as intermediary for moving two DImode vals to V2DImode
sounds like a useful transform to me.  combine is too late since
fwprop already merged the subreg with the following shuffle for the
larger testcase.

Alternatively LRA could also be taught to spill to %xmm by somehow
telling it of the wastly increased cost of the double-spill, single-reload
sequence?  But I guess it would still need to be teached how to
reload V2DImode from a {DImode, DImode} pair in %xmm regs ...

  parent reply	other threads:[~2022-01-21  8:28 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-20 23:24 [Bug target/104151] New: " nekotekina at gmail dot com
2022-01-20 23:41 ` [Bug middle-end/104151] [9/10/11/12 Regression] " pinskia at gcc dot gnu.org
2022-01-21  1:03 ` crazylht at gmail dot com
2022-01-21  1:25 ` crazylht at gmail dot com
2022-01-21  1:28 ` crazylht at gmail dot com
2022-01-21  1:32 ` crazylht at gmail dot com
2022-01-21  8:28 ` rguenth at gcc dot gnu.org [this message]
2022-01-21  9:11 ` rsandifo at gcc dot gnu.org
2022-01-21 10:18 ` rguenth at gcc dot gnu.org
2022-01-21 10:29 ` rguenth at gcc dot gnu.org
2022-01-21 12:20 ` ubizjak at gmail dot com
2022-01-28 12:20 ` jakub at gcc dot gnu.org
2022-01-31 14:06 ` ubizjak at gmail dot com
2022-05-06  8:32 ` [Bug middle-end/104151] [9/10/11/12/13 " jakub at gcc dot gnu.org
2022-09-06 22:04 ` [Bug middle-end/104151] [10/11/12/13 " pobrn at protonmail dot com
2022-09-07  8:18 ` rguenth at gcc dot gnu.org
2023-05-08 12:23 ` [Bug middle-end/104151] [10/11/12/13/14 " rguenth at gcc dot gnu.org
2023-05-11 13:17 ` chfast at gmail dot com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-104151-4-WBJNOFyN7e@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).