[Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type
       [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
@ 2021-08-21 21:25 ` pinskia at gcc dot gnu.org
  2023-06-21 13:33 ` [Bug middle-end/88873] " rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-21 21:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |101926

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So we now vectorize both functions but we mess up foo's code gen.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926
[Bug 101926] [meta-bug] struct/complex argument passing and return should be
improved

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
       [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
  2021-08-21 21:25 ` [Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type pinskia at gcc dot gnu.org
@ 2023-06-21 13:33 ` rguenth at gcc dot gnu.org
  2023-06-21 22:18 ` roger at nextmovesoftware dot com
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-21 13:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |linkw at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org,
                   |                            |sayle at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
          Component|tree-optimization           |middle-end
           Keywords|                            |ra

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we "like"

v2df bar (v2df a, v2df b, v2df c)
{
  vector(2) double vect__4.19;
  vect__4.19_19 = .FMA (b_10(D), a_11(D), c_9(D)); [tail call]
  return vect__4.19_19;
}

but foo has the usual ABI issues:

struct s_t foo (struct s_t a, struct s_t b, struct s_t c)
{
  vector(2) double vect__4.13;
  vector(2) double vect__1.12;
  vector(2) double vect__3.9; 
  vector(2) double vect__2.6;
  struct s_t D.4355;
  vect__1.12_14 = MEM <vector(2) double> [(double *)&c];
  vect__2.6_12 = MEM <vector(2) double> [(double *)&b];
  vect__3.9_13 = MEM <vector(2) double> [(double *)&a];
  vect__4.13_15 = .FMA (vect__2.6_12, vect__3.9_13, vect__1.12_14);
  MEM <vector(2) double> [(double *)&D.4355] = vect__4.13_15;
  return D.4355;
}

where the argument passing / return value handling gets us

foo:
        vmovq   %xmm3, %rax
        vmovq   %xmm0, -24(%rsp)
        vpinsrq $1, %rax, %xmm2, %xmm7
        vmovq   %xmm5, %rax
        vmovq   %xmm1, -16(%rsp)
        vmovapd %xmm7, %xmm6
        vpinsrq $1, %rax, %xmm4, %xmm2
        vmovq   %xmm4, -40(%rsp)
        vfmadd132pd     -24(%rsp), %xmm2, %xmm6
        vmovq   %xmm5, -32(%rsp)
        vmovapd %xmm6, -56(%rsp)
        vmovsd  -48(%rsp), %xmm1
        vmovsd  -56(%rsp), %xmm0
        ret

that's very weird, we also seem to half-way clean up things but fail to
eliminate the useless vmovq   %xmm5, -32(%rsp) spill for example.

The IBM folks who want to use SRA-style analysis at RTL expansion time
might in the end deal with this as well.

We expand to

(insn 2 21 3 2 (set (reg:DF 91)
        (reg:DF 20 xmm0 [ a ])) "t2.c":8:1 -1
     (nil))
(insn 3 2 4 2 (set (reg:DF 92)
        (reg:DF 21 xmm1 [ a+8 ])) "t2.c":8:1 -1
     (nil))
(insn 4 3 5 2 (set (reg:TI 90)
        (const_int 0 [0])) "t2.c":8:1 -1
     (nil))
(insn 5 4 6 2 (set (subreg:DF (reg:TI 90) 0)
        (reg:DF 91)) "t2.c":8:1 -1
     (nil))
(insn 6 5 7 2 (set (subreg:DF (reg:TI 90) 8)
        (reg:DF 92)) "t2.c":8:1 -1
     (nil))

so we're using TImode pseudos because the aggregate has TImode but the
accesses should tell us that V2DFmode would be a way better choice
(or alternatively V2DImode in case float modes are too dangerous).

The actual single use is then

(insn 23 20 24 2 (set (reg:V2DF 85 [ vect__4.13 ])
        (fma:V2DF (subreg:V2DF (reg/v:TI 93 [ b ]) 0)
            (subreg:V2DF (reg/v:TI 89 [ a ]) 0)
            (subreg:V2DF (reg/v:TI 97 [ c ]) 0))) "t2.c":9:18 -1
     (nil))

and of course IRA/LRA are not able to deal with this situation nicely,
possibly because of the subreg sets of the TImode pseudo which we
do not split (well, we can't).  We could possibly use STV to handle
some of this though(?)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
       [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
  2021-08-21 21:25 ` [Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type pinskia at gcc dot gnu.org
  2023-06-21 13:33 ` [Bug middle-end/88873] " rguenth at gcc dot gnu.org
@ 2023-06-21 22:18 ` roger at nextmovesoftware dot com
  2023-07-10  8:09 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: roger at nextmovesoftware dot com @ 2023-06-21 22:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com

--- Comment #5 from Roger Sayle <roger at nextmovesoftware dot com> ---
I have a patch (series) that improves some of the TImode parameter passing
issues with the ABI.  I'll check/investigate whether this fixes DFmode in the
same way that it improves DImode.  I worry that the (hi<<64)|lo idiom might not
be applicable for FP (without SUBREGs), but something similar (with vec_merge)
may resolve this issue during RTL expansion.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
       [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2023-06-21 22:18 ` roger at nextmovesoftware dot com
@ 2023-07-10  8:09 ` cvs-commit at gcc dot gnu.org
  2023-07-12 11:33 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-10  8:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873

--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:12b78b0b42d53019eb2c500d386094194e90ad16

commit r14-2406-g12b78b0b42d53019eb2c500d386094194e90ad16
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Mon Jul 10 09:06:52 2023 +0100

    i386: Add new insvti_lowpart_1 and insvdi_lowpart_1 patterns.

    This patch implements another of Uros' suggestions, to investigate a
    insvti_lowpart_1 pattern to improve TImode parameter passing on x86_64.
    In PR 88873, the RTL the middle-end expands for passing V2DF in TImode
    is subtly different from what it does for V2DI in TImode, sufficiently so
    that my explanations for why insvti_lowpart_1 isn't required don't apply
    in this case.

    This patch adds an insvti_lowpart_1 pattern, complementing the existing
    insvti_highpart_1 pattern, and also a 32-bit variant, insvdi_lowpart_1.
    Because the middle-end represents 128-bit constants using CONST_WIDE_INT
    and 64-bit constants using CONST_INT, it's easiest to treat these as
    different patterns, rather than attempt <dwi> parameterization.

    This patch also includes a peephole2 (actually a pair) to transform
    xchg instructions into mov instructions, when one of the destinations
    is unused.  This optimization is required to produce the optimal code
    sequences below.

    For the 64-bit case:

    __int128 foo(__int128 x, unsigned long long y)
    {
      __int128 m = ~((__int128)~0ull);
      __int128 t = x & m;
      __int128 r = t | y;
      return r;
    }

    Before:
            xchgq   %rdi, %rsi
            movq    %rdx, %rax
            xorl    %esi, %esi
            xorl    %edx, %edx
            orq     %rsi, %rax
            orq     %rdi, %rdx
            ret

    After:
            movq    %rdx, %rax
            movq    %rsi, %rdx
            ret

    For the 32-bit case:

    long long bar(long long x, int y)
    {
      long long mask = ~0ull << 32;
      long long t = x & mask;
      long long r = t | (unsigned int)y;
      return r;
    }

    Before:
            pushl   %ebx
            movl    12(%esp), %edx
            xorl    %ebx, %ebx
            xorl    %eax, %eax
            movl    16(%esp), %ecx
            orl     %ebx, %edx
            popl    %ebx
            orl     %ecx, %eax
            ret

    After:
            movl    12(%esp), %eax
            movl    8(%esp), %edx
            ret

    2023-07-10  Roger Sayle  <roger@nextmovesoftware.com>

    gcc/ChangeLog
            * config/i386/i386.md (peephole2): Transform xchg insn with a
            REG_UNUSED note to a (simple) move.
            (*insvti_lowpart_1): New define_insn_and_split.
            (*insvdi_lowpart_1): Likewise.

    gcc/testsuite/ChangeLog
            * gcc.target/i386/insvdi_lowpart-1.c: New test case.
            * gcc.target/i386/insvti_lowpart-1.c: Likewise.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
       [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2023-07-10  8:09 ` cvs-commit at gcc dot gnu.org
@ 2023-07-12 11:33 ` rguenth at gcc dot gnu.org
  2023-07-14 17:13 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-12 11:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Didn't yet help for the original testcase in the description.  We RTL expand
from

  vect__1.11_14 = MEM <vector(2) double> [(double *)&c];
  vect__2.5_12 = MEM <vector(2) double> [(double *)&b];
  vect__3.8_13 = MEM <vector(2) double> [(double *)&a];
  vect__4.12_15 = .FMA (vect__2.5_12, vect__3.8_13, vect__1.11_14);
  MEM <vector(2) double> [(double *)&D.4349] = vect__4.12_15;
  return D.4349;

and get

(insn 2 21 3 2 (set (reg:DF 91)
        (reg:DF 20 xmm0 [ a ])) "t.c":8:1 -1
     (nil))
(insn 3 2 4 2 (set (reg:DF 92)
        (reg:DF 21 xmm1 [ a+8 ])) "t.c":8:1 -1
     (nil))
(insn 4 3 5 2 (set (reg:TI 90)
        (const_int 0 [0])) "t.c":8:1 -1
     (nil))
(insn 5 4 6 2 (set (subreg:DF (reg:TI 90) 0)
        (reg:DF 91)) "t.c":8:1 -1
     (nil))
(insn 6 5 7 2 (set (subreg:DF (reg:TI 90) 8)
        (reg:DF 92)) "t.c":8:1 -1
     (nil))
(insn 7 6 8 2 (set (reg/v:TI 89 [ a ])
        (reg:TI 90)) "t.c":8:1 -1
     (nil))

...

(insn 23 20 24 2 (set (reg:V2DF 85 [ vect__4.12 ])
        (fma:V2DF (subreg:V2DF (reg/v:TI 93 [ b ]) 0)
            (subreg:V2DF (reg/v:TI 89 [ a ]) 0)
            (subreg:V2DF (reg/v:TI 97 [ c ]) 0))) "t.c":9:18 -1
     (nil))

so the ABI passess struct s_t in two %xmm regs but the backend gives it
TImode.  Nothing cleans this up, we end up with horrible code in the end.
The subreg pass is likely "confused" by the V2DFmode subreg of the TImode
pseudos, maybe it needs to learn to turn the TImode pseudo into a V2DFmode
one ...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
       [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2023-07-12 11:33 ` rguenth at gcc dot gnu.org
@ 2023-07-14 17:13 ` cvs-commit at gcc dot gnu.org
  2023-07-20  8:25 ` cvs-commit at gcc dot gnu.org
  2023-08-04 15:24 ` cvs-commit at gcc dot gnu.org
  7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-14 17:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:8911879415d6c2a7baad88235554a912887a1c5c

commit r14-2526-g8911879415d6c2a7baad88235554a912887a1c5c
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Fri Jul 14 18:10:05 2023 +0100

    i386: Improved insv of DImode/DFmode {high,low}parts into TImode.

    This is the next piece towards a fix for (the x86_64 ABI issues affecting)
    PR 88873.  This patch generalizes the recent tweak to ix86_expand_move
    for setting the highpart of a TImode reg from a DImode source using
    *insvti_highpart_1, to handle both DImode and DFmode sources, and also
    use the recently added *insvti_lowpart_1 for setting the lowpart.

    Although this is another intermediate step (not yet a fix), towards
    enabling *insvti and *concat* patterns to be candidates for TImode STV
    (by using V2DI/V2DF instructions), it already improves things a little.

    For the test case from PR 88873

    typedef struct { double x, y; } s_t;
    typedef double v2df __attribute__ ((vector_size (2 * sizeof(double))));

    s_t foo (s_t a, s_t b, s_t c)
    {
      return (s_t) { fma(a.x, b.x, c.x), fma (a.y, b.y, c.y) };
    }

    With -O2 -march=cascadelake, GCC currently generates:

    Before (29 instructions):
            vmovq   %xmm2, -56(%rsp)
            movq    -56(%rsp), %rdx
            vmovq   %xmm4, -40(%rsp)
            movq    $0, -48(%rsp)
            movq    %rdx, -56(%rsp)
            movq    -40(%rsp), %rdx
            vmovq   %xmm0, -24(%rsp)
            movq    %rdx, -40(%rsp)
            movq    -24(%rsp), %rsi
            movq    -56(%rsp), %rax
            movq    $0, -32(%rsp)
            vmovq   %xmm3, -48(%rsp)
            movq    -48(%rsp), %rcx
            vmovq   %xmm5, -32(%rsp)
            vmovq   %rax, %xmm6
            movq    -40(%rsp), %rax
            movq    $0, -16(%rsp)
            movq    %rsi, -24(%rsp)
            movq    -32(%rsp), %rsi
            vpinsrq $1, %rcx, %xmm6, %xmm6
            vmovq   %rax, %xmm7
            vmovq   %xmm1, -16(%rsp)
            vmovapd %xmm6, %xmm3
            vpinsrq $1, %rsi, %xmm7, %xmm7
            vfmadd132pd     -24(%rsp), %xmm7, %xmm3
            vmovapd %xmm3, -56(%rsp)
            vmovsd  -48(%rsp), %xmm1
            vmovsd  -56(%rsp), %xmm0
            ret

    After (20 instructions):
            vmovq   %xmm2, -56(%rsp)
            movq    -56(%rsp), %rax
            vmovq   %xmm3, -48(%rsp)
            vmovq   %xmm4, -40(%rsp)
            movq    -48(%rsp), %rcx
            vmovq   %xmm5, -32(%rsp)
            vmovq   %rax, %xmm6
            movq    -40(%rsp), %rax
            movq    -32(%rsp), %rsi
            vpinsrq $1, %rcx, %xmm6, %xmm6
            vmovq   %xmm0, -24(%rsp)
            vmovq   %rax, %xmm7
            vmovq   %xmm1, -16(%rsp)
            vmovapd %xmm6, %xmm2
            vpinsrq $1, %rsi, %xmm7, %xmm7
            vfmadd132pd     -24(%rsp), %xmm7, %xmm2
            vmovapd %xmm2, -56(%rsp)
            vmovsd  -48(%rsp), %xmm1
            vmovsd  -56(%rsp), %xmm0
            ret

    2023-07-14  Roger Sayle  <roger@nextmovesoftware.com>

    gcc/ChangeLog
            * config/i386/i386-expand.cc (ix86_expand_move): Generalize special
            case inserting of 64-bit values into a TImode register, to handle
            both DImode and DFmode using either *insvti_lowpart_1
            or *isnvti_highpart_1.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
       [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2023-07-14 17:13 ` cvs-commit at gcc dot gnu.org
@ 2023-07-20  8:25 ` cvs-commit at gcc dot gnu.org
  2023-08-04 15:24 ` cvs-commit at gcc dot gnu.org
  7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-20  8:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873

--- Comment #9 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:097106972f243ddcbddbbddd9a6bcc076f58b453

commit r14-2668-g097106972f243ddcbddbbddd9a6bcc076f58b453
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Thu Jul 20 09:23:11 2023 +0100

    i386: More TImode parameter passing improvements.

    This patch is the next piece of a solution to the x86_64 ABI issues in
    PR 88873.  This splits the *concat<mode><dwi>3_3 define_insn_and_split
    into two patterns, a TARGET_64BIT *concatditi3_3 and a !TARGET_64BIT
    *concatsidi3_3. This allows us to add an additional alternative to the
    the 64-bit version, enabling the register allocator to perform this
    operation using SSE registers, which is implemented/split after reload
    using vec_concatv2di.

    To demonstrate the improvement, the test case from PR88873:

    typedef struct { double x, y; } s_t;

    s_t foo (s_t a, s_t b, s_t c)
    {
      return (s_t){ __builtin_fma(a.x, b.x, c.x), __builtin_fma (a.y, b.y, c.y)
};
    }

    when compiled with -O2 -march=cascadelake, currently generates:

    foo:    vmovq   %xmm2, -56(%rsp)
            movq    -56(%rsp), %rax
            vmovq   %xmm3, -48(%rsp)
            vmovq   %xmm4, -40(%rsp)
            movq    -48(%rsp), %rcx
            vmovq   %xmm5, -32(%rsp)
            vmovq   %rax, %xmm6
            movq    -40(%rsp), %rax
            movq    -32(%rsp), %rsi
            vpinsrq $1, %rcx, %xmm6, %xmm6
            vmovq   %xmm0, -24(%rsp)
            vmovq   %rax, %xmm7
            vmovq   %xmm1, -16(%rsp)
            vmovapd %xmm6, %xmm2
            vpinsrq $1, %rsi, %xmm7, %xmm7
            vfmadd132pd     -24(%rsp), %xmm7, %xmm2
            vmovapd %xmm2, -56(%rsp)
            vmovsd  -48(%rsp), %xmm1
            vmovsd  -56(%rsp), %xmm0
            ret

    with this change, we avoid many of the reloads via memory,

    foo:    vpunpcklqdq     %xmm3, %xmm2, %xmm7
            vpunpcklqdq     %xmm1, %xmm0, %xmm6
            vpunpcklqdq     %xmm5, %xmm4, %xmm2
            vmovdqa %xmm7, -24(%rsp)
            vmovdqa %xmm6, %xmm1
            movq    -16(%rsp), %rax
            vpinsrq $1, %rax, %xmm7, %xmm4
            vmovapd %xmm4, %xmm6
            vfmadd132pd     %xmm1, %xmm2, %xmm6
            vmovapd %xmm6, -24(%rsp)
            vmovsd  -16(%rsp), %xmm1
            vmovsd  -24(%rsp), %xmm0
            ret

    2023-07-20  Roger Sayle  <roger@nextmovesoftware.com>

    gcc/ChangeLog
            * config/i386/i386-expand.cc (ix86_expand_move): Don't call
            force_reg, to use SUBREG rather than create a new pseudo when
            inserting DFmode fields into TImode with insvti_{high,low}part.
            * config/i386/i386.md (*concat<mode><dwi>3_3): Split into two
            define_insn_and_split...
            (*concatditi3_3): 64-bit implementation.  Provide alternative
            that allows register allocation to use SSE registers that is
            split into vec_concatv2di after reload.
            (*concatsidi3_3): 32-bit implementation.

    gcc/testsuite/ChangeLog
            * gcc.target/i386/pr88873.c: New test case.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
       [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2023-07-20  8:25 ` cvs-commit at gcc dot gnu.org
@ 2023-08-04 15:24 ` cvs-commit at gcc dot gnu.org
  7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-08-04 15:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873

--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:faa2202ee7fcf039b2016ce5766a2927526c5f78

commit r14-2997-gfaa2202ee7fcf039b2016ce5766a2927526c5f78
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Fri Aug 4 16:23:38 2023 +0100

    i386: Split SUBREGs of SSE vector registers into vec_select insns.

    This patch is the final piece in the series to improve the ABI issues
    affecting PR 88873.  The previous patches tackled inserting DFmode
    values into V2DFmode registers, by introducing insvti_{low,high}part
    patterns.  This patch improves the extraction of DFmode values from
    V2DFmode registers via TImode intermediates.

    I'd initially thought this would require new extvti_{low,high}part
    patterns to be defined, but all that's required is to recognize that
    the SUBREG idioms produced by combine are equivalent to (forms of)
    vec_select patterns.  The target-independent middle-end can't be sure
    that the appropriate vec_select instruction exists on the target,
    hence doesn't canonicalize a SUBREG of a vector mode as a vec_select,
    but the backend can provide a define_split stating where and when
    this is useful, for example, considering whether the operand is in
    memory, or whether !TARGET_SSE_MATH and the destination is i387.

    For pr88873.c, gcc -O2 -march=cascadelake currently generates:

    foo:    vpunpcklqdq     %xmm3, %xmm2, %xmm7
            vpunpcklqdq     %xmm1, %xmm0, %xmm6
            vpunpcklqdq     %xmm5, %xmm4, %xmm2
            vmovdqa %xmm7, -24(%rsp)
            vmovdqa %xmm6, %xmm1
            movq    -16(%rsp), %rax
            vpinsrq $1, %rax, %xmm7, %xmm4
            vmovapd %xmm4, %xmm6
            vfmadd132pd     %xmm1, %xmm2, %xmm6
            vmovapd %xmm6, -24(%rsp)
            vmovsd  -16(%rsp), %xmm1
            vmovsd  -24(%rsp), %xmm0
            ret

    with this patch, we now generate:

    foo:    vpunpcklqdq     %xmm1, %xmm0, %xmm6
            vpunpcklqdq     %xmm3, %xmm2, %xmm7
            vpunpcklqdq     %xmm5, %xmm4, %xmm2
            vmovdqa %xmm6, %xmm1
            vfmadd132pd     %xmm7, %xmm2, %xmm1
            vmovsd  %xmm1, %xmm1, %xmm0
            vunpckhpd       %xmm1, %xmm1, %xmm1
            ret

    The improvement is even more dramatic when compared to the original
    29 instructions shown in comment #8.  GCC 13, for example, required
    12 transfers to/from memory.

    2023-08-04  Roger Sayle  <roger@nextmovesoftware.com>

    gcc/ChangeLog
            * config/i386/sse.md (define_split): Convert highpart:DF extract
            from V2DFmode register into a sse2_storehpd instruction.
            (define_split): Likewise, convert lowpart:DF extract from V2DF
            register into a sse2_storelpd instruction.

    gcc/testsuite/ChangeLog
            * gcc.target/i386/pr88873.c: Tweak to check for improved code.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-08-04 15:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
2021-08-21 21:25 ` [Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type pinskia at gcc dot gnu.org
2023-06-21 13:33 ` [Bug middle-end/88873] " rguenth at gcc dot gnu.org
2023-06-21 22:18 ` roger at nextmovesoftware dot com
2023-07-10  8:09 ` cvs-commit at gcc dot gnu.org
2023-07-12 11:33 ` rguenth at gcc dot gnu.org
2023-07-14 17:13 ` cvs-commit at gcc dot gnu.org
2023-07-20  8:25 ` cvs-commit at gcc dot gnu.org
2023-08-04 15:24 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).