[Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
       [not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
@ 2021-08-25  0:13 ` pinskia at gcc dot gnu.org
  2021-08-25  7:13 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-25  0:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
           Severity|normal                      |enhancement
   Last reconfirmed|                            |2021-08-25

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Hmm, on aarch64 we do a decent job at vectorizing this (since GCC 11):
        ldp     d4, d0, [x1]
        ldr     d7, [x0, 16]
        ldp     d6, d5, [x0]
        fmul    v3.2s, v0.2s, v7.s[1]
        ldr     d1, [x1, 16]
        fmul    v2.2s, v0.2s, v6.s[1]
        fmul    v0.2s, v0.2s, v5.s[1]
        fmla    v3.2s, v4.2s, v7.s[0]
        fmla    v2.2s, v4.2s, v6.s[0]
        fmla    v0.2s, v4.2s, v5.s[0]
        fadd    v1.2s, v1.2s, v3.2s
        stp     d2, d0, [x8]
        str     d1, [x8, 16]

I suspect this is because V2SF does not exist on x86_64.
Using -Dfloat=double seems to get better for x86_64 (with -mavx2):
        vmovupd (%rdx), %ymm0
        vpermilpd       $0, (%rsi), %ymm1
        movq    %rdi, %rax
        vmovsd  32(%rsi), %xmm5
        vmovsd  40(%rsi), %xmm4
        vpermpd $68, %ymm0, %ymm2
        vpermpd $238, %ymm0, %ymm3
        vmulpd  %ymm2, %ymm1, %ymm2
        vpermilpd       $15, (%rsi), %ymm1
        vmulpd  %ymm3, %ymm1, %ymm1
        vaddpd  %ymm1, %ymm2, %ymm1
        vmulsd  %xmm5, %xmm0, %xmm2
        vmovupd %ymm1, (%rdi)
        vmovapd %xmm0, %xmm1
        vextractf128    $0x1, %ymm0, %xmm0
        vmulsd  %xmm4, %xmm0, %xmm3
        vunpckhpd       %xmm1, %xmm1, %xmm1
        vunpckhpd       %xmm0, %xmm0, %xmm0
        vmulsd  %xmm5, %xmm1, %xmm1
        vmulsd  %xmm4, %xmm0, %xmm0
        vaddsd  %xmm3, %xmm2, %xmm2
        vaddsd  32(%rdx), %xmm2, %xmm2
        vaddsd  %xmm0, %xmm1, %xmm1
        vaddsd  40(%rdx), %xmm1, %xmm1
        vmovsd  %xmm2, 32(%rdi)
        vmovsd  %xmm1, 40(%rdi)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
       [not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
  2021-08-25  0:13 ` [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3 pinskia at gcc dot gnu.org
@ 2021-08-25  7:13 ` rguenth at gcc dot gnu.org
  2021-09-20 11:11 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-25  7:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
x86 actually does have V2SF, the issue is that there's an opportunity for V4SF
vectorization and one for V2SF arriving at the same load groups and that causes
a conflict (there's other PRs about this general issue), so we kill one part:

t.C:18:12: missed:   desired vector type conflicts with earlier one for _2 =
b_35(D)->m11;
t.C:18:12: note:  removing SLP instance operations starting from: <retval>.dx =
_27;

also we have a bunch of live lanes off the remaining vectorized piece which
makes code a bit awkward.

Unfortunately we have no way to force 64bit vectors here (V2SF) to see whether
splitting up the V4SFmode partition would help (I guess it would as can be
seen from using 'double').

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
       [not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
  2021-08-25  0:13 ` [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3 pinskia at gcc dot gnu.org
  2021-08-25  7:13 ` rguenth at gcc dot gnu.org
@ 2021-09-20 11:11 ` rguenth at gcc dot gnu.org
  2021-09-27  8:24 ` cvs-commit at gcc dot gnu.org
  2021-09-27  8:26 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-20 11:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
I have a patch that produces

  vect__1.5_42 = MEM <const vector(4) float> [(float *)a_34(D)];
  vect__1.7_47 = VEC_PERM_EXPR <vect__1.5_42, vect__1.5_42, { 0, 0, 2, 2 }>;
  vect__2.10_49 = MEM <const vector(4) float> [(float *)b_35(D)];
  vect__2.12_53 = VEC_PERM_EXPR <vect__2.10_49, vect__2.10_49, { 0, 1, 0, 1 }>;
  vect__3.13_54 = vect__1.7_47 * vect__2.12_53;
  vect__2.30_73 = MEM <const vector(2) float> [(float *)b_35(D)];
  vect__1.18_61 = VEC_PERM_EXPR <vect__1.5_42, vect__1.5_42, { 1, 1, 3, 3 }>;
  vect__2.23_68 = VEC_PERM_EXPR <vect__2.10_49, vect__2.10_49, { 2, 3, 2, 3 }>;
  vect__6.24_69 = vect__1.18_61 * vect__2.23_68;
  vect__7.25_70 = vect__3.13_54 + vect__6.24_69;
  vect__5.40_85 = MEM <const vector(2) float> [(float *)b_35(D) + 8B];
  MEM <vector(4) float> [(float *)&<retval>] = vect__7.25_70;
  vect__21.35_81 = MEM <const vector(2) float> [(float *)a_34(D) + 16B];
  vect__1.36_82 = VEC_PERM_EXPR <vect__21.35_81, vect__21.35_81, { 0, 0 }>;
  vect__22.37_83 = vect__2.30_73 * vect__1.36_82;
  vect__1.46_94 = VEC_PERM_EXPR <vect__21.35_81, vect__21.35_81, { 1, 1 }>;
  vect__24.47_95 = vect__5.40_85 * vect__1.46_94;
  vect__25.48_96 = vect__22.37_83 + vect__24.47_95;
  vect__26.51_98 = MEM <const vector(2) float> [(float *)b_35(D) + 16B];
  vect__27.52_100 = vect__25.48_96 + vect__26.51_98;
  MEM <vector(2) float> [(float *)&<retval> + 16B] = vect__27.52_100;

that means it ends up with some odd vector loads, but with SSE 4.2 it becomes

        movups  (%rsi), %xmm5
        movups  (%rdx), %xmm1
        movq    %rdi, %rax
        movq    (%rdx), %xmm4
        movq    8(%rdx), %xmm3
        movsldup        %xmm5, %xmm0
        movaps  %xmm1, %xmm2
        movlhps %xmm1, %xmm2
        shufps  $238, %xmm1, %xmm1
        mulps   %xmm0, %xmm2
        movshdup        %xmm5, %xmm0
        mulps   %xmm1, %xmm0
        movq    16(%rsi), %xmm1
        addps   %xmm2, %xmm0
        movups  %xmm0, (%rdi)
        movsldup        %xmm1, %xmm0
        movshdup        %xmm1, %xmm1
        mulps   %xmm4, %xmm0
        mulps   %xmm3, %xmm1
        addps   %xmm1, %xmm0
        movq    16(%rdx), %xmm1
        addps   %xmm1, %xmm0
        movlps  %xmm0, 16(%rdi)

alternatively -mavx can do some of the required perms with the loads and
with -mfma we can use an FMA as well:

        vpermilps       $238, (%rdx), %xmm1
        vpermilps       $245, (%rsi), %xmm0
        movq    %rdi, %rax
        vpermilps       $160, (%rsi), %xmm3
        vpermilps       $68, (%rdx), %xmm4
        vmulps  %xmm1, %xmm0, %xmm0
        vmovq   (%rdx), %xmm2
        vfmadd231ps     %xmm4, %xmm3, %xmm0
        vmovq   8(%rdx), %xmm3
        vmovups %xmm0, (%rdi)
        vmovq   16(%rsi), %xmm0
        vmovsldup       %xmm0, %xmm1
        vmovshdup       %xmm0, %xmm0
        vmulps  %xmm3, %xmm0, %xmm0
        vfmadd132ps     %xmm1, %xmm0, %xmm2
        vmovq   16(%rdx), %xmm0
        vaddps  %xmm2, %xmm0, %xmm0
        vmovlps %xmm0, 16(%rdi)

I'm not sure whether the vmovups + vmovs{l,h}dup are any better than doing
two scalar loads + dups though - it might avoid some STLF conflict with
earlier smaller stores at least.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
       [not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2021-09-20 11:11 ` rguenth at gcc dot gnu.org
@ 2021-09-27  8:24 ` cvs-commit at gcc dot gnu.org
  2021-09-27  8:26 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 5+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-09-27  8:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:6390c5047adb75960f86d56582e6322aaa4d9281

commit r12-3893-g6390c5047adb75960f86d56582e6322aaa4d9281
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Nov 18 09:36:57 2020 +0100

    Allow different vector types for stmt groups

    This allows vectorization (in practice non-loop vectorization) to
    have a stmt participate in different vector type vectorizations.
    It allows us to remove vect_update_shared_vectype and replace it
    by pushing/popping STMT_VINFO_VECTYPE from SLP_TREE_VECTYPE around
    vect_analyze_stmt and vect_transform_stmt.

    For data-ref the situation is a bit more complicated since we
    analyze alignment info with a specific vector type in mind which
    doesn't play well when that changes.

    So the bulk of the change is passing down the actual vector type
    used for a vectorized access to the various accessors of alignment
    info, first and foremost dr_misalignment but also aligned_access_p,
    known_alignment_for_access_p, vect_known_alignment_in_bytes and
    vect_supportable_dr_alignment.  I took the liberty to replace
    ALL_CAPS macro accessors with the lower-case function invocations.

    The actual changes to the behavior are in dr_misalignment which now
    is the place factoring in the negative step adjustment as well as
    handling alignment queries for a vector type with bigger alignment
    requirements than what we can (or have) analyze(d).

    vect_slp_analyze_node_alignment makes use of this and upon receiving
    a vector type with a bigger alingment desire re-analyzes the DR
    with respect to it but keeps an older more precise result if possible.
    In this context it might be possible to do the analysis just once
    but instead of analyzing with respect to a specific desired alignment
    look for the biggest alignment we can compute a not unknown alignment.

    The ChangeLog includes the functional changes but not the bulk due
    to the alignment accessor API changes - I hope that's something good.

    2021-09-17  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/97351
            PR tree-optimization/97352
            PR tree-optimization/82426
            * tree-vectorizer.h (dr_misalignment): Add vector type
            argument.
            (aligned_access_p): Likewise.
            (known_alignment_for_access_p): Likewise.
            (vect_supportable_dr_alignment): Likewise.
            (vect_known_alignment_in_bytes): Likewise.  Refactor.
            (DR_MISALIGNMENT): Remove.
            (vect_update_shared_vectype): Likewise.
            * tree-vect-data-refs.c (dr_misalignment): Refactor, handle
            a vector type with larger alignment requirement and apply
            the negative step adjustment here.
            (vect_calculate_target_alignment): Remove.
            (vect_compute_data_ref_alignment): Get explicit vector type
            argument, do not apply a negative step alignment adjustment
            here.
            (vect_slp_analyze_node_alignment): Re-analyze alignment
            when we re-visit the DR with a bigger desired alignment but
            keep more precise results from smaller alignments.
            * tree-vect-slp.c (vect_update_shared_vectype): Remove.
            (vect_slp_analyze_node_operations_1): Do not update the
            shared vector type on stmts.
            * tree-vect-stmts.c (vect_analyze_stmt): Push/pop the
            vector type of an SLP node to the representative stmt-info.
            (vect_transform_stmt): Likewise.

            * gcc.target/i386/vect-pr82426.c: New testcase.
            * gcc.target/i386/vect-pr97352.c: Likewise.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3
       [not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2021-09-27  8:24 ` cvs-commit at gcc dot gnu.org
@ 2021-09-27  8:26 ` rguenth at gcc dot gnu.org
  4 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-09-27  8:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82426

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
   Target Milestone|---                         |12.0
             Status|ASSIGNED                    |RESOLVED

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 12.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-27  8:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-82426-4@http.gcc.gnu.org/bugzilla/>
2021-08-25  0:13 ` [Bug tree-optimization/82426] Missed tree-slp-vectorization on -O2 and -O3 pinskia at gcc dot gnu.org
2021-08-25  7:13 ` rguenth at gcc dot gnu.org
2021-09-20 11:11 ` rguenth at gcc dot gnu.org
2021-09-27  8:24 ` cvs-commit at gcc dot gnu.org
2021-09-27  8:26 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).