[Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc
@ 2021-03-05 14:54 hubicka at gcc dot gnu.org
  2021-03-08  8:32 ` [Bug tree-optimization/99412] " rguenth at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-03-05 14:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

            Bug ID: 99412
           Summary: s352 benchmark of TSVC is vectorized by clang and not
                    by gcc
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

typedef float real_t;

#define iterations 100000
#define LEN_1D 32000
#define LEN_2D 256

real_t a[LEN_1D],b[LEN_1D];
int main ()
{

//    loop rerolling
//    unrolled dot product

    real_t dot;
    for (int nl = 0; nl < 8*iterations; nl++) {
        dot = (real_t)0.;
        for (int i = 0; i < LEN_1D; i += 5) {
            dot = dot + a[i] * b[i] + a[i + 1] * b[i + 1] + a[i + 2]
                * b[i + 2] + a[i + 3] * b[i + 3] + a[i + 4] * b[i + 4];
        }
    }

    return dot;
}


clang does:
main:                                   # @main
        .cfi_startproc
# %bb.0:
        xorl    %eax, %eax
        .p2align        4, 0x90
.LBB0_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB0_2 Depth 2
        vxorps  %xmm0, %xmm0, %xmm0
        movq    $-5, %rcx
        .p2align        4, 0x90
.LBB0_2:                                #   Parent Loop BB0_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovups b+20(,%rcx,4), %xmm1
        vmovss  b+36(,%rcx,4), %xmm2            # xmm2 = mem[0],zero,zero,zero
        vmulps  a+20(,%rcx,4), %xmm1, %xmm1
        vpermilpd       $1, %xmm1, %xmm3        # xmm3 = xmm1[1,0]
        vaddps  %xmm3, %xmm1, %xmm1
        vmovshdup       %xmm1, %xmm3            # xmm3 = xmm1[1,1,3,3]
        vaddss  %xmm3, %xmm1, %xmm1
        vfmadd231ss     a+36(,%rcx,4), %xmm2, %xmm1 # xmm1 = (xmm2 * mem) +
xmm1
        addq    $5, %rcx
        vaddss  %xmm0, %xmm1, %xmm0
        cmpq    $31995, %rcx                    # imm = 0x7CFB
        jb      .LBB0_2
# %bb.3:                                #   in Loop: Header=BB0_1 Depth=1
        incl    %eax
        cmpl    $800000, %eax                   # imm = 0xC3500
        jne     .LBB0_1
# %bb.4:
        vcvttss2si      %xmm0, %eax
        retq

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
@ 2021-03-08  8:32 ` rguenth at gcc dot gnu.org
  2021-06-09 12:54 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-08  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |tree-optimization
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
             Blocks|                            |53947
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1
         Depends on|                            |97832
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2021-03-08

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
With -fno-tree-reassoc we detect the reduction chain and produce

.L3:
        vmovaps b(%rax), %ymm5
        vmovaps b+32(%rax), %ymm6
        addq    $160, %rax
        vfmadd231ps     a-160(%rax), %ymm5, %ymm1
        vmovaps b-96(%rax), %ymm7
        vfmadd231ps     a-128(%rax), %ymm6, %ymm0
        vmovaps b-64(%rax), %ymm5
        vmovaps b-32(%rax), %ymm6
        vfmadd231ps     a-96(%rax), %ymm7, %ymm2
        vfmadd231ps     a-64(%rax), %ymm5, %ymm3
        vfmadd231ps     a-32(%rax), %ymm6, %ymm4
        cmpq    $128000, %rax
        jne     .L3
        vaddps  %ymm1, %ymm0, %ymm0
        vaddps  %ymm2, %ymm0, %ymm0
        vaddps  %ymm3, %ymm0, %ymm0
        vaddps  %ymm4, %ymm0, %ymm0
        vextractf128    $0x1, %ymm0, %xmm1
        vaddps  %xmm0, %xmm1, %xmm1
        vmovhlps        %xmm1, %xmm1, %xmm0
        vaddps  %xmm1, %xmm0, %xmm0
        vshufps $85, %xmm0, %xmm0, %xmm1
        vaddps  %xmm0, %xmm1, %xmm0
        decl    %edx
        jne     .L2

we're not re-rolling and thus are forced to use a VF of 4 here.

Note that LLVM doesn't seem to veectorize the loop but instead vectorizes
the basic-block which isn't what TSVC looks for (but that would work for
non-fast-math).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
[Bug 97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than
-O3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
  2021-03-08  8:32 ` [Bug tree-optimization/99412] " rguenth at gcc dot gnu.org
@ 2021-06-09 12:54 ` rguenth at gcc dot gnu.org
  2023-01-11 19:14 ` hubicka at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-06-09 12:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412
Bug 99412 depends on bug 97832, which changed state.

Bug 97832 Summary: AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
  2021-03-08  8:32 ` [Bug tree-optimization/99412] " rguenth at gcc dot gnu.org
  2021-06-09 12:54 ` rguenth at gcc dot gnu.org
@ 2023-01-11 19:14 ` hubicka at gcc dot gnu.org
  2023-01-12  9:12 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2023-01-11 19:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
This is also seen with zen4 comparing gcc and aocc. (about 2.3 times
differnece)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-01-11 19:14 ` hubicka at gcc dot gnu.org
@ 2023-01-12  9:12 ` rguenth at gcc dot gnu.org
  2023-01-12  9:30 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-12  9:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #2)
> This is also seen with zen4 comparing gcc and aocc. (about 2.3 times
> differnece)

Disabling

@@ -6877,7 +6887,7 @@ reassociate_bb (basic_block bb)
                          binary op are chosen wisely.  */
                       int len = ops.length ();
                       if (len >= 3)
                         swap_ops_for_binary_stmt (ops, len - 3, stmt);

will naturally create the reduction chain (or leave it in place) given the
current rank computation.  We do have (somewhat) robust fallback from
reduction chain to reduction (via reduction path support), so I think this
change would be OK.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-01-12  9:12 ` rguenth at gcc dot gnu.org
@ 2023-01-12  9:30 ` rguenth at gcc dot gnu.org
  2023-01-12 13:30 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-12  9:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #3)
> (In reply to Jan Hubicka from comment #2)
> > This is also seen with zen4 comparing gcc and aocc. (about 2.3 times
> > differnece)
> 
> Disabling
> 
> @@ -6877,7 +6887,7 @@ reassociate_bb (basic_block bb)
>                           binary op are chosen wisely.  */
>                        int len = ops.length ();
>                        if (len >= 3)
>                          swap_ops_for_binary_stmt (ops, len - 3, stmt);
> 
> will naturally create the reduction chain (or leave it in place) given the
> current rank computation.  We do have (somewhat) robust fallback from
> reduction chain to reduction (via reduction path support), so I think this
> change would be OK.

The code originated from r0-111616-gdf7b0cc4aae062, the reassoc rewrite by
Jeff back in 2005 for GCC 4.3 (or 4.2, don't have that tree around anymore).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-01-12  9:30 ` rguenth at gcc dot gnu.org
@ 2023-01-12 13:30 ` cvs-commit at gcc dot gnu.org
  2023-01-12 13:42 ` rguenth at gcc dot gnu.org
  2023-01-12 13:42 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-01-12 13:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:b073f2b098ba7819450d6c14a0fb96cb1c09f242

commit r13-5122-gb073f2b098ba7819450d6c14a0fb96cb1c09f242
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jan 12 11:18:22 2023 +0100

    tree-optimization/99412 - reassoc and reduction chains

    With -ffast-math we end up associating reduction chains and break
    them - this is because of old code that tries to rectify reductions
    into a shape likened by the vectorizer.  Nowadays the rank compute
    produces correct association for reduction chains and the vectorizer
    has robust support to fall back to a regular reductions (via
    reduction path) when it turns out to be not a proper reduction chain.

    So this patch removes the special code in reassoc which makes
    the TSVC s352 vectorized with -Ofast (it is already without
    -ffast-math).

            PR tree-optimization/99412
            * tree-ssa-reassoc.cc (is_phi_for_stmt): Remove.
            (swap_ops_for_binary_stmt): Remove reduction handling.
            (rewrite_expr_tree_parallel): Adjust.
            (reassociate_bb): Likewise.
            * tree-parloops.cc (build_new_reduction): Handle MINUS_EXPR.

            * gcc.dg/vect/pr99412.c: New testcase.
            * gcc.dg/tree-ssa/reassoc-47.c: Adjust comment.
            * gcc.dg/tree-ssa/reassoc-48.c: Remove.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-01-12 13:30 ` cvs-commit at gcc dot gnu.org
@ 2023-01-12 13:42 ` rguenth at gcc dot gnu.org
  2023-01-12 13:42 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-12 13:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC now does

.L2:
        vmovaps b(%rax), %xmm6
        vmulps  a(%rax), %xmm6, %xmm0
        addq    $80, %rax
        vmovaps b-64(%rax), %xmm7
        vmovaps b-48(%rax), %xmm6
        vaddps  %xmm0, %xmm5, %xmm5
        vmulps  a-64(%rax), %xmm7, %xmm0
        vmovaps b-32(%rax), %xmm7
        vaddps  %xmm0, %xmm1, %xmm1
        vmulps  a-48(%rax), %xmm6, %xmm0
        vmovaps b-16(%rax), %xmm6
        vaddps  %xmm0, %xmm4, %xmm4
        vmulps  a-32(%rax), %xmm7, %xmm0
        vaddps  %xmm0, %xmm2, %xmm2
        vmulps  a-16(%rax), %xmm6, %xmm0
        vaddps  %xmm0, %xmm3, %xmm3
        cmpq    $128000, %rax
        jne     .L2

thus uses a VF of 4 with -Ofast.  Your LLVM snippet uses 5 lanes
instead of our 20 with four lanes in a V4SF and one lane in a scalar.
That's interesting but not something we support.

Re-rolling would mean using a single v4sf 4 lane vector here.  For
a pure SLP loop something like this should be possible without too
much hassle I think.  We'd just need to try ... (and think of if it's
worth in real life)

For the purpose of the Summary this is fixed now.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/99412] s352 benchmark of TSVC is vectorized by clang and not by gcc
  2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2023-01-12 13:42 ` rguenth at gcc dot gnu.org
@ 2023-01-12 13:42 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-12 13:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99412

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |13.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-01-12 13:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 14:54 [Bug middle-end/99412] New: s352 benchmark of TSVC is vectorized by clang and not by gcc hubicka at gcc dot gnu.org
2021-03-08  8:32 ` [Bug tree-optimization/99412] " rguenth at gcc dot gnu.org
2021-06-09 12:54 ` rguenth at gcc dot gnu.org
2023-01-11 19:14 ` hubicka at gcc dot gnu.org
2023-01-12  9:12 ` rguenth at gcc dot gnu.org
2023-01-12  9:30 ` rguenth at gcc dot gnu.org
2023-01-12 13:30 ` cvs-commit at gcc dot gnu.org
2023-01-12 13:42 ` rguenth at gcc dot gnu.org
2023-01-12 13:42 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).