[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/71414] 2x slower than clang  summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions
       [not found] <bug-71414-4@http.gcc.gnu.org/bugzilla/>
@ 2020-10-11 15:39 ` freddie at witherden dot org
  2023-06-06 20:42 ` drraph at gmail dot com
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: freddie at witherden dot org @ 2020-10-11 15:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Freddie Witherden <freddie at witherden dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |freddie at witherden dot org

--- Comment #11 from Freddie Witherden <freddie at witherden dot org> ---
I've been looking into this and the big difference appears to be that when
Clang unrolls the loop it does so using multiple accumulators (and indeed does
this without need to be told to unroll.  Given:

double acc(double *x, int n)
{
    double a = 0;
    #pragma omp simd
    for (int i = 0; i < n; i++)
        a += x[i];
    return a;
}


and compiling with clang -march=native -Ofast -fopenmp -S the core loop reads
as:

        vaddpd  (%rdi,%rsi,8), %ymm0, %ymm0
        vaddpd  32(%rdi,%rsi,8), %ymm1, %ymm1
        vaddpd  64(%rdi,%rsi,8), %ymm2, %ymm2
        vaddpd  96(%rdi,%rsi,8), %ymm3, %ymm3
        vaddpd  128(%rdi,%rsi,8), %ymm0, %ymm0
        vaddpd  160(%rdi,%rsi,8), %ymm1, %ymm1
        vaddpd  192(%rdi,%rsi,8), %ymm2, %ymm2
        vaddpd  224(%rdi,%rsi,8), %ymm3, %ymm3
        vaddpd  256(%rdi,%rsi,8), %ymm0, %ymm0
        vaddpd  288(%rdi,%rsi,8), %ymm1, %ymm1
        vaddpd  320(%rdi,%rsi,8), %ymm2, %ymm2
        vaddpd  352(%rdi,%rsi,8), %ymm3, %ymm3
        vaddpd  384(%rdi,%rsi,8), %ymm0, %ymm0
        vaddpd  416(%rdi,%rsi,8), %ymm1, %ymm1
        vaddpd  448(%rdi,%rsi,8), %ymm2, %ymm2
        vaddpd  480(%rdi,%rsi,8), %ymm3, %ymm3

which is heavily unrolled and uses four separate accumulators to hide the
latency of the vector adds.  Interestingly, one could argue that Clang is not
using enough registers given that Skylake can dual-issue adds and they have a
latency of 4 cycles (implying you want 8 separate accumulators).

GCC 10 with gcc -march=skylake -Ofast -fopenmp -S test.c -funroll-loops

        vaddpd  -224(%r8), %ymm1, %ymm2
        vaddpd  -192(%r8), %ymm2, %ymm3
        vaddpd  -160(%r8), %ymm3, %ymm4
        vaddpd  -128(%r8), %ymm4, %ymm5
        vaddpd  -96(%r8), %ymm5, %ymm6
        vaddpd  -64(%r8), %ymm6, %ymm7
        vaddpd  -32(%r8), %ymm7, %ymm0

which although it is unrolled, is not a useful unrolling due to the dependency
chain.  Indeed, I would not be surprised if the performance is similar to the
unrolled code as the loop related cruft can be hidden.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/71414] 2x slower than clang  summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions
       [not found] <bug-71414-4@http.gcc.gnu.org/bugzilla/>
  2020-10-11 15:39 ` [Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions freddie at witherden dot org
@ 2023-06-06 20:42 ` drraph at gmail dot com
  2023-06-07  6:54 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 5+ messages in thread
From: drraph at gmail dot com @ 2023-06-06 20:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Raphael C <drraph at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |drraph at gmail dot com

--- Comment #12 from Raphael C <drraph at gmail dot com> ---
This problem has been recently discussed at:

https://stackoverflow.com/questions/76407241/why-is-cython-so-much-slower-than-numba-for-this-simple-loop

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/71414] 2x slower than clang  summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions
       [not found] <bug-71414-4@http.gcc.gnu.org/bugzilla/>
  2020-10-11 15:39 ` [Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions freddie at witherden dot org
  2023-06-06 20:42 ` drraph at gmail dot com
@ 2023-06-07  6:54 ` rguenth at gcc dot gnu.org
  2023-06-07  7:24 ` crazylht at gmail dot com
  2023-06-07  7:44 ` rguenther at suse dot de
  4 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-07  6:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
The target now has the ability to tell the vectorizer to choose a larger VF
based on the cost info it got for the default VF, so the x86 backend could
make use of that.  For example with the following patch we'll unroll the
vectorized loops 4 times (of course the actual check for small reduction
loops and a register pressure estimate is missing).  That generates

.L4:
        vaddps  (%rax), %zmm1, %zmm1
        vaddps  64(%rax), %zmm2, %zmm2
        addq    $256, %rax
        vaddps  -128(%rax), %zmm0, %zmm0
        vaddps  -64(%rax), %zmm3, %zmm3
        cmpq    %rcx, %rax
        jne     .L4
        movq    %rdx, %rax
        andq    $-64, %rax
        vaddps  %zmm3, %zmm0, %zmm0
        vaddps  %zmm2, %zmm1, %zmm1
        vaddps  %zmm1, %zmm0, %zmm1
... more epilog ...

with -march=znver4 on current trunk.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index d4ff56ee8dd..53c09bb9d9c 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
                              stmt_vec_info stmt_info, slp_tree node,
                              tree vectype, int misalign,
                              vect_cost_model_location where) override;
+  void finish_cost (const vector_costs *uncast_scalar_costs);
 };

+void
+ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
+{
+  auto *scalar_costs
+    = static_cast<const ix86_vector_costs *> (uncast_scalar_costs);
+  m_suggested_unroll_factor = 4;
+  vector_costs::finish_cost (scalar_costs);
+}
+
 /* Implement targetm.vectorize.create_costs.  */

 static vector_costs *

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/71414] 2x slower than clang  summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions
       [not found] <bug-71414-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2023-06-07  6:54 ` rguenth at gcc dot gnu.org
@ 2023-06-07  7:24 ` crazylht at gmail dot com
  2023-06-07  7:44 ` rguenther at suse dot de
  4 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2023-06-07  7:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #14 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #13)
> The target now has the ability to tell the vectorizer to choose a larger VF
> based on the cost info it got for the default VF, so the x86 backend could
> make use of that.  For example with the following patch we'll unroll the
> vectorized loops 4 times (of course the actual check for small reduction
> loops and a register pressure estimate is missing).  That generates
> 
> .L4:
>         vaddps  (%rax), %zmm1, %zmm1
>         vaddps  64(%rax), %zmm2, %zmm2
>         addq    $256, %rax
>         vaddps  -128(%rax), %zmm0, %zmm0
>         vaddps  -64(%rax), %zmm3, %zmm3
>         cmpq    %rcx, %rax
>         jne     .L4
>         movq    %rdx, %rax
>         andq    $-64, %rax
>         vaddps  %zmm3, %zmm0, %zmm0
>         vaddps  %zmm2, %zmm1, %zmm1
>         vaddps  %zmm1, %zmm0, %zmm1
> ... more epilog ...
> 
> with -march=znver4 on current trunk.
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ff56ee8dd..53c09bb9d9c 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
>                               stmt_vec_info stmt_info, slp_tree node,
>                               tree vectype, int misalign,
>                               vect_cost_model_location where) override;
> +  void finish_cost (const vector_costs *uncast_scalar_costs);
>  };
>  
> +void
> +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
> +{
> +  auto *scalar_costs
> +    = static_cast<const ix86_vector_costs *> (uncast_scalar_costs);
> +  m_suggested_unroll_factor = 4;
> +  vector_costs::finish_cost (scalar_costs);

I remember we have posted an patch for that
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html

One regression observed is the VF of epilog loop will increase(from xmm to ymm)
after unroll the vectorized loops, and it regressed performance for
lower-tripcount loop(similar as -mprefer-vector-width=512).

Also for the case in the PR, I'm trying to enable
-fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will
break reduction chain.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/71414] 2x slower than clang  summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions
       [not found] <bug-71414-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2023-06-07  7:24 ` crazylht at gmail dot com
@ 2023-06-07  7:44 ` rguenther at suse dot de
  4 siblings, 0 replies; 5+ messages in thread
From: rguenther at suse dot de @ 2023-06-07  7:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 7 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
> 
> --- Comment #14 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to Richard Biener from comment #13)
> > The target now has the ability to tell the vectorizer to choose a larger VF
> > based on the cost info it got for the default VF, so the x86 backend could
> > make use of that.  For example with the following patch we'll unroll the
> > vectorized loops 4 times (of course the actual check for small reduction
> > loops and a register pressure estimate is missing).  That generates
> > 
> > .L4:
> >         vaddps  (%rax), %zmm1, %zmm1
> >         vaddps  64(%rax), %zmm2, %zmm2
> >         addq    $256, %rax
> >         vaddps  -128(%rax), %zmm0, %zmm0
> >         vaddps  -64(%rax), %zmm3, %zmm3
> >         cmpq    %rcx, %rax
> >         jne     .L4
> >         movq    %rdx, %rax
> >         andq    $-64, %rax
> >         vaddps  %zmm3, %zmm0, %zmm0
> >         vaddps  %zmm2, %zmm1, %zmm1
> >         vaddps  %zmm1, %zmm0, %zmm1
> > ... more epilog ...
> > 
> > with -march=znver4 on current trunk.
> > 
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d4ff56ee8dd..53c09bb9d9c 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs
> >                               stmt_vec_info stmt_info, slp_tree node,
> >                               tree vectype, int misalign,
> >                               vect_cost_model_location where) override;
> > +  void finish_cost (const vector_costs *uncast_scalar_costs);
> >  };
> >  
> > +void
> > +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
> > +{
> > +  auto *scalar_costs
> > +    = static_cast<const ix86_vector_costs *> (uncast_scalar_costs);
> > +  m_suggested_unroll_factor = 4;
> > +  vector_costs::finish_cost (scalar_costs);
> 
> I remember we have posted an patch for that
> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html
> 
> One regression observed is the VF of epilog loop will increase(from xmm to ymm)
> after unroll the vectorized loops, and it regressed performance for
> lower-tripcount loop(similar as -mprefer-vector-width=512).

Ah, yeah.  We could resort to check estimated_number_of_iterations
to guide us with profile feedback.  I'm also (again) working on
fully masked epilogues which should reduce the impact on low-trip
count loops.

> Also for the case in the PR, I'm trying to enable
> -fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will
> break reduction chain.

Probably also a good idea - maybe -fvariable-expansion-in-unroller can
be made smarter and guided by register pressure?

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-06-07  7:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-71414-4@http.gcc.gnu.org/bugzilla/>
2020-10-11 15:39 ` [Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions freddie at witherden dot org
2023-06-06 20:42 ` drraph at gmail dot com
2023-06-07  6:54 ` rguenth at gcc dot gnu.org
2023-06-07  7:24 ` crazylht at gmail dot com
2023-06-07  7:44 ` rguenther at suse dot de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).