public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/100076] New: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX
@ 2021-04-14  2:21 crazylht at gmail dot com
  2021-04-14  3:16 ` [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3 hjl.tools at gmail dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2021-04-14  2:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

            Bug ID: 100076
           Summary: eembc/automotive/basefp01 has 30.3% regression compare
                    -O2 -ftree-vectorize with -O2 on SKX/CLX
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: crazylht at gmail dot com
                CC: hjl.tools at gmail dot com
  Target Milestone: ---

Refer to https://godbolt.org/z/e3nfz3xvW

cat testcase.c

int
t_run_test(double a)
{

        static double P1, Q1;
        static varsize polyX1[9];
        polyX1[1] = a;
        P1 = (varsize)constantP[0];
        polyX1[1] = a;

// Loop 1
        for( int i1 = 2 ; i1 <= 8 ; i1++ )
        {
            polyX1[i1] = polyX1[i1 - 1] * polyX1[1] ;
        }


        P1 = (varsize)constantP[0] ;
// Loop 2
        for( int i1 = 1 ; i1 <= 8 ; i1++ )
        {
            P1 += (varsize)constantP[i1] * polyX1[i1] ;
        }


        Q1 = (varsize)constantQ[0] ;
// Loop 3
        for( int i1 = 1 ; i1 <= 8 ; i1++ )
        {
            Q1 += (varsize)constantQ[i1] * polyX1[i1] ;
        }


        return a = a * P1 / Q1 ;

}

Loop 1 write array polyX1 which is used by Loop2 and Loop 3, with
-ftree-vectorize -O2, Loop2 and Loop 3 are vectorized, but Loop 1 is not since
it have inter-iterative dependence, then for array polyX1, there're 64-bit
stores in loop 1 and 128-bit load in Loop2 and Loop 3, and it causes store
forwarding stalls which hurt performance.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
  2021-04-14  2:21 [Bug tree-optimization/100076] New: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX crazylht at gmail dot com
@ 2021-04-14  3:16 ` hjl.tools at gmail dot com
  2021-04-14  5:28 ` crazylht at gmail dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: hjl.tools at gmail dot com @ 2021-04-14  3:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

--- Comment #1 from H.J. Lu <hjl.tools at gmail dot com> ---
Is -O3 slower than -O3 -fno-tree-vectorize? If not, why?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
  2021-04-14  2:21 [Bug tree-optimization/100076] New: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX crazylht at gmail dot com
  2021-04-14  3:16 ` [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3 hjl.tools at gmail dot com
@ 2021-04-14  5:28 ` crazylht at gmail dot com
  2021-04-14  7:08 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2021-04-14  5:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to H.J. Lu from comment #1)
> Is -O3 slower than -O3 -fno-tree-vectorize? If not, why?

For this case O3 is Ok, because O3 will enable pass_cunroll to complete unroll
the loop1/loop2/loop3, and later pass_fre will elimiate redudant load of polyX1
in loop2 and loop3 for both -O3 and -O3 -fno-tree-vectorize.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
  2021-04-14  2:21 [Bug tree-optimization/100076] New: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX crazylht at gmail dot com
  2021-04-14  3:16 ` [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3 hjl.tools at gmail dot com
  2021-04-14  5:28 ` crazylht at gmail dot com
@ 2021-04-14  7:08 ` rguenth at gcc dot gnu.org
  2021-04-14  8:22 ` crazylht at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-14  7:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
See also PR90579.  I wonder if there's a way to tell the CPU to not forward
a load - does emitting a lfence inbetween the scalar store and the vector
load fix the issue?

ISTR that the "bad" effect is not so much the delay between flushing the
store buffers to L1 and then loading from L1 but when the CPU speculates
there's no conflicting [not forwardable] store in the store buffer and thus
fetches a wrong value from L1 and thus we have to flush and restart the
pipeline after we discover the conflict late?

Otherwise it's really hard to address these kind of issues - for doubles
and SSE vectorization we might simply vectorize all loads using scalars
but that doesn't scale for larger VFs.  It might eventually be enough to
force peel a single iteration of all loops at the cost of code size
(and performance if there's no STLF issue).

That said, CPU design folks should try to address this by making the
penalty smaller ;)

Can you share a runtime testcase?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
  2021-04-14  2:21 [Bug tree-optimization/100076] New: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX crazylht at gmail dot com
                   ` (2 preceding siblings ...)
  2021-04-14  7:08 ` rguenth at gcc dot gnu.org
@ 2021-04-14  8:22 ` crazylht at gmail dot com
  2021-04-15  7:35 ` rguenth at gcc dot gnu.org
  2021-04-15  9:23 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2021-04-14  8:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 50590
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50590&action=edit
eembc_automotive_basefp01.cpp

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
  2021-04-14  2:21 [Bug tree-optimization/100076] New: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX crazylht at gmail dot com
                   ` (3 preceding siblings ...)
  2021-04-14  8:22 ` crazylht at gmail dot com
@ 2021-04-15  7:35 ` rguenth at gcc dot gnu.org
  2021-04-15  9:23 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-15  7:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-15
             Status|UNCONFIRMED                 |NEW

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note even when avoiding the STLF hit the vectorized version is slower.
You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading
the lower/upper half of vectors separately.

The reason is that without -ffast-math we are using an in-order reduction
which doesn't save us much but instead just combines dependence chains
here.  We do have a related bug for this somewhere.

With -ffast-math the version with/without
-mtune-ctl=^sse_unaligned_load_optimal
is about the same speed, so STLF is a red herring here (on Zen2).

Still not vectorizing is a lot faster.

Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
  2021-04-14  2:21 [Bug tree-optimization/100076] New: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX crazylht at gmail dot com
                   ` (4 preceding siblings ...)
  2021-04-15  7:35 ` rguenth at gcc dot gnu.org
@ 2021-04-15  9:23 ` crazylht at gmail dot com
  5 siblings, 0 replies; 7+ messages in thread
From: crazylht at gmail dot com @ 2021-04-15  9:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076

--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #5)
> Note even when avoiding the STLF hit the vectorized version is slower.
> You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading
> the lower/upper half of vectors separately.
> 
This leads to extra instructions(extra 2 loads), and if the vectorizer knew
that, it would find that the cost of vectorization is larger than scalar. 

> The reason is that without -ffast-math we are using an in-order reduction
> which doesn't save us much but instead just combines dependence chains
> here.  We do have a related bug for this somewhere.
> 
> With -ffast-math the version with/without
> -mtune-ctl=^sse_unaligned_load_optimal
> is about the same speed, so STLF is a red herring here (on Zen2).
> 
> Still not vectorizing is a lot faster.
> 

Yes, As far as vectorization is concerned, vectorization does not improve
performance here(compare -O2 -funroll-loops vs -O2 -ftree-vectorize
-funroll-loops) so I'm wondering if we can adjust the heuristic or cost model
so that the loop is not vectorized.

> Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX?

doesn't help.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-04-15  9:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-14  2:21 [Bug tree-optimization/100076] New: eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX crazylht at gmail dot com
2021-04-14  3:16 ` [Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3 hjl.tools at gmail dot com
2021-04-14  5:28 ` crazylht at gmail dot com
2021-04-14  7:08 ` rguenth at gcc dot gnu.org
2021-04-14  8:22 ` crazylht at gmail dot com
2021-04-15  7:35 ` rguenth at gcc dot gnu.org
2021-04-15  9:23 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).