[Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance
@ 2013-07-03  8:56 vincenzo.innocente at cern dot ch
  2013-07-03  9:33 ` [Bug tree-optimization/57796] " jakub at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: vincenzo.innocente at cern dot ch @ 2013-07-03  8:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

            Bug ID: 57796
           Summary: AVX2 gather vectorization: code bloat and reduction of
                    performance
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vincenzo.innocente at cern dot ch

At least in scimark2 sparse matrix multiplication the use of gather
instructions ends in code bloat and a substantial reduction of performance

the test below has been performed on a INTEL 4770K at fixed freq of 3.495GHz
using gcc version 4.9.0 20130630 (experimental) [trunk revision 200570] (GCC) 

the easiest is just to download scimark2 compile and run as

mkdir scimark2TMP
cd scimark2TMP
wget http://math.nist.gov/scimark2/scimark2_1c.zip .
unzip scimark2_1c.zip
gcc -v
gcc -Ofast -march=corei7-avx *.c -lm
./a.out 5 | grep "Sparse matmult"
./a.out -large 5 | grep "Sparse matmult"
gcc -Ofast -march=corei7-avx -mavx2 -mfma *.c -lm
./a.out 5 | grep "Sparse matmult"
./a.out -large 5 | grep "Sparse matmult"
gcc -Ofast -march=corei7-avx  -S SparseCompRow.c     -o SparseCompRow_avx.s
gcc -Ofast -march=corei7-avx -mavx2 -mfma -S SparseCompRow.c -o
SparseCompRow_avx2.s
wc SparseCompRow_avx.s
wc SparseCompRow_avx2.s

my results

gcc version 4.9.0 20130630 (experimental) [trunk revision 200570] (GCC) 
Sparse matmult  Mflops:  2245.34    (N=1000, nz=5000)
Sparse matmult  Mflops:  2030.24    (N=100000, nz=1000000)
Sparse matmult  Mflops:  1842.84    (N=1000, nz=5000)
Sparse matmult  Mflops:  1754.18    (N=100000, nz=1000000)
 113  269 2156 SparseCompRow_avx.s
 289  778 5910 SparseCompRow_avx2.s


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/57796] AVX2 gather vectorization: code bloat and reduction of performance
  2013-07-03  8:56 [Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance vincenzo.innocente at cern dot ch
@ 2013-07-03  9:33 ` jakub at gcc dot gnu.org
  2013-07-05 13:31 ` [Bug target/57796] " ysrumyan at gmail dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2013-07-03  9:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |kyukhin at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The gather vectorization has been written without access to hw and perhaps some
tuning might be needed.  But I don't have enough info how it should be tuned,
what the exact costs are.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
  2013-07-03  8:56 [Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance vincenzo.innocente at cern dot ch
  2013-07-03  9:33 ` [Bug tree-optimization/57796] " jakub at gcc dot gnu.org
@ 2013-07-05 13:31 ` ysrumyan at gmail dot com
  2013-07-05 13:49 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: ysrumyan at gmail dot com @ 2013-07-05 13:31 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

Yuri Rumyantsev <ysrumyan at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ysrumyan at gmail dot com

--- Comment #2 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
This issue does not related to avx2 tuning but rather to estimation of
vectorization profitability. Note that avx does not support "gathers" and so
the following lnnermost loop
                for (i=rowR; i<rowRp1; i++)
                    sum += x[ col[i] ] * val[i];
is not vectorized. I did simple experiment and found out that iteration count
for it is 5 or 10 (for -large input) and it looks not profitable for avx2
vectorization, i.e. scalar version should be more profitable for execution. If
we slightly change this loop to
        int n = row[r+1] - row[r];
        int *col1 = col + row[r];                         
                for (i=0; i<n; i++)
                    sum += x[ col1[i] ] * val[i];
i.e. set up low bound to zero, peformance drop for avx2 will disappear:

with avx
Sparse matmult  Mflops:  2135.59    (N=1000, nz=5000)
with avx2
Sparse matmult  Mflops:  2309.64    (N=1000, nz=5000)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
  2013-07-03  8:56 [Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance vincenzo.innocente at cern dot ch
  2013-07-03  9:33 ` [Bug tree-optimization/57796] " jakub at gcc dot gnu.org
  2013-07-05 13:31 ` [Bug target/57796] " ysrumyan at gmail dot com
@ 2013-07-05 13:49 ` jakub at gcc dot gnu.org
  2013-07-05 14:14 ` ysrumyan at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: jakub at gcc dot gnu.org @ 2013-07-05 13:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
By tuning I've meant the vectorizer cost model.  If the desirability of gathers
vs. no vectorization at all doesn't depend only on the insns in the loop, but
also on how many iterations the loop has, then perhaps we'd need to runtime
version it or something.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
  2013-07-03  8:56 [Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance vincenzo.innocente at cern dot ch
                   ` (2 preceding siblings ...)
  2013-07-05 13:49 ` jakub at gcc dot gnu.org
@ 2013-07-05 14:14 ` ysrumyan at gmail dot com
  2014-06-19  8:17 ` vincenzo.innocente at cern dot ch
  2015-04-10 11:16 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: ysrumyan at gmail dot com @ 2013-07-05 14:14 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

--- Comment #4 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
(In reply to Jakub Jelinek from comment #3)
> By tuning I've meant the vectorizer cost model.  If the desirability of
> gathers vs. no vectorization at all doesn't depend only on the insns in the
> loop, but also on how many iterations the loop has, then perhaps we'd need
> to runtime version it or something.

Jakub,

We have runtime versioning but for original bench vectorized version of loop is
 selected to execute, but if we change lower bound to 0 (as I did) scalar
version of loop is run.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
  2013-07-03  8:56 [Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance vincenzo.innocente at cern dot ch
                   ` (3 preceding siblings ...)
  2013-07-05 14:14 ` ysrumyan at gmail dot com
@ 2014-06-19  8:17 ` vincenzo.innocente at cern dot ch
  2015-04-10 11:16 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: vincenzo.innocente at cern dot ch @ 2014-06-19  8:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

--- Comment #5 from vincenzo Innocente <vincenzo.innocente at cern dot ch> ---
so with latest 4.9 
gcc version 4.10.0 20140611 (experimental) [trunk revision 211467] (GCC) 
situation has not changed much (the scalar version is now faster!):
I think that the cost of gather instructions is still under-estimated


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
  2013-07-03  8:56 [Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance vincenzo.innocente at cern dot ch
                   ` (4 preceding siblings ...)
  2014-06-19  8:17 ` vincenzo.innocente at cern dot ch
@ 2015-04-10 11:16 ` rguenth at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-10 11:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another use-case for gathers is that of strided loads where we do

           for (j = 0; ; j += VF*stride)
             tmp1 = array[j];
             tmp2 = array[j + stride];
             ...
             vectemp = {tmp1, tmp2, ...}

but could as well do

           off = { 0, stride, ..., stride * N };
           for (j = 0; ; j += VF*stride)
             vectemp = gather (&array[j], off, -1);

still need a separate IV.  Currently the cost of strided loads is

      /* N scalar loads plus gathering them into a vector.  */
      tree vectype = STMT_VINFO_VECTYPE (stmt_info);
      inside_cost += record_stmt_cost (body_cost_vec,
                                       ncopies * TYPE_VECTOR_SUBPARTS
(vectype),
                                       scalar_load, stmt_info, 0, vect_body);
      inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
                                       stmt_info, 0, vect_body);

where a good(?) approximation for gather loads could be just omitting
the vec_construct cost?  (well, a new target cost for gather would be
most appropriate I guess)


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-04-10 11:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-03  8:56 [Bug tree-optimization/57796] New: AVX2 gather vectorization: code bloat and reduction of performance vincenzo.innocente at cern dot ch
2013-07-03  9:33 ` [Bug tree-optimization/57796] " jakub at gcc dot gnu.org
2013-07-05 13:31 ` [Bug target/57796] " ysrumyan at gmail dot com
2013-07-05 13:49 ` jakub at gcc dot gnu.org
2013-07-05 14:14 ` ysrumyan at gmail dot com
2014-06-19  8:17 ` vincenzo.innocente at cern dot ch
2015-04-10 11:16 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).