[Bug tree-optimization/61481] New: Poor optimization of simple small-sized matrix routines with constant data

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/61481] New: Poor optimization of simple small-sized matrix routines with constant data
@ 2014-06-12  2:37 cas43 at cs dot stanford.edu
  2014-06-12  2:38 ` [Bug tree-optimization/61481] " cas43 at cs dot stanford.edu
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: cas43 at cs dot stanford.edu @ 2014-06-12  2:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481

            Bug ID: 61481
           Summary: Poor optimization of simple small-sized matrix
                    routines with constant data
           Product: gcc
           Version: 4.10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cas43 at cs dot stanford.edu

Created attachment 32927
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=32927&action=edit
Output of the

The following function does some simple matrix operations.  All of the data is
constant, and N is small.  The function optimizes to a return statement for N=1
and N=2.  For N=3, optimization is incomplete after tree optimizations but
benifits significantly from later optimizations.  For N=4, the final result is
not good.

template<int N>
int foo()
{
    int x[N*N],y[N*N],z[N*N];
    for(int i=0;i<N*N;i++) x[i]=0;
    for(int i=0;i<N*N;i++) y[i]=0;
    for(int i=0;i<N*N;i++) z[i]=0;
    for(int i=0;i<N;i++) x[i*(N+1)]=1;
    for(int i=0;i<N;i++) y[i*(N+1)]=2;
    for(int i=0;i<N;i++) for(int j=0;j<N;j++) for(int k=0;k<N;k++)
z[i*N+j]+=x[i*N+k]*y[k*N+j];
    int ret=0;
    for(int i=0;i<N;i++) ret+=z[i*(N+1)];
    return ret;
}
template int foo<1>();
template int foo<2>();
template int foo<3>();
template int foo<4>();


Compiled with: g++ test.cpp -c -S -O3

The full test.s file is attached, but the most immediate bits are summarized
below.

=== Asm produced for foo<1>(); ===

movl $2, %eax
ret

=== Asm produced for foo<2>(); ===

movl $4, %eax
ret

=== Asm produced for foo<3>(); ===

subq $32, %rsp
.cfi_def_cfa_offset 40
movl $6, %eax
addq $32, %rsp
.cfi_def_cfa_offset 8
ret

=== Asm produced for foo<4>(); ===

subq $96, %rsp
.cfi_def_cfa_offset 104
xorl %eax, %eax
movl $8, %ecx
leaq -104(%rsp), %rdx
pxor %xmm7, %xmm7
pxor %xmm10, %xmm10
movq %rdx, %rdi
rep; stosq
movl $1, -104(%rsp)
movl $1, -84(%rsp)
movl $1, -64(%rsp)
movl $1, -44(%rsp)
movdqa %xmm7, %xmm11
shufps $136, %xmm7, %xmm11
movdqa -104(%rsp), %xmm0
shufps $221, %xmm7, %xmm7
movdqa -88(%rsp), %xmm3
... lots and lots of SSE instructions ...
movdqa %xmm2, %xmm0
punpckhdq %xmm5, %xmm2
punpckldq %xmm5, %xmm0
movaps %xmm6, -120(%rsp)
movl -120(%rsp), %eax
addl 44(%rsp), %eax
movaps %xmm0, 56(%rsp)
movaps %xmm2, 72(%rsp)
addl 64(%rsp), %eax
addl 84(%rsp), %eax
addq $96, %rsp
.cfi_def_cfa_offset 8
ret


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/61481] Poor optimization of simple small-sized matrix routines with constant data
  2014-06-12  2:37 [Bug tree-optimization/61481] New: Poor optimization of simple small-sized matrix routines with constant data cas43 at cs dot stanford.edu
@ 2014-06-12  2:38 ` cas43 at cs dot stanford.edu
  2014-06-12  9:16 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: cas43 at cs dot stanford.edu @ 2014-06-12  2:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481

--- Comment #1 from Craig Schroeder <cas43 at cs dot stanford.edu> ---
The compiler used in the above compilation was:

Using built-in specs.
COLLECT_GCC=/home/craig/new-gcc/i-trunk/bin/g++
COLLECT_LTO_WRAPPER=/home/craig/new-gcc/i-trunk/libexec/gcc/x86_64-unknown-linux-gnu/4.10.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../s-trunk/configure --prefix=/home/craig/new-gcc/i-trunk
Thread model: posix
gcc version 4.10.0 20140602 (experimental) (GCC)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/61481] Poor optimization of simple small-sized matrix routines with constant data
  2014-06-12  2:37 [Bug tree-optimization/61481] New: Poor optimization of simple small-sized matrix routines with constant data cas43 at cs dot stanford.edu
  2014-06-12  2:38 ` [Bug tree-optimization/61481] " cas43 at cs dot stanford.edu
@ 2014-06-12  9:16 ` rguenth at gcc dot gnu.org
  2014-06-12 21:22 ` cas43 at cs dot stanford.edu
  2021-12-25  7:46 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-06-12  9:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-06-12
     Ever confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
    for(int i=0;i<N;i++) for(int j=0;j<N;j++) for(int k=0;k<N;k++)
z[i*N+j]+=x[i*N+k]*y[k*N+j];

this one probably hits limits for loop unrolling.  It's not easy to prove
that fully unrolling to 4**3 statements will be profitable.

A related testcase is calculix from SPEC CPU 2006, core loop extracted in
testsuite/gfortran.dg/reassoc_4.f.  Fully unrolling the loop nest of depth
5 would be profitable, but we don't compute that (I have several patches
that try to make it so, but they are not effective).

A #pragma force_unroll (or similar) would probably help in such situations
(where you control the source, not for benchmarks of course).

In this case the non-unrolled loops are vectorized before "final" unrolling
is applied which obfuscates the code enough to make it not very well optimized.
In particular the permutations are hard to get rid of.

There are other cases where unrolling prevents vectorization and thus
reduces speed.

I think you are expecting too much from the compiler here ;)  But yes,
forcefully unrolling everything early would get you what you want.

This is also sth for the vectorizer cost-model as I doubt the vectorization
of the outer loop is profitable the way we do it (given the low number of
iterations and the high setup cost):

t.ii:10:8: note: Cost model analysis:
  Vector inside of loop cost: 68
  Vector prologue cost: 16
  Vector epilogue cost: 0
  Scalar iteration cost: 44
  Scalar outside cost: 0
  Vector outside cost: 16
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1

so it computes that, given that we execute 4 scalar iterations in parallel,
this is all profitable.

The outside cost is from

  vect_cst_.16_95 = { 2, 2, 2, 2 };
  vect_cst_.19_81 = {pretmp_82, pretmp_82, pretmp_82, pretmp_82};
  vect_cst_.22_74 = {pretmp_68, pretmp_68, pretmp_68, pretmp_68};
  vect_cst_.25_53 = {pretmp_158, pretmp_158, pretmp_158, pretmp_158};
  vect_cst_.28_18 = {pretmp_160, pretmp_160, pretmp_160, pretmp_160};
  vect_cst_.31_2 = { 2, 2, 2, 2 };
  vect_cst_.34_168 = {pretmp_88, pretmp_88, pretmp_88, pretmp_88};
  vect_cst_.37_69 = {pretmp_47, pretmp_47, pretmp_47, pretmp_47};
  vect_cst_.40_32 = {pretmp_54, pretmp_54, pretmp_54, pretmp_54};
  vect_cst_.43_11 = {pretmp_151, pretmp_151, pretmp_151, pretmp_151};
  vect_cst_.46_143 = { 2, 2, 2, 2 };
  vect_cst_.49_55 = {pretmp_171, pretmp_171, pretmp_171, pretmp_171};
  vect_cst_.52_83 = {pretmp_8, pretmp_8, pretmp_8, pretmp_8};
  vect_cst_.55_66 = {pretmp_122, pretmp_122, pretmp_122, pretmp_122};
  vect_cst_.58_217 = {pretmp_115, pretmp_115, pretmp_115, pretmp_115};
  vect_cst_.61_220 = { 2, 2, 2, 2 };

that is, 16 splats for vector constants (yeah, some duplicates here).

What the vector inside cost doesn't measure is IV maintainance, also
the costs of the permutes

  vect_perm_even_57 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 0, 2, 4,
6 }>;
  vect_perm_odd_94 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 1, 3, 5,
7 }>;
  vect_perm_even_198 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 0, 2,
4, 6 }>;
  vect_perm_odd_212 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 1, 3,
5, 7 }>;
  vect_perm_even_222 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, {
0, 2, 4, 6 }>;
  vect_perm_odd_242 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, {
1, 3, 5, 7 }>;
  vect_perm_even_256 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 0,
2, 4, 6 }>;
  vect_perm_odd_266 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 1,
3, 5, 7 }>;
...
  vect_perm_even_195 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 0,
2, 4, 6 }>;
  vect_perm_odd_194 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 1,
3, 5, 7 }>;
  vect_perm_even_193 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 0,
2, 4, 6 }>;
  vect_perm_odd_192 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 1,
3, 5, 7 }>;
  vect_perm_even_191 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, {
0, 2, 4, 6 }>;
  vect_perm_odd_161 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, {
1, 3, 5, 7 }>;
  vect_perm_even_129 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, {
0, 2, 4, 6 }>;
  vect_perm_odd_107 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, { 1,
3, 5, 7 }>;

are underestimated.

Note that given we only access a subset of all computed values we have
a hard time removing those computations that are not necessary from the
vectorized code:

  MEM[(int *)&z] = vect_inter_high_239;
  MEM[(int *)&z + 16B] = vect_inter_low_240;
  MEM[(int *)&z + 32B] = vect_inter_high_250;
  MEM[(int *)&z + 48B] = vect_inter_low_251;
  _73 = z[0];
  _79 = z[5];
  ret_80 = _73 + _79;
  _85 = z[10];
  ret_86 = ret_80 + _85;
  _91 = z[15];
  ret_92 = ret_86 + _91;

eventually FRE could look through z[15] -> MEM[(int *)&z + 48B] ->
VEC_PERM_EXPR <vect_inter_low_236, vect_inter_low_238, { 2, 6, 3, 7 }>
and avoid that last permutations, but FRE doesn't run after vectorization.

We could also trick DCE to partially remove the dead vector stores
by storing a vector extract to the relevant field.  But there is nothing
to clean that up after that (well, apart from RTL).

You should adjust your code and make z an array for the diagnoal and only
compute that.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/61481] Poor optimization of simple small-sized matrix routines with constant data
  2014-06-12  2:37 [Bug tree-optimization/61481] New: Poor optimization of simple small-sized matrix routines with constant data cas43 at cs dot stanford.edu
  2014-06-12  2:38 ` [Bug tree-optimization/61481] " cas43 at cs dot stanford.edu
  2014-06-12  9:16 ` rguenth at gcc dot gnu.org
@ 2014-06-12 21:22 ` cas43 at cs dot stanford.edu
  2021-12-25  7:46 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: cas43 at cs dot stanford.edu @ 2014-06-12 21:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481

--- Comment #3 from Craig Schroeder <cas43 at cs dot stanford.edu> ---
Created attachment 32930
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=32930&action=edit
More practical (but more complex) example

I am trying to optimize an auto-differentiation utility.  The example above was
the result of simplifying a very complex program down to a simple one that
exhibits many of the same problems.  Auto-differentiation frequently produces
calculations that are trivial or unused, but shaving those computations out in
the code would be quite difficult.

Attached is portion of an application of it, stripped down to the routines that
are actually used and with some unneeded templatization removed.

As in the simpler example, the relative order of loop unrolling and sccp seems
to be the source of problem.  After flattening, there are three loops, with no
nested loops.  None are unrolled during prog.cpp.058t.cunrolli.  All three are
unrolled during prog.cpp.118t.cunroll, but that is too late for the
prog.cpp.103t.sccp step.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/61481] Poor optimization of simple small-sized matrix routines with constant data
  2014-06-12  2:37 [Bug tree-optimization/61481] New: Poor optimization of simple small-sized matrix routines with constant data cas43 at cs dot stanford.edu
                   ` (2 preceding siblings ...)
  2014-06-12 21:22 ` cas43 at cs dot stanford.edu
@ 2021-12-25  7:46 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-25  7:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED
   Target Milestone|---                         |10.0

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The original testcase is all optimized in GCC 10.
The second testcase is undefined and gets optimized down to
__builtin_unreachable.

File a well defined testcase for the other one and we can see what needs to be
optimized.

So closing as fixed for GCC 10, the late FRE which was added is exactly what
optimizes this case.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-12-25  7:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-12  2:37 [Bug tree-optimization/61481] New: Poor optimization of simple small-sized matrix routines with constant data cas43 at cs dot stanford.edu
2014-06-12  2:38 ` [Bug tree-optimization/61481] " cas43 at cs dot stanford.edu
2014-06-12  9:16 ` rguenth at gcc dot gnu.org
2014-06-12 21:22 ` cas43 at cs dot stanford.edu
2021-12-25  7:46 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).