public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/61481] Poor optimization of simple small-sized matrix routines with constant data
Date: Thu, 12 Jun 2014 09:16:00 -0000	[thread overview]
Message-ID: <bug-61481-4-DBXQsLRoA4@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-61481-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-06-12
     Ever confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
    for(int i=0;i<N;i++) for(int j=0;j<N;j++) for(int k=0;k<N;k++)
z[i*N+j]+=x[i*N+k]*y[k*N+j];

this one probably hits limits for loop unrolling.  It's not easy to prove
that fully unrolling to 4**3 statements will be profitable.

A related testcase is calculix from SPEC CPU 2006, core loop extracted in
testsuite/gfortran.dg/reassoc_4.f.  Fully unrolling the loop nest of depth
5 would be profitable, but we don't compute that (I have several patches
that try to make it so, but they are not effective).

A #pragma force_unroll (or similar) would probably help in such situations
(where you control the source, not for benchmarks of course).

In this case the non-unrolled loops are vectorized before "final" unrolling
is applied which obfuscates the code enough to make it not very well optimized.
In particular the permutations are hard to get rid of.

There are other cases where unrolling prevents vectorization and thus
reduces speed.

I think you are expecting too much from the compiler here ;)  But yes,
forcefully unrolling everything early would get you what you want.

This is also sth for the vectorizer cost-model as I doubt the vectorization
of the outer loop is profitable the way we do it (given the low number of
iterations and the high setup cost):

t.ii:10:8: note: Cost model analysis:
  Vector inside of loop cost: 68
  Vector prologue cost: 16
  Vector epilogue cost: 0
  Scalar iteration cost: 44
  Scalar outside cost: 0
  Vector outside cost: 16
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1

so it computes that, given that we execute 4 scalar iterations in parallel,
this is all profitable.

The outside cost is from

  vect_cst_.16_95 = { 2, 2, 2, 2 };
  vect_cst_.19_81 = {pretmp_82, pretmp_82, pretmp_82, pretmp_82};
  vect_cst_.22_74 = {pretmp_68, pretmp_68, pretmp_68, pretmp_68};
  vect_cst_.25_53 = {pretmp_158, pretmp_158, pretmp_158, pretmp_158};
  vect_cst_.28_18 = {pretmp_160, pretmp_160, pretmp_160, pretmp_160};
  vect_cst_.31_2 = { 2, 2, 2, 2 };
  vect_cst_.34_168 = {pretmp_88, pretmp_88, pretmp_88, pretmp_88};
  vect_cst_.37_69 = {pretmp_47, pretmp_47, pretmp_47, pretmp_47};
  vect_cst_.40_32 = {pretmp_54, pretmp_54, pretmp_54, pretmp_54};
  vect_cst_.43_11 = {pretmp_151, pretmp_151, pretmp_151, pretmp_151};
  vect_cst_.46_143 = { 2, 2, 2, 2 };
  vect_cst_.49_55 = {pretmp_171, pretmp_171, pretmp_171, pretmp_171};
  vect_cst_.52_83 = {pretmp_8, pretmp_8, pretmp_8, pretmp_8};
  vect_cst_.55_66 = {pretmp_122, pretmp_122, pretmp_122, pretmp_122};
  vect_cst_.58_217 = {pretmp_115, pretmp_115, pretmp_115, pretmp_115};
  vect_cst_.61_220 = { 2, 2, 2, 2 };

that is, 16 splats for vector constants (yeah, some duplicates here).

What the vector inside cost doesn't measure is IV maintainance, also
the costs of the permutes

  vect_perm_even_57 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 0, 2, 4,
6 }>;
  vect_perm_odd_94 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 1, 3, 5,
7 }>;
  vect_perm_even_198 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 0, 2,
4, 6 }>;
  vect_perm_odd_212 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 1, 3,
5, 7 }>;
  vect_perm_even_222 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, {
0, 2, 4, 6 }>;
  vect_perm_odd_242 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, {
1, 3, 5, 7 }>;
  vect_perm_even_256 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 0,
2, 4, 6 }>;
  vect_perm_odd_266 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 1,
3, 5, 7 }>;
...
  vect_perm_even_195 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 0,
2, 4, 6 }>;
  vect_perm_odd_194 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 1,
3, 5, 7 }>;
  vect_perm_even_193 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 0,
2, 4, 6 }>;
  vect_perm_odd_192 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 1,
3, 5, 7 }>;
  vect_perm_even_191 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, {
0, 2, 4, 6 }>;
  vect_perm_odd_161 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, {
1, 3, 5, 7 }>;
  vect_perm_even_129 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, {
0, 2, 4, 6 }>;
  vect_perm_odd_107 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, { 1,
3, 5, 7 }>;

are underestimated.

Note that given we only access a subset of all computed values we have
a hard time removing those computations that are not necessary from the
vectorized code:

  MEM[(int *)&z] = vect_inter_high_239;
  MEM[(int *)&z + 16B] = vect_inter_low_240;
  MEM[(int *)&z + 32B] = vect_inter_high_250;
  MEM[(int *)&z + 48B] = vect_inter_low_251;
  _73 = z[0];
  _79 = z[5];
  ret_80 = _73 + _79;
  _85 = z[10];
  ret_86 = ret_80 + _85;
  _91 = z[15];
  ret_92 = ret_86 + _91;

eventually FRE could look through z[15] -> MEM[(int *)&z + 48B] ->
VEC_PERM_EXPR <vect_inter_low_236, vect_inter_low_238, { 2, 6, 3, 7 }>
and avoid that last permutations, but FRE doesn't run after vectorization.

We could also trick DCE to partially remove the dead vector stores
by storing a vector extract to the relevant field.  But there is nothing
to clean that up after that (well, apart from RTL).

You should adjust your code and make z an array for the diagnoal and only
compute that.


  parent reply	other threads:[~2014-06-12  9:16 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-12  2:37 [Bug tree-optimization/61481] New: " cas43 at cs dot stanford.edu
2014-06-12  2:38 ` [Bug tree-optimization/61481] " cas43 at cs dot stanford.edu
2014-06-12  9:16 ` rguenth at gcc dot gnu.org [this message]
2014-06-12 21:22 ` cas43 at cs dot stanford.edu
2021-12-25  7:46 ` pinskia at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-61481-4-DBXQsLRoA4@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).