public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/61481] Poor optimization of simple small-sized matrix routines with constant data Date: Thu, 12 Jun 2014 09:16:00 -0000 [thread overview] Message-ID: <bug-61481-4-DBXQsLRoA4@http.gcc.gnu.org/bugzilla/> (raw) In-Reply-To: <bug-61481-4@http.gcc.gnu.org/bugzilla/> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed| |2014-06-12 Ever confirmed|0 |1 Severity|normal |enhancement --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- for(int i=0;i<N;i++) for(int j=0;j<N;j++) for(int k=0;k<N;k++) z[i*N+j]+=x[i*N+k]*y[k*N+j]; this one probably hits limits for loop unrolling. It's not easy to prove that fully unrolling to 4**3 statements will be profitable. A related testcase is calculix from SPEC CPU 2006, core loop extracted in testsuite/gfortran.dg/reassoc_4.f. Fully unrolling the loop nest of depth 5 would be profitable, but we don't compute that (I have several patches that try to make it so, but they are not effective). A #pragma force_unroll (or similar) would probably help in such situations (where you control the source, not for benchmarks of course). In this case the non-unrolled loops are vectorized before "final" unrolling is applied which obfuscates the code enough to make it not very well optimized. In particular the permutations are hard to get rid of. There are other cases where unrolling prevents vectorization and thus reduces speed. I think you are expecting too much from the compiler here ;) But yes, forcefully unrolling everything early would get you what you want. This is also sth for the vectorizer cost-model as I doubt the vectorization of the outer loop is profitable the way we do it (given the low number of iterations and the high setup cost): t.ii:10:8: note: Cost model analysis: Vector inside of loop cost: 68 Vector prologue cost: 16 Vector epilogue cost: 0 Scalar iteration cost: 44 Scalar outside cost: 0 Vector outside cost: 16 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 so it computes that, given that we execute 4 scalar iterations in parallel, this is all profitable. The outside cost is from vect_cst_.16_95 = { 2, 2, 2, 2 }; vect_cst_.19_81 = {pretmp_82, pretmp_82, pretmp_82, pretmp_82}; vect_cst_.22_74 = {pretmp_68, pretmp_68, pretmp_68, pretmp_68}; vect_cst_.25_53 = {pretmp_158, pretmp_158, pretmp_158, pretmp_158}; vect_cst_.28_18 = {pretmp_160, pretmp_160, pretmp_160, pretmp_160}; vect_cst_.31_2 = { 2, 2, 2, 2 }; vect_cst_.34_168 = {pretmp_88, pretmp_88, pretmp_88, pretmp_88}; vect_cst_.37_69 = {pretmp_47, pretmp_47, pretmp_47, pretmp_47}; vect_cst_.40_32 = {pretmp_54, pretmp_54, pretmp_54, pretmp_54}; vect_cst_.43_11 = {pretmp_151, pretmp_151, pretmp_151, pretmp_151}; vect_cst_.46_143 = { 2, 2, 2, 2 }; vect_cst_.49_55 = {pretmp_171, pretmp_171, pretmp_171, pretmp_171}; vect_cst_.52_83 = {pretmp_8, pretmp_8, pretmp_8, pretmp_8}; vect_cst_.55_66 = {pretmp_122, pretmp_122, pretmp_122, pretmp_122}; vect_cst_.58_217 = {pretmp_115, pretmp_115, pretmp_115, pretmp_115}; vect_cst_.61_220 = { 2, 2, 2, 2 }; that is, 16 splats for vector constants (yeah, some duplicates here). What the vector inside cost doesn't measure is IV maintainance, also the costs of the permutes vect_perm_even_57 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 0, 2, 4, 6 }>; vect_perm_odd_94 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 1, 3, 5, 7 }>; vect_perm_even_198 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 0, 2, 4, 6 }>; vect_perm_odd_212 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 1, 3, 5, 7 }>; vect_perm_even_222 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, { 0, 2, 4, 6 }>; vect_perm_odd_242 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, { 1, 3, 5, 7 }>; vect_perm_even_256 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 0, 2, 4, 6 }>; vect_perm_odd_266 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 1, 3, 5, 7 }>; ... vect_perm_even_195 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 0, 2, 4, 6 }>; vect_perm_odd_194 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 1, 3, 5, 7 }>; vect_perm_even_193 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 0, 2, 4, 6 }>; vect_perm_odd_192 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 1, 3, 5, 7 }>; vect_perm_even_191 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, { 0, 2, 4, 6 }>; vect_perm_odd_161 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, { 1, 3, 5, 7 }>; vect_perm_even_129 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, { 0, 2, 4, 6 }>; vect_perm_odd_107 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, { 1, 3, 5, 7 }>; are underestimated. Note that given we only access a subset of all computed values we have a hard time removing those computations that are not necessary from the vectorized code: MEM[(int *)&z] = vect_inter_high_239; MEM[(int *)&z + 16B] = vect_inter_low_240; MEM[(int *)&z + 32B] = vect_inter_high_250; MEM[(int *)&z + 48B] = vect_inter_low_251; _73 = z[0]; _79 = z[5]; ret_80 = _73 + _79; _85 = z[10]; ret_86 = ret_80 + _85; _91 = z[15]; ret_92 = ret_86 + _91; eventually FRE could look through z[15] -> MEM[(int *)&z + 48B] -> VEC_PERM_EXPR <vect_inter_low_236, vect_inter_low_238, { 2, 6, 3, 7 }> and avoid that last permutations, but FRE doesn't run after vectorization. We could also trick DCE to partially remove the dead vector stores by storing a vector extract to the relevant field. But there is nothing to clean that up after that (well, apart from RTL). You should adjust your code and make z an array for the diagnoal and only compute that.
next prev parent reply other threads:[~2014-06-12 9:16 UTC|newest] Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top 2014-06-12 2:37 [Bug tree-optimization/61481] New: " cas43 at cs dot stanford.edu 2014-06-12 2:38 ` [Bug tree-optimization/61481] " cas43 at cs dot stanford.edu 2014-06-12 9:16 ` rguenth at gcc dot gnu.org [this message] 2014-06-12 21:22 ` cas43 at cs dot stanford.edu 2021-12-25 7:46 ` pinskia at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-61481-4-DBXQsLRoA4@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).