public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "linkw at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/100794] New: suboptimal code due to missing pre2 when vectorization fails Date: Thu, 27 May 2021 07:48:11 +0000 [thread overview] Message-ID: <bug-100794-4@http.gcc.gnu.org/bugzilla/> (raw) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794 Bug ID: 100794 Summary: suboptimal code due to missing pre2 when vectorization fails Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- I was investigating one degradation from SPEC2017 554.roms_r on Power9, the baseline is -O2 -mcpu=power9 -ffast-math while the test line is -O2 -mcpu=power9 -ffast-math -ftree-vectorize -fvect-cost-model=very-cheap. One reduced C test case is as below: #include <math.h> #define MIN fmin #define MAX fmax #define N1 400 #define N2 600 #define N3 800 extern int j_0, j_n, i_0, i_n; extern double diff2[N1][N2]; extern double dZdx[N1][N2][N3]; extern double dTdz[N1][N2][N3]; extern double dTdx[N1][N2][N3]; extern double FS[N1][N2][N3]; void test (int k1, int k2) { for (int j = j_0; j < j_n; j++) for (int i = i_0; i < i_n; i++) { double cff = 0.5 * diff2[j][i]; double cff1 = MIN (dZdx[k1][j][i], 0.0); double cff2 = MIN (dZdx[k2][j][i + 1], 0.0); double cff3 = MAX (dZdx[k2][j][i], 0.0); double cff4 = MAX (dZdx[k1][j][i + 1], 0.0); FS[k2][j][i] = cff * (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i]) + cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1]) + cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i]) + cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1])); } } O2 fast: <bb 8> [local count: 955630225]: # prephitmp_107 = PHI <_6(8), pretmp_106(7)> # prephitmp_109 = PHI <_4(8), pretmp_108(7)> # prephitmp_111 = PHI <_23(8), pretmp_110(7)> # prephitmp_113 = PHI <_13(8), pretmp_112(7)> # doloop.9_55 = PHI <doloop.9_57(8), doloop.9_105(7)> # ivtmp.33_102 = PHI <ivtmp.33_101(8), ivtmp.44_70(7)> _87 = (double[400][600] *) ivtmp.45_60; _1 = MEM[(double *)_87 + ivtmp.33_102 * 1]; cff_38 = _1 * 5.0e-1; cff1_40 = MIN_EXPR <prephitmp_107, 0.0>; _4 = MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1]; cff2_42 = MIN_EXPR <_4, 0.0>; cff3_43 = MAX_EXPR <prephitmp_109, 0.0>; _6 = MEM[(double *)_79 + ivtmp.33_102 * 1]; cff4_44 = MAX_EXPR <_6, 0.0>; O2 fast vect (very-cheap) <bb 6> [local count: 955630225]: # doloop.9_55 = PHI <doloop.9_57(6), doloop.9_105(5)> # ivtmp.37_102 = PHI <ivtmp.37_101(6), ivtmp.46_72(5)> # ivtmp.38_92 = PHI <ivtmp.38_91(6), ivtmp.38_90(5)> _77 = (double[400][600] *) ivtmp.48_62; _1 = MEM[(double *)_77 + ivtmp.37_102 * 1]; cff_38 = _1 * 5.0e-1; _2 = MEM[(double *)&dZdx + ivtmp.38_92 * 1]; // redundant load cff1_40 = MIN_EXPR <_2, 0.0>; _4 = MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1]; cff2_42 = MIN_EXPR <_4, 0.0>; _5 = MEM[(double *)&dZdx + ivtmp.37_102 * 1]; // redundant load cff3_43 = MAX_EXPR <_5, 0.0>; _6 = MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1]; cff4_44 = MAX_EXPR <_6, 0.0>; I found the root cause is that: in the baseline version, PRE makes it to reuse some load result from previous iterations, it saves some loads. while in the test line version, with the check below: /* Inhibit the use of an inserted PHI on a loop header when the address of the memory reference is a simple induction variable. In other cases the vectorizer won't do anything anyway (either it's loop invariant or a complicated expression). */ if (sprime && TREE_CODE (sprime) == SSA_NAME && do_pre && (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1) PRE doesn't optimize it to avoid introducing loop carried dependence. It makes sense. But unfortunately the expected downstream loop vectorization isn't performed on the given loop since with "very-cheap" cost model, it doesn't allow vectorizer to peel for niters. Later there seems no downstream pass which is trying to optimize it, it eventually results in sub-optimal code. To rerun pre once after loop vectorization did fix the degradation, but not sure it's practical, since iterating pre seems much time-consuming. Or tagging this kind of loop and later just run pre on the tagged one? It seems also not practical to predict one loop whether can be loop-vectorized later. Also not sure whether there are some passes which can be taught for this.
next reply other threads:[~2021-05-27 7:48 UTC|newest] Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-05-27 7:48 linkw at gcc dot gnu.org [this message] 2021-05-28 6:39 ` [Bug tree-optimization/100794] " rguenth at gcc dot gnu.org 2021-05-28 7:29 ` linkw at gcc dot gnu.org 2021-05-28 7:36 ` rguenther at suse dot de 2021-05-28 11:30 ` linkw at gcc dot gnu.org 2021-05-28 12:23 ` rguenther at suse dot de 2021-05-31 6:05 ` linkw at gcc dot gnu.org 2021-05-31 6:06 ` linkw at gcc dot gnu.org 2021-05-31 6:08 ` linkw at gcc dot gnu.org 2021-05-31 6:18 ` linkw at gcc dot gnu.org 2021-05-31 6:19 ` linkw at gcc dot gnu.org 2021-06-08 5:30 ` cvs-commit at gcc dot gnu.org 2021-06-09 2:20 ` linkw at gcc dot gnu.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-100794-4@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).