From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 123211 invoked by alias); 22 Dec 2017 00:01:42 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 123099 invoked by uid 89); 22 Dec 2017 00:01:38 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,KAM_SHORT,SPF_PASS autolearn=ham version=3.3.2 spammy=H*f:yPw, hid, H*i:yPw X-HELO: mx2.suse.de Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 22 Dec 2017 00:01:36 +0000 X-Amavis-Alert: BAD HEADER SECTION, Duplicate header field: "Cc" Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id B61BBAE53; Fri, 22 Dec 2017 00:01:33 +0000 (UTC) From: Martin Jambor To: Richard Biener Cc: GCC Patches Cc: Subject: Re: [PR 81616] Deferring FMA transformations in tight loops In-Reply-To: References: User-Agent: Notmuch/0.25.1 (https://notmuchmail.org) Emacs/25.3.1 (x86_64-suse-linux-gnu) Date: Fri, 22 Dec 2017 00:01:00 -0000 Message-ID: MIME-Version: 1.0 Content-Type: text/plain X-IsSubscribed: yes X-SW-Source: 2017-12/txt/msg01473.txt.bz2 Hi, On Mon, Dec 18 2017, Richard Biener wrote: > On Fri, Dec 15, 2017 at 3:19 PM, Martin Jambor wrote: >> >> Hello, >> >> the patch below prevents creation if fused-multiply-and-add instructions >> in the widening_mul gimple pass on the Zen-based AMD CPUs and as a >> result fixes regressions of native znver1 tuning when compared to >> generic tuning in: >> >> - the matrix.c testcase of PR 81616 (straightforward matrix >> multiplication) at -O2 and -O3 which is currently 60% (!), >> >> - SPEC 2006 454.calculix at -O2, which is currently over 20%, and >> >> - SPEC 2017 510.parest at -O2 and -Ofast, which is currently also >> about 20% in both cases. >> >> The basic idea is to detect loops in the following form: >> >> >> # accumulator_111 = PHI <0.0(5), accumulator_66(6)> >> ... >> _65 = _14 * _16; >> accumulator_66 = _65 + accumulator_111; >> >> and prevents from creating FMA for it. Because at least in the parest >> and calculix cases it has to, it also deals with more than one chain of >> FMA candidates that feed the next one's addend: >> >> >> >> # accumulator_111 = PHI <0.0(5), accumulator_66(6)> >> ... >> _65 = _14 * _16; >> accumulator_55 = _65 + accumulator_111; >> _65 = _24 * _36; >> accumulator_66 = _65 + accumulator_55; >> >> Unfortunately, to really get rid of the calculix regression, the >> algorithm cannot just look at one BB at a time but also has to work for >> cases like the following: >> >> 1 void mult(void) >> 2 { >> 3 int i, j, k, l; >> 4 >> 5 for(i=0; i> 6 { >> 7 for(j=0; j> 8 { >> 9 for(l=0; l> 10 { >> 11 c[i][j] += a[i][l] * b[k][l]; >> 12 for(k=1; k<10; ++k) >> 13 { >> 14 c[i][j] += a[i][l+k] * b[k][l+k]; >> 15 } >> 16 >> 17 } >> 18 } >> 19 } >> 20 } >> >> where the FMA on line 14 feeds into the one on line 11 in an >> encompassing loop. Therefore I have changed the structure of the pass >> to work in reverse dominance order and it keeps a hash set of results of >> rejected FMAs candidates which it checks when looking at PHI nodes of >> the current BB. Without this reorganization, calculix was still 8% >> slower with native tuning than with generic one. >> >> When the deferring mechanism realizes that in the current BB, the FMA >> candidates do not all form a one chain tight-loop like in the examples >> above, it goes back to all the previously deferred candidates (in the >> current BB only) and performs the transformation. >> >> The main reason is to keep the patch conservative (and also simple), but >> it also means that the following function is not affected and is still >> 20% slower when compiled with native march and tuning compared to the >> generic one: >> >> 1 void mult(struct s *p1, struct s *p2) >> 2 { >> 3 int i, j, k; >> 4 >> 5 for(i=0; i> 6 { >> 7 for(j=0; j> 8 { >> 9 for(k=0; k> 10 { >> 11 p1->c[i][j] += p1->a[i][k] * p1->b[k][j]; >> 12 p2->c[i][j] += p2->a[i][k] * p2->b[k][j]; >> 13 } >> 14 } >> 15 } >> 16 } > > So here we apply store-motion to p[12]->c[i][j] so we see a > reduction in SSA? That is, you are really looking for > SSA SCCs that involve only FMA candidates? There's a SCC > finding code in tree-ssa-sccvn.c, DFS (), but a straight-forward > use->def walk over the adds should find a SCC just involving the > adds? That is, rather than doing all the defering stuff I wonder > if it is easier to simply gather the SCC if there is any and add > "disqualifications", like to each other member? So when seeing > > x += a1 * b1; > x += a2 * b2; > x += a3 * b3; > ... That is certainly an alternative but I am not sure it would result in simpler code, at least with the identical intended behavior, because I would have to then walk backwards from the additions to multiplications, skipping a potential negation, to check that they form a viable FMA. And remembering disqualifications is likely to be similarly complex like remembering deferred transformations. > we'd still generate a FMA for each 2nd statement? Basically When I manually unroll the matrix multiplication twice, not creating every second FMA improves run-time only by 20%, whereas removing all by 60%. When I tried to unroll the loop more, specifically, eight times, the regression disappeared even with unpatched compiler, I assume the loads hid the latency issue. > do some scheduling of as many mults as needed to conver > the latency of the first FMA. > > To go back to the Zen specific issue - so FMA has a higher overall > latency than mul + add (?) and (or?) No, if I read the tables I'm using for this right, the latency is 5c, whereas add + mul have 7. > if there's more than one mul being > accumulated then the 2nd mul can execute in parallel with the first > add (in case dependences allow). That is how I understand the issue. > I wonder if there's an easy way to > query the DFA for the latency of FMA vs. ADD and/or if the --param > (or target hook?) should rather tell us sth about this latency difference? Well, I guess that ideally this would really be an instruction-selection decision based on the fact whether the addend is likely to be ready or not. But I am not sure how to model this at gimple level and I did not think I had enough time to investigate this for gcc 8, so I basically re-used (most of) the observations that lead to Venkat's https://gcc.gnu.org/ml/gcc-patches/2017-11/msg00437.html but in order to prevent FMA generation, not to split it. > > Zen IIRC can execute two FMA128 in parallel (thus one FMA256 per cycle). > That makes it an odd observation that avx256 can "fix" the issue. WRT > the above suggestion the situation needs to be clarified. I don't have a good explanation for this, I will try to ask AMD. > >> I suppose that the best optimization for the above would be to split the >> loops, but one could probably construct at least an artificial testcase >> where the FMAs would keep enough locality that it is not the case. The >> mechanism can be easily extended to keep track of not just one chain but >> a few, preferably as a followup, if people think it makes sense. >> >> An interesting observation is that the matrix multiplication does not >> suffer the penalty when compiled with -O3 -mprefer-vector-width=256. >> Apparently the 256 vector processing can hide the latency penalty when >> internally it is split into two halves. The same goes for 512 bit >> vectors. That is why the patch leaves those be - well, there is a param >> for the threshold which is set to zero for everybody but znver1. If >> maintainers of any other architecture suspect that their FMAs might >> suffer similar latency problem, they can easily try tweaking that >> parameter and see what happens with the matrix multiplication example. >> >> I have bootstrapped and tested the patch on x86_64-linux (as it is and >> also with the param set to a 256 by default to make it trigger). I have >> also measured run-times of all benchmarks in SPEC 2006 FP and SPEC 2017 >> FPrate and the only changes are the big improvements of calculix and >> parest. >> >> After I address any comments and/or suggestions, would it be OK for >> trunk? > > What does the patch do to bwaves whose innermost loop is a sequence > of FMAs (slightly complicated by reassoc widening the sum)? It splits the chain of FMAs at -O2 and -O3 (not at -Ofast where reassociation does that). It actually speeds up the testcase by over 5% at -O2 and over 10% at -O3. > > So for bwaves we currently have loads feeding the multiplies which might > cause enough latency to make the FMA issue not matter and/or there > are intermediate non-FMA adds that might hide the issue. Forcing > a full FMA chain is slowing Zen down significantly though (see my report). Apparently so. > > Another issue is are FMAs not discovered anyway by combine later? > Possibly not on x86 since all patterns use FMA rather than (plus (mult ...)). I don't think so but I may be easily wrong when it comes to such late passes. > > One minor issue below (otherwise I was just looking at the high-level parts). > ... >> +static gphi * >> +result_of_phi (tree op) >> +{ >> + if (TREE_CODE (op) != SSA_NAME) >> + return NULL; > > return dyn_cast (SSA_NAME_DEF_STMT (op)); > Fixed on my private branch. > ? Maybe inline into the single(?) user. There are two. I can do that, but the (TREE_CODE (op) != SSA_NAME) part would make the code IMHO uglier. ... >> +/* After processing statements of a BB and recording STATE, return true if the >> + initial phi is fed by the last FMA candidate result ore one such result from >> + previously processed BBs marked in LAST_RESULT_SET. */ >> + >> +static bool >> +last_fma_candidate_feeds_initial_phi (fma_deferring_state *state, >> + hash_set *last_result_set) >> +{ >> + ssa_op_iter iter; >> + use_operand_p use; >> + FOR_EACH_PHI_ARG (use, state->m_initial_phi, iter, SSA_OP_USE) >> + { >> + tree t = USE_FROM_PTR (use); >> + if (t == state->m_last_result >> + || last_result_set->contains (t)) >> + return true; >> + } > > maybe query the other way around, > > if (single_imm_use (state->m_last_result, ....) > && def_stmt is PHI? > I am not sure I understand. There are usually two uses of state->m_last_result, one in the PHI node and at least one final use of the computed sum after the loop. Also, I need to check the arguments whether they are in the last_result_set. Thanks for the comments, Martin