* SPEC 456.hmmer vectorization question @ 2017-03-06 22:37 Steve Ellcey 2017-03-07 13:45 ` Michael Matz 0 siblings, 1 reply; 6+ messages in thread From: Steve Ellcey @ 2017-03-06 22:37 UTC (permalink / raw) To: gcc; +Cc: law, matz I was looking at the spec 456.hmmer benchmark and this email string from Jeff Law and Micheal Matz: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html and was wondering if anyone was looking at what more it would take for GCC to vectorize the loop in P7Viterbi. There is a big performance win to be had here if it can be done but the alias checking needed seems rather extensive. Steve Ellcey sellcey@cavium.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SPEC 456.hmmer vectorization question 2017-03-06 22:37 SPEC 456.hmmer vectorization question Steve Ellcey @ 2017-03-07 13:45 ` Michael Matz 2017-03-08 19:41 ` Steve Ellcey 0 siblings, 1 reply; 6+ messages in thread From: Michael Matz @ 2017-03-07 13:45 UTC (permalink / raw) To: Steve Ellcey; +Cc: gcc, law Hi Steve, On Mon, 6 Mar 2017, Steve Ellcey wrote: > I was looking at the spec 456.hmmer benchmark and this email string > from Jeff Law and Micheal Matz: > > https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html > > and was wondering if anyone was looking at what more it would take > for GCC to vectorize the loop in P7Viterbi. It takes what I wrote in there. There are two important things that need to happen to get the best performance (at least from an analysis I did in 2011, but nothing material should have changed since then): (1) loop distribution to make some memory streams vectorizable (and leave the others in non-vectorized form). (1a) loop splitting based on conditional (to remove the k<M conditional) (2) a predictive commoning (or loop carried store reuse) on the dc[] stream Non of these is valid if the loop streams can't be disambiguated, and as this is C only adding explicit restrict qualifiers would give you that, or runtime disambiguation, like ICC is doing, that's part (0). Part (1a) is implemented by gimple loop splitting, but it's not effective on hmmer because of missing part (0). Even then performance is not on par with ICC as long as parts (1) and (2) are missing as well (measured by splitting the loops by hand and adding restrict qualifiers explicitely). The inner loop split by streams and conditinional looks like so: for (k = 1; k < M; k++) { mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; } for (k = 1; k < M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } for (k = 1; k < M; k++) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } /* last iteration of original loop */ k = M; mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; (Note again, that this is only valid with disambiguation). Adding restrict qualifiers at the top of routine like so: #define R __restrict int * R mc, * R dc, * R ic; /* pointers to rows of mmx, dmx, imx */ int * R ms, * R is; /* pointers to msc[i], isc[i] */ int * R mpp, * R mpc, * R ip; /* ptrs to mmx[i-1], mmx[i], imx[i-1] */ int * R bp; /* ptr into bsc[] */ int * R ep; /* ptr into esc[] */ int * R dpp; /* ptr into dmx[i-1] (previous row) */ int * R tpmm, * R tpmi, * R tpmd, * R tpim, * R tpii, * R tpdm, * R tpdd; /* ptrs into tsc */ helps to vectorize this. To get the final rest of performance also this transformation needs to happen on the dc[] loop: dctemp=dc[0]; for (k = 1; k < M; k++) { dctemp = dctemp + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dctemp) dctemp = sc; if (dctemp < -INFTY) dctemp = -INFTY; dc[k] = dctemp; } Our loop distribution should actually already be able to split off the three memory streams when restrict is added everywhere, at the 2011 time frame it didn't do it nevertheless (and I haven't looked if it would be able to do that now). predictive commoning could do the dc[] transformation (part (2)), except that it can't without disambiguation. That adding restrict doesn't help here is PR50419, but ultimately it would have to work on the disambiguated loop (without the restrict pointer). So really the prerequisite to optimize hmmer is loop disambiguation, even with the many streams (and hence conditionals) that are there. And it needs to happen well before the loop vectorizer, because loop splitting and distribution, _and_ predictive commoning have the disambiguation as prerequisite in this testcase. After that loop distribution needs to be looked at why it doesn't want to distribute the streams, and then a variant of PR50419 needs to be fixed based on disambiguation info (not based on restrict). For that we need infrastructure that would enable us to disambiguate mem accesses after loop nest versioning happened in the "good" version. Ciao, Michael. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SPEC 456.hmmer vectorization question 2017-03-07 13:45 ` Michael Matz @ 2017-03-08 19:41 ` Steve Ellcey 2017-03-09 8:02 ` Richard Biener 0 siblings, 1 reply; 6+ messages in thread From: Steve Ellcey @ 2017-03-08 19:41 UTC (permalink / raw) To: Michael Matz; +Cc: gcc, law On Tue, 2017-03-07 at 14:45 +0100, Michael Matz wrote: > Hi Steve, > > On Mon, 6 Mar 2017, Steve Ellcey wrote: > > > > > I was looking at the spec 456.hmmer benchmark and this email string > > from Jeff Law and Micheal Matz: > > > >  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html > > > > and was wondering if anyone was looking at what more it would take > > for GCC to vectorize the loop in P7Viterbi. > It takes what I wrote in there.  There are two important things that need > to happen to get the best performance (at least from an analysis I did in > 2011, but nothing material should have changed since then): I guess I was hoping that some progress had been made since then, but it sounds like it hasn't. > (1) loop distribution to make some memory streams vectorizable (and leave >     the others in non-vectorized form). > (1a) loop splitting based on conditional (to remove the k > (2) a predictive commoning (or loop carried store reuse) on the dc[] >     stream > > None of these is valid if the loop streams can't be disambiguated, and as > this is C only adding explicit restrict qualifiers would give you that, or > runtime disambiguation, like ICC is doing, that's part (0). So it sounds like the loop would have to be split up using runtime disambiguation before we could do any of the optimizations.  Would that check and split be something that could or should be done using the graphite framework or would it be a seperate pass done before the graphite phase is called?  I am not sure how one would determine what loops would be worth splitting and which ones would not during such a phase. Steve Ellcey sellcey@cavium.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SPEC 456.hmmer vectorization question 2017-03-08 19:41 ` Steve Ellcey @ 2017-03-09 8:02 ` Richard Biener 2017-03-09 8:12 ` Jakub Jelinek 0 siblings, 1 reply; 6+ messages in thread From: Richard Biener @ 2017-03-09 8:02 UTC (permalink / raw) To: Steve Ellcey; +Cc: Michael Matz, GCC Development, Jeff Law On Wed, Mar 8, 2017 at 8:41 PM, Steve Ellcey <sellcey@caviumnetworks.com> wrote: > On Tue, 2017-03-07 at 14:45 +0100, Michael Matz wrote: >> Hi Steve, >> >> On Mon, 6 Mar 2017, Steve Ellcey wrote: >> >> > >> > I was looking at the spec 456.hmmer benchmark and this email string >> > from Jeff Law and Micheal Matz: >> > >> > https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html >> > >> > and was wondering if anyone was looking at what more it would take >> > for GCC to vectorize the loop in P7Viterbi. > >> It takes what I wrote in there. There are two important things that need >> to happen to get the best performance (at least from an analysis I did in >> 2011, but nothing material should have changed since then): > > I guess I was hoping that some progress had been made since then, but > it sounds like it hasn't. > >> (1) loop distribution to make some memory streams vectorizable (and leave >> the others in non-vectorized form). >> (1a) loop splitting based on conditional (to remove the k >> (2) a predictive commoning (or loop carried store reuse) on the dc[] >> stream >> >> None of these is valid if the loop streams can't be disambiguated, and as >> this is C only adding explicit restrict qualifiers would give you that, or >> runtime disambiguation, like ICC is doing, that's part (0). > > So it sounds like the loop would have to be split up using runtime > disambiguation before we could do any of the optimizations. Would that > check and split be something that could or should be done using the > graphite framework or would it be a seperate pass done before the > graphite phase is called? I am not sure how one would determine what > loops would be worth splitting and which ones would not during such a > phase. It would need to be done before graphite, and yes, the question is when to do this (given the non-trival text size and runtime cost). One option is to do sth similar like we do with IFN_LOOP_VECTORIZED, that is, after followup transforms decide whether the specialized version received any important optimization. Another option is to add value profile counters for aliasing and only do this with FDO when we know at runtime there is no aliasing. Richard. > Steve Ellcey > sellcey@cavium.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SPEC 456.hmmer vectorization question 2017-03-09 8:02 ` Richard Biener @ 2017-03-09 8:12 ` Jakub Jelinek 2017-03-09 8:19 ` Richard Biener 0 siblings, 1 reply; 6+ messages in thread From: Jakub Jelinek @ 2017-03-09 8:12 UTC (permalink / raw) To: Richard Biener; +Cc: Steve Ellcey, Michael Matz, GCC Development, Jeff Law On Thu, Mar 09, 2017 at 09:02:38AM +0100, Richard Biener wrote: > It would need to be done before graphite, and yes, the question is when > to do this (given the non-trival text size and runtime cost). One option is > to do sth similar like we do with IFN_LOOP_VECTORIZED, that is, after > followup transforms decide whether the specialized version received any > important optimization. Another option is to add value profile counters > for aliasing and only do this with FDO when we know at runtime there > is no aliasing. It doesn't have to be either/or. If we have FDO, we can do it unconditionally if we have gathered into that there is likely no aliasing, and optimize the other loop (for the case of aliasing) for size. If we don't have FDO, we could do the IFN_LOOP_VERSIONED way. For IFN_LOOP_VERSIONED, if we check all aliasing cases we could then either use the OpenMP/Cilk/ivdep pragma loop properties (loop->safelen etc.), or even have something stronger (that would say that there aren't any inter-iteration memory dependencies). Jakub ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SPEC 456.hmmer vectorization question 2017-03-09 8:12 ` Jakub Jelinek @ 2017-03-09 8:19 ` Richard Biener 0 siblings, 0 replies; 6+ messages in thread From: Richard Biener @ 2017-03-09 8:19 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Steve Ellcey, Michael Matz, GCC Development, Jeff Law On Thu, Mar 9, 2017 at 9:12 AM, Jakub Jelinek <jakub@redhat.com> wrote: > On Thu, Mar 09, 2017 at 09:02:38AM +0100, Richard Biener wrote: >> It would need to be done before graphite, and yes, the question is when >> to do this (given the non-trival text size and runtime cost). One option is >> to do sth similar like we do with IFN_LOOP_VECTORIZED, that is, after >> followup transforms decide whether the specialized version received any >> important optimization. Another option is to add value profile counters >> for aliasing and only do this with FDO when we know at runtime there >> is no aliasing. > > It doesn't have to be either/or. If we have FDO, we can do it > unconditionally if we have gathered into that there is likely no aliasing, > and optimize the other loop (for the case of aliasing) for size. > If we don't have FDO, we could do the IFN_LOOP_VERSIONED way. > For IFN_LOOP_VERSIONED, if we check all aliasing cases we could then either > use the OpenMP/Cilk/ivdep pragma loop properties (loop->safelen etc.), > or even have something stronger (that would say that there aren't > any inter-iteration memory dependencies). We can use MR_DEPENDENCE_* to partition the dependences properly as well. For loop distribution we can also check profitability before adding any dependence related edges and version according to them. Of course that needs a meaningful cost model... Similarly you can run the ISL optimizer as if there were no dependences and compare the resulting code to the original one with a cost model. This is what the vectorizer does before doing the versioning. For enablement transforms cost modeling is of course hard unless you can chain analysis parts of multiple passes (basically integrate loop passes into "one"). Of course this breaks down once you consider not disambiguating all unknown dependences but only a few (in case the transform can still handle some of those cases - the vectorizer for example cannot deal with any unknown dependences). (breaks down in complexity) Richard. > > Jakub ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-03-09 8:19 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-06 22:37 SPEC 456.hmmer vectorization question Steve Ellcey 2017-03-07 13:45 ` Michael Matz 2017-03-08 19:41 ` Steve Ellcey 2017-03-09 8:02 ` Richard Biener 2017-03-09 8:12 ` Jakub Jelinek 2017-03-09 8:19 ` Richard Biener
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).