SPEC 456.hmmer vectorization question

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* SPEC 456.hmmer vectorization question
@ 2017-03-06 22:37 Steve Ellcey
  2017-03-07 13:45 ` Michael Matz
  0 siblings, 1 reply; 6+ messages in thread
From: Steve Ellcey @ 2017-03-06 22:37 UTC (permalink / raw)
  To: gcc; +Cc: law, matz

I was looking at the spec 456.hmmer benchmark and this email string
from Jeff Law and Micheal Matz:

  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html

and was wondering if anyone was looking at what more it would take
for GCC to vectorize the loop in P7Viterbi.  There is a big performance
win to be had here if it can be done but the alias checking needed
seems rather extensive.

Steve Ellcey
sellcey@cavium.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SPEC 456.hmmer vectorization question
  2017-03-06 22:37 SPEC 456.hmmer vectorization question Steve Ellcey
@ 2017-03-07 13:45 ` Michael Matz
  2017-03-08 19:41   ` Steve Ellcey
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Matz @ 2017-03-07 13:45 UTC (permalink / raw)
  To: Steve Ellcey; +Cc: gcc, law

Hi Steve,

On Mon, 6 Mar 2017, Steve Ellcey wrote:

> I was looking at the spec 456.hmmer benchmark and this email string
> from Jeff Law and Micheal Matz:
> 
>   https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html
> 
> and was wondering if anyone was looking at what more it would take
> for GCC to vectorize the loop in P7Viterbi.

It takes what I wrote in there.  There are two important things that need 
to happen to get the best performance (at least from an analysis I did in 
2011, but nothing material should have changed since then):

(1) loop distribution to make some memory streams vectorizable (and leave 
    the others in non-vectorized form).
(1a) loop splitting based on conditional (to remove the k<M conditional)
(2) a predictive commoning (or loop carried store reuse) on the dc[] 
    stream

Non of these is valid if the loop streams can't be disambiguated, and as 
this is C only adding explicit restrict qualifiers would give you that, or 
runtime disambiguation, like ICC is doing, that's part (0).

Part (1a) is implemented by gimple loop splitting, but it's not effective 
on hmmer because of missing part (0).  Even then performance is not on par 
with ICC as long as parts (1) and (2) are missing as well (measured by 
splitting the loops by hand and adding restrict qualifiers explicitely).

The inner loop split by streams and conditinional looks like so:

    for (k = 1; k < M; k++) {
      mc[k] = mpp[k-1]   + tpmm[k-1];
      if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc;
      mc[k] += ms[k];
      if (mc[k] < -INFTY) mc[k] = -INFTY;
    }

    for (k = 1; k < M; k++) {
      dc[k] = dc[k-1] + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
      if (dc[k] < -INFTY) dc[k] = -INFTY;
    }

    for (k = 1; k < M; k++) {
        ic[k] = mpp[k] + tpmi[k];
        if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;
        ic[k] += is[k];
        if (ic[k] < -INFTY) ic[k] = -INFTY;
    }
    /* last iteration of original loop */
    k = M;
      mc[k] = mpp[k-1]   + tpmm[k-1];
      if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc;
      mc[k] += ms[k];
      if (mc[k] < -INFTY) mc[k] = -INFTY;

      dc[k] = dc[k-1] + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
      if (dc[k] < -INFTY) dc[k] = -INFTY;

(Note again, that this is only valid with disambiguation).  Adding 
restrict qualifiers at the top of routine like so:

#define R __restrict
  int  * R mc, * R dc, * R ic;        /* pointers to rows of mmx, dmx, imx */
  int  * R ms, * R is;             /* pointers to msc[i], isc[i] */
  int  * R mpp, * R mpc, * R ip;      /* ptrs to mmx[i-1], mmx[i], imx[i-1] */
  int  * R bp;               /* ptr into bsc[] */
  int  * R ep;                  /* ptr into esc[] */
  int  * R dpp;                 /* ptr into dmx[i-1] (previous row) */
  int  * R tpmm, * R tpmi, * R tpmd, * R tpim, * R tpii, * R tpdm, * R tpdd; /* ptrs into tsc */

helps to vectorize this.  To get the final rest of performance also this 
transformation needs to happen on the dc[] loop:

    dctemp=dc[0];
    for (k = 1; k < M; k++) {
      dctemp = dctemp + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dctemp) dctemp = sc;
      if (dctemp < -INFTY) dctemp = -INFTY;
      dc[k] = dctemp;
    }

Our loop distribution should actually already be able to split off the 
three memory streams when restrict is added everywhere, at the 2011 time 
frame it didn't do it nevertheless (and I haven't looked if it would be 
able to do that now).

predictive commoning could do the dc[] transformation (part (2)), except 
that it can't without disambiguation.  That adding restrict doesn't help 
here is PR50419, but ultimately it would have to work on the disambiguated 
loop (without the restrict pointer).

So really the prerequisite to optimize hmmer is loop disambiguation, even 
with the many streams (and hence conditionals) that are there.  And it 
needs to happen well before the loop vectorizer, because loop splitting 
and distribution, _and_ predictive commoning have the disambiguation as 
prerequisite in this testcase.

After that loop distribution needs to be looked at why it doesn't want to 
distribute the streams, and then a variant of PR50419 needs to be fixed 
based on disambiguation info (not based on restrict).  For that we need 
infrastructure that would enable us to disambiguate mem accesses after 
loop nest versioning happened in the "good" version.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SPEC 456.hmmer vectorization question
  2017-03-07 13:45 ` Michael Matz
@ 2017-03-08 19:41   ` Steve Ellcey
  2017-03-09  8:02     ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Steve Ellcey @ 2017-03-08 19:41 UTC (permalink / raw)
  To: Michael Matz; +Cc: gcc, law

On Tue, 2017-03-07 at 14:45 +0100, Michael Matz wrote:
> Hi Steve,
> 
> On Mon, 6 Mar 2017, Steve Ellcey wrote:
> 
> > 
> > I was looking at the spec 456.hmmer benchmark and this email string
> > from Jeff Law and Micheal Matz:
> > 
> > Â  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html
> > 
> > and was wondering if anyone was looking at what more it would take
> > for GCC to vectorize the loop in P7Viterbi.

> It takes what I wrote in there.Â Â There are two important things that needÂ 
> to happen to get the best performance (at least from an analysis I did inÂ 
> 2011, but nothing material should have changed since then):

I guess I was hoping that some progress had been made since then, but
it sounds like it hasn't.

> (1) loop distribution to make some memory streams vectorizable (and leaveÂ 
> Â Â Â Â the others in non-vectorized form).
> (1a) loop splitting based on conditional (to remove the k
> (2) a predictive commoning (or loop carried store reuse) on the dc[]Â 
> Â Â Â Â stream
> 
> None of these is valid if the loop streams can't be disambiguated, and asÂ 
> this is C only adding explicit restrict qualifiers would give you that, orÂ 
> runtime disambiguation, like ICC is doing, that's part (0).

So it sounds like the loop would have to be split up using runtime
disambiguation before we could do any of the optimizations. Â Would that
check and split be something that could or should be done using the
graphite framework or would it be a seperate pass done before the
graphite phase is called? Â I am not sure how one would determine what
loops would be worth splitting and which ones would not during such a
phase.

Steve Ellcey
sellcey@cavium.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SPEC 456.hmmer vectorization question
  2017-03-08 19:41   ` Steve Ellcey
@ 2017-03-09  8:02     ` Richard Biener
  2017-03-09  8:12       ` Jakub Jelinek
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2017-03-09  8:02 UTC (permalink / raw)
  To: Steve Ellcey; +Cc: Michael Matz, GCC Development, Jeff Law

On Wed, Mar 8, 2017 at 8:41 PM, Steve Ellcey <sellcey@caviumnetworks.com> wrote:
> On Tue, 2017-03-07 at 14:45 +0100, Michael Matz wrote:
>> Hi Steve,
>>
>> On Mon, 6 Mar 2017, Steve Ellcey wrote:
>>
>> >
>> > I was looking at the spec 456.hmmer benchmark and this email string
>> > from Jeff Law and Micheal Matz:
>> >
>> >   https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html
>> >
>> > and was wondering if anyone was looking at what more it would take
>> > for GCC to vectorize the loop in P7Viterbi.
>
>> It takes what I wrote in there.  There are two important things that need
>> to happen to get the best performance (at least from an analysis I did in
>> 2011, but nothing material should have changed since then):
>
> I guess I was hoping that some progress had been made since then, but
> it sounds like it hasn't.
>
>> (1) loop distribution to make some memory streams vectorizable (and leave
>>     the others in non-vectorized form).
>> (1a) loop splitting based on conditional (to remove the k
>> (2) a predictive commoning (or loop carried store reuse) on the dc[]
>>     stream
>>
>> None of these is valid if the loop streams can't be disambiguated, and as
>> this is C only adding explicit restrict qualifiers would give you that, or
>> runtime disambiguation, like ICC is doing, that's part (0).
>
> So it sounds like the loop would have to be split up using runtime
> disambiguation before we could do any of the optimizations.  Would that
> check and split be something that could or should be done using the
> graphite framework or would it be a seperate pass done before the
> graphite phase is called?  I am not sure how one would determine what
> loops would be worth splitting and which ones would not during such a
> phase.

It would need to be done before graphite, and yes, the question is when
to do this (given the non-trival text size and runtime cost).  One option is
to do sth similar like we do with IFN_LOOP_VECTORIZED, that is, after
followup transforms decide whether the specialized version received any
important optimization.  Another option is to add value profile counters
for aliasing and only do this with FDO when we know at runtime there
is no aliasing.

Richard.

> Steve Ellcey
> sellcey@cavium.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SPEC 456.hmmer vectorization question
  2017-03-09  8:02     ` Richard Biener
@ 2017-03-09  8:12       ` Jakub Jelinek
  2017-03-09  8:19         ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Jakub Jelinek @ 2017-03-09  8:12 UTC (permalink / raw)
  To: Richard Biener; +Cc: Steve Ellcey, Michael Matz, GCC Development, Jeff Law

On Thu, Mar 09, 2017 at 09:02:38AM +0100, Richard Biener wrote:
> It would need to be done before graphite, and yes, the question is when
> to do this (given the non-trival text size and runtime cost).  One option is
> to do sth similar like we do with IFN_LOOP_VECTORIZED, that is, after
> followup transforms decide whether the specialized version received any
> important optimization.  Another option is to add value profile counters
> for aliasing and only do this with FDO when we know at runtime there
> is no aliasing.

It doesn't have to be either/or.  If we have FDO, we can do it
unconditionally if we have gathered into that there is likely no aliasing,
and optimize the other loop (for the case of aliasing) for size.
If we don't have FDO, we could do the IFN_LOOP_VERSIONED way.
For IFN_LOOP_VERSIONED, if we check all aliasing cases we could then either
use the OpenMP/Cilk/ivdep pragma loop properties (loop->safelen etc.),
or even have something stronger (that would say that there aren't
any inter-iteration memory dependencies).

	Jakub

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SPEC 456.hmmer vectorization question
  2017-03-09  8:12       ` Jakub Jelinek
@ 2017-03-09  8:19         ` Richard Biener
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Biener @ 2017-03-09  8:19 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Steve Ellcey, Michael Matz, GCC Development, Jeff Law

On Thu, Mar 9, 2017 at 9:12 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Mar 09, 2017 at 09:02:38AM +0100, Richard Biener wrote:
>> It would need to be done before graphite, and yes, the question is when
>> to do this (given the non-trival text size and runtime cost).  One option is
>> to do sth similar like we do with IFN_LOOP_VECTORIZED, that is, after
>> followup transforms decide whether the specialized version received any
>> important optimization.  Another option is to add value profile counters
>> for aliasing and only do this with FDO when we know at runtime there
>> is no aliasing.
>
> It doesn't have to be either/or.  If we have FDO, we can do it
> unconditionally if we have gathered into that there is likely no aliasing,
> and optimize the other loop (for the case of aliasing) for size.
> If we don't have FDO, we could do the IFN_LOOP_VERSIONED way.
> For IFN_LOOP_VERSIONED, if we check all aliasing cases we could then either
> use the OpenMP/Cilk/ivdep pragma loop properties (loop->safelen etc.),
> or even have something stronger (that would say that there aren't
> any inter-iteration memory dependencies).

We can use MR_DEPENDENCE_* to partition the dependences properly
as well.

For loop distribution we can also check profitability before adding any
dependence related edges and version according to them.  Of course
that needs a meaningful cost model...

Similarly you can run the ISL optimizer as if there were no dependences
and compare the resulting code to the original one with a cost model.

This is what the vectorizer does before doing the versioning.  For enablement
transforms cost modeling is of course hard unless you can chain analysis
parts of multiple passes (basically integrate loop passes into "one").

Of course this breaks down once you consider not disambiguating all
unknown dependences but only a few (in case the transform can still
handle some of those cases - the vectorizer for example cannot deal
with any unknown dependences).  (breaks down in complexity)

Richard.

>
>         Jakub

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-03-09  8:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-06 22:37 SPEC 456.hmmer vectorization question Steve Ellcey
2017-03-07 13:45 ` Michael Matz
2017-03-08 19:41   ` Steve Ellcey
2017-03-09  8:02     ` Richard Biener
2017-03-09  8:12       ` Jakub Jelinek
2017-03-09  8:19         ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).