From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-192707-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 92136 invoked by alias); 7 Mar 2017 13:45:50 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 92126 invoked by uid 89); 7 Mar 2017 13:45:49 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=bsc, isc, k1, esc
X-HELO: mx2.suse.de
Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 07 Mar 2017 13:45:48 +0000
Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254])	by mx2.suse.de (Postfix) with ESMTP id 6A3F1AD51;	Tue,  7 Mar 2017 13:45:46 +0000 (UTC)
Date: Tue, 07 Mar 2017 13:45:00 -0000
From: Michael Matz <matz@suse.de>
To: Steve Ellcey <sellcey@cavium.com>
cc: gcc@gcc.gnu.org, law@redhat.com
Subject: Re: SPEC 456.hmmer vectorization question
In-Reply-To: <201703062237.v26MbW5e008866@sellcey-dt.caveonetworks.com>
Message-ID: <alpine.LSU.2.20.1703071423440.13579@wotan.suse.de>
References: <201703062237.v26MbW5e008866@sellcey-dt.caveonetworks.com>
User-Agent: Alpine 2.20 (LSU 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-IsSubscribed: yes
X-SW-Source: 2017-03/txt/msg00015.txt.bz2

Hi Steve,

On Mon, 6 Mar 2017, Steve Ellcey wrote:

> I was looking at the spec 456.hmmer benchmark and this email string
> from Jeff Law and Micheal Matz:
> 
>   https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html
> 
> and was wondering if anyone was looking at what more it would take
> for GCC to vectorize the loop in P7Viterbi.

It takes what I wrote in there.  There are two important things that need 
to happen to get the best performance (at least from an analysis I did in 
2011, but nothing material should have changed since then):

(1) loop distribution to make some memory streams vectorizable (and leave 
    the others in non-vectorized form).
(1a) loop splitting based on conditional (to remove the k<M conditional)
(2) a predictive commoning (or loop carried store reuse) on the dc[] 
    stream

Non of these is valid if the loop streams can't be disambiguated, and as 
this is C only adding explicit restrict qualifiers would give you that, or 
runtime disambiguation, like ICC is doing, that's part (0).

Part (1a) is implemented by gimple loop splitting, but it's not effective 
on hmmer because of missing part (0).  Even then performance is not on par 
with ICC as long as parts (1) and (2) are missing as well (measured by 
splitting the loops by hand and adding restrict qualifiers explicitely).

The inner loop split by streams and conditinional looks like so:

    for (k = 1; k < M; k++) {
      mc[k] = mpp[k-1]   + tpmm[k-1];
      if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc;
      mc[k] += ms[k];
      if (mc[k] < -INFTY) mc[k] = -INFTY;
    }

    for (k = 1; k < M; k++) {
      dc[k] = dc[k-1] + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
      if (dc[k] < -INFTY) dc[k] = -INFTY;
    }

    for (k = 1; k < M; k++) {
        ic[k] = mpp[k] + tpmi[k];
        if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;
        ic[k] += is[k];
        if (ic[k] < -INFTY) ic[k] = -INFTY;
    }
    /* last iteration of original loop */
    k = M;
      mc[k] = mpp[k-1]   + tpmm[k-1];
      if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
      if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc;
      mc[k] += ms[k];
      if (mc[k] < -INFTY) mc[k] = -INFTY;

      dc[k] = dc[k-1] + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
      if (dc[k] < -INFTY) dc[k] = -INFTY;

(Note again, that this is only valid with disambiguation).  Adding 
restrict qualifiers at the top of routine like so:

#define R __restrict
  int  * R mc, * R dc, * R ic;        /* pointers to rows of mmx, dmx, imx */
  int  * R ms, * R is;             /* pointers to msc[i], isc[i] */
  int  * R mpp, * R mpc, * R ip;      /* ptrs to mmx[i-1], mmx[i], imx[i-1] */
  int  * R bp;               /* ptr into bsc[] */
  int  * R ep;                  /* ptr into esc[] */
  int  * R dpp;                 /* ptr into dmx[i-1] (previous row) */
  int  * R tpmm, * R tpmi, * R tpmd, * R tpim, * R tpii, * R tpdm, * R tpdd; /* ptrs into tsc */

helps to vectorize this.  To get the final rest of performance also this 
transformation needs to happen on the dc[] loop:

    dctemp=dc[0];
    for (k = 1; k < M; k++) {
      dctemp = dctemp + tpdd[k-1];
      if ((sc = mc[k-1] + tpmd[k-1]) > dctemp) dctemp = sc;
      if (dctemp < -INFTY) dctemp = -INFTY;
      dc[k] = dctemp;
    }

Our loop distribution should actually already be able to split off the 
three memory streams when restrict is added everywhere, at the 2011 time 
frame it didn't do it nevertheless (and I haven't looked if it would be 
able to do that now).

predictive commoning could do the dc[] transformation (part (2)), except 
that it can't without disambiguation.  That adding restrict doesn't help 
here is PR50419, but ultimately it would have to work on the disambiguated 
loop (without the restrict pointer).

So really the prerequisite to optimize hmmer is loop disambiguation, even 
with the many streams (and hence conditionals) that are there.  And it 
needs to happen well before the loop vectorizer, because loop splitting 
and distribution, _and_ predictive commoning have the disambiguation as 
prerequisite in this testcase.

After that loop distribution needs to be looked at why it doesn't want to 
distribute the streams, and then a variant of PR50419 needs to be fixed 
based on disambiguation info (not based on restrict).  For that we need 
infrastructure that would enable us to disambiguate mem accesses after 
loop nest versioning happened in the "good" version.


Ciao,
Michael.