From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 92136 invoked by alias); 7 Mar 2017 13:45:50 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 92126 invoked by uid 89); 7 Mar 2017 13:45:49 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=bsc, isc, k1, esc X-HELO: mx2.suse.de Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 07 Mar 2017 13:45:48 +0000 Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 6A3F1AD51; Tue, 7 Mar 2017 13:45:46 +0000 (UTC) Date: Tue, 07 Mar 2017 13:45:00 -0000 From: Michael Matz To: Steve Ellcey cc: gcc@gcc.gnu.org, law@redhat.com Subject: Re: SPEC 456.hmmer vectorization question In-Reply-To: <201703062237.v26MbW5e008866@sellcey-dt.caveonetworks.com> Message-ID: References: <201703062237.v26MbW5e008866@sellcey-dt.caveonetworks.com> User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-IsSubscribed: yes X-SW-Source: 2017-03/txt/msg00015.txt.bz2 Hi Steve, On Mon, 6 Mar 2017, Steve Ellcey wrote: > I was looking at the spec 456.hmmer benchmark and this email string > from Jeff Law and Micheal Matz: > > https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html > > and was wondering if anyone was looking at what more it would take > for GCC to vectorize the loop in P7Viterbi. It takes what I wrote in there. There are two important things that need to happen to get the best performance (at least from an analysis I did in 2011, but nothing material should have changed since then): (1) loop distribution to make some memory streams vectorizable (and leave the others in non-vectorized form). (1a) loop splitting based on conditional (to remove the k mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; } for (k = 1; k < M; k++) { dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; } for (k = 1; k < M; k++) { ic[k] = mpp[k] + tpmi[k]; if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; ic[k] += is[k]; if (ic[k] < -INFTY) ic[k] = -INFTY; } /* last iteration of original loop */ k = M; mc[k] = mpp[k-1] + tpmm[k-1]; if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; mc[k] += ms[k]; if (mc[k] < -INFTY) mc[k] = -INFTY; dc[k] = dc[k-1] + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; if (dc[k] < -INFTY) dc[k] = -INFTY; (Note again, that this is only valid with disambiguation). Adding restrict qualifiers at the top of routine like so: #define R __restrict int * R mc, * R dc, * R ic; /* pointers to rows of mmx, dmx, imx */ int * R ms, * R is; /* pointers to msc[i], isc[i] */ int * R mpp, * R mpc, * R ip; /* ptrs to mmx[i-1], mmx[i], imx[i-1] */ int * R bp; /* ptr into bsc[] */ int * R ep; /* ptr into esc[] */ int * R dpp; /* ptr into dmx[i-1] (previous row) */ int * R tpmm, * R tpmi, * R tpmd, * R tpim, * R tpii, * R tpdm, * R tpdd; /* ptrs into tsc */ helps to vectorize this. To get the final rest of performance also this transformation needs to happen on the dc[] loop: dctemp=dc[0]; for (k = 1; k < M; k++) { dctemp = dctemp + tpdd[k-1]; if ((sc = mc[k-1] + tpmd[k-1]) > dctemp) dctemp = sc; if (dctemp < -INFTY) dctemp = -INFTY; dc[k] = dctemp; } Our loop distribution should actually already be able to split off the three memory streams when restrict is added everywhere, at the 2011 time frame it didn't do it nevertheless (and I haven't looked if it would be able to do that now). predictive commoning could do the dc[] transformation (part (2)), except that it can't without disambiguation. That adding restrict doesn't help here is PR50419, but ultimately it would have to work on the disambiguated loop (without the restrict pointer). So really the prerequisite to optimize hmmer is loop disambiguation, even with the many streams (and hence conditionals) that are there. And it needs to happen well before the loop vectorizer, because loop splitting and distribution, _and_ predictive commoning have the disambiguation as prerequisite in this testcase. After that loop distribution needs to be looked at why it doesn't want to distribute the streams, and then a variant of PR50419 needs to be fixed based on disambiguation info (not based on restrict). For that we need infrastructure that would enable us to disambiguate mem accesses after loop nest versioning happened in the "good" version. Ciao, Michael.