From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 96663 invoked by alias); 22 Jun 2015 13:56:04 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 96642 invoked by uid 89); 22 Jun 2015 13:56:03 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.4 required=5.0 tests=AWL,BAYES_05,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 22 Jun 2015 13:56:02 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1Z72Cc-0003am-P5 from Julian_Brown@mentor.com ; Mon, 22 Jun 2015 06:55:59 -0700 Received: from octopus (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.224.2; Mon, 22 Jun 2015 14:55:57 +0100 Date: Mon, 22 Jun 2015 14:00:00 -0000 From: Julian Brown To: Jakub Jelinek CC: Bernd Schmidt , Thomas Schwinge , , Nathan Sidwell Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points Message-ID: <20150622145549.481d4549@octopus> In-Reply-To: <20150619122557.GO10247@tucnak.redhat.com> References: <20150528150635.7bd5db23@octopus> <20150528142011.GN10247@tucnak.redhat.com> <87pp5kg3js.fsf@schwinge.name> <20150528150802.GO10247@tucnak.redhat.com> <5583E68A.9020608@codesourcery.com> <20150619122557.GO10247@tucnak.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-IsSubscribed: yes X-SW-Source: 2015-06/txt/msg01423.txt.bz2 On Fri, 19 Jun 2015 14:25:57 +0200 Jakub Jelinek wrote: > On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote: > > On 05/28/2015 05:08 PM, Jakub Jelinek wrote: > > > > >I understand it is more work, I'd just like to ask that when > > >designing stuff for the OpenACC offloading you (plural) try to > > >take the other offloading devices and host fallback into account. > > > > The problem is that many of the transformations we need to do are > > really GPU specific, and with the current structure of > > omplow/ompexp they are being done in the host compiler. The > > offloading scheme we decided on does not give us the means to write > > out multiple versions of an offloaded function where each target > > gets a different one. For that reason I think we should postpone > > these lowering decisions until we're in the accel compiler, where > > they could be controlled by target hooks, and over the last two > > weeks I've been doing some experiments to see how that could be > > achieved. > I wonder why struct loop flags and other info together with function > attributes and/or cgraph flags and other info aren't sufficient for > the OpenACC needs. > Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ > simd? > > Why can't the execution model (normal, vector-single and > worker-single) be simply attributes on functions or cgraph node flags > and the kind of #acc loop simply be flags on struct loop, like > already OpenMP simd / Cilk+ simd is? One problem is that (at least on the GPU hardware we've considered so far) we're somewhat constrained in how much control we have over how the underlying hardware executes code: it's possible to draw up a scheme where OpenACC source-level control-flow semantics are reflected directly in the PTX assembly output (e.g. to say "all threads in a CTA/warp will be coherent after such-and-such a loop"), and lowering OpenACC directives quite early seems to make that relatively tractable. (Even if the resulting code is relatively un-optimisable due to the abnormal edges inserted to make sure that the CFG doesn't become "ill-formed".) If arbitrary optimisations are done between OMP-lowering time and somewhere around vectorisation (say), it's less clear if that correspondence can be maintained. Say if the code executed by half the threads in a warp becomes physically separated from the code executed by the other half of the threads in a warp due to some loop optimisation, we can no longer easily determine where that warp will reconverge, and certain other operations (relying on coherent warps -- e.g. CTA synchronisation) become impossible. A similar issue exists for warps within a CTA. So, essentially -- I don't know how "late" loop lowering would interact with: (a) Maintaining a CFG that will work with PTX. (b) Predication for worker-single and/or vector-single modes (actually all currently-proposed schemes have problems with proper representation of data-dependencies for variables and compiler-generated temporaries between predicated regions.) Julian