From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 54388 invoked by alias); 22 Jun 2015 15:17:30 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 54375 invoked by uid 89); 22 Jun 2015 15:17:30 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.1 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 22 Jun 2015 15:17:29 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1Z73TQ-0003x7-Gp from Julian_Brown@mentor.com ; Mon, 22 Jun 2015 08:17:24 -0700 Received: from octopus (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.224.2; Mon, 22 Jun 2015 16:17:21 +0100 Date: Mon, 22 Jun 2015 15:18:00 -0000 From: Julian Brown To: Jakub Jelinek CC: Bernd Schmidt , Thomas Schwinge , , Nathan Sidwell Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points Message-ID: <20150622161714.1324b5f5@octopus> In-Reply-To: <20150622142456.GZ10247@tucnak.redhat.com> References: <20150528150635.7bd5db23@octopus> <20150528142011.GN10247@tucnak.redhat.com> <87pp5kg3js.fsf@schwinge.name> <20150528150802.GO10247@tucnak.redhat.com> <5583E68A.9020608@codesourcery.com> <20150619122557.GO10247@tucnak.redhat.com> <20150622145549.481d4549@octopus> <20150622142456.GZ10247@tucnak.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-IsSubscribed: yes X-SW-Source: 2015-06/txt/msg01442.txt.bz2 On Mon, 22 Jun 2015 16:24:56 +0200 Jakub Jelinek wrote: > On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote: > > One problem is that (at least on the GPU hardware we've considered > > so far) we're somewhat constrained in how much control we have over > > how the underlying hardware executes code: it's possible to draw up > > a scheme where OpenACC source-level control-flow semantics are > > reflected directly in the PTX assembly output (e.g. to say "all > > threads in a CTA/warp will be coherent after such-and-such a > > loop"), and lowering OpenACC directives quite early seems to make > > that relatively tractable. (Even if the resulting code is > > relatively un-optimisable due to the abnormal edges inserted to > > make sure that the CFG doesn't become "ill-formed".) > > > > If arbitrary optimisations are done between OMP-lowering time and > > somewhere around vectorisation (say), it's less clear if that > > correspondence can be maintained. Say if the code executed by half > > the threads in a warp becomes physically separated from the code > > executed by the other half of the threads in a warp due to some loop > > optimisation, we can no longer easily determine where that warp will > > reconverge, and certain other operations (relying on coherent warps > > -- e.g. CTA synchronisation) become impossible. A similar issue > > exists for warps within a CTA. > > > > So, essentially -- I don't know how "late" loop lowering would > > interact with: > > > > (a) Maintaining a CFG that will work with PTX. > > > > (b) Predication for worker-single and/or vector-single modes > > (actually all currently-proposed schemes have problems with proper > > representation of data-dependencies for variables and > > compiler-generated temporaries between predicated regions.) > > I don't understand why lowering the way you suggest helps here at all. > In the proposed scheme, you essentially have whole function > in e.g. worker-single or vector-single mode, which you need to be > able to handle properly in any case, because users can write such > routines themselves. And then you can have a loop in such a function > that has some special attribute, a hint that it is desirable to > vectorize it (for PTX the PTX way) or use vector-single mode for it > in a worker-single function. So, the special pass then of course > needs to handle all the needed broadcasting and reduction required to > change the mode from e.g. worker-single to vector-single, but the > convergence points still would be either on the boundary of such > loops to be vectorized or parallelized, or wherever else they appear > in normal vector-single or worker-single functions (around the calls > to certainly calls?). I think most of my concerns are centred around loops (with the markings you suggest) that might be split into parts: if that cannot happen for loops that are annotated as you describe, maybe things will work out OK. (Apologies for my ignorance here, this isn't a part of the compiler that I know anything about.) Julian