From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-400907-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 54388 invoked by alias); 22 Jun 2015 15:17:30 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 54375 invoked by uid 89); 22 Jun 2015 15:17:30 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.1 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 22 Jun 2015 15:17:29 +0000
Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com)	by relay1.mentorg.com with esmtp 	id 1Z73TQ-0003x7-Gp from Julian_Brown@mentor.com ; Mon, 22 Jun 2015 08:17:24 -0700
Received: from octopus (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.224.2; Mon, 22 Jun 2015 16:17:21 +0100
Date: Mon, 22 Jun 2015 15:18:00 -0000
From: Julian Brown <julian@codesourcery.com>
To: Jakub Jelinek <jakub@redhat.com>
CC: Bernd Schmidt <bernds@codesourcery.com>, Thomas Schwinge	<thomas@codesourcery.com>, <gcc-patches@gcc.gnu.org>, Nathan Sidwell	<nathan@codesourcery.com>
Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points
Message-ID: <20150622161714.1324b5f5@octopus>
In-Reply-To: <20150622142456.GZ10247@tucnak.redhat.com>
References: <20150528150635.7bd5db23@octopus>	<20150528142011.GN10247@tucnak.redhat.com>	<87pp5kg3js.fsf@schwinge.name>	<20150528150802.GO10247@tucnak.redhat.com>	<5583E68A.9020608@codesourcery.com>	<20150619122557.GO10247@tucnak.redhat.com>	<20150622145549.481d4549@octopus>	<20150622142456.GZ10247@tucnak.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
X-SW-Source: 2015-06/txt/msg01442.txt.bz2

On Mon, 22 Jun 2015 16:24:56 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> > One problem is that (at least on the GPU hardware we've considered
> > so far) we're somewhat constrained in how much control we have over
> > how the underlying hardware executes code: it's possible to draw up
> > a scheme where OpenACC source-level control-flow semantics are
> > reflected directly in the PTX assembly output (e.g. to say "all
> > threads in a CTA/warp will be coherent after such-and-such a
> > loop"), and lowering OpenACC directives quite early seems to make
> > that relatively tractable. (Even if the resulting code is
> > relatively un-optimisable due to the abnormal edges inserted to
> > make sure that the CFG doesn't become "ill-formed".)
> > 
> > If arbitrary optimisations are done between OMP-lowering time and
> > somewhere around vectorisation (say), it's less clear if that
> > correspondence can be maintained. Say if the code executed by half
> > the threads in a warp becomes physically separated from the code
> > executed by the other half of the threads in a warp due to some loop
> > optimisation, we can no longer easily determine where that warp will
> > reconverge, and certain other operations (relying on coherent warps
> > -- e.g. CTA synchronisation) become impossible. A similar issue
> > exists for warps within a CTA.
> > 
> > So, essentially -- I don't know how "late" loop lowering would
> > interact with:
> > 
> > (a) Maintaining a CFG that will work with PTX.
> > 
> > (b) Predication for worker-single and/or vector-single modes
> > (actually all currently-proposed schemes have problems with proper
> > representation of data-dependencies for variables and
> > compiler-generated temporaries between predicated regions.)
> 
> I don't understand why lowering the way you suggest helps here at all.
> In the proposed scheme, you essentially have whole function
> in e.g. worker-single or vector-single mode, which you need to be
> able to handle properly in any case, because users can write such
> routines themselves.  And then you can have a loop in such a function
> that has some special attribute, a hint that it is desirable to
> vectorize it (for PTX the PTX way) or use vector-single mode for it
> in a worker-single function.  So, the special pass then of course
> needs to handle all the needed broadcasting and reduction required to
> change the mode from e.g. worker-single to vector-single, but the
> convergence points still would be either on the boundary of such
> loops to be vectorized or parallelized, or wherever else they appear
> in normal vector-single or worker-single functions (around the calls
> to certainly calls?).

I think most of my concerns are centred around loops (with the markings
you suggest) that might be split into parts: if that cannot happen for
loops that are annotated as you describe, maybe things will work out OK.

(Apologies for my ignorance here, this isn't a part of the compiler
that I know anything about.)

Julian