From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-399223-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 109037 invoked by alias); 28 May 2015 15:08:11 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 109025 invoked by uid 89); 28 May 2015 15:08:10 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.4 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Thu, 28 May 2015 15:08:09 +0000
Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])	by mx1.redhat.com (Postfix) with ESMTPS id 68A88C2E50;	Thu, 28 May 2015 15:08:08 +0000 (UTC)
Received: from tucnak.zalov.cz (ovpn-116-89.ams2.redhat.com [10.36.116.89])	by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t4SF86tS007931	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);	Thu, 28 May 2015 11:08:07 -0400
Received: from tucnak.zalov.cz (localhost [127.0.0.1])	by tucnak.zalov.cz (8.14.9/8.14.9) with ESMTP id t4SF85mP031370;	Thu, 28 May 2015 17:08:05 +0200
Received: (from jakub@localhost)	by tucnak.zalov.cz (8.14.9/8.14.9/Submit) id t4SF827U031369;	Thu, 28 May 2015 17:08:02 +0200
Date: Thu, 28 May 2015 15:28:00 -0000
From: Jakub Jelinek <jakub@redhat.com>
To: Thomas Schwinge <thomas@codesourcery.com>
Cc: gcc-patches@gcc.gnu.org, Bernd Schmidt <bernds@codesourcery.com>,        Nathan Sidwell <nathan@codesourcery.com>,        Julian Brown <julian@codesourcery.com>
Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points
Message-ID: <20150528150802.GO10247@tucnak.redhat.com>
Reply-To: Jakub Jelinek <jakub@redhat.com>
References: <20150528150635.7bd5db23@octopus> <20150528142011.GN10247@tucnak.redhat.com> <87pp5kg3js.fsf@schwinge.name>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87pp5kg3js.fsf@schwinge.name>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-IsSubscribed: yes
X-SW-Source: 2015-05/txt/msg02661.txt.bz2

On Thu, May 28, 2015 at 04:49:43PM +0200, Thomas Schwinge wrote:
> > I think much better would be to have a function attribute (or cgraph
> > flag) that would be set for functions you want to compile this way
> > (plus a targetm flag that the targets want to support it that way),
> > plus a flag in loop structure for the acc loop vector loops
> > (perhaps the current OpenMP simd loop flags are good enough for that),
> > and lower it somewhere around the vectorization pass or so.
> 
> Moving the loop lowering/expansion later is along the same lines as we've
> been thinking.  Figuring out how the OpenMP simd implementation works, is
> another thing I wanted to look into.

The OpenMP simd expansion is actually quite simple thing.
Basically, the simd loop is in ompexp expanded as a normal loop with some
flags in the loop structure (which are pretty much optimization hints).
There is a flag that the user would really like to vectorize it, and another
field that says (from what user told) what vectorization factor is safe to
use regardless of compiler's analysis.  There is some complications with
privatization clauses, so some variables are in GIMPLE represented as arrays
with maximum vf elements and indexed by internal function (simd lane), which
the vectorizer then either turns into a scalar again (if the loop isn't
vectorized), or vectorizes it and for addressables keeps in arrays with
actual vf elements.

I admit I don't know too much about OpenACC, but I'd think doing something
similar (i.e. some loop structure hint or request that a particular loop is
vectorized and perhaps something about lexical forward/backward dependencies
in the loop) could work.  Then for XeonPhi or host fallback, you'd just use
normal vectorizer.  And for PTX you could instead about the same time
instead of vectorization lower code to a single working thread doing stuff
except for simd marked loops which would be lowered to run on all threads
in the warp.

> Not disagreeing, but: we have to start somewhere.  GPU offloading and all
> its peculiarities is still entering unknown terriroty in GCC; we're still
> learning, and shall try to converge the emerging different
> implementations in the future.  Doing the completely generic (agnostic of
> specific offloading device) implementation right now is a challenging
> task, hence the work on a "nvptx-specific prototype" first, to put it
> this way.

I understand it is more work, I'd just like to ask that when designing stuff
for the OpenACC offloading you (plural) try to take the other offloading
devices and host fallback into account.  E.g. the XeonPhi is not hard to
understand, it is pretty much just a many core x86_64 chip where the
offloading is some process how to run something on the other device
and the emulation mode very well emulates that through running it in a
different process.  This stuff is already about what happens in offloaded
code, so considerations for it are similar to those for host code
(especially hosts that can vectorize).

As far as OpenMP / PTX goes, I'll try to find time for it again soon
(busy with OpenMP 4.1 work so far), but e.g. the above stuff (having
a single thread in warp do most of the non-vectorized work, and only
use other threads in the warp for vectorization) is definitely what
OpenMP will benefit from too.

	Jakub