From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 109037 invoked by alias); 28 May 2015 15:08:11 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 109025 invoked by uid 89); 28 May 2015 15:08:10 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.4 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Thu, 28 May 2015 15:08:09 +0000 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (Postfix) with ESMTPS id 68A88C2E50; Thu, 28 May 2015 15:08:08 +0000 (UTC) Received: from tucnak.zalov.cz (ovpn-116-89.ams2.redhat.com [10.36.116.89]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t4SF86tS007931 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 28 May 2015 11:08:07 -0400 Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.14.9/8.14.9) with ESMTP id t4SF85mP031370; Thu, 28 May 2015 17:08:05 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.14.9/8.14.9/Submit) id t4SF827U031369; Thu, 28 May 2015 17:08:02 +0200 Date: Thu, 28 May 2015 15:28:00 -0000 From: Jakub Jelinek To: Thomas Schwinge Cc: gcc-patches@gcc.gnu.org, Bernd Schmidt , Nathan Sidwell , Julian Brown Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points Message-ID: <20150528150802.GO10247@tucnak.redhat.com> Reply-To: Jakub Jelinek References: <20150528150635.7bd5db23@octopus> <20150528142011.GN10247@tucnak.redhat.com> <87pp5kg3js.fsf@schwinge.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87pp5kg3js.fsf@schwinge.name> User-Agent: Mutt/1.5.23 (2014-03-12) X-IsSubscribed: yes X-SW-Source: 2015-05/txt/msg02661.txt.bz2 On Thu, May 28, 2015 at 04:49:43PM +0200, Thomas Schwinge wrote: > > I think much better would be to have a function attribute (or cgraph > > flag) that would be set for functions you want to compile this way > > (plus a targetm flag that the targets want to support it that way), > > plus a flag in loop structure for the acc loop vector loops > > (perhaps the current OpenMP simd loop flags are good enough for that), > > and lower it somewhere around the vectorization pass or so. > > Moving the loop lowering/expansion later is along the same lines as we've > been thinking. Figuring out how the OpenMP simd implementation works, is > another thing I wanted to look into. The OpenMP simd expansion is actually quite simple thing. Basically, the simd loop is in ompexp expanded as a normal loop with some flags in the loop structure (which are pretty much optimization hints). There is a flag that the user would really like to vectorize it, and another field that says (from what user told) what vectorization factor is safe to use regardless of compiler's analysis. There is some complications with privatization clauses, so some variables are in GIMPLE represented as arrays with maximum vf elements and indexed by internal function (simd lane), which the vectorizer then either turns into a scalar again (if the loop isn't vectorized), or vectorizes it and for addressables keeps in arrays with actual vf elements. I admit I don't know too much about OpenACC, but I'd think doing something similar (i.e. some loop structure hint or request that a particular loop is vectorized and perhaps something about lexical forward/backward dependencies in the loop) could work. Then for XeonPhi or host fallback, you'd just use normal vectorizer. And for PTX you could instead about the same time instead of vectorization lower code to a single working thread doing stuff except for simd marked loops which would be lowered to run on all threads in the warp. > Not disagreeing, but: we have to start somewhere. GPU offloading and all > its peculiarities is still entering unknown terriroty in GCC; we're still > learning, and shall try to converge the emerging different > implementations in the future. Doing the completely generic (agnostic of > specific offloading device) implementation right now is a challenging > task, hence the work on a "nvptx-specific prototype" first, to put it > this way. I understand it is more work, I'd just like to ask that when designing stuff for the OpenACC offloading you (plural) try to take the other offloading devices and host fallback into account. E.g. the XeonPhi is not hard to understand, it is pretty much just a many core x86_64 chip where the offloading is some process how to run something on the other device and the emulation mode very well emulates that through running it in a different process. This stuff is already about what happens in offloaded code, so considerations for it are similar to those for host code (especially hosts that can vectorize). As far as OpenMP / PTX goes, I'll try to find time for it again soon (busy with OpenMP 4.1 work so far), but e.g. the above stuff (having a single thread in warp do most of the non-vectorized work, and only use other threads in the warp for vectorization) is definitely what OpenMP will benefit from too. Jakub