From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 43539 invoked by alias); 19 Jun 2015 13:45:38 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 43525 invoked by uid 89); 19 Jun 2015 13:45:37 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Fri, 19 Jun 2015 13:45:36 +0000 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (Postfix) with ESMTPS id 3DAFA3582D6; Fri, 19 Jun 2015 13:45:35 +0000 (UTC) Received: from tucnak.zalov.cz (ovpn-116-82.ams2.redhat.com [10.36.116.82]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t5JDjXet006011 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 19 Jun 2015 09:45:34 -0400 Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.14.9/8.14.9) with ESMTP id t5JDjVKP019153; Fri, 19 Jun 2015 15:45:32 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.14.9/8.14.9/Submit) id t5JDjTwG019152; Fri, 19 Jun 2015 15:45:29 +0200 Date: Fri, 19 Jun 2015 14:10:00 -0000 From: Jakub Jelinek To: Bernd Schmidt Cc: Thomas Schwinge , gcc-patches@gcc.gnu.org, Nathan Sidwell , Julian Brown Subject: Re: [gomp4] Preserve NVPTX "reconvergence" points Message-ID: <20150619134529.GP10247@tucnak.redhat.com> Reply-To: Jakub Jelinek References: <20150528150635.7bd5db23@octopus> <20150528142011.GN10247@tucnak.redhat.com> <87pp5kg3js.fsf@schwinge.name> <20150528150802.GO10247@tucnak.redhat.com> <5583E68A.9020608@codesourcery.com> <20150619122557.GO10247@tucnak.redhat.com> <5584132A.6080108@codesourcery.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5584132A.6080108@codesourcery.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-IsSubscribed: yes X-SW-Source: 2015-06/txt/msg01338.txt.bz2 On Fri, Jun 19, 2015 at 03:03:38PM +0200, Bernd Schmidt wrote: > >they are also very much OpenMP or OpenACC specific, rather than representing > >language neutral behavior, so there is a problem that you'd need M x N > >different expansions of those constructs, which is not really maintainable > >(M being number of supported offloading standards, right now 2, and N > >number of different offloading devices (host, XeonPhi, PTX, HSA, ...)). > > Well, that's a problem we have anyway, independent on how we implement all > these devices and standards. I don't see how that's relevant to the > discussion. It is relevant, because if you lower early (omplower/ompexp) into some IL form common to all the offloading standards, then it is M + N. > >I wonder why struct loop flags and other info together with function > >attributes and/or cgraph flags and other info aren't sufficient for the > >OpenACC needs. > >Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd? > > >Why can't the execution model (normal, vector-single and worker-single) > >be simply attributes on functions or cgraph node flags and the kind of > >#acc loop simply be flags on struct loop, like already OpenMP simd > >/ Cilk+ simd is? > > We haven't looked at Cilk+ or anything like that. You suggest using > attributes and flags, but at what point do you intend to actually lower the > IR to actually represent what's going on? I think around where the vectorizer is, perhaps before the loop optimization pass queue (or after it, some investigation is needed). > >The vector level parallelism is something where on the host/host_noshm/XeonPhi > >(dunno about HSA) you want vectorization to happen, and that is already > >implemented in the vectorizer pass, implementing it again elsewhere is > >highly undesirable. For PTX the implementation is of course different, > >and the vectorizer is likely not the right pass to handle them, but why > >can't the same struct loop flags be used by the pass that handles the > >conditionalization of execution for the 2 of the 3 above modes? > > Agreed on wanting the vectorizer to handle things for "normal" machines, > that is one of the motivations for pushing the lowering past the offload LTO > writeout stage. The problem with OpenACC on GPUs is that the predication > really changes the CFG and the data flow - I fear unpredictable effects if > we let any optimizers run before lowering OpenACC to the point where we > actually represent what's going on in the function. I actually believe having some optimization passes in between the ompexp and the lowering of the IR into the form PTX wants is highly desirable, the form with the worker-single or vector-single mode lowered will contain too complex CFG for many optimizations to be really effective, especially if it uses abnormal edges. E.g. inlining supposedly would have harder job etc. What exact unpredictable effects do you fear? If the loop remains in the IL (isn't optimized away as unreachable or isn't removed, e.g. as a non-loop - say if it contains a noreturn call), the flags on struct loop should be still there. For the loop clauses (reduction always, and private/lastprivate if addressable etc.) for OpenMP simd / Cilk+ simd we use special arrays indexed by internal functions, which then during vectorization are shrunk (but in theory could be expanded too) to the right vectorization factor if vectorized, of course accesses within the loop vectorized using SIMD, and if not vectorized, shrunk to 1 element. So the PTX IL lowering pass could use the same arrays ("omp simd array" attribute) to transform the decls into thread local vars as opposed to vars shared by the whole CTA. Jakub