From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20231 invoked by alias); 10 Feb 2016 17:39:22 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 20215 invoked by uid 89); 10 Feb 2016 17:39:21 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=owed, hands, translates, sharedmemory X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 10 Feb 2016 17:39:19 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1aTYjU-0001hr-69 from Thomas_Schwinge@mentor.com ; Wed, 10 Feb 2016 09:39:16 -0800 Received: from hertz.schwinge.homeip.net (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.224.2; Wed, 10 Feb 2016 17:37:56 +0000 From: Thomas Schwinge To: Bernd Schmidt , Jakub Jelinek CC: , Tom de Vries Subject: Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading" In-Reply-To: <56BB674A.7050401@redhat.com> References: <87r3hac1w9.fsf@hertz.schwinge.homeip.net> <569D2059.4010105@mentor.com> <87d1subnu5.fsf@hertz.schwinge.homeip.net> <87a8nyawph.fsf@hertz.schwinge.homeip.net> <20160122083625.GL3017@tucnak.redhat.com> <56A22C2E.6000408@redhat.com> <20160122132538.GT3017@tucnak.redhat.com> <56A22F37.5010505@redhat.com> <87zivg8rcy.fsf@hertz.schwinge.homeip.net> <87h9hg9450.fsf@hertz.schwinge.homeip.net> <56BB3A5E.6000506@redhat.com> <87d1s48w97.fsf@hertz.schwinge.homeip.net> <56BB56EC.90707@redhat.com> <8737t08rgi.fsf@hertz.schwinge.homeip.net> <56BB674A.7050401@redhat.com> User-Agent: Notmuch/0.9-101-g81dad07 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu) Date: Wed, 10 Feb 2016 17:39:00 -0000 Message-ID: <87y4as79fw.fsf@hertz.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-SW-Source: 2016-02/txt/msg00710.txt.bz2 Hi! On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt wro= te: > On 02/10/2016 05:23 PM, Thomas Schwinge wrote: > > Why? A user of GCC has no intrinsic interest in getting OpenACC kernels > > constructs' code offloaded; the user wants his code to execute as fast = as > > possible. > > > > If you consider the whole of OpenACC kernels code offloading as a > > compiler optimization, then it's fine for GCC to abort this > > "optimization" if it's reasonably clear that this transformation (code > > offloading) will not be profitable -- just like what GCC does with other > > possible code optimizations/transformations. >=20 > Yes, but if a single kernel (which might not even get executed at=20 > run-time) can inhibit offloading for the whole program, then we're not=20 > making an intelligent decision, and IMO violating user expectations.=20 Sure, I agree it's a pretty "rough-grained" decision. (Owed to the non-shared-memory offloading architecture -- shared-memory offloading indeed can make such decisions case by case.) > IIUC it's also disabling offloading for parallels rather than just=20 > kernels, which we previously said shouldn't happen. Ah, you're talking about mixed OpenACC parallel/kernels codes -- I understood the earlier discussion to apply to parallel-only codes, where the "avoid offloading" flag will never be set. In mixed parallel/kernels code with one un-parallelized kernels construct, offloading would also (have to be) disabled for the parallel constructs (for the same data consistency reasons explained before). The majority of codes I've seen use either parallel or kernels constructs, typically not both. > > As I've said before, > > profiling the execution times of several real-world codes has shown that > > under the assumtion that parloops fails to parallelize one kernel (one > > out of possibly many), this one kernel has always been a "hot spot", and > > avoiding offloading in this case has always helped prevent performance > > degradation below host-fallback performance. >=20 > IMO a warning for the specific kernel that's problematic would be better= =20 That's something Tom suggested, , and which motivated my patch, in going one step further: > so that users can selectively apply -fopenacc to files where it is=20 > profitable. This puts it into the hands of the user to selectively mark kernels constructs as suitable for GCC's current parloops processing (for example, by disabling OpenACC/offloading on a per-file basis) -- which is something we wanted to avoid, given the idea that in the future, GCC will improve, and will be able to handle kernels constructs better, and the user would then have to re-visit/un-do their earlier changes with each GCC release, instead of just recompiling their code. > > It's of course unfortunate that we have to disable our offloading > > machinery for a lot of codes using OpenACC kernels, but given the curre= nt > > state of OpenACC kernels parallelization analysis (parloops), doing so = is > > still profitable for a user, compared to regressed performance with > > single-threaded offloaded execution. >=20 > How often does this occur on real-world code? Quite a lot for code using the kernels construct, as discussed before, given that parloops fails to handle a lot of constructs in real-world code. > Will we end up supporting=20 > OpenACC by not doing offloading at all in the usual case? This whole discussion does not at all apply to the body of OpenACC code using the parallel instead of the kernels construct, which will be parallelized/offloaded just fine. > The way you=20 > describe it, it sounds like we should recommend that -fopenacc not be=20 > used in gcc-6 and restore the previous invoke.texi langauge that marks=20 > it as experimental. Huh? Like, at random, discouraging users from using GCC's SIMD vectorizer just because that one fails to vectorize some code that it could/should vectorize? (Of course, I'm well aware that GCC's SIMD vectorizer is much more mature than the OpenACC kernels/parloops handling; it's seen many more years of development.) Certainly we should document that there is still a lot of room for improvement in OpenACC kernels handling (just like it's the case for a lot of other generic compiler optimizations) -- and we're doing exactly that on . I don't follow how that translates to discouraging use of -fopenacc however? Gr=C3=BC=C3=9Fe Thomas