From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-421159-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 20231 invoked by alias); 10 Feb 2016 17:39:22 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 20215 invoked by uid 89); 10 Feb 2016 17:39:21 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=owed, hands, translates, sharedmemory
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 10 Feb 2016 17:39:19 +0000
Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com)	by relay1.mentorg.com with esmtp 	id 1aTYjU-0001hr-69 from Thomas_Schwinge@mentor.com ; Wed, 10 Feb 2016 09:39:16 -0800
Received: from hertz.schwinge.homeip.net (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server id 14.3.224.2; Wed, 10 Feb 2016 17:37:56 +0000
From: Thomas Schwinge <thomas@codesourcery.com>
To: Bernd Schmidt <bschmidt@redhat.com>, Jakub Jelinek <jakub@redhat.com>
CC: <gcc-patches@gcc.gnu.org>, Tom de Vries <vries@codesourcery.com>
Subject: Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
In-Reply-To: <56BB674A.7050401@redhat.com>
References: <87r3hac1w9.fsf@hertz.schwinge.homeip.net> <569D2059.4010105@mentor.com> <87d1subnu5.fsf@hertz.schwinge.homeip.net> <87a8nyawph.fsf@hertz.schwinge.homeip.net> <20160122083625.GL3017@tucnak.redhat.com> <56A22C2E.6000408@redhat.com> <20160122132538.GT3017@tucnak.redhat.com> <56A22F37.5010505@redhat.com> <87zivg8rcy.fsf@hertz.schwinge.homeip.net> <87h9hg9450.fsf@hertz.schwinge.homeip.net> <56BB3A5E.6000506@redhat.com> <87d1s48w97.fsf@hertz.schwinge.homeip.net> <56BB56EC.90707@redhat.com> <8737t08rgi.fsf@hertz.schwinge.homeip.net> <56BB674A.7050401@redhat.com>
User-Agent: Notmuch/0.9-101-g81dad07 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu)
Date: Wed, 10 Feb 2016 17:39:00 -0000
Message-ID: <87y4as79fw.fsf@hertz.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2016-02/txt/msg00710.txt.bz2

Hi!

On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschmidt@redhat.com> wro=
te:
> On 02/10/2016 05:23 PM, Thomas Schwinge wrote:
> > Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
> > constructs' code offloaded; the user wants his code to execute as fast =
as
> > possible.
> >
> > If you consider the whole of OpenACC kernels code offloading as a
> > compiler optimization, then it's fine for GCC to abort this
> > "optimization" if it's reasonably clear that this transformation (code
> > offloading) will not be profitable -- just like what GCC does with other
> > possible code optimizations/transformations.
>=20
> Yes, but if a single kernel (which might not even get executed at=20
> run-time) can inhibit offloading for the whole program, then we're not=20
> making an intelligent decision, and IMO violating user expectations.=20

Sure, I agree it's a pretty "rough-grained" decision.  (Owed to the
non-shared-memory offloading architecture -- shared-memory offloading
indeed can make such decisions case by case.)

> IIUC it's also disabling offloading for parallels rather than just=20
> kernels, which we previously said shouldn't happen.

Ah, you're talking about mixed OpenACC parallel/kernels codes -- I
understood the earlier discussion to apply to parallel-only codes, where
the "avoid offloading" flag will never be set.  In mixed parallel/kernels
code with one un-parallelized kernels construct, offloading would also
(have to be) disabled for the parallel constructs (for the same data
consistency reasons explained before).  The majority of codes I've seen
use either parallel or kernels constructs, typically not both.

> > As I've said before,
> > profiling the execution times of several real-world codes has shown that
> > under the assumtion that parloops fails to parallelize one kernel (one
> > out of possibly many), this one kernel has always been a "hot spot", and
> > avoiding offloading in this case has always helped prevent performance
> > degradation below host-fallback performance.
>=20
> IMO a warning for the specific kernel that's problematic would be better=
=20

That's something Tom suggested,
<http://news.gmane.org/find-root.php?message_id=3D%3C569D2059.4010105%40men=
tor.com%3E>,
and which motivated my patch, in going one step further:

> so that users can selectively apply -fopenacc to files where it is=20
> profitable.

This puts it into the hands of the user to selectively mark kernels
constructs as suitable for GCC's current parloops processing (for
example, by disabling OpenACC/offloading on a per-file basis) -- which is
something we wanted to avoid, given the idea that in the future, GCC will
improve, and will be able to handle kernels constructs better, and the
user would then have to re-visit/un-do their earlier changes with each
GCC release, instead of just recompiling their code.

> > It's of course unfortunate that we have to disable our offloading
> > machinery for a lot of codes using OpenACC kernels, but given the curre=
nt
> > state of OpenACC kernels parallelization analysis (parloops), doing so =
is
> > still profitable for a user, compared to regressed performance with
> > single-threaded offloaded execution.
>=20
> How often does this occur on real-world code?

Quite a lot for code using the kernels construct, as discussed before,
given that parloops fails to handle a lot of constructs in real-world
code.

> Will we end up supporting=20
> OpenACC by not doing offloading at all in the usual case?

This whole discussion does not at all apply to the body of OpenACC code
using the parallel instead of the kernels construct, which will be
parallelized/offloaded just fine.

> The way you=20
> describe it, it sounds like we should recommend that -fopenacc not be=20
> used in gcc-6 and restore the previous invoke.texi langauge that marks=20
> it as experimental.

Huh?  Like, at random, discouraging users from using GCC's SIMD
vectorizer just because that one fails to vectorize some code that it
could/should vectorize?  (Of course, I'm well aware that GCC's SIMD
vectorizer is much more mature than the OpenACC kernels/parloops
handling; it's seen many more years of development.)

Certainly we should document that there is still a lot of room for
improvement in OpenACC kernels handling (just like it's the case for a
lot of other generic compiler optimizations) -- and we're doing exactly
that on <https://gcc.gnu.org/wiki/OpenACC>.  I don't follow how that
translates to discouraging use of -fopenacc however?


Gr=C3=BC=C3=9Fe
 Thomas