From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-421149-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 117715 invoked by alias); 10 Feb 2016 16:23:40 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 117696 invoked by uid 89); 10 Feb 2016 16:23:39 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=singlethreaded, Hx-languages-length:2817, user's, ears
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 10 Feb 2016 16:23:33 +0000
Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-02.mgc.mentorg.com)	by relay1.mentorg.com with esmtp 	id 1aTXY9-0004ld-64 from Thomas_Schwinge@mentor.com ; Wed, 10 Feb 2016 08:23:29 -0800
Received: from hertz.schwinge.homeip.net (137.202.0.76) by SVR-IES-FEM-02.mgc.mentorg.com (137.202.0.106) with Microsoft SMTP Server id 14.3.224.2; Wed, 10 Feb 2016 16:23:27 +0000
From: Thomas Schwinge <thomas@codesourcery.com>
To: Bernd Schmidt <bschmidt@redhat.com>, Jakub Jelinek <jakub@redhat.com>
CC: <gcc-patches@gcc.gnu.org>, Tom de Vries <vries@codesourcery.com>
Subject: Re: Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
In-Reply-To: <56BB56EC.90707@redhat.com>
References: <87r3hac1w9.fsf@hertz.schwinge.homeip.net> <569D2059.4010105@mentor.com> <87d1subnu5.fsf@hertz.schwinge.homeip.net> <87a8nyawph.fsf@hertz.schwinge.homeip.net> <20160122083625.GL3017@tucnak.redhat.com> <56A22C2E.6000408@redhat.com> <20160122132538.GT3017@tucnak.redhat.com> <56A22F37.5010505@redhat.com> <87zivg8rcy.fsf@hertz.schwinge.homeip.net> <87h9hg9450.fsf@hertz.schwinge.homeip.net> <56BB3A5E.6000506@redhat.com> <87d1s48w97.fsf@hertz.schwinge.homeip.net> <56BB56EC.90707@redhat.com>
User-Agent: Notmuch/0.9-101-g81dad07 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu)
Date: Wed, 10 Feb 2016 16:23:00 -0000
Message-ID: <8737t08rgi.fsf@hertz.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2016-02/txt/msg00700.txt.bz2

Hi!

On Wed, 10 Feb 2016 16:27:40 +0100, Bernd Schmidt <bschmidt@redhat.com> wro=
te:
> On 02/10/2016 03:39 PM, Thomas Schwinge wrote:
>=20
> > Yes, we need a hammer that big: we have to ensure consistency between
> > data regions on the device and code offloading to the device, as
> > otherwise we'll very easily run into inconsistencies, because of the
> > non-shared memory.  In the general case, it's "all or nothing": you
> > either have to offload all kernels or none of them.
>=20
> That's unfortunately not the impression I got from the earlier=20
> discussion

:-(

> and this seems to imply that one unprofitable kernel would=20
> disable all the others

Correct.

> - IMO this is not acceptable.

Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
constructs' code offloaded; the user wants his code to execute as fast as
possible.

If you consider the whole of OpenACC kernels code offloading as a
compiler optimization, then it's fine for GCC to abort this
"optimization" if it's reasonably clear that this transformation (code
offloading) will not be profitable -- just like what GCC does with other
possible code optimizations/transformations.  As I've said before,
profiling the execution times of several real-world codes has shown that
under the assumtion that parloops fails to parallelize one kernel (one
out of possibly many), this one kernel has always been a "hot spot", and
avoiding offloading in this case has always helped prevent performance
degradation below host-fallback performance.

It's of course unfortunate that we have to disable our offloading
machinery for a lot of codes using OpenACC kernels, but given the current
state of OpenACC kernels parallelization analysis (parloops), doing so is
still profitable for a user, compared to regressed performance with
single-threaded offloaded execution.

Of course...

> There need to be=20
> more compiler smarts to figure out whether a kernel is a valid candidate=
=20
> for skipping the offloading.

... that would be better, obviously.  But, I suggest we work on that
incrementally, after fixing the performance regression with my "avoid
offloading" patch.

I have difficulties coming up with an algorithm/parametrization to have
the compiler/runtime decide whether offloading will be profitable given
input parameters such as a ratio of parallelized/single-threaded kernels.
So I'm all ears to suggestions in that regard.  Consider: if we encounter
a single-threaded kernel, the compiler (parloops) has just given up
"understanding" the user's code.  And again, implementing such heuristics
to me sounds like incremental follow-up projects, quite possibly in
combination with generally improving OpenACC kernels handling/parloops.


Gr=C3=BC=C3=9Fe
 Thomas