From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-414387-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 23578 invoked by alias); 17 Nov 2015 15:18:57 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 23566 invoked by uid 89); 17 Nov 2015 15:18:57 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.3 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2
X-HELO: fencepost.gnu.org
Received: from fencepost.gnu.org (HELO fencepost.gnu.org) (208.118.235.10) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 17 Nov 2015 15:18:55 +0000
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38659)	by fencepost.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256)	(Exim 4.82)	(envelope-from <rguenther@suse.de>)	id 1Zyi20-0004TD-UM	for gcc-patches@gnu.org; Tue, 17 Nov 2015 10:18:53 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)	(envelope-from <rguenther@suse.de>)	id 1Zyi1u-0007NR-Nn	for gcc-patches@gnu.org; Tue, 17 Nov 2015 10:18:52 -0500
Received: from mx2.suse.de ([195.135.220.15]:37764)	by eggs.gnu.org with esmtp (Exim 4.71)	(envelope-from <rguenther@suse.de>)	id 1Zyi1u-0007MU-AH	for gcc-patches@gnu.org; Tue, 17 Nov 2015 10:18:46 -0500
Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254])	by mx2.suse.de (Postfix) with ESMTP id 4FA16AAB2;	Tue, 17 Nov 2015 15:18:20 +0000 (UTC)
Date: Tue, 17 Nov 2015 15:18:00 -0000
From: Richard Biener <rguenther@suse.de>
To: Tom de Vries <Tom_deVries@mentor.com>
cc: Richard Biener <richard.guenther@gmail.com>,     "gcc-patches@gnu.org" <gcc-patches@gnu.org>,     Jakub Jelinek <jakub@redhat.com>
Subject: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
In-Reply-To: <564B3F69.50600@mentor.com>
Message-ID: <alpine.LSU.2.11.1511171618210.4884@t29.fhfr.qr>
References: <5640BD31.2060602@mentor.com> <5640FB07.6010008@mentor.com> <alpine.LSU.2.11.1511111159040.4884@t29.fhfr.qr> <5649C41A.40403@mentor.com> <alpine.LSU.2.11.1511161341420.4884@t29.fhfr.qr> <564A64B3.7080305@mentor.com> <CAFiYyc0i3BX=9Ae1w-a=ySUwLbXNLE-TbKLOjqYRGExPJJ_A_Q@mail.gmail.com> <564B3F69.50600@mentor.com>
User-Agent: Alpine 2.11 (LSU 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic]
X-Received-From: 195.135.220.15
X-SW-Source: 2015-11/txt/msg02103.txt.bz2

On Tue, 17 Nov 2015, Tom de Vries wrote:

> On 17/11/15 11:05, Richard Biener wrote:
> > On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <Tom_deVries@mentor.com>
> > wrote:
> > > On 16/11/15 13:45, Richard Biener wrote:
> > > > > > 
> > > > > > +             NEXT_PASS (pass_scev_cprop);
> > > > > > > > 
> > > > > > > > What's that for?  It's supposed to help removing loops - I don't
> > > > > > > > expect kernels to vanish.
> > > > > 
> > > > > > 
> > > > > > I'm using pass_scev_cprop for the "final value replacement"
> > > > > > functionality.
> > > > > > Added comment.
> > > 
> > > 
> > > > That functionality is intented to enable loop removal.
> > > 
> > > 
> > > Let me try to explain in a bit more detail.
> > > 
> > > 
> > > I.
> > > 
> > > Consider a parloops testcase test.c, with a use of the final value of the
> > > iteration variable (return i):
> > > ...
> > > unsigned int
> > > foo (int n, int *a)
> > > {
> > >    int i;
> > >    for (i = 0; i < n; ++i)
> > >      a[i] = 1;
> > > 
> > >    return i;
> > > }
> > > ...
> > > 
> > > Say we compile with:
> > > ...
> > > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> > > ...
> > > 
> > > We can see here in the parloops dump-file that the loop was parallelized:
> > > ...
> > >    SUCCESS: may be parallelized
> > > ...
> > > 
> > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> > > in the parloops dump-file:
> > > ...
> > > phi is i_1 = PHI <i_10(4)>
> > > arg of phi to exit:   value i_10 used outside loop
> > >    checking if it a part of reduction pattern:
> > >    FAILED: it is not a part of reduction.
> > > ...
> > > 
> > > Auto-parallelization fails in this case because there is a loop exit phi
> > > (the one in bb 6 defining i_1) which is not part of a reduction:
> > > ...
> > >    <bb 4>:
> > >    # i_13 = PHI <0(3), i_10(5)>
> > >    _5 = (long unsigned int) i_13;
> > >    _6 = _5 * 4;
> > >    _8 = a_7(D) + _6;
> > >    *_8 = 1;
> > >    i_10 = i_13 + 1;
> > >    if (n_4(D) > i_10)
> > >      goto <bb 5>;
> > >    else
> > >      goto <bb 6>;
> > > 
> > >    <bb 5>:
> > >    goto <bb 4>;
> > > 
> > >    <bb 6>:
> > >    # i_1 = PHI <i_10(4)>
> > >    _20 = (unsigned int) i_1;
> > > ...
> > > 
> > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> > > ...
> > > final value replacement:
> > >    i_1 = PHI <i_10(4)>
> > >    with
> > >    i_1 = n_4(D);
> > > ...
> > > 
> > > And the resulting loop no longer has any loop exit phis, so
> > > auto-parallelization succeeds:
> > > ...
> > >    <bb 4>:
> > >    # i_13 = PHI <0(3), i_10(5)>
> > >    _5 = (long unsigned int) i_13;
> > >    _6 = _5 * 4;
> > >    _8 = a_7(D) + _6;
> > >    *_8 = 1;
> > >    i_10 = i_13 + 1;
> > >    if (n_4(D) > i_10)
> > >      goto <bb 5>;
> > >    else
> > >      goto <bb 6>;
> > > 
> > >    <bb 5>:
> > >    goto <bb 4>;
> > > 
> > >    <bb 6>:
> > >    _20 = (unsigned int) n_4(D);
> > > ...
> > > 
> > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument
> > > defined
> > > outside loop", for a slightly different testcase where despite the final
> > > value replacement autopar still fails. ]
> > > 
> > > 
> > > II.
> > > 
> > > Now, back to oacc kernels.
> > > 
> > > Consider test-case kernels-loop-n.f95 (will add this one to the
> > > test-cases):
> > > ...
> > > module test
> > > contains
> > >    subroutine foo(n)
> > >      implicit none
> > >      integer :: n
> > >      integer, dimension (0:n-1) :: a, b, c
> > >      integer                    :: i, ii
> > >      do i = 0, n - 1
> > >         a(i) = i * 2
> > >      end do
> > > 
> > >      do i = 0, n -1
> > >         b(i) = i * 4
> > >      end do
> > > 
> > >      !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> > >      do ii = 0, n - 1
> > >         c(ii) = a(ii) + b(ii)
> > >      end do
> > >      !$acc end kernels
> > > 
> > >      do i = 0, n - 1
> > >         if (c(i) .ne. a(i) + b(i)) call abort
> > >      end do
> > > 
> > >    end subroutine foo
> > > end module test
> > > ...
> > > 
> > > The loop at the start of the kernels pass group contains an in-memory
> > > iteration variable, with a store to '*_9 = _38'.
> > > ...
> > >    <bb 4>:
> > >    _13 = *.omp_data_i_4(D).c;
> > >    c.21_14 = *_13;
> > >    _16 = *_9;
> > >    _17 = (integer(kind=8)) _16;
> > >    _18 = *.omp_data_i_4(D).a;
> > >    a.22_19 = *_18;
> > >    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >    _24 = *.omp_data_i_4(D).b;
> > >    b.23_25 = *_24;
> > >    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >    _30 = _23 + _29;
> > >    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >    _38 = _16 + 1;
> > >    *_9 = _38;
> > >    if (_8 == _16)
> > >      goto <bb 3>;
> > >    else
> > >      goto <bb 4>;
> > > ...
> > > 
> > > After pass_lim/pass_copy_prop, we've rewritten that into using a local
> > > iteration variable, but we've generated a read of the final value of the
> > > iteration variable outside the loop, which means auto-parallelization will
> > > fail:
> > > ...
> > >    <bb 5>:
> > >    # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
> > >    _17 = (integer(kind=8)) D__lsm.29_12;
> > >    _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > >    _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > >    _30 = _23 + _29;
> > >    MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > >    _38 = D__lsm.29_12 + 1;
> > >    if (_8 == D__lsm.29_12)
> > >      goto <bb 6>;
> > >    else
> > >      goto <bb 7>;
> > > 
> > >    <bb 6>:
> > >    # D__lsm.29_27 = PHI <_38(5)>
> > >    *_9 = D__lsm.29_27;
> > >    goto <bb 3>;
> > 
> > So this store is not actually necessary?
> 
> a.
> In the case of this example, the store is dead.
> 
> There is a corresponding load at the point that we split off the region:
> ...
>   <bb 9>:
>   #pragma omp return
> 
>   <bb 10>:
>   D.3635 = .omp_data_arr.25.ii;
>   ii = *D.3635;
> ...
> 
> This load is later removed, given that ii is unused after the region. But once
> the region is split off,  there's nothing in the context of the store to
> suggest that it's dead.
> 
> And to get rid of the load of ii before the region is split off, we would have
> to implement some sort of liveness analysis on pre-ssa code.
> 
> b.
> There's the case where there is an explicit use of ii after the region, in
> which case the store is not dead.
> 
> c.
> And there's the case were we use a data clause on the region, f.i. 'create
> (ii)' to indicate that the variable is neither copied in nor copied out of the
> region (the default for a scalar in a kernels region is 'copy', meaning
> copy-in-and-out).
> 
> [ This means the value of ii after the region is uninitialized. So even if
> there's a read from ii after the region, we cannot consider it connected to
> the store, given that the value written by the store on the accelerator will
> not be copied back to the host. ]
> 
> In this case, we already don't have any load of ii after the region:
> ...
>   <bb 9>:
>   #pragma omp return
> 
>   <bb 10>:
>   .omp_data_sizes.28 = {CLOBBER};
>   .omp_data_arr.27 = {CLOBBER};
> ...
> 
> We could insert clobbers for the bits of .omp_data_arr at the end of the
> region to indicate that those are not used. That might enable dse to get rid
> of the dead store.
> 
> 
> But, I think we want a generic solution that handles cases a, b and c, which
> means we have to solve the most difficult case, which is b, where the store is
> not dead.
> 
> >  Or just in an inconvenient place?
> 
> I don't think the place of the store is inconvenient, it would be worse to
> have the store in the loop.
> 
> What is inconvenient about the store is the fact that it reads the final value
> of the iteration variable (which inhibits parloops).
> 
> > >    <bb 7>:
> > >    goto <bb 5>;
> > > ...
> > > 
> > > This makes it similar to the parloops example above, and that's why I've
> > > added pass_scev_cprop in the kernels pass group.
> > > 
> > > [ And for some kernels test-cases with constant loop bound, it's not the
> > > final value replacement bit that does the substitution, but the first bit
> > > in
> > > scev_const_prop using resolve_mixers. So that's a related reason to use
> > > pass_scev_cprop. ]
> > 
> > IMHO autopar needs to handle induction itself.
> 
> I'm not sure what you mean. Could you elaborate?  Autopar handles induction
> variables, but it doesn't handle exit phis reading the final value of the
> induction variable. Is that what you want fixed? How?

Yes.  Perform final value replacement.

> > And the above LIM example
> > is none for why you need two LIM passes...
> 
> Indeed. I'm planning a separate reply to explain in more detail the need for
> the two pass_lims.

Thanks.