From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 23578 invoked by alias); 17 Nov 2015 15:18:57 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 23566 invoked by uid 89); 17 Nov 2015 15:18:57 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.3 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 X-HELO: fencepost.gnu.org Received: from fencepost.gnu.org (HELO fencepost.gnu.org) (208.118.235.10) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Tue, 17 Nov 2015 15:18:55 +0000 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38659) by fencepost.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1Zyi20-0004TD-UM for gcc-patches@gnu.org; Tue, 17 Nov 2015 10:18:53 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zyi1u-0007NR-Nn for gcc-patches@gnu.org; Tue, 17 Nov 2015 10:18:52 -0500 Received: from mx2.suse.de ([195.135.220.15]:37764) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zyi1u-0007MU-AH for gcc-patches@gnu.org; Tue, 17 Nov 2015 10:18:46 -0500 Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 4FA16AAB2; Tue, 17 Nov 2015 15:18:20 +0000 (UTC) Date: Tue, 17 Nov 2015 15:18:00 -0000 From: Richard Biener To: Tom de Vries cc: Richard Biener , "gcc-patches@gnu.org" , Jakub Jelinek Subject: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def In-Reply-To: <564B3F69.50600@mentor.com> Message-ID: References: <5640BD31.2060602@mentor.com> <5640FB07.6010008@mentor.com> <5649C41A.40403@mentor.com> <564A64B3.7080305@mentor.com> <564B3F69.50600@mentor.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] X-Received-From: 195.135.220.15 X-SW-Source: 2015-11/txt/msg02103.txt.bz2 On Tue, 17 Nov 2015, Tom de Vries wrote: > On 17/11/15 11:05, Richard Biener wrote: > > On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries > > wrote: > > > On 16/11/15 13:45, Richard Biener wrote: > > > > > > > > > > > > + NEXT_PASS (pass_scev_cprop); > > > > > > > > > > > > > > > > What's that for? It's supposed to help removing loops - I don't > > > > > > > > expect kernels to vanish. > > > > > > > > > > > > > > > > > I'm using pass_scev_cprop for the "final value replacement" > > > > > > functionality. > > > > > > Added comment. > > > > > > > > > > That functionality is intented to enable loop removal. > > > > > > > > > Let me try to explain in a bit more detail. > > > > > > > > > I. > > > > > > Consider a parloops testcase test.c, with a use of the final value of the > > > iteration variable (return i): > > > ... > > > unsigned int > > > foo (int n, int *a) > > > { > > > int i; > > > for (i = 0; i < n; ++i) > > > a[i] = 1; > > > > > > return i; > > > } > > > ... > > > > > > Say we compile with: > > > ... > > > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details > > > ... > > > > > > We can see here in the parloops dump-file that the loop was parallelized: > > > ... > > > SUCCESS: may be parallelized > > > ... > > > > > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find > > > in the parloops dump-file: > > > ... > > > phi is i_1 = PHI > > > arg of phi to exit: value i_10 used outside loop > > > checking if it a part of reduction pattern: > > > FAILED: it is not a part of reduction. > > > ... > > > > > > Auto-parallelization fails in this case because there is a loop exit phi > > > (the one in bb 6 defining i_1) which is not part of a reduction: > > > ... > > > : > > > # i_13 = PHI <0(3), i_10(5)> > > > _5 = (long unsigned int) i_13; > > > _6 = _5 * 4; > > > _8 = a_7(D) + _6; > > > *_8 = 1; > > > i_10 = i_13 + 1; > > > if (n_4(D) > i_10) > > > goto ; > > > else > > > goto ; > > > > > > : > > > goto ; > > > > > > : > > > # i_1 = PHI > > > _20 = (unsigned int) i_1; > > > ... > > > > > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file: > > > ... > > > final value replacement: > > > i_1 = PHI > > > with > > > i_1 = n_4(D); > > > ... > > > > > > And the resulting loop no longer has any loop exit phis, so > > > auto-parallelization succeeds: > > > ... > > > : > > > # i_13 = PHI <0(3), i_10(5)> > > > _5 = (long unsigned int) i_13; > > > _6 = _5 * 4; > > > _8 = a_7(D) + _6; > > > *_8 = 1; > > > i_10 = i_13 + 1; > > > if (n_4(D) > i_10) > > > goto ; > > > else > > > goto ; > > > > > > : > > > goto ; > > > > > > : > > > _20 = (unsigned int) n_4(D); > > > ... > > > > > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument > > > defined > > > outside loop", for a slightly different testcase where despite the final > > > value replacement autopar still fails. ] > > > > > > > > > II. > > > > > > Now, back to oacc kernels. > > > > > > Consider test-case kernels-loop-n.f95 (will add this one to the > > > test-cases): > > > ... > > > module test > > > contains > > > subroutine foo(n) > > > implicit none > > > integer :: n > > > integer, dimension (0:n-1) :: a, b, c > > > integer :: i, ii > > > do i = 0, n - 1 > > > a(i) = i * 2 > > > end do > > > > > > do i = 0, n -1 > > > b(i) = i * 4 > > > end do > > > > > > !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1)) > > > do ii = 0, n - 1 > > > c(ii) = a(ii) + b(ii) > > > end do > > > !$acc end kernels > > > > > > do i = 0, n - 1 > > > if (c(i) .ne. a(i) + b(i)) call abort > > > end do > > > > > > end subroutine foo > > > end module test > > > ... > > > > > > The loop at the start of the kernels pass group contains an in-memory > > > iteration variable, with a store to '*_9 = _38'. > > > ... > > > : > > > _13 = *.omp_data_i_4(D).c; > > > c.21_14 = *_13; > > > _16 = *_9; > > > _17 = (integer(kind=8)) _16; > > > _18 = *.omp_data_i_4(D).a; > > > a.22_19 = *_18; > > > _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; > > > _24 = *.omp_data_i_4(D).b; > > > b.23_25 = *_24; > > > _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; > > > _30 = _23 + _29; > > > MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; > > > _38 = _16 + 1; > > > *_9 = _38; > > > if (_8 == _16) > > > goto ; > > > else > > > goto ; > > > ... > > > > > > After pass_lim/pass_copy_prop, we've rewritten that into using a local > > > iteration variable, but we've generated a read of the final value of the > > > iteration variable outside the loop, which means auto-parallelization will > > > fail: > > > ... > > > : > > > # D__lsm.29_12 = PHI > > > _17 = (integer(kind=8)) D__lsm.29_12; > > > _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17]; > > > _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17]; > > > _30 = _23 + _29; > > > MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30; > > > _38 = D__lsm.29_12 + 1; > > > if (_8 == D__lsm.29_12) > > > goto ; > > > else > > > goto ; > > > > > > : > > > # D__lsm.29_27 = PHI <_38(5)> > > > *_9 = D__lsm.29_27; > > > goto ; > > > > So this store is not actually necessary? > > a. > In the case of this example, the store is dead. > > There is a corresponding load at the point that we split off the region: > ... > : > #pragma omp return > > : > D.3635 = .omp_data_arr.25.ii; > ii = *D.3635; > ... > > This load is later removed, given that ii is unused after the region. But once > the region is split off, there's nothing in the context of the store to > suggest that it's dead. > > And to get rid of the load of ii before the region is split off, we would have > to implement some sort of liveness analysis on pre-ssa code. > > b. > There's the case where there is an explicit use of ii after the region, in > which case the store is not dead. > > c. > And there's the case were we use a data clause on the region, f.i. 'create > (ii)' to indicate that the variable is neither copied in nor copied out of the > region (the default for a scalar in a kernels region is 'copy', meaning > copy-in-and-out). > > [ This means the value of ii after the region is uninitialized. So even if > there's a read from ii after the region, we cannot consider it connected to > the store, given that the value written by the store on the accelerator will > not be copied back to the host. ] > > In this case, we already don't have any load of ii after the region: > ... > : > #pragma omp return > > : > .omp_data_sizes.28 = {CLOBBER}; > .omp_data_arr.27 = {CLOBBER}; > ... > > We could insert clobbers for the bits of .omp_data_arr at the end of the > region to indicate that those are not used. That might enable dse to get rid > of the dead store. > > > But, I think we want a generic solution that handles cases a, b and c, which > means we have to solve the most difficult case, which is b, where the store is > not dead. > > > Or just in an inconvenient place? > > I don't think the place of the store is inconvenient, it would be worse to > have the store in the loop. > > What is inconvenient about the store is the fact that it reads the final value > of the iteration variable (which inhibits parloops). > > > > : > > > goto ; > > > ... > > > > > > This makes it similar to the parloops example above, and that's why I've > > > added pass_scev_cprop in the kernels pass group. > > > > > > [ And for some kernels test-cases with constant loop bound, it's not the > > > final value replacement bit that does the substitution, but the first bit > > > in > > > scev_const_prop using resolve_mixers. So that's a related reason to use > > > pass_scev_cprop. ] > > > > IMHO autopar needs to handle induction itself. > > I'm not sure what you mean. Could you elaborate? Autopar handles induction > variables, but it doesn't handle exit phis reading the final value of the > induction variable. Is that what you want fixed? How? Yes. Perform final value replacement. > > And the above LIM example > > is none for why you need two LIM passes... > > Indeed. I'm planning a separate reply to explain in more detail the need for > the two pass_lims. Thanks.