From: Richard Biener <rguenther@suse.de>
To: Tom de Vries <Tom_deVries@mentor.com>
Cc: Richard Biener <richard.guenther@gmail.com>,
"gcc-patches@gnu.org" <gcc-patches@gnu.org>,
Jakub Jelinek <jakub@redhat.com>
Subject: Re: [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def
Date: Tue, 17 Nov 2015 15:18:00 -0000 [thread overview]
Message-ID: <alpine.LSU.2.11.1511171618210.4884@t29.fhfr.qr> (raw)
In-Reply-To: <564B3F69.50600@mentor.com>
On Tue, 17 Nov 2015, Tom de Vries wrote:
> On 17/11/15 11:05, Richard Biener wrote:
> > On Tue, Nov 17, 2015 at 12:20 AM, Tom de Vries <Tom_deVries@mentor.com>
> > wrote:
> > > On 16/11/15 13:45, Richard Biener wrote:
> > > > > >
> > > > > > + NEXT_PASS (pass_scev_cprop);
> > > > > > > >
> > > > > > > > What's that for? It's supposed to help removing loops - I don't
> > > > > > > > expect kernels to vanish.
> > > > >
> > > > > >
> > > > > > I'm using pass_scev_cprop for the "final value replacement"
> > > > > > functionality.
> > > > > > Added comment.
> > >
> > >
> > > > That functionality is intented to enable loop removal.
> > >
> > >
> > > Let me try to explain in a bit more detail.
> > >
> > >
> > > I.
> > >
> > > Consider a parloops testcase test.c, with a use of the final value of the
> > > iteration variable (return i):
> > > ...
> > > unsigned int
> > > foo (int n, int *a)
> > > {
> > > int i;
> > > for (i = 0; i < n; ++i)
> > > a[i] = 1;
> > >
> > > return i;
> > > }
> > > ...
> > >
> > > Say we compile with:
> > > ...
> > > $ gcc -S -O2 test.c -ftree-parallelize-loops=2 -fdump-tree-all-details
> > > ...
> > >
> > > We can see here in the parloops dump-file that the loop was parallelized:
> > > ...
> > > SUCCESS: may be parallelized
> > > ...
> > >
> > > Now say that we run with -fno-tree-scev-cprop in addition. Instead we find
> > > in the parloops dump-file:
> > > ...
> > > phi is i_1 = PHI <i_10(4)>
> > > arg of phi to exit: value i_10 used outside loop
> > > checking if it a part of reduction pattern:
> > > FAILED: it is not a part of reduction.
> > > ...
> > >
> > > Auto-parallelization fails in this case because there is a loop exit phi
> > > (the one in bb 6 defining i_1) which is not part of a reduction:
> > > ...
> > > <bb 4>:
> > > # i_13 = PHI <0(3), i_10(5)>
> > > _5 = (long unsigned int) i_13;
> > > _6 = _5 * 4;
> > > _8 = a_7(D) + _6;
> > > *_8 = 1;
> > > i_10 = i_13 + 1;
> > > if (n_4(D) > i_10)
> > > goto <bb 5>;
> > > else
> > > goto <bb 6>;
> > >
> > > <bb 5>:
> > > goto <bb 4>;
> > >
> > > <bb 6>:
> > > # i_1 = PHI <i_10(4)>
> > > _20 = (unsigned int) i_1;
> > > ...
> > >
> > > With -ftree-scev-cprop, we find in the pass_scev_cprop dump-file:
> > > ...
> > > final value replacement:
> > > i_1 = PHI <i_10(4)>
> > > with
> > > i_1 = n_4(D);
> > > ...
> > >
> > > And the resulting loop no longer has any loop exit phis, so
> > > auto-parallelization succeeds:
> > > ...
> > > <bb 4>:
> > > # i_13 = PHI <0(3), i_10(5)>
> > > _5 = (long unsigned int) i_13;
> > > _6 = _5 * 4;
> > > _8 = a_7(D) + _6;
> > > *_8 = 1;
> > > i_10 = i_13 + 1;
> > > if (n_4(D) > i_10)
> > > goto <bb 5>;
> > > else
> > > goto <bb 6>;
> > >
> > > <bb 5>:
> > > goto <bb 4>;
> > >
> > > <bb 6>:
> > > _20 = (unsigned int) n_4(D);
> > > ...
> > >
> > > [ I've filed PR68373 - "autopar fails on loop exit phi with argument
> > > defined
> > > outside loop", for a slightly different testcase where despite the final
> > > value replacement autopar still fails. ]
> > >
> > >
> > > II.
> > >
> > > Now, back to oacc kernels.
> > >
> > > Consider test-case kernels-loop-n.f95 (will add this one to the
> > > test-cases):
> > > ...
> > > module test
> > > contains
> > > subroutine foo(n)
> > > implicit none
> > > integer :: n
> > > integer, dimension (0:n-1) :: a, b, c
> > > integer :: i, ii
> > > do i = 0, n - 1
> > > a(i) = i * 2
> > > end do
> > >
> > > do i = 0, n -1
> > > b(i) = i * 4
> > > end do
> > >
> > > !$acc kernels copyin (a(0:n-1), b(0:n-1)) copyout (c(0:n-1))
> > > do ii = 0, n - 1
> > > c(ii) = a(ii) + b(ii)
> > > end do
> > > !$acc end kernels
> > >
> > > do i = 0, n - 1
> > > if (c(i) .ne. a(i) + b(i)) call abort
> > > end do
> > >
> > > end subroutine foo
> > > end module test
> > > ...
> > >
> > > The loop at the start of the kernels pass group contains an in-memory
> > > iteration variable, with a store to '*_9 = _38'.
> > > ...
> > > <bb 4>:
> > > _13 = *.omp_data_i_4(D).c;
> > > c.21_14 = *_13;
> > > _16 = *_9;
> > > _17 = (integer(kind=8)) _16;
> > > _18 = *.omp_data_i_4(D).a;
> > > a.22_19 = *_18;
> > > _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > > _24 = *.omp_data_i_4(D).b;
> > > b.23_25 = *_24;
> > > _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > > _30 = _23 + _29;
> > > MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > > _38 = _16 + 1;
> > > *_9 = _38;
> > > if (_8 == _16)
> > > goto <bb 3>;
> > > else
> > > goto <bb 4>;
> > > ...
> > >
> > > After pass_lim/pass_copy_prop, we've rewritten that into using a local
> > > iteration variable, but we've generated a read of the final value of the
> > > iteration variable outside the loop, which means auto-parallelization will
> > > fail:
> > > ...
> > > <bb 5>:
> > > # D__lsm.29_12 = PHI <D__lsm.29_15(4), _38(7)>
> > > _17 = (integer(kind=8)) D__lsm.29_12;
> > > _23 = MEM[(integer(kind=4)[0:D.3488] *)a.22_19][_17];
> > > _29 = MEM[(integer(kind=4)[0:D.3484] *)b.23_25][_17];
> > > _30 = _23 + _29;
> > > MEM[(integer(kind=4)[0:D.3480] *)c.21_14][_17] = _30;
> > > _38 = D__lsm.29_12 + 1;
> > > if (_8 == D__lsm.29_12)
> > > goto <bb 6>;
> > > else
> > > goto <bb 7>;
> > >
> > > <bb 6>:
> > > # D__lsm.29_27 = PHI <_38(5)>
> > > *_9 = D__lsm.29_27;
> > > goto <bb 3>;
> >
> > So this store is not actually necessary?
>
> a.
> In the case of this example, the store is dead.
>
> There is a corresponding load at the point that we split off the region:
> ...
> <bb 9>:
> #pragma omp return
>
> <bb 10>:
> D.3635 = .omp_data_arr.25.ii;
> ii = *D.3635;
> ...
>
> This load is later removed, given that ii is unused after the region. But once
> the region is split off, there's nothing in the context of the store to
> suggest that it's dead.
>
> And to get rid of the load of ii before the region is split off, we would have
> to implement some sort of liveness analysis on pre-ssa code.
>
> b.
> There's the case where there is an explicit use of ii after the region, in
> which case the store is not dead.
>
> c.
> And there's the case were we use a data clause on the region, f.i. 'create
> (ii)' to indicate that the variable is neither copied in nor copied out of the
> region (the default for a scalar in a kernels region is 'copy', meaning
> copy-in-and-out).
>
> [ This means the value of ii after the region is uninitialized. So even if
> there's a read from ii after the region, we cannot consider it connected to
> the store, given that the value written by the store on the accelerator will
> not be copied back to the host. ]
>
> In this case, we already don't have any load of ii after the region:
> ...
> <bb 9>:
> #pragma omp return
>
> <bb 10>:
> .omp_data_sizes.28 = {CLOBBER};
> .omp_data_arr.27 = {CLOBBER};
> ...
>
> We could insert clobbers for the bits of .omp_data_arr at the end of the
> region to indicate that those are not used. That might enable dse to get rid
> of the dead store.
>
>
> But, I think we want a generic solution that handles cases a, b and c, which
> means we have to solve the most difficult case, which is b, where the store is
> not dead.
>
> > Or just in an inconvenient place?
>
> I don't think the place of the store is inconvenient, it would be worse to
> have the store in the loop.
>
> What is inconvenient about the store is the fact that it reads the final value
> of the iteration variable (which inhibits parloops).
>
> > > <bb 7>:
> > > goto <bb 5>;
> > > ...
> > >
> > > This makes it similar to the parloops example above, and that's why I've
> > > added pass_scev_cprop in the kernels pass group.
> > >
> > > [ And for some kernels test-cases with constant loop bound, it's not the
> > > final value replacement bit that does the substitution, but the first bit
> > > in
> > > scev_const_prop using resolve_mixers. So that's a related reason to use
> > > pass_scev_cprop. ]
> >
> > IMHO autopar needs to handle induction itself.
>
> I'm not sure what you mean. Could you elaborate? Autopar handles induction
> variables, but it doesn't handle exit phis reading the final value of the
> induction variable. Is that what you want fixed? How?
Yes. Perform final value replacement.
> > And the above LIM example
> > is none for why you need two LIM passes...
>
> Indeed. I'm planning a separate reply to explain in more detail the need for
> the two pass_lims.
Thanks.
next prev parent reply other threads:[~2015-11-17 15:18 UTC|newest]
Thread overview: 133+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-09 15:35 [PATCH series, 16] Use parloops to parallelize oacc kernels regions Tom de Vries
2015-11-09 15:44 ` [PATCH, 1/16] Insert new exit block only when needed in transform_to_exit_first_loop_alt Tom de Vries
2015-11-11 10:50 ` Richard Biener
2015-11-09 15:45 ` [PATCH, 2/16] Make create_parallel_loop return void Tom de Vries
2015-11-11 10:50 ` Richard Biener
2015-11-09 15:51 ` [PATCH, 3/16] Ignore reduction clause on kernels directive Tom de Vries
2015-11-24 12:25 ` [PING][PATCH, " Tom de Vries
2016-01-18 14:24 ` [PING^2][PATCH, " Tom de Vries
2016-01-18 14:26 ` Jakub Jelinek
2015-11-09 16:10 ` [PATCH, 4/16] Implement -foffload-alias Tom de Vries
2015-11-11 10:53 ` Richard Biener
2015-11-11 11:01 ` Jakub Jelinek
2015-11-12 16:04 ` Tom de Vries
2015-11-13 8:46 ` Richard Biener
2015-11-13 11:03 ` Tom de Vries
2015-11-13 11:30 ` Richard Biener
2015-11-13 11:39 ` Jakub Jelinek
2015-11-21 12:24 ` Tom de Vries
2015-11-23 11:46 ` Richard Biener
2015-11-27 11:44 ` Tom de Vries
2015-11-27 12:14 ` Tom de Vries
2015-12-02 9:59 ` Jakub Jelinek
2016-03-14 13:16 ` Tom de Vries
2016-03-14 23:18 ` Tom de Vries
2015-12-02 9:46 ` Jakub Jelinek
2015-12-02 13:11 ` Tom de Vries
2015-12-11 12:45 ` Tom de Vries
2015-12-11 13:00 ` Richard Biener
2015-12-13 16:38 ` Tom de Vries
2015-12-14 13:26 ` Richard Biener
2015-12-14 15:44 ` Tom de Vries
2015-12-16 13:16 ` Richard Biener
2015-12-16 14:43 ` Tom de Vries
2015-12-17 12:03 ` [gomp4] " Thomas Schwinge
2015-12-03 11:53 ` Tom de Vries
2015-11-09 16:31 ` [PATCH, 5/16] Add in_oacc_kernels_region in struct loop Tom de Vries
2015-11-11 10:57 ` Richard Biener
2015-11-16 11:39 ` Tom de Vries
2015-11-16 11:39 ` Tom de Vries
2015-11-16 12:41 ` Richard Biener
2015-11-09 17:39 ` [PATCH, 6/16] Add pass_oacc_kernels Tom de Vries
2015-11-11 10:59 ` Richard Biener
2015-11-19 13:51 ` Tom de Vries
2015-11-24 12:17 ` Tom de Vries
2015-11-25 10:42 ` Richard Biener
2016-02-05 12:06 ` Use plain -fopenacc to enable OpenACC kernels processing (was: [PATCH, 6/16] Add pass_oacc_kernels) Thomas Schwinge
2016-02-10 14:40 ` Use plain -fopenacc to enable OpenACC kernels processing Thomas Schwinge
2016-02-15 16:54 ` Tom de Vries
2016-02-23 15:19 ` Thomas Schwinge
2015-11-09 18:14 ` [PATCH, 7/16] Add pass_dominator_oacc_kernels Tom de Vries
2015-11-11 11:05 ` Richard Biener
2015-11-16 12:04 ` Tom de Vries
2015-11-09 18:34 ` [PATCH, 8/16] Add pass_ch_oacc_kernels Tom de Vries
2015-11-11 20:29 ` Tom de Vries
2015-11-30 12:12 ` [gomp4] Use pass_ch instead of pass_ch_oacc_kernels (was: [PATCH, 8/16] Add pass_ch_oacc_kernels) Thomas Schwinge
2015-11-09 19:53 ` [PATCH, 9/16] Add pass_parallelize_loops_oacc_kernels Tom de Vries
2015-11-16 11:59 ` Tom de Vries
2015-11-24 12:27 ` Tom de Vries
2015-12-13 16:58 ` [PIING][PATCH, " Tom de Vries
2015-12-14 15:23 ` Richard Biener
2016-01-16 22:41 ` [Committed] Move pass_expand_omp_ssa out of pass_parallelize_loops Tom de Vries
2016-01-18 12:59 ` [Committed] Allow pass_parallelize_loops to be run outside the loop pipeline Tom de Vries
2016-01-18 13:07 ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Tom de Vries
2016-01-18 13:30 ` [committed] Add pass_parallelize_loops to pass_oacc_kernels Tom de Vries
2016-01-20 8:54 ` [committed] Add oacc_kernels_p argument to pass_parallelize_loops Thomas Schwinge
2016-01-20 10:31 ` Tom de Vries
2015-11-09 19:59 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
2015-11-11 11:03 ` Richard Biener
2015-11-16 11:55 ` Tom de Vries
2015-11-16 12:45 ` Richard Biener
2015-11-16 23:21 ` Tom de Vries
2015-11-17 10:05 ` Richard Biener
2015-11-17 14:54 ` Tom de Vries
2015-11-17 15:18 ` Richard Biener [this message]
2015-11-17 15:39 ` Tom de Vries
2015-11-17 22:21 ` [PATCH, PR68373 ] Call scev_const_prop in pass_parallelize_loops::execute Tom de Vries
2015-11-19 9:36 ` Tom de Vries
2015-11-20 10:15 ` Richard Biener
2015-11-18 8:30 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Richard Biener
2015-11-18 16:22 ` Bernhard Reutner-Fischer
2015-11-20 12:53 ` [committed, trivial] Fix typo and trailing whitespace in dump-file strings in parloops Tom de Vries
2015-11-19 0:35 ` [PATCH, 10/16] Add pass_oacc_kernels pass group in passes.def Tom de Vries
2015-11-20 10:28 ` Richard Biener
2015-11-21 8:42 ` Tom de Vries
2015-11-23 11:31 ` Richard Biener
2015-11-23 15:53 ` Tom de Vries
2015-11-23 16:38 ` Richard Biener
2015-11-19 10:31 ` Tom de Vries
2015-11-20 10:37 ` Richard Biener
2015-11-20 13:27 ` Tom de Vries
2015-11-20 13:29 ` Richard Biener
2015-11-20 16:34 ` Tom de Vries
2015-11-23 10:11 ` Richard Biener
2015-11-24 12:22 ` Tom de Vries
2015-11-24 13:19 ` Richard Biener
2015-11-24 14:33 ` Tom de Vries
2015-11-24 14:36 ` Richard Biener
2015-11-24 15:05 ` Tom de Vries
2015-11-25 10:43 ` Richard Biener
2015-11-25 10:44 ` Richard Biener
2015-11-30 17:48 ` [gomp4] " Thomas Schwinge
2015-11-22 23:37 ` [PATCH] Don't reapply loops flags if unnecessary in loop_optimizer_init Tom de Vries
2015-11-23 10:33 ` Richard Biener
2015-11-23 11:27 ` Tom de Vries
2015-11-09 20:02 ` [PATCH, 11/16] Update testcases after adding kernels pass group Tom de Vries
2015-11-11 11:03 ` Richard Biener
2015-11-12 14:32 ` Tom de Vries
2015-11-12 14:43 ` Richard Biener
2015-11-12 15:42 ` David Malcolm
2015-11-13 9:44 ` Richard Biener
2015-11-09 20:06 ` [PATCH, 12/16] Handle acc loop directive Tom de Vries
2015-11-24 12:30 ` [PING][PATCH, " Tom de Vries
2016-01-18 14:27 ` [PING^2][PATCH, " Tom de Vries
2016-01-26 12:38 ` [PING^3][PATCH, " Tom de Vries
2016-01-26 12:50 ` Jakub Jelinek
2016-02-12 11:11 ` Tom de Vries
2016-02-22 10:55 ` Tom de Vries
2016-02-22 10:58 ` Jakub Jelinek
2016-02-29 3:27 ` Tom de Vries
2016-03-07 8:22 ` [PING][PATCH, " Tom de Vries
2016-03-14 6:21 ` [PING^2][PATCH, " Tom de Vries
2015-11-09 20:08 ` [PATCH, 13/16] Add c-c++-common/goacc/kernels-*.c Tom de Vries
2016-01-18 13:33 ` [committed] Add oacc kernels tests in goacc Tom de Vries
2015-11-09 20:09 ` [PATCH, 14/16] Add gfortran.dg/goacc/kernels-*.f95 Tom de Vries
2015-11-09 20:11 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
2016-01-18 13:39 ` [comitted] Add oacc kernels test in libgomp Tom de Vries
2016-03-09 9:18 ` [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c Tom de Vries
2016-03-18 12:46 ` Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc (was: [PATCH, 15/16] Add libgomp.oacc-c-c++-common/kernels-*.c) Thomas Schwinge
2016-04-05 9:13 ` Scan for parallelization of the oacc kernels test-cases in gfortran.dg/goacc Tom de Vries
2016-04-07 15:26 ` Thomas Schwinge
2015-11-09 20:12 ` [PATCH, 16/16] Add libgomp.oacc-fortran/kernels-*.f95 Tom de Vries
2016-03-09 9:19 ` Tom de Vries
2016-03-16 13:12 ` Thomas Schwinge
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LSU.2.11.1511171618210.4884@t29.fhfr.qr \
--to=rguenther@suse.de \
--cc=Tom_deVries@mentor.com \
--cc=gcc-patches@gnu.org \
--cc=jakub@redhat.com \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).