From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 109900 invoked by alias); 12 Nov 2015 11:16:25 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 109849 invoked by uid 89); 12 Nov 2015 11:16:24 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Thu, 12 Nov 2015 11:16:23 +0000 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by mx1.redhat.com (Postfix) with ESMTPS id 090EAF8062 for ; Thu, 12 Nov 2015 11:16:22 +0000 (UTC) Received: from tucnak.zalov.cz (ovpn-116-34.ams2.redhat.com [10.36.116.34]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id tACBGKH4027551 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Thu, 12 Nov 2015 06:16:21 -0500 Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.15.2/8.15.2) with ESMTP id tACBGICW002651 for ; Thu, 12 Nov 2015 12:16:19 +0100 Received: (from jakub@localhost) by tucnak.zalov.cz (8.15.2/8.15.2/Submit) id tACBGHVe002650 for gcc-patches@gcc.gnu.org; Thu, 12 Nov 2015 12:16:17 +0100 Date: Thu, 12 Nov 2015 11:16:00 -0000 From: Jakub Jelinek To: GCC Patches Subject: Re: [hsa 4/12] OpenMP lowering/expansion changes (gridification) Message-ID: <20151112111617.GZ5675@tucnak.redhat.com> Reply-To: Jakub Jelinek References: <20151105215108.GC9264@virgil.suse.cz> <20151105215733.GG9264@virgil.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151105215733.GG9264@virgil.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) X-IsSubscribed: yes X-SW-Source: 2015-11/txt/msg01483.txt.bz2 On Thu, Nov 05, 2015 at 10:57:33PM +0100, Martin Jambor wrote: > the patch in this email contains the changes to make our OpenMP > lowering and expansion machinery produce GPU kernels for a certain > limited class of loops. The plan is to make that class quite a big > bigger, but only the following is ready for submission now. > > Basically, whenever the compiler configured for HSAIL generation > encounters the following pattern: > > #pragma omp target > #pragma omp teams thread_limit(workgroup_size) // thread_limit is optional > #pragma omp distribute parallel for firstprivate(n) private(i) other_sharing_clauses() > for (i = 0; i < n; i++) > some_loop_body Do you support only lb 0 or any constant? Only step 1? Can the b be constant, or just a variable? If you need the number of iterations computed before GOMP_target_ext, supposedly you also need to check that n can't change in between target and the distribute (e.g. if it is addressable or global var) and there are some statements in between. What about schedule or dist_schedule clauses? Only schedule(auto) or missing schedule guarantees you you can distribute the work among the threads any way the compiler wants. dist_schedule is always static, but could have different chunk_size. The current int num_threads, int thread_limit GOMP_target_ext arguments perhaps could be changed to something like int num_args, long *args, where args[0] would be the current num_threads and args[1] current thread_limit, and if any offloading target that might benefit from knowing the number of iterations of distribute parallel for that is the only important statement inside, you could perhaps pass it as args[2] and pass 3 instead of 2 to num_args. That could be something kind of generic rather than HSA specific, and extensible. But, looking at your kernel_launch structure, you want something like multiple dimensions and compute each dimension separately rather than combine (collapse) all dimensions together, which is what OpenMP expansion does right now. > While we have also been experimenting quite a bit with dynamic > parallelism, we have only been able to achieve any good performance > via this process of gridification. The user can be notified whether a > particular target construct was gridified or not via our process of > dumping notes, which however only appear in the detailed dump. I am > seriously considering emitting some kind of warning, when HSA-enabled > compiler is about to produce a non-gridified target code. But then it would warn pretty much on all of libgomp testsuite with target constructs in them... > @@ -547,13 +548,13 @@ DEF_FUNCTION_TYPE_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_UINT_PTR, > --- a/gcc/fortran/types.def > +++ b/gcc/fortran/types.def > @@ -145,6 +145,7 @@ DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I2_INT, BT_VOID, BT_VOLATILE_PTR, BT_I2, BT > DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I4_INT, BT_VOID, BT_VOLATILE_PTR, BT_I4, BT_INT) > DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I8_INT, BT_VOID, BT_VOLATILE_PTR, BT_I8, BT_INT) > DEF_FUNCTION_TYPE_3 (BT_FN_VOID_VPTR_I16_INT, BT_VOID, BT_VOLATILE_PTR, BT_I16, BT_INT) > +DEF_FUNCTION_TYPE_3 (BT_FN_VOID_PTR_INT_PTR, BT_VOID, BT_PTR, BT_INT, BT_PTR) > > DEF_FUNCTION_TYPE_4 (BT_FN_VOID_OMPFN_PTR_UINT_UINT, > BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT, BT_UINT) > @@ -215,9 +216,9 @@ DEF_FUNCTION_TYPE_7 (BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_UINT_PTR, > DEF_FUNCTION_TYPE_8 (BT_FN_VOID_OMPFN_PTR_UINT_LONG_LONG_LONG_LONG_UINT, > BT_VOID, BT_PTR_FN_VOID_PTR, BT_PTR, BT_UINT, > BT_LONG, BT_LONG, BT_LONG, BT_LONG, BT_UINT) > -DEF_FUNCTION_TYPE_8 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR, > +DEF_FUNCTION_TYPE_9 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR, > BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE, BT_PTR, > - BT_PTR, BT_PTR, BT_UINT, BT_PTR) > + BT_PTR, BT_PTR, BT_UINT, BT_PTR, BT_PTR) You'd need to move it if you add arguments (but as I said on the other patch, this won't really apply on top of the trunk anyway). > --- a/gcc/gimple.h > +++ b/gcc/gimple.h > @@ -153,6 +153,7 @@ enum gf_mask { > GF_OMP_FOR_KIND_TASKLOOP = 2, > GF_OMP_FOR_KIND_CILKFOR = 3, > GF_OMP_FOR_KIND_OACC_LOOP = 4, > + GF_OMP_FOR_KIND_KERNEL_BODY = 5, > /* Flag for SIMD variants of OMP_FOR kinds. */ > GF_OMP_FOR_SIMD = 1 << 3, > GF_OMP_FOR_KIND_SIMD = GF_OMP_FOR_SIMD | 0, > @@ -621,8 +622,24 @@ struct GTY((tag("GSS_OMP_FOR"))) > /* [ WORD 11 ] > Pre-body evaluated before the loop body begins. */ > gimple_seq pre_body; > + > + /* [ WORD 12 ] > + If set, this statement is part of a gridified kernel, its clauses need to > + be scanned and lowered but the statement should be discarded after > + lowering. */ > + bool kernel_phony; A bool flag is better put as a GF_OMP_* flag, there are still bits left there. > @@ -642,6 +659,26 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT"))) > /* [ WORD 10 ] > Shared data argument. */ > tree data_arg; > + > + /* TODO: Revisit placement of the following two fields. On one hand, we > + currently only use them on target construct. On the other, use on > + parallel construct is also possible in the future. */ > + > + /* [ WORD 11 ] */ > + /* Number of elements in kernel_iter array. */ > + size_t dimensions; > + > + /* [ WORD 12 ] */ > + /* If target also contains a GPU kernel, it should be run with the > + following grid sizes. */ > + struct gimple_omp_target_grid_dim > + * GTY((length ("%h.dimensions"))) kernel_dim; > + > + /* [ WORD 13 ] */ > + /* If set, this statement is part of a gridified kernel, its clauses need to > + be scanned and lowered but the statement should be discarded after > + lowering. */ > + bool kernel_phony; I really don't like sticking any other arguments into these gimple structures. Add some artificial clause and add it to the construct's clauses instead? > --- a/gcc/omp-builtins.def > +++ b/gcc/omp-builtins.def > @@ -302,8 +302,12 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_START, "GOMP_single_copy_start", > BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST) > DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_END, "GOMP_single_copy_end", > BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST) > +DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_REGISTER, "GOMP_offload_register", > + BT_FN_VOID_PTR_INT_PTR, ATTR_NOTHROW_LIST) > +DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_UNREGISTER, "GOMP_offload_unregister", > + BT_FN_VOID_PTR_INT_PTR, ATTR_NOTHROW_LIST) These two are deprecated, use the *_ver ones instead. > DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TARGET, "GOMP_target_41", > - BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR, > + BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_UINT_PTR_PTR, > ATTR_NOTHROW_LIST) This won't really apply to trunk. Jakub