* [gomp4] OpenACC async re-work
@ 2017-06-24 7:54 Chung-Lin Tang
2017-06-26 22:45 ` Cesar Philippidis
0 siblings, 1 reply; 9+ messages in thread
From: Chung-Lin Tang @ 2017-06-24 7:54 UTC (permalink / raw)
To: gcc-patches, Cesar Philippidis, Thomas Schwinge
[-- Attachment #1: Type: text/plain, Size: 11842 bytes --]
Hi Cesar, Thomas,
This patch is the re-implementation of OpenACC async we talked about.
The changes are rather large, so I am putting it here for a few days before
actually committing them to gomp-4_0-branch. Would appreciate if you guys
take a look.
To overall describe the highlights of the changes:
(1) Instead of essentially implementing the entire OpenACC async support
inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
plugin functions. Most of the OpenACC specific logic is pulled into
libgomp/oacc-async.c
(2) CUDA events are no longer used. The limitation of no CUDA calls inside
CUDA callbacks were a problem for resource freeing, but we now stash
them onto the ptx_device and free them later.
(3) For 'wait + async', we now add a local thread synchronize, instead
of just ordering the streams.
(4) To work with the (3) change, some front end changes were added to
propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
represent a 'wait all'.
Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
this after the weekend (or Tues.)
Thanks,
Chung-Lin
2017-06-24 Chung-Lin Tang <cltang@codesourcery.com>
gcc/c/
* c-parser.c (c_parser_oacc_clause_wait): Add representation of wait
clause without argument as 'wait (GOMP_ASYNC_NOVAL)', adjust comments.
gcc/cp/
* parser.c (cp_parser_oacc_clause_wait): Add representation of wait
clause without argument as 'wait (GOMP_ASYNC_NOVAL)', adjust comments.
gcc/fortran/
* trans-openmp.c (gfc_trans_omp_clauses_1): Add representation of wait
clause without argument as 'wait (GOMP_ASYNC_NOVAL)'.
gcc/
* omp-low.c (expand_omp_target): Add middle-end support for handling
OMP_CLAUSE_WAIT clause with a GOMP_ASYNC_NOVAL(-1) as the argument.
gcc/testsuite/
* c-c++-common/goacc/dtype-1.c: Adjust testcase.
* gfortran.dg/goacc/dtype-1.f95: Likewise.
include/
* gomp-constants.h (GOMP_LAUNCH_OP_MASK): Define.
(GOMP_LAUNCH_PACK): Add bitwise-and of GOMP_LAUNCH_OP_MASK.
(GOMP_LAUNCH_OP): Likewise.
libgomp/
* libgomp-plugin.h (struct goacc_asyncqueue): Declare.
(struct goacc_asyncqueue_list): Likewise.
(goacc_aq): Likewise.
(goacc_aq_list): Likewise.
(GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
(GOMP_OFFLOAD_openacc_async_test): Remove.
(GOMP_OFFLOAD_openacc_async_test_all): Remove.
(GOMP_OFFLOAD_openacc_async_wait): Remove.
(GOMP_OFFLOAD_openacc_async_wait_async): Remove.
(GOMP_OFFLOAD_openacc_async_wait_all): Remove.
(GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
(GOMP_OFFLOAD_openacc_async_set_async): Remove.
(GOMP_OFFLOAD_openacc_exec): Adjust declaration.
(GOMP_OFFLOAD_openacc_cuda_get_stream): Likewise.
(GOMP_OFFLOAD_openacc_cuda_set_stream): Likewise.
(GOMP_OFFLOAD_openacc_async_exec): Declare.
(GOMP_OFFLOAD_openacc_async_construct): Declare.
(GOMP_OFFLOAD_openacc_async_destruct): Declare.
(GOMP_OFFLOAD_openacc_async_test): Declare.
(GOMP_OFFLOAD_openacc_async_synchronize): Declare.
(GOMP_OFFLOAD_openacc_async_serialize): Declare.
(GOMP_OFFLOAD_openacc_async_queue_callback): Declare.
(GOMP_OFFLOAD_openacc_async_host2dev): Declare.
(GOMP_OFFLOAD_openacc_async_dev2host): Declare.
* libgomp.h (struct acc_dispatch_t): Define 'async' sub-struct.
(gomp_acc_insert_pointer): Adjust declaration.
(gomp_copy_host2dev): New declaration.
(gomp_copy_dev2host): Likewise.
(gomp_map_vars_async): Likewise.
(gomp_unmap_tgt): Likewise.
(gomp_unmap_vars_async): Likewise.
(gomp_fini_device): Likewise.
* oacc-async.c (get_goacc_thread): New function.
(get_goacc_thread_device): New function.
(lookup_goacc_asyncqueue): New function.
(get_goacc_asyncqueue): New function.
(acc_async_test): Adjust code to use new async design.
(acc_async_test_all): Likewise.
(acc_wait): Likewise.
(acc_wait_async): Likewise.
* oacc-cuda.c (acc_get_cuda_stream): Adjust code to use new async
design.
(acc_set_cuda_stream): Likewise.
* oacc-host.c (host_openacc_exec): Adjust parameters, remove 'async'.
(host_openacc_register_async_cleanup): Remove.
(host_openacc_async_exec): New function.
(host_openacc_async_test): Adjust parameters.
(host_openacc_async_test_all): Remove.
(host_openacc_async_wait): Remove.
(host_openacc_async_wait_async): Remove.
(host_openacc_async_wait_all): Remove.
(host_openacc_async_wait_all_async): Remove.
(host_openacc_async_set_async): Remove.
(host_openacc_async_synchronize): New function.
(host_openacc_async_serialize): New function.
(host_openacc_async_host2dev): New function.
(host_openacc_async_dev2host): New function.
(host_openacc_async_queue_callback): New function.
(host_openacc_async_construct): New function.
(host_openacc_async_destruct): New function.
(struct gomp_device_descr host_dispatch): Remove initialization of old
interface, add intialization of new async sub-struct.
* oacc-init.c (acc_shutdown_1): Adjust to use gomp_fini_device.
(goacc_attach_host_thread_to_device): Remove old async code usage.
* oacc-int.h (goacc_init_asyncqueues): New declaration.
(goacc_fini_asyncqueues): Likewise.
(goacc_async_copyout_unmap_vars): Likewise.
(goacc_async_free): Likewise.
(get_goacc_asyncqueue): Likewise.
(lookup_goacc_asyncqueue): Likewise.
* oacc-mem.c (memcpy_tofrom_device): Adjust code to use new async
design.
(acc_is_present): Explicitly use 1/0 as return value;
(present_create_copy): Adjust code to use new async design.
(delete_copyout): Likewise.
(update_dev_host): Likewise.
(gomp_acc_insert_pointer): Add async parameter, adjust code to use new
async design.
(gomp_acc_remove_pointer): Adjust code to use new async design.
* oacc-parallel.c (GOACC_parallel_keyed): Adjust code to use new async
design, adjust profiling bits, interpret launch op as signed 16-bit
field.
(GOACC_enter_exit_data): Handle -1 as waits num, adjust code to use new
async design.
(goacc_wait): Adjust code to use new async design.
(GOACC_update): Likewise.
* oacc-plugin.c (GOMP_PLUGIN_async_unmap_vars): Remove.
* target.c (goacc_device_copy_async): New function.
(gomp_copy_host2dev): Remove 'static', add goacc_asyncqueue parameter,
add goacc_device_copy_async case.
(gomp_copy_dev2host): Likewise.
(gomp_map_vars_existing): Add goacc_asyncqueue parameter, adjust code.
(gomp_map_pointer): Likewise.
(gomp_map_fields_existing): Likewise.
(gomp_map_vars): Add function for compatiblity.
(gomp_map_vars_async): Adapt from gomp_map_vars, add goacc_asyncqueue
parameter.
(gomp_unmap_tgt): Remove statis, add attribute_hidden.
(gomp_unmap_vars): Add function for compatiblity.
(gomp_unmap_vars_async): Adapt from gomp_unmap_vars, add
goacc_asyncqueue parameter.
(gomp_fini_device): New function.
(gomp_exit_data): Adjust gomp_copy_dev2host call.
(gomp_load_plugin_for_device): Remove old interface, adjust to load
new async interface.
(gomp_target_fini): Adjust code to call gomp_fini_device.
* plugin/plugin-nvptx.c (struct cuda_map): Remove.
(struct ptx_stream): Remove.
(struct nvptx_thread): Remove current_stream field.
(cuda_map_create): Remove.
(cuda_map_destroy): Remove.
(map_init): Remove.
(map_fini): Remove.
(map_pop): Remove.
(map_push): Remove.
(struct goacc_asyncqueue): Define.
(struct nvptx_callback): Define.
(struct ptx_free_block): Define.
(struct ptx_device): Remove null_stream, active_streams, async_streams,
stream_lock, and next fields.
(enum ptx_event_type): Remove.
(struct ptx_event): Remove.
(ptx_event_lock): Remove.
(ptx_events): Remove.
(init_streams_for_device): Remove.
(fini_streams_for_device): Remove.
(select_stream_for_async): Remove.
(nvptx_init): Remove ptx_events and ptx_event_lock references.
(nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
case.
(nvptx_open_device): Add free_blocks initialization, remove
init_streams_for_device call.
(nvptx_close_device): Remove fini_streams_for_device call, add
free_blocks destruct code.
(event_gc): Remove.
(event_add): Remove.
(nvptx_exec): Adjust parameters and code.
(nvptx_free): Likewise.
(nvptx_host2dev): Remove.
(nvptx_dev2host): Remove.
(nvptx_set_async): Remove.
(nvptx_async_test): Remove.
(nvptx_async_test_all): Remove.
(nvptx_wait): Remove.
(nvptx_wait_async): Remove.
(nvptx_wait_all): Remove.
(nvptx_wait_all_async): Remove.
(nvptx_get_cuda_stream): Remove.
(nvptx_set_cuda_stream): Remove.
(GOMP_OFFLOAD_alloc): Adjust code.
(GOMP_OFFLOAD_free): Likewise.
(GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
(GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
(GOMP_OFFLOAD_openacc_async_test_all): Remove.
(GOMP_OFFLOAD_openacc_async_wait): Remove.
(GOMP_OFFLOAD_openacc_async_wait_async): Remove.
(GOMP_OFFLOAD_openacc_async_wait_all): Remove.
(GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
(GOMP_OFFLOAD_openacc_async_set_async): Remove.
(cuda_free_argmem): New function.
(GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
(GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
(GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
(GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
(GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
(GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
(cuda_callback_wrapper): New function.
(cuda_memcpy_sanity_check): New function.
(GOMP_OFFLOAD_host2dev): Remove and re-implement.
(GOMP_OFFLOAD_dev2host): Remove and re-implement.
(GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
(GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.
* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
(cb_enter_data_start): Adjust testcase.
* testsuite/libgomp.oacc-c-c++-common/data-2-lib.c: Adjust testcase.
* testsuite/libgomp.oacc-c-c++-common/data-2.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/data-3.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/lib-71.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/lib-77.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/lib-79.c: Likewise.
* testsuite/libgomp.oacc-c-c++-common/lib-81.c: Likewise.
* testsuite/libgomp.oacc-fortran/lib-12.f90: Likewise.
[-- Attachment #2: 204.diff --]
[-- Type: text/plain, Size: 123168 bytes --]
Index: gcc/c/c-parser.c
===================================================================
--- gcc/c/c-parser.c (revision 249620)
+++ gcc/c/c-parser.c (working copy)
@@ -11941,7 +11941,7 @@ c_parser_oacc_clause_tile (c_parser *parser, tree
}
/* OpenACC:
- wait ( int-expr-list ) */
+ wait [( int-expr-list )] */
static tree
c_parser_oacc_clause_wait (c_parser *parser, tree list)
@@ -11950,7 +11950,15 @@ c_parser_oacc_clause_wait (c_parser *parser, tree
if (c_parser_peek_token (parser)->type == CPP_OPEN_PAREN)
list = c_parser_oacc_wait_list (parser, clause_loc, list);
+ else
+ {
+ tree c = build_omp_clause (clause_loc, OMP_CLAUSE_WAIT);
+ OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+ OMP_CLAUSE_CHAIN (c) = list;
+ list = c;
+ }
+
return list;
}
Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c (revision 249620)
+++ gcc/cp/parser.c (working copy)
@@ -30619,7 +30619,7 @@ cp_parser_oacc_wait_list (cp_parser *parser, locat
}
/* OpenACC:
- wait ( int-expr-list ) */
+ wait [( int-expr-list )] */
static tree
cp_parser_oacc_clause_wait (cp_parser *parser, tree list)
@@ -30626,11 +30626,17 @@ cp_parser_oacc_clause_wait (cp_parser *parser, tre
{
location_t location = cp_lexer_peek_token (parser->lexer)->location;
- if (cp_lexer_peek_token (parser->lexer)->type != CPP_OPEN_PAREN)
- return list;
+ if (cp_lexer_peek_token (parser->lexer)->type == CPP_OPEN_PAREN)
+ list = cp_parser_oacc_wait_list (parser, location, list);
+ else
+ {
+ tree c = build_omp_clause (location, OMP_CLAUSE_WAIT);
+
+ OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+ OMP_CLAUSE_CHAIN (c) = list;
+ list = c;
+ }
- list = cp_parser_oacc_wait_list (parser, location, list);
-
return list;
}
Index: gcc/fortran/trans-openmp.c
===================================================================
--- gcc/fortran/trans-openmp.c (revision 249620)
+++ gcc/fortran/trans-openmp.c (working copy)
@@ -2962,6 +2962,13 @@ gfc_trans_omp_clauses_1 (stmtblock_t *block, gfc_o
omp_clauses = c;
}
}
+ else if (clauses->wait)
+ {
+ c = build_omp_clause (where.lb->location, OMP_CLAUSE_WAIT);
+ OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+ OMP_CLAUSE_CHAIN (c) = omp_clauses;
+ omp_clauses = c;
+ }
if (clauses->num_gangs_expr)
{
tree num_gangs_var
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c (revision 249620)
+++ gcc/omp-low.c (working copy)
@@ -14226,9 +14226,18 @@ expand_omp_target (struct omp_region *region)
/* ... push a placeholder. */
args.safe_push (integer_zero_node);
+ bool noval_seen = false;
+ tree noval = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+
for (; c; c = OMP_CLAUSE_CHAIN (c))
if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WAIT)
{
+ if (tree_int_cst_compare (OMP_CLAUSE_WAIT_EXPR (c), noval) == 0)
+ {
+ noval_seen = true;
+ continue;
+ }
+
args.safe_push (fold_convert_loc (OMP_CLAUSE_LOCATION (c),
integer_type_node,
OMP_CLAUSE_WAIT_EXPR (c)));
@@ -14235,7 +14244,12 @@ expand_omp_target (struct omp_region *region)
num_waits++;
}
- if (!tagging || num_waits)
+ if (noval_seen && num_waits == 0)
+ args[t_wait_idx] =
+ (tagging
+ ? oacc_launch_pack (GOMP_LAUNCH_WAIT, NULL_TREE, GOMP_ASYNC_NOVAL)
+ : noval);
+ else if (!tagging || num_waits)
{
tree len;
Index: gcc/testsuite/c-c++-common/goacc/dtype-1.c
===================================================================
--- gcc/testsuite/c-c++-common/goacc/dtype-1.c (revision 249620)
+++ gcc/testsuite/c-c++-common/goacc/dtype-1.c (working copy)
@@ -96,11 +96,11 @@ test ()
/* { dg-final { scan-tree-dump-times "oacc_parallel device_type\\(\\*\\) \\\[ wait\\(10\\) vector_length\\(10\\) num_workers\\(10\\) num_gangs\\(10\\) async\\(10\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(3\\) vector_length\\(128\\) num_workers\\(300\\) num_gangs\\(300\\) async\\(3\\) \\\] wait\\(1\\) vector_length\\(1\\) num_workers\\(1\\) num_gangs\\(1\\) async\\(1\\)" 1 "omplower" } } */
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) \\\]" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(-1\\) async\\(-1\\) \\\]" 1 "omplower" } } */
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(1\\) async\\(1\\) \\\] async\\(-1\\)" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(1\\) async\\(1\\) \\\] wait\\(-1\\) async\\(-1\\)" 1 "omplower" } } */
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ wait\\(0\\) async\\(0\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(2\\) async\\(2\\) \\\] async\\(-1\\)" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ wait\\(0\\) async\\(0\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(2\\) async\\(2\\) \\\] wait\\(-1\\) async\\(-1\\)" 1 "omplower" } } */
/* { dg-final { scan-tree-dump-times "acc loop device_type\\(nvidia\\) \\\[ tile\\(1\\) gang \\\] private\\(i1\\.0\\) private\\(i1\\)" 1 "omplower" } } */
Index: gcc/testsuite/gfortran.dg/goacc/dtype-1.f95
===================================================================
--- gcc/testsuite/gfortran.dg/goacc/dtype-1.f95 (revision 249620)
+++ gcc/testsuite/gfortran.dg/goacc/dtype-1.f95 (working copy)
@@ -175,13 +175,13 @@ end subroutine sr5b
! { dg-final { scan-tree-dump-times "oacc_parallel device_type\\(\\*\\) \\\[ async\\(10\\) wait\\(10\\) num_gangs\\(10\\) num_workers\\(10\\) vector_length\\(10\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(3\\) wait\\(3\\) num_gangs\\(300\\) num_workers\\(300\\) vector_length\\(128\\) \\\] async\\(1\\) wait\\(1\\) num_gangs\\(1\\) num_workers\\(1\\) vector_length\\(1\\)" 1 "omplower" } }
-! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) \\\]" 1 "omplower" } }
+! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) wait\\(-1\\) \\\]" 1 "omplower" } }
! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(1\\) wait\\(1\\) \\\]" 1 "omplower" } }
! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia\\) \\\[ async\\(2\\) wait\\(2\\) \\\]" 1 "omplower" } }
-! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(1\\) wait\\(1\\) \\\] async\\(-1\\)" 1 "omplower" } }
+! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(1\\) wait\\(1\\) \\\] async\\(-1\\) wait\\(-1\\)" 1 "omplower" } }
! { dg-final { scan-tree-dump-times "acc loop device_type\\(nvidia\\) \\\[ tile\\(1\\) gang \\\] private\\(i1\\) private\\(i1\\.1\\)" 1 "omplower" } }
Index: include/gomp-constants.h
===================================================================
--- include/gomp-constants.h (revision 249620)
+++ include/gomp-constants.h (working copy)
@@ -249,13 +249,14 @@ enum gomp_map_kind
#define GOMP_LAUNCH_CODE_SHIFT 28
#define GOMP_LAUNCH_DEVICE_SHIFT 16
#define GOMP_LAUNCH_OP_SHIFT 0
+#define GOMP_LAUNCH_OP_MASK 0xffff
#define GOMP_LAUNCH_PACK(CODE,DEVICE,OP) \
(((CODE) << GOMP_LAUNCH_CODE_SHIFT) \
| ((DEVICE) << GOMP_LAUNCH_DEVICE_SHIFT) \
- | ((OP) << GOMP_LAUNCH_OP_SHIFT))
+ | (((OP) & GOMP_LAUNCH_OP_MASK) << GOMP_LAUNCH_OP_SHIFT))
#define GOMP_LAUNCH_CODE(X) (((X) >> GOMP_LAUNCH_CODE_SHIFT) & 0xf)
#define GOMP_LAUNCH_DEVICE(X) (((X) >> GOMP_LAUNCH_DEVICE_SHIFT) & 0xfff)
-#define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
+#define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & GOMP_LAUNCH_OP_MASK)
#define GOMP_LAUNCH_OP_MAX 0xffff
/* Bitmask to apply in order to find out the intended device of a target
Index: libgomp/libgomp-plugin.h
===================================================================
--- libgomp/libgomp-plugin.h (revision 249620)
+++ libgomp/libgomp-plugin.h (working copy)
@@ -55,6 +55,20 @@ enum offload_target_type
OFFLOAD_TARGET_TYPE_HSA = 7
};
+/* Opaque type to represent plugin-dependent implementation of an
+ OpenACC asynchronous queue. */
+struct goacc_asyncqueue;
+
+/* Used to keep a list of active asynchronous queues. */
+struct goacc_asyncqueue_list
+{
+ struct goacc_asyncqueue *aq;
+ struct goacc_asyncqueue_list *next;
+};
+
+typedef struct goacc_asyncqueue *goacc_aq;
+typedef struct goacc_asyncqueue_list *goacc_aq_list;
+
/* Auxiliary struct, used for transferring pairs of addresses from plugin
to libgomp. */
struct addr_pair
@@ -99,22 +113,31 @@ extern bool GOMP_OFFLOAD_dev2dev (int, void *, con
extern bool GOMP_OFFLOAD_can_run (void *);
extern void GOMP_OFFLOAD_run (int, void *, void *, void **);
extern void GOMP_OFFLOAD_async_run (int, void *, void *, void **, void *);
+
extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
- void **, int, unsigned *, void *);
-extern void GOMP_OFFLOAD_openacc_register_async_cleanup (void *, int);
-extern int GOMP_OFFLOAD_openacc_async_test (int);
-extern int GOMP_OFFLOAD_openacc_async_test_all (void);
-extern void GOMP_OFFLOAD_openacc_async_wait (int);
-extern void GOMP_OFFLOAD_openacc_async_wait_async (int, int);
-extern void GOMP_OFFLOAD_openacc_async_wait_all (void);
-extern void GOMP_OFFLOAD_openacc_async_wait_all_async (int);
-extern void GOMP_OFFLOAD_openacc_async_set_async (int);
+ void **, unsigned *, void *);
+extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
+ void **, unsigned *, void *,
+ struct goacc_asyncqueue *);
+extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
+extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
+ struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *,
+ void (*)(void *), void *);
+extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *, size_t,
+ struct goacc_asyncqueue *);
+extern bool GOMP_OFFLOAD_openacc_async_dev2host (int, void *, const void *, size_t,
+ struct goacc_asyncqueue *);
extern void *GOMP_OFFLOAD_openacc_create_thread_data (int);
extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *);
extern void *GOMP_OFFLOAD_openacc_cuda_get_current_device (void);
extern void *GOMP_OFFLOAD_openacc_cuda_get_current_context (void);
-extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (int);
-extern int GOMP_OFFLOAD_openacc_cuda_set_stream (int, void *);
+extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *,
+ void *);
#ifdef __cplusplus
}
Index: libgomp/libgomp.h
===================================================================
--- libgomp/libgomp.h (revision 249620)
+++ libgomp/libgomp.h (working copy)
@@ -870,19 +870,23 @@ typedef struct acc_dispatch_t
/* Execute. */
__typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
- /* Async cleanup callback registration. */
- __typeof (GOMP_OFFLOAD_openacc_register_async_cleanup)
- *register_async_cleanup_func;
+ struct {
+ gomp_mutex_t lock;
+ int nasyncqueue;
+ struct goacc_asyncqueue **asyncqueue;
+ struct goacc_asyncqueue_list *active;
+
+ __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_test) *test_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_synchronize) *synchronize_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_serialize) *serialize_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func;
- /* Asynchronous routines. */
- __typeof (GOMP_OFFLOAD_openacc_async_test) *async_test_func;
- __typeof (GOMP_OFFLOAD_openacc_async_test_all) *async_test_all_func;
- __typeof (GOMP_OFFLOAD_openacc_async_wait) *async_wait_func;
- __typeof (GOMP_OFFLOAD_openacc_async_wait_async) *async_wait_async_func;
- __typeof (GOMP_OFFLOAD_openacc_async_wait_all) *async_wait_all_func;
- __typeof (GOMP_OFFLOAD_openacc_async_wait_all_async)
- *async_wait_all_async_func;
- __typeof (GOMP_OFFLOAD_openacc_async_set_async) *async_set_async_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func;
+ } async;
/* Create/destroy TLS data. */
__typeof (GOMP_OFFLOAD_openacc_create_thread_data) *create_thread_data_func;
@@ -974,17 +978,31 @@ enum gomp_map_vars_kind
GOMP_MAP_VARS_ENTER_DATA
};
-extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *);
+extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *, int);
extern void gomp_acc_remove_pointer (void *, size_t, bool, int, int, int);
extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *,
unsigned short *);
-
+extern void gomp_copy_host2dev (struct gomp_device_descr *,
+ struct goacc_asyncqueue *,
+ void *, const void *, size_t);
+extern void gomp_copy_dev2host (struct gomp_device_descr *,
+ struct goacc_asyncqueue *,
+ void *, const void *, size_t);
extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
size_t, void **, void **,
size_t *, void *, bool,
enum gomp_map_vars_kind);
+extern struct target_mem_desc *gomp_map_vars_async (struct gomp_device_descr *,
+ struct goacc_asyncqueue *,
+ size_t, void **, void **,
+ size_t *, void *, bool,
+ enum gomp_map_vars_kind);
+extern void gomp_unmap_tgt (struct target_mem_desc *);
extern void gomp_unmap_vars (struct target_mem_desc *, bool);
+extern void gomp_unmap_vars_async (struct target_mem_desc *, bool,
+ struct goacc_asyncqueue *);
extern void gomp_init_device (struct gomp_device_descr *);
+extern bool gomp_fini_device (struct gomp_device_descr *);
extern void gomp_unload_device (struct gomp_device_descr *);
extern bool gomp_offload_target_available_p (int);
extern bool gomp_remove_var (struct gomp_device_descr *, splay_tree_key);
Index: libgomp/oacc-async.c
===================================================================
--- libgomp/oacc-async.c (revision 249620)
+++ libgomp/oacc-async.c (working copy)
@@ -27,10 +27,85 @@
<http://www.gnu.org/licenses/>. */
#include <assert.h>
+#include <string.h>
#include "openacc.h"
#include "libgomp.h"
#include "oacc-int.h"
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+ struct goacc_thread *thr = goacc_thread ();
+ if (!thr || !thr->dev)
+ gomp_fatal ("no device active");
+ return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+ struct goacc_thread *thr = goacc_thread ();
+
+ if (!thr || !thr->dev)
+ gomp_fatal ("no device active");
+
+ return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+ /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream. */
+ if (async == acc_async_noval)
+ async = thr->default_async;
+
+ if (async == acc_async_sync)
+ return NULL;
+
+ if (async < 0)
+ gomp_fatal ("bad async %d", async);
+
+ struct gomp_device_descr *dev = thr->dev;
+
+ if (!create
+ && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+ return NULL;
+
+ gomp_mutex_lock (&dev->openacc.async.lock);
+ if (async >= dev->openacc.async.nasyncqueue)
+ {
+ int diff = async + 1 - dev->openacc.async.nasyncqueue;
+ dev->openacc.async.asyncqueue
+ = gomp_realloc (dev->openacc.async.asyncqueue,
+ sizeof (goacc_aq) * (async + 1));
+ memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+ 0, sizeof (goacc_aq) * diff);
+ dev->openacc.async.nasyncqueue = async + 1;
+ }
+
+ if (!dev->openacc.async.asyncqueue[async])
+ {
+ dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+
+ /* Link new async queue into active list. */
+ goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+ n->aq = dev->openacc.async.asyncqueue[async];
+ n->next = dev->openacc.async.active;
+ dev->openacc.async.active = n;
+ }
+ gomp_mutex_unlock (&dev->openacc.async.lock);
+ return dev->openacc.async.asyncqueue[async];
+}
+
+attribute_hidden struct goacc_asyncqueue *
+get_goacc_asyncqueue (int async)
+{
+ struct goacc_thread *thr = get_goacc_thread ();
+ return lookup_goacc_asyncqueue (thr, true, async);
+}
+
int
acc_async_test (int async)
{
@@ -54,15 +129,14 @@ acc_async_test (int async)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- int res = thr->dev->openacc.async_test_func (async);
-
if (profiling_setup_p)
{
thr->prof_info = NULL;
thr->api_info = NULL;
}
-
- return res;
+
+ goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+ return thr->dev->openacc.async.test_func (aq);
}
int
@@ -69,7 +143,6 @@ int
acc_async_test_all (void)
{
struct goacc_thread *thr = goacc_thread ();
-
acc_prof_info prof_info;
acc_api_info api_info;
bool profiling_setup_p
@@ -79,8 +152,6 @@ acc_async_test_all (void)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- int res = thr->dev->openacc.async_test_all_func ();
-
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -87,7 +158,17 @@ acc_async_test_all (void)
thr->api_info = NULL;
}
- return res;
+ int ret = 1;
+ /*struct goacc_thread *thr = get_goacc_thread ();*/
+ gomp_mutex_lock (&thr->dev->openacc.async.lock);
+ for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+ if (!thr->dev->openacc.async.test_func (l->aq))
+ {
+ ret = 0;
+ break;
+ }
+ gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+ return ret;
}
void
@@ -113,7 +194,8 @@ acc_wait (int async)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- thr->dev->openacc.async_wait_func (async);
+ goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+ thr->dev->openacc.async.synchronize_func (aq);
if (profiling_setup_p)
{
@@ -153,8 +235,16 @@ acc_wait_async (int async1, int async2)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- thr->dev->openacc.async_wait_async_func (async1, async2);
+ goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
+ goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
+ if (!aq1)
+ gomp_fatal ("invalid async 1");
+ if (aq1 == aq2)
+ gomp_fatal ("identical parameters");
+ thr->dev->openacc.async.synchronize_func (aq1);
+ thr->dev->openacc.async.serialize_func (aq1, aq2);
+
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -176,8 +266,13 @@ acc_wait_all (void)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- thr->dev->openacc.async_wait_all_func ();
+ struct gomp_device_descr *dev = get_goacc_thread_device ();
+ gomp_mutex_lock (&dev->openacc.async.lock);
+ for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
+ dev->openacc.async.synchronize_func (l->aq);
+ gomp_mutex_unlock (&dev->openacc.async.lock);
+
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -219,8 +314,17 @@ acc_wait_all_async (int async)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- thr->dev->openacc.async_wait_all_async_func (async);
+ goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
+ gomp_mutex_lock (&thr->dev->openacc.async.lock);
+ for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+ {
+ thr->dev->openacc.async.synchronize_func (l->aq);
+ if (waiting_queue)
+ thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
+ }
+ gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -251,10 +355,72 @@ acc_set_default_async (int async)
if (async < acc_async_sync)
gomp_fatal ("invalid async argument: %d", async);
- struct goacc_thread *thr = goacc_thread ();
+ struct goacc_thread *thr = get_goacc_thread ();
+ thr->default_async = async;
+}
- if (!thr || !thr->dev)
- gomp_fatal ("no device active");
+static void
+goacc_async_unmap_tgt (void *ptr)
+{
+ struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
- thr->default_async = async;
+ if (tgt->refcount > 1)
+ tgt->refcount--;
+ else
+ gomp_unmap_tgt (tgt);
}
+
+attribute_hidden void
+goacc_async_copyout_unmap_vars (struct target_mem_desc *tgt,
+ struct goacc_asyncqueue *aq)
+{
+ struct gomp_device_descr *devicep = tgt->device_descr;
+
+ /* Increment reference to delay freeing of device memory until callback
+ has triggered. */
+ tgt->refcount++;
+ gomp_unmap_vars_async (tgt, true, aq);
+ devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt,
+ (void *) tgt);
+}
+
+attribute_hidden void
+goacc_async_free (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq, void *ptr)
+{
+ if (!aq)
+ free (ptr);
+ else
+ devicep->openacc.async.queue_callback_func (aq, free, ptr);
+}
+
+attribute_hidden void
+goacc_init_asyncqueues (struct gomp_device_descr *devicep)
+{
+ gomp_mutex_init (&devicep->openacc.async.lock);
+ devicep->openacc.async.nasyncqueue = 0;
+ devicep->openacc.async.asyncqueue = NULL;
+ devicep->openacc.async.active = NULL;
+}
+
+attribute_hidden bool
+goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
+{
+ bool ret = true;
+ if (devicep->openacc.async.nasyncqueue > 0)
+ {
+ goacc_aq_list next;
+ for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
+ {
+ ret &= devicep->openacc.async.destruct_func (l->aq);
+ next = l->next;
+ free (l);
+ }
+ free (devicep->openacc.async.asyncqueue);
+ devicep->openacc.async.nasyncqueue = 0;
+ devicep->openacc.async.asyncqueue = NULL;
+ devicep->openacc.async.active = NULL;
+ }
+ gomp_mutex_destroy (&devicep->openacc.async.lock);
+ return ret;
+}
Index: libgomp/oacc-cuda.c
===================================================================
--- libgomp/oacc-cuda.c (revision 249620)
+++ libgomp/oacc-cuda.c (working copy)
@@ -99,17 +99,12 @@ acc_get_cuda_stream (int async)
prof_info.async_queue = prof_info.async;
}
- void *ret = NULL;
if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
- ret = thr->dev->openacc.cuda.get_stream_func (async);
-
- if (profiling_setup_p)
{
- thr->prof_info = NULL;
- thr->api_info = NULL;
+ goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+ return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
}
-
- return ret;
+ return NULL;
}
int
@@ -138,7 +133,12 @@ acc_set_cuda_stream (int async, void *stream)
int ret = -1;
if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
- ret = thr->dev->openacc.cuda.set_stream_func (async, stream);
+ {
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ gomp_mutex_lock (&thr->dev->openacc.async.lock);
+ ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
+ gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+ }
if (profiling_setup_p)
{
Index: libgomp/oacc-host.c
===================================================================
--- libgomp/oacc-host.c (revision 249620)
+++ libgomp/oacc-host.c (working copy)
@@ -140,7 +140,6 @@ host_openacc_exec (void (*fn) (void *),
size_t mapnum __attribute__ ((unused)),
void **hostaddrs,
void **devaddrs __attribute__ ((unused)),
- int async __attribute__ ((unused)),
unsigned *dims __attribute__ ((unused)),
void *targ_mem_desc __attribute__ ((unused)))
{
@@ -148,49 +147,81 @@ host_openacc_exec (void (*fn) (void *),
}
static void
-host_openacc_register_async_cleanup (void *targ_mem_desc __attribute__ ((unused)),
- int async __attribute__ ((unused)))
+host_openacc_async_exec (void (*fn) (void *),
+ size_t mapnum __attribute__ ((unused)),
+ void **hostaddrs,
+ void **devaddrs __attribute__ ((unused)),
+ unsigned *dims __attribute__ ((unused)),
+ void *targ_mem_desc __attribute__ ((unused)),
+ struct goacc_asyncqueue *aq __attribute__ ((unused)))
{
+ fn (hostaddrs);
}
static int
-host_openacc_async_test (int async __attribute__ ((unused)))
+host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
{
return 1;
}
-static int
-host_openacc_async_test_all (void)
+static void
+host_openacc_async_synchronize (struct goacc_asyncqueue *aq
+ __attribute__ ((unused)))
{
- return 1;
}
static void
-host_openacc_async_wait (int async __attribute__ ((unused)))
+host_openacc_async_serialize (struct goacc_asyncqueue *aq1
+ __attribute__ ((unused)),
+ struct goacc_asyncqueue *aq2
+ __attribute__ ((unused)))
{
}
-static void
-host_openacc_async_wait_async (int async1 __attribute__ ((unused)),
- int async2 __attribute__ ((unused)))
+static bool
+host_openacc_async_host2dev (int ord __attribute__ ((unused)),
+ void *dst __attribute__ ((unused)),
+ const void *src __attribute__ ((unused)),
+ size_t n __attribute__ ((unused)),
+ struct goacc_asyncqueue *aq
+ __attribute__ ((unused)))
{
+ return true;
}
-static void
-host_openacc_async_wait_all (void)
+static bool
+host_openacc_async_dev2host (int ord __attribute__ ((unused)),
+ void *dst __attribute__ ((unused)),
+ const void *src __attribute__ ((unused)),
+ size_t n __attribute__ ((unused)),
+ struct goacc_asyncqueue *aq
+ __attribute__ ((unused)))
{
+ return true;
}
static void
-host_openacc_async_wait_all_async (int async __attribute__ ((unused)))
+host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
+ __attribute__ ((unused)),
+ void (*callback_fn)(void *)
+ __attribute__ ((unused)),
+ void *userptr __attribute__ ((unused)))
{
}
-static void
-host_openacc_async_set_async (int async __attribute__ ((unused)))
+static struct goacc_asyncqueue *
+host_openacc_async_construct (void)
{
+ return NULL;
}
+static bool
+host_openacc_async_destruct (struct goacc_asyncqueue *aq
+ __attribute__ ((unused)))
+{
+ return true;
+}
+
static void *
host_openacc_create_thread_data (int ord __attribute__ ((unused)))
{
@@ -235,16 +266,18 @@ static struct gomp_device_descr host_dispatch =
.exec_func = host_openacc_exec,
- .register_async_cleanup_func = host_openacc_register_async_cleanup,
+ .async = {
+ .construct_func = host_openacc_async_construct,
+ .destruct_func = host_openacc_async_destruct,
+ .test_func = host_openacc_async_test,
+ .synchronize_func = host_openacc_async_synchronize,
+ .serialize_func = host_openacc_async_serialize,
+ .queue_callback_func = host_openacc_async_queue_callback,
+ .exec_func = host_openacc_async_exec,
+ .dev2host_func = host_openacc_async_dev2host,
+ .host2dev_func = host_openacc_async_host2dev,
+ },
- .async_test_func = host_openacc_async_test,
- .async_test_all_func = host_openacc_async_test_all,
- .async_wait_func = host_openacc_async_wait,
- .async_wait_async_func = host_openacc_async_wait_async,
- .async_wait_all_func = host_openacc_async_wait_all,
- .async_wait_all_async_func = host_openacc_async_wait_all_async,
- .async_set_async_func = host_openacc_async_set_async,
-
.create_thread_data_func = host_openacc_create_thread_data,
.destroy_thread_data_func = host_openacc_destroy_thread_data,
Index: libgomp/oacc-init.c
===================================================================
--- libgomp/oacc-init.c (revision 249620)
+++ libgomp/oacc-init.c (working copy)
@@ -390,7 +390,7 @@ acc_shutdown_1 (acc_device_t d)
if (acc_dev->state == GOMP_DEVICE_INITIALIZED)
{
devices_active = true;
- ret &= acc_dev->fini_device_func (acc_dev->target_id);
+ ret &= gomp_fini_device (acc_dev);
acc_dev->state = GOMP_DEVICE_UNINITIALIZED;
}
gomp_mutex_unlock (&acc_dev->lock);
@@ -513,8 +513,6 @@ goacc_attach_host_thread_to_device (int ord)
= acc_dev->openacc.create_thread_data_func (ord);
thr->default_async = acc_async_default;
-
- acc_dev->openacc.async_set_async_func (acc_async_sync);
}
/* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
Index: libgomp/oacc-int.h
===================================================================
--- libgomp/oacc-int.h (revision 249620)
+++ libgomp/oacc-int.h (working copy)
@@ -109,6 +109,15 @@ void goacc_restore_bind (void);
void goacc_lazy_initialize (void);
void goacc_host_init (void);
+void goacc_init_asyncqueues (struct gomp_device_descr *);
+bool goacc_fini_asyncqueues (struct gomp_device_descr *);
+void goacc_async_copyout_unmap_vars (struct target_mem_desc *,
+ struct goacc_asyncqueue *);
+void goacc_async_free (struct gomp_device_descr *,
+ struct goacc_asyncqueue *, void *);
+struct goacc_asyncqueue *get_goacc_asyncqueue (int);
+struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread *, bool, int);
+
void goacc_profiling_initialize (void);
bool goacc_profiling_setup_p (struct goacc_thread *,
acc_prof_info *, acc_api_info *);
Index: libgomp/oacc-mem.c
===================================================================
--- libgomp/oacc-mem.c (revision 249620)
+++ libgomp/oacc-mem.c (working copy)
@@ -224,19 +224,12 @@ memcpy_tofrom_device (bool from, void *d, void *h,
goto out;
}
- if (async > acc_async_sync)
- thr->dev->openacc.async_set_async_func (async);
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ if (from)
+ gomp_copy_dev2host (thr->dev, aq, h, d, s);
+ else
+ gomp_copy_host2dev (thr->dev, aq, d, h, s);
- bool ret = (from
- ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
- : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
-
- if (async > acc_async_sync)
- thr->dev->openacc.async_set_async_func (acc_async_sync);
-
- if (!ret)
- gomp_fatal ("error in %s", libfnname);
-
out:
if (profiling_setup_p)
{
@@ -381,7 +374,7 @@ acc_is_present (void *h, size_t s)
gomp_mutex_unlock (&acc_dev->lock);
- return n != NULL;
+ return (n ? 1 : 0);
}
/* Create a mapping for host [H,+S] -> device [D,+S] */
@@ -613,17 +606,13 @@ present_create_copy (unsigned f, void *h, size_t s
gomp_mutex_unlock (&acc_dev->lock);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (async);
+ goacc_aq aq = get_goacc_asyncqueue (async);
- tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
- GOMP_MAP_VARS_OPENACC);
+ tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s,
+ &kinds, true, GOMP_MAP_VARS_OPENACC);
/* Initialize dynamic refcount. */
tgt->list[0].key->dynamic_refcount = 1;
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
gomp_mutex_lock (&acc_dev->lock);
d = tgt->to_free;
@@ -798,11 +787,8 @@ delete_copyout (unsigned f, void *h, size_t s, int
if (f & FLAG_COPYOUT)
{
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (async);
- acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (acc_async_sync);
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ gomp_copy_dev2host (acc_dev, aq, h, d, s);
}
gomp_remove_var (acc_dev, n);
}
@@ -904,19 +890,15 @@ update_dev_host (int is_dev, void *h, size_t s, in
d = (void *) (n->tgt->tgt_start + n->tgt_offset
+ (uintptr_t) h - n->host_start);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (async);
+ goacc_aq aq = get_goacc_asyncqueue (async);
if (is_dev)
- acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
+ gomp_copy_host2dev (acc_dev, aq, d, h, s);
else
- acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
+ gomp_copy_dev2host (acc_dev, aq, h, d, s);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
gomp_mutex_unlock (&acc_dev->lock);
-
+
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -978,7 +960,7 @@ gomp_acc_declare_allocate (bool allocate, size_t m
void
gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
- void *kinds)
+ void *kinds, int async)
{
struct target_mem_desc *tgt;
struct goacc_thread *thr = goacc_thread ();
@@ -1008,8 +990,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos
}
gomp_debug (0, " %s: prepare mappings\n", __FUNCTION__);
- tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
- NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
+ NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
gomp_debug (0, " %s: mappings prepared\n", __FUNCTION__);
/* Initialize dynamic refcount. */
@@ -1098,11 +1081,15 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f
t->list[i].copy_from = force_copyfrom ? 1 : 0;
break;
}
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (async);
- gomp_unmap_vars (t, true);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (acc_async_sync);
+
+ /* If running synchronously, unmap immediately. */
+ if (async < acc_async_noval)
+ gomp_unmap_vars (t, true);
+ else
+ {
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ goacc_async_copyout_unmap_vars (t, aq);
+ }
}
gomp_mutex_unlock (&acc_dev->lock);
Index: libgomp/oacc-parallel.c
===================================================================
--- libgomp/oacc-parallel.c (revision 249620)
+++ libgomp/oacc-parallel.c (working copy)
@@ -215,7 +215,9 @@ GOACC_parallel_keyed (int device, void (*fn) (void
fn (hostaddrs);
goto out;
}
-
+ else if (profiling_dispatch_p)
+ api_info.device_api = acc_device_api_cuda;
+
/* Default: let the runtime choose. */
for (i = 0; i != GOMP_DIM_MAX; i++)
dims[i] = 0;
@@ -260,10 +262,14 @@ GOACC_parallel_keyed (int device, void (*fn) (void
case GOMP_LAUNCH_WAIT:
{
- unsigned num_waits = GOMP_LAUNCH_OP (tag);
+ /* Be careful to cast the op field as a signed 16-bit, and
+ sign-extend to full integer. */
+ int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
- if (num_waits)
+ if (num_waits > 0)
goacc_wait (async, num_waits, &ap);
+ else if (num_waits == acc_async_noval)
+ acc_wait_all_async (async);
break;
}
@@ -274,8 +280,6 @@ GOACC_parallel_keyed (int device, void (*fn) (void
}
va_end (ap);
- acc_dev->openacc.async_set_async_func (async);
-
if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
{
k.host_start = (uintptr_t) fn;
@@ -307,8 +311,11 @@ GOACC_parallel_keyed (int device, void (*fn) (void
goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
&api_info);
}
- tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true,
- GOMP_MAP_VARS_OPENACC);
+
+ goacc_aq aq = get_goacc_asyncqueue (async);
+
+ tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
+ true, GOMP_MAP_VARS_OPENACC);
if (profiling_dispatch_p)
{
prof_info.event_type = acc_ev_enter_data_end;
@@ -329,14 +336,10 @@ GOACC_parallel_keyed (int device, void (*fn) (void
devaddrs[i] = NULL;
}
- acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
- async, dims, tgt);
-
- /* If running synchronously, unmap immediately. */
- bool copyfrom = true;
- if (async < acc_async_noval)
+ if (aq == NULL)
{
- unmap:
+ acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+ dims, tgt);
if (profiling_dispatch_p)
{
prof_info.event_type = acc_ev_exit_data_start;
@@ -346,7 +349,8 @@ GOACC_parallel_keyed (int device, void (*fn) (void
goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
&api_info);
}
- gomp_unmap_vars (tgt, copyfrom);
+ /* If running synchronously, unmap immediately. */
+ gomp_unmap_vars (tgt, true);
if (profiling_dispatch_p)
{
prof_info.event_type = acc_ev_exit_data_end;
@@ -358,27 +362,11 @@ GOACC_parallel_keyed (int device, void (*fn) (void
}
else
{
- bool async_unmap = false;
- for (size_t i = 0; i < tgt->list_count; i++)
- {
- splay_tree_key k = tgt->list[i].key;
- if (k && k->refcount == 1)
- {
- async_unmap = true;
- break;
- }
- }
- if (async_unmap)
- tgt->device_descr->openacc.register_async_cleanup_func (tgt, async);
- else
- {
- copyfrom = false;
- goto unmap;
- }
+ acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+ dims, tgt, aq);
+ goacc_async_copyout_unmap_vars (tgt, aq);
}
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
out:
if (profiling_dispatch_p)
{
@@ -724,7 +712,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
goto out;
}
- if (num_waits)
+ if (num_waits > 0)
{
va_list ap;
@@ -732,9 +720,9 @@ GOACC_enter_exit_data (int device, size_t mapnum,
goacc_wait (async, num_waits, &ap);
va_end (ap);
}
+ else if (num_waits == acc_async_noval)
+ acc_wait_all_async (async);
- acc_dev->openacc.async_set_async_func (async);
-
/* In c, non-pointers and arrays are represented by a single data clause.
Dynamically allocated arrays and subarrays are represented by a data
clause followed by an internal GOMP_MAP_POINTER.
@@ -783,7 +771,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
&sizes[i], &kinds[i]);
else
gomp_acc_insert_pointer (pointer, &hostaddrs[i],
- &sizes[i], &kinds[i]);
+ &sizes[i], &kinds[i], async);
/* Increment 'i' by two because OpenACC requires fortran
arrays to be contiguous, so each PSET is associated with
one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
@@ -808,9 +796,9 @@ GOACC_enter_exit_data (int device, size_t mapnum,
if (acc_is_present (hostaddrs[i], sizes[i]))
{
if (finalize)
- acc_delete_finalize (hostaddrs[i], sizes[i]);
+ acc_delete_finalize_async (hostaddrs[i], sizes[i], async);
else
- acc_delete (hostaddrs[i], sizes[i]);
+ acc_delete_async (hostaddrs[i], sizes[i], async);
}
break;
case GOMP_MAP_DECLARE_DEALLOCATE:
@@ -817,9 +805,9 @@ GOACC_enter_exit_data (int device, size_t mapnum,
case GOMP_MAP_FROM:
case GOMP_MAP_FORCE_FROM:
if (finalize)
- acc_copyout_finalize (hostaddrs[i], sizes[i]);
+ acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);
else
- acc_copyout (hostaddrs[i], sizes[i]);
+ acc_copyout_async (hostaddrs[i], sizes[i], async);
break;
default:
gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
@@ -844,8 +832,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,
}
}
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
out:
if (profiling_dispatch_p)
{
@@ -868,18 +854,22 @@ goacc_wait (int async, int num_waits, va_list *ap)
while (num_waits--)
{
int qid = va_arg (*ap, int);
-
- if (acc_async_test (qid))
+ goacc_aq aq = get_goacc_asyncqueue (qid);
+ if (acc_dev->openacc.async.test_func (aq))
continue;
-
if (async == acc_async_sync)
- acc_wait (qid);
+ acc_dev->openacc.async.synchronize_func (aq);
else if (qid == async)
- ;/* If we're waiting on the same asynchronous queue as we're
+ /* If we're waiting on the same asynchronous queue as we're
launching on, the queue itself will order work as
required, so there's no need to wait explicitly. */
+ ;
else
- acc_dev->openacc.async_wait_async_func (qid, async);
+ {
+ goacc_aq aq2 = get_goacc_asyncqueue (async);
+ acc_dev->openacc.async.synchronize_func (aq);
+ acc_dev->openacc.async.serialize_func (aq, aq2);
+ }
}
}
@@ -957,7 +947,7 @@ GOACC_update (int device, size_t mapnum,
goto out;
}
- if (num_waits)
+ if (num_waits > 0)
{
va_list ap;
@@ -965,9 +955,9 @@ GOACC_update (int device, size_t mapnum,
goacc_wait (async, num_waits, &ap);
va_end (ap);
}
+ else if (num_waits == acc_async_noval)
+ acc_wait_all_async (async);
- acc_dev->openacc.async_set_async_func (async);
-
bool update_device = false;
for (i = 0; i < mapnum; ++i)
{
@@ -1007,7 +997,7 @@ GOACC_update (int device, size_t mapnum,
/* Fallthru */
case GOMP_MAP_FORCE_TO:
update_device = true;
- acc_update_device (hostaddrs[i], sizes[i]);
+ acc_update_device_async (hostaddrs[i], sizes[i], async);
break;
case GOMP_MAP_FROM:
@@ -1019,7 +1009,7 @@ GOACC_update (int device, size_t mapnum,
/* Fallthru */
case GOMP_MAP_FORCE_FROM:
update_device = false;
- acc_update_self (hostaddrs[i], sizes[i]);
+ acc_update_self_async (hostaddrs[i], sizes[i], async);
break;
default:
@@ -1028,8 +1018,6 @@ GOACC_update (int device, size_t mapnum,
}
}
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
out:
if (profiling_dispatch_p)
{
@@ -1075,7 +1063,7 @@ GOACC_wait (int async, int num_waits, ...)
else if (async == acc_async_sync)
acc_wait_all ();
else if (async == acc_async_noval)
- thr->dev->openacc.async_wait_all_async_func (acc_async_noval);
+ acc_wait_all_async (async);
if (profiling_setup_p)
{
Index: libgomp/oacc-plugin.c
===================================================================
--- libgomp/oacc-plugin.c (revision 249620)
+++ libgomp/oacc-plugin.c (working copy)
@@ -30,17 +30,6 @@
#include "oacc-plugin.h"
#include "oacc-int.h"
-void
-GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
-{
- struct target_mem_desc *tgt = ptr;
- struct gomp_device_descr *devicep = tgt->device_descr;
-
- devicep->openacc.async_set_async_func (async);
- gomp_unmap_vars (tgt, true);
- devicep->openacc.async_set_async_func (acc_async_sync);
-}
-
/* Return the target-specific part of the TLS data for the current thread. */
void *
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c (revision 249620)
+++ libgomp/plugin/plugin-nvptx.c (working copy)
@@ -96,21 +96,19 @@ cuda_error (CUresult r)
static unsigned int instantiated_devices = 0;
static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
-struct cuda_map
+/* NVPTX/CUDA specific definition of asynchronous queues. */
+struct goacc_asyncqueue
{
- CUdeviceptr d;
- size_t size;
- bool active;
- struct cuda_map *next;
+ CUstream cuda_stream;
+ pthread_mutex_t lock;
};
-struct ptx_stream
+struct nvptx_callback
{
- CUstream stream;
- pthread_t host_thread;
- bool multithreaded;
- struct cuda_map *map;
- struct ptx_stream *next;
+ void (*fn) (void *);
+ void *ptr;
+ struct goacc_asyncqueue *aq;
+ struct nvptx_callback *next;
};
/* Thread-specific data for PTX. */
@@ -117,179 +115,12 @@ static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTE
struct nvptx_thread
{
- struct ptx_stream *current_stream;
+ /* We currently have this embedded inside the plugin because libgomp manages
+ devices through integer target_ids. This might be better if using an
+ opaque target-specific pointer directly from gomp_device_descr. */
struct ptx_device *ptx_dev;
};
-static struct cuda_map *
-cuda_map_create (struct goacc_thread *thr, size_t size)
-{
- struct cuda_map *map = GOMP_PLUGIN_malloc (sizeof (struct cuda_map));
-
- assert (map);
-
- map->next = NULL;
- map->size = size;
- map->active = false;
-
- CUDA_CALL_ERET (NULL, cuMemAlloc, &map->d, size);
- assert (map->d);
-
- bool profiling_dispatch_p
- = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
-
- prof_info->event_type = acc_ev_alloc;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- /* Always implicit for "data mapping arguments for cuLaunchKernel". */
- data_event_info.data_event.implicit = 1;
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = size;
- data_event_info.data_event.host_ptr = NULL;
- data_event_info.data_event.device_ptr = (void *) map->d;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- return map;
-}
-
-static void
-cuda_map_destroy (struct goacc_thread *thr, struct cuda_map *map)
-{
- CUDA_CALL_ASSERT (cuMemFree, map->d);
-
- bool profiling_dispatch_p
- = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
-
- prof_info->event_type = acc_ev_free;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- /* Always implicit for "data mapping arguments for cuLaunchKernel". */
- data_event_info.data_event.implicit = 1;
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = map->size;
- data_event_info.data_event.host_ptr = NULL;
- data_event_info.data_event.device_ptr = (void *) map->d;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- free (map);
-}
-
-/* The following map_* routines manage the CUDA device memory that
- contains the data mapping arguments for cuLaunchKernel. Each
- asynchronous PTX stream may have multiple pending kernel
- invocations, which are launched in a FIFO order. As such, the map
- routines maintains a queue of cuLaunchKernel arguments.
-
- Calls to map_push and map_pop must be guarded by ptx_event_lock.
- Likewise, calls to map_init and map_fini are guarded by
- ptx_dev_lock inside GOMP_OFFLOAD_init_device and
- GOMP_OFFLOAD_fini_device, respectively. */
-
-static bool
-map_init (struct goacc_thread *thr, struct ptx_stream *s)
-{
- int size = getpagesize ();
-
- assert (s);
-
- s->map = cuda_map_create (thr, size);
-
- return true;
-}
-
-static bool
-map_fini (struct goacc_thread *thr, struct ptx_stream *s)
-{
- assert (s->map->next == NULL);
- assert (!s->map->active);
-
- cuda_map_destroy (thr, s->map);
-
- return true;
-}
-
-static void
-map_pop (struct goacc_thread *thr, struct ptx_stream *s)
-{
- struct cuda_map *next;
-
- assert (s != NULL);
-
- if (s->map->next == NULL)
- {
- s->map->active = false;
- return;
- }
-
- next = s->map->next;
- cuda_map_destroy (thr, s->map);
- s->map = next;
-}
-
-static CUdeviceptr
-map_push (struct goacc_thread *thr, struct ptx_stream *s, size_t size)
-{
- struct cuda_map *map = NULL, *t = NULL;
-
- assert (s);
- assert (s->map);
-
- /* Each PTX stream requires a separate data region to store the
- launch arguments for cuLaunchKernel. Allocate a new
- cuda_map and push it to the end of the list. */
- if (s->map->active)
- {
- map = cuda_map_create (thr, size);
-
- for (t = s->map; t->next != NULL; t = t->next)
- ;
-
- t->next = map;
- }
- else if (s->map->size < size)
- {
- cuda_map_destroy (thr, s->map);
- map = cuda_map_create (thr, size);
- }
- else
- map = s->map;
-
- s->map = map;
- s->map->active = true;
-
- return s->map->d;
-}
-
/* Target data function launch information. */
struct targ_fn_launch
@@ -342,22 +173,18 @@ struct ptx_image_data
struct ptx_image_data *next;
};
+struct ptx_free_block
+{
+ void *ptr;
+ struct ptx_free_block *next;
+};
+
struct ptx_device
{
CUcontext ctx;
bool ctx_shared;
CUdevice dev;
- struct ptx_stream *null_stream;
- /* All non-null streams associated with this device (actually context),
- either created implicitly or passed in from the user (via
- acc_set_cuda_stream). */
- struct ptx_stream *active_streams;
- struct {
- struct ptx_stream **arr;
- int size;
- } async_streams;
- /* A lock for use when manipulating the above stream list and array. */
- pthread_mutex_t stream_lock;
+
int ord;
bool overlap;
bool map;
@@ -381,32 +208,13 @@ struct ptx_device
struct ptx_image_data *images; /* Images loaded on device. */
pthread_mutex_t image_lock; /* Lock for above list. */
-
- struct ptx_device *next;
-};
-enum ptx_event_type
-{
- PTX_EVT_MEM,
- PTX_EVT_KNL,
- PTX_EVT_SYNC,
- PTX_EVT_ASYNC_CLEANUP
-};
+ struct ptx_free_block *free_blocks;
+ pthread_mutex_t free_blocks_lock;
-struct ptx_event
-{
- CUevent *evt;
- int type;
- void *addr;
- int ord;
- int val;
-
- struct ptx_event *next;
+ struct ptx_device *next;
};
-static pthread_mutex_t ptx_event_lock;
-static struct ptx_event *ptx_events;
-
static struct ptx_device **ptx_devices;
static inline struct nvptx_thread *
@@ -415,190 +223,6 @@ nvptx_thread (void)
return (struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
}
-static bool
-init_streams_for_device (struct ptx_device *ptx_dev, int concurrency)
-{
- int i;
- struct ptx_stream *null_stream
- = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream));
-
- null_stream->stream = NULL;
- null_stream->host_thread = pthread_self ();
- null_stream->multithreaded = true;
- if (!map_init (NULL, null_stream))
- return false;
-
- ptx_dev->null_stream = null_stream;
- ptx_dev->active_streams = NULL;
- pthread_mutex_init (&ptx_dev->stream_lock, NULL);
-
- if (concurrency < 1)
- concurrency = 1;
-
- /* This is just a guess -- make space for as many async streams as the
- current device is capable of concurrently executing. This can grow
- later as necessary. No streams are created yet. */
- ptx_dev->async_streams.arr
- = GOMP_PLUGIN_malloc (concurrency * sizeof (struct ptx_stream *));
- ptx_dev->async_streams.size = concurrency;
-
- for (i = 0; i < concurrency; i++)
- ptx_dev->async_streams.arr[i] = NULL;
-
- return true;
-}
-
-static bool
-fini_streams_for_device (struct ptx_device *ptx_dev)
-{
- free (ptx_dev->async_streams.arr);
-
- bool ret = true;
- while (ptx_dev->active_streams != NULL)
- {
- struct ptx_stream *s = ptx_dev->active_streams;
- ptx_dev->active_streams = ptx_dev->active_streams->next;
-
- ret &= map_fini (NULL, s);
-
- CUresult r = cuStreamDestroy (s->stream);
- if (r != CUDA_SUCCESS)
- {
- GOMP_PLUGIN_error ("cuStreamDestroy error: %s", cuda_error (r));
- ret = false;
- }
- free (s);
- }
-
- ret &= map_fini (NULL, ptx_dev->null_stream);
- free (ptx_dev->null_stream);
- return ret;
-}
-
-/* Select a stream for (OpenACC-semantics) ASYNC argument for the current
- thread THREAD (and also current device/context). If CREATE is true, create
- the stream if it does not exist (or use EXISTING if it is non-NULL), and
- associate the stream with the same thread argument. Returns stream to use
- as result. */
-
-static struct ptx_stream *
-select_stream_for_async (int async, pthread_t thread, bool create,
- CUstream existing)
-{
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
- /* Local copy of TLS variable. */
- struct ptx_device *ptx_dev = nvthd->ptx_dev;
- struct ptx_stream *stream = NULL;
- int orig_async = async;
-
- /* The special value acc_async_noval (-1) maps to the thread-specific
- default async stream. */
- if (async == acc_async_noval)
- async = GOMP_PLUGIN_acc_thread_default_async ();
-
- if (create)
- pthread_mutex_lock (&ptx_dev->stream_lock);
-
- /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the
- null stream, and in fact better performance may be obtainable if it doesn't
- (because the null stream enforces overly-strict synchronisation with
- respect to other streams for legacy reasons, and that's probably not
- needed with OpenACC). Maybe investigate later. */
- if (async == acc_async_sync)
- stream = ptx_dev->null_stream;
- else if (async >= 0 && async < ptx_dev->async_streams.size
- && ptx_dev->async_streams.arr[async] && !(create && existing))
- stream = ptx_dev->async_streams.arr[async];
- else if (async >= 0 && create)
- {
- if (async >= ptx_dev->async_streams.size)
- {
- int i, newsize = ptx_dev->async_streams.size * 2;
-
- if (async >= newsize)
- newsize = async + 1;
-
- ptx_dev->async_streams.arr
- = GOMP_PLUGIN_realloc (ptx_dev->async_streams.arr,
- newsize * sizeof (struct ptx_stream *));
-
- for (i = ptx_dev->async_streams.size; i < newsize; i++)
- ptx_dev->async_streams.arr[i] = NULL;
-
- ptx_dev->async_streams.size = newsize;
- }
-
- /* Create a new stream on-demand if there isn't one already, or if we're
- setting a particular async value to an existing (externally-provided)
- stream. */
- if (!ptx_dev->async_streams.arr[async] || existing)
- {
- CUresult r;
- struct ptx_stream *s
- = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream));
-
- if (existing)
- s->stream = existing;
- else
- {
- r = cuStreamCreate (&s->stream, CU_STREAM_DEFAULT);
- if (r != CUDA_SUCCESS)
- {
- pthread_mutex_unlock (&ptx_dev->stream_lock);
- GOMP_PLUGIN_fatal ("cuStreamCreate error: %s",
- cuda_error (r));
- }
- }
-
- /* If CREATE is true, we're going to be queueing some work on this
- stream. Associate it with the current host thread. */
- s->host_thread = thread;
- s->multithreaded = false;
-
- if (!map_init (thr, s))
- {
- pthread_mutex_unlock (&ptx_dev->stream_lock);
- GOMP_PLUGIN_fatal ("map_init fail");
- }
-
- s->next = ptx_dev->active_streams;
- ptx_dev->active_streams = s;
- ptx_dev->async_streams.arr[async] = s;
- }
-
- stream = ptx_dev->async_streams.arr[async];
- }
- else if (async < 0)
- {
- if (create)
- pthread_mutex_unlock (&ptx_dev->stream_lock);
- GOMP_PLUGIN_fatal ("bad async %d", async);
- }
-
- if (create)
- {
- assert (stream != NULL);
-
- /* If we're trying to use the same stream from different threads
- simultaneously, set stream->multithreaded to true. This affects the
- behaviour of acc_async_test_all and acc_wait_all, which are supposed to
- only wait for asynchronous launches from the same host thread they are
- invoked on. If multiple threads use the same async value, we make note
- of that here and fall back to testing/waiting for all threads in those
- functions. */
- if (thread != stream->host_thread)
- stream->multithreaded = true;
-
- pthread_mutex_unlock (&ptx_dev->stream_lock);
- }
- else if (stream && !stream->multithreaded
- && !pthread_equal (stream->host_thread, thread))
- GOMP_PLUGIN_fatal ("async %d used on wrong thread", orig_async);
-
- return stream;
-}
-
/* Initialize the device. Return TRUE on success, else FALSE. PTX_DEV_LOCK
should be locked on entry and remains locked on exit. */
@@ -611,9 +235,6 @@ nvptx_init (void)
return true;
CUDA_CALL (cuInit, 0);
- ptx_events = NULL;
- pthread_mutex_init (&ptx_event_lock, NULL);
-
CUDA_CALL (cuDeviceGetCount, &ndevs);
ptx_devices = GOMP_PLUGIN_malloc_cleared (sizeof (struct ptx_device *)
* ndevs);
@@ -632,6 +253,11 @@ nvptx_attach_host_thread_to_device (int n)
CUcontext thd_ctx;
r = cuCtxGetDevice (&dev);
+ if (r == CUDA_ERROR_NOT_PERMITTED)
+ {
+ /* Assume we're in a CUDA callback, just return true. */
+ return true;
+ }
if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
{
GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
@@ -759,6 +385,9 @@ nvptx_open_device (int n)
ptx_dev->images = NULL;
pthread_mutex_init (&ptx_dev->image_lock, NULL);
+ ptx_dev->free_blocks = NULL;
+ pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL);
+
GOMP_PLUGIN_debug (0, "Nvidia device %d:\n\tGPU_OVERLAP = %d\n"
"\tCAN_MAP_HOST_MEMORY = %d\n\tCONCURRENT_KERNELS = %d\n"
"\tCOMPUTE_MODE = %d\n\tINTEGRATED = %d\n"
@@ -775,9 +404,6 @@ nvptx_open_device (int n)
ptx_dev->max_registers_per_multiprocessor,
ptx_dev->max_shared_memory_per_multiprocessor);
- if (!init_streams_for_device (ptx_dev, async_engines))
- return NULL;
-
return ptx_dev;
}
@@ -787,9 +413,15 @@ nvptx_close_device (struct ptx_device *ptx_dev)
if (!ptx_dev)
return true;
- if (!fini_streams_for_device (ptx_dev))
- return false;
-
+ for (struct ptx_free_block *b = ptx_dev->free_blocks; b;)
+ {
+ struct ptx_free_block *b_next = b->next;
+ CUDA_CALL (cuMemFree, (CUdeviceptr) b->ptr);
+ free (b);
+ b = b_next;
+ }
+
+ pthread_mutex_destroy (&ptx_dev->free_blocks_lock);
pthread_mutex_destroy (&ptx_dev->image_lock);
if (!ptx_dev->ctx_shared)
@@ -905,134 +537,14 @@ link_ptx (CUmodule *module, const struct targ_ptx_
}
static void
-event_gc (bool memmap_lockable)
-{
- struct ptx_event *ptx_event = ptx_events;
- struct ptx_event *async_cleanups = NULL;
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- pthread_mutex_lock (&ptx_event_lock);
-
- while (ptx_event != NULL)
- {
- CUresult r;
- struct ptx_event *e = ptx_event;
-
- ptx_event = ptx_event->next;
-
- if (e->ord != nvthd->ptx_dev->ord)
- continue;
-
- r = cuEventQuery (*e->evt);
- if (r == CUDA_SUCCESS)
- {
- bool append_async = false;
- CUevent *te;
-
- te = e->evt;
-
- switch (e->type)
- {
- case PTX_EVT_MEM:
- case PTX_EVT_SYNC:
- break;
-
- case PTX_EVT_KNL:
- map_pop (thr, e->addr);
- break;
-
- case PTX_EVT_ASYNC_CLEANUP:
- {
- /* The function gomp_plugin_async_unmap_vars needs to claim the
- memory-map splay tree lock for the current device, so we
- can't call it when one of our callers has already claimed
- the lock. In that case, just delay the GC for this event
- until later. */
- if (!memmap_lockable)
- continue;
-
- append_async = true;
- }
- break;
- }
-
- cuEventDestroy (*te);
- free ((void *)te);
-
- /* Unlink 'e' from ptx_events list. */
- if (ptx_events == e)
- ptx_events = ptx_events->next;
- else
- {
- struct ptx_event *e_ = ptx_events;
- while (e_->next != e)
- e_ = e_->next;
- e_->next = e_->next->next;
- }
-
- if (append_async)
- {
- e->next = async_cleanups;
- async_cleanups = e;
- }
- else
- free (e);
- }
- }
-
- pthread_mutex_unlock (&ptx_event_lock);
-
- /* We have to do these here, after ptx_event_lock is released. */
- while (async_cleanups)
- {
- struct ptx_event *e = async_cleanups;
- async_cleanups = async_cleanups->next;
-
- GOMP_PLUGIN_async_unmap_vars (e->addr, e->val);
- free (e);
- }
-}
-
-static void
-event_add (enum ptx_event_type type, CUevent *e, void *h, int val)
-{
- struct ptx_event *ptx_event;
- struct nvptx_thread *nvthd = nvptx_thread ();
-
- assert (type == PTX_EVT_MEM || type == PTX_EVT_KNL || type == PTX_EVT_SYNC
- || type == PTX_EVT_ASYNC_CLEANUP);
-
- ptx_event = GOMP_PLUGIN_malloc (sizeof (struct ptx_event));
- ptx_event->type = type;
- ptx_event->evt = e;
- ptx_event->addr = h;
- ptx_event->ord = nvthd->ptx_dev->ord;
- ptx_event->val = val;
-
- pthread_mutex_lock (&ptx_event_lock);
-
- ptx_event->next = ptx_events;
- ptx_events = ptx_event;
-
- pthread_mutex_unlock (&ptx_event_lock);
-}
-
-static void
nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
- int async, unsigned *dims, void *targ_mem_desc)
+ unsigned *dims, void *targ_mem_desc,
+ CUdeviceptr dp, CUstream stream)
{
struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
CUfunction function;
- CUresult r;
int i;
- struct ptx_stream *dev_str;
void *kargs[1];
- void *hp;
- CUdeviceptr dp;
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
- const char *maybe_abort_msg = "(perhaps abort was called)";
int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
int dev_size = nvptx_thread ()->ptx_dev->multiprocessor_count;
@@ -1044,9 +556,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
function = targ_fn->fn;
- dev_str = select_stream_for_async (async, pthread_self (), false, NULL);
- assert (dev_str == nvthd->current_stream);
-
/* Initialize the launch dimensions. Typically this is constant,
provided by the device compiler, but we must permit runtime
values. */
@@ -1175,61 +684,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
threads_per_block);
}
- /* This reserves a chunk of a pre-allocated page of memory mapped on both
- the host and the device. HP is a host pointer to the new chunk, and DP is
- the corresponding device pointer. */
- pthread_mutex_lock (&ptx_event_lock);
- dp = map_push (thr, dev_str, mapnum * sizeof (void *));
- pthread_mutex_unlock (&ptx_event_lock);
-
- GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__);
-
- /* Copy the array of arguments to the mapped page. */
- hp = alloca(sizeof(void *) * mapnum);
- for (i = 0; i < mapnum; i++)
- ((void **) hp)[i] = devaddrs[i] != 0 ? devaddrs[i] : hostaddrs[i];
-
- /* Copy the (device) pointers to arguments to the device (dp and hp might in
- fact have the same value on a unified-memory system). */
-
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_upload_start;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- /* Always implicit for "data mapping arguments for cuLaunchKernel". */
- data_event_info.data_event.implicit = 1;
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = mapnum * sizeof (void *);
- data_event_info.data_event.host_ptr = hp;
- data_event_info.data_event.device_ptr = (void *) dp;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, hp,
- mapnum * sizeof (void *));
-
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_upload_end;
- data_event_info.data_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
GOMP_PLUGIN_debug (0, " %s: kernel %s: launch"
" gangs=%u, workers=%u, vectors=%u\n",
__FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG],
@@ -1241,7 +695,11 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
// num_workers ntid.y
// vector length ntid.x
+ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+ acc_prof_info *prof_info = thr->prof_info;
acc_event_info enqueue_launch_event_info;
+ acc_api_info *api_info = thr->api_info;
+ bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
if (profiling_dispatch_p)
{
prof_info->event_type = acc_ev_enqueue_launch_start;
@@ -1269,11 +727,13 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &enqueue_launch_event_info,
api_info);
}
+
kargs[0] = &dp;
CUDA_CALL_ASSERT (cuLaunchKernel, function,
dims[GOMP_DIM_GANG], 1, 1,
dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
- 0, dev_str->stream, kargs, 0);
+ 0, stream, kargs, 0);
+
if (profiling_dispatch_p)
{
prof_info->event_type = acc_ev_enqueue_launch_end;
@@ -1283,91 +743,8 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
api_info);
}
- acc_event_info wait_event_info;
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_start;
-
- wait_event_info.other_event.event_type = prof_info->event_type;
- wait_event_info.other_event.valid_bytes
- = _ACC_OTHER_EVENT_INFO_VALID_BYTES;
- wait_event_info.other_event.parent_construct
- /* TODO = compute_construct_event_info.other_event.parent_construct */
- = acc_construct_parallel; //TODO: kernels...
- wait_event_info.other_event.implicit = 1;
- wait_event_info.other_event.tool_info = NULL;
-
- api_info->device_api = acc_device_api_cuda;
- }
-#ifndef DISABLE_ASYNC
- if (async < acc_async_noval)
- {
- if (profiling_dispatch_p)
- {
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- r = cuStreamSynchronize (dev_str->stream);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_end;
- wait_event_info.other_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- if (r == CUDA_ERROR_LAUNCH_FAILED)
- GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
- maybe_abort_msg);
- else if (r != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
- }
- else
- {
- CUevent *e;
-
- e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
-
- r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING);
- if (r == CUDA_ERROR_LAUNCH_FAILED)
- GOMP_PLUGIN_fatal ("cuEventCreate error: %s %s\n", cuda_error (r),
- maybe_abort_msg);
- else if (r != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r));
-
- event_gc (true);
-
- CUDA_CALL_ASSERT (cuEventRecord, *e, dev_str->stream);
-
- event_add (PTX_EVT_KNL, e, (void *)dev_str, 0);
- }
-#else
- if (profiling_dispatch_p)
- {
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- r = cuCtxSynchronize ();
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_end;
- wait_event_info.other_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- if (r == CUDA_ERROR_LAUNCH_FAILED)
- GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
- maybe_abort_msg);
- else if (r != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
-#endif
-
GOMP_PLUGIN_debug (0, " %s: kernel %s: finished\n", __FUNCTION__,
targ_fn->launch->fn);
-
-#ifndef DISABLE_ASYNC
- if (async < acc_async_noval)
-#endif
- map_pop (thr, dev_str);
}
void * openacc_get_current_cuda_context (void);
@@ -1410,8 +787,21 @@ nvptx_alloc (size_t s)
}
static bool
-nvptx_free (void *p)
+nvptx_free (void *p, struct ptx_device *ptx_dev)
{
+ /* Assume callback context if this is null. */
+ if (GOMP_PLUGIN_goacc_thread () == NULL)
+ {
+ struct ptx_free_block *n
+ = GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
+ n->ptr = p;
+ pthread_mutex_lock (&ptx_dev->free_blocks_lock);
+ n->next = ptx_dev->free_blocks;
+ ptx_dev->free_blocks = n;
+ pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
+ return true;
+ }
+
CUdeviceptr pb;
size_t ps;
@@ -1423,478 +813,9 @@ static bool
}
CUDA_CALL (cuMemFree, (CUdeviceptr) p);
-
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- acc_prof_info *prof_info = thr->prof_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_free;
-
- acc_event_info data_event_info;
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- data_event_info.data_event.implicit = 1; //TODO
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = ps;
- data_event_info.data_event.host_ptr = NULL;
- data_event_info.data_event.device_ptr = p;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
return true;
}
-
-static bool
-nvptx_host2dev (void *d, const void *h, size_t s)
-{
- CUdeviceptr pb;
- size_t ps;
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- if (!s)
- return true;
- if (!d)
- {
- GOMP_PLUGIN_error ("invalid device address");
- return false;
- }
-
- CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
-
- if (!pb)
- {
- GOMP_PLUGIN_error ("invalid device address");
- return false;
- }
- if (!h)
- {
- GOMP_PLUGIN_error ("invalid host address");
- return false;
- }
- if (d == h)
- {
- GOMP_PLUGIN_error ("invalid host or device address");
- return false;
- }
- if ((void *)(d + s) > (void *)(pb + ps))
- {
- GOMP_PLUGIN_error ("invalid size");
- return false;
- }
-
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_upload_start;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- data_event_info.data_event.implicit = 1; //TODO
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = s;
- data_event_info.data_event.host_ptr = /* TODO */ (void *) h;
- data_event_info.data_event.device_ptr = d;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
-#ifndef DISABLE_ASYNC
- if (nvthd->current_stream != nvthd->ptx_dev->null_stream)
- {
- CUevent *e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
- CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
- event_gc (false);
- CUDA_CALL (cuMemcpyHtoDAsync,
- (CUdeviceptr) d, h, s, nvthd->current_stream->stream);
- CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream);
- event_add (PTX_EVT_MEM, e, (void *)h, 0);
- }
- else
-#endif
- CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) d, h, s);
-
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_upload_end;
- data_event_info.data_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- return true;
-}
-
-static bool
-nvptx_dev2host (void *h, const void *d, size_t s)
-{
- CUdeviceptr pb;
- size_t ps;
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- if (!s)
- return true;
- if (!d)
- {
- GOMP_PLUGIN_error ("invalid device address");
- return false;
- }
-
- CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
-
- if (!pb)
- {
- GOMP_PLUGIN_error ("invalid device address");
- return false;
- }
- if (!h)
- {
- GOMP_PLUGIN_error ("invalid host address");
- return false;
- }
- if (d == h)
- {
- GOMP_PLUGIN_error ("invalid host or device address");
- return false;
- }
- if ((void *)(d + s) > (void *)(pb + ps))
- {
- GOMP_PLUGIN_error ("invalid size");
- return false;
- }
-
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_download_start;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- data_event_info.data_event.implicit = 1; //TODO
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = s;
- data_event_info.data_event.host_ptr = h;
- data_event_info.data_event.device_ptr = /* TODO */ (void *) d;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
-#ifndef DISABLE_ASYNC
- if (nvthd->current_stream != nvthd->ptx_dev->null_stream)
- {
- CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
- CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
- event_gc (false);
- CUDA_CALL (cuMemcpyDtoHAsync,
- h, (CUdeviceptr) d, s, nvthd->current_stream->stream);
- CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream);
- event_add (PTX_EVT_MEM, e, (void *)h, 0);
- }
- else
-#endif
- CUDA_CALL (cuMemcpyDtoH, h, (CUdeviceptr) d, s);
-
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_download_end;
- data_event_info.data_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- return true;
-}
-
-static void
-nvptx_set_async (int async)
-{
- struct nvptx_thread *nvthd = nvptx_thread ();
- nvthd->current_stream
- = select_stream_for_async (async, pthread_self (), true, NULL);
-}
-
-static int
-nvptx_async_test (int async)
-{
- CUresult r;
- struct ptx_stream *s;
-
- s = select_stream_for_async (async, pthread_self (), false, NULL);
-
- if (!s)
- GOMP_PLUGIN_fatal ("unknown async %d", async);
-
- r = cuStreamQuery (s->stream);
- if (r == CUDA_SUCCESS)
- {
- /* The oacc-parallel.c:goacc_wait function calls this hook to determine
- whether all work has completed on this stream, and if so omits the call
- to the wait hook. If that happens, event_gc might not get called
- (which prevents variables from getting unmapped and their associated
- device storage freed), so call it here. */
- event_gc (true);
- return 1;
- }
- else if (r == CUDA_ERROR_NOT_READY)
- return 0;
-
- GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
-
- return 0;
-}
-
-static int
-nvptx_async_test_all (void)
-{
- struct ptx_stream *s;
- pthread_t self = pthread_self ();
- struct nvptx_thread *nvthd = nvptx_thread ();
-
- pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
- for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
- {
- if ((s->multithreaded || pthread_equal (s->host_thread, self))
- && cuStreamQuery (s->stream) == CUDA_ERROR_NOT_READY)
- {
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
- return 0;
- }
- }
-
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
- event_gc (true);
-
- return 1;
-}
-
-static void
-nvptx_wait (int async)
-{
- struct ptx_stream *s;
-
- s = select_stream_for_async (async, pthread_self (), false, NULL);
- if (!s)
- GOMP_PLUGIN_fatal ("unknown async %d", async);
-
- GOMP_PLUGIN_debug (0, " %s: waiting on async=%d\n", __FUNCTION__, async);
-
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- bool profiling_dispatch_p
- = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
- acc_event_info wait_event_info;
- if (profiling_dispatch_p)
- {
- acc_prof_info *prof_info = thr->prof_info;
- acc_api_info *api_info = thr->api_info;
-
- prof_info->event_type = acc_ev_wait_start;
-
- wait_event_info.other_event.event_type = prof_info->event_type;
- wait_event_info.other_event.valid_bytes
- = _ACC_OTHER_EVENT_INFO_VALID_BYTES;
- wait_event_info.other_event.parent_construct
- /* TODO = compute_construct_event_info.other_event.parent_construct */
- = acc_construct_parallel; //TODO: kernels...
- wait_event_info.other_event.implicit = 1;
- wait_event_info.other_event.tool_info = NULL;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
- if (profiling_dispatch_p)
- {
- acc_prof_info *prof_info = thr->prof_info;
- acc_api_info *api_info = thr->api_info;
-
- prof_info->event_type = acc_ev_wait_end;
-
- wait_event_info.other_event.event_type = prof_info->event_type;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
-
- event_gc (true);
-}
-
-static void
-nvptx_wait_async (int async1, int async2)
-{
- CUevent *e;
- struct ptx_stream *s1, *s2;
- pthread_t self = pthread_self ();
-
- /* The stream that is waiting (rather than being waited for) doesn't
- necessarily have to exist already. */
- s2 = select_stream_for_async (async2, self, true, NULL);
-
- s1 = select_stream_for_async (async1, self, false, NULL);
- if (!s1)
- GOMP_PLUGIN_fatal ("invalid async 1\n");
-
- if (s1 == s2)
- GOMP_PLUGIN_fatal ("identical parameters");
-
- e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-
- CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-
- event_gc (true);
-
- CUDA_CALL_ASSERT (cuEventRecord, *e, s1->stream);
-
- event_add (PTX_EVT_SYNC, e, NULL, 0);
-
- CUDA_CALL_ASSERT (cuStreamWaitEvent, s2->stream, *e, 0);
-}
-
-static void
-nvptx_wait_all (void)
-{
- CUresult r;
- struct ptx_stream *s;
- pthread_t self = pthread_self ();
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info wait_event_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- wait_event_info.other_event.valid_bytes
- = _ACC_OTHER_EVENT_INFO_VALID_BYTES;
- wait_event_info.other_event.parent_construct
- /* TODO = compute_construct_event_info.other_event.parent_construct */
- = acc_construct_parallel; //TODO: kernels...
- wait_event_info.other_event.implicit = 1;
- wait_event_info.other_event.tool_info = NULL;
-
- api_info->device_api = acc_device_api_cuda;
- }
-
- /* Wait for active streams initiated by this thread (or by multiple threads)
- to complete. */
- for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
- {
- if (s->multithreaded || pthread_equal (s->host_thread, self))
- {
- r = cuStreamQuery (s->stream);
- if (r == CUDA_SUCCESS)
- continue;
- else if (r != CUDA_ERROR_NOT_READY)
- GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
-
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_start;
- wait_event_info.other_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info,
- &wait_event_info,
- api_info);
- }
- CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_end;
- wait_event_info.other_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info,
- &wait_event_info,
- api_info);
- }
- }
- }
-
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
- event_gc (true);
-}
-
-static void
-nvptx_wait_all_async (int async)
-{
- struct ptx_stream *waiting_stream, *other_stream;
- CUevent *e;
- struct nvptx_thread *nvthd = nvptx_thread ();
- pthread_t self = pthread_self ();
-
- /* The stream doing the waiting. This could be the first mention of the
- stream, so create it if necessary. */
- waiting_stream
- = select_stream_for_async (async, pthread_self (), true, NULL);
-
- /* Launches on the null stream already block on other streams in the
- context. */
- if (!waiting_stream || waiting_stream == nvthd->ptx_dev->null_stream)
- return;
-
- event_gc (true);
-
- pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
- for (other_stream = nvthd->ptx_dev->active_streams;
- other_stream != NULL;
- other_stream = other_stream->next)
- {
- if (!other_stream->multithreaded
- && !pthread_equal (other_stream->host_thread, self))
- continue;
-
- e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-
- CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-
- /* Record an event on the waited-for stream. */
- CUDA_CALL_ASSERT (cuEventRecord, *e, other_stream->stream);
-
- event_add (PTX_EVT_SYNC, e, NULL, 0);
-
- CUDA_CALL_ASSERT (cuStreamWaitEvent, waiting_stream->stream, *e, 0);
- }
-
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-}
-
static void *
nvptx_get_current_cuda_device (void)
{
@@ -1917,70 +838,6 @@ nvptx_get_current_cuda_context (void)
return nvthd->ptx_dev->ctx;
}
-static void *
-nvptx_get_cuda_stream (int async)
-{
- struct ptx_stream *s;
- struct nvptx_thread *nvthd = nvptx_thread ();
-
- if (!nvthd || !nvthd->ptx_dev)
- return NULL;
-
- s = select_stream_for_async (async, pthread_self (), false, NULL);
-
- return s ? s->stream : NULL;
-}
-
-static int
-nvptx_set_cuda_stream (int async, void *stream)
-{
- struct ptx_stream *oldstream;
- pthread_t self = pthread_self ();
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- if (async < 0)
- GOMP_PLUGIN_fatal ("bad async %d", async);
-
- pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
- /* We have a list of active streams and an array mapping async values to
- entries of that list. We need to take "ownership" of the passed-in stream,
- and add it to our list, removing the previous entry also (if there was one)
- in order to prevent resource leaks. Note the potential for surprise
- here: maybe we should keep track of passed-in streams and leave it up to
- the user to tidy those up, but that doesn't work for stream handles
- returned from acc_get_cuda_stream above... */
-
- oldstream = select_stream_for_async (async, self, false, NULL);
-
- if (oldstream)
- {
- if (nvthd->ptx_dev->active_streams == oldstream)
- nvthd->ptx_dev->active_streams = nvthd->ptx_dev->active_streams->next;
- else
- {
- struct ptx_stream *s = nvthd->ptx_dev->active_streams;
- while (s->next != oldstream)
- s = s->next;
- s->next = s->next->next;
- }
-
- CUDA_CALL_ASSERT (cuStreamDestroy, oldstream->stream);
-
- if (!map_fini (thr, oldstream))
- GOMP_PLUGIN_fatal ("error when freeing host memory");
-
- free (oldstream);
- }
-
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
- (void) select_stream_for_async (async, self, true, (CUstream) stream);
-
- return 1;
-}
-
/* Plugin entry points. */
const char *
@@ -2223,6 +1080,23 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
{
if (!nvptx_attach_host_thread_to_device (ord))
return NULL;
+
+ struct ptx_device *ptx_dev = ptx_devices[ord];
+ struct ptx_free_block *blocks, *tmp;
+
+ pthread_mutex_lock (&ptx_dev->free_blocks_lock);
+ blocks = ptx_dev->free_blocks;
+ ptx_dev->free_blocks = NULL;
+ pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
+
+ while (blocks)
+ {
+ tmp = blocks->next;
+ nvptx_free (blocks->ptr, ptx_dev);
+ free (blocks);
+ blocks = tmp;
+ }
+
return nvptx_alloc (size);
}
@@ -2230,84 +1104,165 @@ bool
GOMP_OFFLOAD_free (int ord, void *ptr)
{
return (nvptx_attach_host_thread_to_device (ord)
- && nvptx_free (ptr));
+ && nvptx_free (ptr, ptx_devices[ord]));
}
-bool
-GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
-{
- return (nvptx_attach_host_thread_to_device (ord)
- && nvptx_dev2host (dst, src, n));
-}
-
-bool
-GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
-{
- return (nvptx_attach_host_thread_to_device (ord)
- && nvptx_host2dev (dst, src, n));
-}
-
-void (*device_run) (int n, void *fn_ptr, void *vars) = NULL;
-
void
GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
void **hostaddrs, void **devaddrs,
- int async, unsigned *dims, void *targ_mem_desc)
+ unsigned *dims, void *targ_mem_desc)
{
- nvptx_exec (fn, mapnum, hostaddrs, devaddrs, async, dims, targ_mem_desc);
-}
+ GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__);
-void
-GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc, int async)
-{
- struct nvptx_thread *nvthd = nvptx_thread ();
- CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
+ void **hp = NULL;
+ CUdeviceptr dp = 0;
- CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
- CUDA_CALL_ASSERT (cuEventRecord, *e, nvthd->current_stream->stream);
- event_add (PTX_EVT_ASYNC_CLEANUP, e, targ_mem_desc, async);
-}
+ if (mapnum > 0)
+ {
+ hp = alloca (mapnum * sizeof (void *));
+ for (int i = 0; i < mapnum; i++)
+ hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+ CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+ }
-int
-GOMP_OFFLOAD_openacc_async_test (int async)
-{
- return nvptx_async_test (async);
-}
+ /* Copy the (device) pointers to arguments to the device (dp and hp might in
+ fact have the same value on a unified-memory system). */
+ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+ acc_prof_info *prof_info = thr->prof_info;
+ acc_event_info data_event_info;
+ acc_api_info *api_info = thr->api_info;
+ bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+ if (profiling_dispatch_p)
+ {
+ prof_info->event_type = acc_ev_enqueue_upload_start;
-int
-GOMP_OFFLOAD_openacc_async_test_all (void)
-{
- return nvptx_async_test_all ();
-}
+ data_event_info.data_event.event_type = prof_info->event_type;
+ data_event_info.data_event.valid_bytes
+ = _ACC_DATA_EVENT_INFO_VALID_BYTES;
+ data_event_info.data_event.parent_construct
+ = acc_construct_parallel; //TODO
+ /* Always implicit for "data mapping arguments for cuLaunchKernel". */
+ data_event_info.data_event.implicit = 1;
+ data_event_info.data_event.tool_info = NULL;
+ data_event_info.data_event.var_name = NULL; //TODO
+ data_event_info.data_event.bytes = mapnum * sizeof (void *);
+ data_event_info.data_event.host_ptr = hp;
+ data_event_info.data_event.device_ptr = (void *) dp;
-void
-GOMP_OFFLOAD_openacc_async_wait (int async)
-{
- nvptx_wait (async);
-}
+ api_info->device_api = acc_device_api_cuda;
-void
-GOMP_OFFLOAD_openacc_async_wait_async (int async1, int async2)
-{
- nvptx_wait_async (async1, async2);
+ GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+ api_info);
+ }
+
+ if (mapnum > 0)
+ CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
+ mapnum * sizeof (void *));
+
+ if (profiling_dispatch_p)
+ {
+ prof_info->event_type = acc_ev_enqueue_upload_end;
+ data_event_info.data_event.event_type = prof_info->event_type;
+ GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+ api_info);
+ }
+
+ nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+ dp, NULL);
+
+ CUresult r = cuStreamSynchronize (NULL);
+ const char *maybe_abort_msg = "(perhaps abort was called)";
+ if (r == CUDA_ERROR_LAUNCH_FAILED)
+ GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
+ maybe_abort_msg);
+ else if (r != CUDA_SUCCESS)
+ GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
+ CUDA_CALL_ASSERT (cuMemFree, dp);
}
-void
-GOMP_OFFLOAD_openacc_async_wait_all (void)
+static void
+cuda_free_argmem (void *ptr)
{
- nvptx_wait_all ();
+ void **block = (void **) ptr;
+ nvptx_free (block[0], (struct ptx_device *) block[1]);
+ free (block);
}
void
-GOMP_OFFLOAD_openacc_async_wait_all_async (int async)
+GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
+ void **hostaddrs, void **devaddrs,
+ unsigned *dims, void *targ_mem_desc,
+ struct goacc_asyncqueue *aq)
{
- nvptx_wait_all_async (async);
-}
+ GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__);
-void
-GOMP_OFFLOAD_openacc_async_set_async (int async)
-{
- nvptx_set_async (async);
+ void **hp = NULL;
+ CUdeviceptr dp = 0;
+ void **block = NULL;
+
+ if (mapnum > 0)
+ {
+ block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
+ hp = block + 2;
+ for (int i = 0; i < mapnum; i++)
+ hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+ CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+ }
+
+ /* Copy the (device) pointers to arguments to the device (dp and hp might in
+ fact have the same value on a unified-memory system). */
+ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+ acc_prof_info *prof_info = thr->prof_info;
+ acc_event_info data_event_info;
+ acc_api_info *api_info = thr->api_info;
+ bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+ if (profiling_dispatch_p)
+ {
+ prof_info->event_type = acc_ev_enqueue_upload_start;
+
+ data_event_info.data_event.event_type = prof_info->event_type;
+ data_event_info.data_event.valid_bytes
+ = _ACC_DATA_EVENT_INFO_VALID_BYTES;
+ data_event_info.data_event.parent_construct
+ = acc_construct_parallel; //TODO
+ /* Always implicit for "data mapping arguments for cuLaunchKernel". */
+ data_event_info.data_event.implicit = 1;
+ data_event_info.data_event.tool_info = NULL;
+ data_event_info.data_event.var_name = NULL; //TODO
+ data_event_info.data_event.bytes = mapnum * sizeof (void *);
+ data_event_info.data_event.host_ptr = hp;
+ data_event_info.data_event.device_ptr = (void *) dp;
+
+ api_info->device_api = acc_device_api_cuda;
+
+ GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+ api_info);
+ }
+
+ if (mapnum > 0)
+ {
+ CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
+ mapnum * sizeof (void *), aq->cuda_stream);
+ block[0] = (void *) dp;
+
+ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+ struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
+ block[1] = (void *) nvthd->ptx_dev;
+ }
+
+ if (profiling_dispatch_p)
+ {
+ prof_info->event_type = acc_ev_enqueue_upload_end;
+ data_event_info.data_event.event_type = prof_info->event_type;
+ GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+ api_info);
+ }
+
+ nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+ dp, aq->cuda_stream);
+
+ if (mapnum > 0)
+ GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
}
void *
@@ -2329,7 +1284,6 @@ GOMP_OFFLOAD_openacc_create_thread_data (int ord)
if (!thd_ctx)
CUDA_CALL_ASSERT (cuCtxPushCurrent, ptx_dev->ctx);
- nvthd->current_stream = ptx_dev->null_stream;
nvthd->ptx_dev = ptx_dev;
return (void *) nvthd;
@@ -2354,17 +1308,169 @@ GOMP_OFFLOAD_openacc_cuda_get_current_context (voi
}
/* NOTE: This returns a CUstream, not a ptx_stream pointer. */
-
void *
-GOMP_OFFLOAD_openacc_cuda_get_stream (int async)
+GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq)
{
- return nvptx_get_cuda_stream (async);
+ return (void *) aq->cuda_stream;
}
/* NOTE: This takes a CUstream, not a ptx_stream pointer. */
+int
+GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
+{
+ if (aq->cuda_stream)
+ {
+ CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+ CUDA_CALL_ASSERT (cuStreamDestroy, aq->cuda_stream);
+ }
+ aq->cuda_stream = (CUstream) stream;
+ return 1;
+}
+
+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (void)
+{
+ struct goacc_asyncqueue *aq
+ = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
+ CUDA_CALL_ASSERT (cuStreamCreate, &aq->cuda_stream, CU_STREAM_DEFAULT);
+ return aq;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
+{
+ CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
+ free (aq);
+ return true;
+}
+
int
-GOMP_OFFLOAD_openacc_cuda_set_stream (int async, void *stream)
+GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
{
- return nvptx_set_cuda_stream (async, stream);
+ CUresult r = cuStreamQuery (aq->cuda_stream);
+ if (r == CUDA_SUCCESS)
+ return 1;
+ if (r == CUDA_ERROR_NOT_READY)
+ return 0;
+
+ GOMP_PLUGIN_error ("cuStreamQuery error: %s", cuda_error (r));
+ return -1;
}
+
+void
+GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
+{
+ CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
+ struct goacc_asyncqueue *aq2)
+{
+ CUevent e;
+ CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
+ CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
+ CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
+}
+
+static void
+cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
+{
+ if (res != CUDA_SUCCESS)
+ GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
+ struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
+ cb->fn (cb->ptr);
+ free (ptr);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
+ void (*callback_fn)(void *),
+ void *userptr)
+{
+ struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b));
+ b->fn = callback_fn;
+ b->ptr = userptr;
+ b->aq = aq;
+ CUDA_CALL_ASSERT (cuStreamAddCallback, aq->cuda_stream,
+ cuda_callback_wrapper, (void *) b, 0);
+}
+
+static bool
+cuda_memcpy_sanity_check (const void *h, const void *d, size_t s)
+{
+ CUdeviceptr pb;
+ size_t ps;
+ if (!s)
+ return true;
+ if (!d)
+ {
+ GOMP_PLUGIN_error ("invalid device address");
+ return false;
+ }
+ CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
+ if (!pb)
+ {
+ GOMP_PLUGIN_error ("invalid device address");
+ return false;
+ }
+ if (!h)
+ {
+ GOMP_PLUGIN_error ("invalid host address");
+ return false;
+ }
+ if (d == h)
+ {
+ GOMP_PLUGIN_error ("invalid host or device address");
+ return false;
+ }
+ if ((void *)(d + s) > (void *)(pb + ps))
+ {
+ GOMP_PLUGIN_error ("invalid size");
+ return false;
+ }
+ return true;
+}
+
+bool
+GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
+{
+ if (!nvptx_attach_host_thread_to_device (ord)
+ || !cuda_memcpy_sanity_check (src, dst, n))
+ return false;
+ CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n);
+ return true;
+}
+
+bool
+GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
+{
+ if (!nvptx_attach_host_thread_to_device (ord)
+ || !cuda_memcpy_sanity_check (dst, src, n))
+ return false;
+ CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n);
+ return true;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
+ size_t n, struct goacc_asyncqueue *aq)
+{
+ if (!nvptx_attach_host_thread_to_device (ord)
+ || !cuda_memcpy_sanity_check (src, dst, n))
+ return false;
+ CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream);
+ return true;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
+ size_t n, struct goacc_asyncqueue *aq)
+{
+ if (!nvptx_attach_host_thread_to_device (ord)
+ || !cuda_memcpy_sanity_check (dst, src, n))
+ return false;
+ CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream);
+ return true;
+}
Index: libgomp/target.c
===================================================================
--- libgomp/target.c (revision 249620)
+++ libgomp/target.c (working copy)
@@ -187,18 +187,44 @@ gomp_device_copy (struct gomp_device_descr *device
}
}
-static void
+static inline void
+goacc_device_copy_async (struct gomp_device_descr *devicep,
+ bool (*copy_func) (int, void *, const void *, size_t,
+ struct goacc_asyncqueue *),
+ const char *dst, void *dstaddr,
+ const char *src, const void *srcaddr,
+ size_t size, struct goacc_asyncqueue *aq)
+{
+ if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq))
+ {
+ gomp_mutex_unlock (&devicep->lock);
+ gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed",
+ src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size);
+ }
+}
+
+attribute_hidden void
gomp_copy_host2dev (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq,
void *d, const void *h, size_t sz)
{
- gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
+ if (aq)
+ goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
+ "dev", d, "host", h, sz, aq);
+ else
+ gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
}
-static void
+attribute_hidden void
gomp_copy_dev2host (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq,
void *h, const void *d, size_t sz)
{
- gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
+ if (aq)
+ goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func,
+ "host", h, "dev", d, sz, aq);
+ else
+ gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
}
static void
@@ -216,7 +242,8 @@ gomp_free_device_memory (struct gomp_device_descr
Helper function of gomp_map_vars. */
static inline void
-gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn,
+gomp_map_vars_existing (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq, splay_tree_key oldn,
splay_tree_key newn, struct target_var_desc *tgt_var,
unsigned char kind)
{
@@ -238,7 +265,7 @@ static inline void
}
if (GOMP_MAP_ALWAYS_TO_P (kind))
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (oldn->tgt->tgt_start + oldn->tgt_offset
+ newn->host_start - oldn->host_start),
(void *) newn->host_start,
@@ -256,8 +283,8 @@ get_kind (bool short_mapkind, void *kinds, int idx
}
static void
-gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr,
- uintptr_t target_offset, uintptr_t bias)
+gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
+ uintptr_t host_ptr, uintptr_t target_offset, uintptr_t bias)
{
struct gomp_device_descr *devicep = tgt->device_descr;
struct splay_tree_s *mem_map = &devicep->mem_map;
@@ -268,7 +295,7 @@ static void
{
cur_node.tgt_offset = (uintptr_t) NULL;
/* FIXME: see comment about coalescing host/dev transfers below. */
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start + target_offset),
(void *) &cur_node.tgt_offset,
sizeof (void *));
@@ -291,7 +318,7 @@ static void
to initialize the pointer with. */
cur_node.tgt_offset -= bias;
/* FIXME: see comment about coalescing host/dev transfers below. */
- gomp_copy_host2dev (devicep, (void *) (tgt->tgt_start + target_offset),
+ gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset),
(void *) &cur_node.tgt_offset, sizeof (void *));
}
@@ -329,9 +356,9 @@ gomp_map_pset (struct target_mem_desc *tgt, uintpt
}
static void
-gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n,
- size_t first, size_t i, void **hostaddrs,
- size_t *sizes, void *kinds)
+gomp_map_fields_existing (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
+ splay_tree_key n, size_t first, size_t i,
+ void **hostaddrs, size_t *sizes, void *kinds)
{
struct gomp_device_descr *devicep = tgt->device_descr;
struct splay_tree_s *mem_map = &devicep->mem_map;
@@ -348,7 +375,7 @@ static void
&& n2->tgt == n->tgt
&& n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset)
{
- gomp_map_vars_existing (devicep, n2, &cur_node,
+ gomp_map_vars_existing (devicep, aq, n2, &cur_node,
&tgt->list[i], kind & typemask);
return;
}
@@ -364,7 +391,7 @@ static void
&& n2->host_start - n->host_start
== n2->tgt_offset - n->tgt_offset)
{
- gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i],
+ gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i],
kind & typemask);
return;
}
@@ -376,7 +403,7 @@ static void
&& n2->tgt == n->tgt
&& n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset)
{
- gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i],
+ gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i],
kind & typemask);
return;
}
@@ -547,6 +574,18 @@ gomp_map_vars (struct gomp_device_descr *devicep,
void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
bool short_mapkind, enum gomp_map_vars_kind pragma_kind)
{
+ struct target_mem_desc *tgt;
+ tgt = gomp_map_vars_async (devicep, NULL, mapnum, hostaddrs, devaddrs,
+ sizes, kinds, short_mapkind, pragma_kind);
+ return tgt;
+}
+
+attribute_hidden struct target_mem_desc *
+gomp_map_vars_async (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq, size_t mapnum,
+ void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
+ bool short_mapkind, enum gomp_map_vars_kind pragma_kind)
+{
size_t i, tgt_align, tgt_size, not_found_cnt = 0;
bool has_firstprivate = false;
const int rshift = short_mapkind ? 8 : 3;
@@ -665,7 +704,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
continue;
}
for (i = first; i <= last; i++)
- gomp_map_fields_existing (tgt, n, first, i, hostaddrs,
+ gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs,
sizes, kinds);
i--;
continue;
@@ -722,7 +761,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
else
n = splay_tree_lookup (mem_map, &cur_node);
if (n && n->refcount != REFCOUNT_LINK)
- gomp_map_vars_existing (devicep, n, &cur_node, &tgt->list[i],
+ gomp_map_vars_existing (devicep, aq, n, &cur_node, &tgt->list[i],
kind & typemask);
else
{
@@ -790,7 +829,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
if (n)
{
assert (n->refcount != REFCOUNT_LINK);
- gomp_map_vars_existing (devicep, n, &cur_node, row_desc,
+ gomp_map_vars_existing (devicep, aq, n, &cur_node, row_desc,
kind & typemask);
}
else
@@ -866,7 +905,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
tgt_size = (tgt_size + align - 1) & ~(align - 1);
tgt->list[i].offset = tgt_size;
len = sizes[i];
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start + tgt_size),
(void *) hostaddrs[i], len);
tgt_size += len;
@@ -900,7 +939,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
continue;
}
for (i = first; i <= last; i++)
- gomp_map_fields_existing (tgt, n, first, i, hostaddrs,
+ gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs,
sizes, kinds);
i--;
continue;
@@ -920,7 +959,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i - 1);
if (cur_node.tgt_offset)
cur_node.tgt_offset -= sizes[i];
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (n->tgt->tgt_start
+ n->tgt_offset
+ cur_node.host_start
@@ -950,7 +989,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
k->host_end = k->host_start + sizeof (void *);
splay_tree_key n = splay_tree_lookup (mem_map, k);
if (n && n->refcount != REFCOUNT_LINK)
- gomp_map_vars_existing (devicep, n, k, &tgt->list[i],
+ gomp_map_vars_existing (devicep, aq, n, k, &tgt->list[i],
kind & typemask);
else
{
@@ -1006,7 +1045,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
/* FIXME: Perhaps add some smarts, like if copying
several adjacent fields from host to target, use some
host buffer to avoid sending each var individually. */
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start
+ k->tgt_offset),
(void *) k->host_start,
@@ -1013,7 +1052,8 @@ gomp_map_vars (struct gomp_device_descr *devicep,
k->host_end - k->host_start);
break;
case GOMP_MAP_POINTER:
- gomp_map_pointer (tgt, (uintptr_t) *(void **) k->host_start,
+ gomp_map_pointer (tgt, aq,
+ (uintptr_t) *(void **) k->host_start,
k->tgt_offset, sizes[i]);
break;
case GOMP_MAP_TO_PSET:
@@ -1042,7 +1082,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
sizes[j]);
tptr = *(uintptr_t *) hostaddrs[i];
*(uintptr_t *) hostaddrs[i]= toffset;
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start
+ k->tgt_offset),
(void *) k->host_start,
@@ -1052,7 +1092,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
found_pointer = true;
}
if (!found_pointer)
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start
+ k->tgt_offset),
(void *) k->host_start,
@@ -1079,7 +1119,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
break;
case GOMP_MAP_FORCE_DEVICEPTR:
assert (k->host_end - k->host_start == sizeof (void *));
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start
+ k->tgt_offset),
(void *) k->host_start,
@@ -1096,9 +1136,8 @@ gomp_map_vars (struct gomp_device_descr *devicep,
/* Set link pointer on target to the device address of the
mapped object. */
void *tgt_addr = (void *) (tgt->tgt_start + k->tgt_offset);
- devicep->host2dev_func (devicep->target_id,
- (void *) n->tgt_offset,
- &tgt_addr, sizeof (void *));
+ gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset,
+ &tgt_addr, sizeof (void *));
}
array++;
}
@@ -1142,7 +1181,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
if (n)
{
assert (n->refcount != REFCOUNT_LINK);
- gomp_map_vars_existing (devicep, n, &cur_node, row_desc,
+ gomp_map_vars_existing (devicep, aq, n, &cur_node, row_desc,
kind & typemask);
target_row_addr = n->tgt->tgt_start + n->tgt_offset;
}
@@ -1166,7 +1205,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
row_desc->copy_from
= GOMP_MAP_COPY_FROM_P (kind & typemask);
row_desc->always_copy_from
- = GOMP_MAP_COPY_FROM_P (kind & typemask);
+ = GOMP_MAP_ALWAYS_FROM_P (kind & typemask);
row_desc->offset = 0;
row_desc->length = da->data_row_size;
@@ -1175,7 +1214,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
splay_tree_insert (mem_map, array);
if (GOMP_MAP_COPY_TO_P (kind & typemask))
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) tgt->tgt_start + k->tgt_offset,
(void *) k->host_start,
da->data_row_size);
@@ -1191,9 +1230,11 @@ gomp_map_vars (struct gomp_device_descr *devicep,
{
void *ptrblock = gomp_dynamic_array_create_ptrblock
(da, target_ptrblock, target_data_rows + row_start);
- gomp_copy_host2dev (devicep, target_ptrblock, ptrblock,
+ gomp_copy_host2dev (devicep, aq, target_ptrblock, ptrblock,
da->ptrblock_size);
- free (ptrblock);
+ /* Freeing of the ptrblock must be scheduled after the host2dev
+ copy completes. */
+ goacc_async_free (devicep, aq, ptrblock);
}
row_start += da->data_row_num;
@@ -1213,7 +1254,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
{
cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i);
/* FIXME: see above FIXME comment. */
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start + i * sizeof (void *)),
(void *) &cur_node.tgt_offset, sizeof (void *));
}
@@ -1232,7 +1273,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
return tgt;
}
-static void
+attribute_hidden void
gomp_unmap_tgt (struct target_mem_desc *tgt)
{
/* Deallocate on target the tgt->tgt_start .. tgt->tgt_end region. */
@@ -1267,6 +1308,13 @@ gomp_remove_var (struct gomp_device_descr *devicep
attribute_hidden void
gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
{
+ gomp_unmap_vars_async (tgt, do_copyfrom, NULL);
+}
+
+attribute_hidden void
+gomp_unmap_vars_async (struct target_mem_desc *tgt, bool do_copyfrom,
+ struct goacc_asyncqueue *aq)
+{
struct gomp_device_descr *devicep = tgt->device_descr;
if (tgt->list_count == 0)
@@ -1302,7 +1350,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool
if ((do_unmap && do_copyfrom && tgt->list[i].copy_from)
|| tgt->list[i].always_copy_from)
- gomp_copy_dev2host (devicep,
+ gomp_copy_dev2host (devicep, aq,
(void *) (k->host_start + tgt->list[i].offset),
(void *) (k->tgt->tgt_start + k->tgt_offset
+ tgt->list[i].offset),
@@ -1368,9 +1416,9 @@ gomp_update (struct gomp_device_descr *devicep, si
size_t size = cur_node.host_end - cur_node.host_start;
if (GOMP_MAP_COPY_TO_P (kind & typemask))
- gomp_copy_host2dev (devicep, devaddr, hostaddr, size);
+ gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size);
if (GOMP_MAP_COPY_FROM_P (kind & typemask))
- gomp_copy_dev2host (devicep, hostaddr, devaddr, size);
+ gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size);
}
}
gomp_mutex_unlock (&devicep->lock);
@@ -1691,9 +1739,21 @@ gomp_init_device (struct gomp_device_descr *device
false);
}
+ /* Initialize OpenACC asynchronous queues. */
+ goacc_init_asyncqueues (devicep);
+
devicep->state = GOMP_DEVICE_INITIALIZED;
}
+attribute_hidden bool
+gomp_fini_device (struct gomp_device_descr *devicep)
+{
+ devicep->state = GOMP_DEVICE_FINALIZED;
+ bool ret = goacc_fini_asyncqueues (devicep);
+ ret &= devicep->fini_device_func (devicep->target_id);
+ return ret;
+}
+
attribute_hidden void
gomp_unload_device (struct gomp_device_descr *devicep)
{
@@ -2222,7 +2282,7 @@ gomp_exit_data (struct gomp_device_descr *devicep,
if ((kind == GOMP_MAP_FROM && k->refcount == 0)
|| kind == GOMP_MAP_ALWAYS_FROM)
- gomp_copy_dev2host (devicep, (void *) cur_node.host_start,
+ gomp_copy_dev2host (devicep, NULL, (void *) cur_node.host_start,
(void *) (k->tgt->tgt_start + k->tgt_offset
+ cur_node.host_start
- k->host_start),
@@ -2848,20 +2908,20 @@ gomp_load_plugin_for_device (struct gomp_device_de
if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
{
if (!DLSYM_OPT (openacc.exec, openacc_exec)
- || !DLSYM_OPT (openacc.register_async_cleanup,
- openacc_register_async_cleanup)
- || !DLSYM_OPT (openacc.async_test, openacc_async_test)
- || !DLSYM_OPT (openacc.async_test_all, openacc_async_test_all)
- || !DLSYM_OPT (openacc.async_wait, openacc_async_wait)
- || !DLSYM_OPT (openacc.async_wait_async, openacc_async_wait_async)
- || !DLSYM_OPT (openacc.async_wait_all, openacc_async_wait_all)
- || !DLSYM_OPT (openacc.async_wait_all_async,
- openacc_async_wait_all_async)
- || !DLSYM_OPT (openacc.async_set_async, openacc_async_set_async)
|| !DLSYM_OPT (openacc.create_thread_data,
openacc_create_thread_data)
|| !DLSYM_OPT (openacc.destroy_thread_data,
- openacc_destroy_thread_data))
+ openacc_destroy_thread_data)
+ || !DLSYM_OPT (openacc.async.construct, openacc_async_construct)
+ || !DLSYM_OPT (openacc.async.destruct, openacc_async_destruct)
+ || !DLSYM_OPT (openacc.async.test, openacc_async_test)
+ || !DLSYM_OPT (openacc.async.synchronize, openacc_async_synchronize)
+ || !DLSYM_OPT (openacc.async.serialize, openacc_async_serialize)
+ || !DLSYM_OPT (openacc.async.queue_callback,
+ openacc_async_queue_callback)
+ || !DLSYM_OPT (openacc.async.exec, openacc_async_exec)
+ || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host)
+ || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev))
{
/* Require all the OpenACC handlers if we have
GOMP_OFFLOAD_CAP_OPENACC_200. */
@@ -2912,10 +2972,7 @@ gomp_target_fini (void)
struct gomp_device_descr *devicep = &devices[i];
gomp_mutex_lock (&devicep->lock);
if (devicep->state == GOMP_DEVICE_INITIALIZED)
- {
- ret = devicep->fini_device_func (devicep->target_id);
- devicep->state = GOMP_DEVICE_FINALIZED;
- }
+ ret = gomp_fini_device (devicep);
gomp_mutex_unlock (&devicep->lock);
if (!ret)
gomp_fatal ("device finalization failed");
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c (working copy)
@@ -206,11 +206,6 @@ void cb_enter_data_start (acc_prof_info *prof_info
assert (event_info->other_event.implicit == 1);
assert (event_info->other_event.tool_info == NULL);
- if (acc_device_type == acc_device_host
- || state < 100) //TODO
- assert (api_info->device_api == acc_device_api_none);
- else
- assert (api_info->device_api == acc_device_api_cuda);
assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
assert (api_info->device_type == prof_info->device_type);
assert (api_info->vendor == -1);
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c (working copy)
@@ -151,7 +151,7 @@ main (int argc, char **argv)
d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
#pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N], N) \
- async (4)
+ wait (1, 2, 3) async (4)
for (int ii = 0; ii < N; ii++)
e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c (working copy)
@@ -162,7 +162,7 @@ main (int argc, char **argv)
d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
#pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \
- wait (1) async (4)
+ wait (1, 2, 3) async (4)
for (int ii = 0; ii < N; ii++)
e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c (working copy)
@@ -138,7 +138,7 @@ main (int argc, char **argv)
d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
#pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \
- wait (1,5) async (4)
+ wait (1, 2, 3, 5) async (4)
for (int ii = 0; ii < N; ii++)
e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c (working copy)
@@ -46,16 +46,22 @@ main (int argc, char **argv)
abort ();
}
- fprintf (stderr, "CheCKpOInT\n");
- if (acc_async_test (1) != 0)
+ if (acc_async_test (0) != 0)
{
fprintf (stderr, "asynchronous operation not running\n");
abort ();
}
+ /* Test unseen async number. */
+ if (acc_async_test (1) != 1)
+ {
+ fprintf (stderr, "acc_async_test failed on unseen number\n");
+ abort ();
+ }
+
sleep (1);
- if (acc_async_test (1) != 1)
+ if (acc_async_test (0) != 1)
{
fprintf (stderr, "found asynchronous operation still running\n");
abort ();
@@ -65,7 +71,3 @@ main (int argc, char **argv)
return 0;
}
-
-/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */
-/* { dg-output "unknown async \[0-9\]+" } */
-/* { dg-shouldfail "" } */
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c (working copy)
@@ -72,14 +72,13 @@ main (int argc, char **argv)
abort ();
}
- fprintf (stderr, "CheCKpOInT\n");
- acc_wait (1);
+ acc_wait (0);
gettimeofday (&tv2, NULL);
t2 = ((tv2.tv_sec - tv1.tv_sec) * 1000000) + (tv2.tv_usec - tv1.tv_usec);
- if (t2 > t1)
+ if (t2 - t1 > 100)
{
fprintf (stderr, "too long 1\n");
abort ();
@@ -87,7 +86,7 @@ main (int argc, char **argv)
gettimeofday (&tv1, NULL);
- acc_wait (1);
+ acc_wait (0);
gettimeofday (&tv2, NULL);
@@ -103,7 +102,3 @@ main (int argc, char **argv)
return 0;
}
-
-/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */
-/* { dg-output "unknown async \[0-9\]+" } */
-/* { dg-shouldfail "" } */
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c (working copy)
@@ -84,6 +84,7 @@ main (int argc, char **argv)
for (i = 0; i < N; i++)
{
+ stream = (CUstream) acc_get_cuda_stream (i & 1);
r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, NULL, 0);
if (r != CUDA_SUCCESS)
{
@@ -92,10 +93,10 @@ main (int argc, char **argv)
}
}
- acc_wait_async (0, 1);
-
if (acc_async_test (0) != 0)
abort ();
+
+ acc_wait_async (0, 1);
if (acc_async_test (1) != 0)
abort ();
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c (working copy)
@@ -109,7 +109,7 @@ main (int argc, char **argv)
for (i = 0; i <= N; i++)
{
- if (acc_async_test (i) != 0)
+ if (acc_async_test (i) == 0)
abort ();
}
Index: libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90
===================================================================
--- libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 (revision 249620)
+++ libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 (working copy)
@@ -1,4 +1,5 @@
! { dg-do run }
+! { dg-xfail-run-if "n/a" { openacc_host_selected } }
program main
use openacc
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gomp4] OpenACC async re-work
2017-06-24 7:54 [gomp4] OpenACC async re-work Chung-Lin Tang
@ 2017-06-26 22:45 ` Cesar Philippidis
2017-06-27 10:56 ` Chung-Lin Tang
0 siblings, 1 reply; 9+ messages in thread
From: Cesar Philippidis @ 2017-06-26 22:45 UTC (permalink / raw)
To: Chung-Lin Tang, gcc-patches, Thomas Schwinge
I still need more time to review this, but ...
On 06/24/2017 12:54 AM, Chung-Lin Tang wrote:
> Hi Cesar, Thomas,
> This patch is the re-implementation of OpenACC async we talked about.
> The changes are rather large, so I am putting it here for a few days before
> actually committing them to gomp-4_0-branch. Would appreciate if you guys
> take a look.
>
> To overall describe the highlights of the changes:
>
> (1) Instead of essentially implementing the entire OpenACC async support
> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
> plugin functions. Most of the OpenACC specific logic is pulled into
> libgomp/oacc-async.c
I'm not sure if plugins need to maintain backwards compatibility.
However, I don't see any changes inside libgomp.map, so maybe it's not
required.
> (2) CUDA events are no longer used. The limitation of no CUDA calls inside
> CUDA callbacks were a problem for resource freeing, but we now stash
> them onto the ptx_device and free them later.
Yay!
> (3) For 'wait + async', we now add a local thread synchronize, instead
> of just ordering the streams.
>
> (4) To work with the (3) change, some front end changes were added to
> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
> represent a 'wait all'.
What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
easier to make that change in the gimplifier?
> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
> this after the weekend (or Tues.)
> * plugin/plugin-nvptx.c (struct cuda_map): Remove.
> (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
> (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
These two functions seem extremely similar. I wonder if you should
consolidate them.
Overall, I like how you were able eliminate the externally managed map_*
data structure which was used to pass in arguments to nvptx_exec.
Although I wonder if we should just pass in those individual arguments
directly to cuLaunchKernel. But that's a big change in itself.
Cesar
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gomp4] OpenACC async re-work
2017-06-26 22:45 ` Cesar Philippidis
@ 2017-06-27 10:56 ` Chung-Lin Tang
2017-06-28 22:31 ` Cesar Philippidis
0 siblings, 1 reply; 9+ messages in thread
From: Chung-Lin Tang @ 2017-06-27 10:56 UTC (permalink / raw)
To: Cesar Philippidis, gcc-patches, Thomas Schwinge
On 2017/6/27 6:45 AM, Cesar Philippidis wrote:
>> (1) Instead of essentially implementing the entire OpenACC async support
>> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
>> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
>> plugin functions. Most of the OpenACC specific logic is pulled into
>> libgomp/oacc-async.c
> I'm not sure if plugins need to maintain backwards compatibility.
> However, I don't see any changes inside libgomp.map, so maybe it's not
> required.
This patch is pretty large, but only inner workings (including libgomp vs. plugin interface) were modified.
No user API compatibility was altered.
>> (3) For 'wait + async', we now add a local thread synchronize, instead
>> of just ordering the streams.
>>
>> (4) To work with the (3) change, some front end changes were added to
>> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
>> represent a 'wait all'.
> What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
> easier to make that change in the gimplifier?
Actually, we were basically throwing away argument-less wait clauses in front-ends
before this patch; i.e. '#pragma acc parallel async' and '#pragma acc parallel wait async'
were internally the same.
The use of GOMP_ASYNC_NOVAL (-1) was just following the current 'async' clause representation
convention.
>> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
>> this after the weekend (or Tues.)
>> * plugin/plugin-nvptx.c (struct cuda_map): Remove.
>> (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
>> (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
> These two functions seem extremely similar. I wonder if you should
> consolidate them.
It would be nice to have a proper set of pthreads based host fallback hooks
for the openacc.async substruct later. Ideally, an accelerator plugin can
just implement GOMP_OFFLOAD_openacc_exec, and the default host pthreads-based
GOMP_OFFLOAD_openacc_async_exec can be implemented in terms of
the synchronous GOMP_OFFLOAD_openacc_exec. Combining the two hook routines
would make this less clean.
> Overall, I like how you were able eliminate the externally managed map_*
> data structure which was used to pass in arguments to nvptx_exec.
> Although I wonder if we should just pass in those individual arguments
> directly to cuLaunchKernel. But that's a big change in itself.
I didn't think of that when working on the current patch, maybe later.
Thanks,
Chung-Lin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gomp4] OpenACC async re-work
2017-06-27 10:56 ` Chung-Lin Tang
@ 2017-06-28 22:31 ` Cesar Philippidis
2017-07-25 12:51 ` Chung-Lin Tang
0 siblings, 1 reply; 9+ messages in thread
From: Cesar Philippidis @ 2017-06-28 22:31 UTC (permalink / raw)
To: Chung-Lin Tang, gcc-patches, Thomas Schwinge
On 06/27/2017 03:56 AM, Chung-Lin Tang wrote:
> On 2017/6/27 6:45 AM, Cesar Philippidis wrote:
>>> (1) Instead of essentially implementing the entire OpenACC async support
>>> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
>>> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
>>> plugin functions. Most of the OpenACC specific logic is pulled into
>>> libgomp/oacc-async.c
>> I'm not sure if plugins need to maintain backwards compatibility.
>> However, I don't see any changes inside libgomp.map, so maybe it's not
>> required.
>
> This patch is pretty large, but only inner workings (including libgomp vs. plugin interface) were modified.
> No user API compatibility was altered.
>
>>> (3) For 'wait + async', we now add a local thread synchronize, instead
>>> of just ordering the streams.
>>>
>>> (4) To work with the (3) change, some front end changes were added to
>>> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
>>> represent a 'wait all'.
>> What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
>> easier to make that change in the gimplifier?
>
> Actually, we were basically throwing away argument-less wait clauses in front-ends
> before this patch; i.e. '#pragma acc parallel async' and '#pragma acc parallel wait async'
> were internally the same.
>
> The use of GOMP_ASYNC_NOVAL (-1) was just following the current 'async' clause representation
> convention.
So then then wait was implied before? Or maybe that's why 'wait async'
didn't work.
>>> Patch was tested to have no regressions on gomp-4_0-branch. I'll commit
>>> this after the weekend (or Tues.)
>>> * plugin/plugin-nvptx.c (struct cuda_map): Remove.
>>> (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
>>> (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
>> These two functions seem extremely similar. I wonder if you should
>> consolidate them.
>
> It would be nice to have a proper set of pthreads based host fallback hooks
> for the openacc.async substruct later. Ideally, an accelerator plugin can
> just implement GOMP_OFFLOAD_openacc_exec, and the default host pthreads-based
> GOMP_OFFLOAD_openacc_async_exec can be implemented in terms of
> the synchronous GOMP_OFFLOAD_openacc_exec. Combining the two hook routines
> would make this less clean.
After looking at this some more, I like how your patch simplifies
things. This small bit of somewhat duplicated code is much better than
what we had before. So I'm ok with it.
>> Overall, I like how you were able eliminate the externally managed map_*
>> data structure which was used to pass in arguments to nvptx_exec.
>> Although I wonder if we should just pass in those individual arguments
>> directly to cuLaunchKernel. But that's a big change in itself.
>
> I didn't think of that when working on the current patch, maybe later.
Here's some more comments regarding the code below. One high-level
comment regarding the usage of async-specific locks. Can't you get by
with using the global device lock, instead of a special async queue or
would that cause a deadlock?
> Index: libgomp/oacc-async.c
> ===================================================================
> --- libgomp/oacc-async.c (revision 249620)
> +++ libgomp/oacc-async.c (working copy)
> @@ -27,10 +27,85 @@
> <http://www.gnu.org/licenses/>. */
>
> #include <assert.h>
> +#include <string.h>
> #include "openacc.h"
> #include "libgomp.h"
> #include "oacc-int.h"
>
> +static struct goacc_thread *
> +get_goacc_thread (void)
> +{
> + struct goacc_thread *thr = goacc_thread ();
> + if (!thr || !thr->dev)
> + gomp_fatal ("no device active");
> + return thr;
> +}
> +
> +static struct gomp_device_descr *
> +get_goacc_thread_device (void)
> +{
> + struct goacc_thread *thr = goacc_thread ();
> +
> + if (!thr || !thr->dev)
> + gomp_fatal ("no device active");
> +
> + return thr->dev;
> +}
These two functions can be made public because a lot of other functioncs
can use them too. I don't know where to stash them though. You can
change that later though.
> +attribute_hidden struct goacc_asyncqueue *
> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int
async)
> +{
> + /* The special value acc_async_noval (-1) maps to the thread-specific
> + default async stream. */
> + if (async == acc_async_noval)
> + async = thr->default_async;
Is the default async queue device independent? I thought the default
async queue is defined in the acc_async_t enum. Maybe set
async = acc_async_default
?
> + if (async == acc_async_sync)
> + return NULL;
> +
> + if (async < 0)
> + gomp_fatal ("bad async %d", async);
> +
> + struct gomp_device_descr *dev = thr->dev;
> +
> + if (!create
> + && (async >= dev->openacc.async.nasyncqueue
> + || !dev->openacc.async.asyncqueue[async]))
> + return NULL;
> +
> + gomp_mutex_lock (&dev->openacc.async.lock);
Is this lock sufficient? What happens if the device is released?
> + if (async >= dev->openacc.async.nasyncqueue)
> + {
Not your fault, but I wonder if we would be better off just hard-capping
the number of async queues. What happens if the user does something like
wait (1<<30)? That can be addressed later.
> + int diff = async + 1 - dev->openacc.async.nasyncqueue;
> + dev->openacc.async.asyncqueue
> + = gomp_realloc (dev->openacc.async.asyncqueue,
> + sizeof (goacc_aq) * (async + 1));
> + memset (dev->openacc.async.asyncqueue +
dev->openacc.async.nasyncqueue,
> + 0, sizeof (goacc_aq) * diff);
> + dev->openacc.async.nasyncqueue = async + 1;
> + }
> +
> + if (!dev->openacc.async.asyncqueue[async])
> + {
> + dev->openacc.async.asyncqueue[async] =
dev->openacc.async.construct_func ();
> +
> + /* Link new async queue into active list. */
> + goacc_aq_list n = gomp_malloc (sizeof (struct
goacc_asyncqueue_list));
> + n->aq = dev->openacc.async.asyncqueue[async];
> + n->next = dev->openacc.async.active;
> + dev->openacc.async.active = n;
> + }
> + gomp_mutex_unlock (&dev->openacc.async.lock);
> + return dev->openacc.async.asyncqueue[async];
> +}
> +
> +attribute_hidden struct goacc_asyncqueue *
> +get_goacc_asyncqueue (int async)
> +{
> + struct goacc_thread *thr = get_goacc_thread ();
> + return lookup_goacc_asyncqueue (thr, true, async);
> +}
> +
> int
> acc_async_test (int async)
> {
> @@ -54,15 +129,14 @@ acc_async_test (int async)
> if (!thr || !thr->dev)
> gomp_fatal ("no device active");
>
> - int res = thr->dev->openacc.async_test_func (async);
> -
> if (profiling_setup_p)
> {
> thr->prof_info = NULL;
> thr->api_info = NULL;
> }
> -
> - return res;
> +
> + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> + return thr->dev->openacc.async.test_func (aq);
I'm not sure how the profling stuff works. Should the profling state be
state be set after calling thr->dev->openacc.async.test_func?
> }
>
> int
> @@ -69,7 +143,6 @@ int
> acc_async_test_all (void)
> {
> struct goacc_thread *thr = goacc_thread ();
> -
> acc_prof_info prof_info;
> acc_api_info api_info;
> bool profiling_setup_p
> @@ -79,8 +152,6 @@ acc_async_test_all (void)
> if (!thr || !thr->dev)
> gomp_fatal ("no device active");
>
> - int res = thr->dev->openacc.async_test_all_func ();
> -
> if (profiling_setup_p)
> {
> thr->prof_info = NULL;
> @@ -87,7 +158,17 @@ acc_async_test_all (void)
> thr->api_info = NULL;
> }
>
> - return res;
> + int ret = 1;
> + /*struct goacc_thread *thr = get_goacc_thread ();*/
> + gomp_mutex_lock (&thr->dev->openacc.async.lock);
> + for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
> + if (!thr->dev->openacc.async.test_func (l->aq))
> + {
> + ret = 0;
> + break;
> + }
> + gomp_mutex_unlock (&thr->dev->openacc.async.lock);
> + return ret;
Likewise.
> }
>
> void
> @@ -113,7 +194,8 @@ acc_wait (int async)
> if (!thr || !thr->dev)
> gomp_fatal ("no device active");
>
> - thr->dev->openacc.async_wait_func (async);
> + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
> + thr->dev->openacc.async.synchronize_func (aq);
>
> if (profiling_setup_p)
> {
> Index: libgomp/oacc-cuda.c
> ===================================================================
> --- libgomp/oacc-cuda.c (revision 249620)
> +++ libgomp/oacc-cuda.c (working copy)
> @@ -99,17 +99,12 @@ acc_get_cuda_stream (int async)
> prof_info.async_queue = prof_info.async;
> }
>
> - void *ret = NULL;
> if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
> - ret = thr->dev->openacc.cuda.get_stream_func (async);
> -
> - if (profiling_setup_p)
> {
> - thr->prof_info = NULL;
> - thr->api_info = NULL;
> + goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
> + return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
Again, strange ordering fo profiling_setup_p.
By the way, why not use get_goacc_thread here and other places in this
function? Again, that's a problem for another day.
> }
> -
> - return ret;
> + return NULL;
> }
>
> int
> @@ -138,7 +133,12 @@ acc_set_cuda_stream (int async, void *stream)
>
> int ret = -1;
> if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
> - ret = thr->dev->openacc.cuda.set_stream_func (async, stream);
> + {
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + gomp_mutex_lock (&thr->dev->openacc.async.lock);
> + ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
> + gomp_mutex_unlock (&thr->dev->openacc.async.lock);
> + }
>
> if (profiling_setup_p)
> {
> Index: libgomp/oacc-int.h
> ===================================================================
> --- libgomp/oacc-int.h (revision 249620)
> +++ libgomp/oacc-int.h (working copy)
> @@ -109,6 +109,15 @@ void goacc_restore_bind (void);
> void goacc_lazy_initialize (void);
> void goacc_host_init (void);
>
> +void goacc_init_asyncqueues (struct gomp_device_descr *);
> +bool goacc_fini_asyncqueues (struct gomp_device_descr *);
> +void goacc_async_copyout_unmap_vars (struct target_mem_desc *,
> + struct goacc_asyncqueue *);
> +void goacc_async_free (struct gomp_device_descr *,
> + struct goacc_asyncqueue *, void *);
> +struct goacc_asyncqueue *get_goacc_asyncqueue (int);
> +struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread
*, bool, int);
> +
> void goacc_profiling_initialize (void);
> bool goacc_profiling_setup_p (struct goacc_thread *,
> acc_prof_info *, acc_api_info *);
> Index: libgomp/oacc-mem.c
> ===================================================================
> --- libgomp/oacc-mem.c (revision 249620)
> +++ libgomp/oacc-mem.c (working copy)
> @@ -224,19 +224,12 @@ memcpy_tofrom_device (bool from, void *d, void *h,
> goto out;
> }
>
> - if (async > acc_async_sync)
> - thr->dev->openacc.async_set_async_func (async);
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + if (from)
> + gomp_copy_dev2host (thr->dev, aq, h, d, s);
> + else
> + gomp_copy_host2dev (thr->dev, aq, d, h, s);
>
> - bool ret = (from
> - ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
> - : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
> -
> - if (async > acc_async_sync)
> - thr->dev->openacc.async_set_async_func (acc_async_sync);
> -
> - if (!ret)
> - gomp_fatal ("error in %s", libfnname);
> -
> out:
> if (profiling_setup_p)
> {
> @@ -381,7 +374,7 @@ acc_is_present (void *h, size_t s)
>
> gomp_mutex_unlock (&acc_dev->lock);
>
> - return n != NULL;
> + return (n ? 1 : 0);
> }
>
> /* Create a mapping for host [H,+S] -> device [D,+S] */
> @@ -613,17 +606,13 @@ present_create_copy (unsigned f, void *h, size_t s
>
> gomp_mutex_unlock (&acc_dev->lock);
>
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (async);
> + goacc_aq aq = get_goacc_asyncqueue (async);
Do you want to call async_set_async_func outside of the protection of
acc_dev->lock?
> - tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s,
&kinds, true,
> - GOMP_MAP_VARS_OPENACC);
> + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs,
NULL, &s,
> + &kinds, true, GOMP_MAP_VARS_OPENACC);
> /* Initialize dynamic refcount. */
> tgt->list[0].key->dynamic_refcount = 1;
>
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (acc_async_sync);
> -
> gomp_mutex_lock (&acc_dev->lock);
>
> d = tgt->to_free;
> @@ -798,11 +787,8 @@ delete_copyout (unsigned f, void *h, size_t s, int
>
> if (f & FLAG_COPYOUT)
> {
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (async);
> - acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (acc_async_sync);
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + gomp_copy_dev2host (acc_dev, aq, h, d, s);
> }
> gomp_remove_var (acc_dev, n);
> }
> @@ -904,19 +890,15 @@ update_dev_host (int is_dev, void *h, size_t s, in
> d = (void *) (n->tgt->tgt_start + n->tgt_offset
> + (uintptr_t) h - n->host_start);
>
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (async);
> + goacc_aq aq = get_goacc_asyncqueue (async);
>
> if (is_dev)
> - acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
> + gomp_copy_host2dev (acc_dev, aq, d, h, s);
> else
> - acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
> + gomp_copy_dev2host (acc_dev, aq, h, d, s);
>
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (acc_async_sync);
> -
Why did you remove this, but not add a clal to set_goacc_asyncqueue?
Maybe it's redundant.
> gomp_mutex_unlock (&acc_dev->lock);
> -
> +
> if (profiling_setup_p)
> {
> thr->prof_info = NULL;
> @@ -978,7 +960,7 @@ gomp_acc_declare_allocate (bool allocate, size_t m
>
> void
> gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
> - void *kinds)
> + void *kinds, int async)
> {
> struct target_mem_desc *tgt;
> struct goacc_thread *thr = goacc_thread ();
> @@ -1008,8 +990,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos
> }
>
> gomp_debug (0, " %s: prepare mappings\n", __FUNCTION__);
> - tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
> - NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
> + NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
> gomp_debug (0, " %s: mappings prepared\n", __FUNCTION__);
>
> /* Initialize dynamic refcount. */
> @@ -1098,11 +1081,15 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f
> t->list[i].copy_from = force_copyfrom ? 1 : 0;
> break;
> }
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (async);
> - gomp_unmap_vars (t, true);
> - if (async > acc_async_sync)
> - acc_dev->openacc.async_set_async_func (acc_async_sync);
> +
> + /* If running synchronously, unmap immediately. */
> + if (async < acc_async_noval)
> + gomp_unmap_vars (t, true);
> + else
> + {
> + goacc_aq aq = get_goacc_asyncqueue (async);
> + goacc_async_copyout_unmap_vars (t, aq);
> + }
> }
>
> gomp_mutex_unlock (&acc_dev->lock);
> Index: libgomp/oacc-parallel.c
> ===================================================================
> --- libgomp/oacc-parallel.c (revision 249620)
> +++ libgomp/oacc-parallel.c (working copy)
> @@ -215,7 +215,9 @@ GOACC_parallel_keyed (int device, void (*fn) (void
> fn (hostaddrs);
> goto out;
> }
> -
> + else if (profiling_dispatch_p)
> + api_info.device_api = acc_device_api_cuda;
> +
That seems target specific. Does that belong in the generic code path?
> /* Default: let the runtime choose. */
> for (i = 0; i != GOMP_DIM_MAX; i++)
> dims[i] = 0;
> @@ -260,10 +262,14 @@ GOACC_parallel_keyed (int device, void (*fn) (void
>
> case GOMP_LAUNCH_WAIT:
> {
> - unsigned num_waits = GOMP_LAUNCH_OP (tag);
> + /* Be careful to cast the op field as a signed 16-bit, and
> + sign-extend to full integer. */
> + int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
>
> - if (num_waits)
> + if (num_waits > 0)
> goacc_wait (async, num_waits, &ap);
> + else if (num_waits == acc_async_noval)
> + acc_wait_all_async (async);
> break;
> }
>
Cesar
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gomp4] OpenACC async re-work
2017-06-28 22:31 ` Cesar Philippidis
@ 2017-07-25 12:51 ` Chung-Lin Tang
2017-07-25 14:09 ` Cesar Philippidis
2017-10-10 18:37 ` Thomas Schwinge
0 siblings, 2 replies; 9+ messages in thread
From: Chung-Lin Tang @ 2017-07-25 12:51 UTC (permalink / raw)
To: Cesar Philippidis, gcc-patches, Thomas Schwinge
[-- Attachment #1: Type: text/plain, Size: 7248 bytes --]
On 2017/6/29 6:31 AM, Cesar Philippidis wrote:
> On 06/27/2017 03:56 AM, Chung-Lin Tang wrote:
>> On 2017/6/27 6:45 AM, Cesar Philippidis wrote:
>>>> (1) Instead of essentially implementing the entire OpenACC async support
>>>> inside the plugin, we now use an opaque 'goacc_asyncqueue' implemented
>>>> by the plugin, along with core 'test', 'synchronize', 'serialize', etc.
>>>> plugin functions. Most of the OpenACC specific logic is pulled into
>>>> libgomp/oacc-async.c
>>> I'm not sure if plugins need to maintain backwards compatibility.
>>> However, I don't see any changes inside libgomp.map, so maybe it's not
>>> required.
>>
>> This patch is pretty large, but only inner workings (including libgomp vs. plugin interface) were modified.
>> No user API compatibility was altered.
>>
>>>> (3) For 'wait + async', we now add a local thread synchronize, instead
>>>> of just ordering the streams.
>>>>
>>>> (4) To work with the (3) change, some front end changes were added to
>>>> propagate argument-less wait clauses as 'wait(GOACC_ASYNC_NOVAL)' to
>>>> represent a 'wait all'.
>>> What's the significance of GOMP_ASYNC_NOVAL? Wouldn't it have been
>>> easier to make that change in the gimplifier?
>>
>> Actually, we were basically throwing away argument-less wait clauses in front-ends
>> before this patch; i.e. '#pragma acc parallel async' and '#pragma acc parallel wait async'
>> were internally the same.
>>
>> The use of GOMP_ASYNC_NOVAL (-1) was just following the current 'async' clause representation
>> convention.
>
> So then then wait was implied before? Or maybe that's why 'wait async'
> didn't work.
Not really, 'wait async' didn't work because we were not synchronizing
the local thread.
>>> Overall, I like how you were able eliminate the externally managed map_*
>>> data structure which was used to pass in arguments to nvptx_exec.
>>> Although I wonder if we should just pass in those individual arguments
>>> directly to cuLaunchKernel. But that's a big change in itself.
>>
>> I didn't think of that when working on the current patch, maybe later.
>
> Here's some more comments regarding the code below. One high-level
> comment regarding the usage of async-specific locks. Can't you get by
> with using the global device lock, instead of a special async queue or
> would that cause a deadlock?
I did not try to re-use the device lock here. I think using that might
cause more ordering/nesting problems (gomp mutexes are not pthread mutexes,
no 'try' mechanism here).
The intention of the async queue locks are just very short temporal
protection when looking up a struct goacc_asyncqueue. Also see below comments.
>> +attribute_hidden struct goacc_asyncqueue *
>> +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int
> async)
>> +{
>> + /* The special value acc_async_noval (-1) maps to the thread-specific
>> + default async stream. */
>> + if (async == acc_async_noval)
>> + async = thr->default_async;
>
> Is the default async queue device independent? I thought the default
> async queue is defined in the acc_async_t enum. Maybe set
>
> async = acc_async_default
It's now thread dependent and user settable in OpenACC 2.5
(i.e. the acc_set/get_default_async user API routines)
>> + if (async == acc_async_sync)
>> + return NULL;
>> +
>> + if (async < 0)
>> + gomp_fatal ("bad async %d", async);
>> +
>> + struct gomp_device_descr *dev = thr->dev;
>> +
>> + if (!create
>> + && (async >= dev->openacc.async.nasyncqueue
>> + || !dev->openacc.async.asyncqueue[async]))
>> + return NULL;
>> +
>> + gomp_mutex_lock (&dev->openacc.async.lock);
> Is this lock sufficient? What happens if the device is released?
>
>> + if (async >= dev->openacc.async.nasyncqueue)
>> + {
> Not your fault, but I wonder if we would be better off just hard-capping
> the number of async queues. What happens if the user does something like
> wait (1<<30)? That can be addressed later.
Yeah, looks like we should use some sort of map or hash (above some threshold)
instead.
>> int
>> acc_async_test (int async)
>> {
>> @@ -54,15 +129,14 @@ acc_async_test (int async)
>> if (!thr || !thr->dev)
>> gomp_fatal ("no device active");
>>
>> - int res = thr->dev->openacc.async_test_func (async);
>> -
>> if (profiling_setup_p)
>> {
>> thr->prof_info = NULL;
>> thr->api_info = NULL;
>> }
>> -
>> - return res;
>> +
>> + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
>> + return thr->dev->openacc.async.test_func (aq);
> I'm not sure how the profling stuff works. Should the profling state be
> state be set after calling thr->dev->openacc.async.test_func?
I've updated the patch to place those profiling code (which seems to be
end of routine cleanups) to the very end.
>> - void *ret = NULL;
>> if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
>> - ret = thr->dev->openacc.cuda.get_stream_func (async);
>> -
>> - if (profiling_setup_p)
>> {
>> - thr->prof_info = NULL;
>> - thr->api_info = NULL;
>> + goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
>> + return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
> Again, strange ordering fo profiling_setup_p.
>
> By the way, why not use get_goacc_thread here and other places in this
> function? Again, that's a problem for another day.
get_goacc_thread immediately gomp_fatal()s if the 'thr' or device is NULL,
while the original behavior here fails more elegantly. The supposed behavior
can be discussed later, but here I'm just trying to preserve things
as much as possible.
>> - if (async > acc_async_sync)
>> - acc_dev->openacc.async_set_async_func (async);
>> + goacc_aq aq = get_goacc_asyncqueue (async);
>>
>> if (is_dev)
>> - acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
>> + gomp_copy_host2dev (acc_dev, aq, d, h, s);
>> else
>> - acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
>> + gomp_copy_dev2host (acc_dev, aq, h, d, s);
>>
>> - if (async > acc_async_sync)
>> - acc_dev->openacc.async_set_async_func (acc_async_sync);
>> -
> Why did you remove this, but not add a clal to set_goacc_asyncqueue?
> Maybe it's redundant.
get_goacc_asyncqueue acquires the async queue lock internally. This is
related to your above comment on sharing the device lock. IMHO, this
may cause more complex lock ordering issues, but might be revisited later
to see if that simplifies things.
>> Index: libgomp/oacc-parallel.c
>> ===================================================================
>> --- libgomp/oacc-parallel.c (revision 249620)
>> +++ libgomp/oacc-parallel.c (working copy)
>> @@ -215,7 +215,9 @@ GOACC_parallel_keyed (int device, void (*fn) (void
>> fn (hostaddrs);
>> goto out;
>> }
>> -
>> + else if (profiling_dispatch_p)
>> + api_info.device_api = acc_device_api_cuda;
>> +
> That seems target specific. Does that belong in the generic code path?
I don't really understand the profiling code. This change appeared to
fix regressions. Thomas might know better.
Attached is the updated version of the patch, re-tested.
Thomas, do you need some more time to look over it? Or should I commit it first?
Thanks,
Chung-Lin
[-- Attachment #2: 204-v2.patch --]
[-- Type: text/x-patch, Size: 123433 bytes --]
Index: gcc/c/c-parser.c
===================================================================
--- gcc/c/c-parser.c (revision 250497)
+++ gcc/c/c-parser.c (working copy)
@@ -11941,7 +11941,7 @@ c_parser_oacc_clause_tile (c_parser *parser, tree
}
/* OpenACC:
- wait ( int-expr-list ) */
+ wait [( int-expr-list )] */
static tree
c_parser_oacc_clause_wait (c_parser *parser, tree list)
@@ -11950,7 +11950,15 @@ c_parser_oacc_clause_wait (c_parser *parser, tree
if (c_parser_peek_token (parser)->type == CPP_OPEN_PAREN)
list = c_parser_oacc_wait_list (parser, clause_loc, list);
+ else
+ {
+ tree c = build_omp_clause (clause_loc, OMP_CLAUSE_WAIT);
+ OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+ OMP_CLAUSE_CHAIN (c) = list;
+ list = c;
+ }
+
return list;
}
Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c (revision 250497)
+++ gcc/cp/parser.c (working copy)
@@ -30619,7 +30619,7 @@ cp_parser_oacc_wait_list (cp_parser *parser, locat
}
/* OpenACC:
- wait ( int-expr-list ) */
+ wait [( int-expr-list )] */
static tree
cp_parser_oacc_clause_wait (cp_parser *parser, tree list)
@@ -30626,11 +30626,17 @@ cp_parser_oacc_clause_wait (cp_parser *parser, tre
{
location_t location = cp_lexer_peek_token (parser->lexer)->location;
- if (cp_lexer_peek_token (parser->lexer)->type != CPP_OPEN_PAREN)
- return list;
+ if (cp_lexer_peek_token (parser->lexer)->type == CPP_OPEN_PAREN)
+ list = cp_parser_oacc_wait_list (parser, location, list);
+ else
+ {
+ tree c = build_omp_clause (location, OMP_CLAUSE_WAIT);
+
+ OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+ OMP_CLAUSE_CHAIN (c) = list;
+ list = c;
+ }
- list = cp_parser_oacc_wait_list (parser, location, list);
-
return list;
}
Index: gcc/fortran/trans-openmp.c
===================================================================
--- gcc/fortran/trans-openmp.c (revision 250497)
+++ gcc/fortran/trans-openmp.c (working copy)
@@ -2962,6 +2962,13 @@ gfc_trans_omp_clauses_1 (stmtblock_t *block, gfc_o
omp_clauses = c;
}
}
+ else if (clauses->wait)
+ {
+ c = build_omp_clause (where.lb->location, OMP_CLAUSE_WAIT);
+ OMP_CLAUSE_DECL (c) = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+ OMP_CLAUSE_CHAIN (c) = omp_clauses;
+ omp_clauses = c;
+ }
if (clauses->num_gangs_expr)
{
tree num_gangs_var
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c (revision 250497)
+++ gcc/omp-low.c (working copy)
@@ -14226,9 +14226,18 @@ expand_omp_target (struct omp_region *region)
/* ... push a placeholder. */
args.safe_push (integer_zero_node);
+ bool noval_seen = false;
+ tree noval = build_int_cst (integer_type_node, GOMP_ASYNC_NOVAL);
+
for (; c; c = OMP_CLAUSE_CHAIN (c))
if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WAIT)
{
+ if (tree_int_cst_compare (OMP_CLAUSE_WAIT_EXPR (c), noval) == 0)
+ {
+ noval_seen = true;
+ continue;
+ }
+
args.safe_push (fold_convert_loc (OMP_CLAUSE_LOCATION (c),
integer_type_node,
OMP_CLAUSE_WAIT_EXPR (c)));
@@ -14235,7 +14244,12 @@ expand_omp_target (struct omp_region *region)
num_waits++;
}
- if (!tagging || num_waits)
+ if (noval_seen && num_waits == 0)
+ args[t_wait_idx] =
+ (tagging
+ ? oacc_launch_pack (GOMP_LAUNCH_WAIT, NULL_TREE, GOMP_ASYNC_NOVAL)
+ : noval);
+ else if (!tagging || num_waits)
{
tree len;
Index: gcc/testsuite/c-c++-common/goacc/dtype-1.c
===================================================================
--- gcc/testsuite/c-c++-common/goacc/dtype-1.c (revision 250497)
+++ gcc/testsuite/c-c++-common/goacc/dtype-1.c (working copy)
@@ -96,11 +96,11 @@ test ()
/* { dg-final { scan-tree-dump-times "oacc_parallel device_type\\(\\*\\) \\\[ wait\\(10\\) vector_length\\(10\\) num_workers\\(10\\) num_gangs\\(10\\) async\\(10\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(3\\) vector_length\\(128\\) num_workers\\(300\\) num_gangs\\(300\\) async\\(3\\) \\\] wait\\(1\\) vector_length\\(1\\) num_workers\\(1\\) num_gangs\\(1\\) async\\(1\\)" 1 "omplower" } } */
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) \\\]" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(-1\\) async\\(-1\\) \\\]" 1 "omplower" } } */
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(1\\) async\\(1\\) \\\] async\\(-1\\)" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ wait\\(1\\) async\\(1\\) \\\] wait\\(-1\\) async\\(-1\\)" 1 "omplower" } } */
-/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ wait\\(0\\) async\\(0\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(2\\) async\\(2\\) \\\] async\\(-1\\)" 1 "omplower" } } */
+/* { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ wait\\(0\\) async\\(0\\) \\\] device_type\\(nvidia\\) \\\[ wait\\(2\\) async\\(2\\) \\\] wait\\(-1\\) async\\(-1\\)" 1 "omplower" } } */
/* { dg-final { scan-tree-dump-times "acc loop device_type\\(nvidia\\) \\\[ tile\\(1\\) gang \\\] private\\(i1\\.0\\) private\\(i1\\)" 1 "omplower" } } */
Index: gcc/testsuite/gfortran.dg/goacc/dtype-1.f95
===================================================================
--- gcc/testsuite/gfortran.dg/goacc/dtype-1.f95 (revision 250497)
+++ gcc/testsuite/gfortran.dg/goacc/dtype-1.f95 (working copy)
@@ -175,13 +175,13 @@ end subroutine sr5b
! { dg-final { scan-tree-dump-times "oacc_parallel device_type\\(\\*\\) \\\[ async\\(10\\) wait\\(10\\) num_gangs\\(10\\) num_workers\\(10\\) vector_length\\(10\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(3\\) wait\\(3\\) num_gangs\\(300\\) num_workers\\(300\\) vector_length\\(128\\) \\\] async\\(1\\) wait\\(1\\) num_gangs\\(1\\) num_workers\\(1\\) vector_length\\(1\\)" 1 "omplower" } }
-! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) \\\]" 1 "omplower" } }
+! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(-1\\) wait\\(-1\\) \\\]" 1 "omplower" } }
! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(nvidia\\) \\\[ async\\(1\\) wait\\(1\\) \\\]" 1 "omplower" } }
! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia\\) \\\[ async\\(2\\) wait\\(2\\) \\\]" 1 "omplower" } }
-! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(1\\) wait\\(1\\) \\\] async\\(-1\\)" 1 "omplower" } }
+! { dg-final { scan-tree-dump-times "oacc_kernels device_type\\(\\*\\) \\\[ async\\(0\\) wait\\(0\\) \\\] device_type\\(nvidia_ptx\\) \\\[ async\\(1\\) wait\\(1\\) \\\] async\\(-1\\) wait\\(-1\\)" 1 "omplower" } }
! { dg-final { scan-tree-dump-times "acc loop device_type\\(nvidia\\) \\\[ tile\\(1\\) gang \\\] private\\(i1\\) private\\(i1\\.1\\)" 1 "omplower" } }
Index: include/gomp-constants.h
===================================================================
--- include/gomp-constants.h (revision 250497)
+++ include/gomp-constants.h (working copy)
@@ -249,13 +249,14 @@ enum gomp_map_kind
#define GOMP_LAUNCH_CODE_SHIFT 28
#define GOMP_LAUNCH_DEVICE_SHIFT 16
#define GOMP_LAUNCH_OP_SHIFT 0
+#define GOMP_LAUNCH_OP_MASK 0xffff
#define GOMP_LAUNCH_PACK(CODE,DEVICE,OP) \
(((CODE) << GOMP_LAUNCH_CODE_SHIFT) \
| ((DEVICE) << GOMP_LAUNCH_DEVICE_SHIFT) \
- | ((OP) << GOMP_LAUNCH_OP_SHIFT))
+ | (((OP) & GOMP_LAUNCH_OP_MASK) << GOMP_LAUNCH_OP_SHIFT))
#define GOMP_LAUNCH_CODE(X) (((X) >> GOMP_LAUNCH_CODE_SHIFT) & 0xf)
#define GOMP_LAUNCH_DEVICE(X) (((X) >> GOMP_LAUNCH_DEVICE_SHIFT) & 0xfff)
-#define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
+#define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & GOMP_LAUNCH_OP_MASK)
#define GOMP_LAUNCH_OP_MAX 0xffff
/* Bitmask to apply in order to find out the intended device of a target
Index: libgomp/libgomp-plugin.h
===================================================================
--- libgomp/libgomp-plugin.h (revision 250497)
+++ libgomp/libgomp-plugin.h (working copy)
@@ -55,6 +55,20 @@ enum offload_target_type
OFFLOAD_TARGET_TYPE_HSA = 7
};
+/* Opaque type to represent plugin-dependent implementation of an
+ OpenACC asynchronous queue. */
+struct goacc_asyncqueue;
+
+/* Used to keep a list of active asynchronous queues. */
+struct goacc_asyncqueue_list
+{
+ struct goacc_asyncqueue *aq;
+ struct goacc_asyncqueue_list *next;
+};
+
+typedef struct goacc_asyncqueue *goacc_aq;
+typedef struct goacc_asyncqueue_list *goacc_aq_list;
+
/* Auxiliary struct, used for transferring pairs of addresses from plugin
to libgomp. */
struct addr_pair
@@ -99,22 +113,31 @@ extern bool GOMP_OFFLOAD_dev2dev (int, void *, con
extern bool GOMP_OFFLOAD_can_run (void *);
extern void GOMP_OFFLOAD_run (int, void *, void *, void **);
extern void GOMP_OFFLOAD_async_run (int, void *, void *, void **, void *);
+
extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
- void **, int, unsigned *, void *);
-extern void GOMP_OFFLOAD_openacc_register_async_cleanup (void *, int);
-extern int GOMP_OFFLOAD_openacc_async_test (int);
-extern int GOMP_OFFLOAD_openacc_async_test_all (void);
-extern void GOMP_OFFLOAD_openacc_async_wait (int);
-extern void GOMP_OFFLOAD_openacc_async_wait_async (int, int);
-extern void GOMP_OFFLOAD_openacc_async_wait_all (void);
-extern void GOMP_OFFLOAD_openacc_async_wait_all_async (int);
-extern void GOMP_OFFLOAD_openacc_async_set_async (int);
+ void **, unsigned *, void *);
+extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
+ void **, unsigned *, void *,
+ struct goacc_asyncqueue *);
+extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
+extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
+ struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *,
+ void (*)(void *), void *);
+extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *, size_t,
+ struct goacc_asyncqueue *);
+extern bool GOMP_OFFLOAD_openacc_async_dev2host (int, void *, const void *, size_t,
+ struct goacc_asyncqueue *);
extern void *GOMP_OFFLOAD_openacc_create_thread_data (int);
extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *);
extern void *GOMP_OFFLOAD_openacc_cuda_get_current_device (void);
extern void *GOMP_OFFLOAD_openacc_cuda_get_current_context (void);
-extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (int);
-extern int GOMP_OFFLOAD_openacc_cuda_set_stream (int, void *);
+extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *,
+ void *);
#ifdef __cplusplus
}
Index: libgomp/libgomp.h
===================================================================
--- libgomp/libgomp.h (revision 250497)
+++ libgomp/libgomp.h (working copy)
@@ -870,19 +870,23 @@ typedef struct acc_dispatch_t
/* Execute. */
__typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
- /* Async cleanup callback registration. */
- __typeof (GOMP_OFFLOAD_openacc_register_async_cleanup)
- *register_async_cleanup_func;
+ struct {
+ gomp_mutex_t lock;
+ int nasyncqueue;
+ struct goacc_asyncqueue **asyncqueue;
+ struct goacc_asyncqueue_list *active;
+
+ __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_test) *test_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_synchronize) *synchronize_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_serialize) *serialize_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func;
- /* Asynchronous routines. */
- __typeof (GOMP_OFFLOAD_openacc_async_test) *async_test_func;
- __typeof (GOMP_OFFLOAD_openacc_async_test_all) *async_test_all_func;
- __typeof (GOMP_OFFLOAD_openacc_async_wait) *async_wait_func;
- __typeof (GOMP_OFFLOAD_openacc_async_wait_async) *async_wait_async_func;
- __typeof (GOMP_OFFLOAD_openacc_async_wait_all) *async_wait_all_func;
- __typeof (GOMP_OFFLOAD_openacc_async_wait_all_async)
- *async_wait_all_async_func;
- __typeof (GOMP_OFFLOAD_openacc_async_set_async) *async_set_async_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func;
+ __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func;
+ } async;
/* Create/destroy TLS data. */
__typeof (GOMP_OFFLOAD_openacc_create_thread_data) *create_thread_data_func;
@@ -974,17 +978,31 @@ enum gomp_map_vars_kind
GOMP_MAP_VARS_ENTER_DATA
};
-extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *);
+extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *, int);
extern void gomp_acc_remove_pointer (void *, size_t, bool, int, int, int);
extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *,
unsigned short *);
-
+extern void gomp_copy_host2dev (struct gomp_device_descr *,
+ struct goacc_asyncqueue *,
+ void *, const void *, size_t);
+extern void gomp_copy_dev2host (struct gomp_device_descr *,
+ struct goacc_asyncqueue *,
+ void *, const void *, size_t);
extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
size_t, void **, void **,
size_t *, void *, bool,
enum gomp_map_vars_kind);
+extern struct target_mem_desc *gomp_map_vars_async (struct gomp_device_descr *,
+ struct goacc_asyncqueue *,
+ size_t, void **, void **,
+ size_t *, void *, bool,
+ enum gomp_map_vars_kind);
+extern void gomp_unmap_tgt (struct target_mem_desc *);
extern void gomp_unmap_vars (struct target_mem_desc *, bool);
+extern void gomp_unmap_vars_async (struct target_mem_desc *, bool,
+ struct goacc_asyncqueue *);
extern void gomp_init_device (struct gomp_device_descr *);
+extern bool gomp_fini_device (struct gomp_device_descr *);
extern void gomp_unload_device (struct gomp_device_descr *);
extern bool gomp_offload_target_available_p (int);
extern bool gomp_remove_var (struct gomp_device_descr *, splay_tree_key);
Index: libgomp/oacc-async.c
===================================================================
--- libgomp/oacc-async.c (revision 250497)
+++ libgomp/oacc-async.c (working copy)
@@ -27,10 +27,85 @@
<http://www.gnu.org/licenses/>. */
#include <assert.h>
+#include <string.h>
#include "openacc.h"
#include "libgomp.h"
#include "oacc-int.h"
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+ struct goacc_thread *thr = goacc_thread ();
+ if (!thr || !thr->dev)
+ gomp_fatal ("no device active");
+ return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+ struct goacc_thread *thr = goacc_thread ();
+
+ if (!thr || !thr->dev)
+ gomp_fatal ("no device active");
+
+ return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+ /* The special value acc_async_noval (-1) maps to the thread-specific
+ default async stream. */
+ if (async == acc_async_noval)
+ async = thr->default_async;
+
+ if (async == acc_async_sync)
+ return NULL;
+
+ if (async < 0)
+ gomp_fatal ("bad async %d", async);
+
+ struct gomp_device_descr *dev = thr->dev;
+
+ if (!create
+ && (async >= dev->openacc.async.nasyncqueue
+ || !dev->openacc.async.asyncqueue[async]))
+ return NULL;
+
+ gomp_mutex_lock (&dev->openacc.async.lock);
+ if (async >= dev->openacc.async.nasyncqueue)
+ {
+ int diff = async + 1 - dev->openacc.async.nasyncqueue;
+ dev->openacc.async.asyncqueue
+ = gomp_realloc (dev->openacc.async.asyncqueue,
+ sizeof (goacc_aq) * (async + 1));
+ memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+ 0, sizeof (goacc_aq) * diff);
+ dev->openacc.async.nasyncqueue = async + 1;
+ }
+
+ if (!dev->openacc.async.asyncqueue[async])
+ {
+ dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+
+ /* Link new async queue into active list. */
+ goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+ n->aq = dev->openacc.async.asyncqueue[async];
+ n->next = dev->openacc.async.active;
+ dev->openacc.async.active = n;
+ }
+ gomp_mutex_unlock (&dev->openacc.async.lock);
+ return dev->openacc.async.asyncqueue[async];
+}
+
+attribute_hidden struct goacc_asyncqueue *
+get_goacc_asyncqueue (int async)
+{
+ struct goacc_thread *thr = get_goacc_thread ();
+ return lookup_goacc_asyncqueue (thr, true, async);
+}
+
int
acc_async_test (int async)
{
@@ -39,6 +114,9 @@ acc_async_test (int async)
struct goacc_thread *thr = goacc_thread ();
+ if (!thr || !thr->dev)
+ gomp_fatal ("no device active");
+
acc_prof_info prof_info;
acc_api_info api_info;
bool profiling_setup_p
@@ -51,11 +129,9 @@ acc_async_test (int async)
prof_info.async_queue = prof_info.async;
}
- if (!thr || !thr->dev)
- gomp_fatal ("no device active");
+ goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+ int res = thr->dev->openacc.async.test_func (aq);
- int res = thr->dev->openacc.async_test_func (async);
-
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -69,6 +145,8 @@ int
acc_async_test_all (void)
{
struct goacc_thread *thr = goacc_thread ();
+ if (!thr || !thr->dev)
+ gomp_fatal ("no device active");
acc_prof_info prof_info;
acc_api_info api_info;
@@ -76,18 +154,22 @@ acc_async_test_all (void)
= __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
false);
- if (!thr || !thr->dev)
- gomp_fatal ("no device active");
+ int ret = 1;
+ gomp_mutex_lock (&thr->dev->openacc.async.lock);
+ for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+ if (!thr->dev->openacc.async.test_func (l->aq))
+ {
+ ret = 0;
+ break;
+ }
+ gomp_mutex_unlock (&thr->dev->openacc.async.lock);
- int res = thr->dev->openacc.async_test_all_func ();
-
if (profiling_setup_p)
{
thr->prof_info = NULL;
thr->api_info = NULL;
}
-
- return res;
+ return ret;
}
void
@@ -113,7 +195,8 @@ acc_wait (int async)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- thr->dev->openacc.async_wait_func (async);
+ goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+ thr->dev->openacc.async.synchronize_func (aq);
if (profiling_setup_p)
{
@@ -153,8 +236,16 @@ acc_wait_async (int async1, int async2)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- thr->dev->openacc.async_wait_async_func (async1, async2);
+ goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
+ goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
+ if (!aq1)
+ gomp_fatal ("invalid async 1");
+ if (aq1 == aq2)
+ gomp_fatal ("identical parameters");
+ thr->dev->openacc.async.synchronize_func (aq1);
+ thr->dev->openacc.async.serialize_func (aq1, aq2);
+
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -176,8 +267,13 @@ acc_wait_all (void)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- thr->dev->openacc.async_wait_all_func ();
+ struct gomp_device_descr *dev = get_goacc_thread_device ();
+ gomp_mutex_lock (&dev->openacc.async.lock);
+ for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
+ dev->openacc.async.synchronize_func (l->aq);
+ gomp_mutex_unlock (&dev->openacc.async.lock);
+
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -219,8 +315,17 @@ acc_wait_all_async (int async)
if (!thr || !thr->dev)
gomp_fatal ("no device active");
- thr->dev->openacc.async_wait_all_async_func (async);
+ goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
+ gomp_mutex_lock (&thr->dev->openacc.async.lock);
+ for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+ {
+ thr->dev->openacc.async.synchronize_func (l->aq);
+ if (waiting_queue)
+ thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
+ }
+ gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -251,10 +356,72 @@ acc_set_default_async (int async)
if (async < acc_async_sync)
gomp_fatal ("invalid async argument: %d", async);
- struct goacc_thread *thr = goacc_thread ();
+ struct goacc_thread *thr = get_goacc_thread ();
+ thr->default_async = async;
+}
- if (!thr || !thr->dev)
- gomp_fatal ("no device active");
+static void
+goacc_async_unmap_tgt (void *ptr)
+{
+ struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
- thr->default_async = async;
+ if (tgt->refcount > 1)
+ tgt->refcount--;
+ else
+ gomp_unmap_tgt (tgt);
}
+
+attribute_hidden void
+goacc_async_copyout_unmap_vars (struct target_mem_desc *tgt,
+ struct goacc_asyncqueue *aq)
+{
+ struct gomp_device_descr *devicep = tgt->device_descr;
+
+ /* Increment reference to delay freeing of device memory until callback
+ has triggered. */
+ tgt->refcount++;
+ gomp_unmap_vars_async (tgt, true, aq);
+ devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt,
+ (void *) tgt);
+}
+
+attribute_hidden void
+goacc_async_free (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq, void *ptr)
+{
+ if (!aq)
+ free (ptr);
+ else
+ devicep->openacc.async.queue_callback_func (aq, free, ptr);
+}
+
+attribute_hidden void
+goacc_init_asyncqueues (struct gomp_device_descr *devicep)
+{
+ gomp_mutex_init (&devicep->openacc.async.lock);
+ devicep->openacc.async.nasyncqueue = 0;
+ devicep->openacc.async.asyncqueue = NULL;
+ devicep->openacc.async.active = NULL;
+}
+
+attribute_hidden bool
+goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
+{
+ bool ret = true;
+ if (devicep->openacc.async.nasyncqueue > 0)
+ {
+ goacc_aq_list next;
+ for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
+ {
+ ret &= devicep->openacc.async.destruct_func (l->aq);
+ next = l->next;
+ free (l);
+ }
+ free (devicep->openacc.async.asyncqueue);
+ devicep->openacc.async.nasyncqueue = 0;
+ devicep->openacc.async.asyncqueue = NULL;
+ devicep->openacc.async.active = NULL;
+ }
+ gomp_mutex_destroy (&devicep->openacc.async.lock);
+ return ret;
+}
Index: libgomp/oacc-cuda.c
===================================================================
--- libgomp/oacc-cuda.c (revision 250497)
+++ libgomp/oacc-cuda.c (working copy)
@@ -99,17 +99,12 @@ acc_get_cuda_stream (int async)
prof_info.async_queue = prof_info.async;
}
- void *ret = NULL;
if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
- ret = thr->dev->openacc.cuda.get_stream_func (async);
-
- if (profiling_setup_p)
{
- thr->prof_info = NULL;
- thr->api_info = NULL;
+ goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+ return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
}
-
- return ret;
+ return NULL;
}
int
@@ -138,7 +133,12 @@ acc_set_cuda_stream (int async, void *stream)
int ret = -1;
if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
- ret = thr->dev->openacc.cuda.set_stream_func (async, stream);
+ {
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ gomp_mutex_lock (&thr->dev->openacc.async.lock);
+ ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
+ gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+ }
if (profiling_setup_p)
{
Index: libgomp/oacc-host.c
===================================================================
--- libgomp/oacc-host.c (revision 250497)
+++ libgomp/oacc-host.c (working copy)
@@ -140,7 +140,6 @@ host_openacc_exec (void (*fn) (void *),
size_t mapnum __attribute__ ((unused)),
void **hostaddrs,
void **devaddrs __attribute__ ((unused)),
- int async __attribute__ ((unused)),
unsigned *dims __attribute__ ((unused)),
void *targ_mem_desc __attribute__ ((unused)))
{
@@ -148,49 +147,81 @@ host_openacc_exec (void (*fn) (void *),
}
static void
-host_openacc_register_async_cleanup (void *targ_mem_desc __attribute__ ((unused)),
- int async __attribute__ ((unused)))
+host_openacc_async_exec (void (*fn) (void *),
+ size_t mapnum __attribute__ ((unused)),
+ void **hostaddrs,
+ void **devaddrs __attribute__ ((unused)),
+ unsigned *dims __attribute__ ((unused)),
+ void *targ_mem_desc __attribute__ ((unused)),
+ struct goacc_asyncqueue *aq __attribute__ ((unused)))
{
+ fn (hostaddrs);
}
static int
-host_openacc_async_test (int async __attribute__ ((unused)))
+host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
{
return 1;
}
-static int
-host_openacc_async_test_all (void)
+static void
+host_openacc_async_synchronize (struct goacc_asyncqueue *aq
+ __attribute__ ((unused)))
{
- return 1;
}
static void
-host_openacc_async_wait (int async __attribute__ ((unused)))
+host_openacc_async_serialize (struct goacc_asyncqueue *aq1
+ __attribute__ ((unused)),
+ struct goacc_asyncqueue *aq2
+ __attribute__ ((unused)))
{
}
-static void
-host_openacc_async_wait_async (int async1 __attribute__ ((unused)),
- int async2 __attribute__ ((unused)))
+static bool
+host_openacc_async_host2dev (int ord __attribute__ ((unused)),
+ void *dst __attribute__ ((unused)),
+ const void *src __attribute__ ((unused)),
+ size_t n __attribute__ ((unused)),
+ struct goacc_asyncqueue *aq
+ __attribute__ ((unused)))
{
+ return true;
}
-static void
-host_openacc_async_wait_all (void)
+static bool
+host_openacc_async_dev2host (int ord __attribute__ ((unused)),
+ void *dst __attribute__ ((unused)),
+ const void *src __attribute__ ((unused)),
+ size_t n __attribute__ ((unused)),
+ struct goacc_asyncqueue *aq
+ __attribute__ ((unused)))
{
+ return true;
}
static void
-host_openacc_async_wait_all_async (int async __attribute__ ((unused)))
+host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
+ __attribute__ ((unused)),
+ void (*callback_fn)(void *)
+ __attribute__ ((unused)),
+ void *userptr __attribute__ ((unused)))
{
}
-static void
-host_openacc_async_set_async (int async __attribute__ ((unused)))
+static struct goacc_asyncqueue *
+host_openacc_async_construct (void)
{
+ return NULL;
}
+static bool
+host_openacc_async_destruct (struct goacc_asyncqueue *aq
+ __attribute__ ((unused)))
+{
+ return true;
+}
+
static void *
host_openacc_create_thread_data (int ord __attribute__ ((unused)))
{
@@ -235,16 +266,18 @@ static struct gomp_device_descr host_dispatch =
.exec_func = host_openacc_exec,
- .register_async_cleanup_func = host_openacc_register_async_cleanup,
+ .async = {
+ .construct_func = host_openacc_async_construct,
+ .destruct_func = host_openacc_async_destruct,
+ .test_func = host_openacc_async_test,
+ .synchronize_func = host_openacc_async_synchronize,
+ .serialize_func = host_openacc_async_serialize,
+ .queue_callback_func = host_openacc_async_queue_callback,
+ .exec_func = host_openacc_async_exec,
+ .dev2host_func = host_openacc_async_dev2host,
+ .host2dev_func = host_openacc_async_host2dev,
+ },
- .async_test_func = host_openacc_async_test,
- .async_test_all_func = host_openacc_async_test_all,
- .async_wait_func = host_openacc_async_wait,
- .async_wait_async_func = host_openacc_async_wait_async,
- .async_wait_all_func = host_openacc_async_wait_all,
- .async_wait_all_async_func = host_openacc_async_wait_all_async,
- .async_set_async_func = host_openacc_async_set_async,
-
.create_thread_data_func = host_openacc_create_thread_data,
.destroy_thread_data_func = host_openacc_destroy_thread_data,
Index: libgomp/oacc-init.c
===================================================================
--- libgomp/oacc-init.c (revision 250497)
+++ libgomp/oacc-init.c (working copy)
@@ -390,7 +390,7 @@ acc_shutdown_1 (acc_device_t d)
if (acc_dev->state == GOMP_DEVICE_INITIALIZED)
{
devices_active = true;
- ret &= acc_dev->fini_device_func (acc_dev->target_id);
+ ret &= gomp_fini_device (acc_dev);
acc_dev->state = GOMP_DEVICE_UNINITIALIZED;
}
gomp_mutex_unlock (&acc_dev->lock);
@@ -513,8 +513,6 @@ goacc_attach_host_thread_to_device (int ord)
= acc_dev->openacc.create_thread_data_func (ord);
thr->default_async = acc_async_default;
-
- acc_dev->openacc.async_set_async_func (acc_async_sync);
}
/* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
Index: libgomp/oacc-int.h
===================================================================
--- libgomp/oacc-int.h (revision 250497)
+++ libgomp/oacc-int.h (working copy)
@@ -109,6 +109,15 @@ void goacc_restore_bind (void);
void goacc_lazy_initialize (void);
void goacc_host_init (void);
+void goacc_init_asyncqueues (struct gomp_device_descr *);
+bool goacc_fini_asyncqueues (struct gomp_device_descr *);
+void goacc_async_copyout_unmap_vars (struct target_mem_desc *,
+ struct goacc_asyncqueue *);
+void goacc_async_free (struct gomp_device_descr *,
+ struct goacc_asyncqueue *, void *);
+struct goacc_asyncqueue *get_goacc_asyncqueue (int);
+struct goacc_asyncqueue *lookup_goacc_asyncqueue (struct goacc_thread *, bool, int);
+
void goacc_profiling_initialize (void);
bool goacc_profiling_setup_p (struct goacc_thread *,
acc_prof_info *, acc_api_info *);
Index: libgomp/oacc-mem.c
===================================================================
--- libgomp/oacc-mem.c (revision 250497)
+++ libgomp/oacc-mem.c (working copy)
@@ -224,19 +224,12 @@ memcpy_tofrom_device (bool from, void *d, void *h,
goto out;
}
- if (async > acc_async_sync)
- thr->dev->openacc.async_set_async_func (async);
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ if (from)
+ gomp_copy_dev2host (thr->dev, aq, h, d, s);
+ else
+ gomp_copy_host2dev (thr->dev, aq, d, h, s);
- bool ret = (from
- ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
- : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
-
- if (async > acc_async_sync)
- thr->dev->openacc.async_set_async_func (acc_async_sync);
-
- if (!ret)
- gomp_fatal ("error in %s", libfnname);
-
out:
if (profiling_setup_p)
{
@@ -381,7 +374,7 @@ acc_is_present (void *h, size_t s)
gomp_mutex_unlock (&acc_dev->lock);
- return n != NULL;
+ return (n ? 1 : 0);
}
/* Create a mapping for host [H,+S] -> device [D,+S] */
@@ -613,17 +606,13 @@ present_create_copy (unsigned f, void *h, size_t s
gomp_mutex_unlock (&acc_dev->lock);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (async);
+ goacc_aq aq = get_goacc_asyncqueue (async);
- tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
- GOMP_MAP_VARS_OPENACC);
+ tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s,
+ &kinds, true, GOMP_MAP_VARS_OPENACC);
/* Initialize dynamic refcount. */
tgt->list[0].key->dynamic_refcount = 1;
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
gomp_mutex_lock (&acc_dev->lock);
d = tgt->to_free;
@@ -798,11 +787,8 @@ delete_copyout (unsigned f, void *h, size_t s, int
if (f & FLAG_COPYOUT)
{
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (async);
- acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (acc_async_sync);
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ gomp_copy_dev2host (acc_dev, aq, h, d, s);
}
gomp_remove_var (acc_dev, n);
}
@@ -904,19 +890,15 @@ update_dev_host (int is_dev, void *h, size_t s, in
d = (void *) (n->tgt->tgt_start + n->tgt_offset
+ (uintptr_t) h - n->host_start);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (async);
+ goacc_aq aq = get_goacc_asyncqueue (async);
if (is_dev)
- acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
+ gomp_copy_host2dev (acc_dev, aq, d, h, s);
else
- acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
+ gomp_copy_dev2host (acc_dev, aq, h, d, s);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
gomp_mutex_unlock (&acc_dev->lock);
-
+
if (profiling_setup_p)
{
thr->prof_info = NULL;
@@ -978,7 +960,7 @@ gomp_acc_declare_allocate (bool allocate, size_t m
void
gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
- void *kinds)
+ void *kinds, int async)
{
struct target_mem_desc *tgt;
struct goacc_thread *thr = goacc_thread ();
@@ -1008,8 +990,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hos
}
gomp_debug (0, " %s: prepare mappings\n", __FUNCTION__);
- tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
- NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
+ NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
gomp_debug (0, " %s: mappings prepared\n", __FUNCTION__);
/* Initialize dynamic refcount. */
@@ -1098,11 +1081,15 @@ gomp_acc_remove_pointer (void *h, size_t s, bool f
t->list[i].copy_from = force_copyfrom ? 1 : 0;
break;
}
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (async);
- gomp_unmap_vars (t, true);
- if (async > acc_async_sync)
- acc_dev->openacc.async_set_async_func (acc_async_sync);
+
+ /* If running synchronously, unmap immediately. */
+ if (async < acc_async_noval)
+ gomp_unmap_vars (t, true);
+ else
+ {
+ goacc_aq aq = get_goacc_asyncqueue (async);
+ goacc_async_copyout_unmap_vars (t, aq);
+ }
}
gomp_mutex_unlock (&acc_dev->lock);
Index: libgomp/oacc-parallel.c
===================================================================
--- libgomp/oacc-parallel.c (revision 250497)
+++ libgomp/oacc-parallel.c (working copy)
@@ -215,7 +215,9 @@ GOACC_parallel_keyed (int device, void (*fn) (void
fn (hostaddrs);
goto out;
}
-
+ else if (profiling_dispatch_p)
+ api_info.device_api = acc_device_api_cuda;
+
/* Default: let the runtime choose. */
for (i = 0; i != GOMP_DIM_MAX; i++)
dims[i] = 0;
@@ -260,10 +262,14 @@ GOACC_parallel_keyed (int device, void (*fn) (void
case GOMP_LAUNCH_WAIT:
{
- unsigned num_waits = GOMP_LAUNCH_OP (tag);
+ /* Be careful to cast the op field as a signed 16-bit, and
+ sign-extend to full integer. */
+ int num_waits = ((signed short) GOMP_LAUNCH_OP (tag));
- if (num_waits)
+ if (num_waits > 0)
goacc_wait (async, num_waits, &ap);
+ else if (num_waits == acc_async_noval)
+ acc_wait_all_async (async);
break;
}
@@ -274,8 +280,6 @@ GOACC_parallel_keyed (int device, void (*fn) (void
}
va_end (ap);
- acc_dev->openacc.async_set_async_func (async);
-
if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
{
k.host_start = (uintptr_t) fn;
@@ -307,8 +311,11 @@ GOACC_parallel_keyed (int device, void (*fn) (void
goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
&api_info);
}
- tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true,
- GOMP_MAP_VARS_OPENACC);
+
+ goacc_aq aq = get_goacc_asyncqueue (async);
+
+ tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
+ true, GOMP_MAP_VARS_OPENACC);
if (profiling_dispatch_p)
{
prof_info.event_type = acc_ev_enter_data_end;
@@ -329,14 +336,10 @@ GOACC_parallel_keyed (int device, void (*fn) (void
devaddrs[i] = NULL;
}
- acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
- async, dims, tgt);
-
- /* If running synchronously, unmap immediately. */
- bool copyfrom = true;
- if (async < acc_async_noval)
+ if (aq == NULL)
{
- unmap:
+ acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+ dims, tgt);
if (profiling_dispatch_p)
{
prof_info.event_type = acc_ev_exit_data_start;
@@ -346,7 +349,8 @@ GOACC_parallel_keyed (int device, void (*fn) (void
goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
&api_info);
}
- gomp_unmap_vars (tgt, copyfrom);
+ /* If running synchronously, unmap immediately. */
+ gomp_unmap_vars (tgt, true);
if (profiling_dispatch_p)
{
prof_info.event_type = acc_ev_exit_data_end;
@@ -358,27 +362,11 @@ GOACC_parallel_keyed (int device, void (*fn) (void
}
else
{
- bool async_unmap = false;
- for (size_t i = 0; i < tgt->list_count; i++)
- {
- splay_tree_key k = tgt->list[i].key;
- if (k && k->refcount == 1)
- {
- async_unmap = true;
- break;
- }
- }
- if (async_unmap)
- tgt->device_descr->openacc.register_async_cleanup_func (tgt, async);
- else
- {
- copyfrom = false;
- goto unmap;
- }
+ acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+ dims, tgt, aq);
+ goacc_async_copyout_unmap_vars (tgt, aq);
}
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
out:
if (profiling_dispatch_p)
{
@@ -724,7 +712,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
goto out;
}
- if (num_waits)
+ if (num_waits > 0)
{
va_list ap;
@@ -732,9 +720,9 @@ GOACC_enter_exit_data (int device, size_t mapnum,
goacc_wait (async, num_waits, &ap);
va_end (ap);
}
+ else if (num_waits == acc_async_noval)
+ acc_wait_all_async (async);
- acc_dev->openacc.async_set_async_func (async);
-
/* In c, non-pointers and arrays are represented by a single data clause.
Dynamically allocated arrays and subarrays are represented by a data
clause followed by an internal GOMP_MAP_POINTER.
@@ -783,7 +771,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
&sizes[i], &kinds[i]);
else
gomp_acc_insert_pointer (pointer, &hostaddrs[i],
- &sizes[i], &kinds[i]);
+ &sizes[i], &kinds[i], async);
/* Increment 'i' by two because OpenACC requires fortran
arrays to be contiguous, so each PSET is associated with
one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
@@ -808,9 +796,9 @@ GOACC_enter_exit_data (int device, size_t mapnum,
if (acc_is_present (hostaddrs[i], sizes[i]))
{
if (finalize)
- acc_delete_finalize (hostaddrs[i], sizes[i]);
+ acc_delete_finalize_async (hostaddrs[i], sizes[i], async);
else
- acc_delete (hostaddrs[i], sizes[i]);
+ acc_delete_async (hostaddrs[i], sizes[i], async);
}
break;
case GOMP_MAP_DECLARE_DEALLOCATE:
@@ -817,9 +805,9 @@ GOACC_enter_exit_data (int device, size_t mapnum,
case GOMP_MAP_FROM:
case GOMP_MAP_FORCE_FROM:
if (finalize)
- acc_copyout_finalize (hostaddrs[i], sizes[i]);
+ acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);
else
- acc_copyout (hostaddrs[i], sizes[i]);
+ acc_copyout_async (hostaddrs[i], sizes[i], async);
break;
default:
gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
@@ -844,8 +832,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,
}
}
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
out:
if (profiling_dispatch_p)
{
@@ -868,18 +854,22 @@ goacc_wait (int async, int num_waits, va_list *ap)
while (num_waits--)
{
int qid = va_arg (*ap, int);
-
- if (acc_async_test (qid))
+ goacc_aq aq = get_goacc_asyncqueue (qid);
+ if (acc_dev->openacc.async.test_func (aq))
continue;
-
if (async == acc_async_sync)
- acc_wait (qid);
+ acc_dev->openacc.async.synchronize_func (aq);
else if (qid == async)
- ;/* If we're waiting on the same asynchronous queue as we're
+ /* If we're waiting on the same asynchronous queue as we're
launching on, the queue itself will order work as
required, so there's no need to wait explicitly. */
+ ;
else
- acc_dev->openacc.async_wait_async_func (qid, async);
+ {
+ goacc_aq aq2 = get_goacc_asyncqueue (async);
+ acc_dev->openacc.async.synchronize_func (aq);
+ acc_dev->openacc.async.serialize_func (aq, aq2);
+ }
}
}
@@ -957,7 +947,7 @@ GOACC_update (int device, size_t mapnum,
goto out;
}
- if (num_waits)
+ if (num_waits > 0)
{
va_list ap;
@@ -965,9 +955,9 @@ GOACC_update (int device, size_t mapnum,
goacc_wait (async, num_waits, &ap);
va_end (ap);
}
+ else if (num_waits == acc_async_noval)
+ acc_wait_all_async (async);
- acc_dev->openacc.async_set_async_func (async);
-
bool update_device = false;
for (i = 0; i < mapnum; ++i)
{
@@ -1007,7 +997,7 @@ GOACC_update (int device, size_t mapnum,
/* Fallthru */
case GOMP_MAP_FORCE_TO:
update_device = true;
- acc_update_device (hostaddrs[i], sizes[i]);
+ acc_update_device_async (hostaddrs[i], sizes[i], async);
break;
case GOMP_MAP_FROM:
@@ -1019,7 +1009,7 @@ GOACC_update (int device, size_t mapnum,
/* Fallthru */
case GOMP_MAP_FORCE_FROM:
update_device = false;
- acc_update_self (hostaddrs[i], sizes[i]);
+ acc_update_self_async (hostaddrs[i], sizes[i], async);
break;
default:
@@ -1028,8 +1018,6 @@ GOACC_update (int device, size_t mapnum,
}
}
- acc_dev->openacc.async_set_async_func (acc_async_sync);
-
out:
if (profiling_dispatch_p)
{
@@ -1075,7 +1063,7 @@ GOACC_wait (int async, int num_waits, ...)
else if (async == acc_async_sync)
acc_wait_all ();
else if (async == acc_async_noval)
- thr->dev->openacc.async_wait_all_async_func (acc_async_noval);
+ acc_wait_all_async (async);
if (profiling_setup_p)
{
Index: libgomp/oacc-plugin.c
===================================================================
--- libgomp/oacc-plugin.c (revision 250497)
+++ libgomp/oacc-plugin.c (working copy)
@@ -30,17 +30,6 @@
#include "oacc-plugin.h"
#include "oacc-int.h"
-void
-GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
-{
- struct target_mem_desc *tgt = ptr;
- struct gomp_device_descr *devicep = tgt->device_descr;
-
- devicep->openacc.async_set_async_func (async);
- gomp_unmap_vars (tgt, true);
- devicep->openacc.async_set_async_func (acc_async_sync);
-}
-
/* Return the target-specific part of the TLS data for the current thread. */
void *
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c (revision 250497)
+++ libgomp/plugin/plugin-nvptx.c (working copy)
@@ -96,21 +96,19 @@ cuda_error (CUresult r)
static unsigned int instantiated_devices = 0;
static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
-struct cuda_map
+/* NVPTX/CUDA specific definition of asynchronous queues. */
+struct goacc_asyncqueue
{
- CUdeviceptr d;
- size_t size;
- bool active;
- struct cuda_map *next;
+ CUstream cuda_stream;
+ pthread_mutex_t lock;
};
-struct ptx_stream
+struct nvptx_callback
{
- CUstream stream;
- pthread_t host_thread;
- bool multithreaded;
- struct cuda_map *map;
- struct ptx_stream *next;
+ void (*fn) (void *);
+ void *ptr;
+ struct goacc_asyncqueue *aq;
+ struct nvptx_callback *next;
};
/* Thread-specific data for PTX. */
@@ -117,179 +115,12 @@ static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTE
struct nvptx_thread
{
- struct ptx_stream *current_stream;
+ /* We currently have this embedded inside the plugin because libgomp manages
+ devices through integer target_ids. This might be better if using an
+ opaque target-specific pointer directly from gomp_device_descr. */
struct ptx_device *ptx_dev;
};
-static struct cuda_map *
-cuda_map_create (struct goacc_thread *thr, size_t size)
-{
- struct cuda_map *map = GOMP_PLUGIN_malloc (sizeof (struct cuda_map));
-
- assert (map);
-
- map->next = NULL;
- map->size = size;
- map->active = false;
-
- CUDA_CALL_ERET (NULL, cuMemAlloc, &map->d, size);
- assert (map->d);
-
- bool profiling_dispatch_p
- = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
-
- prof_info->event_type = acc_ev_alloc;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- /* Always implicit for "data mapping arguments for cuLaunchKernel". */
- data_event_info.data_event.implicit = 1;
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = size;
- data_event_info.data_event.host_ptr = NULL;
- data_event_info.data_event.device_ptr = (void *) map->d;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- return map;
-}
-
-static void
-cuda_map_destroy (struct goacc_thread *thr, struct cuda_map *map)
-{
- CUDA_CALL_ASSERT (cuMemFree, map->d);
-
- bool profiling_dispatch_p
- = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
-
- prof_info->event_type = acc_ev_free;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- /* Always implicit for "data mapping arguments for cuLaunchKernel". */
- data_event_info.data_event.implicit = 1;
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = map->size;
- data_event_info.data_event.host_ptr = NULL;
- data_event_info.data_event.device_ptr = (void *) map->d;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- free (map);
-}
-
-/* The following map_* routines manage the CUDA device memory that
- contains the data mapping arguments for cuLaunchKernel. Each
- asynchronous PTX stream may have multiple pending kernel
- invocations, which are launched in a FIFO order. As such, the map
- routines maintains a queue of cuLaunchKernel arguments.
-
- Calls to map_push and map_pop must be guarded by ptx_event_lock.
- Likewise, calls to map_init and map_fini are guarded by
- ptx_dev_lock inside GOMP_OFFLOAD_init_device and
- GOMP_OFFLOAD_fini_device, respectively. */
-
-static bool
-map_init (struct goacc_thread *thr, struct ptx_stream *s)
-{
- int size = getpagesize ();
-
- assert (s);
-
- s->map = cuda_map_create (thr, size);
-
- return true;
-}
-
-static bool
-map_fini (struct goacc_thread *thr, struct ptx_stream *s)
-{
- assert (s->map->next == NULL);
- assert (!s->map->active);
-
- cuda_map_destroy (thr, s->map);
-
- return true;
-}
-
-static void
-map_pop (struct goacc_thread *thr, struct ptx_stream *s)
-{
- struct cuda_map *next;
-
- assert (s != NULL);
-
- if (s->map->next == NULL)
- {
- s->map->active = false;
- return;
- }
-
- next = s->map->next;
- cuda_map_destroy (thr, s->map);
- s->map = next;
-}
-
-static CUdeviceptr
-map_push (struct goacc_thread *thr, struct ptx_stream *s, size_t size)
-{
- struct cuda_map *map = NULL, *t = NULL;
-
- assert (s);
- assert (s->map);
-
- /* Each PTX stream requires a separate data region to store the
- launch arguments for cuLaunchKernel. Allocate a new
- cuda_map and push it to the end of the list. */
- if (s->map->active)
- {
- map = cuda_map_create (thr, size);
-
- for (t = s->map; t->next != NULL; t = t->next)
- ;
-
- t->next = map;
- }
- else if (s->map->size < size)
- {
- cuda_map_destroy (thr, s->map);
- map = cuda_map_create (thr, size);
- }
- else
- map = s->map;
-
- s->map = map;
- s->map->active = true;
-
- return s->map->d;
-}
-
/* Target data function launch information. */
struct targ_fn_launch
@@ -342,22 +173,18 @@ struct ptx_image_data
struct ptx_image_data *next;
};
+struct ptx_free_block
+{
+ void *ptr;
+ struct ptx_free_block *next;
+};
+
struct ptx_device
{
CUcontext ctx;
bool ctx_shared;
CUdevice dev;
- struct ptx_stream *null_stream;
- /* All non-null streams associated with this device (actually context),
- either created implicitly or passed in from the user (via
- acc_set_cuda_stream). */
- struct ptx_stream *active_streams;
- struct {
- struct ptx_stream **arr;
- int size;
- } async_streams;
- /* A lock for use when manipulating the above stream list and array. */
- pthread_mutex_t stream_lock;
+
int ord;
bool overlap;
bool map;
@@ -381,32 +208,13 @@ struct ptx_device
struct ptx_image_data *images; /* Images loaded on device. */
pthread_mutex_t image_lock; /* Lock for above list. */
-
- struct ptx_device *next;
-};
-enum ptx_event_type
-{
- PTX_EVT_MEM,
- PTX_EVT_KNL,
- PTX_EVT_SYNC,
- PTX_EVT_ASYNC_CLEANUP
-};
+ struct ptx_free_block *free_blocks;
+ pthread_mutex_t free_blocks_lock;
-struct ptx_event
-{
- CUevent *evt;
- int type;
- void *addr;
- int ord;
- int val;
-
- struct ptx_event *next;
+ struct ptx_device *next;
};
-static pthread_mutex_t ptx_event_lock;
-static struct ptx_event *ptx_events;
-
static struct ptx_device **ptx_devices;
static inline struct nvptx_thread *
@@ -415,190 +223,6 @@ nvptx_thread (void)
return (struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
}
-static bool
-init_streams_for_device (struct ptx_device *ptx_dev, int concurrency)
-{
- int i;
- struct ptx_stream *null_stream
- = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream));
-
- null_stream->stream = NULL;
- null_stream->host_thread = pthread_self ();
- null_stream->multithreaded = true;
- if (!map_init (NULL, null_stream))
- return false;
-
- ptx_dev->null_stream = null_stream;
- ptx_dev->active_streams = NULL;
- pthread_mutex_init (&ptx_dev->stream_lock, NULL);
-
- if (concurrency < 1)
- concurrency = 1;
-
- /* This is just a guess -- make space for as many async streams as the
- current device is capable of concurrently executing. This can grow
- later as necessary. No streams are created yet. */
- ptx_dev->async_streams.arr
- = GOMP_PLUGIN_malloc (concurrency * sizeof (struct ptx_stream *));
- ptx_dev->async_streams.size = concurrency;
-
- for (i = 0; i < concurrency; i++)
- ptx_dev->async_streams.arr[i] = NULL;
-
- return true;
-}
-
-static bool
-fini_streams_for_device (struct ptx_device *ptx_dev)
-{
- free (ptx_dev->async_streams.arr);
-
- bool ret = true;
- while (ptx_dev->active_streams != NULL)
- {
- struct ptx_stream *s = ptx_dev->active_streams;
- ptx_dev->active_streams = ptx_dev->active_streams->next;
-
- ret &= map_fini (NULL, s);
-
- CUresult r = cuStreamDestroy (s->stream);
- if (r != CUDA_SUCCESS)
- {
- GOMP_PLUGIN_error ("cuStreamDestroy error: %s", cuda_error (r));
- ret = false;
- }
- free (s);
- }
-
- ret &= map_fini (NULL, ptx_dev->null_stream);
- free (ptx_dev->null_stream);
- return ret;
-}
-
-/* Select a stream for (OpenACC-semantics) ASYNC argument for the current
- thread THREAD (and also current device/context). If CREATE is true, create
- the stream if it does not exist (or use EXISTING if it is non-NULL), and
- associate the stream with the same thread argument. Returns stream to use
- as result. */
-
-static struct ptx_stream *
-select_stream_for_async (int async, pthread_t thread, bool create,
- CUstream existing)
-{
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
- /* Local copy of TLS variable. */
- struct ptx_device *ptx_dev = nvthd->ptx_dev;
- struct ptx_stream *stream = NULL;
- int orig_async = async;
-
- /* The special value acc_async_noval (-1) maps to the thread-specific
- default async stream. */
- if (async == acc_async_noval)
- async = GOMP_PLUGIN_acc_thread_default_async ();
-
- if (create)
- pthread_mutex_lock (&ptx_dev->stream_lock);
-
- /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the
- null stream, and in fact better performance may be obtainable if it doesn't
- (because the null stream enforces overly-strict synchronisation with
- respect to other streams for legacy reasons, and that's probably not
- needed with OpenACC). Maybe investigate later. */
- if (async == acc_async_sync)
- stream = ptx_dev->null_stream;
- else if (async >= 0 && async < ptx_dev->async_streams.size
- && ptx_dev->async_streams.arr[async] && !(create && existing))
- stream = ptx_dev->async_streams.arr[async];
- else if (async >= 0 && create)
- {
- if (async >= ptx_dev->async_streams.size)
- {
- int i, newsize = ptx_dev->async_streams.size * 2;
-
- if (async >= newsize)
- newsize = async + 1;
-
- ptx_dev->async_streams.arr
- = GOMP_PLUGIN_realloc (ptx_dev->async_streams.arr,
- newsize * sizeof (struct ptx_stream *));
-
- for (i = ptx_dev->async_streams.size; i < newsize; i++)
- ptx_dev->async_streams.arr[i] = NULL;
-
- ptx_dev->async_streams.size = newsize;
- }
-
- /* Create a new stream on-demand if there isn't one already, or if we're
- setting a particular async value to an existing (externally-provided)
- stream. */
- if (!ptx_dev->async_streams.arr[async] || existing)
- {
- CUresult r;
- struct ptx_stream *s
- = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream));
-
- if (existing)
- s->stream = existing;
- else
- {
- r = cuStreamCreate (&s->stream, CU_STREAM_DEFAULT);
- if (r != CUDA_SUCCESS)
- {
- pthread_mutex_unlock (&ptx_dev->stream_lock);
- GOMP_PLUGIN_fatal ("cuStreamCreate error: %s",
- cuda_error (r));
- }
- }
-
- /* If CREATE is true, we're going to be queueing some work on this
- stream. Associate it with the current host thread. */
- s->host_thread = thread;
- s->multithreaded = false;
-
- if (!map_init (thr, s))
- {
- pthread_mutex_unlock (&ptx_dev->stream_lock);
- GOMP_PLUGIN_fatal ("map_init fail");
- }
-
- s->next = ptx_dev->active_streams;
- ptx_dev->active_streams = s;
- ptx_dev->async_streams.arr[async] = s;
- }
-
- stream = ptx_dev->async_streams.arr[async];
- }
- else if (async < 0)
- {
- if (create)
- pthread_mutex_unlock (&ptx_dev->stream_lock);
- GOMP_PLUGIN_fatal ("bad async %d", async);
- }
-
- if (create)
- {
- assert (stream != NULL);
-
- /* If we're trying to use the same stream from different threads
- simultaneously, set stream->multithreaded to true. This affects the
- behaviour of acc_async_test_all and acc_wait_all, which are supposed to
- only wait for asynchronous launches from the same host thread they are
- invoked on. If multiple threads use the same async value, we make note
- of that here and fall back to testing/waiting for all threads in those
- functions. */
- if (thread != stream->host_thread)
- stream->multithreaded = true;
-
- pthread_mutex_unlock (&ptx_dev->stream_lock);
- }
- else if (stream && !stream->multithreaded
- && !pthread_equal (stream->host_thread, thread))
- GOMP_PLUGIN_fatal ("async %d used on wrong thread", orig_async);
-
- return stream;
-}
-
/* Initialize the device. Return TRUE on success, else FALSE. PTX_DEV_LOCK
should be locked on entry and remains locked on exit. */
@@ -611,9 +235,6 @@ nvptx_init (void)
return true;
CUDA_CALL (cuInit, 0);
- ptx_events = NULL;
- pthread_mutex_init (&ptx_event_lock, NULL);
-
CUDA_CALL (cuDeviceGetCount, &ndevs);
ptx_devices = GOMP_PLUGIN_malloc_cleared (sizeof (struct ptx_device *)
* ndevs);
@@ -632,6 +253,11 @@ nvptx_attach_host_thread_to_device (int n)
CUcontext thd_ctx;
r = cuCtxGetDevice (&dev);
+ if (r == CUDA_ERROR_NOT_PERMITTED)
+ {
+ /* Assume we're in a CUDA callback, just return true. */
+ return true;
+ }
if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
{
GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
@@ -759,6 +385,9 @@ nvptx_open_device (int n)
ptx_dev->images = NULL;
pthread_mutex_init (&ptx_dev->image_lock, NULL);
+ ptx_dev->free_blocks = NULL;
+ pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL);
+
GOMP_PLUGIN_debug (0, "Nvidia device %d:\n\tGPU_OVERLAP = %d\n"
"\tCAN_MAP_HOST_MEMORY = %d\n\tCONCURRENT_KERNELS = %d\n"
"\tCOMPUTE_MODE = %d\n\tINTEGRATED = %d\n"
@@ -775,9 +404,6 @@ nvptx_open_device (int n)
ptx_dev->max_registers_per_multiprocessor,
ptx_dev->max_shared_memory_per_multiprocessor);
- if (!init_streams_for_device (ptx_dev, async_engines))
- return NULL;
-
return ptx_dev;
}
@@ -787,9 +413,15 @@ nvptx_close_device (struct ptx_device *ptx_dev)
if (!ptx_dev)
return true;
- if (!fini_streams_for_device (ptx_dev))
- return false;
-
+ for (struct ptx_free_block *b = ptx_dev->free_blocks; b;)
+ {
+ struct ptx_free_block *b_next = b->next;
+ CUDA_CALL (cuMemFree, (CUdeviceptr) b->ptr);
+ free (b);
+ b = b_next;
+ }
+
+ pthread_mutex_destroy (&ptx_dev->free_blocks_lock);
pthread_mutex_destroy (&ptx_dev->image_lock);
if (!ptx_dev->ctx_shared)
@@ -913,134 +545,14 @@ link_ptx (CUmodule *module, const struct targ_ptx_
}
static void
-event_gc (bool memmap_lockable)
-{
- struct ptx_event *ptx_event = ptx_events;
- struct ptx_event *async_cleanups = NULL;
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- pthread_mutex_lock (&ptx_event_lock);
-
- while (ptx_event != NULL)
- {
- CUresult r;
- struct ptx_event *e = ptx_event;
-
- ptx_event = ptx_event->next;
-
- if (e->ord != nvthd->ptx_dev->ord)
- continue;
-
- r = cuEventQuery (*e->evt);
- if (r == CUDA_SUCCESS)
- {
- bool append_async = false;
- CUevent *te;
-
- te = e->evt;
-
- switch (e->type)
- {
- case PTX_EVT_MEM:
- case PTX_EVT_SYNC:
- break;
-
- case PTX_EVT_KNL:
- map_pop (thr, e->addr);
- break;
-
- case PTX_EVT_ASYNC_CLEANUP:
- {
- /* The function gomp_plugin_async_unmap_vars needs to claim the
- memory-map splay tree lock for the current device, so we
- can't call it when one of our callers has already claimed
- the lock. In that case, just delay the GC for this event
- until later. */
- if (!memmap_lockable)
- continue;
-
- append_async = true;
- }
- break;
- }
-
- cuEventDestroy (*te);
- free ((void *)te);
-
- /* Unlink 'e' from ptx_events list. */
- if (ptx_events == e)
- ptx_events = ptx_events->next;
- else
- {
- struct ptx_event *e_ = ptx_events;
- while (e_->next != e)
- e_ = e_->next;
- e_->next = e_->next->next;
- }
-
- if (append_async)
- {
- e->next = async_cleanups;
- async_cleanups = e;
- }
- else
- free (e);
- }
- }
-
- pthread_mutex_unlock (&ptx_event_lock);
-
- /* We have to do these here, after ptx_event_lock is released. */
- while (async_cleanups)
- {
- struct ptx_event *e = async_cleanups;
- async_cleanups = async_cleanups->next;
-
- GOMP_PLUGIN_async_unmap_vars (e->addr, e->val);
- free (e);
- }
-}
-
-static void
-event_add (enum ptx_event_type type, CUevent *e, void *h, int val)
-{
- struct ptx_event *ptx_event;
- struct nvptx_thread *nvthd = nvptx_thread ();
-
- assert (type == PTX_EVT_MEM || type == PTX_EVT_KNL || type == PTX_EVT_SYNC
- || type == PTX_EVT_ASYNC_CLEANUP);
-
- ptx_event = GOMP_PLUGIN_malloc (sizeof (struct ptx_event));
- ptx_event->type = type;
- ptx_event->evt = e;
- ptx_event->addr = h;
- ptx_event->ord = nvthd->ptx_dev->ord;
- ptx_event->val = val;
-
- pthread_mutex_lock (&ptx_event_lock);
-
- ptx_event->next = ptx_events;
- ptx_events = ptx_event;
-
- pthread_mutex_unlock (&ptx_event_lock);
-}
-
-static void
nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
- int async, unsigned *dims, void *targ_mem_desc)
+ unsigned *dims, void *targ_mem_desc,
+ CUdeviceptr dp, CUstream stream)
{
struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
CUfunction function;
- CUresult r;
int i;
- struct ptx_stream *dev_str;
void *kargs[1];
- void *hp;
- CUdeviceptr dp;
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
- const char *maybe_abort_msg = "(perhaps abort was called)";
int cpu_size = nvptx_thread ()->ptx_dev->max_threads_per_multiprocessor;
int block_size = nvptx_thread ()->ptx_dev->max_threads_per_block;
int dev_size = nvptx_thread ()->ptx_dev->multiprocessor_count;
@@ -1052,9 +564,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
function = targ_fn->fn;
- dev_str = select_stream_for_async (async, pthread_self (), false, NULL);
- assert (dev_str == nvthd->current_stream);
-
/* Initialize the launch dimensions. Typically this is constant,
provided by the device compiler, but we must permit runtime
values. */
@@ -1185,61 +694,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
threads_per_block);
}
- /* This reserves a chunk of a pre-allocated page of memory mapped on both
- the host and the device. HP is a host pointer to the new chunk, and DP is
- the corresponding device pointer. */
- pthread_mutex_lock (&ptx_event_lock);
- dp = map_push (thr, dev_str, mapnum * sizeof (void *));
- pthread_mutex_unlock (&ptx_event_lock);
-
- GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__);
-
- /* Copy the array of arguments to the mapped page. */
- hp = alloca(sizeof(void *) * mapnum);
- for (i = 0; i < mapnum; i++)
- ((void **) hp)[i] = devaddrs[i] != 0 ? devaddrs[i] : hostaddrs[i];
-
- /* Copy the (device) pointers to arguments to the device (dp and hp might in
- fact have the same value on a unified-memory system). */
-
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_upload_start;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- /* Always implicit for "data mapping arguments for cuLaunchKernel". */
- data_event_info.data_event.implicit = 1;
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = mapnum * sizeof (void *);
- data_event_info.data_event.host_ptr = hp;
- data_event_info.data_event.device_ptr = (void *) dp;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, hp,
- mapnum * sizeof (void *));
-
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_upload_end;
- data_event_info.data_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
GOMP_PLUGIN_debug (0, " %s: kernel %s: launch"
" gangs=%u, workers=%u, vectors=%u\n",
__FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG],
@@ -1251,7 +705,11 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
// num_workers ntid.y
// vector length ntid.x
+ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+ acc_prof_info *prof_info = thr->prof_info;
acc_event_info enqueue_launch_event_info;
+ acc_api_info *api_info = thr->api_info;
+ bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
if (profiling_dispatch_p)
{
prof_info->event_type = acc_ev_enqueue_launch_start;
@@ -1279,11 +737,13 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &enqueue_launch_event_info,
api_info);
}
+
kargs[0] = &dp;
CUDA_CALL_ASSERT (cuLaunchKernel, function,
dims[GOMP_DIM_GANG], 1, 1,
dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
- 0, dev_str->stream, kargs, 0);
+ 0, stream, kargs, 0);
+
if (profiling_dispatch_p)
{
prof_info->event_type = acc_ev_enqueue_launch_end;
@@ -1293,91 +753,8 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
api_info);
}
- acc_event_info wait_event_info;
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_start;
-
- wait_event_info.other_event.event_type = prof_info->event_type;
- wait_event_info.other_event.valid_bytes
- = _ACC_OTHER_EVENT_INFO_VALID_BYTES;
- wait_event_info.other_event.parent_construct
- /* TODO = compute_construct_event_info.other_event.parent_construct */
- = acc_construct_parallel; //TODO: kernels...
- wait_event_info.other_event.implicit = 1;
- wait_event_info.other_event.tool_info = NULL;
-
- api_info->device_api = acc_device_api_cuda;
- }
-#ifndef DISABLE_ASYNC
- if (async < acc_async_noval)
- {
- if (profiling_dispatch_p)
- {
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- r = cuStreamSynchronize (dev_str->stream);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_end;
- wait_event_info.other_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- if (r == CUDA_ERROR_LAUNCH_FAILED)
- GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
- maybe_abort_msg);
- else if (r != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
- }
- else
- {
- CUevent *e;
-
- e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
-
- r = cuEventCreate (e, CU_EVENT_DISABLE_TIMING);
- if (r == CUDA_ERROR_LAUNCH_FAILED)
- GOMP_PLUGIN_fatal ("cuEventCreate error: %s %s\n", cuda_error (r),
- maybe_abort_msg);
- else if (r != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r));
-
- event_gc (true);
-
- CUDA_CALL_ASSERT (cuEventRecord, *e, dev_str->stream);
-
- event_add (PTX_EVT_KNL, e, (void *)dev_str, 0);
- }
-#else
- if (profiling_dispatch_p)
- {
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- r = cuCtxSynchronize ();
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_end;
- wait_event_info.other_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- if (r == CUDA_ERROR_LAUNCH_FAILED)
- GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
- maybe_abort_msg);
- else if (r != CUDA_SUCCESS)
- GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
-#endif
-
GOMP_PLUGIN_debug (0, " %s: kernel %s: finished\n", __FUNCTION__,
targ_fn->launch->fn);
-
-#ifndef DISABLE_ASYNC
- if (async < acc_async_noval)
-#endif
- map_pop (thr, dev_str);
}
void * openacc_get_current_cuda_context (void);
@@ -1420,8 +797,21 @@ nvptx_alloc (size_t s)
}
static bool
-nvptx_free (void *p)
+nvptx_free (void *p, struct ptx_device *ptx_dev)
{
+ /* Assume callback context if this is null. */
+ if (GOMP_PLUGIN_goacc_thread () == NULL)
+ {
+ struct ptx_free_block *n
+ = GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
+ n->ptr = p;
+ pthread_mutex_lock (&ptx_dev->free_blocks_lock);
+ n->next = ptx_dev->free_blocks;
+ ptx_dev->free_blocks = n;
+ pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
+ return true;
+ }
+
CUdeviceptr pb;
size_t ps;
@@ -1433,478 +823,9 @@ static bool
}
CUDA_CALL (cuMemFree, (CUdeviceptr) p);
-
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- acc_prof_info *prof_info = thr->prof_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_free;
-
- acc_event_info data_event_info;
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- data_event_info.data_event.implicit = 1; //TODO
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = ps;
- data_event_info.data_event.host_ptr = NULL;
- data_event_info.data_event.device_ptr = p;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
return true;
}
-
-static bool
-nvptx_host2dev (void *d, const void *h, size_t s)
-{
- CUdeviceptr pb;
- size_t ps;
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- if (!s)
- return true;
- if (!d)
- {
- GOMP_PLUGIN_error ("invalid device address");
- return false;
- }
-
- CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
-
- if (!pb)
- {
- GOMP_PLUGIN_error ("invalid device address");
- return false;
- }
- if (!h)
- {
- GOMP_PLUGIN_error ("invalid host address");
- return false;
- }
- if (d == h)
- {
- GOMP_PLUGIN_error ("invalid host or device address");
- return false;
- }
- if ((void *)(d + s) > (void *)(pb + ps))
- {
- GOMP_PLUGIN_error ("invalid size");
- return false;
- }
-
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_upload_start;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- data_event_info.data_event.implicit = 1; //TODO
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = s;
- data_event_info.data_event.host_ptr = /* TODO */ (void *) h;
- data_event_info.data_event.device_ptr = d;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
-#ifndef DISABLE_ASYNC
- if (nvthd->current_stream != nvthd->ptx_dev->null_stream)
- {
- CUevent *e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
- CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
- event_gc (false);
- CUDA_CALL (cuMemcpyHtoDAsync,
- (CUdeviceptr) d, h, s, nvthd->current_stream->stream);
- CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream);
- event_add (PTX_EVT_MEM, e, (void *)h, 0);
- }
- else
-#endif
- CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) d, h, s);
-
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_upload_end;
- data_event_info.data_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- return true;
-}
-
-static bool
-nvptx_dev2host (void *h, const void *d, size_t s)
-{
- CUdeviceptr pb;
- size_t ps;
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- if (!s)
- return true;
- if (!d)
- {
- GOMP_PLUGIN_error ("invalid device address");
- return false;
- }
-
- CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
-
- if (!pb)
- {
- GOMP_PLUGIN_error ("invalid device address");
- return false;
- }
- if (!h)
- {
- GOMP_PLUGIN_error ("invalid host address");
- return false;
- }
- if (d == h)
- {
- GOMP_PLUGIN_error ("invalid host or device address");
- return false;
- }
- if ((void *)(d + s) > (void *)(pb + ps))
- {
- GOMP_PLUGIN_error ("invalid size");
- return false;
- }
-
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info data_event_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_download_start;
-
- data_event_info.data_event.event_type = prof_info->event_type;
- data_event_info.data_event.valid_bytes
- = _ACC_DATA_EVENT_INFO_VALID_BYTES;
- data_event_info.data_event.parent_construct
- = acc_construct_parallel; //TODO
- data_event_info.data_event.implicit = 1; //TODO
- data_event_info.data_event.tool_info = NULL;
- data_event_info.data_event.var_name = NULL; //TODO
- data_event_info.data_event.bytes = s;
- data_event_info.data_event.host_ptr = h;
- data_event_info.data_event.device_ptr = /* TODO */ (void *) d;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
-#ifndef DISABLE_ASYNC
- if (nvthd->current_stream != nvthd->ptx_dev->null_stream)
- {
- CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
- CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
- event_gc (false);
- CUDA_CALL (cuMemcpyDtoHAsync,
- h, (CUdeviceptr) d, s, nvthd->current_stream->stream);
- CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream);
- event_add (PTX_EVT_MEM, e, (void *)h, 0);
- }
- else
-#endif
- CUDA_CALL (cuMemcpyDtoH, h, (CUdeviceptr) d, s);
-
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_enqueue_download_end;
- data_event_info.data_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
- api_info);
- }
-
- return true;
-}
-
-static void
-nvptx_set_async (int async)
-{
- struct nvptx_thread *nvthd = nvptx_thread ();
- nvthd->current_stream
- = select_stream_for_async (async, pthread_self (), true, NULL);
-}
-
-static int
-nvptx_async_test (int async)
-{
- CUresult r;
- struct ptx_stream *s;
-
- s = select_stream_for_async (async, pthread_self (), false, NULL);
-
- if (!s)
- GOMP_PLUGIN_fatal ("unknown async %d", async);
-
- r = cuStreamQuery (s->stream);
- if (r == CUDA_SUCCESS)
- {
- /* The oacc-parallel.c:goacc_wait function calls this hook to determine
- whether all work has completed on this stream, and if so omits the call
- to the wait hook. If that happens, event_gc might not get called
- (which prevents variables from getting unmapped and their associated
- device storage freed), so call it here. */
- event_gc (true);
- return 1;
- }
- else if (r == CUDA_ERROR_NOT_READY)
- return 0;
-
- GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
-
- return 0;
-}
-
-static int
-nvptx_async_test_all (void)
-{
- struct ptx_stream *s;
- pthread_t self = pthread_self ();
- struct nvptx_thread *nvthd = nvptx_thread ();
-
- pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
- for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
- {
- if ((s->multithreaded || pthread_equal (s->host_thread, self))
- && cuStreamQuery (s->stream) == CUDA_ERROR_NOT_READY)
- {
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
- return 0;
- }
- }
-
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
- event_gc (true);
-
- return 1;
-}
-
-static void
-nvptx_wait (int async)
-{
- struct ptx_stream *s;
-
- s = select_stream_for_async (async, pthread_self (), false, NULL);
- if (!s)
- GOMP_PLUGIN_fatal ("unknown async %d", async);
-
- GOMP_PLUGIN_debug (0, " %s: waiting on async=%d\n", __FUNCTION__, async);
-
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- bool profiling_dispatch_p
- = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
- acc_event_info wait_event_info;
- if (profiling_dispatch_p)
- {
- acc_prof_info *prof_info = thr->prof_info;
- acc_api_info *api_info = thr->api_info;
-
- prof_info->event_type = acc_ev_wait_start;
-
- wait_event_info.other_event.event_type = prof_info->event_type;
- wait_event_info.other_event.valid_bytes
- = _ACC_OTHER_EVENT_INFO_VALID_BYTES;
- wait_event_info.other_event.parent_construct
- /* TODO = compute_construct_event_info.other_event.parent_construct */
- = acc_construct_parallel; //TODO: kernels...
- wait_event_info.other_event.implicit = 1;
- wait_event_info.other_event.tool_info = NULL;
-
- api_info->device_api = acc_device_api_cuda;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
- CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
- if (profiling_dispatch_p)
- {
- acc_prof_info *prof_info = thr->prof_info;
- acc_api_info *api_info = thr->api_info;
-
- prof_info->event_type = acc_ev_wait_end;
-
- wait_event_info.other_event.event_type = prof_info->event_type;
-
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &wait_event_info,
- api_info);
- }
-
- event_gc (true);
-}
-
-static void
-nvptx_wait_async (int async1, int async2)
-{
- CUevent *e;
- struct ptx_stream *s1, *s2;
- pthread_t self = pthread_self ();
-
- /* The stream that is waiting (rather than being waited for) doesn't
- necessarily have to exist already. */
- s2 = select_stream_for_async (async2, self, true, NULL);
-
- s1 = select_stream_for_async (async1, self, false, NULL);
- if (!s1)
- GOMP_PLUGIN_fatal ("invalid async 1\n");
-
- if (s1 == s2)
- GOMP_PLUGIN_fatal ("identical parameters");
-
- e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-
- CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-
- event_gc (true);
-
- CUDA_CALL_ASSERT (cuEventRecord, *e, s1->stream);
-
- event_add (PTX_EVT_SYNC, e, NULL, 0);
-
- CUDA_CALL_ASSERT (cuStreamWaitEvent, s2->stream, *e, 0);
-}
-
-static void
-nvptx_wait_all (void)
-{
- CUresult r;
- struct ptx_stream *s;
- pthread_t self = pthread_self ();
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
- acc_prof_info *prof_info = thr->prof_info;
- acc_event_info wait_event_info;
- acc_api_info *api_info = thr->api_info;
- bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
- if (profiling_dispatch_p)
- {
- wait_event_info.other_event.valid_bytes
- = _ACC_OTHER_EVENT_INFO_VALID_BYTES;
- wait_event_info.other_event.parent_construct
- /* TODO = compute_construct_event_info.other_event.parent_construct */
- = acc_construct_parallel; //TODO: kernels...
- wait_event_info.other_event.implicit = 1;
- wait_event_info.other_event.tool_info = NULL;
-
- api_info->device_api = acc_device_api_cuda;
- }
-
- /* Wait for active streams initiated by this thread (or by multiple threads)
- to complete. */
- for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
- {
- if (s->multithreaded || pthread_equal (s->host_thread, self))
- {
- r = cuStreamQuery (s->stream);
- if (r == CUDA_SUCCESS)
- continue;
- else if (r != CUDA_ERROR_NOT_READY)
- GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
-
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_start;
- wait_event_info.other_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info,
- &wait_event_info,
- api_info);
- }
- CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
- if (profiling_dispatch_p)
- {
- prof_info->event_type = acc_ev_wait_end;
- wait_event_info.other_event.event_type = prof_info->event_type;
- GOMP_PLUGIN_goacc_profiling_dispatch (prof_info,
- &wait_event_info,
- api_info);
- }
- }
- }
-
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
- event_gc (true);
-}
-
-static void
-nvptx_wait_all_async (int async)
-{
- struct ptx_stream *waiting_stream, *other_stream;
- CUevent *e;
- struct nvptx_thread *nvthd = nvptx_thread ();
- pthread_t self = pthread_self ();
-
- /* The stream doing the waiting. This could be the first mention of the
- stream, so create it if necessary. */
- waiting_stream
- = select_stream_for_async (async, pthread_self (), true, NULL);
-
- /* Launches on the null stream already block on other streams in the
- context. */
- if (!waiting_stream || waiting_stream == nvthd->ptx_dev->null_stream)
- return;
-
- event_gc (true);
-
- pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
- for (other_stream = nvthd->ptx_dev->active_streams;
- other_stream != NULL;
- other_stream = other_stream->next)
- {
- if (!other_stream->multithreaded
- && !pthread_equal (other_stream->host_thread, self))
- continue;
-
- e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-
- CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-
- /* Record an event on the waited-for stream. */
- CUDA_CALL_ASSERT (cuEventRecord, *e, other_stream->stream);
-
- event_add (PTX_EVT_SYNC, e, NULL, 0);
-
- CUDA_CALL_ASSERT (cuStreamWaitEvent, waiting_stream->stream, *e, 0);
- }
-
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-}
-
static void *
nvptx_get_current_cuda_device (void)
{
@@ -1927,70 +848,6 @@ nvptx_get_current_cuda_context (void)
return nvthd->ptx_dev->ctx;
}
-static void *
-nvptx_get_cuda_stream (int async)
-{
- struct ptx_stream *s;
- struct nvptx_thread *nvthd = nvptx_thread ();
-
- if (!nvthd || !nvthd->ptx_dev)
- return NULL;
-
- s = select_stream_for_async (async, pthread_self (), false, NULL);
-
- return s ? s->stream : NULL;
-}
-
-static int
-nvptx_set_cuda_stream (int async, void *stream)
-{
- struct ptx_stream *oldstream;
- pthread_t self = pthread_self ();
- struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
- struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
-
- if (async < 0)
- GOMP_PLUGIN_fatal ("bad async %d", async);
-
- pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
- /* We have a list of active streams and an array mapping async values to
- entries of that list. We need to take "ownership" of the passed-in stream,
- and add it to our list, removing the previous entry also (if there was one)
- in order to prevent resource leaks. Note the potential for surprise
- here: maybe we should keep track of passed-in streams and leave it up to
- the user to tidy those up, but that doesn't work for stream handles
- returned from acc_get_cuda_stream above... */
-
- oldstream = select_stream_for_async (async, self, false, NULL);
-
- if (oldstream)
- {
- if (nvthd->ptx_dev->active_streams == oldstream)
- nvthd->ptx_dev->active_streams = nvthd->ptx_dev->active_streams->next;
- else
- {
- struct ptx_stream *s = nvthd->ptx_dev->active_streams;
- while (s->next != oldstream)
- s = s->next;
- s->next = s->next->next;
- }
-
- CUDA_CALL_ASSERT (cuStreamDestroy, oldstream->stream);
-
- if (!map_fini (thr, oldstream))
- GOMP_PLUGIN_fatal ("error when freeing host memory");
-
- free (oldstream);
- }
-
- pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
- (void) select_stream_for_async (async, self, true, (CUstream) stream);
-
- return 1;
-}
-
/* Plugin entry points. */
const char *
@@ -2233,6 +1090,23 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
{
if (!nvptx_attach_host_thread_to_device (ord))
return NULL;
+
+ struct ptx_device *ptx_dev = ptx_devices[ord];
+ struct ptx_free_block *blocks, *tmp;
+
+ pthread_mutex_lock (&ptx_dev->free_blocks_lock);
+ blocks = ptx_dev->free_blocks;
+ ptx_dev->free_blocks = NULL;
+ pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
+
+ while (blocks)
+ {
+ tmp = blocks->next;
+ nvptx_free (blocks->ptr, ptx_dev);
+ free (blocks);
+ blocks = tmp;
+ }
+
return nvptx_alloc (size);
}
@@ -2240,84 +1114,165 @@ bool
GOMP_OFFLOAD_free (int ord, void *ptr)
{
return (nvptx_attach_host_thread_to_device (ord)
- && nvptx_free (ptr));
+ && nvptx_free (ptr, ptx_devices[ord]));
}
-bool
-GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
-{
- return (nvptx_attach_host_thread_to_device (ord)
- && nvptx_dev2host (dst, src, n));
-}
-
-bool
-GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
-{
- return (nvptx_attach_host_thread_to_device (ord)
- && nvptx_host2dev (dst, src, n));
-}
-
-void (*device_run) (int n, void *fn_ptr, void *vars) = NULL;
-
void
GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
void **hostaddrs, void **devaddrs,
- int async, unsigned *dims, void *targ_mem_desc)
+ unsigned *dims, void *targ_mem_desc)
{
- nvptx_exec (fn, mapnum, hostaddrs, devaddrs, async, dims, targ_mem_desc);
-}
+ GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__);
-void
-GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc, int async)
-{
- struct nvptx_thread *nvthd = nvptx_thread ();
- CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
+ void **hp = NULL;
+ CUdeviceptr dp = 0;
- CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
- CUDA_CALL_ASSERT (cuEventRecord, *e, nvthd->current_stream->stream);
- event_add (PTX_EVT_ASYNC_CLEANUP, e, targ_mem_desc, async);
-}
+ if (mapnum > 0)
+ {
+ hp = alloca (mapnum * sizeof (void *));
+ for (int i = 0; i < mapnum; i++)
+ hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+ CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+ }
-int
-GOMP_OFFLOAD_openacc_async_test (int async)
-{
- return nvptx_async_test (async);
-}
+ /* Copy the (device) pointers to arguments to the device (dp and hp might in
+ fact have the same value on a unified-memory system). */
+ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+ acc_prof_info *prof_info = thr->prof_info;
+ acc_event_info data_event_info;
+ acc_api_info *api_info = thr->api_info;
+ bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+ if (profiling_dispatch_p)
+ {
+ prof_info->event_type = acc_ev_enqueue_upload_start;
-int
-GOMP_OFFLOAD_openacc_async_test_all (void)
-{
- return nvptx_async_test_all ();
-}
+ data_event_info.data_event.event_type = prof_info->event_type;
+ data_event_info.data_event.valid_bytes
+ = _ACC_DATA_EVENT_INFO_VALID_BYTES;
+ data_event_info.data_event.parent_construct
+ = acc_construct_parallel; //TODO
+ /* Always implicit for "data mapping arguments for cuLaunchKernel". */
+ data_event_info.data_event.implicit = 1;
+ data_event_info.data_event.tool_info = NULL;
+ data_event_info.data_event.var_name = NULL; //TODO
+ data_event_info.data_event.bytes = mapnum * sizeof (void *);
+ data_event_info.data_event.host_ptr = hp;
+ data_event_info.data_event.device_ptr = (void *) dp;
-void
-GOMP_OFFLOAD_openacc_async_wait (int async)
-{
- nvptx_wait (async);
-}
+ api_info->device_api = acc_device_api_cuda;
-void
-GOMP_OFFLOAD_openacc_async_wait_async (int async1, int async2)
-{
- nvptx_wait_async (async1, async2);
+ GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+ api_info);
+ }
+
+ if (mapnum > 0)
+ CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
+ mapnum * sizeof (void *));
+
+ if (profiling_dispatch_p)
+ {
+ prof_info->event_type = acc_ev_enqueue_upload_end;
+ data_event_info.data_event.event_type = prof_info->event_type;
+ GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+ api_info);
+ }
+
+ nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+ dp, NULL);
+
+ CUresult r = cuStreamSynchronize (NULL);
+ const char *maybe_abort_msg = "(perhaps abort was called)";
+ if (r == CUDA_ERROR_LAUNCH_FAILED)
+ GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
+ maybe_abort_msg);
+ else if (r != CUDA_SUCCESS)
+ GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
+ CUDA_CALL_ASSERT (cuMemFree, dp);
}
-void
-GOMP_OFFLOAD_openacc_async_wait_all (void)
+static void
+cuda_free_argmem (void *ptr)
{
- nvptx_wait_all ();
+ void **block = (void **) ptr;
+ nvptx_free (block[0], (struct ptx_device *) block[1]);
+ free (block);
}
void
-GOMP_OFFLOAD_openacc_async_wait_all_async (int async)
+GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
+ void **hostaddrs, void **devaddrs,
+ unsigned *dims, void *targ_mem_desc,
+ struct goacc_asyncqueue *aq)
{
- nvptx_wait_all_async (async);
-}
+ GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__);
-void
-GOMP_OFFLOAD_openacc_async_set_async (int async)
-{
- nvptx_set_async (async);
+ void **hp = NULL;
+ CUdeviceptr dp = 0;
+ void **block = NULL;
+
+ if (mapnum > 0)
+ {
+ block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
+ hp = block + 2;
+ for (int i = 0; i < mapnum; i++)
+ hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+ CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+ }
+
+ /* Copy the (device) pointers to arguments to the device (dp and hp might in
+ fact have the same value on a unified-memory system). */
+ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+ acc_prof_info *prof_info = thr->prof_info;
+ acc_event_info data_event_info;
+ acc_api_info *api_info = thr->api_info;
+ bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+ if (profiling_dispatch_p)
+ {
+ prof_info->event_type = acc_ev_enqueue_upload_start;
+
+ data_event_info.data_event.event_type = prof_info->event_type;
+ data_event_info.data_event.valid_bytes
+ = _ACC_DATA_EVENT_INFO_VALID_BYTES;
+ data_event_info.data_event.parent_construct
+ = acc_construct_parallel; //TODO
+ /* Always implicit for "data mapping arguments for cuLaunchKernel". */
+ data_event_info.data_event.implicit = 1;
+ data_event_info.data_event.tool_info = NULL;
+ data_event_info.data_event.var_name = NULL; //TODO
+ data_event_info.data_event.bytes = mapnum * sizeof (void *);
+ data_event_info.data_event.host_ptr = hp;
+ data_event_info.data_event.device_ptr = (void *) dp;
+
+ api_info->device_api = acc_device_api_cuda;
+
+ GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+ api_info);
+ }
+
+ if (mapnum > 0)
+ {
+ CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
+ mapnum * sizeof (void *), aq->cuda_stream);
+ block[0] = (void *) dp;
+
+ struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+ struct nvptx_thread *nvthd = (struct nvptx_thread *) thr->target_tls;
+ block[1] = (void *) nvthd->ptx_dev;
+ }
+
+ if (profiling_dispatch_p)
+ {
+ prof_info->event_type = acc_ev_enqueue_upload_end;
+ data_event_info.data_event.event_type = prof_info->event_type;
+ GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+ api_info);
+ }
+
+ nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+ dp, aq->cuda_stream);
+
+ if (mapnum > 0)
+ GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
}
void *
@@ -2339,7 +1294,6 @@ GOMP_OFFLOAD_openacc_create_thread_data (int ord)
if (!thd_ctx)
CUDA_CALL_ASSERT (cuCtxPushCurrent, ptx_dev->ctx);
- nvthd->current_stream = ptx_dev->null_stream;
nvthd->ptx_dev = ptx_dev;
return (void *) nvthd;
@@ -2364,17 +1318,169 @@ GOMP_OFFLOAD_openacc_cuda_get_current_context (voi
}
/* NOTE: This returns a CUstream, not a ptx_stream pointer. */
-
void *
-GOMP_OFFLOAD_openacc_cuda_get_stream (int async)
+GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq)
{
- return nvptx_get_cuda_stream (async);
+ return (void *) aq->cuda_stream;
}
/* NOTE: This takes a CUstream, not a ptx_stream pointer. */
+int
+GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
+{
+ if (aq->cuda_stream)
+ {
+ CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+ CUDA_CALL_ASSERT (cuStreamDestroy, aq->cuda_stream);
+ }
+ aq->cuda_stream = (CUstream) stream;
+ return 1;
+}
+
+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (void)
+{
+ struct goacc_asyncqueue *aq
+ = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
+ CUDA_CALL_ASSERT (cuStreamCreate, &aq->cuda_stream, CU_STREAM_DEFAULT);
+ return aq;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
+{
+ CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
+ free (aq);
+ return true;
+}
+
int
-GOMP_OFFLOAD_openacc_cuda_set_stream (int async, void *stream)
+GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
{
- return nvptx_set_cuda_stream (async, stream);
+ CUresult r = cuStreamQuery (aq->cuda_stream);
+ if (r == CUDA_SUCCESS)
+ return 1;
+ if (r == CUDA_ERROR_NOT_READY)
+ return 0;
+
+ GOMP_PLUGIN_error ("cuStreamQuery error: %s", cuda_error (r));
+ return -1;
}
+
+void
+GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
+{
+ CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
+ struct goacc_asyncqueue *aq2)
+{
+ CUevent e;
+ CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
+ CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
+ CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
+}
+
+static void
+cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
+{
+ if (res != CUDA_SUCCESS)
+ GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
+ struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
+ cb->fn (cb->ptr);
+ free (ptr);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
+ void (*callback_fn)(void *),
+ void *userptr)
+{
+ struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b));
+ b->fn = callback_fn;
+ b->ptr = userptr;
+ b->aq = aq;
+ CUDA_CALL_ASSERT (cuStreamAddCallback, aq->cuda_stream,
+ cuda_callback_wrapper, (void *) b, 0);
+}
+
+static bool
+cuda_memcpy_sanity_check (const void *h, const void *d, size_t s)
+{
+ CUdeviceptr pb;
+ size_t ps;
+ if (!s)
+ return true;
+ if (!d)
+ {
+ GOMP_PLUGIN_error ("invalid device address");
+ return false;
+ }
+ CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
+ if (!pb)
+ {
+ GOMP_PLUGIN_error ("invalid device address");
+ return false;
+ }
+ if (!h)
+ {
+ GOMP_PLUGIN_error ("invalid host address");
+ return false;
+ }
+ if (d == h)
+ {
+ GOMP_PLUGIN_error ("invalid host or device address");
+ return false;
+ }
+ if ((void *)(d + s) > (void *)(pb + ps))
+ {
+ GOMP_PLUGIN_error ("invalid size");
+ return false;
+ }
+ return true;
+}
+
+bool
+GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
+{
+ if (!nvptx_attach_host_thread_to_device (ord)
+ || !cuda_memcpy_sanity_check (src, dst, n))
+ return false;
+ CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n);
+ return true;
+}
+
+bool
+GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
+{
+ if (!nvptx_attach_host_thread_to_device (ord)
+ || !cuda_memcpy_sanity_check (dst, src, n))
+ return false;
+ CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n);
+ return true;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
+ size_t n, struct goacc_asyncqueue *aq)
+{
+ if (!nvptx_attach_host_thread_to_device (ord)
+ || !cuda_memcpy_sanity_check (src, dst, n))
+ return false;
+ CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream);
+ return true;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
+ size_t n, struct goacc_asyncqueue *aq)
+{
+ if (!nvptx_attach_host_thread_to_device (ord)
+ || !cuda_memcpy_sanity_check (dst, src, n))
+ return false;
+ CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream);
+ return true;
+}
Index: libgomp/target.c
===================================================================
--- libgomp/target.c (revision 250497)
+++ libgomp/target.c (working copy)
@@ -187,18 +187,44 @@ gomp_device_copy (struct gomp_device_descr *device
}
}
-static void
+static inline void
+goacc_device_copy_async (struct gomp_device_descr *devicep,
+ bool (*copy_func) (int, void *, const void *, size_t,
+ struct goacc_asyncqueue *),
+ const char *dst, void *dstaddr,
+ const char *src, const void *srcaddr,
+ size_t size, struct goacc_asyncqueue *aq)
+{
+ if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq))
+ {
+ gomp_mutex_unlock (&devicep->lock);
+ gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed",
+ src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size);
+ }
+}
+
+attribute_hidden void
gomp_copy_host2dev (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq,
void *d, const void *h, size_t sz)
{
- gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
+ if (aq)
+ goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
+ "dev", d, "host", h, sz, aq);
+ else
+ gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
}
-static void
+attribute_hidden void
gomp_copy_dev2host (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq,
void *h, const void *d, size_t sz)
{
- gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
+ if (aq)
+ goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func,
+ "host", h, "dev", d, sz, aq);
+ else
+ gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
}
static void
@@ -216,7 +242,8 @@ gomp_free_device_memory (struct gomp_device_descr
Helper function of gomp_map_vars. */
static inline void
-gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn,
+gomp_map_vars_existing (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq, splay_tree_key oldn,
splay_tree_key newn, struct target_var_desc *tgt_var,
unsigned char kind)
{
@@ -238,7 +265,7 @@ static inline void
}
if (GOMP_MAP_ALWAYS_TO_P (kind))
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (oldn->tgt->tgt_start + oldn->tgt_offset
+ newn->host_start - oldn->host_start),
(void *) newn->host_start,
@@ -256,8 +283,8 @@ get_kind (bool short_mapkind, void *kinds, int idx
}
static void
-gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr,
- uintptr_t target_offset, uintptr_t bias)
+gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
+ uintptr_t host_ptr, uintptr_t target_offset, uintptr_t bias)
{
struct gomp_device_descr *devicep = tgt->device_descr;
struct splay_tree_s *mem_map = &devicep->mem_map;
@@ -268,7 +295,7 @@ static void
{
cur_node.tgt_offset = (uintptr_t) NULL;
/* FIXME: see comment about coalescing host/dev transfers below. */
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start + target_offset),
(void *) &cur_node.tgt_offset,
sizeof (void *));
@@ -291,7 +318,7 @@ static void
to initialize the pointer with. */
cur_node.tgt_offset -= bias;
/* FIXME: see comment about coalescing host/dev transfers below. */
- gomp_copy_host2dev (devicep, (void *) (tgt->tgt_start + target_offset),
+ gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset),
(void *) &cur_node.tgt_offset, sizeof (void *));
}
@@ -329,9 +356,9 @@ gomp_map_pset (struct target_mem_desc *tgt, uintpt
}
static void
-gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n,
- size_t first, size_t i, void **hostaddrs,
- size_t *sizes, void *kinds)
+gomp_map_fields_existing (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
+ splay_tree_key n, size_t first, size_t i,
+ void **hostaddrs, size_t *sizes, void *kinds)
{
struct gomp_device_descr *devicep = tgt->device_descr;
struct splay_tree_s *mem_map = &devicep->mem_map;
@@ -348,7 +375,7 @@ static void
&& n2->tgt == n->tgt
&& n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset)
{
- gomp_map_vars_existing (devicep, n2, &cur_node,
+ gomp_map_vars_existing (devicep, aq, n2, &cur_node,
&tgt->list[i], kind & typemask);
return;
}
@@ -364,7 +391,7 @@ static void
&& n2->host_start - n->host_start
== n2->tgt_offset - n->tgt_offset)
{
- gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i],
+ gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i],
kind & typemask);
return;
}
@@ -376,7 +403,7 @@ static void
&& n2->tgt == n->tgt
&& n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset)
{
- gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i],
+ gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i],
kind & typemask);
return;
}
@@ -547,6 +574,18 @@ gomp_map_vars (struct gomp_device_descr *devicep,
void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
bool short_mapkind, enum gomp_map_vars_kind pragma_kind)
{
+ struct target_mem_desc *tgt;
+ tgt = gomp_map_vars_async (devicep, NULL, mapnum, hostaddrs, devaddrs,
+ sizes, kinds, short_mapkind, pragma_kind);
+ return tgt;
+}
+
+attribute_hidden struct target_mem_desc *
+gomp_map_vars_async (struct gomp_device_descr *devicep,
+ struct goacc_asyncqueue *aq, size_t mapnum,
+ void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
+ bool short_mapkind, enum gomp_map_vars_kind pragma_kind)
+{
size_t i, tgt_align, tgt_size, not_found_cnt = 0;
bool has_firstprivate = false;
const int rshift = short_mapkind ? 8 : 3;
@@ -665,7 +704,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
continue;
}
for (i = first; i <= last; i++)
- gomp_map_fields_existing (tgt, n, first, i, hostaddrs,
+ gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs,
sizes, kinds);
i--;
continue;
@@ -722,7 +761,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
else
n = splay_tree_lookup (mem_map, &cur_node);
if (n && n->refcount != REFCOUNT_LINK)
- gomp_map_vars_existing (devicep, n, &cur_node, &tgt->list[i],
+ gomp_map_vars_existing (devicep, aq, n, &cur_node, &tgt->list[i],
kind & typemask);
else
{
@@ -790,7 +829,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
if (n)
{
assert (n->refcount != REFCOUNT_LINK);
- gomp_map_vars_existing (devicep, n, &cur_node, row_desc,
+ gomp_map_vars_existing (devicep, aq, n, &cur_node, row_desc,
kind & typemask);
}
else
@@ -866,7 +905,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
tgt_size = (tgt_size + align - 1) & ~(align - 1);
tgt->list[i].offset = tgt_size;
len = sizes[i];
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start + tgt_size),
(void *) hostaddrs[i], len);
tgt_size += len;
@@ -900,7 +939,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
continue;
}
for (i = first; i <= last; i++)
- gomp_map_fields_existing (tgt, n, first, i, hostaddrs,
+ gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs,
sizes, kinds);
i--;
continue;
@@ -920,7 +959,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i - 1);
if (cur_node.tgt_offset)
cur_node.tgt_offset -= sizes[i];
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (n->tgt->tgt_start
+ n->tgt_offset
+ cur_node.host_start
@@ -950,7 +989,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
k->host_end = k->host_start + sizeof (void *);
splay_tree_key n = splay_tree_lookup (mem_map, k);
if (n && n->refcount != REFCOUNT_LINK)
- gomp_map_vars_existing (devicep, n, k, &tgt->list[i],
+ gomp_map_vars_existing (devicep, aq, n, k, &tgt->list[i],
kind & typemask);
else
{
@@ -1006,7 +1045,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
/* FIXME: Perhaps add some smarts, like if copying
several adjacent fields from host to target, use some
host buffer to avoid sending each var individually. */
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start
+ k->tgt_offset),
(void *) k->host_start,
@@ -1013,7 +1052,8 @@ gomp_map_vars (struct gomp_device_descr *devicep,
k->host_end - k->host_start);
break;
case GOMP_MAP_POINTER:
- gomp_map_pointer (tgt, (uintptr_t) *(void **) k->host_start,
+ gomp_map_pointer (tgt, aq,
+ (uintptr_t) *(void **) k->host_start,
k->tgt_offset, sizes[i]);
break;
case GOMP_MAP_TO_PSET:
@@ -1042,7 +1082,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
sizes[j]);
tptr = *(uintptr_t *) hostaddrs[i];
*(uintptr_t *) hostaddrs[i]= toffset;
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start
+ k->tgt_offset),
(void *) k->host_start,
@@ -1052,7 +1092,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
found_pointer = true;
}
if (!found_pointer)
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start
+ k->tgt_offset),
(void *) k->host_start,
@@ -1079,7 +1119,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
break;
case GOMP_MAP_FORCE_DEVICEPTR:
assert (k->host_end - k->host_start == sizeof (void *));
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start
+ k->tgt_offset),
(void *) k->host_start,
@@ -1096,9 +1136,8 @@ gomp_map_vars (struct gomp_device_descr *devicep,
/* Set link pointer on target to the device address of the
mapped object. */
void *tgt_addr = (void *) (tgt->tgt_start + k->tgt_offset);
- devicep->host2dev_func (devicep->target_id,
- (void *) n->tgt_offset,
- &tgt_addr, sizeof (void *));
+ gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset,
+ &tgt_addr, sizeof (void *));
}
array++;
}
@@ -1142,7 +1181,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
if (n)
{
assert (n->refcount != REFCOUNT_LINK);
- gomp_map_vars_existing (devicep, n, &cur_node, row_desc,
+ gomp_map_vars_existing (devicep, aq, n, &cur_node, row_desc,
kind & typemask);
target_row_addr = n->tgt->tgt_start + n->tgt_offset;
}
@@ -1166,7 +1205,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
row_desc->copy_from
= GOMP_MAP_COPY_FROM_P (kind & typemask);
row_desc->always_copy_from
- = GOMP_MAP_COPY_FROM_P (kind & typemask);
+ = GOMP_MAP_ALWAYS_FROM_P (kind & typemask);
row_desc->offset = 0;
row_desc->length = da->data_row_size;
@@ -1175,7 +1214,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
splay_tree_insert (mem_map, array);
if (GOMP_MAP_COPY_TO_P (kind & typemask))
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) tgt->tgt_start + k->tgt_offset,
(void *) k->host_start,
da->data_row_size);
@@ -1191,9 +1230,11 @@ gomp_map_vars (struct gomp_device_descr *devicep,
{
void *ptrblock = gomp_dynamic_array_create_ptrblock
(da, target_ptrblock, target_data_rows + row_start);
- gomp_copy_host2dev (devicep, target_ptrblock, ptrblock,
+ gomp_copy_host2dev (devicep, aq, target_ptrblock, ptrblock,
da->ptrblock_size);
- free (ptrblock);
+ /* Freeing of the ptrblock must be scheduled after the host2dev
+ copy completes. */
+ goacc_async_free (devicep, aq, ptrblock);
}
row_start += da->data_row_num;
@@ -1213,7 +1254,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
{
cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i);
/* FIXME: see above FIXME comment. */
- gomp_copy_host2dev (devicep,
+ gomp_copy_host2dev (devicep, aq,
(void *) (tgt->tgt_start + i * sizeof (void *)),
(void *) &cur_node.tgt_offset, sizeof (void *));
}
@@ -1232,7 +1273,7 @@ gomp_map_vars (struct gomp_device_descr *devicep,
return tgt;
}
-static void
+attribute_hidden void
gomp_unmap_tgt (struct target_mem_desc *tgt)
{
/* Deallocate on target the tgt->tgt_start .. tgt->tgt_end region. */
@@ -1267,6 +1308,13 @@ gomp_remove_var (struct gomp_device_descr *devicep
attribute_hidden void
gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
{
+ gomp_unmap_vars_async (tgt, do_copyfrom, NULL);
+}
+
+attribute_hidden void
+gomp_unmap_vars_async (struct target_mem_desc *tgt, bool do_copyfrom,
+ struct goacc_asyncqueue *aq)
+{
struct gomp_device_descr *devicep = tgt->device_descr;
if (tgt->list_count == 0)
@@ -1302,7 +1350,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool
if ((do_unmap && do_copyfrom && tgt->list[i].copy_from)
|| tgt->list[i].always_copy_from)
- gomp_copy_dev2host (devicep,
+ gomp_copy_dev2host (devicep, aq,
(void *) (k->host_start + tgt->list[i].offset),
(void *) (k->tgt->tgt_start + k->tgt_offset
+ tgt->list[i].offset),
@@ -1368,9 +1416,9 @@ gomp_update (struct gomp_device_descr *devicep, si
size_t size = cur_node.host_end - cur_node.host_start;
if (GOMP_MAP_COPY_TO_P (kind & typemask))
- gomp_copy_host2dev (devicep, devaddr, hostaddr, size);
+ gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size);
if (GOMP_MAP_COPY_FROM_P (kind & typemask))
- gomp_copy_dev2host (devicep, hostaddr, devaddr, size);
+ gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size);
}
}
gomp_mutex_unlock (&devicep->lock);
@@ -1691,9 +1739,21 @@ gomp_init_device (struct gomp_device_descr *device
false);
}
+ /* Initialize OpenACC asynchronous queues. */
+ goacc_init_asyncqueues (devicep);
+
devicep->state = GOMP_DEVICE_INITIALIZED;
}
+attribute_hidden bool
+gomp_fini_device (struct gomp_device_descr *devicep)
+{
+ devicep->state = GOMP_DEVICE_FINALIZED;
+ bool ret = goacc_fini_asyncqueues (devicep);
+ ret &= devicep->fini_device_func (devicep->target_id);
+ return ret;
+}
+
attribute_hidden void
gomp_unload_device (struct gomp_device_descr *devicep)
{
@@ -2222,7 +2282,7 @@ gomp_exit_data (struct gomp_device_descr *devicep,
if ((kind == GOMP_MAP_FROM && k->refcount == 0)
|| kind == GOMP_MAP_ALWAYS_FROM)
- gomp_copy_dev2host (devicep, (void *) cur_node.host_start,
+ gomp_copy_dev2host (devicep, NULL, (void *) cur_node.host_start,
(void *) (k->tgt->tgt_start + k->tgt_offset
+ cur_node.host_start
- k->host_start),
@@ -2848,20 +2908,20 @@ gomp_load_plugin_for_device (struct gomp_device_de
if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
{
if (!DLSYM_OPT (openacc.exec, openacc_exec)
- || !DLSYM_OPT (openacc.register_async_cleanup,
- openacc_register_async_cleanup)
- || !DLSYM_OPT (openacc.async_test, openacc_async_test)
- || !DLSYM_OPT (openacc.async_test_all, openacc_async_test_all)
- || !DLSYM_OPT (openacc.async_wait, openacc_async_wait)
- || !DLSYM_OPT (openacc.async_wait_async, openacc_async_wait_async)
- || !DLSYM_OPT (openacc.async_wait_all, openacc_async_wait_all)
- || !DLSYM_OPT (openacc.async_wait_all_async,
- openacc_async_wait_all_async)
- || !DLSYM_OPT (openacc.async_set_async, openacc_async_set_async)
|| !DLSYM_OPT (openacc.create_thread_data,
openacc_create_thread_data)
|| !DLSYM_OPT (openacc.destroy_thread_data,
- openacc_destroy_thread_data))
+ openacc_destroy_thread_data)
+ || !DLSYM_OPT (openacc.async.construct, openacc_async_construct)
+ || !DLSYM_OPT (openacc.async.destruct, openacc_async_destruct)
+ || !DLSYM_OPT (openacc.async.test, openacc_async_test)
+ || !DLSYM_OPT (openacc.async.synchronize, openacc_async_synchronize)
+ || !DLSYM_OPT (openacc.async.serialize, openacc_async_serialize)
+ || !DLSYM_OPT (openacc.async.queue_callback,
+ openacc_async_queue_callback)
+ || !DLSYM_OPT (openacc.async.exec, openacc_async_exec)
+ || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host)
+ || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev))
{
/* Require all the OpenACC handlers if we have
GOMP_OFFLOAD_CAP_OPENACC_200. */
@@ -2912,10 +2972,7 @@ gomp_target_fini (void)
struct gomp_device_descr *devicep = &devices[i];
gomp_mutex_lock (&devicep->lock);
if (devicep->state == GOMP_DEVICE_INITIALIZED)
- {
- ret = devicep->fini_device_func (devicep->target_id);
- devicep->state = GOMP_DEVICE_FINALIZED;
- }
+ ret = gomp_fini_device (devicep);
gomp_mutex_unlock (&devicep->lock);
if (!ret)
gomp_fatal ("device finalization failed");
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c (working copy)
@@ -206,11 +206,6 @@ void cb_enter_data_start (acc_prof_info *prof_info
assert (event_info->other_event.implicit == 1);
assert (event_info->other_event.tool_info == NULL);
- if (acc_device_type == acc_device_host
- || state < 100) //TODO
- assert (api_info->device_api == acc_device_api_none);
- else
- assert (api_info->device_api == acc_device_api_cuda);
assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
assert (api_info->device_type == prof_info->device_type);
assert (api_info->vendor == -1);
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c (working copy)
@@ -151,7 +151,7 @@ main (int argc, char **argv)
d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
#pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N], N) \
- async (4)
+ wait (1, 2, 3) async (4)
for (int ii = 0; ii < N; ii++)
e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c (working copy)
@@ -162,7 +162,7 @@ main (int argc, char **argv)
d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
#pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \
- wait (1) async (4)
+ wait (1, 2, 3) async (4)
for (int ii = 0; ii < N; ii++)
e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c (working copy)
@@ -138,7 +138,7 @@ main (int argc, char **argv)
d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
#pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \
- wait (1,5) async (4)
+ wait (1, 2, 3, 5) async (4)
for (int ii = 0; ii < N; ii++)
e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c (working copy)
@@ -46,16 +46,22 @@ main (int argc, char **argv)
abort ();
}
- fprintf (stderr, "CheCKpOInT\n");
- if (acc_async_test (1) != 0)
+ if (acc_async_test (0) != 0)
{
fprintf (stderr, "asynchronous operation not running\n");
abort ();
}
+ /* Test unseen async number. */
+ if (acc_async_test (1) != 1)
+ {
+ fprintf (stderr, "acc_async_test failed on unseen number\n");
+ abort ();
+ }
+
sleep (1);
- if (acc_async_test (1) != 1)
+ if (acc_async_test (0) != 1)
{
fprintf (stderr, "found asynchronous operation still running\n");
abort ();
@@ -65,7 +71,3 @@ main (int argc, char **argv)
return 0;
}
-
-/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */
-/* { dg-output "unknown async \[0-9\]+" } */
-/* { dg-shouldfail "" } */
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c (working copy)
@@ -72,14 +72,13 @@ main (int argc, char **argv)
abort ();
}
- fprintf (stderr, "CheCKpOInT\n");
- acc_wait (1);
+ acc_wait (0);
gettimeofday (&tv2, NULL);
t2 = ((tv2.tv_sec - tv1.tv_sec) * 1000000) + (tv2.tv_usec - tv1.tv_usec);
- if (t2 > t1)
+ if (t2 - t1 > 100)
{
fprintf (stderr, "too long 1\n");
abort ();
@@ -87,7 +86,7 @@ main (int argc, char **argv)
gettimeofday (&tv1, NULL);
- acc_wait (1);
+ acc_wait (0);
gettimeofday (&tv2, NULL);
@@ -103,7 +102,3 @@ main (int argc, char **argv)
return 0;
}
-
-/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */
-/* { dg-output "unknown async \[0-9\]+" } */
-/* { dg-shouldfail "" } */
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c (working copy)
@@ -84,6 +84,7 @@ main (int argc, char **argv)
for (i = 0; i < N; i++)
{
+ stream = (CUstream) acc_get_cuda_stream (i & 1);
r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, NULL, 0);
if (r != CUDA_SUCCESS)
{
@@ -92,10 +93,10 @@ main (int argc, char **argv)
}
}
- acc_wait_async (0, 1);
-
if (acc_async_test (0) != 0)
abort ();
+
+ acc_wait_async (0, 1);
if (acc_async_test (1) != 0)
abort ();
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c (working copy)
@@ -109,7 +109,7 @@ main (int argc, char **argv)
for (i = 0; i <= N; i++)
{
- if (acc_async_test (i) != 0)
+ if (acc_async_test (i) == 0)
abort ();
}
Index: libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90
===================================================================
--- libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 (revision 250497)
+++ libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 (working copy)
@@ -1,4 +1,5 @@
! { dg-do run }
+! { dg-xfail-run-if "n/a" { openacc_host_selected } }
program main
use openacc
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gomp4] OpenACC async re-work
2017-07-25 12:51 ` Chung-Lin Tang
@ 2017-07-25 14:09 ` Cesar Philippidis
2017-07-25 15:03 ` Chung-Lin Tang
2017-10-10 18:37 ` Thomas Schwinge
1 sibling, 1 reply; 9+ messages in thread
From: Cesar Philippidis @ 2017-07-25 14:09 UTC (permalink / raw)
To: Chung-Lin Tang, gcc-patches, Thomas Schwinge
On 07/25/2017 05:51 AM, Chung-Lin Tang wrote:
> On 2017/6/29 6:31 AM, Cesar Philippidis wrote:
> Attached is the updated version of the patch, re-tested.
>
> Thomas, do you need some more time to look over it? Or should I commit it first?
I'm not too concerned about the profiling stuff because that should be
considered beta quality (because it's still not fully implemented).
We'll probably need to revisit the locking stuff later, but this patch
is OK for og7. I suspect that we may have to make other locking tweaks
in the nvptx plugin anyway.
Cesar
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gomp4] OpenACC async re-work
2017-07-25 14:09 ` Cesar Philippidis
@ 2017-07-25 15:03 ` Chung-Lin Tang
0 siblings, 0 replies; 9+ messages in thread
From: Chung-Lin Tang @ 2017-07-25 15:03 UTC (permalink / raw)
To: Cesar Philippidis, gcc-patches, Thomas Schwinge
On 2017/7/25 10:09 PM, Cesar Philippidis wrote:
> On 07/25/2017 05:51 AM, Chung-Lin Tang wrote:
>> On 2017/6/29 6:31 AM, Cesar Philippidis wrote:
>
>> Attached is the updated version of the patch, re-tested.
>>
>> Thomas, do you need some more time to look over it? Or should I commit it first?
>
> I'm not too concerned about the profiling stuff because that should be
> considered beta quality (because it's still not fully implemented).
>
> We'll probably need to revisit the locking stuff later, but this patch
> is OK for og7. I suspect that we may have to make other locking tweaks
> in the nvptx plugin anyway.
>
> Cesar
>
Patch has been committed to gomp-4_0-branch.
Chung-Lin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gomp4] OpenACC async re-work
2017-07-25 12:51 ` Chung-Lin Tang
2017-07-25 14:09 ` Cesar Philippidis
@ 2017-10-10 18:37 ` Thomas Schwinge
2017-10-10 21:05 ` Cesar Philippidis
1 sibling, 1 reply; 9+ messages in thread
From: Thomas Schwinge @ 2017-10-10 18:37 UTC (permalink / raw)
To: Chung-Lin Tang, Cesar Philippidis, gcc-patches
Hi!
Reported by Cesar for a test case similar to the one below, where we
observe:
acc_prof-cuda-1.exe: [...]/libgomp/oacc-profiling.c:592: goacc_profiling_dispatch_p: Assertion `thr->prof_info == NULL' failed.
This is because of:
On Tue, 25 Jul 2017 20:51:05 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> --- libgomp/oacc-cuda.c (revision 250497)
> +++ libgomp/oacc-cuda.c (working copy)
> @@ -99,17 +99,12 @@ acc_get_cuda_stream (int async)
> prof_info.async_queue = prof_info.async;
> }
>
> - void *ret = NULL;
> if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
> - ret = thr->dev->openacc.cuda.get_stream_func (async);
> -
> - if (profiling_setup_p)
> {
> - thr->prof_info = NULL;
> - thr->api_info = NULL;
> + goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
> + return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
> }
> -
> - return ret;
> + return NULL;
> }
Pushed to openacc-gcc-7-branch:
commit db149741171147fa86a9bfe708a9082f508115ac
Author: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue Oct 10 19:25:19 2017 +0200
acc_get_cuda_stream: Clean up data of the OpenACC Profiling Interface
libgomp/
* oacc-cuda.c (acc_get_cuda_stream): Clean up data of the OpenACC
Profiling Interface.
* testsuite/libgomp.oacc-c-c++-common/acc_prof-cuda-1.c: New file.
---
libgomp/oacc-cuda.c | 13 +++++++++++--
.../libgomp.oacc-c-c++-common/acc_prof-cuda-1.c | 16 ++++++++++++++++
2 files changed, 27 insertions(+), 2 deletions(-)
diff --git libgomp/oacc-cuda.c libgomp/oacc-cuda.c
index 1fbe77d..0ac93e9 100644
--- libgomp/oacc-cuda.c
+++ libgomp/oacc-cuda.c
@@ -99,12 +99,21 @@ acc_get_cuda_stream (int async)
prof_info.async_queue = prof_info.async;
}
+ void *ret = NULL;
if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
{
goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
- return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
+ if (aq)
+ ret = thr->dev->openacc.cuda.get_stream_func (aq);
}
- return NULL;
+
+ if (profiling_setup_p)
+ {
+ thr->prof_info = NULL;
+ thr->api_info = NULL;
+ }
+
+ return ret;
}
int
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-cuda-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-cuda-1.c
new file mode 100644
index 0000000..63f5e49
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-cuda-1.c
@@ -0,0 +1,16 @@
+/* TODO: This is to eventually test dispatch of events to callbacks. */
+
+#include <openacc.h>
+
+int main()
+{
+ acc_init(acc_device_default);
+
+ (void) acc_get_cuda_stream(acc_async_default);
+ /* The following used to crash the runtime due to acc_get_cuda_stream not
+ cleaning up data of the OpenACC Profiling Interface. */
+#pragma acc data
+ ;
+
+ return 0;
+}
Grüße
Thomas
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gomp4] OpenACC async re-work
2017-10-10 18:37 ` Thomas Schwinge
@ 2017-10-10 21:05 ` Cesar Philippidis
0 siblings, 0 replies; 9+ messages in thread
From: Cesar Philippidis @ 2017-10-10 21:05 UTC (permalink / raw)
To: Thomas Schwinge, Chung-Lin Tang, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 2700 bytes --]
On 10/10/2017 11:08 AM, Thomas Schwinge wrote:
> Reported by Cesar for a test case similar to the one below, where we
> observe:
>
> acc_prof-cuda-1.exe: [...]/libgomp/oacc-profiling.c:592: goacc_profiling_dispatch_p: Assertion `thr->prof_info == NULL' failed.
>
> This is because of:
>
> On Tue, 25 Jul 2017 20:51:05 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
>> --- libgomp/oacc-cuda.c (revision 250497)
>> +++ libgomp/oacc-cuda.c (working copy)
>> @@ -99,17 +99,12 @@ acc_get_cuda_stream (int async)
>> prof_info.async_queue = prof_info.async;
>> }
>>
>> - void *ret = NULL;
>> if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
>> - ret = thr->dev->openacc.cuda.get_stream_func (async);
>> -
>> - if (profiling_setup_p)
>> {
>> - thr->prof_info = NULL;
>> - thr->api_info = NULL;
>> + goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
>> + return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
>> }
>> -
>> - return ret;
>> + return NULL;
>> }
>
> Pushed to openacc-gcc-7-branch:
>
> commit db149741171147fa86a9bfe708a9082f508115ac
> Author: Thomas Schwinge <thomas@codesourcery.com>
> Date: Tue Oct 10 19:25:19 2017 +0200
>
> acc_get_cuda_stream: Clean up data of the OpenACC Profiling Interface
>
> libgomp/
> * oacc-cuda.c (acc_get_cuda_stream): Clean up data of the OpenACC
> Profiling Interface.
> * testsuite/libgomp.oacc-c-c++-common/acc_prof-cuda-1.c: New file.
> ---
> libgomp/oacc-cuda.c | 13 +++++++++++--
> .../libgomp.oacc-c-c++-common/acc_prof-cuda-1.c | 16 ++++++++++++++++
> 2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git libgomp/oacc-cuda.c libgomp/oacc-cuda.c
> index 1fbe77d..0ac93e9 100644
> --- libgomp/oacc-cuda.c
> +++ libgomp/oacc-cuda.c
> @@ -99,12 +99,21 @@ acc_get_cuda_stream (int async)
> prof_info.async_queue = prof_info.async;
> }
>
> + void *ret = NULL;
> if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
> {
> goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
> - return aq ? thr->dev->openacc.cuda.get_stream_func (aq) : NULL;
> + if (aq)
> + ret = thr->dev->openacc.cuda.get_stream_func (aq);
> }
> - return NULL;
> +
> + if (profiling_setup_p)
> + {
> + thr->prof_info = NULL;
> + thr->api_info = NULL;
> + }
> +
> + return ret;
> }
Thanks!
I wonder if it makes sense to apply my patch to og7. It's the workaround
that I was using in case the async queue doesn't exist. Basically, in
that case, goacc_wait turns into a nop.
Cesar
[-- Attachment #2: og7-async.diff --]
[-- Type: text/x-patch, Size: 523 bytes --]
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 1172d739ec7..85ed2327795 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -855,6 +855,9 @@ goacc_wait (int async, int num_waits, va_list *ap)
{
int qid = va_arg (*ap, int);
goacc_aq aq = get_goacc_asyncqueue (qid);
+ /* There's nothing to do if an async queue doesn't exit. */
+ if (aq == NULL)
+ return;
if (acc_dev->openacc.async.test_func (aq))
continue;
if (async == acc_async_sync)
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-10-10 21:01 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-24 7:54 [gomp4] OpenACC async re-work Chung-Lin Tang
2017-06-26 22:45 ` Cesar Philippidis
2017-06-27 10:56 ` Chung-Lin Tang
2017-06-28 22:31 ` Cesar Philippidis
2017-07-25 12:51 ` Chung-Lin Tang
2017-07-25 14:09 ` Cesar Philippidis
2017-07-25 15:03 ` Chung-Lin Tang
2017-10-10 18:37 ` Thomas Schwinge
2017-10-10 21:05 ` Cesar Philippidis
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).