The attached patch calls CUDA's cuMemcopy2D and cuMemcpy3D for omp_target_memcpy_rect[,_async} for dim=2/dim=3. This should speed up the data transfer for noncontiguous data. While being there, I ended up adding support for device to other device copying; while potentially slow, it is still better than not being able to copy - and with shared-memory, it shouldn't be that bad. Comments, suggestions, remarks? If there are none, will commit it... Disclaimer: While I have done correctness tests (system with two nvptx GPUs, I have not done any performance tests. (I also tested it without offloading configured, but that's rather boring.) Tobias ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955