The attached patch calls CUDA's cuMemcopy2D and cuMemcpy3D
for omp_target_memcpy_rect[,_async} for dim=2/dim=3. This should
speed up the data transfer for noncontiguous data.

While being there, I ended up adding support for device to other device
copying; while potentially slow, it is still better than not being able to
copy - and with shared-memory, it shouldn't be that bad.

Comments, suggestions, remarks?
If there are none, will commit it...

Disclaimer: While I have done correctness tests (system with two nvptx GPUs,
I have not done any performance tests. (I also tested it without offloading
configured, but that's rather boring.)

Tobias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955