The patch introduces the following OpenACC/PTX-specific built-ins: * GOACC_ntid * GOACC_tid * GOACC_nctaid * GOACC_ctaid * acc_on_device * GOACC_get_thread_num * GOACC_get_num_threads Of these functions, the only one part of the OpenACC spec is acc_on_device. The other functions are helpers for omp-low.c. In particular, I'm using GOACC_get_thread_num and GOACC_get_num_threads to determine the number of accelerator threads available to the reduction clause. Current GOACC_get_num_threads is num_gangs * vector_length, but value is subject to change later on. It's probably a premature to include the PTX built-ins right now, but I'd like to middle end of our internal OpenACC branch in sync with gomp-4_0-branch. This patch also allows OpenACC reductions to process the array holding partial reductions on the accelerator, instead of copying that array back to the host. Currently, this only happens when num_gangs = 1. For PTX targets, we're going to need to use another kernel to process the array of partial results because PTX lacks inter-CTA synchronization (we're currently mapping gangs to CTAs). That's why I was working on the routine clause recently. Is this OK for gomp-4_0-branch? Thanks, Cesar