This patch adds default compute dimension handling. Users rarely specify compute dimensions, expecting the toolchain to DTRT. More savvy users would like to specify global defaults. This patch permits both. While the vector and worker dimensions are constrained by the target CPU implementation, the number of gangs is arbitrary. The number that can compute in parallel depends on the physical number on your accelerator board -- but that's hidden behind the runtime API, which will schedule logical instances onto the physical devices an an arbitrary order. Without this patch, one's reliant on the user specifying 'num_gangs(G)' with a suitable 'G' on each offload region. General code tends not to do that. Further, if one's relying on automatic paritioning in a parallel region via #pragma acc loop auto (we default auto there, if nothing overrides it) then the user has no way of knowing which set of partions were being used, so would be unwise to specify a particular axis with non-unity size. Hence this patch. We add a '-fopenacc-dim=G:W:V' option, where G, W, & V are integer constants. A particular entry may be omitted to get the default value. I envision extending this to device_type support with something like DEV_T:G:W:V as comma-separated tuples. If the option is omitted -- or dimensions not completely specified -- the backend gets to pick defaults. For PTX we already force V as 32, and bounded W at 32 (but permitted smaller values). This patch sets W & G to 32. Explicitly specified values go through backend range checking. The backend validate_dims hook is extended to handle these cases (with a NULL fndecl arg), and it is also changed to not fill in defaults (except in the case of determining the global default). The loop partitioning code in the oacc dev lower pass is rearranged to return the mask of partition axes used, and then that pass selects a suitable default value for axes that are unspecified -- either the default value, or the minimum permitted value. The outcome is that the naive user will get multiple compute elements for '#pragma acc loop' use in a parallel region, whereas before they had to specify the number of elements to guarantee that (but as mentioned above would then want to specify which axis each loop should be partitioned over). ok? nathan