This patch adds default compute dimension handling.  Users rarely specify 
compute dimensions, expecting the toolchain to DTRT.  More savvy users would 
like to specify global defaults.  This patch permits both.

While the vector and worker dimensions are constrained by the target CPU 
implementation, the number of gangs is arbitrary.  The number that can compute 
in parallel depends on the physical number on your accelerator board -- but 
that's hidden behind the runtime API, which will schedule logical instances onto 
  the physical devices an an arbitrary order.  Without this patch, one's reliant 
  on the user specifying 'num_gangs(G)' with a  suitable 'G' on each offload 
region.  General code tends not to do that.    Further, if one's relying on 
automatic paritioning in a parallel region via
#pragma acc loop auto
(we default auto there, if nothing overrides it)

then the user has no way of knowing which set of partions were being used, so 
would be unwise to specify a particular axis with non-unity size.

Hence this patch.

We add a '-fopenacc-dim=G:W:V' option, where G, W, & V are integer constants.  A 
particular entry may be omitted to get the default value.  I envision extending 
this to device_type support with something like DEV_T:G:W:V as comma-separated 
tuples.

If the option is omitted -- or dimensions not completely specified -- the 
backend gets to pick defaults.  For PTX we already force V as 32, and bounded W 
at 32 (but permitted smaller values).  This patch sets W & G to 32.  Explicitly 
specified values go through backend range checking.

The backend validate_dims hook is extended to handle these cases (with a NULL 
fndecl arg), and it is also changed to not fill in defaults (except in the case 
of determining the global default).

The loop partitioning code in the oacc dev lower pass is rearranged to return 
the mask of partition axes used, and then that pass selects a suitable default 
value for axes that are unspecified -- either the default value, or the minimum 
permitted value.

The outcome is that the naive user will get multiple compute elements for 
'#pragma acc loop' use in a parallel region, whereas before they had to specify 
the number of elements to guarantee that (but as mentioned above would then want 
to specify which axis each loop should be partitioned over).

ok?

nathan