I've committed this patch to change the ways stacks are initialized on 
amdgcn. The patch only touches GCN files, or the GCN-only portions of 
libgomp files, so I'm allowing it despite stage 4 because I want the ABI 
change done for GCC 13, and because it enables Tobias's reverse 
offload-patch that has already been approved, I think.

The stacks used to be placed in the "private segment" provided for the 
purpose by the GPU drivers, but those addresses are not accessible from 
the host, not even by the HSA API, which was a problem for reverse offload.

The new scheme allocates space in the same way as we do the heap space, 
except that each kernel has its own instance. We were already doing that 
for the "team arena" ephemeral heap, so I have unified the two 
implementations.

While the change does not alter the procedure call standard, it does 
alter the kernel entry ABI and requires any code using the compiler 
builtins for kernel properties to be rebuilt. A recent version of Newlib 
is required (version 4.3.0.20230120 has the necessary changes).

Benchmarking shows no significant change in performance.

The __builtin_apply tests fail because they attempt to access memory in 
parent stack frames (I think), but that causes a memory fault when they 
don't exist (stack underflow; if I modify the testcase to include extra 
call depth it passed fine). In any case, the behaviour of 
__builtin_apply has not changed, only the device has become less forgiving.

I will back-port this to OG12 shortly.

Andrew