On 03/09/2018 05:55 PM, Cesar Philippidis wrote:
> On 03/09/2018 08:21 AM, Tom de Vries wrote:
>> On 03/09/2018 12:31 AM, Cesar Philippidis wrote:
>>> Nvidia Volta GPUs now support warp-level synchronization.
>>
>> Well, let's try to make that statement a bit more precise.
>>
>> All Nvidia architectures have supported synchronization of threads in a
>> warp on a very basic level: by means of convergence (and unfortunately,
>> we've seen that this is very error-prone).
>>
>> What is new in ptx 6.0 combined with sm_70 is the ability to sync
>> divergent threads without having to converge, f.i. by using new
>> instructions bar.warp.sync and barrier.sync.
> 
> Yes. The major difference sm_70 GPU architectures and earlier GPUs is
> that sm_70 allows the user to explicitly synchronize divergent warps. At
> least on Maxwell and Pascal, the PTX SASS compiler uses two instructions
> to branch, SYNC and BRA. I think, SYNC guarantees that a warp is
> convergent at the SYNC point, whereas BRA makes no such guarantees.
> 

If you want to understand the interplay of sync (or .s suffix), branch 
and ssy, please read 
https://people.engr.ncsu.edu/hzhou/ispass_15-poster.pdf .

> What's worse, once a warp has become divergent on sm_60 and earlier
> GPUs, there's no way to reliably reconverge them. So, to avoid that
> problem, it critical that the PTX SASS compiler use SYNC instructions
> when possible. Fortunately, bar.warp.sync resolves the divergent warp
> problem on sm_70+.
> 
>>> As such, the
>>> semantics of legacy bar.sync instructions have slightly changed on newer
>>> GPUs.
>>
>> Before in ptx 3.1, we have for bar.sync:
>> ...
>> Barriers are executed on a per-warp basis as if all the threads in a
>> warp are active. Thus, if any thread in a warp executes a bar
>> instruction, it is as if all the threads in the warp have executed
>> the bar instruction. All threads in the warp are stalled until the
>> barrier completes, and the arrival count for the barrier is incremented
>> by the warp size (not the number of active threads in the warp). In
>> conditionally executed code, a bar instruction should only be used if it
>> is known that all threads evaluate the condition identically (the warp
>> does not diverge).
>> ...
>>
>> But in ptx 6.0, we have:
>> ...
>> bar.sync is equivalent to barrier.sync.aligned
>> ...
>> and:
>> ...
>> Instruction barrier has optional .aligned modifier. When specified, it
>> indicates that all threads in CTA will execute the same barrier
>> instruction. In conditionally executed code, an aligned barrier
>> instruction should only be used if it is known that all threads in
>> CTA evaluate the condition identically, otherwise behavior is undefined.
>> ...
>>
>> So, in ptx 3.1 bar.sync should be executed in convergent mode (all the
>> threads in each warp executing the same). But in ptx 6.0, bar.sync
>> should be executed in the mode that the whole CTA is executing the same
>> code.
>>
>> So going from the description of ptx, it seems indeed that the semantics
>> of bar.sync has changed. That is however surprising, since it would
>> break the forward compatibility that AFAIU is the idea behind ptx.
>>
>> So for now my hope is that this is a documentation error.
> 
> I spent a lot of time debugging deadlocks with the vector length changes
> and I have see no changes in the SASS code generated in the newer Nvidia
> drivers when compared to the older ones, at lease with respect to the
> barrier instructions. This isn't the first time I've seen
> inconsistencies with thread synchronization in Nvidia's documentation.
> For the longest time, the "CUDA Programming Guide" provided slightly
> conflicting semantics for the __syncthreads() function, which ultimately
> gets implemented as bar.sync in PTX.
> 
>>> The PTX JIT will now, occasionally, emit a warpsync instruction
>>> immediately before a bar.sync for Volta GPUs. That implies that warps
>>> must be convergent on entry to those threads barriers.
>>>
>>
>> That warps must be convergent on entry to bar.sync is already required
>> by ptx 3.1.
>>
>> [ And bar.warp.sync does not force convergence, so if the warpsync
>> instruction you mention is equivalent to bar.warp.sync then your
>> reasoning is incorrect. ]
> 
> I'm under the impression that bar.warp.sync converges all of the
> non-exited threads in a warp.

I have not played around with the instruction yet, so I'm not sure, but 
what I read from the docs is that bar.warp.sync converges all of the 
non-exited threads in a warp only and only if it's positioned at a point 
post-dominating a divergent branch.

Consider this case:
...
if (tid.x == 0)
   {
     A;
     bar.warp.sync 32;
     B;
   }
else
   {
     C;
     bar.warp.sync 32;
     D;
   }
...
AFAIU, this allows bar.warp.sync to synchronize the threads in the warp, 
_without_ converging.


> You'd still need to use bar.sync or some
> variant of the new barrier instruction to converge the entire CTA. But
> at the moment, we're still generating code that's backwards compatible
> with sm_30.
> 
>>> The problem in og7, and trunk, is that GCC emits barrier instructions at
>>> the wrong spots. E.g., consider the following OpenACC parallel region:
>>>
>>>  Â Â  #pragma acc parallel loop worker
>>>  Â Â  for (i = 0; i < 10; i++)
>>>  Â Â Â Â  a[i] = i;
>>>
>>> At -O2, GCC generates the following PTX code:
>>>
>>>  Â Â Â Â Â Â Â Â  {
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  .reg.u32Â Â Â Â Â Â Â  %y;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  mov.u32 %y, %tid.y;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  setp.ne.u32Â Â Â Â  %r76, %y, 0;
>>>  Â Â Â Â Â Â Â Â  }
>>>  Â Â Â Â Â Â Â Â  {
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  .reg.u32Â Â Â Â Â Â Â  %x;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  mov.u32 %x, %tid.x;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  setp.ne.u32Â Â Â Â  %r75, %x, 0;
>>>  Â Â Â Â Â Â Â Â  }
>>>  Â Â Â Â Â Â Â Â  @%r76Â Â  bra.uni $L6;
>>>  Â Â Â Â Â Â Â Â  @%r75Â Â  braÂ Â Â Â  $L7;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  mov.u64 %r67, %ar0;
>>>  Â Â Â Â Â Â Â Â  // fork 2;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  cvta.shared.u64 %r74, __oacc_bcast;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  st.u64Â  [%r74], %r67;
>>> $L7:
>>> $L6:
>>>  Â Â Â Â Â Â Â Â  @%r75Â Â  braÂ Â Â Â  $L5;
>>>  Â Â Â Â Â Â Â Â  // forked 2;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  bar.syncÂ Â Â Â Â Â Â  0;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  cvta.shared.u64 %r73, __oacc_bcast;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  ld.u64Â  %r67, [%r73];
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  mov.u32 %r62, %ntid.y;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  mov.u32 %r63, %tid.y;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  setp.gt.s32Â Â Â Â  %r68, %r63, 9;
>>>  Â Â Â Â Â Â Â Â  @%r68Â Â  braÂ Â Â Â  $L2;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  mov.u32 %r55, %r63;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  cvt.s64.s32Â Â Â Â  %r69, %r62;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  shl.b64 %r59, %r69, 2;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  cvt.s64.s32Â Â Â Â  %r70, %r55;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  shl.b64 %r71, %r70, 2;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  add.u64 %r58, %r67, %r71;
>>> $L3:
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  st.u32Â  [%r58], %r55;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  add.u32 %r55, %r55, %r62;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  add.u64 %r58, %r58, %r59;
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  setp.le.s32Â Â Â Â  %r72, %r55, 9;
>>>  Â Â Â Â Â Â Â Â  @%r72Â Â  braÂ Â Â Â  $L3;
>>> $L2:
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  bar.syncÂ Â Â Â Â Â Â  1;
>>>  Â Â Â Â Â Â Â Â  // joining 2;
>>> $L5:
>>>  Â Â Â Â Â Â Â Â  // join 2;
>>>  Â Â Â Â Â Â Â Â  ret;
>>>
>>> Note the bar.sync instructions placed immediately after the forked
>>> comment and before the joining comment. The problem here is that branch
>>> above the forked comment guarantees that the warps are not synchronous
>>> (when vector_length > 1, which is always the case).
>>
>> This is already advised against in ptx 3.1, so yes, we should fix this.
>>
>>> Likewise, bar.sync
>>> instruction before joining should be placed after label L5 in order to
>>> allow all of the threads in the warp to reach it.
>>>
>>
>> Agreed.
>>
>>> The attached patch teaches the nvptx to make those adjustments.
>>
>> Can you show me a diff of the ptx for the test-case above for trunk?
> 
> --- w-old.s     2018-03-08 15:19:47.139516578 -0800
> +++ w.s 2018-03-09 08:42:52.217057332 -0800
> @@ -46,9 +46,9 @@
>                  st.u64  [%r74], %r67;
>   $L7:
>   $L6:
> -       @%r75   bra     $L5;
>          // forked 2;
>                  bar.sync        0;
> +       @%r75   bra     $L5;
>                  cvta.shared.u64 %r73, __oacc_bcast;
>                  ld.u64  %r67, [%r73];
>                  mov.u32 %r62, %ntid.y;
> @@ -68,9 +68,9 @@
>                  setp.le.s32     %r72, %r55, 9;
>          @%r72   bra     $L3;
>   $L2:
> -               bar.sync        1;
>          // joining 2;
>   $L5:
> +               bar.sync        1;
>          // join 2;
>          ret;
>   }
> 
> 

At -O0, yes.

At -O2, we have:
...
  diff -u -a 1 2
--- 1   2018-03-19 14:13:44.074834552 +0100
+++ 2   2018-03-19 14:15:06.075301168 +0100
@@ -42,20 +42,20 @@
  st.u64 [%r32],%r25;
  $L7:
  $L6:
-@ %r33 bra $L5;
  // forked 2;
  bar.sync 0;
+@ %r33 bra $L5;
  cvta.shared.u64 %r31,__worker_bcast;
  ld.u64 %r25,[%r31];
  mov.u32 %r24,%tid.y;
  setp.le.s32 %r26,%r24,9;
  @ %r26 bra $L2;
  $L3:
-bar.sync 1;
  // joining 2;
  $L5:
-@ %r34 bra.uni $L8;
+bar.sync 1;
  @ %r33 bra $L9;
+@ %r34 bra.uni $L8;
  // join 2;
  $L9:
  $L8:
...

Note that this changes ordering of the vector-neutering jump and 
worker-neutering jump at the end. In principle, this should not be 
harmful, but it violates the invariant that vector-neutering 
branch-around code should be as short-lived as possible. So, this needs 
to be fixed.

I've found this issue by adding verification of the neutering, as 
attached below.

Thanks,
- Tom