On 03/09/2018 05:55 PM, Cesar Philippidis wrote: > On 03/09/2018 08:21 AM, Tom de Vries wrote: >> On 03/09/2018 12:31 AM, Cesar Philippidis wrote: >>> Nvidia Volta GPUs now support warp-level synchronization. >> >> Well, let's try to make that statement a bit more precise. >> >> All Nvidia architectures have supported synchronization of threads in a >> warp on a very basic level: by means of convergence (and unfortunately, >> we've seen that this is very error-prone). >> >> What is new in ptx 6.0 combined with sm_70 is the ability to sync >> divergent threads without having to converge, f.i. by using new >> instructions bar.warp.sync and barrier.sync. > > Yes. The major difference sm_70 GPU architectures and earlier GPUs is > that sm_70 allows the user to explicitly synchronize divergent warps. At > least on Maxwell and Pascal, the PTX SASS compiler uses two instructions > to branch, SYNC and BRA. I think, SYNC guarantees that a warp is > convergent at the SYNC point, whereas BRA makes no such guarantees. > If you want to understand the interplay of sync (or .s suffix), branch and ssy, please read https://people.engr.ncsu.edu/hzhou/ispass_15-poster.pdf . > What's worse, once a warp has become divergent on sm_60 and earlier > GPUs, there's no way to reliably reconverge them. So, to avoid that > problem, it critical that the PTX SASS compiler use SYNC instructions > when possible. Fortunately, bar.warp.sync resolves the divergent warp > problem on sm_70+. > >>> As such, the >>> semantics of legacy bar.sync instructions have slightly changed on newer >>> GPUs. >> >> Before in ptx 3.1, we have for bar.sync: >> ... >> Barriers are executed on a per-warp basis as if all the threads in a >> warp are active. Thus, if any thread in a warp executes a bar >> instruction, it is as if all the threads in the warp have executed >> the bar instruction. All threads in the warp are stalled until the >> barrier completes, and the arrival count for the barrier is incremented >> by the warp size (not the number of active threads in the warp). In >> conditionally executed code, a bar instruction should only be used if it >> is known that all threads evaluate the condition identically (the warp >> does not diverge). >> ... >> >> But in ptx 6.0, we have: >> ... >> bar.sync is equivalent to barrier.sync.aligned >> ... >> and: >> ... >> Instruction barrier has optional .aligned modifier. When specified, it >> indicates that all threads in CTA will execute the same barrier >> instruction. In conditionally executed code, an aligned barrier >> instruction should only be used if it is known that all threads in >> CTA evaluate the condition identically, otherwise behavior is undefined. >> ... >> >> So, in ptx 3.1 bar.sync should be executed in convergent mode (all the >> threads in each warp executing the same). But in ptx 6.0, bar.sync >> should be executed in the mode that the whole CTA is executing the same >> code. >> >> So going from the description of ptx, it seems indeed that the semantics >> of bar.sync has changed. That is however surprising, since it would >> break the forward compatibility that AFAIU is the idea behind ptx. >> >> So for now my hope is that this is a documentation error. > > I spent a lot of time debugging deadlocks with the vector length changes > and I have see no changes in the SASS code generated in the newer Nvidia > drivers when compared to the older ones, at lease with respect to the > barrier instructions. This isn't the first time I've seen > inconsistencies with thread synchronization in Nvidia's documentation. > For the longest time, the "CUDA Programming Guide" provided slightly > conflicting semantics for the __syncthreads() function, which ultimately > gets implemented as bar.sync in PTX. > >>> The PTX JIT will now, occasionally, emit a warpsync instruction >>> immediately before a bar.sync for Volta GPUs. That implies that warps >>> must be convergent on entry to those threads barriers. >>> >> >> That warps must be convergent on entry to bar.sync is already required >> by ptx 3.1. >> >> [ And bar.warp.sync does not force convergence, so if the warpsync >> instruction you mention is equivalent to bar.warp.sync then your >> reasoning is incorrect. ] > > I'm under the impression that bar.warp.sync converges all of the > non-exited threads in a warp. I have not played around with the instruction yet, so I'm not sure, but what I read from the docs is that bar.warp.sync converges all of the non-exited threads in a warp only and only if it's positioned at a point post-dominating a divergent branch. Consider this case: ... if (tid.x == 0) { A; bar.warp.sync 32; B; } else { C; bar.warp.sync 32; D; } ... AFAIU, this allows bar.warp.sync to synchronize the threads in the warp, _without_ converging. > You'd still need to use bar.sync or some > variant of the new barrier instruction to converge the entire CTA. But > at the moment, we're still generating code that's backwards compatible > with sm_30. > >>> The problem in og7, and trunk, is that GCC emits barrier instructions at >>> the wrong spots. E.g., consider the following OpenACC parallel region: >>> >>>    #pragma acc parallel loop worker >>>    for (i = 0; i < 10; i++) >>>      a[i] = i; >>> >>> At -O2, GCC generates the following PTX code: >>> >>>          { >>>                  .reg.u32        %y; >>>                  mov.u32 %y, %tid.y; >>>                  setp.ne.u32     %r76, %y, 0; >>>          } >>>          { >>>                  .reg.u32        %x; >>>                  mov.u32 %x, %tid.x; >>>                  setp.ne.u32     %r75, %x, 0; >>>          } >>>          @%r76   bra.uni $L6; >>>          @%r75   bra     $L7; >>>                  mov.u64 %r67, %ar0; >>>          // fork 2; >>>                  cvta.shared.u64 %r74, __oacc_bcast; >>>                  st.u64  [%r74], %r67; >>> $L7: >>> $L6: >>>          @%r75   bra     $L5; >>>          // forked 2; >>>                  bar.sync        0; >>>                  cvta.shared.u64 %r73, __oacc_bcast; >>>                  ld.u64  %r67, [%r73]; >>>                  mov.u32 %r62, %ntid.y; >>>                  mov.u32 %r63, %tid.y; >>>                  setp.gt.s32     %r68, %r63, 9; >>>          @%r68   bra     $L2; >>>                  mov.u32 %r55, %r63; >>>                  cvt.s64.s32     %r69, %r62; >>>                  shl.b64 %r59, %r69, 2; >>>                  cvt.s64.s32     %r70, %r55; >>>                  shl.b64 %r71, %r70, 2; >>>                  add.u64 %r58, %r67, %r71; >>> $L3: >>>                  st.u32  [%r58], %r55; >>>                  add.u32 %r55, %r55, %r62; >>>                  add.u64 %r58, %r58, %r59; >>>                  setp.le.s32     %r72, %r55, 9; >>>          @%r72   bra     $L3; >>> $L2: >>>                  bar.sync        1; >>>          // joining 2; >>> $L5: >>>          // join 2; >>>          ret; >>> >>> Note the bar.sync instructions placed immediately after the forked >>> comment and before the joining comment. The problem here is that branch >>> above the forked comment guarantees that the warps are not synchronous >>> (when vector_length > 1, which is always the case). >> >> This is already advised against in ptx 3.1, so yes, we should fix this. >> >>> Likewise, bar.sync >>> instruction before joining should be placed after label L5 in order to >>> allow all of the threads in the warp to reach it. >>> >> >> Agreed. >> >>> The attached patch teaches the nvptx to make those adjustments. >> >> Can you show me a diff of the ptx for the test-case above for trunk? > > --- w-old.s 2018-03-08 15:19:47.139516578 -0800 > +++ w.s 2018-03-09 08:42:52.217057332 -0800 > @@ -46,9 +46,9 @@ > st.u64 [%r74], %r67; > $L7: > $L6: > - @%r75 bra $L5; > // forked 2; > bar.sync 0; > + @%r75 bra $L5; > cvta.shared.u64 %r73, __oacc_bcast; > ld.u64 %r67, [%r73]; > mov.u32 %r62, %ntid.y; > @@ -68,9 +68,9 @@ > setp.le.s32 %r72, %r55, 9; > @%r72 bra $L3; > $L2: > - bar.sync 1; > // joining 2; > $L5: > + bar.sync 1; > // join 2; > ret; > } > > At -O0, yes. At -O2, we have: ... diff -u -a 1 2 --- 1 2018-03-19 14:13:44.074834552 +0100 +++ 2 2018-03-19 14:15:06.075301168 +0100 @@ -42,20 +42,20 @@ st.u64 [%r32],%r25; $L7: $L6: -@ %r33 bra $L5; // forked 2; bar.sync 0; +@ %r33 bra $L5; cvta.shared.u64 %r31,__worker_bcast; ld.u64 %r25,[%r31]; mov.u32 %r24,%tid.y; setp.le.s32 %r26,%r24,9; @ %r26 bra $L2; $L3: -bar.sync 1; // joining 2; $L5: -@ %r34 bra.uni $L8; +bar.sync 1; @ %r33 bra $L9; +@ %r34 bra.uni $L8; // join 2; $L9: $L8: ... Note that this changes ordering of the vector-neutering jump and worker-neutering jump at the end. In principle, this should not be harmful, but it violates the invariant that vector-neutering branch-around code should be as short-lived as possible. So, this needs to be fixed. I've found this issue by adding verification of the neutering, as attached below. Thanks, - Tom