[og7] Update nvptx_fork/join barrier placement

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Cesar Philippidis <cesar@codesourcery.com>
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	Tom de Vries	<Tom_deVries@mentor.com>
Subject: [og7] Update nvptx_fork/join barrier placement
Date: Thu, 08 Mar 2018 23:31:00 -0000	[thread overview]
Message-ID: <600a90eb-fbc6-1b35-a3d3-f34915473951@codesourcery.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 2994 bytes --]

Nvidia Volta GPUs now support warp-level synchronization. As such, the
semantics of legacy bar.sync instructions have slightly changed on newer
GPUs. The PTX JIT will now, occasionally, emit a warpsync instruction
immediately before a bar.sync for Volta GPUs. That implies that warps
must be convergent on entry to those threads barriers.

The problem in og7, and trunk, is that GCC emits barrier instructions at
the wrong spots. E.g., consider the following OpenACC parallel region:

  #pragma acc parallel loop worker
  for (i = 0; i < 10; i++)
    a[i] = i;

At -O2, GCC generates the following PTX code:

        {
                .reg.u32        %y;
                mov.u32 %y, %tid.y;
                setp.ne.u32     %r76, %y, 0;
        }
        {
                .reg.u32        %x;
                mov.u32 %x, %tid.x;
                setp.ne.u32     %r75, %x, 0;
        }
        @%r76   bra.uni $L6;
        @%r75   bra     $L7;
                mov.u64 %r67, %ar0;
        // fork 2;
                cvta.shared.u64 %r74, __oacc_bcast;
                st.u64  [%r74], %r67;
$L7:
$L6:
        @%r75   bra     $L5;
        // forked 2;
                bar.sync        0;
                cvta.shared.u64 %r73, __oacc_bcast;
                ld.u64  %r67, [%r73];
                mov.u32 %r62, %ntid.y;
                mov.u32 %r63, %tid.y;
                setp.gt.s32     %r68, %r63, 9;
        @%r68   bra     $L2;
                mov.u32 %r55, %r63;
                cvt.s64.s32     %r69, %r62;
                shl.b64 %r59, %r69, 2;
                cvt.s64.s32     %r70, %r55;
                shl.b64 %r71, %r70, 2;
                add.u64 %r58, %r67, %r71;
$L3:
                st.u32  [%r58], %r55;
                add.u32 %r55, %r55, %r62;
                add.u64 %r58, %r58, %r59;
                setp.le.s32     %r72, %r55, 9;
        @%r72   bra     $L3;
$L2:
                bar.sync        1;
        // joining 2;
$L5:
        // join 2;
        ret;

Note the bar.sync instructions placed immediately after the forked
comment and before the joining comment. The problem here is that branch
above the forked comment guarantees that the warps are not synchronous
(when vector_length > 1, which is always the case). Likewise, bar.sync
instruction before joining should be placed after label L5 in order to
allow all of the threads in the warp to reach it.

The attached patch teaches the nvptx to make those adjustments. It
doesn't cause any regressions on legacy GPUs, but it does resolve quite
a few failures with Volta in the libgomp execution tests. Therefore,
this patch doesn't include any new test cases. Part of this patch came
from my vector_length patch set that I posted last week. However, that
patch set didn't consider the placement of the joining barrier.

I've applied this patch to openacc-gcc-7-branch.

Tom, is a similar patch OK for trunk? The major difference between trunk
and og7 is that og7 changed the name of nvptx_warp_sync to nvptx_cta_sync.

Cesar

[-- Attachment #2: og7-barriers.diff --]
[-- Type: text/x-patch, Size: 1953 bytes --]

2018-03-08  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_single): Adjust placement of nvptx_fork
	and nvptx_join nutering labels.
	(nvptx_process_pars): Place the CTA barrier at the beginning of the
	join block.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b16cf59575c..efc6161a6b0 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4056,6 +4056,15 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	return;
     }
 
+  /* NVPTX_BARSYNC barriers are placed immediately before NVPTX_JOIN
+     in order to ensure that all of the threads in a CTA reach the
+     barrier.  Don't nueter BLOCK if head is NVPTX_BARSYNC and tail is
+     NVPTX_JOIN.  */
+  if (from == to
+      && recog_memoized (head) == CODE_FOR_nvptx_barsync
+      && recog_memoized (tail) == CODE_FOR_nvptx_join)
+    return;
+
   /* Insert the vector test inside the worker test.  */
   unsigned mode;
   rtx_insn *before = tail;
@@ -4103,7 +4112,17 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  br = gen_br_true (pred, label);
 	else
 	  br = gen_br_true_uni (pred, label);
-	emit_insn_before (br, head);
+
+	if (recog_memoized (head) == CODE_FOR_nvptx_forked
+	    && recog_memoized (NEXT_INSN (head)) == CODE_FOR_nvptx_barsync)
+	  {
+	    head = NEXT_INSN (head);
+	    emit_insn_after (br, head);
+	  }
+	else if (recog_memoized (head) == CODE_FOR_nvptx_barsync)
+	  emit_insn_after (br, head);
+	else
+	  emit_insn_before (br, head);
 
 	LABEL_NUSES (label)++;
 	if (tail_branch)
@@ -4325,7 +4344,7 @@ nvptx_process_pars (parallel *par)
 	{
 	  /* Insert begin and end synchronizations.  */
 	  emit_insn_after (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (true), par->joining_insn);
+	  emit_insn_before (nvptx_cta_sync (true), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))

next             reply	other threads:[~2018-03-08 23:31 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-08 23:31 Cesar Philippidis [this message]
2018-03-09 16:22 ` Tom de Vries
2018-03-09 16:55   ` Cesar Philippidis
2018-03-19 14:30     ` Tom de Vries
2018-03-19 15:24       ` Cesar Philippidis
2018-03-19 15:44         ` Tom de Vries
2018-03-19 17:04         ` Tom de Vries
2018-03-19 18:09           ` Cesar Philippidis
2018-03-20 10:41           ` [nvptx, PR84952, committed] Fix bar.sync position Tom de Vries

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=600a90eb-fbc6-1b35-a3d3-f34915473951@codesourcery.com \
    --to=cesar@codesourcery.com \
    --cc=Tom_deVries@mentor.com \
    --cc=gcc-patches@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).