From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 122491 invoked by alias); 11 Jul 2017 12:37:03 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 120600 invoked by uid 89); 11 Jul 2017 12:37:02 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.6 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,RCVD_IN_DNSWL_NONE,SPF_PASS,URIBL_RED autolearn=ham version=3.3.2 spammy=x0, 031 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 11 Jul 2017 12:37:00 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-MBX-04.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1dUuPS-0000Pz-8T from Tom_deVries@mentor.com for gcc-patches@gcc.gnu.org; Tue, 11 Jul 2017 05:36:58 -0700 Received: from [127.0.0.1] (137.202.0.87) by SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) with Microsoft SMTP Server (TLS) id 15.0.1263.5; Tue, 11 Jul 2017 13:36:54 +0100 To: GCC Patches From: Tom de Vries Subject: [nvptx, committed] Add extra initialization of broadcasted condition variables Message-ID: <2bf62176-d7f2-0502-35f4-67cd54930621@mentor.com> Date: Tue, 11 Jul 2017 12:37:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------5939699EA00FCA96FD1D85A8" X-ClientProxiedBy: svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) To SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) X-SW-Source: 2017-07/txt/msg00516.txt.bz2 --------------5939699EA00FCA96FD1D85A8 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Content-length: 254 Hi, we've run into a PTX JIT bug with cuda driver version 381.22 for sm_61 at -O1 and higher. This patch adds a workaround, guarded by a macro, enabling the workaround by default. Tested on x86_64 with nvidia accelerator. Committed. Thanks, - Tom --------------5939699EA00FCA96FD1D85A8 Content-Type: text/x-patch; name="0001-Add-extra-initialization-of-broadcasted-condition-variables.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename*0="0001-Add-extra-initialization-of-broadcasted-condition-varia"; filename*1="bles.patch" Content-length: 2968 Add extra initialization of broadcasted condition variables 2017-07-11 Tom de Vries * config/nvptx/nvptx.c (WORKAROUND_PTXJIT_BUG): New macro. (bb_first_real_insn): New function. (nvptx_single): Add extra initialization of broadcasted condition variables. --- gcc/config/nvptx/nvptx.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index daeec27..c8847a5 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -74,6 +74,8 @@ /* This file should be included last. */ #include "target-def.h" +#define WORKAROUND_PTXJIT_BUG 1 + /* The various PTX memory areas an object might reside in. */ enum nvptx_data_area { @@ -3844,6 +3846,24 @@ nvptx_wsync (bool after) return gen_nvptx_barsync (GEN_INT (after)); } +#if WORKAROUND_PTXJIT_BUG +/* Return first real insn in BB, or return NULL_RTX if BB does not contain + real insns. */ + +static rtx_insn * +bb_first_real_insn (basic_block bb) +{ + rtx_insn *insn; + + /* Find first insn of from block. */ + FOR_BB_INSNS (bb, insn) + if (INSN_P (insn)) + return insn; + + return 0; +} +#endif + /* Single neutering according to MASK. FROM is the incoming block and TO is the outgoing block. These may be the same block. Insert at start of FROM: @@ -3958,6 +3978,39 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask) { /* Vector mode only, do a shuffle. */ +#if WORKAROUND_PTXJIT_BUG + /* The branch condition %rcond is propagated like this: + + { + .reg .u32 %x; + mov.u32 %x,%tid.x; + setp.ne.u32 %rnotvzero,%x,0; + } + + @%rnotvzero bra Lskip; + setp.. %rcond,op1,op2; + Lskip: + selp.u32 %rcondu32,1,0,%rcond; + shfl.idx.b32 %rcondu32,%rcondu32,0,31; + setp.ne.u32 %rcond,%rcondu32,0; + + There seems to be a bug in the ptx JIT compiler (observed at driver + version 381.22, at -O1 and higher for sm_61), that drops the shfl + unless %rcond is initialized to something before 'bra Lskip'. The + bug is not observed with ptxas from cuda 8.0.61. + + It is true that the code is non-trivial: at Lskip, %rcond is + uninitialized in threads 1-31, and after the selp the same holds + for %rcondu32. But shfl propagates the defined value in thread 0 + to threads 1-31, so after the shfl %rcondu32 is defined in threads + 0-31, and after the setp.ne %rcond is defined in threads 0-31. + + There is nothing in the PTX spec to suggest that this is wrong, or + to explain why the extra initialization is needed. So, we classify + it as a JIT bug, and the extra initialization as workaround. */ + emit_insn_before (gen_movbi (pvar, const0_rtx), + bb_first_real_insn (from)); +#endif emit_insn_before (nvptx_gen_vcast (pvar), tail); } else --------------5939699EA00FCA96FD1D85A8--