From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 72338 invoked by alias); 22 Oct 2015 16:42:06 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 72324 invoked by uid 89); 22 Oct 2015 16:42:06 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: smtp.ispras.ru Received: from smtp.ispras.ru (HELO smtp.ispras.ru) (83.149.199.79) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 22 Oct 2015 16:41:56 +0000 Received: from [10.10.3.121] (unknown [83.149.199.91]) by smtp.ispras.ru (Postfix) with ESMTP id A5C1720508; Thu, 22 Oct 2015 19:41:51 +0300 (MSK) Date: Thu, 22 Oct 2015 16:42:00 -0000 From: Alexander Monakov To: Jakub Jelinek cc: Bernd Schmidt , gcc-patches@gcc.gnu.org, Dmitry Melnik Subject: Re: [gomp4 00/14] NVPTX: further porting In-Reply-To: <20151022095442.GN478@tucnak.redhat.com> Message-ID: References: <1445366076-16082-1-git-send-email-amonakov@ispras.ru> <562779F9.9070800@redhat.com> <20151022095442.GN478@tucnak.redhat.com> User-Agent: Alpine 2.20 (LNX 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-SW-Source: 2015-10/txt/msg02303.txt.bz2 On Thu, 22 Oct 2015, Jakub Jelinek wrote: > Does that apply also to threads within a warp? I.e. is .local local to each > thread in the warp, or to the whole warp, and if the former, how can say at > the start of a SIMD region or at its end the local vars be broadcast to > other threads and collected back? One thing is scalar vars, another > pointers, or references to various types, or even bigger indirection. .local is indeed local to each warp member, not the warp as a whole. What OpenACC/PTX implementation does is to copy the whole stack frame, plus live registers: the implementation is in nvptx.c:nvptx_propagate. I see two possible alternative approaches for OpenMP/PTX. The first approach is to try and follow the OpenACC scheme. In OpenMP that will be more complicated. First, we won't have a single stack frame, so we'll need to emit stack propagation at call sites. Second, we'll have to ensure that each libgomp function that can appear in call chain from target region entry to simd loop runs in "vector-neutered" mode, that is, threads 1-31 in each warp follow branches that thread 0 executes. The second approach is to run all threads in the warp all the time, making sure they execute the same code with the same data, and thus build up the same local state. In this case we'd need to ensure this invariant: if threads in the warp have the same state prior to executing an instruction, they also have the same state after executing that instruction (plus global state changes as if only one thread executed that instruction). Most instructions are safe w.r.t this invariant. Atomics break it, so to maintain the invariant for atomics we need to conditionally execute it in only one thread, and then copy the register holding the result to other threads. Apart from atomics, I see only two more hazards: calls and user asm. For calls, I think the solution is to execute the call in all threads, demanding that callees hold up the invariant. To ensure that, we'd need to recompile newlib and other libs in that mode. Finally, a few callees are out of our control since they are provided by the driver: malloc, free, vprintf. Those we can treat like atomics. What do you think? Does that sound correct? Was something like this considered (and rejected?) for OpenACC? Thanks. Alexander