From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 121252 invoked by alias); 13 Dec 2016 14:48:29 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 121222 invoked by uid 89); 13 Dec 2016 14:48:28 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=transfers, experiencing, individually, transferred X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 13 Dec 2016 14:48:25 +0000 Received: from svr-orw-mbx-01.mgc.mentorg.com ([147.34.90.201]) by relay1.mentorg.com with esmtp id 1cGoNR-000671-9O from Cesar_Philippidis@mentor.com ; Tue, 13 Dec 2016 06:48:21 -0800 Received: from [127.0.0.1] (147.34.91.1) by svr-orw-mbx-01.mgc.mentorg.com (147.34.90.201) with Microsoft SMTP Server (TLS) id 15.0.1210.3; Tue, 13 Dec 2016 06:48:17 -0800 Subject: Re: [PATCH] omp-low.c split To: Jakub Jelinek , Thomas Schwinge , Alexander Monakov , GCC Patches , References: <20161209130821.baeo4o3bd2yazgzz@virgil.suse.cz> <87r35crrca.fsf@hertz.schwinge.homeip.net> <20161213114316.GQ3541@tucnak.redhat.com> <20161213124223.oeejnf4mih77t54s@virgil.suse.cz> From: Cesar Philippidis Message-ID: Date: Tue, 13 Dec 2016 14:48:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20161213124223.oeejnf4mih77t54s@virgil.suse.cz> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-ClientProxiedBy: svr-orw-mbx-01.mgc.mentorg.com (147.34.90.201) To svr-orw-mbx-01.mgc.mentorg.com (147.34.90.201) X-SW-Source: 2016-12/txt/msg01167.txt.bz2 On 12/13/2016 04:42 AM, Martin Jambor wrote: >> And this as well. But omp-grid.c is fine too. > > ...I prefer omp-grid.c because I plan to use gridification also for > GCN targets, though hopefully only as an optimization rather than a > hard requirement ...and in fact I still think it is a good > optimization of simple loops for execution on all CUDA-like > environments with block/thread grids because it removes conditions > which the run-time can handle better. Regarding gridification, is your cauldron talk from 2015 still current, or have there been some significant changes? When we first started with OpenACC we were using a lot of the existing lower_omp_for* infrastructure handle ACC LOOPs. But there was a couple of problems with that. First, the chunk partitioning caused a lot of overhead, and second because of OpenACC execution model it made more sense to write our own functions (lower_oacc_head_tail / lower_oacc_reductions). In fact, during lowering gcc only marks where the loops are. All of those markers get replaced and the loops get optimized during the oaccdevlow pass which runs on the target compiler. Right now one of the significant bottlenecks we're experiencing on nvptx targets is with I/O. First, prior to launching a PTX kernel, libgomp transfers each data mapping individually in a synchronous manner. I'm debating whether it makes sense to pass in all of those data mappings to the accelerator prior to the PTX kernel launch asynchronously, obviously with an explicit synchronization barrier just prior to launching the kernel. Another bottleneck involves firstprivate variables. Often, those variables are 'scalars' and consequently, they shouldn't need explicit data mappings. I noticed that Jakub introduced a special GOMP_MAP_FIRSTPRIVATE_INT, which omits data mappings for integral types with less than or equal precision to pointers. It would probably be beneficial to expand this to reals. The last observation is that OpenMP code in general passes a struct with all of the data mappings to the various OMP regions/offloaded code. That's fine, but for offloading targets, particularly nvptx, it would probably be slightly more efficient if those OMP regions took actual function arguments instead of a single struct. At least on nvptx targets, in order to pass that struct to the accelerator, the runtime must first allocate device memory for it, then copy all of the struct contents to the device each time prior to launching a PTX kernel. A lot of this could be bypassed because cuLaunchKernel accepts a variable number of kernel arguments. Obviously, those arguments need to be transferred to the accelerator one way or another, so I'm not sure yet how beneficial this optimization would end up being. To be clear, I'm not proposing any of these changes for gcc7. Any changes to the above will go to gomp-4_0-branch first, then we'll port them over to gcc8. What type of performance problems are you experiencing with HSA? Cesar