From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12olkn2013.outbound.protection.outlook.com [40.92.22.13]) by sourceware.org (Postfix) with ESMTPS id 8F54E3858D37; Thu, 21 Apr 2022 19:26:44 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8F54E3858D37 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Xo7q9bxQJe5KglUf7FjxMoou9cOi9BqTiZQVMI3PYWCE/85ftmW2dPT61BarhXnHedYanll5I7QSzwsQOnN/j6697xeTzG09B84AuDPtb2OwGCN5s0Dp1W1TPXQPlviuaOZz4dSX+wnGKOi0zXlQGvlBNG/sR7XaswrL413Mofh7ZLbywKi61eUkedGcd1z+87/vNQPwPz3pqjY7lK4jAlC8gJ5MxP1UT5uAGyk94iYHfShmWelj0WLLzkRD9yJp6rFWKN0gY1CZmoXHcLqmFkH0BqIY8orCi0X7rnsjIhAUN1DiwQaGw/n/jp3xUbVF8k4IULMGmIlc9zUO3CechA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Gp8kBuJlgwY5dyDV7BE5qTvR2IQgUJii7LR4Ci2nTOY=; b=O9dqbzyHIKh10dsvECTwyyz/PCWEgZqk93t93p+fppFrNGXNrsGmy2LSgJ9k2IzSPGQ+yPBQbY0PQi7vI9B6dwvSh0WqomATHRcvUrcY00pJ328s/rpmRUinpAW0IJ8rJHeLNrSajcSwp1AkkRiI4XhAMvRRXQ8zVP46WhWYIBUMqQEmGS1bhHfetdOohPrY1jI3VKeKYYrxDsibLZZKwm10JjuNPBl6O6jlOs6H3P0AB4jIWLiWZPsyd6FRAP2nvPPXJDNoVCpbWSxDhg+D1uS9bORn8Kq3y5br6BiyP0jY4NUDraEhZyzEcV+S8ZWmvYd0nMrPeH5Sr7dBd2VhhA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none Received: from BY5PR14MB3911.namprd14.prod.outlook.com (2603:10b6:a03:1d7::12) by SN6PR14MB2175.namprd14.prod.outlook.com (2603:10b6:805:4b::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5186.14; Thu, 21 Apr 2022 19:26:42 +0000 Received: from BY5PR14MB3911.namprd14.prod.outlook.com ([fe80::bc89:6e74:510c:99b2]) by BY5PR14MB3911.namprd14.prod.outlook.com ([fe80::bc89:6e74:510c:99b2%8]) with mapi id 15.20.5164.026; Thu, 21 Apr 2022 19:26:42 +0000 Message-ID: Subject: GSoC - Accelerating Fortran DO CONCURRENT in GCC From: Wileam Yonatan Phan To: gcc@gcc.gnu.org, fortran@gcc.gnu.org Cc: rouson@lbl.gov, mjambor@suse.cz, tobias@codesourcery.com, thomas@codesourcery.com, jlarkin@nvidia.com Date: Thu, 21 Apr 2022 15:26:39 -0400 Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.5-0ubuntu1 Content-Transfer-Encoding: 8bit X-TMN: [MmKBu5mtUvkIZMwAI0Vs5iPTKXFb8WnF] X-ClientProxiedBy: BN0PR02CA0050.namprd02.prod.outlook.com (2603:10b6:408:e5::25) To BY5PR14MB3911.namprd14.prod.outlook.com (2603:10b6:a03:1d7::12) X-Microsoft-Original-Message-ID: <7ead2f9c875f06e097e0be3c2f0eede5bc13ffe3.camel@outlook.com> MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: a95f83e1-a1db-4cbf-2c1e-08da23ccdd6c X-MS-TrafficTypeDiagnostic: SN6PR14MB2175:EE_ X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: IJ/MBTtSKMQv4HCV4L1hgPshVwp4P09reJI0Ihlsks8oU27QdcCCRj0HxO9qNj8GD+SzWKLpBd2hIqUgSlrowilF8af86xcvp+fqv782/olyg3mT9jwfDyiJ2Q4R/ojH7EzAKl9LDGGQ/q0kTGu27E5qWvCN0NH+5TRblbZ5rFvd/2OQMd2wG3IFgfE/Xsm8QrlEcM2drXoVRHCaJA9XztBOw89xfYwi/+jI5QggAKrJkhL7i72xCHRsCWgUQXgge/aN0AwBAXasIlU62dHk1tD8LKiAG0cSEorZ37wrOSBTwNm4mlHLe66WA8XS+BAkuNAA5KpYCpvTtjLwud5WCZ5+LtLfSco+mAYGaDDIHgJnIGSvtlHPP7NtR+f19vgKILudkCdMFcQz1I4AGjz0w4p3NtAVsGXpncFMQh4JZ1W0Glt4JGks5vjyhMksSKy62lBTPS10B7hTyasMtJoZ0qeuHMFBD7EZLbuAK6c4raiFmEKps3kHk1VkKZc+lnhyL4GIIJ2UzyKl4jAzx5aPOYnNHGuMccwjHvZWRJnkCKLagd+oWGd0H0o+6BqDMm0RNZeTe2AROab0DvBR8DZZiA== X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?Tmh0SDMvS1NjMXFyYlE0NFYwQmFmOEd2Zm5jUkNVZFNSSVluVS9OZldrR09S?= =?utf-8?B?TUFzMHRJMkFLRElPU2E2elBUOXg2VTl5Y2tjUkxIY0RBWGpsOUhrRDR0WXFs?= =?utf-8?B?RkY4RW1kdnoxTkxESHVxWmFURWJFNnM3bkJzQU9sZ2dselJyVzk0RmZGRTJX?= =?utf-8?B?SEVnYm5XaGFwQ0dON0EzWHFuMWtUVFcxcnp6bm5KM3ZhUk43WU8vemtXLzQ3?= =?utf-8?B?WFM2WFY4TGdzSE95TmwxU2V2bG5iQ1dsNWlrZUlxdzRxdXdCTUxvd3FLeEsy?= =?utf-8?B?SHlyWGhTKzZ6TnNsaEE3SVUzYTVGa0dLbkhka2pkZW5uTFV6ZytOQnNjYktj?= =?utf-8?B?TEF3bGd5SC80aWI4b252Vk50Q3lwMExHa2pnZVhjRWRLZ0tSaHk5S1RJWjdZ?= =?utf-8?B?UE0vdm5YYWg3OE01UmN1ajJjeTl1aXFWTkFHMmE5RnBGOURLOUJUUFlwUzls?= =?utf-8?B?OS83ZlA1b3ZYYk5BWFRMcjZZSWk0bm01SE1YNGJ1Nk5kenJXSzNudWRrbjNS?= =?utf-8?B?eCtVbTlMdXVhdm92eGF2NEsya0tTbEowbW5Hc05ZdEtiUFcrcUxSTE5PRUdH?= =?utf-8?B?KytiOXNTNFpPRXZyTkpyNG9meWZ5UDh1K05KWEkrUG1NcFFsYzJXa3lxWmEx?= =?utf-8?B?T2VDTGNnSFgzYm1Jb1owUkplcUU4aXNhN0QwdEhvTEZIVjNVR3g3VkRPbUpY?= =?utf-8?B?dkRsZ2JiNFJ2TnhRblNXTlhLWVNQTHBZZTdRYlJHemk4d3NzRW1pSGM3a2VT?= =?utf-8?B?QjNEOWdVdm1QM2lxQXJMUXNnWS9hRStUanhCaG9CRDhZdEgwaEl3WTBSTGlG?= =?utf-8?B?R0FVM1h3ekNLckZEWXJjak1aN0VWYjlGei9wQVhuSmlhbHNQR2thZ3M5aWtU?= =?utf-8?B?UndhQ256VVB0Q09aWmtNUk9vbWZrRjErUzVjTEhVbXZVWHFhTmRkYjh3bzFO?= =?utf-8?B?T3d4VUl4UFF6NjcrMENvTHlTY1JJNlE1RVVKeVV1ZnhWT2p6dTV0ZEUrM1Vt?= =?utf-8?B?amEzYkhvRXZOMTR6eUZPMW5oc2lXU0djYmtuT1RPVlU1ZG1GaVJHcWpDVlhh?= =?utf-8?B?WjZmRDN3MC9OMUJXU1N6S2d0Mkp3MWRhc3RXQk4vd3EvTDZHLzB0Rk44Ny95?= =?utf-8?B?VnVPdWpBUUtuRVB1Y215L0xBMGR3S2xvay9RQVErZnlVNWxPVHNuSi9WMnU1?= =?utf-8?B?U2s5MVVyMHFYalhKLzI0eDQrejFWNnRWQ3k1ZVRYbkJlTVVqSkVTV2dVSUty?= =?utf-8?B?T01uaTNtVEgvTktLUC91bHc1a0p4NjZxNnNDUXZjNGJuV3Z4TmJhQTlHMDFz?= =?utf-8?B?cTVyY0U2VFlrUmU1MTh5V00zY3JWaXo5N0lYMXIyRTFoQjFoRldOdHdjTkg4?= =?utf-8?B?YXh2RlVqT08rVnRRSGhmWlNpS0xCZzZOMy9PWVp5bVFNMUpXYTIvU2hHeWVi?= =?utf-8?B?dGtsTHkweUhwaEdOQ0Fhcktlc0FyVmRtL1VaTkcxQmVETW1jQlg5bHVoK3Nz?= =?utf-8?B?eEFYNkVHS1lZa1ZkWUpZQTZmSEFGOWkxeHgxOGsvU3JEMitNeG40WVNnU1g5?= =?utf-8?B?cUdOSFlvOW9ULy9CbFJ5RHlQOGlmOWlVMnEwVXBEREZTaUdCa203c0hDbGcx?= =?utf-8?B?My8vcW9YNWpzd2loeFR2RERIMnlWdHQ1M1B0SW0za0twc2YwcGlYVkFRV21t?= =?utf-8?B?M0JRS1BXYXE1NS9QNDJlS1ljYW1vYnhnbm5SdnUrcVVMS2hPUHFFbWJTc1FK?= =?utf-8?B?ZGU2ZE9INGhYYUliaStKTnAxYnZOTjd4RGRRa1R2T3ZPZ2NUVmVVVGtsa1F2?= =?utf-8?B?M0E0YS9BSWRsTitOV3p1RjN1L3djaGdkTSswZzUrQ0pWdzRoSU8vclM4U3Bi?= =?utf-8?B?ZnZWelNNbkVhUDNySHpOelk5Mmh6VzBzVHhDOXRWODlCeGk5UFBwZThVSnJh?= =?utf-8?Q?r+qABId0IPA=3D?= X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: a95f83e1-a1db-4cbf-2c1e-08da23ccdd6c X-MS-Exchange-CrossTenant-AuthSource: BY5PR14MB3911.namprd14.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 21 Apr 2022 19:26:42.6812 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN6PR14MB2175 X-Spam-Status: No, score=3.2 required=5.0 tests=BAYES_50, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, TXREP, URI_DOTEDU autolearn=no autolearn_force=no version=3.4.4 X-Spam-Level: *** X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: fortran@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Fortran mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Apr 2022 19:26:46 -0000 Hi everyone, I submitted a very short proposal for the GCC GSoC this year, specifically to work on DO CONCURRENT GPU offloading support. I found out about this literally three days ago (Apr 18) from Thomas Schwinge's post on OpenACC community Slack. I wish I’d come across this sooner than mere hours before the GSoC proposal deadline on Apr 19. But I guess almost late is better than late -- hopefully y’all will forgive me for this transgression. The submitted version of the proposal can be accessed here on my personal website: https://wyphan.github.io/assets/pdf/20220419-AcceleratingFortranDoConcurrentInGCC-GSoC2022.pdf I personally think that DO CONCURRENT GPU offloading is an ambitious but very doable project, especially when the plan of action has been laid out: 1. Support Fortran 2018 DO CONCURRENT locality specifiers (LOCAL, LOCAL_INIT, SHARED) and the DEFAULT clause in the parser. 2. Support Fortran 202X DO CONCURRENT REDUCTION clause in the parser. 3. Implement actual parallelization controlled by `-fdo-concurrent=` compiler flag with 5 backends (serial, openmp, parallel, openmp-target, openacc). I’ll be honest: The last two backends in step 3 (OpenMP target offload and OpenACC) gets me excited. At the moment DO CONCURRENT GPU offloading is exclusive to NVIDIA nvfortran and (obviously) NVIDIA GPUs. I think gfortran holds a special place here. GCC can already offload OpenACC to AMD GPUs. The timing couldn't be more perfect -- the upcoming Frontier exascale system at ORNL will use AMD GPUs, and the _only_ compiler that would support OpenACC on that platform will be GCC and Cray! (Though I recall reading that Cray already pulled the plug on OpenACC for C/C++ in cc and CC, such that it only works in ftn). This is a strong point to capitalize on for GCC _and_ AMD, which I wish more people know about. Especially as OpenMP target offload support still matures across all compilers. Some background on myself: I recently graduated with an MS in Physics from U of Tennessee, Knoxville. My thesis work involved porting my advisor's Gordon Bell- winning (2010) density response code (based on Elk FP-LAPW DFT package) to Summit at ORNL. The code is modern Fortran (mostly 90/95 but uses some 03/08 features) with MPI and OpenMP (CPU-only); I added OpenACC and calls to the MAGMA [icl.utk.edu/magma] library. After participating in the OLCF GPU Hackathon 2020 with the code, as well as adding further optimizations, we were able to reach up to 12x wall clock time speedup for a test case inspired by the one used in the CPU-only Gordon Bell version [doi:10.1109/SC.2010.55]. We ported the hotspot as shown by initial CPU-only profiling in the form of a nested loop within _one single subroutine_. The defining characteristic of this hotspot in the subroutine is small-ish (~200x~200 times ~200x~50) batched (~500 to ~4000 batch size) double complex matrix-matrix multiply (ZGEMM), which used calls to MAGMA library (esp. since MAGMA can interface with both cuBLAS and rocBLAS). OpenACC is used to manage the device memory and host<->device transfers. I also wrote 3 OpenACC kernels to support the batching mechanism (one to index the batches, one to fill in the batches with input data, and one to fill in the results from the batches). As an OpenACC practitioner, I was originally attracted to its simplicity and portability, but have become an advocate for it due to its competitive performance against native programming models (NVIDIA CUDA and AMD HIP). This practical knowledge and experience have led to a joint research with Emeritus Professor Lenore Mullin (SUNY Albany) that was presented as a “lessons learned”-style talk during the OpenACC Summit 2021. This talk is about our experience with OpenACC to port her FFT algorithm [arXiv:0811.2535] to target NVIDIA GPUs starting from a CPU-only OpenMP Fortran code. Since the talk, we've also collaborated on porting her GEMM code [NREL/CP-2C00-80232] starting from a CPU-only C code to target NVIDIA V100 and A100 GPUs. Previously, I’ve submitted a bug to the GCC Bugzilla, which unfortunately is not-a-bug. Currently I’m working with Damian Rouson at Sourcery Institute on isolating gfortran bugs on derived type finalization and helping him with reproducer codes. The main motivation for me with this GSoC project is to learn not only how to break the compiler, but also how to fix it. I also started a thread at the Fortran language Discourse forum to discuss further about this topic: https://fortran-lang.discourse.group/t/gsoc-2022-accelerating-fortran-do-concurrent-in-gcc/3269 Once again, sorry if I started on the wrong foot. I’m just trying not to cramp too hard and move along with this project. Thanks, Wileam Y. Phan GitHub: @wyphan https://phan.codes/