From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=VaXC=BF=intel.com=lipeng.zhu@sourceware.org>
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
	by sourceware.org (Postfix) with ESMTPS id 880703857722;
	Tue, 16 May 2023 07:08:35 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 880703857722
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1684220915; x=1715756915;
  h=message-id:date:mime-version:subject:from:to:cc:
   references:in-reply-to:content-transfer-encoding;
  bh=VHXREn9e+/Cir3+y5xsj5nIS9y2URFgvIkXEav+t3e0=;
  b=aQfZOyBUAAwzkrLxnY/q57q1RjIeRAn3CpBn1cYEJR0Kh0IiSOWK6Lgp
   RlFu7tMd0rzSWiOgBDqaGQqrJNHVXCq4hGGJ+VXVcsM43hY/+yjcD4/GJ
   XdkENDdKRDWV6hrN0WYudOLHm+QZuIVwXeaDMaLL8wY3ou+ime36f1x9g
   YOItK18CRsdUbga+npO1z/2xIqr+a2SZgFY7IJMbDjD+RauquBSHe4f68
   x5j5lWynY+Cmy3Wjvnc4mFTtgYvxCFhbNegMkTluhjx3/M0zfy+2EFO1U
   nAdz3KQqZmsIFsvaAq4Vl3HG8+DUiNh19dtzS7hAFGTWzZemdVDP4zK61
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10711"; a="331020421"
X-IronPort-AV: E=Sophos;i="5.99,278,1677571200"; 
   d="scan'208";a="331020421"
Received: from orsmga003.jf.intel.com ([10.7.209.27])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 May 2023 00:08:34 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10711"; a="651720399"
X-IronPort-AV: E=Sophos;i="5.99,278,1677571200"; 
   d="scan'208";a="651720399"
Received: from zhulipeng-win.ccr.corp.intel.com (HELO [10.238.2.93]) ([10.238.2.93])
  by orsmga003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 May 2023 00:08:25 -0700
Message-ID: <dbaa667c-c0f0-ab53-ca27-a4239b55907c@intel.com>
Date: Tue, 16 May 2023 15:08:23 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Firefox/102.0 Thunderbird/102.10.0
Subject: Re: [PATCH v4] libgfortran: Replace mutex with rwlock
From: "Zhu, Lipeng" <lipeng.zhu@intel.com>
To: Bernhard Reutner-Fischer <rep.dot.nop@gmail.com>
Cc: fortran@gcc.gnu.org, gcc-patches@gcc.gnu.org, hongjiu.lu@intel.com,
 tianyou.li@intel.com, pan.deng@intel.com, wangyang.guo@intel.com
References: <81c359ae-ab69-7f03-f113-4b865441de44@intel.com>
Content-Language: en-US
In-Reply-To: <81c359ae-ab69-7f03-f113-4b865441de44@intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00,BODY_8BITS,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,NICE_REPLY_A,SPF_HELO_PASS,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <fortran.gcc.gnu.org>


On 5/9/2023 10:32 AM, Zhu, Lipeng wrote:
> 
> 
> On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote:
>> On Mon,  8 May 2023 17:44:43 +0800
>> Lipeng Zhu <lipeng.zhu@intel.com> wrote:
>>
>>> This patch try to introduce the rwlock and split the read/write to
>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>> increase CPU efficiency. In the get_gfc_unit function, the percentage
>>> to step into the insert_unit function is around 30%, in most
>>> instances, we can get the unit in the phase of reading the unit_cache
>>> or unit_root tree. So split the read/write phase by rwlock would be an
>>> approach to make it more parallel.
>>>
>>> BTW, the IPC metrics can gain around 9x in our test server with 220
>>> cores. The benchmark we used is https://github.com/rwesson/NEAT
>>
>> See commentary typos below.
>> You did not state if you regression tested the patch?
> I use valgrind --tool=helgrind or --tool=drd to test 'make 
> check-fortran'. Is it necessary to add an additional unit test for this 
> patch?
> 
>> Other than that it LGTM but i cannot approve it.
> Thank you for your kind help for this patch, is there anything that I 
> can do or can you help to push this patch forward?
> 
Hi Bernhard,

Is there any other refinement that need I to do for this patch?

Thanks.


>>
>>> diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index
>>> ad226c8e856..0033cc74252 100644
>>> --- a/libgfortran/io/async.h
>>> +++ b/libgfortran/io/async.h
>>> @@ -210,6 +210,128 @@
>>>       DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", 
>>> aio_prefix, #mutex,
>> mutex); \
> Thanks, corrected in Patch v5.
> 
>>>     } while (0)
>>> +#ifdef __GTHREAD_RWLOCK_INIT
>>> +#define RWLOCK_DEBUG_ADD(rwlock) do {        \
>>> +    aio_rwlock_debug *n;                \
>>> +    n = xmalloc (sizeof(aio_rwlock_debug));    \
>>
>> Missing space before the open brace: sizeof (
>>
> Thanks, corrected in Patch v5.
> 
>>> diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index
>>> 82664dc5f98..62f1db21d34 100644
>>> --- a/libgfortran/io/unit.c
>>> +++ b/libgfortran/io/unit.c
>>> @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME
>>> respectively.  If not, see
>>>   /* IO locking rules:
>>> -   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and 
>>> UNIT_CACHE.
>>> +   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and 
>>> UNIT_CACHE.
>>> +   And use the rwlock to spilt read and write phase to UNIT_ROOT tree
>>> +   and UNIT_CACHE to increase CPU efficiency.
>>
>> s/spilt/split. Maybe:
>>
>> Using an rwlock improves efficiency by allowing us to separate readers 
>> and writers of both UNIT_ROOT
>> and UNIT_CACHE.
>>
> Thanks, corrected in Patch v5.
> 
>>> @@ -350,6 +356,17 @@ retry:
>>>         if (c == 0)
>>>       break;
>>>       }
>>> +  /* We did not find a unit in the cache nor in the unit list, 
>>> create a new
>>> +    (locked) unit and insert into the unit list and cache.
>>> +    Manipulating either or both the unit list and the unit cache 
>>> requires to
>>> +    hold a write-lock [for obvious reasons]:
>>> +    1. By separating the read/write lock, it will greatly reduce the 
>>> contention
>>> +       at the read part, while write part is not always necessary or 
>>> most
>>> +       unlikely once the unit hit in cache.
>>
>> +    By separating the read/write lock, we will greatly reduce the 
>> contention
>> +    on the read part, while the write part is unlikely once the unit 
>> hits
>> +    the cache.
>>
>>> +    2. We try to balance the implementation complexity and the 
>>> performance
>>> +       gains that fit into current cases we observed by just using a
>>> +       pthread_rwlock. */
>>
>> Let's drop 2.
> 
> Got it, thanks!
>> thanks,