From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=G6fy=A5=intel.com=lipeng.zhu@sourceware.org>
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by sourceware.org (Postfix) with ESMTPS id 694423858D35;
	Mon,  8 May 2023 09:31:33 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 694423858D35
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1683538293; x=1715074293;
  h=message-id:date:mime-version:subject:to:cc:from:
   content-transfer-encoding;
  bh=PNJY1/RSGSEURlxF4trDxTdjlk1+uZtrOxv5hFmrsKw=;
  b=TZVMtKr+3dNJVJGpPYT5M2jmpxy6JnMOZaH4tivTkpac//9W7SfesaKN
   TPOJa2SVhbEU/YlQvhDrQVPf05u3OKI1yO9bhgDqevao7uikGwvAhsQLX
   V9kv2aHo5GnpUuMuxx6SXK+JQxxwxkepoXi4jPquUuvj4jT5y8Kdi9PMO
   PpAVVokQjE7xmjNpwR63c+7vNdaZFkRLJstpCttiQ7C1tZl6F+ZIU9cR7
   1RYbXjyedlKJYtFZKEJ0aLUPgTlplWrxxA2cOucMFmlBSNDCTWxDH7GW8
   jszsyJDkvGEtj7myGehL1wSBuerqV4o1iv73L6SXPwQbYlK2MJcSk6xWy
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10703"; a="349631676"
X-IronPort-AV: E=Sophos;i="5.99,258,1677571200"; 
   d="scan'208";a="349631676"
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
  by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 May 2023 02:31:31 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10703"; a="701351898"
X-IronPort-AV: E=Sophos;i="5.99,258,1677571200"; 
   d="scan'208";a="701351898"
Received: from zhulipeng-win.ccr.corp.intel.com (HELO [10.238.2.93]) ([10.238.2.93])
  by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 May 2023 02:31:30 -0700
Message-ID: <f29e541f-3029-6f2e-3a1f-798f9503a511@intel.com>
Date: Mon, 8 May 2023 17:31:27 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Firefox/102.0 Thunderbird/102.10.0
Subject: Re: [PATCH v3] libgfortran: Replace mutex with rwlock
Content-Language: en-US
To: Bernhard Reutner-Fischer <rep.dot.nop@gmail.com>
Cc: tkoenig@netcologne.de, fortran@gcc.gnu.org, gcc-patches@gcc.gnu.org,
 "Lu, Hongjiu" <hongjiu.lu@intel.com>, "Li, Tianyou" <tianyou.li@intel.com>,
 "Deng, Pan" <pan.deng@intel.com>, "Guo, Wangyang" <wangyang.guo@intel.com>
From: "Zhu, Lipeng" <lipeng.zhu@intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <fortran.gcc.gnu.org>


On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote:
> Hi!
> 
> [please do not top-post]
> 
Sure, thanks for the reminder.

> On Thu, 20 Apr 2023 21:13:08 +0800
> "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
> 
>> Hi Bernhard,
>>
>> Thanks for your questions and suggestions.
>> The rwlock could allow multiple threads to have concurrent read-only
>> access to the cache/unit list, only a single writer is allowed.
> 
> right.
> 
>> Write lock will not be acquired until all read lock are released.
> 
> So i must have confused rwlock with something else, something that allows self to hold a read-lock and
> upgrade that to a write-lock, purposely starving all successive incoming readers. I.e. just toggle your
> RD_TO_WRLOCK impl, here, atomically. This proved to be benefical in some situations in the past.
> Doesn't seem to work with your rwlock, does it
> 
Pthread API doesn't support the upgrade rdlock to wrlock directly, the 
calling thread may deadlock if at the time the call is made it holds the 
read-write lock (whether a read or write lock). 
https://linux.die.net/man/3/pthread_rwlock_wrlock.
That is why we implement RD_TO_WRLOCK like this: release read lock 
firstly and then acquire write lock.

>> And I didn't change the mutex scope when refactor the code, only make
>> a more fine-grained distinction for the read/write cache/unit list.
> 
> Yes of course, i can see you did that.
> 
>> I complete the comment according to your template, I will insert the
>> comment in the source code in next version patch with other refinement
>> by your suggestions.
>> "
>> Now we did not get a unit in cache and unit list, so we need to create
>> a new unit, and update it to cache and unit list.
> 
> s/Now/By now/ or s/Now w/W/ and s/get/find/ "
> We did not find a unit in the cache nor in the unit list, create a new
> (locked) unit and insert into the unit list and cache.
> Manipulating either or both the unit list and the unit cache requires to hold a write-lock [for obvious
> reasons]"
> 
> Superfluous when talking about pthread_rwlock_wrlock since that implies that even the process
> acquiring the wrlock has to first release it's very own rdlock.
> 
Thanks for the correction, I will update the comments in next v4 patch.

>> Prior to update the cache and list, we need to release all read locks,
>> and then immediately to acquire write lock, thus ensure the exclusive
>> update to the cache and unit list.
>> Either way, we will manipulate the cache and/or the unit list so we
>> must take a write lock now.
>> We don't take the write bit in *addition* to the read lock because:
>> 1. It will needlessly complicate releasing the respective lock;
> 
> Under pthread_rwlock_wrlock it will deadlock, so that's wrong?
> Drop that point then? If not, what's your reasoning / observation?
> 
> Under my lock, you hold the R, additionally take the W and then immediately release the R because you
> yourself won't read, just write.
> But mine's not the pthread_rwlock you talk about, admittedly.
> 
Yes, pthread_rwlock doesn't support upgrade rdlock to wrlock directly, 
so we can’t hold rdlock while trying to acquire wrlock. I will drop the 
first point and update the comment accordingly.

>> 2. By separate the read/write lock, it will greatly reduce the
>> contention at the read part, while write part is not always necessary
>> or most unlikely once the unit hit in cache;
> 
> We know that.
> 
>> 3. We try to balance the implementation complexity and the performance
>> gains that fit into current cases we observed.
> 
> .. by just using a pthread_rwlock. And that's the top point iff you keep it at that. That's a fair step, sure.
> BTW, did you look at the RELEASE semantics, respectively the note that one day (and now is that very
> day), we might improve on the release semantics? Can of course be incremental AFAIC
> 
Not quite understand your question about looking at the RELEASE 
semantics. Can you be more specific?

>> "
> 
> If folks agree on this first step then you have my OK with a catchy malloc and the discussion recorded
> here on the list. A second step would be RELEASE.
> And, depending on the underlying capabilities of available locks, further tweaks, obviously.
> 
> PS: and, please, don't top-post
> 
> thanks,
> 
>>
>> Best Regards,
>> Zhu, Lipeng
>>
>> On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote:
>>> On 19 April 2023 09:06:28 CEST, Lipeng Zhu via Fortran <fortran@gcc.gnu.org> wrote:
>>>> This patch try to introduce the rwlock and split the read/write to
>>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>>> increase CPU efficiency. In the get_gfc_unit function, the
>>>> percentage to step into the insert_unit function is around 30%, in
>>>> most instances, we can get the unit in the phase of reading the
>>>> unit_cache or unit_root tree. So split the read/write phase by
>>>> rwlock would be an approach to make it more parallel.
>>>>
>>>> BTW, the IPC metrics can gain around 9x in our test server with 220
>>>> cores. The benchmark we used is https://github.com/rwesson/NEAT
>>>>   
>>>    
>>>> +#define RD_TO_WRLOCK(rwlock) \
>>>> +  RWUNLOCK (rwlock);\
>>>> +  WRLOCK (rwlock);
>>>> +#endif
>>>> +
>>>
>>>    
>>>> diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index
>>>> 82664dc5f98..4312c5f36de 100644
>>>> --- a/libgfortran/io/unit.c
>>>> +++ b/libgfortran/io/unit.c
>>>    
>>>> @@ -329,7 +335,7 @@ get_gfc_unit (int n, int do_create)
>>>>     int c, created = 0;
>>>>
>>>>     NOTE ("Unit n=%d, do_create = %d", n, do_create);
>>>> -  LOCK (&unit_lock);
>>>> +  RDLOCK (&unit_rwlock);
>>>>
>>>> retry:
>>>>     for (c = 0; c < CACHE_SIZE; c++) @@ -350,6 +356,7 @@ retry:
>>>>         if (c == 0)
>>>> 	break;
>>>>       }
>>>> +  RD_TO_WRLOCK (&unit_rwlock);
>>>
>>> So I'm trying to convince myself why it's safe to unlock and only then take the write lock.
>>>
>>> Can you please elaborate/confirm why that's ok?
>>>
>>> I wouldn't mind a comment like
>>> We can release the unit and cache read lock now. We might have to allocate a (locked) unit, below in
>>> do_create.
>>> Either way, we will manipulate the cache and/or the unit list so we have to take a write lock now.
>>>
>>> We don't take the write bit in *addition* to the read lock because..
>>>
>>> (that needlessly complicates releasing the respective locks / it triggers too much contention when
> we..
>>> / ...?)
>>>
>>> thanks,
>>>    
>>>>
>>>>     if (p == NULL && do_create)
>>>>       {
>>>> @@ -368,8 +375,8 @@ retry:
>>>>     if (created)
>>>>       {
>>>>         /* Newly created units have their lock held already
>>>> -	 from insert_unit.  Just unlock UNIT_LOCK and return.  */
>>>> -      UNLOCK (&unit_lock);
>>>> +	 from insert_unit.  Just unlock UNIT_RWLOCK and return.  */
>>>> +      RWUNLOCK (&unit_rwlock);
>>>>         return p;
>>>>       }
>>>>
>>>> @@ -380,7 +387,7 @@ found:
>>>>         if (! TRYLOCK (&p->lock))
>>>> 	{
>>>> 	  /* assert (p->closed == 0); */
>>>> -	  UNLOCK (&unit_lock);
>>>> +	  RWUNLOCK (&unit_rwlock);
>>>> 	  return p;
>>>> 	}
>>>>
>>>> @@ -388,14 +395,14 @@ found:
>>>>       }
>>>>
>>>>
>>>> -  UNLOCK (&unit_lock);
>>>> +  RWUNLOCK (&unit_rwlock);
>>>>
>>>>     if (p != NULL && (p->child_dtio == 0))
>>>>       {
>>>>         LOCK (&p->lock);
>>>>         if (p->closed)
>>>> 	{
>>>> -	  LOCK (&unit_lock);
>>>> +	  WRLOCK (&unit_rwlock);
>>>> 	  UNLOCK (&p->lock);
>>>> 	  if (predec_waiting_locked (p) == 0)
>>>> 	    destroy_unit_mutex (p);
>>>> @@ -593,8 +600,8 @@ init_units (void)
>>>> #endif
>>>> #endif
>>>>
>>>> -#ifndef __GTHREAD_MUTEX_INIT
>>>> -  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_lock);
>>>> +#if (!defined(__GTHREAD_RWLOCK_INIT) &&
>>>> +!defined(__GTHREAD_MUTEX_INIT))
>>>> +  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_rwlock);
>>>> #endif
>>>>
>>>>     if (sizeof (max_offset) == 8)
>>>    
>