public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH v4] libgfortran: Replace mutex with rwlock
@ 2023-05-09  2:32 Zhu, Lipeng
  2023-05-16  7:08 ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-05-09  2:32 UTC (permalink / raw)
  To: Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, hongjiu.lu, tianyou.li, pan.deng, wangyang.guo



On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote:
> On Mon,  8 May 2023 17:44:43 +0800
> Lipeng Zhu <lipeng.zhu@intel.com> wrote:
> 
>> This patch try to introduce the rwlock and split the read/write to
>> unit_root tree and unit_cache with rwlock instead of the mutex to
>> increase CPU efficiency. In the get_gfc_unit function, the percentage
>> to step into the insert_unit function is around 30%, in most
>> instances, we can get the unit in the phase of reading the unit_cache
>> or unit_root tree. So split the read/write phase by rwlock would be an
>> approach to make it more parallel.
>>
>> BTW, the IPC metrics can gain around 9x in our test server with 220
>> cores. The benchmark we used is https://github.com/rwesson/NEAT
> 
> See commentary typos below.
> You did not state if you regression tested the patch?
I use valgrind --tool=helgrind or --tool=drd to test 'make 
check-fortran'. Is it necessary to add an additional unit test for this 
patch?

> Other than that it LGTM but i cannot approve it.
Thank you for your kind help for this patch, is there anything that I 
can do or can you help to push this patch forward?

> 
>> diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index
>> ad226c8e856..0033cc74252 100644
>> --- a/libgfortran/io/async.h
>> +++ b/libgfortran/io/async.h
>> @@ -210,6 +210,128 @@
>>       DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex,
> mutex); \
Thanks, corrected in Patch v5.

>>     } while (0)
>>   
>> +#ifdef __GTHREAD_RWLOCK_INIT
>> +#define RWLOCK_DEBUG_ADD(rwlock) do {		\
>> +    aio_rwlock_debug *n;				\
>> +    n = xmalloc (sizeof(aio_rwlock_debug));	\
> 
> Missing space before the open brace: sizeof (
> 
Thanks, corrected in Patch v5.

>> diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index
>> 82664dc5f98..62f1db21d34 100644
>> --- a/libgfortran/io/unit.c
>> +++ b/libgfortran/io/unit.c
>> @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME
>> respectively.  If not, see
>>   
>>   
>>   /* IO locking rules:
>> -   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
>> +   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
>> +   And use the rwlock to spilt read and write phase to UNIT_ROOT tree
>> +   and UNIT_CACHE to increase CPU efficiency.
> 
> s/spilt/split. Maybe:
> 
> Using an rwlock improves efficiency by allowing us to separate readers and writers of both UNIT_ROOT
> and UNIT_CACHE.
> 
Thanks, corrected in Patch v5.

>> @@ -350,6 +356,17 @@ retry:
>>         if (c == 0)
>>   	break;
>>       }
>> +  /* We did not find a unit in the cache nor in the unit list, create a new
>> +    (locked) unit and insert into the unit list and cache.
>> +    Manipulating either or both the unit list and the unit cache requires to
>> +    hold a write-lock [for obvious reasons]:
>> +    1. By separating the read/write lock, it will greatly reduce the contention
>> +       at the read part, while write part is not always necessary or most
>> +       unlikely once the unit hit in cache.
> 
> +    By separating the read/write lock, we will greatly reduce the contention
> +    on the read part, while the write part is unlikely once the unit hits
> +    the cache.
> 
>> +    2. We try to balance the implementation complexity and the performance
>> +       gains that fit into current cases we observed by just using a
>> +       pthread_rwlock. */
> 
> Let's drop 2.

Got it, thanks!
> thanks,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-05-09  2:32 [PATCH v4] libgfortran: Replace mutex with rwlock Zhu, Lipeng
@ 2023-05-16  7:08 ` Zhu, Lipeng
  2023-05-23  2:53   ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-05-16  7:08 UTC (permalink / raw)
  To: Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, hongjiu.lu, tianyou.li, pan.deng, wangyang.guo



On 5/9/2023 10:32 AM, Zhu, Lipeng wrote:
> 
> 
> On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote:
>> On Mon,  8 May 2023 17:44:43 +0800
>> Lipeng Zhu <lipeng.zhu@intel.com> wrote:
>>
>>> This patch try to introduce the rwlock and split the read/write to
>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>> increase CPU efficiency. In the get_gfc_unit function, the percentage
>>> to step into the insert_unit function is around 30%, in most
>>> instances, we can get the unit in the phase of reading the unit_cache
>>> or unit_root tree. So split the read/write phase by rwlock would be an
>>> approach to make it more parallel.
>>>
>>> BTW, the IPC metrics can gain around 9x in our test server with 220
>>> cores. The benchmark we used is https://github.com/rwesson/NEAT
>>
>> See commentary typos below.
>> You did not state if you regression tested the patch?
> I use valgrind --tool=helgrind or --tool=drd to test 'make 
> check-fortran'. Is it necessary to add an additional unit test for this 
> patch?
> 
>> Other than that it LGTM but i cannot approve it.
> Thank you for your kind help for this patch, is there anything that I 
> can do or can you help to push this patch forward?
> 
Hi Bernhard,

Is there any other refinement that need I to do for this patch?

Thanks.


>>
>>> diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index
>>> ad226c8e856..0033cc74252 100644
>>> --- a/libgfortran/io/async.h
>>> +++ b/libgfortran/io/async.h
>>> @@ -210,6 +210,128 @@
>>>       DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", 
>>> aio_prefix, #mutex,
>> mutex); \
> Thanks, corrected in Patch v5.
> 
>>>     } while (0)
>>> +#ifdef __GTHREAD_RWLOCK_INIT
>>> +#define RWLOCK_DEBUG_ADD(rwlock) do {        \
>>> +    aio_rwlock_debug *n;                \
>>> +    n = xmalloc (sizeof(aio_rwlock_debug));    \
>>
>> Missing space before the open brace: sizeof (
>>
> Thanks, corrected in Patch v5.
> 
>>> diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index
>>> 82664dc5f98..62f1db21d34 100644
>>> --- a/libgfortran/io/unit.c
>>> +++ b/libgfortran/io/unit.c
>>> @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME
>>> respectively.  If not, see
>>>   /* IO locking rules:
>>> -   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and 
>>> UNIT_CACHE.
>>> +   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and 
>>> UNIT_CACHE.
>>> +   And use the rwlock to spilt read and write phase to UNIT_ROOT tree
>>> +   and UNIT_CACHE to increase CPU efficiency.
>>
>> s/spilt/split. Maybe:
>>
>> Using an rwlock improves efficiency by allowing us to separate readers 
>> and writers of both UNIT_ROOT
>> and UNIT_CACHE.
>>
> Thanks, corrected in Patch v5.
> 
>>> @@ -350,6 +356,17 @@ retry:
>>>         if (c == 0)
>>>       break;
>>>       }
>>> +  /* We did not find a unit in the cache nor in the unit list, 
>>> create a new
>>> +    (locked) unit and insert into the unit list and cache.
>>> +    Manipulating either or both the unit list and the unit cache 
>>> requires to
>>> +    hold a write-lock [for obvious reasons]:
>>> +    1. By separating the read/write lock, it will greatly reduce the 
>>> contention
>>> +       at the read part, while write part is not always necessary or 
>>> most
>>> +       unlikely once the unit hit in cache.
>>
>> +    By separating the read/write lock, we will greatly reduce the 
>> contention
>> +    on the read part, while the write part is unlikely once the unit 
>> hits
>> +    the cache.
>>
>>> +    2. We try to balance the implementation complexity and the 
>>> performance
>>> +       gains that fit into current cases we observed by just using a
>>> +       pthread_rwlock. */
>>
>> Let's drop 2.
> 
> Got it, thanks!
>> thanks,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-05-16  7:08 ` Zhu, Lipeng
@ 2023-05-23  2:53   ` Zhu, Lipeng
  2023-05-24 19:18     ` Thomas Koenig
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-05-23  2:53 UTC (permalink / raw)
  To: Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, hongjiu.lu, tianyou.li, pan.deng, wangyang.guo



On 5/16/2023 3:08 PM, Zhu, Lipeng wrote:
> 
> 
> On 5/9/2023 10:32 AM, Zhu, Lipeng wrote:
>>
>>
>> On 1/1/1970 8:00 AM, Bernhard Reutner-Fischer wrote:
>>> On Mon,  8 May 2023 17:44:43 +0800
>>> Lipeng Zhu <lipeng.zhu@intel.com> wrote:
>>>
>>>> This patch try to introduce the rwlock and split the read/write to
>>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>>> increase CPU efficiency. In the get_gfc_unit function, the percentage
>>>> to step into the insert_unit function is around 30%, in most
>>>> instances, we can get the unit in the phase of reading the unit_cache
>>>> or unit_root tree. So split the read/write phase by rwlock would be an
>>>> approach to make it more parallel.
>>>>
>>>> BTW, the IPC metrics can gain around 9x in our test server with 220
>>>> cores. The benchmark we used is https://github.com/rwesson/NEAT
>>>
>>> See commentary typos below.
>>> You did not state if you regression tested the patch?
>> I use valgrind --tool=helgrind or --tool=drd to test 'make 
>> check-fortran'. Is it necessary to add an additional unit test for 
>> this patch?
>>
>>> Other than that it LGTM but i cannot approve it.
>> Thank you for your kind help for this patch, is there anything that I 
>> can do or can you help to push this patch forward?
>>
> Hi Bernhard,
> 
> Is there any other refinement that need I to do for this patch?
> 
> Thanks.
> 
> 

May I know any comment or concern on this patch, thanks for your time :)

>>>
>>>> diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h index
>>>> ad226c8e856..0033cc74252 100644
>>>> --- a/libgfortran/io/async.h
>>>> +++ b/libgfortran/io/async.h
>>>> @@ -210,6 +210,128 @@
>>>>       DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s 
>>>> %78p\n", aio_prefix, #mutex,
>>> mutex); \
>> Thanks, corrected in Patch v5.
>>
>>>>     } while (0)
>>>> +#ifdef __GTHREAD_RWLOCK_INIT
>>>> +#define RWLOCK_DEBUG_ADD(rwlock) do {        \
>>>> +    aio_rwlock_debug *n;                \
>>>> +    n = xmalloc (sizeof(aio_rwlock_debug));    \
>>>
>>> Missing space before the open brace: sizeof (
>>>
>> Thanks, corrected in Patch v5.
>>
>>>> diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c index
>>>> 82664dc5f98..62f1db21d34 100644
>>>> --- a/libgfortran/io/unit.c
>>>> +++ b/libgfortran/io/unit.c
>>>> @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME
>>>> respectively.  If not, see
>>>>   /* IO locking rules:
>>>> -   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and 
>>>> UNIT_CACHE.
>>>> +   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and 
>>>> UNIT_CACHE.
>>>> +   And use the rwlock to spilt read and write phase to UNIT_ROOT tree
>>>> +   and UNIT_CACHE to increase CPU efficiency.
>>>
>>> s/spilt/split. Maybe:
>>>
>>> Using an rwlock improves efficiency by allowing us to separate 
>>> readers and writers of both UNIT_ROOT
>>> and UNIT_CACHE.
>>>
>> Thanks, corrected in Patch v5.
>>
>>>> @@ -350,6 +356,17 @@ retry:
>>>>         if (c == 0)
>>>>       break;
>>>>       }
>>>> +  /* We did not find a unit in the cache nor in the unit list, 
>>>> create a new
>>>> +    (locked) unit and insert into the unit list and cache.
>>>> +    Manipulating either or both the unit list and the unit cache 
>>>> requires to
>>>> +    hold a write-lock [for obvious reasons]:
>>>> +    1. By separating the read/write lock, it will greatly reduce 
>>>> the contention
>>>> +       at the read part, while write part is not always necessary 
>>>> or most
>>>> +       unlikely once the unit hit in cache.
>>>
>>> +    By separating the read/write lock, we will greatly reduce the 
>>> contention
>>> +    on the read part, while the write part is unlikely once the unit 
>>> hits
>>> +    the cache.
>>>
>>>> +    2. We try to balance the implementation complexity and the 
>>>> performance
>>>> +       gains that fit into current cases we observed by just using a
>>>> +       pthread_rwlock. */
>>>
>>> Let's drop 2.
>>
>> Got it, thanks!
>>> thanks,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-05-23  2:53   ` Zhu, Lipeng
@ 2023-05-24 19:18     ` Thomas Koenig
  2023-08-18  3:06       ` Zhu, Lipeng
  2023-08-18  3:18       ` [PATCH v6] " Zhu, Lipeng
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Koenig @ 2023-05-24 19:18 UTC (permalink / raw)
  To: Zhu, Lipeng, Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, hongjiu.lu, tianyou.li, pan.deng,
	wangyang.guo, Jakub Jelinek

Hi Lipeng,

> May I know any comment or concern on this patch, thanks for your time 😄

Thanks for your patience in getting this reviewed.

A few remarks / questions.

Which strategy is used in this implementation, read-preferring or
write-preferring?  And if read-preferring is used, is there
a danger of deadlock if people do unreasonable things?
Maybe you could explain that, also in a comment in the code.

Can you add some sort of torture test case(s) which does a lot of
opening/closing/reading/writing, possibly with asynchronous
I/O and/or pthreads, to catch possible problems?  If there is a
system dependency or some race condition, chances are that regression
testers will catch this.

With this, the libgfortran parts are OK, unless somebody else has more
comments, so give this a couple of days.  I cannot approve the libgcc
parts, that would be somebody else (Jakub?)

Best regards

	Thomas



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-05-24 19:18     ` Thomas Koenig
@ 2023-08-18  3:06       ` Zhu, Lipeng
  2023-09-14  8:33         ` Zhu, Lipeng
  2023-08-18  3:18       ` [PATCH v6] " Zhu, Lipeng
  1 sibling, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-08-18  3:06 UTC (permalink / raw)
  To: Thomas Koenig, Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, Lu, Hongjiu, Li, Tianyou, Deng, Pan, Guo,
	Wangyang, Jakub Jelinek

Hi Thomas,

> 
> Hi Lipeng,
> 
> > May I know any comment or concern on this patch, thanks for your time
> > 😄
> 
> Thanks for your patience in getting this reviewed.
> 
> A few remarks / questions.
> 
> Which strategy is used in this implementation, read-preferring or write-
> preferring?  And if read-preferring is used, is there a danger of deadlock if
> people do unreasonable things?
> Maybe you could explain that, also in a comment in the code.
> 

Yes, the implementation use the read-preferring strategy, and comments in code.
When adding the test cases, I didn’t meet the situation which may cause the deadlock.
Maybe you can give more guidance about that.

> Can you add some sort of torture test case(s) which does a lot of
> opening/closing/reading/writing, possibly with asynchronous I/O and/or
> pthreads, to catch possible problems?  If there is a system dependency or
> some race condition, chances are that regression testers will catch this.
> 

Sure, as your comments, in the patch V6, I added 3 test cases with OpenMP to test different cases in concurrency respectively:
1. find and create unit very frequently to stress read lock and write lock.
2. only access the unit which exist in cache to stress read lock.
3. access the same unit in concurrency.
For the third test case, it also help to find a bug:  When unit can't be found in cache nor unit list in read phase, then threads will try to acquire write lock to insert the same unit, this will cause duplicate key error.  
To fix this bug, I get the unit from unit list once again before insert in write lock. More details you can refer the patch v6.

> With this, the libgfortran parts are OK, unless somebody else has more
> comments, so give this a couple of days.  I cannot approve the libgcc parts,
> that would be somebody else (Jakub?)
> 
> Best regards
> 
> 	Thomas
> 

Best Regards,
Lipeng Zhu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6] libgfortran: Replace mutex with rwlock
  2023-05-24 19:18     ` Thomas Koenig
  2023-08-18  3:06       ` Zhu, Lipeng
@ 2023-08-18  3:18       ` Zhu, Lipeng
  2023-12-08 10:19         ` Jakub Jelinek
  1 sibling, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-08-18  3:18 UTC (permalink / raw)
  To: tkoenig, rep.dot.nop
  Cc: lipeng.zhu, fortran, gcc-patches, hongjiu.lu, tianyou.li,
	pan.deng, wangyang.guo, jakub, Zhu

From: Lipeng Zhu <lipeng.zhu@intel.com>

This patch try to introduce the rwlock and split the read/write to
unit_root tree and unit_cache with rwlock instead of the mutex to
increase CPU efficiency. In the get_gfc_unit function, the percentage
to step into the insert_unit function is around 30%, in most instances,
we can get the unit in the phase of reading the unit_cache or unit_root
tree. So split the read/write phase by rwlock would be an approach to
make it more parallel.

BTW, the IPC metrics can gain around 9x in our test
server with 220 cores. The benchmark we used is
https://github.com/rwesson/NEAT

libgcc/ChangeLog:

    * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro
    (__gthrw): New function
    (__gthread_rwlock_rdlock): New function
    (__gthread_rwlock_tryrdlock): New function
    (__gthread_rwlock_wrlock): New function
    (__gthread_rwlock_trywrlock): New function
    (__gthread_rwlock_unlock): New function

libgfortran/ChangeLog:

    * io/async.c (DEBUG_LINE): New
    * io/async.h (RWLOCK_DEBUG_ADD): New macro
    (CHECK_RDLOCK): New macro
    (CHECK_WRLOCK): New macro
    (TAIL_RWLOCK_DEBUG_QUEUE): New macro
    (IN_RWLOCK_DEBUG_QUEUE): New macro
    (RDLOCK): New macro
    (WRLOCK): New macro
    (RWUNLOCK): New macro
    (RD_TO_WRLOCK): New macro
    (INTERN_RDLOCK): New macro
    (INTERN_WRLOCK): New macro
    (INTERN_RWUNLOCK): New macro
    * io/io.h (internal_proto): Define unit_rwlock
    * io/transfer.c (st_read_done_worker): Relace unit_lock with unit_rwlock
    (st_write_done_worker): Relace unit_lock with unit_rwlock
    * io/unit.c (get_gfc_unit): Relace unit_lock with unit_rwlock
    (if): Relace unit_lock with unit_rwlock
    (close_unit_1): Relace unit_lock with unit_rwlock
    (close_units): Relace unit_lock with unit_rwlock
    (newunit_alloc): Relace unit_lock with unit_rwlock
    * io/unix.c (flush_all_units): Relace unit_lock with unit_rwlock

---
v1 -> v2:
Limit the pthread_rwlock usage in libgcc only when __cplusplus isn't defined.

v2 -> v3:
Rebase the patch with trunk branch.

v3 -> v4:
Update the comments.

v4 -> v5:
Fix typos and code formatter.

v5 -> v6:
Add unit tests.

Reviewed-by: Hongjiu Lu <hongjiu.lu@intel.com>
Reviewed-by: Bernhard Reutner-Fischer <rep.dot.nop@gmail.com>
Reviewed-by: Thomas Koenig <tkoenig@netcologne.de>
Signed-off-by: Zhu, Lipeng <lipeng.zhu@intel.com>
---
 libgcc/gthr-posix.h                           |  60 +++++++
 libgfortran/io/async.c                        |   4 +
 libgfortran/io/async.h                        | 154 +++++++++++++++++-
 libgfortran/io/io.h                           |  15 +-
 libgfortran/io/transfer.c                     |   8 +-
 libgfortran/io/unit.c                         | 116 ++++++++-----
 libgfortran/io/unix.c                         |  16 +-
 .../testsuite/libgomp.fortran/rwlock_1.f90    |  33 ++++
 .../testsuite/libgomp.fortran/rwlock_2.f90    |  22 +++
 .../testsuite/libgomp.fortran/rwlock_3.f90    |  18 ++
 10 files changed, 387 insertions(+), 59 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_1.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_2.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_3.f90

diff --git a/libgcc/gthr-posix.h b/libgcc/gthr-posix.h
index aebcfdd9f4c..73283082997 100644
--- a/libgcc/gthr-posix.h
+++ b/libgcc/gthr-posix.h
@@ -48,6 +48,9 @@ typedef pthread_t __gthread_t;
 typedef pthread_key_t __gthread_key_t;
 typedef pthread_once_t __gthread_once_t;
 typedef pthread_mutex_t __gthread_mutex_t;
+#ifndef __cplusplus
+typedef pthread_rwlock_t __gthread_rwlock_t;
+#endif
 typedef pthread_mutex_t __gthread_recursive_mutex_t;
 typedef pthread_cond_t __gthread_cond_t;
 typedef struct timespec __gthread_time_t;
@@ -58,6 +61,9 @@ typedef struct timespec __gthread_time_t;
 
 #define __GTHREAD_MUTEX_INIT PTHREAD_MUTEX_INITIALIZER
 #define __GTHREAD_MUTEX_INIT_FUNCTION __gthread_mutex_init_function
+#ifndef __cplusplus
+#define __GTHREAD_RWLOCK_INIT PTHREAD_RWLOCK_INITIALIZER
+#endif
 #define __GTHREAD_ONCE_INIT PTHREAD_ONCE_INIT
 #if defined(PTHREAD_RECURSIVE_MUTEX_INITIALIZER)
 #define __GTHREAD_RECURSIVE_MUTEX_INIT PTHREAD_RECURSIVE_MUTEX_INITIALIZER
@@ -135,6 +141,13 @@ __gthrw(pthread_mutexattr_init)
 __gthrw(pthread_mutexattr_settype)
 __gthrw(pthread_mutexattr_destroy)
 
+#ifndef __cplusplus
+__gthrw(pthread_rwlock_rdlock)
+__gthrw(pthread_rwlock_tryrdlock)
+__gthrw(pthread_rwlock_wrlock)
+__gthrw(pthread_rwlock_trywrlock)
+__gthrw(pthread_rwlock_unlock)
+#endif
 
 #if defined(_LIBOBJC) || defined(_LIBOBJC_WEAK)
 /* Objective-C.  */
@@ -885,6 +898,53 @@ __gthread_cond_destroy (__gthread_cond_t* __cond)
   return __gthrw_(pthread_cond_destroy) (__cond);
 }
 
+#ifndef __cplusplus
+static inline int
+__gthread_rwlock_rdlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_rdlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_tryrdlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_tryrdlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_wrlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_wrlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_trywrlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_trywrlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_unlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_unlock) (__rwlock);
+  else
+    return 0;
+}
+#endif
+
 #endif /* _LIBOBJC */
 
 #endif /* ! GCC_GTHR_POSIX_H */
diff --git a/libgfortran/io/async.c b/libgfortran/io/async.c
index 81d1d8175aa..a97272ce5e4 100644
--- a/libgfortran/io/async.c
+++ b/libgfortran/io/async.c
@@ -42,6 +42,10 @@ DEBUG_LINE (__thread const char *aio_prefix = MPREFIX);
 
 DEBUG_LINE (__gthread_mutex_t debug_queue_lock = __GTHREAD_MUTEX_INIT;)
 DEBUG_LINE (aio_lock_debug *aio_debug_head = NULL;)
+#ifdef __GTHREAD_RWLOCK_INIT
+DEBUG_LINE (aio_rwlock_debug *aio_rwlock_debug_head = NULL;)
+DEBUG_LINE (__gthread_rwlock_t debug_queue_rwlock = __GTHREAD_RWLOCK_INIT;)
+#endif
 
 /* Current unit for asynchronous I/O.  Needed for error reporting.  */
 
diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h
index ad226c8e856..0872084a651 100644
--- a/libgfortran/io/async.h
+++ b/libgfortran/io/async.h
@@ -207,9 +207,132 @@
     INTERN_LOCK (&debug_queue_lock);					\
     MUTEX_DEBUG_ADD (mutex);						\
     INTERN_UNLOCK (&debug_queue_lock);					\
-    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \
+    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, \
+		 mutex); \
   } while (0)
 
+#ifdef __GTHREAD_RWLOCK_INIT
+#define RWLOCK_DEBUG_ADD(rwlock) do {		\
+    aio_rwlock_debug *n;				\
+    n = xmalloc (sizeof (aio_rwlock_debug));	\
+    n->prev = TAIL_RWLOCK_DEBUG_QUEUE;			\
+    if (n->prev)				\
+      n->prev->next = n;			\
+    n->next = NULL;				\
+    n->line = __LINE__;				\
+    n->func = __FUNCTION__;			\
+    n->rw = rwlock;				\
+    if (!aio_rwlock_debug_head) {			\
+      aio_rwlock_debug_head = n;			\
+    }						\
+  } while (0)
+
+#define CHECK_RDLOCK(rwlock, status) do {					\
+    aio_rwlock_debug *curr;						\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    if (__gthread_rwlock_tryrdlock (rwlock)) {				\
+      if ((curr = IN_RWLOCK_DEBUG_QUEUE (rwlock))) {				\
+	sprintf (status, DEBUG_RED "%s():%d" DEBUG_NORM, curr->func, curr->line); \
+      } else								\
+	sprintf (status, DEBUG_RED "unknown" DEBUG_NORM);			\
+    }									\
+    else {								\
+      __gthread_rwlock_unlock (rwlock);					\
+      sprintf (status, DEBUG_GREEN "rwunlocked" DEBUG_NORM);			\
+    }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+  }while (0)
+
+#define CHECK_WRLOCK(rwlock, status) do {					\
+    aio_rwlock_debug *curr;						\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    if (__gthread_rwlock_trywrlock (rwlock)) {				\
+      if ((curr = IN_RWLOCK_DEBUG_QUEUE (rwlock))) {				\
+	sprintf (status, DEBUG_RED "%s():%d" DEBUG_NORM, curr->func, curr->line); \
+      } else								\
+	sprintf (status, DEBUG_RED "unknown" DEBUG_NORM);			\
+    }									\
+    else {								\
+      __gthread_rwlock_unlock (rwlock);					\
+      sprintf (status, DEBUG_GREEN "rwunlocked" DEBUG_NORM);			\
+    }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+  }while (0)
+
+#define TAIL_RWLOCK_DEBUG_QUEUE ({			\
+      aio_rwlock_debug *curr = aio_rwlock_debug_head;	\
+      while (curr && curr->next) {		\
+	curr = curr->next;			\
+      }						\
+      curr;					\
+    })
+
+#define IN_RWLOCK_DEBUG_QUEUE(rwlock) ({		\
+      __label__ end;				\
+      aio_rwlock_debug *curr = aio_rwlock_debug_head;	\
+      while (curr) {				\
+	if (curr->rw == rwlock) {			\
+	  goto end;				\
+	}					\
+	curr = curr->next;			\
+      }						\
+    end:;					\
+      curr;					\
+    })
+
+#define RDLOCK(rwlock) do {						\
+    char status[200];							\
+    CHECK_RDLOCK (rwlock, status);						\
+    DEBUG_PRINTF ("%s%-42s prev: %-35s %20s():%-5d %18p\n", aio_prefix,	\
+		 DEBUG_RED "RDLOCK: " DEBUG_NORM #rwlock, status, __FUNCTION__, __LINE__, (void *) rwlock); \
+    INTERN_RDLOCK (rwlock);							\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    RWLOCK_DEBUG_ADD (rwlock);						\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #rwlock, rwlock); \
+  } while (0)
+
+#define WRLOCK(rwlock) do {						\
+    char status[200];							\
+    CHECK_WRLOCK (rwlock, status);						\
+    DEBUG_PRINTF ("%s%-42s prev: %-35s %20s():%-5d %18p\n", aio_prefix,	\
+		 DEBUG_RED "WRLOCK: " DEBUG_NORM #rwlock, status, __FUNCTION__, __LINE__, (void *) rwlock); \
+    INTERN_WRLOCK (rwlock);							\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    RWLOCK_DEBUG_ADD (rwlock);						\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #rwlock, rwlock); \
+  } while (0)
+
+#define RWUNLOCK(rwlock) do {						\
+    aio_rwlock_debug *curr;						\
+    DEBUG_PRINTF ("%s%-75s %20s():%-5d %18p\n", aio_prefix, DEBUG_GREEN "RWUNLOCK: " DEBUG_NORM #rwlock, \
+		 __FUNCTION__, __LINE__, (void *) rwlock);		\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    curr = IN_RWLOCK_DEBUG_QUEUE (rwlock);					\
+    if (curr)								\
+      {									\
+	if (curr->prev)							\
+	  curr->prev->next = curr->next;				\
+	if (curr->next) {						\
+	  curr->next->prev = curr->prev;				\
+	  if (curr == aio_rwlock_debug_head)					\
+	    aio_rwlock_debug_head = curr->next;				\
+	} else {							\
+	  if (curr == aio_rwlock_debug_head)					\
+	    aio_rwlock_debug_head = NULL;					\
+	}								\
+	free (curr);							\
+      }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    INTERN_RWUNLOCK (rwlock);						\
+  } while (0)
+
+#define RD_TO_WRLOCK(rwlock) \
+  RWUNLOCK (rwlock);\
+  WRLOCK (rwlock);
+#endif
+
 #define DEBUG_LINE(...) __VA_ARGS__
 
 #else
@@ -221,12 +344,31 @@
 #define LOCK(mutex) INTERN_LOCK (mutex)
 #define UNLOCK(mutex) INTERN_UNLOCK (mutex)
 #define TRYLOCK(mutex) (__gthread_mutex_trylock (mutex))
+#ifdef __GTHREAD_RWLOCK_INIT
+#define RDLOCK(rwlock) INTERN_RDLOCK (rwlock)
+#define WRLOCK(rwlock) INTERN_WRLOCK (rwlock)
+#define RWUNLOCK(rwlock) INTERN_RWUNLOCK (rwlock)
+#define RD_TO_WRLOCK(rwlock) \
+  RWUNLOCK (rwlock);\
+  WRLOCK (rwlock);
+#endif
+#endif
+
+#ifndef __GTHREAD_RWLOCK_INIT
+#define RDLOCK(rwlock) LOCK (rwlock)
+#define WRLOCK(rwlock) LOCK (rwlock)
+#define RWUNLOCK(rwlock) UNLOCK (rwlock)
+#define RD_TO_WRLOCK(rwlock) {}
 #endif
 
 #define INTERN_LOCK(mutex) T_ERROR (__gthread_mutex_lock, mutex);
 
 #define INTERN_UNLOCK(mutex) T_ERROR (__gthread_mutex_unlock, mutex);
 
+#define INTERN_RDLOCK(rwlock) T_ERROR (__gthread_rwlock_rdlock, rwlock);
+#define INTERN_WRLOCK(rwlock) T_ERROR (__gthread_rwlock_wrlock, rwlock);
+#define INTERN_RWUNLOCK(rwlock) T_ERROR (__gthread_rwlock_unlock, rwlock);
+
 #if ASYNC_IO
 
 /* au->lock has to be held when calling this macro.  */
@@ -288,8 +430,18 @@ DEBUG_LINE (typedef struct aio_lock_debug{
   struct aio_lock_debug *prev;
 } aio_lock_debug;)
 
+DEBUG_LINE (typedef struct aio_rwlock_debug{
+  __gthread_rwlock_t *rw;
+  int line;
+  const char *func;
+  struct aio_rwlock_debug *next;
+  struct aio_rwlock_debug *prev;
+} aio_rwlock_debug;)
+
 DEBUG_LINE (extern aio_lock_debug *aio_debug_head;)
 DEBUG_LINE (extern __gthread_mutex_t debug_queue_lock;)
+DEBUG_LINE (extern aio_rwlock_debug *aio_rwlock_debug_head;)
+DEBUG_LINE (extern __gthread_rwlock_t debug_queue_rwlock;)
 
 /* Thread - local storage of the current unit we are looking at. Needed for
    error reporting.  */
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index ecdf1dd3f05..15daa0995b1 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -690,7 +690,7 @@ typedef struct gfc_unit
      from the UNIT_ROOT tree, but doesn't free it and the
      last of the waiting threads will do that.
      This must be either atomically increased/decreased, or
-     always guarded by UNIT_LOCK.  */
+     always guarded by UNIT_RWLOCK.  */
   int waiting;
   /* Flag set by close_unit if the unit as been closed.
      Must be manipulated under unit's lock.  */
@@ -769,8 +769,13 @@ internal_proto(default_recl);
 extern gfc_unit *unit_root;
 internal_proto(unit_root);
 
-extern __gthread_mutex_t unit_lock;
-internal_proto(unit_lock);
+#ifdef __GTHREAD_RWLOCK_INIT
+extern __gthread_rwlock_t unit_rwlock;
+internal_proto(unit_rwlock);
+#else
+extern __gthread_mutex_t unit_rwlock;
+internal_proto(unit_rwlock);
+#endif
 
 extern int close_unit (gfc_unit *);
 internal_proto(close_unit);
@@ -1015,9 +1020,9 @@ dec_waiting_unlocked (gfc_unit *u)
 #ifdef HAVE_ATOMIC_FETCH_ADD
   (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
 #else
-  __gthread_mutex_lock (&unit_lock);
+  WRLOCK (&unit_rwlock);
   u->waiting--;
-  __gthread_mutex_unlock (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 #endif
 }
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 8bb5d1101ca..b01f45b80f6 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -4539,9 +4539,9 @@ st_read_done_worker (st_parameter_dt *dtp, bool unlock)
    if (free_newunit)
      {
        /* Avoid inverse lock issues by placing after unlock_unit.  */
-       LOCK (&unit_lock);
+       WRLOCK (&unit_rwlock);
        newunit_free (dtp->common.unit);
-       UNLOCK (&unit_lock);
+       RWUNLOCK (&unit_rwlock);
      }
 }
 
@@ -4636,9 +4636,9 @@ st_write_done_worker (st_parameter_dt *dtp, bool unlock)
    if (free_newunit)
      {
        /* Avoid inverse lock issues by placing after unlock_unit.  */
-       LOCK (&unit_lock);
+       WRLOCK (&unit_rwlock);
        newunit_free (dtp->common.unit);
-       UNLOCK (&unit_lock);
+       RWUNLOCK (&unit_rwlock);
      }
 }
 
diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c
index 82664dc5f98..c5b920b0866 100644
--- a/libgfortran/io/unit.c
+++ b/libgfortran/io/unit.c
@@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 
 
 /* IO locking rules:
-   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
+   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
+   Using an rwlock improves efficiency by allowing us to separate readers
+   and writers of both UNIT_ROOT and UNIT_CACHE.
    Concurrent use of different units should be supported, so
    each unit has its own lock, LOCK.
    Open should be atomic with its reopening of units and list_read.c
    in several places needs find_unit another unit while holding stdin
-   unit's lock, so it must be possible to acquire UNIT_LOCK while holding
+   unit's lock, so it must be possible to acquire UNIT_RWLOCK while holding
    some unit's lock.  Therefore to avoid deadlocks, it is forbidden
-   to acquire unit's private locks while holding UNIT_LOCK, except
+   to acquire unit's private locks while holding UNIT_RWLOCK, except
    for freshly created units (where no other thread can get at their
    address yet) or when using just trylock rather than lock operation.
    In addition to unit's private lock each unit has a WAITERS counter
    and CLOSED flag.  WAITERS counter must be either only
    atomically incremented/decremented in all places (if atomic builtins
-   are supported), or protected by UNIT_LOCK in all places (otherwise).
+   are supported), or protected by UNIT_RWLOCK in all places (otherwise).
    CLOSED flag must be always protected by unit's LOCK.
-   After finding a unit in UNIT_CACHE or UNIT_ROOT with UNIT_LOCK held,
+   After finding a unit in UNIT_CACHE or UNIT_ROOT with UNIT_RWLOCK held,
    WAITERS must be incremented to avoid concurrent close from freeing
-   the unit between unlocking UNIT_LOCK and acquiring unit's LOCK.
-   Unit freeing is always done under UNIT_LOCK.  If close_unit sees any
+   the unit between unlocking UNIT_RWLOCK and acquiring unit's LOCK.
+   Unit freeing is always done under UNIT_RWLOCK.  If close_unit sees any
    WAITERS, it doesn't free the unit but instead sets the CLOSED flag
    and the thread that decrements WAITERS to zero while CLOSED flag is
-   set is responsible for freeing it (while holding UNIT_LOCK).
+   set is responsible for freeing it (while holding UNIT_RWLOCK).
    flush_all_units operation is iterating over the unit tree with
-   increasing UNIT_NUMBER while holding UNIT_LOCK and attempting to
+   increasing UNIT_NUMBER while holding UNIT_RWLOCK and attempting to
    flush each unit (and therefore needs the unit's LOCK held as well).
    To avoid deadlocks, it just trylocks the LOCK and if unsuccessful,
-   remembers the current unit's UNIT_NUMBER, unlocks UNIT_LOCK, acquires
-   unit's LOCK and after flushing reacquires UNIT_LOCK and restarts with
+   remembers the current unit's UNIT_NUMBER, unlocks UNIT_RWLOCK, acquires
+   unit's LOCK and after flushing reacquires UNIT_RWLOCK and restarts with
    the smallest UNIT_NUMBER above the last one flushed.
 
    If find_unit/find_or_create_unit/find_file/get_unit routines return
@@ -101,10 +103,14 @@ gfc_offset max_offset;
 gfc_offset default_recl;
 
 gfc_unit *unit_root;
+#ifdef __GTHREAD_RWLOCK_INIT
+__gthread_rwlock_t unit_rwlock = __GTHREAD_RWLOCK_INIT;
+#else
 #ifdef __GTHREAD_MUTEX_INIT
-__gthread_mutex_t unit_lock = __GTHREAD_MUTEX_INIT;
+__gthread_mutex_t unit_rwlock = __GTHREAD_MUTEX_INIT;
 #else
-__gthread_mutex_t unit_lock;
+__gthread_mutex_t unit_rwlock;
+#endif
 #endif
 
 /* We use these filenames for error reporting.  */
@@ -317,6 +323,27 @@ delete_unit (gfc_unit *old)
   unit_root = delete_treap (old, unit_root);
 }
 
+/* get_gfc_unit_from_root()-- Given an integer, return a pointer
+  to the unit structure. Returns NULL if the unit does not exist,
+   otherwise returns a locked unit. */
+
+static inline gfc_unit * get_gfc_unit_from_unit_root (int n)
+{
+    gfc_unit *p;
+    int c = 0;
+    p = unit_root;
+  while (p != NULL)
+    {
+      c = compare (n, p->unit_number);
+      if (c < 0)
+	p = p->left;
+      if (c > 0)
+	p = p->right;
+      if (c == 0)
+	break;
+    }
+    return p;
+}
 
 /* get_gfc_unit()-- Given an integer, return a pointer to the unit
    structure.  Returns NULL if the unit does not exist,
@@ -329,7 +356,7 @@ get_gfc_unit (int n, int do_create)
   int c, created = 0;
 
   NOTE ("Unit n=%d, do_create = %d", n, do_create);
-  LOCK (&unit_lock);
+  RDLOCK (&unit_rwlock);
 
 retry:
   for (c = 0; c < CACHE_SIZE; c++)
@@ -339,18 +366,25 @@ retry:
 	goto found;
       }
 
-  p = unit_root;
-  while (p != NULL)
-    {
-      c = compare (n, p->unit_number);
-      if (c < 0)
-	p = p->left;
-      if (c > 0)
-	p = p->right;
-      if (c == 0)
-	break;
-    }
-
+  p = get_gfc_unit_from_unit_root (n);
+
+  /* We did not find a unit in the cache nor in the unit list,
+    create a new (locked) unit and insert into the unit list and
+    cache. Manipulating either or both the unit list and the unit
+    cache requires to hold a write-lock [for obvious reasons]:
+    By separating the read/write lock, we will greatly reduce
+    the contention on the read part, while the write part is
+    unlikely once the unit hits the cache. */
+  RD_TO_WRLOCK (&unit_rwlock);
+
+  /* In the case of high concurrency, when multiple threads want
+    to find or create the same unit, the unit number may not
+    exist in cache nor in the unit list during read phase, then
+    threads will acquire the write-lock to insert the same unit
+    number to unit list. To avoid duplicate insert, we need to
+    find unit list once again to ensure that the unit number
+    not exist. */
+  p = get_gfc_unit_from_unit_root (n);
   if (p == NULL && do_create)
     {
       p = insert_unit (n);
@@ -368,8 +402,8 @@ retry:
   if (created)
     {
       /* Newly created units have their lock held already
-	 from insert_unit.  Just unlock UNIT_LOCK and return.  */
-      UNLOCK (&unit_lock);
+	 from insert_unit.  Just unlock UNIT_RWLOCK and return.  */
+      RWUNLOCK (&unit_rwlock);
       return p;
     }
 
@@ -380,7 +414,7 @@ found:
       if (! TRYLOCK (&p->lock))
 	{
 	  /* assert (p->closed == 0); */
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
 	  return p;
 	}
 
@@ -388,14 +422,14 @@ found:
     }
 
 
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 
   if (p != NULL && (p->child_dtio == 0))
     {
       LOCK (&p->lock);
       if (p->closed)
 	{
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&p->lock);
 	  if (predec_waiting_locked (p) == 0)
 	    destroy_unit_mutex (p);
@@ -593,8 +627,8 @@ init_units (void)
 #endif
 #endif
 
-#ifndef __GTHREAD_MUTEX_INIT
-  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_lock);
+#if (!defined(__GTHREAD_RWLOCK_INIT) && !defined(__GTHREAD_MUTEX_INIT))
+  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_rwlock);
 #endif
 
   if (sizeof (max_offset) == 8)
@@ -731,7 +765,7 @@ close_unit_1 (gfc_unit *u, int locked)
 
   u->closed = 1;
   if (!locked)
-    LOCK (&unit_lock);
+    WRLOCK (&unit_rwlock);
 
   for (i = 0; i < CACHE_SIZE; i++)
     if (unit_cache[i] == u)
@@ -758,7 +792,7 @@ close_unit_1 (gfc_unit *u, int locked)
     destroy_unit_mutex (u);
 
   if (!locked)
-    UNLOCK (&unit_lock);
+    RWUNLOCK (&unit_rwlock);
 
   return rc;
 }
@@ -795,10 +829,10 @@ close_unit (gfc_unit *u)
 void
 close_units (void)
 {
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   while (unit_root != NULL)
     close_unit_1 (unit_root, 1);
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 
   free (newunits);
 
@@ -905,7 +939,7 @@ finish_last_advance_record (gfc_unit *u)
 int
 newunit_alloc (void)
 {
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   if (!newunits)
     {
       newunits = xcalloc (16, 1);
@@ -919,7 +953,7 @@ newunit_alloc (void)
         {
           newunits[ii] = true;
           newunit_lwi = ii + 1;
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
           return -ii + NEWUNIT_START;
         }
     }
@@ -932,12 +966,12 @@ newunit_alloc (void)
   memset (newunits + old_size, 0, old_size);
   newunits[old_size] = true;
   newunit_lwi = old_size + 1;
-    UNLOCK (&unit_lock);
+    RWUNLOCK (&unit_rwlock);
   return -old_size + NEWUNIT_START;
 }
 
 
-/* Free a previously allocated newunit= unit number.  unit_lock must
+/* Free a previously allocated newunit= unit number.  unit_rwlock must
    be held when calling.  */
 
 void
diff --git a/libgfortran/io/unix.c b/libgfortran/io/unix.c
index ba12be08252..d53003919ab 100644
--- a/libgfortran/io/unix.c
+++ b/libgfortran/io/unix.c
@@ -1774,7 +1774,7 @@ find_file (const char *file, gfc_charlen_type file_len)
   id = id_from_path (path);
 #endif
 
-  LOCK (&unit_lock);
+  RDLOCK (&unit_rwlock);
 retry:
   u = find_file0 (unit_root, FIND_FILE0_ARGS);
   if (u != NULL)
@@ -1783,19 +1783,19 @@ retry:
       if (! __gthread_mutex_trylock (&u->lock))
 	{
 	  /* assert (u->closed == 0); */
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
 	  goto done;
 	}
 
       inc_waiting_locked (u);
     }
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
   if (u != NULL)
     {
       LOCK (&u->lock);
       if (u->closed)
 	{
-	  LOCK (&unit_lock);
+	  RDLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  if (predec_waiting_locked (u) == 0)
 	    free (u);
@@ -1839,13 +1839,13 @@ flush_all_units (void)
   gfc_unit *u;
   int min_unit = 0;
 
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   do
     {
       u = flush_all_units_1 (unit_root, min_unit);
       if (u != NULL)
 	inc_waiting_locked (u);
-      UNLOCK (&unit_lock);
+      RWUNLOCK (&unit_rwlock);
       if (u == NULL)
 	return;
 
@@ -1856,13 +1856,13 @@ flush_all_units (void)
       if (u->closed == 0)
 	{
 	  sflush (u->s);
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  (void) predec_waiting_locked (u);
 	}
       else
 	{
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  if (predec_waiting_locked (u) == 0)
 	    free (u);
diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_1.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
new file mode 100644
index 00000000000..f90ecbeb00f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
@@ -0,0 +1,33 @@
+! { dg-do run }
+! Multiple threads call open/write/read/close in concurrency with different unit number,
+! threads can acquire read lock concurrently, to find unit from cache or unit list very frequently,
+! if not found, threads will acquire the write lock exclusively to insert unit to cache and unit list.
+! This test case is used to stress both the read and write lock when access unit cache and list.
+program main
+  use omp_lib
+  implicit none
+  integer:: unit_number, v1, v2, i
+  character(11) :: file_name
+  character(3) :: async = "no"
+  !$omp parallel private (unit_number, v1, v2, file_name, async, i)
+    do i = 0, 100
+      unit_number = 10 + omp_get_thread_num ()
+      write (file_name, "(I3, A)") unit_number, "_tst.dat"
+      file_name = adjustl(file_name)
+      open (unit_number, file=file_name, asynchronous="yes")
+      ! call inquire with file parameter to test find_file in unix.c
+      inquire (file=file_name, asynchronous=async)
+      if (async /= "YES") stop 1
+      write (unit_number, *, asynchronous="yes") unit_number
+      write (unit_number, *, asynchronous="yes") unit_number + 1
+      close(unit_number)
+
+      open (unit_number, file = file_name, asynchronous="yes")
+      read (unit_number, *, asynchronous="yes") v1
+      read (unit_number, *, asynchronous="yes") v2
+      wait (unit_number)
+      if ((v1 /= unit_number) .or. (v2 /= unit_number + 1)) stop 2
+      close(unit_number, status="delete")
+    end do
+  !$omp end parallel
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_2.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_2.f90
new file mode 100644
index 00000000000..08c80d14cfb
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/rwlock_2.f90
@@ -0,0 +1,22 @@
+! { dg-do run }
+! Insert a unit into cache at the beginning, then start multiple
+! threads to access the same unit concurrency, unit will be found in unit cache during the read lock phase.
+! This test case is used to test the read lock when access unit cache and list.
+program main
+  use omp_lib
+  implicit none
+  integer:: thread_id, total_threads, i, j
+  total_threads = omp_get_max_threads ()
+  open (10, file='tst.dat', asynchronous="yes")
+  !$omp parallel private (thread_id, i, j)
+    do i = 1, 100
+      thread_id = omp_get_thread_num ()
+      do j = 1, 100
+        write (10, *, asynchronous="yes") thread_id, i
+      end do
+    end do
+  !$omp end parallel
+  ! call inquire with file parameter to test find_file in unix.c
+  call flush ()
+  close (10, status="delete")
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_3.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_3.f90
new file mode 100644
index 00000000000..1906fcd7a0b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/rwlock_3.f90
@@ -0,0 +1,18 @@
+! { dg-do run }
+! Find or create the same unit number in concurrency,
+! at beginning, threads cannot find the unit in cache or unit list,
+! then threads will acquire the write lock to insert unit.
+! This test case is used to ensure that no duplicate unit number will be
+! inserted into cache nor unit list when same unit was accessed in concurrency.
+program main
+  use omp_lib
+  implicit none
+  integer:: i
+  !$omp parallel private (i)
+    do i = 1, 100
+      open (10, file='tst.dat', asynchronous="yes")
+      ! Delete the unit number from cache and unit list to stress write lock.
+      close (10, status="delete")
+    end do
+  !$omp end parallel
+end program
-- 
2.39.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-08-18  3:06       ` Zhu, Lipeng
@ 2023-09-14  8:33         ` Zhu, Lipeng
  2023-10-23  1:21           ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-09-14  8:33 UTC (permalink / raw)
  To: Thomas Koenig, Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, Lu, Hongjiu, Li, Tianyou, Deng, Pan, Guo,
	Wangyang, Jakub Jelinek

> Hi Thomas,
> 
> >
> > Hi Lipeng,
> >
> > > May I know any comment or concern on this patch, thanks for your
> > > time
> > > 😄
> >
> > Thanks for your patience in getting this reviewed.
> >
> > A few remarks / questions.
> >
> > Which strategy is used in this implementation, read-preferring or
> > write- preferring?  And if read-preferring is used, is there a danger
> > of deadlock if people do unreasonable things?
> > Maybe you could explain that, also in a comment in the code.
> >
> 
> Yes, the implementation use the read-preferring strategy, and comments in
> code.
> When adding the test cases, I didn’t meet the situation which may cause the
> deadlock.
> Maybe you can give more guidance about that.
> 
> > Can you add some sort of torture test case(s) which does a lot of
> > opening/closing/reading/writing, possibly with asynchronous I/O and/or
> > pthreads, to catch possible problems?  If there is a system dependency
> > or some race condition, chances are that regression testers will catch this.
> >
> 
> Sure, as your comments, in the patch V6, I added 3 test cases with OpenMP to
> test different cases in concurrency respectively:
> 1. find and create unit very frequently to stress read lock and write lock.
> 2. only access the unit which exist in cache to stress read lock.
> 3. access the same unit in concurrency.
> For the third test case, it also help to find a bug:  When unit can't be found in
> cache nor unit list in read phase, then threads will try to acquire write lock to
> insert the same unit, this will cause duplicate key error.
> To fix this bug, I get the unit from unit list once again before insert in write lock.
> More details you can refer the patch v6.
> 

Could you help to review this update? I really appreciate your assistance.

> > With this, the libgfortran parts are OK, unless somebody else has more
> > comments, so give this a couple of days.  I cannot approve the libgcc
> > parts, that would be somebody else (Jakub?)
> >
> > Best regards
> >
> > 	Thomas
> >
> 
> Best Regards,
> Lipeng Zhu

Best Regards,
Lipeng Zhu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-09-14  8:33         ` Zhu, Lipeng
@ 2023-10-23  1:21           ` Zhu, Lipeng
  2023-10-23  5:52             ` Thomas Koenig
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-10-23  1:21 UTC (permalink / raw)
  To: Thomas Koenig, Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, Lu, Hongjiu, Li, Tianyou, Deng, Pan, Guo,
	Wangyang, Jakub Jelinek

> 
> > Hi Thomas,
> >
> > >
> > > Hi Lipeng,
> > >
> > > > May I know any comment or concern on this patch, thanks for your
> > > > time
> > > > 😄
> > >
> > > Thanks for your patience in getting this reviewed.
> > >
> > > A few remarks / questions.
> > >
> > > Which strategy is used in this implementation, read-preferring or
> > > write- preferring?  And if read-preferring is used, is there a
> > > danger of deadlock if people do unreasonable things?
> > > Maybe you could explain that, also in a comment in the code.
> > >
> >
> > Yes, the implementation use the read-preferring strategy, and comments
> > in code.
> > When adding the test cases, I didn’t meet the situation which may
> > cause the deadlock.
> > Maybe you can give more guidance about that.
> >
> > > Can you add some sort of torture test case(s) which does a lot of
> > > opening/closing/reading/writing, possibly with asynchronous I/O
> > > and/or pthreads, to catch possible problems?  If there is a system
> > > dependency or some race condition, chances are that regression testers
> will catch this.
> > >
> >
> > Sure, as your comments, in the patch V6, I added 3 test cases with
> > OpenMP to test different cases in concurrency respectively:
> > 1. find and create unit very frequently to stress read lock and write lock.
> > 2. only access the unit which exist in cache to stress read lock.
> > 3. access the same unit in concurrency.
> > For the third test case, it also help to find a bug:  When unit can't
> > be found in cache nor unit list in read phase, then threads will try
> > to acquire write lock to insert the same unit, this will cause duplicate key
> error.
> > To fix this bug, I get the unit from unit list once again before insert in write
> lock.
> > More details you can refer the patch v6.
> >
> 
> Could you help to review this update? I really appreciate your assistance.
> 

Hi Thomas, Bernhard,

Could you help to review this update?  Any concern will be appreciated.

Regards,
Lipeng Zhu
> > > With this, the libgfortran parts are OK, unless somebody else has
> > > more comments, so give this a couple of days.  I cannot approve the
> > > libgcc parts, that would be somebody else (Jakub?)
> > >
> > > Best regards
> > >
> > > 	Thomas
> > >
> >
> > Best Regards,
> > Lipeng Zhu
> 
> Best Regards,
> Lipeng Zhu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-10-23  1:21           ` Zhu, Lipeng
@ 2023-10-23  5:52             ` Thomas Koenig
  2023-10-23 23:59               ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Koenig @ 2023-10-23  5:52 UTC (permalink / raw)
  To: Zhu, Lipeng, Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, Lu, Hongjiu, Li, Tianyou, Deng, Pan, Guo,
	Wangyang, Jakub Jelinek

Hi Lipeng,

>>> Sure, as your comments, in the patch V6, I added 3 test cases with
>>> OpenMP to test different cases in concurrency respectively:
>>> 1. find and create unit very frequently to stress read lock and write lock.
>>> 2. only access the unit which exist in cache to stress read lock.
>>> 3. access the same unit in concurrency.
>>> For the third test case, it also help to find a bug:  When unit can't
>>> be found in cache nor unit list in read phase, then threads will try
>>> to acquire write lock to insert the same unit, this will cause duplicate key
>> error.
>>> To fix this bug, I get the unit from unit list once again before insert in write
>> lock.
>>> More details you can refer the patch v6.
>>>
>>
>> Could you help to review this update? I really appreciate your assistance.
>>

> Could you help to review this update?  Any concern will be appreciated.

Fortran parts are OK (I think I wrote that already), we need somebody
for the non-Fortran parts.

Jakub, could you maybe take a look?

Best regards

	Thomas


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-10-23  5:52             ` Thomas Koenig
@ 2023-10-23 23:59               ` Zhu, Lipeng
  2023-11-01 10:14                 ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-10-23 23:59 UTC (permalink / raw)
  To: Jakub Jelinek, Thomas Koenig, Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, Lu, Hongjiu, Li, Tianyou, Deng, Pan, Guo, Wangyang

> 
> Hi Lipeng,
> 
> >>> Sure, as your comments, in the patch V6, I added 3 test cases with
> >>> OpenMP to test different cases in concurrency respectively:
> >>> 1. find and create unit very frequently to stress read lock and write lock.
> >>> 2. only access the unit which exist in cache to stress read lock.
> >>> 3. access the same unit in concurrency.
> >>> For the third test case, it also help to find a bug:  When unit
> >>> can't be found in cache nor unit list in read phase, then threads
> >>> will try to acquire write lock to insert the same unit, this will
> >>> cause duplicate key
> >> error.
> >>> To fix this bug, I get the unit from unit list once again before
> >>> insert in write
> >> lock.
> >>> More details you can refer the patch v6.
> >>>
> >>
> >> Could you help to review this update? I really appreciate your assistance.
> >>
> 
> > Could you help to review this update?  Any concern will be appreciated.
> 
> Fortran parts are OK (I think I wrote that already), we need somebody for the
> non-Fortran parts.
> 
Hi Thomas,

Thanks for your response. Very appreciate for your patience and help.

> Jakub, could you maybe take a look?
> 
> Best regards
> 
> 	Thomas

Hi Jakub,

Can you help to take a look at the change for libgcc part that added 
several rwlock macros in libgcc/gthr-posix.h?

Best Regards,
Lipeng Zhu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-10-23 23:59               ` Zhu, Lipeng
@ 2023-11-01 10:14                 ` Zhu, Lipeng
  2023-11-02  9:58                   ` Bernhard Reutner-Fischer
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-11-01 10:14 UTC (permalink / raw)
  To: Jakub Jelinek, Thomas Koenig, Bernhard Reutner-Fischer
  Cc: fortran, gcc-patches, Lu, Hongjiu, Li, Tianyou, Deng, Pan, Guo, Wangyang

> >
> > Hi Lipeng,
> >
> > >>> Sure, as your comments, in the patch V6, I added 3 test cases with
> > >>> OpenMP to test different cases in concurrency respectively:
> > >>> 1. find and create unit very frequently to stress read lock and write lock.
> > >>> 2. only access the unit which exist in cache to stress read lock.
> > >>> 3. access the same unit in concurrency.
> > >>> For the third test case, it also help to find a bug:  When unit
> > >>> can't be found in cache nor unit list in read phase, then threads
> > >>> will try to acquire write lock to insert the same unit, this will
> > >>> cause duplicate key
> > >> error.
> > >>> To fix this bug, I get the unit from unit list once again before
> > >>> insert in write
> > >> lock.
> > >>> More details you can refer the patch v6.
> > >>>
> > >>
> > >> Could you help to review this update? I really appreciate your assistance.
> > >>
> >
> > > Could you help to review this update?  Any concern will be appreciated.
> >
> > Fortran parts are OK (I think I wrote that already), we need somebody
> > for the non-Fortran parts.
> >
> Hi Thomas,
> 
> Thanks for your response. Very appreciate for your patience and help.
> 
> > Jakub, could you maybe take a look?
> >
> > Best regards
> >
> > 	Thomas
> 
> Hi Jakub,
> 
> Can you help to take a look at the change for libgcc part that added several
> rwlock macros in libgcc/gthr-posix.h?
> 

Hi Jakub,

Could you help to review this, any comment will be greatly appreciated.

> Best Regards,
> Lipeng Zhu


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-11-01 10:14                 ` Zhu, Lipeng
@ 2023-11-02  9:58                   ` Bernhard Reutner-Fischer
  2023-11-23  9:36                     ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Bernhard Reutner-Fischer @ 2023-11-02  9:58 UTC (permalink / raw)
  To: Zhu, Lipeng
  Cc: rep.dot.nop, Jakub Jelinek, Thomas Koenig, fortran, gcc-patches,
	Lu, Hongjiu, Li, Tianyou, Deng, Pan, Guo, Wangyang, ian

[CCing Ian as libgcc maintainer]

On Wed, 1 Nov 2023 10:14:37 +0000
"Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:

> > >
> > > Hi Lipeng,
> > >  
> > > >>> Sure, as your comments, in the patch V6, I added 3 test cases with
> > > >>> OpenMP to test different cases in concurrency respectively:
> > > >>> 1. find and create unit very frequently to stress read lock and write lock.
> > > >>> 2. only access the unit which exist in cache to stress read lock.
> > > >>> 3. access the same unit in concurrency.
> > > >>> For the third test case, it also help to find a bug:  When unit
> > > >>> can't be found in cache nor unit list in read phase, then threads
> > > >>> will try to acquire write lock to insert the same unit, this will
> > > >>> cause duplicate key  
> > > >> error.  
> > > >>> To fix this bug, I get the unit from unit list once again before
> > > >>> insert in write  
> > > >> lock.  
> > > >>> More details you can refer the patch v6.
> > > >>>  
> > > >>
> > > >> Could you help to review this update? I really appreciate your assistance.
> > > >>  
> > >  
> > > > Could you help to review this update?  Any concern will be appreciated.  
> > >
> > > Fortran parts are OK (I think I wrote that already), we need somebody
> > > for the non-Fortran parts.
> > >  
> > Hi Thomas,
> > 
> > Thanks for your response. Very appreciate for your patience and help.
> >   
> > > Jakub, could you maybe take a look?
> > >
> > > Best regards
> > >
> > > 	Thomas  
> > 
> > Hi Jakub,
> > 
> > Can you help to take a look at the change for libgcc part that added several
> > rwlock macros in libgcc/gthr-posix.h?
> >   
> 
> Hi Jakub,
> 
> Could you help to review this, any comment will be greatly appreciated.

Latest version is at
https://inbox.sourceware.org/gcc-patches/20230818031818.2161842-1-lipeng.zhu@intel.com/

> 
> > Best Regards,
> > Lipeng Zhu  
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-11-02  9:58                   ` Bernhard Reutner-Fischer
@ 2023-11-23  9:36                     ` Zhu, Lipeng
  2023-12-07  5:18                       ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-11-23  9:36 UTC (permalink / raw)
  To: ian, Jakub Jelinek, Bernhard Reutner-Fischer
  Cc: Thomas Koenig, fortran, gcc-patches, Lu, Hongjiu, Li, Tianyou,
	Deng, Pan, Guo, Wangyang

> [CCing Ian as libgcc maintainer]
> 
> On Wed, 1 Nov 2023 10:14:37 +0000
> "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
> 
> > > >
> > > > Hi Lipeng,
> > > >
> > > > >>> Sure, as your comments, in the patch V6, I added 3 test cases
> > > > >>> with OpenMP to test different cases in concurrency respectively:
> > > > >>> 1. find and create unit very frequently to stress read lock and write
> lock.
> > > > >>> 2. only access the unit which exist in cache to stress read lock.
> > > > >>> 3. access the same unit in concurrency.
> > > > >>> For the third test case, it also help to find a bug:  When
> > > > >>> unit can't be found in cache nor unit list in read phase, then
> > > > >>> threads will try to acquire write lock to insert the same
> > > > >>> unit, this will cause duplicate key
> > > > >> error.
> > > > >>> To fix this bug, I get the unit from unit list once again
> > > > >>> before insert in write
> > > > >> lock.
> > > > >>> More details you can refer the patch v6.
> > > > >>>
> > > > >>
> > > > >> Could you help to review this update? I really appreciate your
> assistance.
> > > > >>
> > > >
> > > > > Could you help to review this update?  Any concern will be
> appreciated.
> > > >
> > > > Fortran parts are OK (I think I wrote that already), we need
> > > > somebody for the non-Fortran parts.
> > > >
> > > Hi Thomas,
> > >
> > > Thanks for your response. Very appreciate for your patience and help.
> > >
> > > > Jakub, could you maybe take a look?
> > > >
> > > > Best regards
> > > >
> > > > 	Thomas
> > >
> > > Hi Jakub,
> > >
> > > Can you help to take a look at the change for libgcc part that added
> > > several rwlock macros in libgcc/gthr-posix.h?
> > >
> >
> > Hi Jakub,
> >
> > Could you help to review this, any comment will be greatly appreciated.
> 
> Latest version is at
> https://inbox.sourceware.org/gcc-patches/20230818031818.2161842-1-
> lipeng.zhu@intel.com/
> 
Thanks Bernhard.

Hi Ian, 
Could you help to review the changes for libgcc part?  
Very looking forward to your help.

> >
> > > Best Regards,
> > > Lipeng Zhu
> >


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-11-23  9:36                     ` Zhu, Lipeng
@ 2023-12-07  5:18                       ` Zhu, Lipeng
  0 siblings, 0 replies; 38+ messages in thread
From: Zhu, Lipeng @ 2023-12-07  5:18 UTC (permalink / raw)
  To: Jakub Jelinek, ian, Bernhard Reutner-Fischer
  Cc: Thomas Koenig, fortran, gcc-patches, Lu, Hongjiu, Li, Tianyou,
	Deng, Pan, Guo, Wangyang

> > [CCing Ian as libgcc maintainer]
> >
> > On Wed, 1 Nov 2023 10:14:37 +0000
> > "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
> >
> > > > >
> > > > > Hi Lipeng,
> > > > >
> > > > > >>> Sure, as your comments, in the patch V6, I added 3 test
> > > > > >>> cases with OpenMP to test different cases in concurrency
> respectively:
> > > > > >>> 1. find and create unit very frequently to stress read lock
> > > > > >>> and write
> > lock.
> > > > > >>> 2. only access the unit which exist in cache to stress read lock.
> > > > > >>> 3. access the same unit in concurrency.
> > > > > >>> For the third test case, it also help to find a bug:  When
> > > > > >>> unit can't be found in cache nor unit list in read phase,
> > > > > >>> then threads will try to acquire write lock to insert the
> > > > > >>> same unit, this will cause duplicate key
> > > > > >> error.
> > > > > >>> To fix this bug, I get the unit from unit list once again
> > > > > >>> before insert in write
> > > > > >> lock.
> > > > > >>> More details you can refer the patch v6.
> > > > > >>>
> > > > > >>
> > > > > >> Could you help to review this update? I really appreciate
> > > > > >> your
> > assistance.
> > > > > >>
> > > > >
> > > > > > Could you help to review this update?  Any concern will be
> > appreciated.
> > > > >
> > > > > Fortran parts are OK (I think I wrote that already), we need
> > > > > somebody for the non-Fortran parts.
> > > > >
> > > > Hi Thomas,
> > > >
> > > > Thanks for your response. Very appreciate for your patience and help.
> > > >
> > > > > Jakub, could you maybe take a look?
> > > > >
> > > > > Best regards
> > > > >
> > > > > 	Thomas
> > > >
> > > > Hi Jakub,
> > > >
> > > > Can you help to take a look at the change for libgcc part that
> > > > added several rwlock macros in libgcc/gthr-posix.h?
> > > >
> > >
> > > Hi Jakub,
> > >
> > > Could you help to review this, any comment will be greatly appreciated.
> >
> > Latest version is at
> > https://inbox.sourceware.org/gcc-patches/20230818031818.2161842-1-
> > lipeng.zhu@intel.com/
> >
> Thanks Bernhard.
> 
> Hi Ian,
> Could you help to review the changes for libgcc part?
> Very looking forward to your help.
> 
> > >
> > > > Best Regards,
> > > > Lipeng Zhu
> > >

Hi Jakub, 

Could you help to review this patch for the changes in libgcc/gthr-posix.h.
Just as Thomas commented: "Fortran parts are OK".  We need your 
comments for the non-fortran parts.  Very appreciated for your help.
Latest version is at https://inbox.sourceware.org/gcc-patches/20230818031818.2161842-1-lipeng.zhu@intel.com/

Lipeng Zhu,
Best Regards.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6] libgfortran: Replace mutex with rwlock
  2023-08-18  3:18       ` [PATCH v6] " Zhu, Lipeng
@ 2023-12-08 10:19         ` Jakub Jelinek
  2023-12-09 15:13           ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Jakub Jelinek @ 2023-12-08 10:19 UTC (permalink / raw)
  To: Zhu, Lipeng
  Cc: tkoenig, rep.dot.nop, fortran, gcc-patches, hongjiu.lu,
	tianyou.li, pan.deng, wangyang.guo, Zhu

On Fri, Aug 18, 2023 at 11:18:19AM +0800, Zhu, Lipeng wrote:
> From: Lipeng Zhu <lipeng.zhu@intel.com>
> 
> This patch try to introduce the rwlock and split the read/write to
> unit_root tree and unit_cache with rwlock instead of the mutex to
> increase CPU efficiency. In the get_gfc_unit function, the percentage
> to step into the insert_unit function is around 30%, in most instances,
> we can get the unit in the phase of reading the unit_cache or unit_root
> tree. So split the read/write phase by rwlock would be an approach to
> make it more parallel.
> 
> BTW, the IPC metrics can gain around 9x in our test
> server with 220 cores. The benchmark we used is
> https://github.com/rwesson/NEAT
> 
> libgcc/ChangeLog:
> 
>     * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro
>     (__gthrw): New function
>     (__gthread_rwlock_rdlock): New function
>     (__gthread_rwlock_tryrdlock): New function
>     (__gthread_rwlock_wrlock): New function
>     (__gthread_rwlock_trywrlock): New function
>     (__gthread_rwlock_unlock): New function
> 
> libgfortran/ChangeLog:
> 
>     * io/async.c (DEBUG_LINE): New
>     * io/async.h (RWLOCK_DEBUG_ADD): New macro
>     (CHECK_RDLOCK): New macro
>     (CHECK_WRLOCK): New macro
>     (TAIL_RWLOCK_DEBUG_QUEUE): New macro
>     (IN_RWLOCK_DEBUG_QUEUE): New macro
>     (RDLOCK): New macro
>     (WRLOCK): New macro
>     (RWUNLOCK): New macro
>     (RD_TO_WRLOCK): New macro
>     (INTERN_RDLOCK): New macro
>     (INTERN_WRLOCK): New macro
>     (INTERN_RWUNLOCK): New macro
>     * io/io.h (internal_proto): Define unit_rwlock
>     * io/transfer.c (st_read_done_worker): Relace unit_lock with unit_rwlock
>     (st_write_done_worker): Relace unit_lock with unit_rwlock
>     * io/unit.c (get_gfc_unit): Relace unit_lock with unit_rwlock
>     (if): Relace unit_lock with unit_rwlock
>     (close_unit_1): Relace unit_lock with unit_rwlock
>     (close_units): Relace unit_lock with unit_rwlock
>     (newunit_alloc): Relace unit_lock with unit_rwlock
>     * io/unix.c (flush_all_units): Relace unit_lock with unit_rwlock

The changeLog entries are all incorrect.
1) they should be indented by a tab, not 4 spaces, and should end with
   a dot
2) when several consecutive descriptions have the same text, especially
   when it is long, it should use Likewise. for the 2nd and following
3) (internal_proto) is certainly not what you've changed, from what I can
   see in io.h you've done:
	* io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
	a comment.
	(unit_lock): Remove including associated internal_proto.
	(unit_rwlock): New declarations including associated internal_proto.
	(dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
	instead of __gthread_mutex_lock and __gthread_mutex_unlock on
	unit_lock.
   (if) is certainly not what you've changed either, always find what
   function or macro the change was in, or if you remove something, state
   it, if you add something, state it.
4) all the
   Replace unit_lock with unit_rwlock. descriptions only partially match
   reality, you've also changed the operations on those variables.

> --- a/libgfortran/io/async.h
> +++ b/libgfortran/io/async.h
> @@ -207,9 +207,132 @@
>      INTERN_LOCK (&debug_queue_lock);					\
>      MUTEX_DEBUG_ADD (mutex);						\
>      INTERN_UNLOCK (&debug_queue_lock);					\
> -    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \
> +    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, \
> +		 mutex); \

Why are you changing this at all?

> +#define RD_TO_WRLOCK(rwlock) \
> +  RWUNLOCK (rwlock);\

At least a space before \ (or better tab

> +#define RD_TO_WRLOCK(rwlock) \
> +  RWUNLOCK (rwlock);\

Likewise.

> +  WRLOCK (rwlock);
> +#endif
> +#endif
> +
> +#ifndef __GTHREAD_RWLOCK_INIT
> +#define RDLOCK(rwlock) LOCK (rwlock)
> +#define WRLOCK(rwlock) LOCK (rwlock)
> +#define RWUNLOCK(rwlock) UNLOCK (rwlock)
> +#define RD_TO_WRLOCK(rwlock) {}

do {} while (0)
instead of {}
?

>  #endif
>  
>  #define INTERN_LOCK(mutex) T_ERROR (__gthread_mutex_lock, mutex);
>  
>  #define INTERN_UNLOCK(mutex) T_ERROR (__gthread_mutex_unlock, mutex);
>  
> +#define INTERN_RDLOCK(rwlock) T_ERROR (__gthread_rwlock_rdlock, rwlock);
> +#define INTERN_WRLOCK(rwlock) T_ERROR (__gthread_rwlock_wrlock, rwlock);
> +#define INTERN_RWUNLOCK(rwlock) T_ERROR (__gthread_rwlock_unlock, rwlock);

Admittedly preexisting issue, but I wonder why the ; at the end, all the
uses of RDLOCK etc. I've seen were followed by ;

> --- a/libgfortran/io/unit.c
> +++ b/libgfortran/io/unit.c
> @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
>  
>  
>  /* IO locking rules:
> -   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
> +   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.

master rw lock
?

> +/* get_gfc_unit_from_root()-- Given an integer, return a pointer
> +  to the unit structure. Returns NULL if the unit does not exist,
> +   otherwise returns a locked unit. */

Formatting, to is indented differently from otherwise, both should
be 3 spaces, and there should be 2 spaces after ., rather than just
one (in both spots).

> +
> +static inline gfc_unit * get_gfc_unit_from_unit_root (int n)

Formatting, there shouldn't be space after *, but also function
name should be at the beginning of line, so in this case it should
be
static inline gfc_unit *
get_gfc_unit_from_unit_root (int n)

> +{
> +    gfc_unit *p;
> +    int c = 0;
> +    p = unit_root;

This should be indented by 2 spaces rather than 4.

> +  while (p != NULL)
> +    {
> +      c = compare (n, p->unit_number);
> +      if (c < 0)
> +	p = p->left;
> +      if (c > 0)
> +	p = p->right;
> +      if (c == 0)
> +	break;
> +    }
> +    return p;

And the above return p; line as well.

	Jakub


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v6] libgfortran: Replace mutex with rwlock
  2023-12-08 10:19         ` Jakub Jelinek
@ 2023-12-09 15:13           ` Zhu, Lipeng
  2023-12-09 15:39             ` [PATCH v7] " Lipeng Zhu
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-12-09 15:13 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: tkoenig, rep.dot.nop, fortran, gcc-patches, Lu, Hongjiu, Li,
	Tianyou, Deng, Pan, Guo, Wangyang

On 2023/12/8 18:19, Jakub Jelinek wrote:
> On Fri, Aug 18, 2023 at 11:18:19AM +0800, Zhu, Lipeng wrote:
> > From: Lipeng Zhu <lipeng.zhu@intel.com>
> >
> > This patch try to introduce the rwlock and split the read/write to
> > unit_root tree and unit_cache with rwlock instead of the mutex to
> > increase CPU efficiency. In the get_gfc_unit function, the percentage
> > to step into the insert_unit function is around 30%, in most
> > instances, we can get the unit in the phase of reading the unit_cache
> > or unit_root tree. So split the read/write phase by rwlock would be an
> > approach to make it more parallel.
> >
> > BTW, the IPC metrics can gain around 9x in our test server with 220
> > cores. The benchmark we used is https://github.com/rwesson/NEAT
> >
> > libgcc/ChangeLog:
> >
> >     * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro
> >     (__gthrw): New function
> >     (__gthread_rwlock_rdlock): New function
> >     (__gthread_rwlock_tryrdlock): New function
> >     (__gthread_rwlock_wrlock): New function
> >     (__gthread_rwlock_trywrlock): New function
> >     (__gthread_rwlock_unlock): New function
> >
> > libgfortran/ChangeLog:
> >
> >     * io/async.c (DEBUG_LINE): New
> >     * io/async.h (RWLOCK_DEBUG_ADD): New macro
> >     (CHECK_RDLOCK): New macro
> >     (CHECK_WRLOCK): New macro
> >     (TAIL_RWLOCK_DEBUG_QUEUE): New macro
> >     (IN_RWLOCK_DEBUG_QUEUE): New macro
> >     (RDLOCK): New macro
> >     (WRLOCK): New macro
> >     (RWUNLOCK): New macro
> >     (RD_TO_WRLOCK): New macro
> >     (INTERN_RDLOCK): New macro
> >     (INTERN_WRLOCK): New macro
> >     (INTERN_RWUNLOCK): New macro
> >     * io/io.h (internal_proto): Define unit_rwlock
> >     * io/transfer.c (st_read_done_worker): Relace unit_lock with unit_rwlock
> >     (st_write_done_worker): Relace unit_lock with unit_rwlock
> >     * io/unit.c (get_gfc_unit): Relace unit_lock with unit_rwlock
> >     (if): Relace unit_lock with unit_rwlock
> >     (close_unit_1): Relace unit_lock with unit_rwlock
> >     (close_units): Relace unit_lock with unit_rwlock
> >     (newunit_alloc): Relace unit_lock with unit_rwlock
> >     * io/unix.c (flush_all_units): Relace unit_lock with unit_rwlock
> 
> The changeLog entries are all incorrect.
> 1) they should be indented by a tab, not 4 spaces, and should end with
>    a dot
> 2) when several consecutive descriptions have the same text, especially
>    when it is long, it should use Likewise. for the 2nd and following
> 3) (internal_proto) is certainly not what you've changed, from what I can
>    see in io.h you've done:
> 	* io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
> 	a comment.
> 	(unit_lock): Remove including associated internal_proto.
> 	(unit_rwlock): New declarations including associated internal_proto.
> 	(dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
> 	instead of __gthread_mutex_lock and __gthread_mutex_unlock on
> 	unit_lock.
>    (if) is certainly not what you've changed either, always find what
>    function or macro the change was in, or if you remove something, state
>    it, if you add something, state it.
> 4) all the
>    Replace unit_lock with unit_rwlock. descriptions only partially match
>    reality, you've also changed the operations on those variables.
> 

Hi Jakub, 
Thanks for your help, very appreciated! I just updated the patch according to
your comments. A new version [PATCH V7] is sent out for your review which 
update the change log and formatting code according to your following 
comments.

Lipeng Zhu

> > --- a/libgfortran/io/async.h
> > +++ b/libgfortran/io/async.h
> > @@ -207,9 +207,132 @@
> >      INTERN_LOCK (&debug_queue_lock);
> 	\
> >      MUTEX_DEBUG_ADD (mutex);
> 	\
> >      INTERN_UNLOCK (&debug_queue_lock);
> 	\
> > -    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-
> 30s %78p\n", aio_prefix, #mutex, mutex); \
> > +    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-
> 30s %78p\n", aio_prefix, #mutex, \
> > +		 mutex); \
> 
> Why are you changing this at all?
> 
> > +#define RD_TO_WRLOCK(rwlock) \
> > +  RWUNLOCK (rwlock);\
> 
> At least a space before \ (or better tab
> 
> > +#define RD_TO_WRLOCK(rwlock) \
> > +  RWUNLOCK (rwlock);\
> 
> Likewise.
> 
> > +  WRLOCK (rwlock);
> > +#endif
> > +#endif
> > +
> > +#ifndef __GTHREAD_RWLOCK_INIT
> > +#define RDLOCK(rwlock) LOCK (rwlock)
> > +#define WRLOCK(rwlock) LOCK (rwlock)
> > +#define RWUNLOCK(rwlock) UNLOCK (rwlock) #define
> RD_TO_WRLOCK(rwlock)
> > +{}
> 
> do {} while (0)
> instead of {}
> ?
> 
> >  #endif
> >
> >  #define INTERN_LOCK(mutex) T_ERROR (__gthread_mutex_lock, mutex);
> >
> >  #define INTERN_UNLOCK(mutex) T_ERROR (__gthread_mutex_unlock,
> mutex);
> >
> > +#define INTERN_RDLOCK(rwlock) T_ERROR (__gthread_rwlock_rdlock,
> > +rwlock); #define INTERN_WRLOCK(rwlock) T_ERROR
> > +(__gthread_rwlock_wrlock, rwlock); #define INTERN_RWUNLOCK(rwlock)
> > +T_ERROR (__gthread_rwlock_unlock, rwlock);
> 
> Admittedly preexisting issue, but I wonder why the ; at the end, all the uses of
> RDLOCK etc. I've seen were followed by ;
> 
> > --- a/libgfortran/io/unit.c
> > +++ b/libgfortran/io/unit.c
> > @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME
> > respectively.  If not, see
> >
> >
> >  /* IO locking rules:
> > -   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
> > +   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and
> UNIT_CACHE.
> 
> master rw lock
> ?
> 
> > +/* get_gfc_unit_from_root()-- Given an integer, return a pointer
> > +  to the unit structure. Returns NULL if the unit does not exist,
> > +   otherwise returns a locked unit. */
> 
> Formatting, to is indented differently from otherwise, both should be 3 spaces,
> and there should be 2 spaces after ., rather than just one (in both spots).
> 
> > +
> > +static inline gfc_unit * get_gfc_unit_from_unit_root (int n)
> 
> Formatting, there shouldn't be space after *, but also function name should be
> at the beginning of line, so in this case it should be static inline gfc_unit *
> get_gfc_unit_from_unit_root (int n)
> 
> > +{
> > +    gfc_unit *p;
> > +    int c = 0;
> > +    p = unit_root;
> 
> This should be indented by 2 spaces rather than 4.
> 
> > +  while (p != NULL)
> > +    {
> > +      c = compare (n, p->unit_number);
> > +      if (c < 0)
> > +	p = p->left;
> > +      if (c > 0)
> > +	p = p->right;
> > +      if (c == 0)
> > +	break;
> > +    }
> > +    return p;
> 
> And the above return p; line as well.
> 
> 	Jakub



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-09 15:39             ` [PATCH v7] " Lipeng Zhu
@ 2023-12-09 15:23               ` Jakub Jelinek
  2023-12-10  3:25                 ` Zhu, Lipeng
  2023-12-14 15:50               ` Richard Earnshaw (lists)
  1 sibling, 1 reply; 38+ messages in thread
From: Jakub Jelinek @ 2023-12-09 15:23 UTC (permalink / raw)
  To: Lipeng Zhu
  Cc: fortran, gcc-patches, hongjiu.lu, pan.deng, rep.dot.nop,
	tianyou.li, tkoenig, wangyang.guo

On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
> This patch try to introduce the rwlock and split the read/write to
> unit_root tree and unit_cache with rwlock instead of the mutex to
> increase CPU efficiency. In the get_gfc_unit function, the percentage
> to step into the insert_unit function is around 30%, in most instances,
> we can get the unit in the phase of reading the unit_cache or unit_root
> tree. So split the read/write phase by rwlock would be an approach to
> make it more parallel.
> 
> BTW, the IPC metrics can gain around 9x in our test
> server with 220 cores. The benchmark we used is
> https://github.com/rwesson/NEAT
> 
> libgcc/ChangeLog:
> 
> 	* gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
> 	(__gthrw): New function.
> 	(__gthread_rwlock_rdlock): New function.
> 	(__gthread_rwlock_tryrdlock): New function.
> 	(__gthread_rwlock_wrlock): New function.
> 	(__gthread_rwlock_trywrlock): New function.
> 	(__gthread_rwlock_unlock): New function.
> 
> libgfortran/ChangeLog:
> 
> 	* io/async.c (DEBUG_LINE): New macro.
> 	* io/async.h (RWLOCK_DEBUG_ADD): New macro.
> 	(CHECK_RDLOCK): New macro.
> 	(CHECK_WRLOCK): New macro.
> 	(TAIL_RWLOCK_DEBUG_QUEUE): New macro.
> 	(IN_RWLOCK_DEBUG_QUEUE): New macro.
> 	(RDLOCK): New macro.
> 	(WRLOCK): New macro.
> 	(RWUNLOCK): New macro.
> 	(RD_TO_WRLOCK): New macro.
> 	(INTERN_RDLOCK): New macro.
> 	(INTERN_WRLOCK): New macro.
> 	(INTERN_RWUNLOCK): New macro.
> 	* io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
> 	a comment.
> 	(unit_lock): Remove including associated internal_proto.
> 	(unit_rwlock): New declarations including associated internal_proto.
> 	(dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
> 	instead of __gthread_mutex_lock and __gthread_mutex_unlock on
> 	unit_lock.
> 	* io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK on
> 	unit_rwlock instead of LOCK and UNLOCK on unit_lock.
> 	(st_write_done_worker): Likewise.
> 	* io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
> 	comment. Use unit_rwlock variable instead of unit_lock variable.
> 	(get_gfc_unit_from_unit_root): New function.
> 	(get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
> 	instead of LOCK and UNLOCK on unit_lock.
> 	(close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead of
> 	LOCK and UNLOCK on unit_lock.
> 	(close_units): Likewise.
> 	(newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
> 	unit_lock.
> 	* io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
> 	instead of LOCK and UNLOCK on unit_lock.
> 	(flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
> 	of LOCK and UNLOCK on unit_lock.

Ok for trunk, thanks.

	Jakub


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-09 15:13           ` Zhu, Lipeng
@ 2023-12-09 15:39             ` Lipeng Zhu
  2023-12-09 15:23               ` Jakub Jelinek
  2023-12-14 15:50               ` Richard Earnshaw (lists)
  0 siblings, 2 replies; 38+ messages in thread
From: Lipeng Zhu @ 2023-12-09 15:39 UTC (permalink / raw)
  To: jakub
  Cc: fortran, gcc-patches, hongjiu.lu, pan.deng, rep.dot.nop,
	tianyou.li, tkoenig, wangyang.guo, Lipeng Zhu

This patch try to introduce the rwlock and split the read/write to
unit_root tree and unit_cache with rwlock instead of the mutex to
increase CPU efficiency. In the get_gfc_unit function, the percentage
to step into the insert_unit function is around 30%, in most instances,
we can get the unit in the phase of reading the unit_cache or unit_root
tree. So split the read/write phase by rwlock would be an approach to
make it more parallel.

BTW, the IPC metrics can gain around 9x in our test
server with 220 cores. The benchmark we used is
https://github.com/rwesson/NEAT

libgcc/ChangeLog:

	* gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
	(__gthrw): New function.
	(__gthread_rwlock_rdlock): New function.
	(__gthread_rwlock_tryrdlock): New function.
	(__gthread_rwlock_wrlock): New function.
	(__gthread_rwlock_trywrlock): New function.
	(__gthread_rwlock_unlock): New function.

libgfortran/ChangeLog:

	* io/async.c (DEBUG_LINE): New macro.
	* io/async.h (RWLOCK_DEBUG_ADD): New macro.
	(CHECK_RDLOCK): New macro.
	(CHECK_WRLOCK): New macro.
	(TAIL_RWLOCK_DEBUG_QUEUE): New macro.
	(IN_RWLOCK_DEBUG_QUEUE): New macro.
	(RDLOCK): New macro.
	(WRLOCK): New macro.
	(RWUNLOCK): New macro.
	(RD_TO_WRLOCK): New macro.
	(INTERN_RDLOCK): New macro.
	(INTERN_WRLOCK): New macro.
	(INTERN_RWUNLOCK): New macro.
	* io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
	a comment.
	(unit_lock): Remove including associated internal_proto.
	(unit_rwlock): New declarations including associated internal_proto.
	(dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
	instead of __gthread_mutex_lock and __gthread_mutex_unlock on
	unit_lock.
	* io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK on
	unit_rwlock instead of LOCK and UNLOCK on unit_lock.
	(st_write_done_worker): Likewise.
	* io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
	comment. Use unit_rwlock variable instead of unit_lock variable.
	(get_gfc_unit_from_unit_root): New function.
	(get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
	instead of LOCK and UNLOCK on unit_lock.
	(close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead of
	LOCK and UNLOCK on unit_lock.
	(close_units): Likewise.
	(newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
	unit_lock.
	* io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
	instead of LOCK and UNLOCK on unit_lock.
	(flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
	of LOCK and UNLOCK on unit_lock.

---
v1 -> v2:
Limit the pthread_rwlock usage in libgcc only when __cplusplus isn't defined.

v2 -> v3:
Rebase the patch with trunk branch.

v3 -> v4:
Update the comments.

v4 -> v5:
Fix typos and code formatter.

v5 -> v6:
Add unit tests.

v6 -> v7:
Update ChangeLog and code formatter.

Reviewed-by: Hongjiu Lu <hongjiu.lu@intel.com>
Reviewed-by: Bernhard Reutner-Fischer <rep.dot.nop@gmail.com>
Reviewed-by: Thomas Koenig <tkoenig@netcologne.de>
Reviewed-by: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
---
 libgcc/gthr-posix.h                           |  60 +++++++
 libgfortran/io/async.c                        |   4 +
 libgfortran/io/async.h                        | 151 ++++++++++++++++++
 libgfortran/io/io.h                           |  15 +-
 libgfortran/io/transfer.c                     |   8 +-
 libgfortran/io/unit.c                         | 117 +++++++++-----
 libgfortran/io/unix.c                         |  16 +-
 .../testsuite/libgomp.fortran/rwlock_1.f90    |  33 ++++
 .../testsuite/libgomp.fortran/rwlock_2.f90    |  22 +++
 .../testsuite/libgomp.fortran/rwlock_3.f90    |  18 +++
 10 files changed, 386 insertions(+), 58 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_1.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_2.f90
 create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_3.f90

diff --git a/libgcc/gthr-posix.h b/libgcc/gthr-posix.h
index aebcfdd9f4c..73283082997 100644
--- a/libgcc/gthr-posix.h
+++ b/libgcc/gthr-posix.h
@@ -48,6 +48,9 @@ typedef pthread_t __gthread_t;
 typedef pthread_key_t __gthread_key_t;
 typedef pthread_once_t __gthread_once_t;
 typedef pthread_mutex_t __gthread_mutex_t;
+#ifndef __cplusplus
+typedef pthread_rwlock_t __gthread_rwlock_t;
+#endif
 typedef pthread_mutex_t __gthread_recursive_mutex_t;
 typedef pthread_cond_t __gthread_cond_t;
 typedef struct timespec __gthread_time_t;
@@ -58,6 +61,9 @@ typedef struct timespec __gthread_time_t;
 
 #define __GTHREAD_MUTEX_INIT PTHREAD_MUTEX_INITIALIZER
 #define __GTHREAD_MUTEX_INIT_FUNCTION __gthread_mutex_init_function
+#ifndef __cplusplus
+#define __GTHREAD_RWLOCK_INIT PTHREAD_RWLOCK_INITIALIZER
+#endif
 #define __GTHREAD_ONCE_INIT PTHREAD_ONCE_INIT
 #if defined(PTHREAD_RECURSIVE_MUTEX_INITIALIZER)
 #define __GTHREAD_RECURSIVE_MUTEX_INIT PTHREAD_RECURSIVE_MUTEX_INITIALIZER
@@ -135,6 +141,13 @@ __gthrw(pthread_mutexattr_init)
 __gthrw(pthread_mutexattr_settype)
 __gthrw(pthread_mutexattr_destroy)
 
+#ifndef __cplusplus
+__gthrw(pthread_rwlock_rdlock)
+__gthrw(pthread_rwlock_tryrdlock)
+__gthrw(pthread_rwlock_wrlock)
+__gthrw(pthread_rwlock_trywrlock)
+__gthrw(pthread_rwlock_unlock)
+#endif
 
 #if defined(_LIBOBJC) || defined(_LIBOBJC_WEAK)
 /* Objective-C.  */
@@ -885,6 +898,53 @@ __gthread_cond_destroy (__gthread_cond_t* __cond)
   return __gthrw_(pthread_cond_destroy) (__cond);
 }
 
+#ifndef __cplusplus
+static inline int
+__gthread_rwlock_rdlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_rdlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_tryrdlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_tryrdlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_wrlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_wrlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_trywrlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_trywrlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_unlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_unlock) (__rwlock);
+  else
+    return 0;
+}
+#endif
+
 #endif /* _LIBOBJC */
 
 #endif /* ! GCC_GTHR_POSIX_H */
diff --git a/libgfortran/io/async.c b/libgfortran/io/async.c
index 8fa1f0d4ce0..91bf397105d 100644
--- a/libgfortran/io/async.c
+++ b/libgfortran/io/async.c
@@ -42,6 +42,10 @@ DEBUG_LINE (__thread const char *aio_prefix = MPREFIX);
 
 DEBUG_LINE (__gthread_mutex_t debug_queue_lock = __GTHREAD_MUTEX_INIT;)
 DEBUG_LINE (aio_lock_debug *aio_debug_head = NULL;)
+#ifdef __GTHREAD_RWLOCK_INIT
+DEBUG_LINE (aio_rwlock_debug *aio_rwlock_debug_head = NULL;)
+DEBUG_LINE (__gthread_rwlock_t debug_queue_rwlock = __GTHREAD_RWLOCK_INIT;)
+#endif
 
 /* Current unit for asynchronous I/O.  Needed for error reporting.  */
 
diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h
index ad226c8e856..f112f6870bb 100644
--- a/libgfortran/io/async.h
+++ b/libgfortran/io/async.h
@@ -210,6 +210,128 @@
     DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \
   } while (0)
 
+#ifdef __GTHREAD_RWLOCK_INIT
+#define RWLOCK_DEBUG_ADD(rwlock) do {		\
+    aio_rwlock_debug *n;				\
+    n = xmalloc (sizeof (aio_rwlock_debug));	\
+    n->prev = TAIL_RWLOCK_DEBUG_QUEUE;			\
+    if (n->prev)				\
+      n->prev->next = n;			\
+    n->next = NULL;				\
+    n->line = __LINE__;				\
+    n->func = __FUNCTION__;			\
+    n->rw = rwlock;				\
+    if (!aio_rwlock_debug_head) {			\
+      aio_rwlock_debug_head = n;			\
+    }						\
+  } while (0)
+
+#define CHECK_RDLOCK(rwlock, status) do {					\
+    aio_rwlock_debug *curr;						\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    if (__gthread_rwlock_tryrdlock (rwlock)) {				\
+      if ((curr = IN_RWLOCK_DEBUG_QUEUE (rwlock))) {				\
+	sprintf (status, DEBUG_RED "%s():%d" DEBUG_NORM, curr->func, curr->line); \
+      } else								\
+	sprintf (status, DEBUG_RED "unknown" DEBUG_NORM);			\
+    }									\
+    else {								\
+      __gthread_rwlock_unlock (rwlock);					\
+      sprintf (status, DEBUG_GREEN "rwunlocked" DEBUG_NORM);			\
+    }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+  }while (0)
+
+#define CHECK_WRLOCK(rwlock, status) do {					\
+    aio_rwlock_debug *curr;						\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    if (__gthread_rwlock_trywrlock (rwlock)) {				\
+      if ((curr = IN_RWLOCK_DEBUG_QUEUE (rwlock))) {				\
+	sprintf (status, DEBUG_RED "%s():%d" DEBUG_NORM, curr->func, curr->line); \
+      } else								\
+	sprintf (status, DEBUG_RED "unknown" DEBUG_NORM);			\
+    }									\
+    else {								\
+      __gthread_rwlock_unlock (rwlock);					\
+      sprintf (status, DEBUG_GREEN "rwunlocked" DEBUG_NORM);			\
+    }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+  }while (0)
+
+#define TAIL_RWLOCK_DEBUG_QUEUE ({			\
+      aio_rwlock_debug *curr = aio_rwlock_debug_head;	\
+      while (curr && curr->next) {		\
+	curr = curr->next;			\
+      }						\
+      curr;					\
+    })
+
+#define IN_RWLOCK_DEBUG_QUEUE(rwlock) ({		\
+      __label__ end;				\
+      aio_rwlock_debug *curr = aio_rwlock_debug_head;	\
+      while (curr) {				\
+	if (curr->rw == rwlock) {			\
+	  goto end;				\
+	}					\
+	curr = curr->next;			\
+      }						\
+    end:;					\
+      curr;					\
+    })
+
+#define RDLOCK(rwlock) do {						\
+    char status[200];							\
+    CHECK_RDLOCK (rwlock, status);						\
+    DEBUG_PRINTF ("%s%-42s prev: %-35s %20s():%-5d %18p\n", aio_prefix,	\
+		 DEBUG_RED "RDLOCK: " DEBUG_NORM #rwlock, status, __FUNCTION__, __LINE__, (void *) rwlock); \
+    INTERN_RDLOCK (rwlock);							\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    RWLOCK_DEBUG_ADD (rwlock);						\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #rwlock, rwlock); \
+  } while (0)
+
+#define WRLOCK(rwlock) do {						\
+    char status[200];							\
+    CHECK_WRLOCK (rwlock, status);						\
+    DEBUG_PRINTF ("%s%-42s prev: %-35s %20s():%-5d %18p\n", aio_prefix,	\
+		 DEBUG_RED "WRLOCK: " DEBUG_NORM #rwlock, status, __FUNCTION__, __LINE__, (void *) rwlock); \
+    INTERN_WRLOCK (rwlock);							\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    RWLOCK_DEBUG_ADD (rwlock);						\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #rwlock, rwlock); \
+  } while (0)
+
+#define RWUNLOCK(rwlock) do {						\
+    aio_rwlock_debug *curr;						\
+    DEBUG_PRINTF ("%s%-75s %20s():%-5d %18p\n", aio_prefix, DEBUG_GREEN "RWUNLOCK: " DEBUG_NORM #rwlock, \
+		 __FUNCTION__, __LINE__, (void *) rwlock);		\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    curr = IN_RWLOCK_DEBUG_QUEUE (rwlock);					\
+    if (curr)								\
+      {									\
+	if (curr->prev)							\
+	  curr->prev->next = curr->next;				\
+	if (curr->next) {						\
+	  curr->next->prev = curr->prev;				\
+	  if (curr == aio_rwlock_debug_head)					\
+	    aio_rwlock_debug_head = curr->next;				\
+	} else {							\
+	  if (curr == aio_rwlock_debug_head)					\
+	    aio_rwlock_debug_head = NULL;					\
+	}								\
+	free (curr);							\
+      }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    INTERN_RWUNLOCK (rwlock);						\
+  } while (0)
+
+#define RD_TO_WRLOCK(rwlock)	\
+  RWUNLOCK (rwlock);	\
+  WRLOCK (rwlock);
+#endif
+
 #define DEBUG_LINE(...) __VA_ARGS__
 
 #else
@@ -221,12 +343,31 @@
 #define LOCK(mutex) INTERN_LOCK (mutex)
 #define UNLOCK(mutex) INTERN_UNLOCK (mutex)
 #define TRYLOCK(mutex) (__gthread_mutex_trylock (mutex))
+#ifdef __GTHREAD_RWLOCK_INIT
+#define RDLOCK(rwlock) INTERN_RDLOCK (rwlock)
+#define WRLOCK(rwlock) INTERN_WRLOCK (rwlock)
+#define RWUNLOCK(rwlock) INTERN_RWUNLOCK (rwlock)
+#define RD_TO_WRLOCK(rwlock)	\
+  RWUNLOCK (rwlock);	\
+  WRLOCK (rwlock);
+#endif
+#endif
+
+#ifndef __GTHREAD_RWLOCK_INIT
+#define RDLOCK(rwlock) LOCK (rwlock)
+#define WRLOCK(rwlock) LOCK (rwlock)
+#define RWUNLOCK(rwlock) UNLOCK (rwlock)
+#define RD_TO_WRLOCK(rwlock) do {} while (0)
 #endif
 
 #define INTERN_LOCK(mutex) T_ERROR (__gthread_mutex_lock, mutex);
 
 #define INTERN_UNLOCK(mutex) T_ERROR (__gthread_mutex_unlock, mutex);
 
+#define INTERN_RDLOCK(rwlock) T_ERROR (__gthread_rwlock_rdlock, rwlock)
+#define INTERN_WRLOCK(rwlock) T_ERROR (__gthread_rwlock_wrlock, rwlock)
+#define INTERN_RWUNLOCK(rwlock) T_ERROR (__gthread_rwlock_unlock, rwlock)
+
 #if ASYNC_IO
 
 /* au->lock has to be held when calling this macro.  */
@@ -288,8 +429,18 @@ DEBUG_LINE (typedef struct aio_lock_debug{
   struct aio_lock_debug *prev;
 } aio_lock_debug;)
 
+DEBUG_LINE (typedef struct aio_rwlock_debug{
+  __gthread_rwlock_t *rw;
+  int line;
+  const char *func;
+  struct aio_rwlock_debug *next;
+  struct aio_rwlock_debug *prev;
+} aio_rwlock_debug;)
+
 DEBUG_LINE (extern aio_lock_debug *aio_debug_head;)
 DEBUG_LINE (extern __gthread_mutex_t debug_queue_lock;)
+DEBUG_LINE (extern aio_rwlock_debug *aio_rwlock_debug_head;)
+DEBUG_LINE (extern __gthread_rwlock_t debug_queue_rwlock;)
 
 /* Thread - local storage of the current unit we are looking at. Needed for
    error reporting.  */
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index ecdf1dd3f05..15daa0995b1 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -690,7 +690,7 @@ typedef struct gfc_unit
      from the UNIT_ROOT tree, but doesn't free it and the
      last of the waiting threads will do that.
      This must be either atomically increased/decreased, or
-     always guarded by UNIT_LOCK.  */
+     always guarded by UNIT_RWLOCK.  */
   int waiting;
   /* Flag set by close_unit if the unit as been closed.
      Must be manipulated under unit's lock.  */
@@ -769,8 +769,13 @@ internal_proto(default_recl);
 extern gfc_unit *unit_root;
 internal_proto(unit_root);
 
-extern __gthread_mutex_t unit_lock;
-internal_proto(unit_lock);
+#ifdef __GTHREAD_RWLOCK_INIT
+extern __gthread_rwlock_t unit_rwlock;
+internal_proto(unit_rwlock);
+#else
+extern __gthread_mutex_t unit_rwlock;
+internal_proto(unit_rwlock);
+#endif
 
 extern int close_unit (gfc_unit *);
 internal_proto(close_unit);
@@ -1015,9 +1020,9 @@ dec_waiting_unlocked (gfc_unit *u)
 #ifdef HAVE_ATOMIC_FETCH_ADD
   (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
 #else
-  __gthread_mutex_lock (&unit_lock);
+  WRLOCK (&unit_rwlock);
   u->waiting--;
-  __gthread_mutex_unlock (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 #endif
 }
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 500db90c828..00d516adcb0 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -4538,9 +4538,9 @@ st_read_done_worker (st_parameter_dt *dtp, bool unlock)
    if (free_newunit)
      {
        /* Avoid inverse lock issues by placing after unlock_unit.  */
-       LOCK (&unit_lock);
+       WRLOCK (&unit_rwlock);
        newunit_free (dtp->common.unit);
-       UNLOCK (&unit_lock);
+       RWUNLOCK (&unit_rwlock);
      }
 }
 
@@ -4634,9 +4634,9 @@ st_write_done_worker (st_parameter_dt *dtp, bool unlock)
    if (free_newunit)
      {
        /* Avoid inverse lock issues by placing after unlock_unit.  */
-       LOCK (&unit_lock);
+       WRLOCK (&unit_rwlock);
        newunit_free (dtp->common.unit);
-       UNLOCK (&unit_lock);
+       RWUNLOCK (&unit_rwlock);
      }
 }
 
diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c
index 36d025949c2..0c8c35e464e 100644
--- a/libgfortran/io/unit.c
+++ b/libgfortran/io/unit.c
@@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 
 
 /* IO locking rules:
-   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
+   UNIT_RWLOCK is a master rw lock, protecting UNIT_ROOT tree and UNIT_CACHE.
+   Using an rwlock improves efficiency by allowing us to separate readers
+   and writers of both UNIT_ROOT and UNIT_CACHE.
    Concurrent use of different units should be supported, so
    each unit has its own lock, LOCK.
    Open should be atomic with its reopening of units and list_read.c
    in several places needs find_unit another unit while holding stdin
-   unit's lock, so it must be possible to acquire UNIT_LOCK while holding
+   unit's lock, so it must be possible to acquire UNIT_RWLOCK while holding
    some unit's lock.  Therefore to avoid deadlocks, it is forbidden
-   to acquire unit's private locks while holding UNIT_LOCK, except
+   to acquire unit's private locks while holding UNIT_RWLOCK, except
    for freshly created units (where no other thread can get at their
    address yet) or when using just trylock rather than lock operation.
    In addition to unit's private lock each unit has a WAITERS counter
    and CLOSED flag.  WAITERS counter must be either only
    atomically incremented/decremented in all places (if atomic builtins
-   are supported), or protected by UNIT_LOCK in all places (otherwise).
+   are supported), or protected by UNIT_RWLOCK in all places (otherwise).
    CLOSED flag must be always protected by unit's LOCK.
-   After finding a unit in UNIT_CACHE or UNIT_ROOT with UNIT_LOCK held,
+   After finding a unit in UNIT_CACHE or UNIT_ROOT with UNIT_RWLOCK held,
    WAITERS must be incremented to avoid concurrent close from freeing
-   the unit between unlocking UNIT_LOCK and acquiring unit's LOCK.
-   Unit freeing is always done under UNIT_LOCK.  If close_unit sees any
+   the unit between unlocking UNIT_RWLOCK and acquiring unit's LOCK.
+   Unit freeing is always done under UNIT_RWLOCK.  If close_unit sees any
    WAITERS, it doesn't free the unit but instead sets the CLOSED flag
    and the thread that decrements WAITERS to zero while CLOSED flag is
-   set is responsible for freeing it (while holding UNIT_LOCK).
+   set is responsible for freeing it (while holding UNIT_RWLOCK).
    flush_all_units operation is iterating over the unit tree with
-   increasing UNIT_NUMBER while holding UNIT_LOCK and attempting to
+   increasing UNIT_NUMBER while holding UNIT_RWLOCK and attempting to
    flush each unit (and therefore needs the unit's LOCK held as well).
    To avoid deadlocks, it just trylocks the LOCK and if unsuccessful,
-   remembers the current unit's UNIT_NUMBER, unlocks UNIT_LOCK, acquires
-   unit's LOCK and after flushing reacquires UNIT_LOCK and restarts with
+   remembers the current unit's UNIT_NUMBER, unlocks UNIT_RWLOCK, acquires
+   unit's LOCK and after flushing reacquires UNIT_RWLOCK and restarts with
    the smallest UNIT_NUMBER above the last one flushed.
 
    If find_unit/find_or_create_unit/find_file/get_unit routines return
@@ -101,10 +103,14 @@ gfc_offset max_offset;
 gfc_offset default_recl;
 
 gfc_unit *unit_root;
+#ifdef __GTHREAD_RWLOCK_INIT
+__gthread_rwlock_t unit_rwlock = __GTHREAD_RWLOCK_INIT;
+#else
 #ifdef __GTHREAD_MUTEX_INIT
-__gthread_mutex_t unit_lock = __GTHREAD_MUTEX_INIT;
+__gthread_mutex_t unit_rwlock = __GTHREAD_MUTEX_INIT;
 #else
-__gthread_mutex_t unit_lock;
+__gthread_mutex_t unit_rwlock;
+#endif
 #endif
 
 /* We use these filenames for error reporting.  */
@@ -317,6 +323,28 @@ delete_unit (gfc_unit *old)
   unit_root = delete_treap (old, unit_root);
 }
 
+/* get_gfc_unit_from_root()-- Given an integer, return a pointer
+   to the unit structure. Returns NULL if the unit does not exist,
+   otherwise returns a locked unit. */
+
+static inline gfc_unit *
+get_gfc_unit_from_unit_root (int n)
+{
+  gfc_unit *p;
+  int c = 0;
+  p = unit_root;
+  while (p != NULL)
+    {
+      c = compare (n, p->unit_number);
+      if (c < 0)
+        p = p->left;
+      if (c > 0)
+        p = p->right;
+      if (c == 0)
+        break;
+    }
+  return p;
+}
 
 /* get_gfc_unit()-- Given an integer, return a pointer to the unit
    structure.  Returns NULL if the unit does not exist,
@@ -329,7 +357,7 @@ get_gfc_unit (int n, int do_create)
   int c, created = 0;
 
   NOTE ("Unit n=%d, do_create = %d", n, do_create);
-  LOCK (&unit_lock);
+  RDLOCK (&unit_rwlock);
 
 retry:
   for (c = 0; c < CACHE_SIZE; c++)
@@ -339,18 +367,25 @@ retry:
 	goto found;
       }
 
-  p = unit_root;
-  while (p != NULL)
-    {
-      c = compare (n, p->unit_number);
-      if (c < 0)
-	p = p->left;
-      if (c > 0)
-	p = p->right;
-      if (c == 0)
-	break;
-    }
-
+  p = get_gfc_unit_from_unit_root (n);
+
+  /* We did not find a unit in the cache nor in the unit list,
+    create a new (locked) unit and insert into the unit list and
+    cache. Manipulating either or both the unit list and the unit
+    cache requires to hold a write-lock [for obvious reasons]:
+    By separating the read/write lock, we will greatly reduce
+    the contention on the read part, while the write part is
+    unlikely once the unit hits the cache. */
+  RD_TO_WRLOCK (&unit_rwlock);
+
+  /* In the case of high concurrency, when multiple threads want
+    to find or create the same unit, the unit number may not
+    exist in cache nor in the unit list during read phase, then
+    threads will acquire the write-lock to insert the same unit
+    number to unit list. To avoid duplicate insert, we need to
+    find unit list once again to ensure that the unit number
+    not exist. */
+  p = get_gfc_unit_from_unit_root (n);
   if (p == NULL && do_create)
     {
       p = insert_unit (n);
@@ -368,8 +403,8 @@ retry:
   if (created)
     {
       /* Newly created units have their lock held already
-	 from insert_unit.  Just unlock UNIT_LOCK and return.  */
-      UNLOCK (&unit_lock);
+	 from insert_unit. Just unlock UNIT_RWLOCK and return. */
+      RWUNLOCK (&unit_rwlock);
       return p;
     }
 
@@ -380,7 +415,7 @@ found:
       if (! TRYLOCK (&p->lock))
 	{
 	  /* assert (p->closed == 0); */
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
 	  return p;
 	}
 
@@ -388,14 +423,14 @@ found:
     }
 
 
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 
   if (p != NULL && (p->child_dtio == 0))
     {
       LOCK (&p->lock);
       if (p->closed)
 	{
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&p->lock);
 	  if (predec_waiting_locked (p) == 0)
 	    destroy_unit_mutex (p);
@@ -594,8 +629,8 @@ init_units (void)
 #endif
 #endif
 
-#ifndef __GTHREAD_MUTEX_INIT
-  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_lock);
+#if (!defined(__GTHREAD_RWLOCK_INIT) && !defined(__GTHREAD_MUTEX_INIT))
+  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_rwlock);
 #endif
 
   if (sizeof (max_offset) == 8)
@@ -732,7 +767,7 @@ close_unit_1 (gfc_unit *u, int locked)
 
   u->closed = 1;
   if (!locked)
-    LOCK (&unit_lock);
+    WRLOCK (&unit_rwlock);
 
   for (i = 0; i < CACHE_SIZE; i++)
     if (unit_cache[i] == u)
@@ -759,7 +794,7 @@ close_unit_1 (gfc_unit *u, int locked)
     destroy_unit_mutex (u);
 
   if (!locked)
-    UNLOCK (&unit_lock);
+    RWUNLOCK (&unit_rwlock);
 
   return rc;
 }
@@ -796,10 +831,10 @@ close_unit (gfc_unit *u)
 void
 close_units (void)
 {
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   while (unit_root != NULL)
     close_unit_1 (unit_root, 1);
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 
   free (newunits);
 
@@ -906,7 +941,7 @@ finish_last_advance_record (gfc_unit *u)
 int
 newunit_alloc (void)
 {
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   if (!newunits)
     {
       newunits = xcalloc (16, 1);
@@ -920,7 +955,7 @@ newunit_alloc (void)
         {
           newunits[ii] = true;
           newunit_lwi = ii + 1;
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
           return -ii + NEWUNIT_START;
         }
     }
@@ -933,12 +968,12 @@ newunit_alloc (void)
   memset (newunits + old_size, 0, old_size);
   newunits[old_size] = true;
   newunit_lwi = old_size + 1;
-    UNLOCK (&unit_lock);
+    RWUNLOCK (&unit_rwlock);
   return -old_size + NEWUNIT_START;
 }
 
 
-/* Free a previously allocated newunit= unit number.  unit_lock must
+/* Free a previously allocated newunit= unit number.  unit_rwlock must
    be held when calling.  */
 
 void
diff --git a/libgfortran/io/unix.c b/libgfortran/io/unix.c
index d466df979df..dcae051744d 100644
--- a/libgfortran/io/unix.c
+++ b/libgfortran/io/unix.c
@@ -1773,7 +1773,7 @@ find_file (const char *file, gfc_charlen_type file_len)
   id = id_from_path (path);
 #endif
 
-  LOCK (&unit_lock);
+  RDLOCK (&unit_rwlock);
 retry:
   u = find_file0 (unit_root, FIND_FILE0_ARGS);
   if (u != NULL)
@@ -1782,19 +1782,19 @@ retry:
       if (! __gthread_mutex_trylock (&u->lock))
 	{
 	  /* assert (u->closed == 0); */
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
 	  goto done;
 	}
 
       inc_waiting_locked (u);
     }
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
   if (u != NULL)
     {
       LOCK (&u->lock);
       if (u->closed)
 	{
-	  LOCK (&unit_lock);
+	  RDLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  if (predec_waiting_locked (u) == 0)
 	    free (u);
@@ -1838,13 +1838,13 @@ flush_all_units (void)
   gfc_unit *u;
   int min_unit = 0;
 
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   do
     {
       u = flush_all_units_1 (unit_root, min_unit);
       if (u != NULL)
 	inc_waiting_locked (u);
-      UNLOCK (&unit_lock);
+      RWUNLOCK (&unit_rwlock);
       if (u == NULL)
 	return;
 
@@ -1855,13 +1855,13 @@ flush_all_units (void)
       if (u->closed == 0)
 	{
 	  sflush (u->s);
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  (void) predec_waiting_locked (u);
 	}
       else
 	{
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  if (predec_waiting_locked (u) == 0)
 	    free (u);
diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_1.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
new file mode 100644
index 00000000000..f90ecbeb00f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
@@ -0,0 +1,33 @@
+! { dg-do run }
+! Multiple threads call open/write/read/close in concurrency with different unit number,
+! threads can acquire read lock concurrently, to find unit from cache or unit list very frequently,
+! if not found, threads will acquire the write lock exclusively to insert unit to cache and unit list.
+! This test case is used to stress both the read and write lock when access unit cache and list.
+program main
+  use omp_lib
+  implicit none
+  integer:: unit_number, v1, v2, i
+  character(11) :: file_name
+  character(3) :: async = "no"
+  !$omp parallel private (unit_number, v1, v2, file_name, async, i)
+    do i = 0, 100
+      unit_number = 10 + omp_get_thread_num ()
+      write (file_name, "(I3, A)") unit_number, "_tst.dat"
+      file_name = adjustl(file_name)
+      open (unit_number, file=file_name, asynchronous="yes")
+      ! call inquire with file parameter to test find_file in unix.c
+      inquire (file=file_name, asynchronous=async)
+      if (async /= "YES") stop 1
+      write (unit_number, *, asynchronous="yes") unit_number
+      write (unit_number, *, asynchronous="yes") unit_number + 1
+      close(unit_number)
+
+      open (unit_number, file = file_name, asynchronous="yes")
+      read (unit_number, *, asynchronous="yes") v1
+      read (unit_number, *, asynchronous="yes") v2
+      wait (unit_number)
+      if ((v1 /= unit_number) .or. (v2 /= unit_number + 1)) stop 2
+      close(unit_number, status="delete")
+    end do
+  !$omp end parallel
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_2.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_2.f90
new file mode 100644
index 00000000000..08c80d14cfb
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/rwlock_2.f90
@@ -0,0 +1,22 @@
+! { dg-do run }
+! Insert a unit into cache at the beginning, then start multiple
+! threads to access the same unit concurrency, unit will be found in unit cache during the read lock phase.
+! This test case is used to test the read lock when access unit cache and list.
+program main
+  use omp_lib
+  implicit none
+  integer:: thread_id, total_threads, i, j
+  total_threads = omp_get_max_threads ()
+  open (10, file='tst.dat', asynchronous="yes")
+  !$omp parallel private (thread_id, i, j)
+    do i = 1, 100
+      thread_id = omp_get_thread_num ()
+      do j = 1, 100
+        write (10, *, asynchronous="yes") thread_id, i
+      end do
+    end do
+  !$omp end parallel
+  ! call inquire with file parameter to test find_file in unix.c
+  call flush ()
+  close (10, status="delete")
+end program
diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_3.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_3.f90
new file mode 100644
index 00000000000..1906fcd7a0b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.fortran/rwlock_3.f90
@@ -0,0 +1,18 @@
+! { dg-do run }
+! Find or create the same unit number in concurrency,
+! at beginning, threads cannot find the unit in cache or unit list,
+! then threads will acquire the write lock to insert unit.
+! This test case is used to ensure that no duplicate unit number will be
+! inserted into cache nor unit list when same unit was accessed in concurrency.
+program main
+  use omp_lib
+  implicit none
+  integer:: i
+  !$omp parallel private (i)
+    do i = 1, 100
+      open (10, file='tst.dat', asynchronous="yes")
+      ! Delete the unit number from cache and unit list to stress write lock.
+      close (10, status="delete")
+    end do
+  !$omp end parallel
+end program
-- 
2.39.3


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-09 15:23               ` Jakub Jelinek
@ 2023-12-10  3:25                 ` Zhu, Lipeng
  2023-12-11 17:45                   ` H.J. Lu
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-12-10  3:25 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: fortran, gcc-patches, Lu, Hongjiu, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang

On 2023/12/9 23:23, Jakub Jelinek wrote:
> On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
> > This patch try to introduce the rwlock and split the read/write to
> > unit_root tree and unit_cache with rwlock instead of the mutex to
> > increase CPU efficiency. In the get_gfc_unit function, the percentage
> > to step into the insert_unit function is around 30%, in most
> > instances, we can get the unit in the phase of reading the unit_cache
> > or unit_root tree. So split the read/write phase by rwlock would be an
> > approach to make it more parallel.
> >
> > BTW, the IPC metrics can gain around 9x in our test server with 220
> > cores. The benchmark we used is https://github.com/rwesson/NEAT
> >
> > libgcc/ChangeLog:
> >
> > 	* gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
> > 	(__gthrw): New function.
> > 	(__gthread_rwlock_rdlock): New function.
> > 	(__gthread_rwlock_tryrdlock): New function.
> > 	(__gthread_rwlock_wrlock): New function.
> > 	(__gthread_rwlock_trywrlock): New function.
> > 	(__gthread_rwlock_unlock): New function.
> >
> > libgfortran/ChangeLog:
> >
> > 	* io/async.c (DEBUG_LINE): New macro.
> > 	* io/async.h (RWLOCK_DEBUG_ADD): New macro.
> > 	(CHECK_RDLOCK): New macro.
> > 	(CHECK_WRLOCK): New macro.
> > 	(TAIL_RWLOCK_DEBUG_QUEUE): New macro.
> > 	(IN_RWLOCK_DEBUG_QUEUE): New macro.
> > 	(RDLOCK): New macro.
> > 	(WRLOCK): New macro.
> > 	(RWUNLOCK): New macro.
> > 	(RD_TO_WRLOCK): New macro.
> > 	(INTERN_RDLOCK): New macro.
> > 	(INTERN_WRLOCK): New macro.
> > 	(INTERN_RWUNLOCK): New macro.
> > 	* io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
> > 	a comment.
> > 	(unit_lock): Remove including associated internal_proto.
> > 	(unit_rwlock): New declarations including associated internal_proto.
> > 	(dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
> > 	instead of __gthread_mutex_lock and __gthread_mutex_unlock on
> > 	unit_lock.
> > 	* io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK
> on
> > 	unit_rwlock instead of LOCK and UNLOCK on unit_lock.
> > 	(st_write_done_worker): Likewise.
> > 	* io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
> > 	comment. Use unit_rwlock variable instead of unit_lock variable.
> > 	(get_gfc_unit_from_unit_root): New function.
> > 	(get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
> > 	instead of LOCK and UNLOCK on unit_lock.
> > 	(close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead
> of
> > 	LOCK and UNLOCK on unit_lock.
> > 	(close_units): Likewise.
> > 	(newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
> > 	unit_lock.
> > 	* io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
> > 	instead of LOCK and UNLOCK on unit_lock.
> > 	(flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
> > 	of LOCK and UNLOCK on unit_lock.
> 
> Ok for trunk, thanks.
> 
> 	Jakub

Thanks! Looking forward to landing to trunk.

Lipeng Zhu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-10  3:25                 ` Zhu, Lipeng
@ 2023-12-11 17:45                   ` H.J. Lu
  2023-12-12  2:05                     ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: H.J. Lu @ 2023-12-11 17:45 UTC (permalink / raw)
  To: Zhu, Lipeng
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang

On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com> wrote:
>
> On 2023/12/9 23:23, Jakub Jelinek wrote:
> > On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
> > > This patch try to introduce the rwlock and split the read/write to
> > > unit_root tree and unit_cache with rwlock instead of the mutex to
> > > increase CPU efficiency. In the get_gfc_unit function, the percentage
> > > to step into the insert_unit function is around 30%, in most
> > > instances, we can get the unit in the phase of reading the unit_cache
> > > or unit_root tree. So split the read/write phase by rwlock would be an
> > > approach to make it more parallel.
> > >
> > > BTW, the IPC metrics can gain around 9x in our test server with 220
> > > cores. The benchmark we used is https://github.com/rwesson/NEAT
> > >
> > > libgcc/ChangeLog:
> > >
> > >     * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
> > >     (__gthrw): New function.
> > >     (__gthread_rwlock_rdlock): New function.
> > >     (__gthread_rwlock_tryrdlock): New function.
> > >     (__gthread_rwlock_wrlock): New function.
> > >     (__gthread_rwlock_trywrlock): New function.
> > >     (__gthread_rwlock_unlock): New function.
> > >
> > > libgfortran/ChangeLog:
> > >
> > >     * io/async.c (DEBUG_LINE): New macro.
> > >     * io/async.h (RWLOCK_DEBUG_ADD): New macro.
> > >     (CHECK_RDLOCK): New macro.
> > >     (CHECK_WRLOCK): New macro.
> > >     (TAIL_RWLOCK_DEBUG_QUEUE): New macro.
> > >     (IN_RWLOCK_DEBUG_QUEUE): New macro.
> > >     (RDLOCK): New macro.
> > >     (WRLOCK): New macro.
> > >     (RWUNLOCK): New macro.
> > >     (RD_TO_WRLOCK): New macro.
> > >     (INTERN_RDLOCK): New macro.
> > >     (INTERN_WRLOCK): New macro.
> > >     (INTERN_RWUNLOCK): New macro.
> > >     * io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
> > >     a comment.
> > >     (unit_lock): Remove including associated internal_proto.
> > >     (unit_rwlock): New declarations including associated internal_proto.
> > >     (dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
> > >     instead of __gthread_mutex_lock and __gthread_mutex_unlock on
> > >     unit_lock.
> > >     * io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK
> > on
> > >     unit_rwlock instead of LOCK and UNLOCK on unit_lock.
> > >     (st_write_done_worker): Likewise.
> > >     * io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
> > >     comment. Use unit_rwlock variable instead of unit_lock variable.
> > >     (get_gfc_unit_from_unit_root): New function.
> > >     (get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
> > >     instead of LOCK and UNLOCK on unit_lock.
> > >     (close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead
> > of
> > >     LOCK and UNLOCK on unit_lock.
> > >     (close_units): Likewise.
> > >     (newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
> > >     unit_lock.
> > >     * io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
> > >     instead of LOCK and UNLOCK on unit_lock.
> > >     (flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
> > >     of LOCK and UNLOCK on unit_lock.
> >
> > Ok for trunk, thanks.
> >
> >       Jakub
>
> Thanks! Looking forward to landing to trunk.
>
> Lipeng Zhu

Pushed for you.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-11 17:45                   ` H.J. Lu
@ 2023-12-12  2:05                     ` Zhu, Lipeng
  2023-12-13 20:52                       ` Thomas Schwinge
  0 siblings, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-12-12  2:05 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang

On 2023/12/12 1:45, H.J. Lu wrote:
> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com> wrote:
> >
> > On 2023/12/9 23:23, Jakub Jelinek wrote:
> > > On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
> > > > This patch try to introduce the rwlock and split the read/write to
> > > > unit_root tree and unit_cache with rwlock instead of the mutex to
> > > > increase CPU efficiency. In the get_gfc_unit function, the
> > > > percentage to step into the insert_unit function is around 30%, in
> > > > most instances, we can get the unit in the phase of reading the
> > > > unit_cache or unit_root tree. So split the read/write phase by
> > > > rwlock would be an approach to make it more parallel.
> > > >
> > > > BTW, the IPC metrics can gain around 9x in our test server with
> > > > 220 cores. The benchmark we used is
> > > > https://github.com/rwesson/NEAT
> > > >
> > > > libgcc/ChangeLog:
> > > >
> > > >     * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
> > > >     (__gthrw): New function.
> > > >     (__gthread_rwlock_rdlock): New function.
> > > >     (__gthread_rwlock_tryrdlock): New function.
> > > >     (__gthread_rwlock_wrlock): New function.
> > > >     (__gthread_rwlock_trywrlock): New function.
> > > >     (__gthread_rwlock_unlock): New function.
> > > >
> > > > libgfortran/ChangeLog:
> > > >
> > > >     * io/async.c (DEBUG_LINE): New macro.
> > > >     * io/async.h (RWLOCK_DEBUG_ADD): New macro.
> > > >     (CHECK_RDLOCK): New macro.
> > > >     (CHECK_WRLOCK): New macro.
> > > >     (TAIL_RWLOCK_DEBUG_QUEUE): New macro.
> > > >     (IN_RWLOCK_DEBUG_QUEUE): New macro.
> > > >     (RDLOCK): New macro.
> > > >     (WRLOCK): New macro.
> > > >     (RWUNLOCK): New macro.
> > > >     (RD_TO_WRLOCK): New macro.
> > > >     (INTERN_RDLOCK): New macro.
> > > >     (INTERN_WRLOCK): New macro.
> > > >     (INTERN_RWUNLOCK): New macro.
> > > >     * io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
> > > >     a comment.
> > > >     (unit_lock): Remove including associated internal_proto.
> > > >     (unit_rwlock): New declarations including associated internal_proto.
> > > >     (dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on
> unit_rwlock
> > > >     instead of __gthread_mutex_lock and __gthread_mutex_unlock on
> > > >     unit_lock.
> > > >     * io/transfer.c (st_read_done_worker): Use WRLOCK and
> RWUNLOCK
> > > on
> > > >     unit_rwlock instead of LOCK and UNLOCK on unit_lock.
> > > >     (st_write_done_worker): Likewise.
> > > >     * io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
> > > >     comment. Use unit_rwlock variable instead of unit_lock variable.
> > > >     (get_gfc_unit_from_unit_root): New function.
> > > >     (get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on
> unit_rwlock
> > > >     instead of LOCK and UNLOCK on unit_lock.
> > > >     (close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead
> > > of
> > > >     LOCK and UNLOCK on unit_lock.
> > > >     (close_units): Likewise.
> > > >     (newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK
> on
> > > >     unit_lock.
> > > >     * io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
> > > >     instead of LOCK and UNLOCK on unit_lock.
> > > >     (flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock
> instead
> > > >     of LOCK and UNLOCK on unit_lock.
> > >
> > > Ok for trunk, thanks.
> > >
> > >       Jakub
> >
> > Thanks! Looking forward to landing to trunk.
> >
> > Lipeng Zhu
> 
> Pushed for you.
> 
> Thanks.
> 
> --
> H.J.

Thanks for everyone's patience and help, really appreciate that!

Lipeng Zhu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-12  2:05                     ` Zhu, Lipeng
@ 2023-12-13 20:52                       ` Thomas Schwinge
  2023-12-14  2:28                         ` Zhu, Lipeng
  2023-12-21 11:42                         ` Thomas Schwinge
  0 siblings, 2 replies; 38+ messages in thread
From: Thomas Schwinge @ 2023-12-13 20:52 UTC (permalink / raw)
  To: Zhu, Lipeng, H.J. Lu
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang

Hi Lipeng!

On 2023-12-12T02:05:26+0000, "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
> On 2023/12/12 1:45, H.J. Lu wrote:
>> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com> wrote:
>> > On 2023/12/9 23:23, Jakub Jelinek wrote:
>> > > On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
>> > > > This patch try to introduce the rwlock and split the read/write to
>> > > > unit_root tree and unit_cache with rwlock instead of the mutex to
>> > > > increase CPU efficiency. In the get_gfc_unit function, the
>> > > > percentage to step into the insert_unit function is around 30%, in
>> > > > most instances, we can get the unit in the phase of reading the
>> > > > unit_cache or unit_root tree. So split the read/write phase by
>> > > > rwlock would be an approach to make it more parallel.
>> > > >
>> > > > BTW, the IPC metrics can gain around 9x in our test server with
>> > > > 220 cores. The benchmark we used is
>> > > > https://github.com/rwesson/NEAT

>> > > Ok for trunk, thanks.

>> > Thanks! Looking forward to landing to trunk.

>> Pushed for you.

> Thanks for everyone's patience and help, really appreciate that!

Congratulations on your first contribution to GCC (as far as I can tell)!
:-)


I've just filed <https://gcc.gnu.org/PR113005>
"'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' execution test timeouts".
Would you be able to look into that?


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-13 20:52                       ` Thomas Schwinge
@ 2023-12-14  2:28                         ` Zhu, Lipeng
  2023-12-14 12:29                           ` Thomas Schwinge
  2023-12-21 11:42                         ` Thomas Schwinge
  1 sibling, 1 reply; 38+ messages in thread
From: Zhu, Lipeng @ 2023-12-14  2:28 UTC (permalink / raw)
  To: Thomas Schwinge, H.J. Lu
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang


On 2023/12/14 4:52, Thomas Schwinge wrote:
> Hi Lipeng!
> 
> On 2023-12-12T02:05:26+0000, "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
> > On 2023/12/12 1:45, H.J. Lu wrote:
> >> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com>
> wrote:
> >> > On 2023/12/9 23:23, Jakub Jelinek wrote:
> >> > > On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
> >> > > > This patch try to introduce the rwlock and split the read/write
> >> > > > to unit_root tree and unit_cache with rwlock instead of the
> >> > > > mutex to increase CPU efficiency. In the get_gfc_unit function,
> >> > > > the percentage to step into the insert_unit function is around
> >> > > > 30%, in most instances, we can get the unit in the phase of
> >> > > > reading the unit_cache or unit_root tree. So split the
> >> > > > read/write phase by rwlock would be an approach to make it more
> parallel.
> >> > > >
> >> > > > BTW, the IPC metrics can gain around 9x in our test server with
> >> > > > 220 cores. The benchmark we used is
> >> > > > https://github.com/rwesson/NEAT
> 
> >> > > Ok for trunk, thanks.
> 
> >> > Thanks! Looking forward to landing to trunk.
> 
> >> Pushed for you.
> 
> > Thanks for everyone's patience and help, really appreciate that!
> 
> Congratulations on your first contribution to GCC (as far as I can tell)!
> :-)
> 
> 
> I've just filed <https://gcc.gnu.org/PR113005>
> "'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' execution
> test timeouts".
> Would you be able to look into that?
> 
> 
> Grüße
>  Thomas
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201,
> 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer:
> Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München;
> Registergericht München, HRB 106955

Hi Thomas,

Sure, I will look into that. 

BTW, I didn’t have the PowerPC in hands, do you mind granting the access of your
test environment to me to help reproduce the issue?

Lipeng Zhu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-14  2:28                         ` Zhu, Lipeng
@ 2023-12-14 12:29                           ` Thomas Schwinge
  2023-12-14 12:39                             ` Jakub Jelinek
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Schwinge @ 2023-12-14 12:29 UTC (permalink / raw)
  To: Zhu, Lipeng, H.J. Lu
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang

Hi Lipeng!

On 2023-12-14T02:28:22+0000, "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
> On 2023/12/14 4:52, Thomas Schwinge wrote:
>> On 2023-12-12T02:05:26+0000, "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
>> > On 2023/12/12 1:45, H.J. Lu wrote:
>> >> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com>
>> wrote:
>> >> > On 2023/12/9 23:23, Jakub Jelinek wrote:
>> >> > > On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
>> >> > > > This patch try to introduce the rwlock and split the read/write
>> >> > > > to unit_root tree and unit_cache with rwlock instead of the
>> >> > > > mutex to increase CPU efficiency. In the get_gfc_unit function,
>> >> > > > the percentage to step into the insert_unit function is around
>> >> > > > 30%, in most instances, we can get the unit in the phase of
>> >> > > > reading the unit_cache or unit_root tree. So split the
>> >> > > > read/write phase by rwlock would be an approach to make it more
>> parallel.
>> >> > > >
>> >> > > > BTW, the IPC metrics can gain around 9x in our test server with
>> >> > > > 220 cores. The benchmark we used is
>> >> > > > https://github.com/rwesson/NEAT

>> I've just filed <https://gcc.gnu.org/PR113005>
>> "'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' execution
>> test timeouts".
>> Would you be able to look into that?

> Sure, I will look into that.
>
> BTW, I didn’t have the PowerPC in hands, do you mind granting the access of your
> test environment to me to help reproduce the issue?

That's unfortunately not possible: it's behind company VPN, restricted
access.  :-/ I'll later try to have at least a quick look where it's
hanging, or what it's doing.


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-14 12:29                           ` Thomas Schwinge
@ 2023-12-14 12:39                             ` Jakub Jelinek
  2023-12-15  5:43                               ` Zhu, Lipeng
  0 siblings, 1 reply; 38+ messages in thread
From: Jakub Jelinek @ 2023-12-14 12:39 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: Zhu, Lipeng, H.J. Lu, fortran, gcc-patches, Deng, Pan,
	rep.dot.nop, Li, Tianyou, tkoenig, Guo, Wangyang

On Thu, Dec 14, 2023 at 01:29:01PM +0100, Thomas Schwinge wrote:
> > Sure, I will look into that.
> >
> > BTW, I didn’t have the PowerPC in hands, do you mind granting the access of your
> > test environment to me to help reproduce the issue?
> 
> That's unfortunately not possible: it's behind company VPN, restricted
> access.  :-/ I'll later try to have at least a quick look where it's
> hanging, or what it's doing.

There is always https://gcc.gnu.org/wiki/CompileFarm
There are e.g. 192, 160, 128, 80, 64 thread power machines.
Whether any of them can actually reproduce it, I haven't tried.
But shouldn't be that time consuming to reproduce, for this kind of thing
one can just build
.../configure --enable-languages=c,c++,fortran --disable-bootstrap --disable-libsanitizer
make -jN
compiler and just
cd power*/libgomp
make check RUNTESTFLAGS=fortran.exp=rwlock*.f90
repeatedly to see if it gets stuck.

	Jakub


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-09 15:39             ` [PATCH v7] " Lipeng Zhu
  2023-12-09 15:23               ` Jakub Jelinek
@ 2023-12-14 15:50               ` Richard Earnshaw (lists)
  2023-12-15 11:31                 ` Lipeng Zhu
  1 sibling, 1 reply; 38+ messages in thread
From: Richard Earnshaw (lists) @ 2023-12-14 15:50 UTC (permalink / raw)
  To: Lipeng Zhu, jakub
  Cc: fortran, gcc-patches, hongjiu.lu, pan.deng, rep.dot.nop,
	tianyou.li, tkoenig, wangyang.guo

On 09/12/2023 15:39, Lipeng Zhu wrote:
> This patch try to introduce the rwlock and split the read/write to
> unit_root tree and unit_cache with rwlock instead of the mutex to
> increase CPU efficiency. In the get_gfc_unit function, the percentage
> to step into the insert_unit function is around 30%, in most instances,
> we can get the unit in the phase of reading the unit_cache or unit_root
> tree. So split the read/write phase by rwlock would be an approach to
> make it more parallel.
> 
> BTW, the IPC metrics can gain around 9x in our test
> server with 220 cores. The benchmark we used is
> https://github.com/rwesson/NEAT
> 
> libgcc/ChangeLog:
> 
> 	* gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
> 	(__gthrw): New function.
> 	(__gthread_rwlock_rdlock): New function.
> 	(__gthread_rwlock_tryrdlock): New function.
> 	(__gthread_rwlock_wrlock): New function.
> 	(__gthread_rwlock_trywrlock): New function.
> 	(__gthread_rwlock_unlock): New function.
> 
> libgfortran/ChangeLog:
> 
> 	* io/async.c (DEBUG_LINE): New macro.
> 	* io/async.h (RWLOCK_DEBUG_ADD): New macro.
> 	(CHECK_RDLOCK): New macro.
> 	(CHECK_WRLOCK): New macro.
> 	(TAIL_RWLOCK_DEBUG_QUEUE): New macro.
> 	(IN_RWLOCK_DEBUG_QUEUE): New macro.
> 	(RDLOCK): New macro.
> 	(WRLOCK): New macro.
> 	(RWUNLOCK): New macro.
> 	(RD_TO_WRLOCK): New macro.
> 	(INTERN_RDLOCK): New macro.
> 	(INTERN_WRLOCK): New macro.
> 	(INTERN_RWUNLOCK): New macro.
> 	* io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
> 	a comment.
> 	(unit_lock): Remove including associated internal_proto.
> 	(unit_rwlock): New declarations including associated internal_proto.
> 	(dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
> 	instead of __gthread_mutex_lock and __gthread_mutex_unlock on
> 	unit_lock.
> 	* io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK on
> 	unit_rwlock instead of LOCK and UNLOCK on unit_lock.
> 	(st_write_done_worker): Likewise.
> 	* io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
> 	comment. Use unit_rwlock variable instead of unit_lock variable.
> 	(get_gfc_unit_from_unit_root): New function.
> 	(get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
> 	instead of LOCK and UNLOCK on unit_lock.
> 	(close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead of
> 	LOCK and UNLOCK on unit_lock.
> 	(close_units): Likewise.
> 	(newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
> 	unit_lock.
> 	* io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
> 	instead of LOCK and UNLOCK on unit_lock.
> 	(flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
> 	of LOCK and UNLOCK on unit_lock.
> 

It looks like this has broken builds on arm-none-eabi when using newlib:

In file included from /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran
/runtime/error.c:27:
/work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h: In function 
‘dec_waiting_unlocked’:
/work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1023:3: error
: implicit declaration of function ‘WRLOCK’ [-Wimplicit-function-declaration]
 1023 |   WRLOCK (&unit_rwlock);
      |   ^~~~~~
/work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1025:3: error
: implicit declaration of function ‘RWUNLOCK’ [-Wimplicit-function-declaration]
 1025 |   RWUNLOCK (&unit_rwlock);
      |   ^~~~~~~~


R.

> ---
> v1 -> v2:
> Limit the pthread_rwlock usage in libgcc only when __cplusplus isn't defined.
> 
> v2 -> v3:
> Rebase the patch with trunk branch.
> 
> v3 -> v4:
> Update the comments.
> 
> v4 -> v5:
> Fix typos and code formatter.
> 
> v5 -> v6:
> Add unit tests.
> 
> v6 -> v7:
> Update ChangeLog and code formatter.
> 
> Reviewed-by: Hongjiu Lu <hongjiu.lu@intel.com>
> Reviewed-by: Bernhard Reutner-Fischer <rep.dot.nop@gmail.com>
> Reviewed-by: Thomas Koenig <tkoenig@netcologne.de>
> Reviewed-by: Jakub Jelinek <jakub@redhat.com>
> Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
> ---
>  libgcc/gthr-posix.h                           |  60 +++++++
>  libgfortran/io/async.c                        |   4 +
>  libgfortran/io/async.h                        | 151 ++++++++++++++++++
>  libgfortran/io/io.h                           |  15 +-
>  libgfortran/io/transfer.c                     |   8 +-
>  libgfortran/io/unit.c                         | 117 +++++++++-----
>  libgfortran/io/unix.c                         |  16 +-
>  .../testsuite/libgomp.fortran/rwlock_1.f90    |  33 ++++
>  .../testsuite/libgomp.fortran/rwlock_2.f90    |  22 +++
>  .../testsuite/libgomp.fortran/rwlock_3.f90    |  18 +++
>  10 files changed, 386 insertions(+), 58 deletions(-)
>  create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_1.f90
>  create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_2.f90
>  create mode 100644 libgomp/testsuite/libgomp.fortran/rwlock_3.f90
> 
> diff --git a/libgcc/gthr-posix.h b/libgcc/gthr-posix.h
> index aebcfdd9f4c..73283082997 100644
> --- a/libgcc/gthr-posix.h
> +++ b/libgcc/gthr-posix.h
> @@ -48,6 +48,9 @@ typedef pthread_t __gthread_t;
>  typedef pthread_key_t __gthread_key_t;
>  typedef pthread_once_t __gthread_once_t;
>  typedef pthread_mutex_t __gthread_mutex_t;
> +#ifndef __cplusplus
> +typedef pthread_rwlock_t __gthread_rwlock_t;
> +#endif
>  typedef pthread_mutex_t __gthread_recursive_mutex_t;
>  typedef pthread_cond_t __gthread_cond_t;
>  typedef struct timespec __gthread_time_t;
> @@ -58,6 +61,9 @@ typedef struct timespec __gthread_time_t;
>  
>  #define __GTHREAD_MUTEX_INIT PTHREAD_MUTEX_INITIALIZER
>  #define __GTHREAD_MUTEX_INIT_FUNCTION __gthread_mutex_init_function
> +#ifndef __cplusplus
> +#define __GTHREAD_RWLOCK_INIT PTHREAD_RWLOCK_INITIALIZER
> +#endif
>  #define __GTHREAD_ONCE_INIT PTHREAD_ONCE_INIT
>  #if defined(PTHREAD_RECURSIVE_MUTEX_INITIALIZER)
>  #define __GTHREAD_RECURSIVE_MUTEX_INIT PTHREAD_RECURSIVE_MUTEX_INITIALIZER
> @@ -135,6 +141,13 @@ __gthrw(pthread_mutexattr_init)
>  __gthrw(pthread_mutexattr_settype)
>  __gthrw(pthread_mutexattr_destroy)
>  
> +#ifndef __cplusplus
> +__gthrw(pthread_rwlock_rdlock)
> +__gthrw(pthread_rwlock_tryrdlock)
> +__gthrw(pthread_rwlock_wrlock)
> +__gthrw(pthread_rwlock_trywrlock)
> +__gthrw(pthread_rwlock_unlock)
> +#endif
>  
>  #if defined(_LIBOBJC) || defined(_LIBOBJC_WEAK)
>  /* Objective-C.  */
> @@ -885,6 +898,53 @@ __gthread_cond_destroy (__gthread_cond_t* __cond)
>    return __gthrw_(pthread_cond_destroy) (__cond);
>  }
>  
> +#ifndef __cplusplus
> +static inline int
> +__gthread_rwlock_rdlock (__gthread_rwlock_t *__rwlock)
> +{
> +  if (__gthread_active_p ())
> +    return __gthrw_(pthread_rwlock_rdlock) (__rwlock);
> +  else
> +    return 0;
> +}
> +
> +static inline int
> +__gthread_rwlock_tryrdlock (__gthread_rwlock_t *__rwlock)
> +{
> +  if (__gthread_active_p ())
> +    return __gthrw_(pthread_rwlock_tryrdlock) (__rwlock);
> +  else
> +    return 0;
> +}
> +
> +static inline int
> +__gthread_rwlock_wrlock (__gthread_rwlock_t *__rwlock)
> +{
> +  if (__gthread_active_p ())
> +    return __gthrw_(pthread_rwlock_wrlock) (__rwlock);
> +  else
> +    return 0;
> +}
> +
> +static inline int
> +__gthread_rwlock_trywrlock (__gthread_rwlock_t *__rwlock)
> +{
> +  if (__gthread_active_p ())
> +    return __gthrw_(pthread_rwlock_trywrlock) (__rwlock);
> +  else
> +    return 0;
> +}
> +
> +static inline int
> +__gthread_rwlock_unlock (__gthread_rwlock_t *__rwlock)
> +{
> +  if (__gthread_active_p ())
> +    return __gthrw_(pthread_rwlock_unlock) (__rwlock);
> +  else
> +    return 0;
> +}
> +#endif
> +
>  #endif /* _LIBOBJC */
>  
>  #endif /* ! GCC_GTHR_POSIX_H */
> diff --git a/libgfortran/io/async.c b/libgfortran/io/async.c
> index 8fa1f0d4ce0..91bf397105d 100644
> --- a/libgfortran/io/async.c
> +++ b/libgfortran/io/async.c
> @@ -42,6 +42,10 @@ DEBUG_LINE (__thread const char *aio_prefix = MPREFIX);
>  
>  DEBUG_LINE (__gthread_mutex_t debug_queue_lock = __GTHREAD_MUTEX_INIT;)
>  DEBUG_LINE (aio_lock_debug *aio_debug_head = NULL;)
> +#ifdef __GTHREAD_RWLOCK_INIT
> +DEBUG_LINE (aio_rwlock_debug *aio_rwlock_debug_head = NULL;)
> +DEBUG_LINE (__gthread_rwlock_t debug_queue_rwlock = __GTHREAD_RWLOCK_INIT;)
> +#endif
>  
>  /* Current unit for asynchronous I/O.  Needed for error reporting.  */
>  
> diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h
> index ad226c8e856..f112f6870bb 100644
> --- a/libgfortran/io/async.h
> +++ b/libgfortran/io/async.h
> @@ -210,6 +210,128 @@
>      DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \
>    } while (0)
>  
> +#ifdef __GTHREAD_RWLOCK_INIT
> +#define RWLOCK_DEBUG_ADD(rwlock) do {		\
> +    aio_rwlock_debug *n;				\
> +    n = xmalloc (sizeof (aio_rwlock_debug));	\
> +    n->prev = TAIL_RWLOCK_DEBUG_QUEUE;			\
> +    if (n->prev)				\
> +      n->prev->next = n;			\
> +    n->next = NULL;				\
> +    n->line = __LINE__;				\
> +    n->func = __FUNCTION__;			\
> +    n->rw = rwlock;				\
> +    if (!aio_rwlock_debug_head) {			\
> +      aio_rwlock_debug_head = n;			\
> +    }						\
> +  } while (0)
> +
> +#define CHECK_RDLOCK(rwlock, status) do {					\
> +    aio_rwlock_debug *curr;						\
> +    INTERN_WRLOCK (&debug_queue_rwlock);					\
> +    if (__gthread_rwlock_tryrdlock (rwlock)) {				\
> +      if ((curr = IN_RWLOCK_DEBUG_QUEUE (rwlock))) {				\
> +	sprintf (status, DEBUG_RED "%s():%d" DEBUG_NORM, curr->func, curr->line); \
> +      } else								\
> +	sprintf (status, DEBUG_RED "unknown" DEBUG_NORM);			\
> +    }									\
> +    else {								\
> +      __gthread_rwlock_unlock (rwlock);					\
> +      sprintf (status, DEBUG_GREEN "rwunlocked" DEBUG_NORM);			\
> +    }									\
> +    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
> +  }while (0)
> +
> +#define CHECK_WRLOCK(rwlock, status) do {					\
> +    aio_rwlock_debug *curr;						\
> +    INTERN_WRLOCK (&debug_queue_rwlock);					\
> +    if (__gthread_rwlock_trywrlock (rwlock)) {				\
> +      if ((curr = IN_RWLOCK_DEBUG_QUEUE (rwlock))) {				\
> +	sprintf (status, DEBUG_RED "%s():%d" DEBUG_NORM, curr->func, curr->line); \
> +      } else								\
> +	sprintf (status, DEBUG_RED "unknown" DEBUG_NORM);			\
> +    }									\
> +    else {								\
> +      __gthread_rwlock_unlock (rwlock);					\
> +      sprintf (status, DEBUG_GREEN "rwunlocked" DEBUG_NORM);			\
> +    }									\
> +    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
> +  }while (0)
> +
> +#define TAIL_RWLOCK_DEBUG_QUEUE ({			\
> +      aio_rwlock_debug *curr = aio_rwlock_debug_head;	\
> +      while (curr && curr->next) {		\
> +	curr = curr->next;			\
> +      }						\
> +      curr;					\
> +    })
> +
> +#define IN_RWLOCK_DEBUG_QUEUE(rwlock) ({		\
> +      __label__ end;				\
> +      aio_rwlock_debug *curr = aio_rwlock_debug_head;	\
> +      while (curr) {				\
> +	if (curr->rw == rwlock) {			\
> +	  goto end;				\
> +	}					\
> +	curr = curr->next;			\
> +      }						\
> +    end:;					\
> +      curr;					\
> +    })
> +
> +#define RDLOCK(rwlock) do {						\
> +    char status[200];							\
> +    CHECK_RDLOCK (rwlock, status);						\
> +    DEBUG_PRINTF ("%s%-42s prev: %-35s %20s():%-5d %18p\n", aio_prefix,	\
> +		 DEBUG_RED "RDLOCK: " DEBUG_NORM #rwlock, status, __FUNCTION__, __LINE__, (void *) rwlock); \
> +    INTERN_RDLOCK (rwlock);							\
> +    INTERN_WRLOCK (&debug_queue_rwlock);					\
> +    RWLOCK_DEBUG_ADD (rwlock);						\
> +    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
> +    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #rwlock, rwlock); \
> +  } while (0)
> +
> +#define WRLOCK(rwlock) do {						\
> +    char status[200];							\
> +    CHECK_WRLOCK (rwlock, status);						\
> +    DEBUG_PRINTF ("%s%-42s prev: %-35s %20s():%-5d %18p\n", aio_prefix,	\
> +		 DEBUG_RED "WRLOCK: " DEBUG_NORM #rwlock, status, __FUNCTION__, __LINE__, (void *) rwlock); \
> +    INTERN_WRLOCK (rwlock);							\
> +    INTERN_WRLOCK (&debug_queue_rwlock);					\
> +    RWLOCK_DEBUG_ADD (rwlock);						\
> +    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
> +    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #rwlock, rwlock); \
> +  } while (0)
> +
> +#define RWUNLOCK(rwlock) do {						\
> +    aio_rwlock_debug *curr;						\
> +    DEBUG_PRINTF ("%s%-75s %20s():%-5d %18p\n", aio_prefix, DEBUG_GREEN "RWUNLOCK: " DEBUG_NORM #rwlock, \
> +		 __FUNCTION__, __LINE__, (void *) rwlock);		\
> +    INTERN_WRLOCK (&debug_queue_rwlock);					\
> +    curr = IN_RWLOCK_DEBUG_QUEUE (rwlock);					\
> +    if (curr)								\
> +      {									\
> +	if (curr->prev)							\
> +	  curr->prev->next = curr->next;				\
> +	if (curr->next) {						\
> +	  curr->next->prev = curr->prev;				\
> +	  if (curr == aio_rwlock_debug_head)					\
> +	    aio_rwlock_debug_head = curr->next;				\
> +	} else {							\
> +	  if (curr == aio_rwlock_debug_head)					\
> +	    aio_rwlock_debug_head = NULL;					\
> +	}								\
> +	free (curr);							\
> +      }									\
> +    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
> +    INTERN_RWUNLOCK (rwlock);						\
> +  } while (0)
> +
> +#define RD_TO_WRLOCK(rwlock)	\
> +  RWUNLOCK (rwlock);	\
> +  WRLOCK (rwlock);
> +#endif
> +
>  #define DEBUG_LINE(...) __VA_ARGS__
>  
>  #else
> @@ -221,12 +343,31 @@
>  #define LOCK(mutex) INTERN_LOCK (mutex)
>  #define UNLOCK(mutex) INTERN_UNLOCK (mutex)
>  #define TRYLOCK(mutex) (__gthread_mutex_trylock (mutex))
> +#ifdef __GTHREAD_RWLOCK_INIT
> +#define RDLOCK(rwlock) INTERN_RDLOCK (rwlock)
> +#define WRLOCK(rwlock) INTERN_WRLOCK (rwlock)
> +#define RWUNLOCK(rwlock) INTERN_RWUNLOCK (rwlock)
> +#define RD_TO_WRLOCK(rwlock)	\
> +  RWUNLOCK (rwlock);	\
> +  WRLOCK (rwlock);
> +#endif
> +#endif
> +
> +#ifndef __GTHREAD_RWLOCK_INIT
> +#define RDLOCK(rwlock) LOCK (rwlock)
> +#define WRLOCK(rwlock) LOCK (rwlock)
> +#define RWUNLOCK(rwlock) UNLOCK (rwlock)
> +#define RD_TO_WRLOCK(rwlock) do {} while (0)
>  #endif
>  
>  #define INTERN_LOCK(mutex) T_ERROR (__gthread_mutex_lock, mutex);
>  
>  #define INTERN_UNLOCK(mutex) T_ERROR (__gthread_mutex_unlock, mutex);
>  
> +#define INTERN_RDLOCK(rwlock) T_ERROR (__gthread_rwlock_rdlock, rwlock)
> +#define INTERN_WRLOCK(rwlock) T_ERROR (__gthread_rwlock_wrlock, rwlock)
> +#define INTERN_RWUNLOCK(rwlock) T_ERROR (__gthread_rwlock_unlock, rwlock)
> +
>  #if ASYNC_IO
>  
>  /* au->lock has to be held when calling this macro.  */
> @@ -288,8 +429,18 @@ DEBUG_LINE (typedef struct aio_lock_debug{
>    struct aio_lock_debug *prev;
>  } aio_lock_debug;)
>  
> +DEBUG_LINE (typedef struct aio_rwlock_debug{
> +  __gthread_rwlock_t *rw;
> +  int line;
> +  const char *func;
> +  struct aio_rwlock_debug *next;
> +  struct aio_rwlock_debug *prev;
> +} aio_rwlock_debug;)
> +
>  DEBUG_LINE (extern aio_lock_debug *aio_debug_head;)
>  DEBUG_LINE (extern __gthread_mutex_t debug_queue_lock;)
> +DEBUG_LINE (extern aio_rwlock_debug *aio_rwlock_debug_head;)
> +DEBUG_LINE (extern __gthread_rwlock_t debug_queue_rwlock;)
>  
>  /* Thread - local storage of the current unit we are looking at. Needed for
>     error reporting.  */
> diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
> index ecdf1dd3f05..15daa0995b1 100644
> --- a/libgfortran/io/io.h
> +++ b/libgfortran/io/io.h
> @@ -690,7 +690,7 @@ typedef struct gfc_unit
>       from the UNIT_ROOT tree, but doesn't free it and the
>       last of the waiting threads will do that.
>       This must be either atomically increased/decreased, or
> -     always guarded by UNIT_LOCK.  */
> +     always guarded by UNIT_RWLOCK.  */
>    int waiting;
>    /* Flag set by close_unit if the unit as been closed.
>       Must be manipulated under unit's lock.  */
> @@ -769,8 +769,13 @@ internal_proto(default_recl);
>  extern gfc_unit *unit_root;
>  internal_proto(unit_root);
>  
> -extern __gthread_mutex_t unit_lock;
> -internal_proto(unit_lock);
> +#ifdef __GTHREAD_RWLOCK_INIT
> +extern __gthread_rwlock_t unit_rwlock;
> +internal_proto(unit_rwlock);
> +#else
> +extern __gthread_mutex_t unit_rwlock;
> +internal_proto(unit_rwlock);
> +#endif
>  
>  extern int close_unit (gfc_unit *);
>  internal_proto(close_unit);
> @@ -1015,9 +1020,9 @@ dec_waiting_unlocked (gfc_unit *u)
>  #ifdef HAVE_ATOMIC_FETCH_ADD
>    (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
>  #else
> -  __gthread_mutex_lock (&unit_lock);
> +  WRLOCK (&unit_rwlock);
>    u->waiting--;
> -  __gthread_mutex_unlock (&unit_lock);
> +  RWUNLOCK (&unit_rwlock);
>  #endif
>  }
>  
> diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
> index 500db90c828..00d516adcb0 100644
> --- a/libgfortran/io/transfer.c
> +++ b/libgfortran/io/transfer.c
> @@ -4538,9 +4538,9 @@ st_read_done_worker (st_parameter_dt *dtp, bool unlock)
>     if (free_newunit)
>       {
>         /* Avoid inverse lock issues by placing after unlock_unit.  */
> -       LOCK (&unit_lock);
> +       WRLOCK (&unit_rwlock);
>         newunit_free (dtp->common.unit);
> -       UNLOCK (&unit_lock);
> +       RWUNLOCK (&unit_rwlock);
>       }
>  }
>  
> @@ -4634,9 +4634,9 @@ st_write_done_worker (st_parameter_dt *dtp, bool unlock)
>     if (free_newunit)
>       {
>         /* Avoid inverse lock issues by placing after unlock_unit.  */
> -       LOCK (&unit_lock);
> +       WRLOCK (&unit_rwlock);
>         newunit_free (dtp->common.unit);
> -       UNLOCK (&unit_lock);
> +       RWUNLOCK (&unit_rwlock);
>       }
>  }
>  
> diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c
> index 36d025949c2..0c8c35e464e 100644
> --- a/libgfortran/io/unit.c
> +++ b/libgfortran/io/unit.c
> @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
>  
>  
>  /* IO locking rules:
> -   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
> +   UNIT_RWLOCK is a master rw lock, protecting UNIT_ROOT tree and UNIT_CACHE.
> +   Using an rwlock improves efficiency by allowing us to separate readers
> +   and writers of both UNIT_ROOT and UNIT_CACHE.
>     Concurrent use of different units should be supported, so
>     each unit has its own lock, LOCK.
>     Open should be atomic with its reopening of units and list_read.c
>     in several places needs find_unit another unit while holding stdin
> -   unit's lock, so it must be possible to acquire UNIT_LOCK while holding
> +   unit's lock, so it must be possible to acquire UNIT_RWLOCK while holding
>     some unit's lock.  Therefore to avoid deadlocks, it is forbidden
> -   to acquire unit's private locks while holding UNIT_LOCK, except
> +   to acquire unit's private locks while holding UNIT_RWLOCK, except
>     for freshly created units (where no other thread can get at their
>     address yet) or when using just trylock rather than lock operation.
>     In addition to unit's private lock each unit has a WAITERS counter
>     and CLOSED flag.  WAITERS counter must be either only
>     atomically incremented/decremented in all places (if atomic builtins
> -   are supported), or protected by UNIT_LOCK in all places (otherwise).
> +   are supported), or protected by UNIT_RWLOCK in all places (otherwise).
>     CLOSED flag must be always protected by unit's LOCK.
> -   After finding a unit in UNIT_CACHE or UNIT_ROOT with UNIT_LOCK held,
> +   After finding a unit in UNIT_CACHE or UNIT_ROOT with UNIT_RWLOCK held,
>     WAITERS must be incremented to avoid concurrent close from freeing
> -   the unit between unlocking UNIT_LOCK and acquiring unit's LOCK.
> -   Unit freeing is always done under UNIT_LOCK.  If close_unit sees any
> +   the unit between unlocking UNIT_RWLOCK and acquiring unit's LOCK.
> +   Unit freeing is always done under UNIT_RWLOCK.  If close_unit sees any
>     WAITERS, it doesn't free the unit but instead sets the CLOSED flag
>     and the thread that decrements WAITERS to zero while CLOSED flag is
> -   set is responsible for freeing it (while holding UNIT_LOCK).
> +   set is responsible for freeing it (while holding UNIT_RWLOCK).
>     flush_all_units operation is iterating over the unit tree with
> -   increasing UNIT_NUMBER while holding UNIT_LOCK and attempting to
> +   increasing UNIT_NUMBER while holding UNIT_RWLOCK and attempting to
>     flush each unit (and therefore needs the unit's LOCK held as well).
>     To avoid deadlocks, it just trylocks the LOCK and if unsuccessful,
> -   remembers the current unit's UNIT_NUMBER, unlocks UNIT_LOCK, acquires
> -   unit's LOCK and after flushing reacquires UNIT_LOCK and restarts with
> +   remembers the current unit's UNIT_NUMBER, unlocks UNIT_RWLOCK, acquires
> +   unit's LOCK and after flushing reacquires UNIT_RWLOCK and restarts with
>     the smallest UNIT_NUMBER above the last one flushed.
>  
>     If find_unit/find_or_create_unit/find_file/get_unit routines return
> @@ -101,10 +103,14 @@ gfc_offset max_offset;
>  gfc_offset default_recl;
>  
>  gfc_unit *unit_root;
> +#ifdef __GTHREAD_RWLOCK_INIT
> +__gthread_rwlock_t unit_rwlock = __GTHREAD_RWLOCK_INIT;
> +#else
>  #ifdef __GTHREAD_MUTEX_INIT
> -__gthread_mutex_t unit_lock = __GTHREAD_MUTEX_INIT;
> +__gthread_mutex_t unit_rwlock = __GTHREAD_MUTEX_INIT;
>  #else
> -__gthread_mutex_t unit_lock;
> +__gthread_mutex_t unit_rwlock;
> +#endif
>  #endif
>  
>  /* We use these filenames for error reporting.  */
> @@ -317,6 +323,28 @@ delete_unit (gfc_unit *old)
>    unit_root = delete_treap (old, unit_root);
>  }
>  
> +/* get_gfc_unit_from_root()-- Given an integer, return a pointer
> +   to the unit structure. Returns NULL if the unit does not exist,
> +   otherwise returns a locked unit. */
> +
> +static inline gfc_unit *
> +get_gfc_unit_from_unit_root (int n)
> +{
> +  gfc_unit *p;
> +  int c = 0;
> +  p = unit_root;
> +  while (p != NULL)
> +    {
> +      c = compare (n, p->unit_number);
> +      if (c < 0)
> +        p = p->left;
> +      if (c > 0)
> +        p = p->right;
> +      if (c == 0)
> +        break;
> +    }
> +  return p;
> +}
>  
>  /* get_gfc_unit()-- Given an integer, return a pointer to the unit
>     structure.  Returns NULL if the unit does not exist,
> @@ -329,7 +357,7 @@ get_gfc_unit (int n, int do_create)
>    int c, created = 0;
>  
>    NOTE ("Unit n=%d, do_create = %d", n, do_create);
> -  LOCK (&unit_lock);
> +  RDLOCK (&unit_rwlock);
>  
>  retry:
>    for (c = 0; c < CACHE_SIZE; c++)
> @@ -339,18 +367,25 @@ retry:
>  	goto found;
>        }
>  
> -  p = unit_root;
> -  while (p != NULL)
> -    {
> -      c = compare (n, p->unit_number);
> -      if (c < 0)
> -	p = p->left;
> -      if (c > 0)
> -	p = p->right;
> -      if (c == 0)
> -	break;
> -    }
> -
> +  p = get_gfc_unit_from_unit_root (n);
> +
> +  /* We did not find a unit in the cache nor in the unit list,
> +    create a new (locked) unit and insert into the unit list and
> +    cache. Manipulating either or both the unit list and the unit
> +    cache requires to hold a write-lock [for obvious reasons]:
> +    By separating the read/write lock, we will greatly reduce
> +    the contention on the read part, while the write part is
> +    unlikely once the unit hits the cache. */
> +  RD_TO_WRLOCK (&unit_rwlock);
> +
> +  /* In the case of high concurrency, when multiple threads want
> +    to find or create the same unit, the unit number may not
> +    exist in cache nor in the unit list during read phase, then
> +    threads will acquire the write-lock to insert the same unit
> +    number to unit list. To avoid duplicate insert, we need to
> +    find unit list once again to ensure that the unit number
> +    not exist. */
> +  p = get_gfc_unit_from_unit_root (n);
>    if (p == NULL && do_create)
>      {
>        p = insert_unit (n);
> @@ -368,8 +403,8 @@ retry:
>    if (created)
>      {
>        /* Newly created units have their lock held already
> -	 from insert_unit.  Just unlock UNIT_LOCK and return.  */
> -      UNLOCK (&unit_lock);
> +	 from insert_unit. Just unlock UNIT_RWLOCK and return. */
> +      RWUNLOCK (&unit_rwlock);
>        return p;
>      }
>  
> @@ -380,7 +415,7 @@ found:
>        if (! TRYLOCK (&p->lock))
>  	{
>  	  /* assert (p->closed == 0); */
> -	  UNLOCK (&unit_lock);
> +	  RWUNLOCK (&unit_rwlock);
>  	  return p;
>  	}
>  
> @@ -388,14 +423,14 @@ found:
>      }
>  
>  
> -  UNLOCK (&unit_lock);
> +  RWUNLOCK (&unit_rwlock);
>  
>    if (p != NULL && (p->child_dtio == 0))
>      {
>        LOCK (&p->lock);
>        if (p->closed)
>  	{
> -	  LOCK (&unit_lock);
> +	  WRLOCK (&unit_rwlock);
>  	  UNLOCK (&p->lock);
>  	  if (predec_waiting_locked (p) == 0)
>  	    destroy_unit_mutex (p);
> @@ -594,8 +629,8 @@ init_units (void)
>  #endif
>  #endif
>  
> -#ifndef __GTHREAD_MUTEX_INIT
> -  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_lock);
> +#if (!defined(__GTHREAD_RWLOCK_INIT) && !defined(__GTHREAD_MUTEX_INIT))
> +  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_rwlock);
>  #endif
>  
>    if (sizeof (max_offset) == 8)
> @@ -732,7 +767,7 @@ close_unit_1 (gfc_unit *u, int locked)
>  
>    u->closed = 1;
>    if (!locked)
> -    LOCK (&unit_lock);
> +    WRLOCK (&unit_rwlock);
>  
>    for (i = 0; i < CACHE_SIZE; i++)
>      if (unit_cache[i] == u)
> @@ -759,7 +794,7 @@ close_unit_1 (gfc_unit *u, int locked)
>      destroy_unit_mutex (u);
>  
>    if (!locked)
> -    UNLOCK (&unit_lock);
> +    RWUNLOCK (&unit_rwlock);
>  
>    return rc;
>  }
> @@ -796,10 +831,10 @@ close_unit (gfc_unit *u)
>  void
>  close_units (void)
>  {
> -  LOCK (&unit_lock);
> +  WRLOCK (&unit_rwlock);
>    while (unit_root != NULL)
>      close_unit_1 (unit_root, 1);
> -  UNLOCK (&unit_lock);
> +  RWUNLOCK (&unit_rwlock);
>  
>    free (newunits);
>  
> @@ -906,7 +941,7 @@ finish_last_advance_record (gfc_unit *u)
>  int
>  newunit_alloc (void)
>  {
> -  LOCK (&unit_lock);
> +  WRLOCK (&unit_rwlock);
>    if (!newunits)
>      {
>        newunits = xcalloc (16, 1);
> @@ -920,7 +955,7 @@ newunit_alloc (void)
>          {
>            newunits[ii] = true;
>            newunit_lwi = ii + 1;
> -	  UNLOCK (&unit_lock);
> +	  RWUNLOCK (&unit_rwlock);
>            return -ii + NEWUNIT_START;
>          }
>      }
> @@ -933,12 +968,12 @@ newunit_alloc (void)
>    memset (newunits + old_size, 0, old_size);
>    newunits[old_size] = true;
>    newunit_lwi = old_size + 1;
> -    UNLOCK (&unit_lock);
> +    RWUNLOCK (&unit_rwlock);
>    return -old_size + NEWUNIT_START;
>  }
>  
>  
> -/* Free a previously allocated newunit= unit number.  unit_lock must
> +/* Free a previously allocated newunit= unit number.  unit_rwlock must
>     be held when calling.  */
>  
>  void
> diff --git a/libgfortran/io/unix.c b/libgfortran/io/unix.c
> index d466df979df..dcae051744d 100644
> --- a/libgfortran/io/unix.c
> +++ b/libgfortran/io/unix.c
> @@ -1773,7 +1773,7 @@ find_file (const char *file, gfc_charlen_type file_len)
>    id = id_from_path (path);
>  #endif
>  
> -  LOCK (&unit_lock);
> +  RDLOCK (&unit_rwlock);
>  retry:
>    u = find_file0 (unit_root, FIND_FILE0_ARGS);
>    if (u != NULL)
> @@ -1782,19 +1782,19 @@ retry:
>        if (! __gthread_mutex_trylock (&u->lock))
>  	{
>  	  /* assert (u->closed == 0); */
> -	  UNLOCK (&unit_lock);
> +	  RWUNLOCK (&unit_rwlock);
>  	  goto done;
>  	}
>  
>        inc_waiting_locked (u);
>      }
> -  UNLOCK (&unit_lock);
> +  RWUNLOCK (&unit_rwlock);
>    if (u != NULL)
>      {
>        LOCK (&u->lock);
>        if (u->closed)
>  	{
> -	  LOCK (&unit_lock);
> +	  RDLOCK (&unit_rwlock);
>  	  UNLOCK (&u->lock);
>  	  if (predec_waiting_locked (u) == 0)
>  	    free (u);
> @@ -1838,13 +1838,13 @@ flush_all_units (void)
>    gfc_unit *u;
>    int min_unit = 0;
>  
> -  LOCK (&unit_lock);
> +  WRLOCK (&unit_rwlock);
>    do
>      {
>        u = flush_all_units_1 (unit_root, min_unit);
>        if (u != NULL)
>  	inc_waiting_locked (u);
> -      UNLOCK (&unit_lock);
> +      RWUNLOCK (&unit_rwlock);
>        if (u == NULL)
>  	return;
>  
> @@ -1855,13 +1855,13 @@ flush_all_units (void)
>        if (u->closed == 0)
>  	{
>  	  sflush (u->s);
> -	  LOCK (&unit_lock);
> +	  WRLOCK (&unit_rwlock);
>  	  UNLOCK (&u->lock);
>  	  (void) predec_waiting_locked (u);
>  	}
>        else
>  	{
> -	  LOCK (&unit_lock);
> +	  WRLOCK (&unit_rwlock);
>  	  UNLOCK (&u->lock);
>  	  if (predec_waiting_locked (u) == 0)
>  	    free (u);
> diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_1.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
> new file mode 100644
> index 00000000000..f90ecbeb00f
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
> @@ -0,0 +1,33 @@
> +! { dg-do run }
> +! Multiple threads call open/write/read/close in concurrency with different unit number,
> +! threads can acquire read lock concurrently, to find unit from cache or unit list very frequently,
> +! if not found, threads will acquire the write lock exclusively to insert unit to cache and unit list.
> +! This test case is used to stress both the read and write lock when access unit cache and list.
> +program main
> +  use omp_lib
> +  implicit none
> +  integer:: unit_number, v1, v2, i
> +  character(11) :: file_name
> +  character(3) :: async = "no"
> +  !$omp parallel private (unit_number, v1, v2, file_name, async, i)
> +    do i = 0, 100
> +      unit_number = 10 + omp_get_thread_num ()
> +      write (file_name, "(I3, A)") unit_number, "_tst.dat"
> +      file_name = adjustl(file_name)
> +      open (unit_number, file=file_name, asynchronous="yes")
> +      ! call inquire with file parameter to test find_file in unix.c
> +      inquire (file=file_name, asynchronous=async)
> +      if (async /= "YES") stop 1
> +      write (unit_number, *, asynchronous="yes") unit_number
> +      write (unit_number, *, asynchronous="yes") unit_number + 1
> +      close(unit_number)
> +
> +      open (unit_number, file = file_name, asynchronous="yes")
> +      read (unit_number, *, asynchronous="yes") v1
> +      read (unit_number, *, asynchronous="yes") v2
> +      wait (unit_number)
> +      if ((v1 /= unit_number) .or. (v2 /= unit_number + 1)) stop 2
> +      close(unit_number, status="delete")
> +    end do
> +  !$omp end parallel
> +end program
> diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_2.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_2.f90
> new file mode 100644
> index 00000000000..08c80d14cfb
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/rwlock_2.f90
> @@ -0,0 +1,22 @@
> +! { dg-do run }
> +! Insert a unit into cache at the beginning, then start multiple
> +! threads to access the same unit concurrency, unit will be found in unit cache during the read lock phase.
> +! This test case is used to test the read lock when access unit cache and list.
> +program main
> +  use omp_lib
> +  implicit none
> +  integer:: thread_id, total_threads, i, j
> +  total_threads = omp_get_max_threads ()
> +  open (10, file='tst.dat', asynchronous="yes")
> +  !$omp parallel private (thread_id, i, j)
> +    do i = 1, 100
> +      thread_id = omp_get_thread_num ()
> +      do j = 1, 100
> +        write (10, *, asynchronous="yes") thread_id, i
> +      end do
> +    end do
> +  !$omp end parallel
> +  ! call inquire with file parameter to test find_file in unix.c
> +  call flush ()
> +  close (10, status="delete")
> +end program
> diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_3.f90 b/libgomp/testsuite/libgomp.fortran/rwlock_3.f90
> new file mode 100644
> index 00000000000..1906fcd7a0b
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.fortran/rwlock_3.f90
> @@ -0,0 +1,18 @@
> +! { dg-do run }
> +! Find or create the same unit number in concurrency,
> +! at beginning, threads cannot find the unit in cache or unit list,
> +! then threads will acquire the write lock to insert unit.
> +! This test case is used to ensure that no duplicate unit number will be
> +! inserted into cache nor unit list when same unit was accessed in concurrency.
> +program main
> +  use omp_lib
> +  implicit none
> +  integer:: i
> +  !$omp parallel private (i)
> +    do i = 1, 100
> +      open (10, file='tst.dat', asynchronous="yes")
> +      ! Delete the unit number from cache and unit list to stress write lock.
> +      close (10, status="delete")
> +    end do
> +  !$omp end parallel
> +end program


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-14 12:39                             ` Jakub Jelinek
@ 2023-12-15  5:43                               ` Zhu, Lipeng
  0 siblings, 0 replies; 38+ messages in thread
From: Zhu, Lipeng @ 2023-12-15  5:43 UTC (permalink / raw)
  To: Jakub Jelinek, Thomas Schwinge
  Cc: H.J. Lu, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang


On 2023/12/14 20:39, Jakub Jelinek wrote:
> On Thu, Dec 14, 2023 at 01:29:01PM +0100, Thomas Schwinge wrote:
>>> Sure, I will look into that.
>>>
>>> BTW, I didn’t have the PowerPC in hands, do you mind granting the access of your
>>> test environment to me to help reproduce the issue?
>>
>> That's unfortunately not possible: it's behind company VPN, restricted
>> access.  :-/ I'll later try to have at least a quick look where it's
>> hanging, or what it's doing.
> 
> There is always https://gcc.gnu.org/wiki/CompileFarm
> There are e.g. 192, 160, 128, 80, 64 thread power machines.

Thanks Jakub,

I submit a request to join the CompileFarm project. Could you help to approve it?

Lipeng Zhu

> Whether any of them can actually reproduce it, I haven't tried.
> But shouldn't be that time consuming to reproduce, for this kind of thing
> one can just build
> .../configure --enable-languages=c,c++,fortran --disable-bootstrap --disable-libsanitizer
> make -jN
> compiler and just
> cd power*/libgomp
> make check RUNTESTFLAGS=fortran.exp=rwlock*.f90
> repeatedly to see if it gets stuck.
> 
> 	Jakub
> 
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-14 15:50               ` Richard Earnshaw (lists)
@ 2023-12-15 11:31                 ` Lipeng Zhu
  2023-12-15 19:23                   ` Richard Earnshaw
  0 siblings, 1 reply; 38+ messages in thread
From: Lipeng Zhu @ 2023-12-15 11:31 UTC (permalink / raw)
  To: Richard Earnshaw (lists), jakub
  Cc: fortran, gcc-patches, hongjiu.lu, pan.deng, rep.dot.nop,
	tianyou.li, tkoenig, wangyang.guo



On 2023/12/14 23:50, Richard Earnshaw (lists) wrote:
> On 09/12/2023 15:39, Lipeng Zhu wrote:
>> This patch try to introduce the rwlock and split the read/write to
>> unit_root tree and unit_cache with rwlock instead of the mutex to
>> increase CPU efficiency. In the get_gfc_unit function, the percentage
>> to step into the insert_unit function is around 30%, in most instances,
>> we can get the unit in the phase of reading the unit_cache or unit_root
>> tree. So split the read/write phase by rwlock would be an approach to
>> make it more parallel.
>>
>> BTW, the IPC metrics can gain around 9x in our test
>> server with 220 cores. The benchmark we used is
>> https://github.com/rwesson/NEAT
>>
>> libgcc/ChangeLog:
>>
>> 	* gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
>> 	(__gthrw): New function.
>> 	(__gthread_rwlock_rdlock): New function.
>> 	(__gthread_rwlock_tryrdlock): New function.
>> 	(__gthread_rwlock_wrlock): New function.
>> 	(__gthread_rwlock_trywrlock): New function.
>> 	(__gthread_rwlock_unlock): New function.
>>
>> libgfortran/ChangeLog:
>>
>> 	* io/async.c (DEBUG_LINE): New macro.
>> 	* io/async.h (RWLOCK_DEBUG_ADD): New macro.
>> 	(CHECK_RDLOCK): New macro.
>> 	(CHECK_WRLOCK): New macro.
>> 	(TAIL_RWLOCK_DEBUG_QUEUE): New macro.
>> 	(IN_RWLOCK_DEBUG_QUEUE): New macro.
>> 	(RDLOCK): New macro.
>> 	(WRLOCK): New macro.
>> 	(RWUNLOCK): New macro.
>> 	(RD_TO_WRLOCK): New macro.
>> 	(INTERN_RDLOCK): New macro.
>> 	(INTERN_WRLOCK): New macro.
>> 	(INTERN_RWUNLOCK): New macro.
>> 	* io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
>> 	a comment.
>> 	(unit_lock): Remove including associated internal_proto.
>> 	(unit_rwlock): New declarations including associated internal_proto.
>> 	(dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
>> 	instead of __gthread_mutex_lock and __gthread_mutex_unlock on
>> 	unit_lock.
>> 	* io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK on
>> 	unit_rwlock instead of LOCK and UNLOCK on unit_lock.
>> 	(st_write_done_worker): Likewise.
>> 	* io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
>> 	comment. Use unit_rwlock variable instead of unit_lock variable.
>> 	(get_gfc_unit_from_unit_root): New function.
>> 	(get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
>> 	instead of LOCK and UNLOCK on unit_lock.
>> 	(close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead of
>> 	LOCK and UNLOCK on unit_lock.
>> 	(close_units): Likewise.
>> 	(newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
>> 	unit_lock.
>> 	* io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
>> 	instead of LOCK and UNLOCK on unit_lock.
>> 	(flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
>> 	of LOCK and UNLOCK on unit_lock.
>>
> 
> It looks like this has broken builds on arm-none-eabi when using newlib:
> 
> In file included from /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran
> /runtime/error.c:27:
> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h: In function
> ‘dec_waiting_unlocked’:
> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1023:3: error
> : implicit declaration of function ‘WRLOCK’ [-Wimplicit-function-declaration]
>   1023 |   WRLOCK (&unit_rwlock);
>        |   ^~~~~~
> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1025:3: error
> : implicit declaration of function ‘RWUNLOCK’ [-Wimplicit-function-declaration]
>   1025 |   RWUNLOCK (&unit_rwlock);
>        |   ^~~~~~~~
> 
> 
> R.

Hi Richard,

The root cause is that the macro WRLOCK and RWUNLOCK are not defined in 
io.h. The reason of x86 platform not failed is that 
HAVE_ATOMIC_FETCH_ADD is defined then caused above macros were never 
been used. Code logic show as below:
#ifdef HAVE_ATOMIC_FETCH_ADD
   (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
#else
   WRLOCK (&unit_rwlock);
   u->waiting--;
   RWUNLOCK (&unit_rwlock);
#endif

I just draft a patch try to fix this bug, because I didn't have arm 
platform, would you help to validate if it was fixed on arm platform?

diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index 15daa0995b1..c7f0f7d7d9e 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -1020,9 +1020,15 @@ dec_waiting_unlocked (gfc_unit *u)
  #ifdef HAVE_ATOMIC_FETCH_ADD
    (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
  #else
-  WRLOCK (&unit_rwlock);
+#ifdef __GTHREAD_RWLOCK_INIT
+  __gthread_rwlock_wrlock (&unit_rwlock);
+  u->waiting--;
+  __gthread_rwlock_unlock (&unit_rwlock);
+#else
+  __gthread_mutex_lock (&unit_rwlock);
    u->waiting--;
-  RWUNLOCK (&unit_rwlock);
+  __gthread_mutex_unlock (&unit_rwlock);
+#endif
  #endif
  }


Lipeng Zhu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-15 11:31                 ` Lipeng Zhu
@ 2023-12-15 19:23                   ` Richard Earnshaw
  2024-01-02 11:57                     ` Vaseeharan Vinayagamoorthy
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Earnshaw @ 2023-12-15 19:23 UTC (permalink / raw)
  To: Lipeng Zhu, Richard Earnshaw (lists), jakub
  Cc: fortran, gcc-patches, hongjiu.lu, pan.deng, rep.dot.nop,
	tianyou.li, tkoenig, wangyang.guo



On 15/12/2023 11:31, Lipeng Zhu wrote:
> 
> 
> On 2023/12/14 23:50, Richard Earnshaw (lists) wrote:
>> On 09/12/2023 15:39, Lipeng Zhu wrote:
>>> This patch try to introduce the rwlock and split the read/write to
>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>> increase CPU efficiency. In the get_gfc_unit function, the percentage
>>> to step into the insert_unit function is around 30%, in most instances,
>>> we can get the unit in the phase of reading the unit_cache or unit_root
>>> tree. So split the read/write phase by rwlock would be an approach to
>>> make it more parallel.
>>>
>>> BTW, the IPC metrics can gain around 9x in our test
>>> server with 220 cores. The benchmark we used is
>>> https://github.com/rwesson/NEAT
>>>
>>> libgcc/ChangeLog:
>>>
>>>     * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
>>>     (__gthrw): New function.
>>>     (__gthread_rwlock_rdlock): New function.
>>>     (__gthread_rwlock_tryrdlock): New function.
>>>     (__gthread_rwlock_wrlock): New function.
>>>     (__gthread_rwlock_trywrlock): New function.
>>>     (__gthread_rwlock_unlock): New function.
>>>
>>> libgfortran/ChangeLog:
>>>
>>>     * io/async.c (DEBUG_LINE): New macro.
>>>     * io/async.h (RWLOCK_DEBUG_ADD): New macro.
>>>     (CHECK_RDLOCK): New macro.
>>>     (CHECK_WRLOCK): New macro.
>>>     (TAIL_RWLOCK_DEBUG_QUEUE): New macro.
>>>     (IN_RWLOCK_DEBUG_QUEUE): New macro.
>>>     (RDLOCK): New macro.
>>>     (WRLOCK): New macro.
>>>     (RWUNLOCK): New macro.
>>>     (RD_TO_WRLOCK): New macro.
>>>     (INTERN_RDLOCK): New macro.
>>>     (INTERN_WRLOCK): New macro.
>>>     (INTERN_RWUNLOCK): New macro.
>>>     * io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
>>>     a comment.
>>>     (unit_lock): Remove including associated internal_proto.
>>>     (unit_rwlock): New declarations including associated internal_proto.
>>>     (dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
>>>     instead of __gthread_mutex_lock and __gthread_mutex_unlock on
>>>     unit_lock.
>>>     * io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK on
>>>     unit_rwlock instead of LOCK and UNLOCK on unit_lock.
>>>     (st_write_done_worker): Likewise.
>>>     * io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
>>>     comment. Use unit_rwlock variable instead of unit_lock variable.
>>>     (get_gfc_unit_from_unit_root): New function.
>>>     (get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
>>>     instead of LOCK and UNLOCK on unit_lock.
>>>     (close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead of
>>>     LOCK and UNLOCK on unit_lock.
>>>     (close_units): Likewise.
>>>     (newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
>>>     unit_lock.
>>>     * io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
>>>     instead of LOCK and UNLOCK on unit_lock.
>>>     (flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
>>>     of LOCK and UNLOCK on unit_lock.
>>>
>>
>> It looks like this has broken builds on arm-none-eabi when using newlib:
>>
>> In file included from 
>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran
>> /runtime/error.c:27:
>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h: In 
>> function
>> ‘dec_waiting_unlocked’:
>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1023:3: error
>> : implicit declaration of function ‘WRLOCK’ 
>> [-Wimplicit-function-declaration]
>>   1023 |   WRLOCK (&unit_rwlock);
>>        |   ^~~~~~
>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1025:3: error
>> : implicit declaration of function ‘RWUNLOCK’ 
>> [-Wimplicit-function-declaration]
>>   1025 |   RWUNLOCK (&unit_rwlock);
>>        |   ^~~~~~~~
>>
>>
>> R.
> 
> Hi Richard,
> 
> The root cause is that the macro WRLOCK and RWUNLOCK are not defined in 
> io.h. The reason of x86 platform not failed is that 
> HAVE_ATOMIC_FETCH_ADD is defined then caused above macros were never 
> been used. Code logic show as below:
> #ifdef HAVE_ATOMIC_FETCH_ADD
>    (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
> #else
>    WRLOCK (&unit_rwlock);
>    u->waiting--;
>    RWUNLOCK (&unit_rwlock);
> #endif
> 
> I just draft a patch try to fix this bug, because I didn't have arm 
> platform, would you help to validate if it was fixed on arm platform?
> 
> diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
> index 15daa0995b1..c7f0f7d7d9e 100644
> --- a/libgfortran/io/io.h
> +++ b/libgfortran/io/io.h
> @@ -1020,9 +1020,15 @@ dec_waiting_unlocked (gfc_unit *u)
>   #ifdef HAVE_ATOMIC_FETCH_ADD
>     (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
>   #else
> -  WRLOCK (&unit_rwlock);
> +#ifdef __GTHREAD_RWLOCK_INIT
> +  __gthread_rwlock_wrlock (&unit_rwlock);
> +  u->waiting--;
> +  __gthread_rwlock_unlock (&unit_rwlock);
> +#else
> +  __gthread_mutex_lock (&unit_rwlock);
>     u->waiting--;
> -  RWUNLOCK (&unit_rwlock);
> +  __gthread_mutex_unlock (&unit_rwlock);
> +#endif
>   #endif
>   }
> 
> 
> Lipeng Zhu

Hi Lipeng,

Thanks for the quick reply.  I can confirm that with the above change 
the bootstrap failure is fixed.  However, this shouldn't be considered a 
formal review; libgfortran is not really my area.

I'll be away now until January 2nd.

Richard.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-13 20:52                       ` Thomas Schwinge
  2023-12-14  2:28                         ` Zhu, Lipeng
@ 2023-12-21 11:42                         ` Thomas Schwinge
  2023-12-22  6:48                           ` Lipeng Zhu
  2024-01-03  9:14                           ` Lipeng Zhu
  1 sibling, 2 replies; 38+ messages in thread
From: Thomas Schwinge @ 2023-12-21 11:42 UTC (permalink / raw)
  To: Zhu, Lipeng, H.J. Lu
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang, Tobias Burnus

Hi!

On 2023-12-13T21:52:29+0100, I wrote:
> On 2023-12-12T02:05:26+0000, "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
>> On 2023/12/12 1:45, H.J. Lu wrote:
>>> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com> wrote:
>>> > On 2023/12/9 23:23, Jakub Jelinek wrote:
>>> > > On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
>>> > > > This patch try to introduce the rwlock and split the read/write to
>>> > > > unit_root tree and unit_cache with rwlock instead of the mutex to
>>> > > > increase CPU efficiency. In the get_gfc_unit function, the
>>> > > > percentage to step into the insert_unit function is around 30%, in
>>> > > > most instances, we can get the unit in the phase of reading the
>>> > > > unit_cache or unit_root tree. So split the read/write phase by
>>> > > > rwlock would be an approach to make it more parallel.
>>> > > >
>>> > > > BTW, the IPC metrics can gain around 9x in our test server with
>>> > > > 220 cores. The benchmark we used is
>>> > > > https://github.com/rwesson/NEAT
>
>>> > > Ok for trunk, thanks.
>
>>> > Thanks! Looking forward to landing to trunk.
>
>>> Pushed for you.

> I've just filed <https://gcc.gnu.org/PR113005>
> "'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' execution test timeouts".
> Would you be able to look into that?

See my update in there.


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-21 11:42                         ` Thomas Schwinge
@ 2023-12-22  6:48                           ` Lipeng Zhu
  2024-01-03  9:14                           ` Lipeng Zhu
  1 sibling, 0 replies; 38+ messages in thread
From: Lipeng Zhu @ 2023-12-22  6:48 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang, Tobias Burnus, H.J. Lu

Hi Thomas,

On 2023/12/21 19:42, Thomas Schwinge wrote:
> Hi!
> 
> On 2023-12-13T21:52:29+0100, I wrote:
>> On 2023-12-12T02:05:26+0000, "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
>>> On 2023/12/12 1:45, H.J. Lu wrote:
>>>> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com> wrote:
>>>>> On 2023/12/9 23:23, Jakub Jelinek wrote:
>>>>>> On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
>>>>>>> This patch try to introduce the rwlock and split the read/write to
>>>>>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>>>>>> increase CPU efficiency. In the get_gfc_unit function, the
>>>>>>> percentage to step into the insert_unit function is around 30%, in
>>>>>>> most instances, we can get the unit in the phase of reading the
>>>>>>> unit_cache or unit_root tree. So split the read/write phase by
>>>>>>> rwlock would be an approach to make it more parallel.
>>>>>>>
>>>>>>> BTW, the IPC metrics can gain around 9x in our test server with
>>>>>>> 220 cores. The benchmark we used is
>>>>>>> https://github.com/rwesson/NEAT
>>
>>>>>> Ok for trunk, thanks.
>>
>>>>> Thanks! Looking forward to landing to trunk.
>>
>>>> Pushed for you.
> 
>> I've just filed <https://gcc.gnu.org/PR113005>
>> "'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' execution test timeouts".
>> Would you be able to look into that?
> 
> See my update in there.
> 
> 
> Grüße
>   Thomas
> -------------- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 
201, 80634 München; Gesellschaft mit beschränkter Haftung; 
Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: 
München; Registergericht München, HRB 106955
> 

Since I don't have gcc bugzilla account. Reply in this thread:
Limit themselves to some lower 'OMP_NUM_THREADS' should be an option or 
increase the execution timeout?

But I can't reproduce the execution timeout failure on both powerpc9 and 
powerpc10 arch machine. And I also tried to decrease the CPU frequency 
from 2.6G to 800M, these test cases still can run successfully.

 > so only a little bit of an improvement of the new "rwlock" 
libgfortran vs. old "mutex" GCC 10 one, curiously.  (But supposedly that 
depends on the hardware or other factors?)

The rwlock can increase the IPC a lot, maybe the wall time you listed is 
not obvious.

$ grep ^cpu < /proc/cpuinfo | uniq -c

192 cpu             : POWER10 (architected), altivec supported

Native configuration is powerpc64le-unknown-linux-gnu

Schedule of variations:
     unix

PASS: libgomp.fortran/rwlock_1.f90   -O0  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O0  execution test
PASS: libgomp.fortran/rwlock_1.f90   -O1  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O1  execution test
PASS: libgomp.fortran/rwlock_1.f90   -O2  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O2  execution test
PASS: libgomp.fortran/rwlock_1.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
PASS: libgomp.fortran/rwlock_1.f90   -O3 -g  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -O3 -g  execution test
PASS: libgomp.fortran/rwlock_1.f90   -Os  (test for excess errors)
PASS: libgomp.fortran/rwlock_1.f90   -Os  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O0  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O0  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O1  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O1  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O2  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O2  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
PASS: libgomp.fortran/rwlock_2.f90   -O3 -g  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -O3 -g  execution test
PASS: libgomp.fortran/rwlock_2.f90   -Os  (test for excess errors)
PASS: libgomp.fortran/rwlock_2.f90   -Os  execution test
PASS: libgomp.fortran/rwlock_3.f90   -O0  (test for excess errors)
PASS: libgomp.fortran/rwlock_3.f90   -O0  execution test
PASS: libgomp.fortran/rwlock_3.f90   -O1  (test for excess errors)
PASS: libgomp.fortran/rwlock_3.f90   -O1  execution test
PASS: libgomp.fortran/rwlock_3.f90   -O2  (test for excess errors)
PASS: libgomp.fortran/rwlock_3.f90   -O2  execution test
PASS: libgomp.fortran/rwlock_3.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors)
PASS: libgomp.fortran/rwlock_3.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  execution test

Lipeng Zhu


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-15 19:23                   ` Richard Earnshaw
@ 2024-01-02 11:57                     ` Vaseeharan Vinayagamoorthy
  2024-01-03  1:02                       ` Lipeng Zhu
  0 siblings, 1 reply; 38+ messages in thread
From: Vaseeharan Vinayagamoorthy @ 2024-01-02 11:57 UTC (permalink / raw)
  To: Lipeng Zhu, jakub
  Cc: fortran, gcc-patches, hongjiu.lu, pan.deng, rep.dot.nop,
	tianyou.li, tkoenig, wangyang.guo, Richard Earnshaw,
	Richard Earnshaw

[-- Attachment #1: Type: text/plain, Size: 6423 bytes --]

Hi Lipeng,

It looks like your draft patch to fix the builds for arm-none-eabi target is not merged yet, because our arm-none-eabi builds are still broken. Are you waiting for additional information, or would you be able to fix this issue?

Kind regards,
Vasee
________________________________
From: Richard Earnshaw <Richard.Earnshaw@foss.arm.com>
Sent: 15 December 2023 19:23
To: Lipeng Zhu <lipeng.zhu@intel.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; jakub@redhat.com <jakub@redhat.com>
Cc: fortran@gcc.gnu.org <fortran@gcc.gnu.org>; gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; hongjiu.lu@intel.com <hongjiu.lu@intel.com>; pan.deng@intel.com <pan.deng@intel.com>; rep.dot.nop@gmail.com <rep.dot.nop@gmail.com>; tianyou.li@intel.com <tianyou.li@intel.com>; tkoenig@netcologne.de <tkoenig@netcologne.de>; wangyang.guo@intel.com <wangyang.guo@intel.com>
Subject: Re: [PATCH v7] libgfortran: Replace mutex with rwlock



On 15/12/2023 11:31, Lipeng Zhu wrote:
>
>
> On 2023/12/14 23:50, Richard Earnshaw (lists) wrote:
>> On 09/12/2023 15:39, Lipeng Zhu wrote:
>>> This patch try to introduce the rwlock and split the read/write to
>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>> increase CPU efficiency. In the get_gfc_unit function, the percentage
>>> to step into the insert_unit function is around 30%, in most instances,
>>> we can get the unit in the phase of reading the unit_cache or unit_root
>>> tree. So split the read/write phase by rwlock would be an approach to
>>> make it more parallel.
>>>
>>> BTW, the IPC metrics can gain around 9x in our test
>>> server with 220 cores. The benchmark we used is
>>> https://github.com/rwesson/NEAT
>>>
>>> libgcc/ChangeLog:
>>>
>>>     * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
>>>     (__gthrw): New function.
>>>     (__gthread_rwlock_rdlock): New function.
>>>     (__gthread_rwlock_tryrdlock): New function.
>>>     (__gthread_rwlock_wrlock): New function.
>>>     (__gthread_rwlock_trywrlock): New function.
>>>     (__gthread_rwlock_unlock): New function.
>>>
>>> libgfortran/ChangeLog:
>>>
>>>     * io/async.c (DEBUG_LINE): New macro.
>>>     * io/async.h (RWLOCK_DEBUG_ADD): New macro.
>>>     (CHECK_RDLOCK): New macro.
>>>     (CHECK_WRLOCK): New macro.
>>>     (TAIL_RWLOCK_DEBUG_QUEUE): New macro.
>>>     (IN_RWLOCK_DEBUG_QUEUE): New macro.
>>>     (RDLOCK): New macro.
>>>     (WRLOCK): New macro.
>>>     (RWUNLOCK): New macro.
>>>     (RD_TO_WRLOCK): New macro.
>>>     (INTERN_RDLOCK): New macro.
>>>     (INTERN_WRLOCK): New macro.
>>>     (INTERN_RWUNLOCK): New macro.
>>>     * io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
>>>     a comment.
>>>     (unit_lock): Remove including associated internal_proto.
>>>     (unit_rwlock): New declarations including associated internal_proto.
>>>     (dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
>>>     instead of __gthread_mutex_lock and __gthread_mutex_unlock on
>>>     unit_lock.
>>>     * io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK on
>>>     unit_rwlock instead of LOCK and UNLOCK on unit_lock.
>>>     (st_write_done_worker): Likewise.
>>>     * io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
>>>     comment. Use unit_rwlock variable instead of unit_lock variable.
>>>     (get_gfc_unit_from_unit_root): New function.
>>>     (get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
>>>     instead of LOCK and UNLOCK on unit_lock.
>>>     (close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead of
>>>     LOCK and UNLOCK on unit_lock.
>>>     (close_units): Likewise.
>>>     (newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
>>>     unit_lock.
>>>     * io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
>>>     instead of LOCK and UNLOCK on unit_lock.
>>>     (flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
>>>     of LOCK and UNLOCK on unit_lock.
>>>
>>
>> It looks like this has broken builds on arm-none-eabi when using newlib:
>>
>> In file included from
>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran
>> /runtime/error.c:27:
>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h: In
>> function
>> ‘dec_waiting_unlocked’:
>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1023:3: error
>> : implicit declaration of function ‘WRLOCK’
>> [-Wimplicit-function-declaration]
>>   1023 |   WRLOCK (&unit_rwlock);
>>        |   ^~~~~~
>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1025:3: error
>> : implicit declaration of function ‘RWUNLOCK’
>> [-Wimplicit-function-declaration]
>>   1025 |   RWUNLOCK (&unit_rwlock);
>>        |   ^~~~~~~~
>>
>>
>> R.
>
> Hi Richard,
>
> The root cause is that the macro WRLOCK and RWUNLOCK are not defined in
> io.h. The reason of x86 platform not failed is that
> HAVE_ATOMIC_FETCH_ADD is defined then caused above macros were never
> been used. Code logic show as below:
> #ifdef HAVE_ATOMIC_FETCH_ADD
>    (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
> #else
>    WRLOCK (&unit_rwlock);
>    u->waiting--;
>    RWUNLOCK (&unit_rwlock);
> #endif
>
> I just draft a patch try to fix this bug, because I didn't have arm
> platform, would you help to validate if it was fixed on arm platform?
>
> diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
> index 15daa0995b1..c7f0f7d7d9e 100644
> --- a/libgfortran/io/io.h
> +++ b/libgfortran/io/io.h
> @@ -1020,9 +1020,15 @@ dec_waiting_unlocked (gfc_unit *u)
>   #ifdef HAVE_ATOMIC_FETCH_ADD
>     (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
>   #else
> -  WRLOCK (&unit_rwlock);
> +#ifdef __GTHREAD_RWLOCK_INIT
> +  __gthread_rwlock_wrlock (&unit_rwlock);
> +  u->waiting--;
> +  __gthread_rwlock_unlock (&unit_rwlock);
> +#else
> +  __gthread_mutex_lock (&unit_rwlock);
>     u->waiting--;
> -  RWUNLOCK (&unit_rwlock);
> +  __gthread_mutex_unlock (&unit_rwlock);
> +#endif
>   #endif
>   }
>
>
> Lipeng Zhu

Hi Lipeng,

Thanks for the quick reply.  I can confirm that with the above change
the bootstrap failure is fixed.  However, this shouldn't be considered a
formal review; libgfortran is not really my area.

I'll be away now until January 2nd.

Richard.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2024-01-02 11:57                     ` Vaseeharan Vinayagamoorthy
@ 2024-01-03  1:02                       ` Lipeng Zhu
  0 siblings, 0 replies; 38+ messages in thread
From: Lipeng Zhu @ 2024-01-03  1:02 UTC (permalink / raw)
  To: Vaseeharan Vinayagamoorthy, jakub
  Cc: fortran, gcc-patches, hongjiu.lu, pan.deng, rep.dot.nop,
	tianyou.li, tkoenig, wangyang.guo, Richard Earnshaw,
	Richard Earnshaw



On 2024/1/2 11:57, Vaseeharan Vinayagamoorthy wrote:
> Hi Lipeng,
> 
> It looks like your draft patch to fix the builds for arm-none-eabi target is not merged yet, because our arm-none-eabi builds are still broken. Are you waiting for additional information, or would you be able to fix this issue?
> 
> Kind regards,
> Vasee

Hi Vasee,

Actually I already sent a patch 
https://inbox.sourceware.org/gcc-patches/20231222023605.3894839-1-lipeng.zhu@intel.com/ 
to fix the build failure issue, now it is waiting for community to review.

Lipeng Zhu
> ________________________________
> From: Richard Earnshaw <Richard.Earnshaw@foss.arm.com>
> Sent: 15 December 2023 19:23
> To: Lipeng Zhu <lipeng.zhu@intel.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; jakub@redhat.com <jakub@redhat.com>
> Cc: fortran@gcc.gnu.org <fortran@gcc.gnu.org>; gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; hongjiu.lu@intel.com <hongjiu.lu@intel.com>; pan.deng@intel.com <pan.deng@intel.com>; rep.dot.nop@gmail.com <rep.dot.nop@gmail.com>; tianyou.li@intel.com <tianyou.li@intel.com>; tkoenig@netcologne.de <tkoenig@netcologne.de>; wangyang.guo@intel.com <wangyang.guo@intel.com>
> Subject: Re: [PATCH v7] libgfortran: Replace mutex with rwlock
> 
> 
> 
> On 15/12/2023 11:31, Lipeng Zhu wrote:
>>
>>
>> On 2023/12/14 23:50, Richard Earnshaw (lists) wrote:
>>> On 09/12/2023 15:39, Lipeng Zhu wrote:
>>>> This patch try to introduce the rwlock and split the read/write to
>>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>>> increase CPU efficiency. In the get_gfc_unit function, the percentage
>>>> to step into the insert_unit function is around 30%, in most instances,
>>>> we can get the unit in the phase of reading the unit_cache or unit_root
>>>> tree. So split the read/write phase by rwlock would be an approach to
>>>> make it more parallel.
>>>>
>>>> BTW, the IPC metrics can gain around 9x in our test
>>>> server with 220 cores. The benchmark we used is
>>>> https://github.com/rwesson/NEAT
>>>>
>>>> libgcc/ChangeLog:
>>>>
>>>>      * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro.
>>>>      (__gthrw): New function.
>>>>      (__gthread_rwlock_rdlock): New function.
>>>>      (__gthread_rwlock_tryrdlock): New function.
>>>>      (__gthread_rwlock_wrlock): New function.
>>>>      (__gthread_rwlock_trywrlock): New function.
>>>>      (__gthread_rwlock_unlock): New function.
>>>>
>>>> libgfortran/ChangeLog:
>>>>
>>>>      * io/async.c (DEBUG_LINE): New macro.
>>>>      * io/async.h (RWLOCK_DEBUG_ADD): New macro.
>>>>      (CHECK_RDLOCK): New macro.
>>>>      (CHECK_WRLOCK): New macro.
>>>>      (TAIL_RWLOCK_DEBUG_QUEUE): New macro.
>>>>      (IN_RWLOCK_DEBUG_QUEUE): New macro.
>>>>      (RDLOCK): New macro.
>>>>      (WRLOCK): New macro.
>>>>      (RWUNLOCK): New macro.
>>>>      (RD_TO_WRLOCK): New macro.
>>>>      (INTERN_RDLOCK): New macro.
>>>>      (INTERN_WRLOCK): New macro.
>>>>      (INTERN_RWUNLOCK): New macro.
>>>>      * io/io.h (struct gfc_unit): Change UNIT_LOCK to UNIT_RWLOCK in
>>>>      a comment.
>>>>      (unit_lock): Remove including associated internal_proto.
>>>>      (unit_rwlock): New declarations including associated internal_proto.
>>>>      (dec_waiting_unlocked): Use WRLOCK and RWUNLOCK on unit_rwlock
>>>>      instead of __gthread_mutex_lock and __gthread_mutex_unlock on
>>>>      unit_lock.
>>>>      * io/transfer.c (st_read_done_worker): Use WRLOCK and RWUNLOCK on
>>>>      unit_rwlock instead of LOCK and UNLOCK on unit_lock.
>>>>      (st_write_done_worker): Likewise.
>>>>      * io/unit.c: Change UNIT_LOCK to UNIT_RWLOCK in 'IO locking rules'
>>>>      comment. Use unit_rwlock variable instead of unit_lock variable.
>>>>      (get_gfc_unit_from_unit_root): New function.
>>>>      (get_gfc_unit): Use RDLOCK, WRLOCK and RWUNLOCK on unit_rwlock
>>>>      instead of LOCK and UNLOCK on unit_lock.
>>>>      (close_unit_1): Use WRLOCK and RWUNLOCK on unit_rwlock instead of
>>>>      LOCK and UNLOCK on unit_lock.
>>>>      (close_units): Likewise.
>>>>      (newunit_alloc): Use RWUNLOCK on unit_rwlock instead of UNLOCK on
>>>>      unit_lock.
>>>>      * io/unix.c (find_file): Use RDLOCK and RWUNLOCK on unit_rwlock
>>>>      instead of LOCK and UNLOCK on unit_lock.
>>>>      (flush_all_units): Use WRLOCK and RWUNLOCK on unit_rwlock instead
>>>>      of LOCK and UNLOCK on unit_lock.
>>>>
>>>
>>> It looks like this has broken builds on arm-none-eabi when using newlib:
>>>
>>> In file included from
>>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran
>>> /runtime/error.c:27:
>>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h: In
>>> function
>>> ‘dec_waiting_unlocked’:
>>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1023:3: error
>>> : implicit declaration of function ‘WRLOCK’
>>> [-Wimplicit-function-declaration]
>>>    1023 |   WRLOCK (&unit_rwlock);
>>>         |   ^~~~~~
>>> /work/rearnsha/gnusrc/nightly/gcc-cross/master/libgfortran/io/io.h:1025:3: error
>>> : implicit declaration of function ‘RWUNLOCK’
>>> [-Wimplicit-function-declaration]
>>>    1025 |   RWUNLOCK (&unit_rwlock);
>>>         |   ^~~~~~~~
>>>
>>>
>>> R.
>>
>> Hi Richard,
>>
>> The root cause is that the macro WRLOCK and RWUNLOCK are not defined in
>> io.h. The reason of x86 platform not failed is that
>> HAVE_ATOMIC_FETCH_ADD is defined then caused above macros were never
>> been used. Code logic show as below:
>> #ifdef HAVE_ATOMIC_FETCH_ADD
>>     (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
>> #else
>>     WRLOCK (&unit_rwlock);
>>     u->waiting--;
>>     RWUNLOCK (&unit_rwlock);
>> #endif
>>
>> I just draft a patch try to fix this bug, because I didn't have arm
>> platform, would you help to validate if it was fixed on arm platform?
>>
>> diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
>> index 15daa0995b1..c7f0f7d7d9e 100644
>> --- a/libgfortran/io/io.h
>> +++ b/libgfortran/io/io.h
>> @@ -1020,9 +1020,15 @@ dec_waiting_unlocked (gfc_unit *u)
>>    #ifdef HAVE_ATOMIC_FETCH_ADD
>>      (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
>>    #else
>> -  WRLOCK (&unit_rwlock);
>> +#ifdef __GTHREAD_RWLOCK_INIT
>> +  __gthread_rwlock_wrlock (&unit_rwlock);
>> +  u->waiting--;
>> +  __gthread_rwlock_unlock (&unit_rwlock);
>> +#else
>> +  __gthread_mutex_lock (&unit_rwlock);
>>      u->waiting--;
>> -  RWUNLOCK (&unit_rwlock);
>> +  __gthread_mutex_unlock (&unit_rwlock);
>> +#endif
>>    #endif
>>    }
>>
>>
>> Lipeng Zhu
> 
> Hi Lipeng,
> 
> Thanks for the quick reply.  I can confirm that with the above change
> the bootstrap failure is fixed.  However, this shouldn't be considered a
> formal review; libgfortran is not really my area.
> 
> I'll be away now until January 2nd.
> 
> Richard.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: RE: [PATCH v7] libgfortran: Replace mutex with rwlock
  2023-12-21 11:42                         ` Thomas Schwinge
  2023-12-22  6:48                           ` Lipeng Zhu
@ 2024-01-03  9:14                           ` Lipeng Zhu
  2024-01-17 13:25                             ` Lipeng Zhu
  1 sibling, 1 reply; 38+ messages in thread
From: Lipeng Zhu @ 2024-01-03  9:14 UTC (permalink / raw)
  To: Thomas Schwinge, H.J. Lu
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang, Tobias Burnus



On 2023/12/21 19:42, Thomas Schwinge wrote:
> Hi!
> 
> On 2023-12-13T21:52:29+0100, I wrote:
>> On 2023-12-12T02:05:26+0000, "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
>>> On 2023/12/12 1:45, H.J. Lu wrote:
>>>> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com> wrote:
>>>>> On 2023/12/9 23:23, Jakub Jelinek wrote:
>>>>>> On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
>>>>>>> This patch try to introduce the rwlock and split the read/write to
>>>>>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>>>>>> increase CPU efficiency. In the get_gfc_unit function, the
>>>>>>> percentage to step into the insert_unit function is around 30%, in
>>>>>>> most instances, we can get the unit in the phase of reading the
>>>>>>> unit_cache or unit_root tree. So split the read/write phase by
>>>>>>> rwlock would be an approach to make it more parallel.
>>>>>>>
>>>>>>> BTW, the IPC metrics can gain around 9x in our test server with
>>>>>>> 220 cores. The benchmark we used is
>>>>>>> https://github.com/rwesson/NEAT
>>
>>>>>> Ok for trunk, thanks.
>>
>>>>> Thanks! Looking forward to landing to trunk.
>>
>>>> Pushed for you.
> 
>> I've just filed <https://gcc.gnu.org/PR113005>
>> "'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' execution test timeouts".
>> Would you be able to look into that?
> 
> See my update in there.
> 
> 
> Grüße
>   Thomas
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
> 

Updated in https://gcc.gnu.org/PR113005. Could you help to verify if the 
draft patch would fix the execution test timeout issue on your side?

diff --git a/libgomp/testsuite/libgomp.fortran/rwlock_1.f90 
b/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
index f90ecbeb00f..1c271ae031d 100644
--- a/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
+++ b/libgomp/testsuite/libgomp.fortran/rwlock_1.f90
@@ -7,13 +7,12 @@ program main
    use omp_lib
    implicit none
    integer:: unit_number, v1, v2, i
-  character(11) :: file_name
+  character(16) :: file_name
    character(3) :: async = "no"
    !$omp parallel private (unit_number, v1, v2, file_name, async, i)
      do i = 0, 100
        unit_number = 10 + omp_get_thread_num ()
-      write (file_name, "(I3, A)") unit_number, "_tst.dat"
-      file_name = adjustl(file_name)
+      write (file_name, "(I5.5, A)") unit_number, "_tst.dat"
        open (unit_number, file=file_name, asynchronous="yes")
        ! call inquire with file parameter to test find_file in unix.c
        inquire (file=file_name, asynchronous=async)

Lipeng Zhu


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v7] libgfortran: Replace mutex with rwlock
  2024-01-03  9:14                           ` Lipeng Zhu
@ 2024-01-17 13:25                             ` Lipeng Zhu
  0 siblings, 0 replies; 38+ messages in thread
From: Lipeng Zhu @ 2024-01-17 13:25 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: Jakub Jelinek, fortran, gcc-patches, Deng, Pan, rep.dot.nop, Li,
	Tianyou, tkoenig, Guo, Wangyang, Tobias Burnus, H.J. Lu



On 1/3/2024 5:14 PM, Lipeng Zhu wrote:
> 
> 
> On 2023/12/21 19:42, Thomas Schwinge wrote:
>> Hi!
>>
>> On 2023-12-13T21:52:29+0100, I wrote:
>>> On 2023-12-12T02:05:26+0000, "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote:
>>>> On 2023/12/12 1:45, H.J. Lu wrote:
>>>>> On Sat, Dec 9, 2023 at 7:25 PM Zhu, Lipeng <lipeng.zhu@intel.com> 
>>>>> wrote:
>>>>>> On 2023/12/9 23:23, Jakub Jelinek wrote:
>>>>>>> On Sat, Dec 09, 2023 at 10:39:45AM -0500, Lipeng Zhu wrote:
>>>>>>>> This patch try to introduce the rwlock and split the read/write to
>>>>>>>> unit_root tree and unit_cache with rwlock instead of the mutex to
>>>>>>>> increase CPU efficiency. In the get_gfc_unit function, the
>>>>>>>> percentage to step into the insert_unit function is around 30%, in
>>>>>>>> most instances, we can get the unit in the phase of reading the
>>>>>>>> unit_cache or unit_root tree. So split the read/write phase by
>>>>>>>> rwlock would be an approach to make it more parallel.
>>>>>>>>
>>>>>>>> BTW, the IPC metrics can gain around 9x in our test server with
>>>>>>>> 220 cores. The benchmark we used is
>>>>>>>> https://github.com/rwesson/NEAT
>>>
>>>>>>> Ok for trunk, thanks.
>>>
>>>>>> Thanks! Looking forward to landing to trunk.
>>>
>>>>> Pushed for you.
>>
>>> I've just filed <https://gcc.gnu.org/PR113005>
>>> "'libgomp.fortran/rwlock_1.f90', 'libgomp.fortran/rwlock_3.f90' 
>>> execution test timeouts".
>>> Would you be able to look into that?
>>
>> See my update in there.
>>
>>
>> Grüße
>>   Thomas
>> -----------------
>> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 
>> 201, 80634 München; Gesellschaft mit beschränkter Haftung; 
>> Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: 
>> München; Registergericht München, HRB 106955
>>
> 
> Updated in https://gcc.gnu.org/PR113005. Could you help to verify if the 
> draft patch would fix the execution test timeout issue on your side?
> 

Hi Thomas,

Any feedback from your side?

Regards,
Lipeng Zhu




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] libgfortran: Replace mutex with rwlock
@ 2023-05-25 12:40 Zhu, Lipeng
  0 siblings, 0 replies; 38+ messages in thread
From: Zhu, Lipeng @ 2023-05-25 12:40 UTC (permalink / raw)
  To: Thomas Koenig
  Cc: fortran, gcc-patches, rep.dot.nop, hongjiu.lu, tianyou.li,
	pan.deng, jakub, wangyang.guo



On 1/1/1970 8:00 AM, Thomas Koenig wrote:
> Hi Lipeng,
> 
>> May I know any comment or concern on this patch, thanks for your time :)
>>
> 
> Thanks for your patience in getting this reviewed.
> 
> A few remarks / questions.
> 
> Which strategy is used in this implementation, read-preferring or write-preferring?  And if read-
> preferring is used, is there a danger of deadlock if people do unreasonable things?
> Maybe you could explain that, also in a comment in the code >
> Can you add some sort of torture test case(s) which does a lot of opening/closing/reading/writing,
> possibly with asynchronous I/O and/or pthreads, to catch possible problems?  If there is a system
> dependency or some race condition, chances are that regression testers will catch this.

Hi Thomas,

Thanks for your time for the review.
Sure, I will add test case according to your suggestions and update the 
comment based on the implementation of "read-preferring" strategy.

Thanks,
Lipeng Zhu
> 
> With this, the libgfortran parts are OK, unless somebody else has more comments, so give this a couple
> of days.  I cannot approve the libgcc parts, that would be somebody else (Jakub?)
> 
> Best regards
> 
> 	Thomas
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4] libgfortran: Replace mutex with rwlock
  2023-05-08  9:44 ` Lipeng Zhu
@ 2023-05-08 10:28   ` Bernhard Reutner-Fischer
  0 siblings, 0 replies; 38+ messages in thread
From: Bernhard Reutner-Fischer @ 2023-05-08 10:28 UTC (permalink / raw)
  To: Lipeng Zhu
  Cc: fortran, gcc-patches, hongjiu.lu, tianyou.li, pan.deng, wangyang.guo

On Mon,  8 May 2023 17:44:43 +0800
Lipeng Zhu <lipeng.zhu@intel.com> wrote:

> This patch try to introduce the rwlock and split the read/write to
> unit_root tree and unit_cache with rwlock instead of the mutex to
> increase CPU efficiency. In the get_gfc_unit function, the percentage
> to step into the insert_unit function is around 30%, in most instances,
> we can get the unit in the phase of reading the unit_cache or unit_root
> tree. So split the read/write phase by rwlock would be an approach to
> make it more parallel.
> 
> BTW, the IPC metrics can gain around 9x in our test
> server with 220 cores. The benchmark we used is
> https://github.com/rwesson/NEAT

See commentary typos below.
You did not state if you regression tested the patch?
Other than that it LGTM but i cannot approve it.

> diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h
> index ad226c8e856..0033cc74252 100644
> --- a/libgfortran/io/async.h
> +++ b/libgfortran/io/async.h
> @@ -210,6 +210,128 @@
>      DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \
>    } while (0)
>  
> +#ifdef __GTHREAD_RWLOCK_INIT
> +#define RWLOCK_DEBUG_ADD(rwlock) do {		\
> +    aio_rwlock_debug *n;				\
> +    n = xmalloc (sizeof(aio_rwlock_debug));	\

Missing space before the open brace: sizeof (

> diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c
> index 82664dc5f98..62f1db21d34 100644
> --- a/libgfortran/io/unit.c
> +++ b/libgfortran/io/unit.c
> @@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
>  
>  
>  /* IO locking rules:
> -   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
> +   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
> +   And use the rwlock to spilt read and write phase to UNIT_ROOT tree
> +   and UNIT_CACHE to increase CPU efficiency.

s/spilt/split. Maybe:

Using an rwlock improves efficiency by allowing us to separate readers
and writers of both UNIT_ROOT and UNIT_CACHE.

> @@ -350,6 +356,17 @@ retry:
>        if (c == 0)
>  	break;
>      }
> +  /* We did not find a unit in the cache nor in the unit list, create a new
> +    (locked) unit and insert into the unit list and cache.
> +    Manipulating either or both the unit list and the unit cache requires to
> +    hold a write-lock [for obvious reasons]:
> +    1. By separating the read/write lock, it will greatly reduce the contention
> +       at the read part, while write part is not always necessary or most
> +       unlikely once the unit hit in cache.

+    By separating the read/write lock, we will greatly reduce the contention
+    on the read part, while the write part is unlikely once the unit hits
+    the cache.

> +    2. We try to balance the implementation complexity and the performance
> +       gains that fit into current cases we observed by just using a
> +       pthread_rwlock. */

Let's drop 2.
thanks,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v4] libgfortran: Replace mutex with rwlock
       [not found] <20230424214534.77117b73 () nbbrfq>
@ 2023-05-08  9:44 ` Lipeng Zhu
  2023-05-08 10:28   ` Bernhard Reutner-Fischer
  0 siblings, 1 reply; 38+ messages in thread
From: Lipeng Zhu @ 2023-05-08  9:44 UTC (permalink / raw)
  To: rep.dot.nop, fortran, gcc-patches
  Cc: hongjiu.lu, tianyou.li, pan.deng, wangyang.guo, Lipeng Zhu

This patch try to introduce the rwlock and split the read/write to
unit_root tree and unit_cache with rwlock instead of the mutex to
increase CPU efficiency. In the get_gfc_unit function, the percentage
to step into the insert_unit function is around 30%, in most instances,
we can get the unit in the phase of reading the unit_cache or unit_root
tree. So split the read/write phase by rwlock would be an approach to
make it more parallel.

BTW, the IPC metrics can gain around 9x in our test
server with 220 cores. The benchmark we used is
https://github.com/rwesson/NEAT

libgcc/ChangeLog:

    * gthr-posix.h (__GTHREAD_RWLOCK_INIT): New macro
    (__gthrw): New function
    (__gthread_rwlock_rdlock): New function
    (__gthread_rwlock_tryrdlock): New function
    (__gthread_rwlock_wrlock): New function
    (__gthread_rwlock_trywrlock): New function
    (__gthread_rwlock_unlock): New function

libgfortran/ChangeLog:

    * io/async.c (DEBUG_LINE): New
    * io/async.h (RWLOCK_DEBUG_ADD): New macro
    (CHECK_RDLOCK): New macro
    (CHECK_WRLOCK): New macro
    (TAIL_RWLOCK_DEBUG_QUEUE): New macro
    (IN_RWLOCK_DEBUG_QUEUE): New macro
    (RDLOCK): New macro
    (WRLOCK): New macro
    (RWUNLOCK): New macro
    (RD_TO_WRLOCK): New macro
    (INTERN_RDLOCK): New macro
    (INTERN_WRLOCK): New macro
    (INTERN_RWUNLOCK): New macro
    * io/io.h (internal_proto): Define unit_rwlock
    * io/transfer.c (st_read_done_worker): Relace unit_lock with unit_rwlock
    (st_write_done_worker): Relace unit_lock with unit_rwlock
    * io/unit.c (get_gfc_unit): Relace unit_lock with unit_rwlock
    (if): Relace unit_lock with unit_rwlock
    (close_unit_1): Relace unit_lock with unit_rwlock
    (close_units): Relace unit_lock with unit_rwlock
    (newunit_alloc): Relace unit_lock with unit_rwlock
    * io/unix.c (flush_all_units): Relace unit_lock with unit_rwlock

---
v1 -> v2:
Limit the pthread_rwlock usage in libgcc only when __cplusplus isn't defined.

v2 -> v3:
Rebase the patch with trunk branch.

v3 -> v4:
Update the comments.

Reviewed-by: Hongjiu Lu <hongjiu.lu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
---
 libgcc/gthr-posix.h       |  60 +++++++++++++++
 libgfortran/io/async.c    |   4 +
 libgfortran/io/async.h    | 151 ++++++++++++++++++++++++++++++++++++++
 libgfortran/io/io.h       |  15 ++--
 libgfortran/io/transfer.c |   8 +-
 libgfortran/io/unit.c     |  75 +++++++++++--------
 libgfortran/io/unix.c     |  16 ++--
 7 files changed, 283 insertions(+), 46 deletions(-)

diff --git a/libgcc/gthr-posix.h b/libgcc/gthr-posix.h
index aebcfdd9f4c..73283082997 100644
--- a/libgcc/gthr-posix.h
+++ b/libgcc/gthr-posix.h
@@ -48,6 +48,9 @@ typedef pthread_t __gthread_t;
 typedef pthread_key_t __gthread_key_t;
 typedef pthread_once_t __gthread_once_t;
 typedef pthread_mutex_t __gthread_mutex_t;
+#ifndef __cplusplus
+typedef pthread_rwlock_t __gthread_rwlock_t;
+#endif
 typedef pthread_mutex_t __gthread_recursive_mutex_t;
 typedef pthread_cond_t __gthread_cond_t;
 typedef struct timespec __gthread_time_t;
@@ -58,6 +61,9 @@ typedef struct timespec __gthread_time_t;
 
 #define __GTHREAD_MUTEX_INIT PTHREAD_MUTEX_INITIALIZER
 #define __GTHREAD_MUTEX_INIT_FUNCTION __gthread_mutex_init_function
+#ifndef __cplusplus
+#define __GTHREAD_RWLOCK_INIT PTHREAD_RWLOCK_INITIALIZER
+#endif
 #define __GTHREAD_ONCE_INIT PTHREAD_ONCE_INIT
 #if defined(PTHREAD_RECURSIVE_MUTEX_INITIALIZER)
 #define __GTHREAD_RECURSIVE_MUTEX_INIT PTHREAD_RECURSIVE_MUTEX_INITIALIZER
@@ -135,6 +141,13 @@ __gthrw(pthread_mutexattr_init)
 __gthrw(pthread_mutexattr_settype)
 __gthrw(pthread_mutexattr_destroy)
 
+#ifndef __cplusplus
+__gthrw(pthread_rwlock_rdlock)
+__gthrw(pthread_rwlock_tryrdlock)
+__gthrw(pthread_rwlock_wrlock)
+__gthrw(pthread_rwlock_trywrlock)
+__gthrw(pthread_rwlock_unlock)
+#endif
 
 #if defined(_LIBOBJC) || defined(_LIBOBJC_WEAK)
 /* Objective-C.  */
@@ -885,6 +898,53 @@ __gthread_cond_destroy (__gthread_cond_t* __cond)
   return __gthrw_(pthread_cond_destroy) (__cond);
 }
 
+#ifndef __cplusplus
+static inline int
+__gthread_rwlock_rdlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_rdlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_tryrdlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_tryrdlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_wrlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_wrlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_trywrlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_trywrlock) (__rwlock);
+  else
+    return 0;
+}
+
+static inline int
+__gthread_rwlock_unlock (__gthread_rwlock_t *__rwlock)
+{
+  if (__gthread_active_p ())
+    return __gthrw_(pthread_rwlock_unlock) (__rwlock);
+  else
+    return 0;
+}
+#endif
+
 #endif /* _LIBOBJC */
 
 #endif /* ! GCC_GTHR_POSIX_H */
diff --git a/libgfortran/io/async.c b/libgfortran/io/async.c
index 81d1d8175aa..a97272ce5e4 100644
--- a/libgfortran/io/async.c
+++ b/libgfortran/io/async.c
@@ -42,6 +42,10 @@ DEBUG_LINE (__thread const char *aio_prefix = MPREFIX);
 
 DEBUG_LINE (__gthread_mutex_t debug_queue_lock = __GTHREAD_MUTEX_INIT;)
 DEBUG_LINE (aio_lock_debug *aio_debug_head = NULL;)
+#ifdef __GTHREAD_RWLOCK_INIT
+DEBUG_LINE (aio_rwlock_debug *aio_rwlock_debug_head = NULL;)
+DEBUG_LINE (__gthread_rwlock_t debug_queue_rwlock = __GTHREAD_RWLOCK_INIT;)
+#endif
 
 /* Current unit for asynchronous I/O.  Needed for error reporting.  */
 
diff --git a/libgfortran/io/async.h b/libgfortran/io/async.h
index ad226c8e856..0033cc74252 100644
--- a/libgfortran/io/async.h
+++ b/libgfortran/io/async.h
@@ -210,6 +210,128 @@
     DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #mutex, mutex); \
   } while (0)
 
+#ifdef __GTHREAD_RWLOCK_INIT
+#define RWLOCK_DEBUG_ADD(rwlock) do {		\
+    aio_rwlock_debug *n;				\
+    n = xmalloc (sizeof(aio_rwlock_debug));	\
+    n->prev = TAIL_RWLOCK_DEBUG_QUEUE;			\
+    if (n->prev)				\
+      n->prev->next = n;			\
+    n->next = NULL;				\
+    n->line = __LINE__;				\
+    n->func = __FUNCTION__;			\
+    n->rw = rwlock;				\
+    if (!aio_rwlock_debug_head) {			\
+      aio_rwlock_debug_head = n;			\
+    }						\
+  } while (0)
+
+#define CHECK_RDLOCK(rwlock, status) do {					\
+    aio_rwlock_debug *curr;						\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    if (__gthread_rwlock_tryrdlock (rwlock)) {				\
+      if ((curr = IN_RWLOCK_DEBUG_QUEUE (rwlock))) {				\
+	sprintf (status, DEBUG_RED "%s():%d" DEBUG_NORM, curr->func, curr->line); \
+      } else								\
+	sprintf (status, DEBUG_RED "unknown" DEBUG_NORM);			\
+    }									\
+    else {								\
+      __gthread_rwlock_unlock (rwlock);					\
+      sprintf (status, DEBUG_GREEN "rwunlocked" DEBUG_NORM);			\
+    }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+  }while (0)
+
+#define CHECK_WRLOCK(rwlock, status) do {					\
+    aio_rwlock_debug *curr;						\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    if (__gthread_rwlock_trywrlock (rwlock)) {				\
+      if ((curr = IN_RWLOCK_DEBUG_QUEUE (rwlock))) {				\
+	sprintf (status, DEBUG_RED "%s():%d" DEBUG_NORM, curr->func, curr->line); \
+      } else								\
+	sprintf (status, DEBUG_RED "unknown" DEBUG_NORM);			\
+    }									\
+    else {								\
+      __gthread_rwlock_unlock (rwlock);					\
+      sprintf (status, DEBUG_GREEN "rwunlocked" DEBUG_NORM);			\
+    }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+  }while (0)
+
+#define TAIL_RWLOCK_DEBUG_QUEUE ({			\
+      aio_rwlock_debug *curr = aio_rwlock_debug_head;	\
+      while (curr && curr->next) {		\
+	curr = curr->next;			\
+      }						\
+      curr;					\
+    })
+
+#define IN_RWLOCK_DEBUG_QUEUE(rwlock) ({		\
+      __label__ end;				\
+      aio_rwlock_debug *curr = aio_rwlock_debug_head;	\
+      while (curr) {				\
+	if (curr->rw == rwlock) {			\
+	  goto end;				\
+	}					\
+	curr = curr->next;			\
+      }						\
+    end:;					\
+      curr;					\
+    })
+
+#define RDLOCK(rwlock) do {						\
+    char status[200];							\
+    CHECK_RDLOCK (rwlock, status);						\
+    DEBUG_PRINTF ("%s%-42s prev: %-35s %20s():%-5d %18p\n", aio_prefix,	\
+		 DEBUG_RED "RDLOCK: " DEBUG_NORM #rwlock, status, __FUNCTION__, __LINE__, (void *) rwlock); \
+    INTERN_RDLOCK (rwlock);							\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    RWLOCK_DEBUG_ADD (rwlock);						\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #rwlock, rwlock); \
+  } while (0)
+
+#define WRLOCK(rwlock) do {						\
+    char status[200];							\
+    CHECK_WRLOCK (rwlock, status);						\
+    DEBUG_PRINTF ("%s%-42s prev: %-35s %20s():%-5d %18p\n", aio_prefix,	\
+		 DEBUG_RED "WRLOCK: " DEBUG_NORM #rwlock, status, __FUNCTION__, __LINE__, (void *) rwlock); \
+    INTERN_WRLOCK (rwlock);							\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    RWLOCK_DEBUG_ADD (rwlock);						\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    DEBUG_PRINTF ("%s" DEBUG_RED "ACQ:" DEBUG_NORM " %-30s %78p\n", aio_prefix, #rwlock, rwlock); \
+  } while (0)
+
+#define RWUNLOCK(rwlock) do {						\
+    aio_rwlock_debug *curr;						\
+    DEBUG_PRINTF ("%s%-75s %20s():%-5d %18p\n", aio_prefix, DEBUG_GREEN "RWUNLOCK: " DEBUG_NORM #rwlock, \
+		 __FUNCTION__, __LINE__, (void *) rwlock);		\
+    INTERN_WRLOCK (&debug_queue_rwlock);					\
+    curr = IN_RWLOCK_DEBUG_QUEUE (rwlock);					\
+    if (curr)								\
+      {									\
+	if (curr->prev)							\
+	  curr->prev->next = curr->next;				\
+	if (curr->next) {						\
+	  curr->next->prev = curr->prev;				\
+	  if (curr == aio_rwlock_debug_head)					\
+	    aio_rwlock_debug_head = curr->next;				\
+	} else {							\
+	  if (curr == aio_rwlock_debug_head)					\
+	    aio_rwlock_debug_head = NULL;					\
+	}								\
+	free (curr);							\
+      }									\
+    INTERN_RWUNLOCK (&debug_queue_rwlock);					\
+    INTERN_RWUNLOCK (rwlock);						\
+  } while (0)
+
+#define RD_TO_WRLOCK(rwlock) \
+  RWUNLOCK (rwlock);\
+  WRLOCK (rwlock);
+#endif
+
 #define DEBUG_LINE(...) __VA_ARGS__
 
 #else
@@ -221,12 +343,31 @@
 #define LOCK(mutex) INTERN_LOCK (mutex)
 #define UNLOCK(mutex) INTERN_UNLOCK (mutex)
 #define TRYLOCK(mutex) (__gthread_mutex_trylock (mutex))
+#ifdef __GTHREAD_RWLOCK_INIT
+#define RDLOCK(rwlock) INTERN_RDLOCK (rwlock)
+#define WRLOCK(rwlock) INTERN_WRLOCK (rwlock)
+#define RWUNLOCK(rwlock) INTERN_RWUNLOCK (rwlock)
+#define RD_TO_WRLOCK(rwlock) \
+  RWUNLOCK (rwlock);\
+  WRLOCK (rwlock);
+#endif
+#endif
+
+#ifndef __GTHREAD_RWLOCK_INIT
+#define RDLOCK(rwlock) LOCK (rwlock)
+#define WRLOCK(rwlock) LOCK (rwlock)
+#define RWUNLOCK(rwlock) UNLOCK (rwlock)
+#define RD_TO_WRLOCK(rwlock) {}
 #endif
 
 #define INTERN_LOCK(mutex) T_ERROR (__gthread_mutex_lock, mutex);
 
 #define INTERN_UNLOCK(mutex) T_ERROR (__gthread_mutex_unlock, mutex);
 
+#define INTERN_RDLOCK(rwlock) T_ERROR (__gthread_rwlock_rdlock, rwlock);
+#define INTERN_WRLOCK(rwlock) T_ERROR (__gthread_rwlock_wrlock, rwlock);
+#define INTERN_RWUNLOCK(rwlock) T_ERROR (__gthread_rwlock_unlock, rwlock);
+
 #if ASYNC_IO
 
 /* au->lock has to be held when calling this macro.  */
@@ -288,8 +429,18 @@ DEBUG_LINE (typedef struct aio_lock_debug{
   struct aio_lock_debug *prev;
 } aio_lock_debug;)
 
+DEBUG_LINE (typedef struct aio_rwlock_debug{
+  __gthread_rwlock_t *rw;
+  int line;
+  const char *func;
+  struct aio_rwlock_debug *next;
+  struct aio_rwlock_debug *prev;
+} aio_rwlock_debug;)
+
 DEBUG_LINE (extern aio_lock_debug *aio_debug_head;)
 DEBUG_LINE (extern __gthread_mutex_t debug_queue_lock;)
+DEBUG_LINE (extern aio_rwlock_debug *aio_rwlock_debug_head;)
+DEBUG_LINE (extern __gthread_rwlock_t debug_queue_rwlock;)
 
 /* Thread - local storage of the current unit we are looking at. Needed for
    error reporting.  */
diff --git a/libgfortran/io/io.h b/libgfortran/io/io.h
index ecdf1dd3f05..15daa0995b1 100644
--- a/libgfortran/io/io.h
+++ b/libgfortran/io/io.h
@@ -690,7 +690,7 @@ typedef struct gfc_unit
      from the UNIT_ROOT tree, but doesn't free it and the
      last of the waiting threads will do that.
      This must be either atomically increased/decreased, or
-     always guarded by UNIT_LOCK.  */
+     always guarded by UNIT_RWLOCK.  */
   int waiting;
   /* Flag set by close_unit if the unit as been closed.
      Must be manipulated under unit's lock.  */
@@ -769,8 +769,13 @@ internal_proto(default_recl);
 extern gfc_unit *unit_root;
 internal_proto(unit_root);
 
-extern __gthread_mutex_t unit_lock;
-internal_proto(unit_lock);
+#ifdef __GTHREAD_RWLOCK_INIT
+extern __gthread_rwlock_t unit_rwlock;
+internal_proto(unit_rwlock);
+#else
+extern __gthread_mutex_t unit_rwlock;
+internal_proto(unit_rwlock);
+#endif
 
 extern int close_unit (gfc_unit *);
 internal_proto(close_unit);
@@ -1015,9 +1020,9 @@ dec_waiting_unlocked (gfc_unit *u)
 #ifdef HAVE_ATOMIC_FETCH_ADD
   (void) __atomic_fetch_add (&u->waiting, -1, __ATOMIC_RELAXED);
 #else
-  __gthread_mutex_lock (&unit_lock);
+  WRLOCK (&unit_rwlock);
   u->waiting--;
-  __gthread_mutex_unlock (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 #endif
 }
 
diff --git a/libgfortran/io/transfer.c b/libgfortran/io/transfer.c
index 8bb5d1101ca..b01f45b80f6 100644
--- a/libgfortran/io/transfer.c
+++ b/libgfortran/io/transfer.c
@@ -4539,9 +4539,9 @@ st_read_done_worker (st_parameter_dt *dtp, bool unlock)
    if (free_newunit)
      {
        /* Avoid inverse lock issues by placing after unlock_unit.  */
-       LOCK (&unit_lock);
+       WRLOCK (&unit_rwlock);
        newunit_free (dtp->common.unit);
-       UNLOCK (&unit_lock);
+       RWUNLOCK (&unit_rwlock);
      }
 }
 
@@ -4636,9 +4636,9 @@ st_write_done_worker (st_parameter_dt *dtp, bool unlock)
    if (free_newunit)
      {
        /* Avoid inverse lock issues by placing after unlock_unit.  */
-       LOCK (&unit_lock);
+       WRLOCK (&unit_rwlock);
        newunit_free (dtp->common.unit);
-       UNLOCK (&unit_lock);
+       RWUNLOCK (&unit_rwlock);
      }
 }
 
diff --git a/libgfortran/io/unit.c b/libgfortran/io/unit.c
index 82664dc5f98..62f1db21d34 100644
--- a/libgfortran/io/unit.c
+++ b/libgfortran/io/unit.c
@@ -33,34 +33,36 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 
 
 /* IO locking rules:
-   UNIT_LOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
+   UNIT_RWLOCK is a master lock, protecting UNIT_ROOT tree and UNIT_CACHE.
+   And use the rwlock to spilt read and write phase to UNIT_ROOT tree
+   and UNIT_CACHE to increase CPU efficiency.
    Concurrent use of different units should be supported, so
    each unit has its own lock, LOCK.
    Open should be atomic with its reopening of units and list_read.c
    in several places needs find_unit another unit while holding stdin
-   unit's lock, so it must be possible to acquire UNIT_LOCK while holding
+   unit's lock, so it must be possible to acquire UNIT_RWLOCK while holding
    some unit's lock.  Therefore to avoid deadlocks, it is forbidden
-   to acquire unit's private locks while holding UNIT_LOCK, except
+   to acquire unit's private locks while holding UNIT_RWLOCK, except
    for freshly created units (where no other thread can get at their
    address yet) or when using just trylock rather than lock operation.
    In addition to unit's private lock each unit has a WAITERS counter
    and CLOSED flag.  WAITERS counter must be either only
    atomically incremented/decremented in all places (if atomic builtins
-   are supported), or protected by UNIT_LOCK in all places (otherwise).
+   are supported), or protected by UNIT_RWLOCK in all places (otherwise).
    CLOSED flag must be always protected by unit's LOCK.
-   After finding a unit in UNIT_CACHE or UNIT_ROOT with UNIT_LOCK held,
+   After finding a unit in UNIT_CACHE or UNIT_ROOT with UNIT_RWLOCK held,
    WAITERS must be incremented to avoid concurrent close from freeing
-   the unit between unlocking UNIT_LOCK and acquiring unit's LOCK.
-   Unit freeing is always done under UNIT_LOCK.  If close_unit sees any
+   the unit between unlocking UNIT_RWLOCK and acquiring unit's LOCK.
+   Unit freeing is always done under UNIT_RWLOCK.  If close_unit sees any
    WAITERS, it doesn't free the unit but instead sets the CLOSED flag
    and the thread that decrements WAITERS to zero while CLOSED flag is
-   set is responsible for freeing it (while holding UNIT_LOCK).
+   set is responsible for freeing it (while holding UNIT_RWLOCK).
    flush_all_units operation is iterating over the unit tree with
-   increasing UNIT_NUMBER while holding UNIT_LOCK and attempting to
+   increasing UNIT_NUMBER while holding UNIT_RWLOCK and attempting to
    flush each unit (and therefore needs the unit's LOCK held as well).
    To avoid deadlocks, it just trylocks the LOCK and if unsuccessful,
-   remembers the current unit's UNIT_NUMBER, unlocks UNIT_LOCK, acquires
-   unit's LOCK and after flushing reacquires UNIT_LOCK and restarts with
+   remembers the current unit's UNIT_NUMBER, unlocks UNIT_RWLOCK, acquires
+   unit's LOCK and after flushing reacquires UNIT_RWLOCK and restarts with
    the smallest UNIT_NUMBER above the last one flushed.
 
    If find_unit/find_or_create_unit/find_file/get_unit routines return
@@ -101,10 +103,14 @@ gfc_offset max_offset;
 gfc_offset default_recl;
 
 gfc_unit *unit_root;
+#ifdef __GTHREAD_RWLOCK_INIT
+__gthread_rwlock_t unit_rwlock = __GTHREAD_RWLOCK_INIT;
+#else
 #ifdef __GTHREAD_MUTEX_INIT
-__gthread_mutex_t unit_lock = __GTHREAD_MUTEX_INIT;
+__gthread_mutex_t unit_rwlock = __GTHREAD_MUTEX_INIT;
 #else
-__gthread_mutex_t unit_lock;
+__gthread_mutex_t unit_rwlock;
+#endif
 #endif
 
 /* We use these filenames for error reporting.  */
@@ -329,7 +335,7 @@ get_gfc_unit (int n, int do_create)
   int c, created = 0;
 
   NOTE ("Unit n=%d, do_create = %d", n, do_create);
-  LOCK (&unit_lock);
+  RDLOCK (&unit_rwlock);
 
 retry:
   for (c = 0; c < CACHE_SIZE; c++)
@@ -350,6 +356,17 @@ retry:
       if (c == 0)
 	break;
     }
+  /* We did not find a unit in the cache nor in the unit list, create a new
+    (locked) unit and insert into the unit list and cache.
+    Manipulating either or both the unit list and the unit cache requires to
+    hold a write-lock [for obvious reasons]:
+    1. By separating the read/write lock, it will greatly reduce the contention
+       at the read part, while write part is not always necessary or most
+       unlikely once the unit hit in cache.
+    2. We try to balance the implementation complexity and the performance
+       gains that fit into current cases we observed by just using a
+       pthread_rwlock. */
+  RD_TO_WRLOCK (&unit_rwlock);
 
   if (p == NULL && do_create)
     {
@@ -368,8 +385,8 @@ retry:
   if (created)
     {
       /* Newly created units have their lock held already
-	 from insert_unit.  Just unlock UNIT_LOCK and return.  */
-      UNLOCK (&unit_lock);
+	 from insert_unit.  Just unlock UNIT_RWLOCK and return.  */
+      RWUNLOCK (&unit_rwlock);
       return p;
     }
 
@@ -380,7 +397,7 @@ found:
       if (! TRYLOCK (&p->lock))
 	{
 	  /* assert (p->closed == 0); */
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
 	  return p;
 	}
 
@@ -388,14 +405,14 @@ found:
     }
 
 
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 
   if (p != NULL && (p->child_dtio == 0))
     {
       LOCK (&p->lock);
       if (p->closed)
 	{
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&p->lock);
 	  if (predec_waiting_locked (p) == 0)
 	    destroy_unit_mutex (p);
@@ -593,8 +610,8 @@ init_units (void)
 #endif
 #endif
 
-#ifndef __GTHREAD_MUTEX_INIT
-  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_lock);
+#if (!defined(__GTHREAD_RWLOCK_INIT) && !defined(__GTHREAD_MUTEX_INIT))
+  __GTHREAD_MUTEX_INIT_FUNCTION (&unit_rwlock);
 #endif
 
   if (sizeof (max_offset) == 8)
@@ -731,7 +748,7 @@ close_unit_1 (gfc_unit *u, int locked)
 
   u->closed = 1;
   if (!locked)
-    LOCK (&unit_lock);
+    WRLOCK (&unit_rwlock);
 
   for (i = 0; i < CACHE_SIZE; i++)
     if (unit_cache[i] == u)
@@ -758,7 +775,7 @@ close_unit_1 (gfc_unit *u, int locked)
     destroy_unit_mutex (u);
 
   if (!locked)
-    UNLOCK (&unit_lock);
+    RWUNLOCK (&unit_rwlock);
 
   return rc;
 }
@@ -795,10 +812,10 @@ close_unit (gfc_unit *u)
 void
 close_units (void)
 {
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   while (unit_root != NULL)
     close_unit_1 (unit_root, 1);
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
 
   free (newunits);
 
@@ -905,7 +922,7 @@ finish_last_advance_record (gfc_unit *u)
 int
 newunit_alloc (void)
 {
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   if (!newunits)
     {
       newunits = xcalloc (16, 1);
@@ -919,7 +936,7 @@ newunit_alloc (void)
         {
           newunits[ii] = true;
           newunit_lwi = ii + 1;
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
           return -ii + NEWUNIT_START;
         }
     }
@@ -932,12 +949,12 @@ newunit_alloc (void)
   memset (newunits + old_size, 0, old_size);
   newunits[old_size] = true;
   newunit_lwi = old_size + 1;
-    UNLOCK (&unit_lock);
+    RWUNLOCK (&unit_rwlock);
   return -old_size + NEWUNIT_START;
 }
 
 
-/* Free a previously allocated newunit= unit number.  unit_lock must
+/* Free a previously allocated newunit= unit number.  unit_rwlock must
    be held when calling.  */
 
 void
diff --git a/libgfortran/io/unix.c b/libgfortran/io/unix.c
index ba12be08252..d53003919ab 100644
--- a/libgfortran/io/unix.c
+++ b/libgfortran/io/unix.c
@@ -1774,7 +1774,7 @@ find_file (const char *file, gfc_charlen_type file_len)
   id = id_from_path (path);
 #endif
 
-  LOCK (&unit_lock);
+  RDLOCK (&unit_rwlock);
 retry:
   u = find_file0 (unit_root, FIND_FILE0_ARGS);
   if (u != NULL)
@@ -1783,19 +1783,19 @@ retry:
       if (! __gthread_mutex_trylock (&u->lock))
 	{
 	  /* assert (u->closed == 0); */
-	  UNLOCK (&unit_lock);
+	  RWUNLOCK (&unit_rwlock);
 	  goto done;
 	}
 
       inc_waiting_locked (u);
     }
-  UNLOCK (&unit_lock);
+  RWUNLOCK (&unit_rwlock);
   if (u != NULL)
     {
       LOCK (&u->lock);
       if (u->closed)
 	{
-	  LOCK (&unit_lock);
+	  RDLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  if (predec_waiting_locked (u) == 0)
 	    free (u);
@@ -1839,13 +1839,13 @@ flush_all_units (void)
   gfc_unit *u;
   int min_unit = 0;
 
-  LOCK (&unit_lock);
+  WRLOCK (&unit_rwlock);
   do
     {
       u = flush_all_units_1 (unit_root, min_unit);
       if (u != NULL)
 	inc_waiting_locked (u);
-      UNLOCK (&unit_lock);
+      RWUNLOCK (&unit_rwlock);
       if (u == NULL)
 	return;
 
@@ -1856,13 +1856,13 @@ flush_all_units (void)
       if (u->closed == 0)
 	{
 	  sflush (u->s);
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  (void) predec_waiting_locked (u);
 	}
       else
 	{
-	  LOCK (&unit_lock);
+	  WRLOCK (&unit_rwlock);
 	  UNLOCK (&u->lock);
 	  if (predec_waiting_locked (u) == 0)
 	    free (u);
-- 
2.39.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-01-17 13:25 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-09  2:32 [PATCH v4] libgfortran: Replace mutex with rwlock Zhu, Lipeng
2023-05-16  7:08 ` Zhu, Lipeng
2023-05-23  2:53   ` Zhu, Lipeng
2023-05-24 19:18     ` Thomas Koenig
2023-08-18  3:06       ` Zhu, Lipeng
2023-09-14  8:33         ` Zhu, Lipeng
2023-10-23  1:21           ` Zhu, Lipeng
2023-10-23  5:52             ` Thomas Koenig
2023-10-23 23:59               ` Zhu, Lipeng
2023-11-01 10:14                 ` Zhu, Lipeng
2023-11-02  9:58                   ` Bernhard Reutner-Fischer
2023-11-23  9:36                     ` Zhu, Lipeng
2023-12-07  5:18                       ` Zhu, Lipeng
2023-08-18  3:18       ` [PATCH v6] " Zhu, Lipeng
2023-12-08 10:19         ` Jakub Jelinek
2023-12-09 15:13           ` Zhu, Lipeng
2023-12-09 15:39             ` [PATCH v7] " Lipeng Zhu
2023-12-09 15:23               ` Jakub Jelinek
2023-12-10  3:25                 ` Zhu, Lipeng
2023-12-11 17:45                   ` H.J. Lu
2023-12-12  2:05                     ` Zhu, Lipeng
2023-12-13 20:52                       ` Thomas Schwinge
2023-12-14  2:28                         ` Zhu, Lipeng
2023-12-14 12:29                           ` Thomas Schwinge
2023-12-14 12:39                             ` Jakub Jelinek
2023-12-15  5:43                               ` Zhu, Lipeng
2023-12-21 11:42                         ` Thomas Schwinge
2023-12-22  6:48                           ` Lipeng Zhu
2024-01-03  9:14                           ` Lipeng Zhu
2024-01-17 13:25                             ` Lipeng Zhu
2023-12-14 15:50               ` Richard Earnshaw (lists)
2023-12-15 11:31                 ` Lipeng Zhu
2023-12-15 19:23                   ` Richard Earnshaw
2024-01-02 11:57                     ` Vaseeharan Vinayagamoorthy
2024-01-03  1:02                       ` Lipeng Zhu
  -- strict thread matches above, loose matches on Subject: below --
2023-05-25 12:40 [PATCH v4] " Zhu, Lipeng
     [not found] <20230424214534.77117b73 () nbbrfq>
2023-05-08  9:44 ` Lipeng Zhu
2023-05-08 10:28   ` Bernhard Reutner-Fischer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).