public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* Biased locking prototype with RSEQ (for libio)
@ 2023-09-26 18:57 Mathieu Desnoyers
  0 siblings, 0 replies; 3+ messages in thread
From: Mathieu Desnoyers @ 2023-09-26 18:57 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-alpha, carlos

Hi Florian,

After our discussions at the GNU Tools Cauldron, I went ahead and
implemented a biased locking prototype within librseq. I only
implemented the rseq_cmpeqv1_storev2 helper for x86-64, but
doing it for other architectures should be straightforward.

The basic idea here is that a single thread is designed as the
"biased fast" thread. As long as only this thread accesses the
lock, it can acquire and release the lock with loads and stores
(no barriers, no atomics, no acquire/release). As soon as other
threads need to touch this lock, they go through a state-machine
which ensures that at least one membarrier RSEQ-fence is issued
before attempting to acquire the lock with a CAS. The fast thread
grabs the lock within a RSEQ critical section to check whether
multi-threaded synchronization is required.

I noticed that in order to benefit from the performance improvements
provided by biased locking on my AMD Ryzen 7 PRO 6850U, I need to
ensure that the lock/unlock fast-paths are inlined, because calls
add significant overhead.

The code is available here:

https://github.com/compudj/librseq/tree/biased-lock

Especially:

https://github.com/compudj/librseq/blob/biased-lock/tests/rseq_biased_lock.c
https://github.com/compudj/librseq/blob/biased-lock/tests/rseq_biased_lock.h
https://github.com/compudj/librseq/blob/biased-lock/tests/test_rseq_biased_lock.c

Some benchmark results (with a single threaded workload):

This test takes the lock, does a load, test, conditional branch, and two stores
before releasing the lock.

My libc version is 2.36-9+deb12u1.

time ./test_rseq_biased_lock -t 1 -r 10000000000

The baseline (no locking) is 0.5 ns/loop, which is already removed from each
result in the table below.

                                         ns/loop
biased locking (fast thread, inline):       0.5
biased locking (slow thread, inline):       1.3
biased locking (fast thread, calls):        2.4
biased locking (slow thread, calls):        2.4
pthread mutex:                              5.2

Please note that this is a purely single-threaded workload. I suspect that
moving from lock; cmpxchg to loads and stores may have additional benefits
when the workload involve more interactions between cores at the cache
coherency protocol level.

Feedback is welcome!

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Biased locking prototype with RSEQ (for libio)
  2023-09-27 17:38 Wilco Dijkstra
@ 2023-09-27 19:14 ` Mathieu Desnoyers
  0 siblings, 0 replies; 3+ messages in thread
From: Mathieu Desnoyers @ 2023-09-27 19:14 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: 'GNU C Library'

On 9/27/23 18:38, Wilco Dijkstra wrote:
> Hi Mathieu,
> 
>> The basic idea here is that a single thread is designed as the
>> "biased fast" thread. As long as only this thread accesses the
>> lock, it can acquire and release the lock with loads and stores
>> (no barriers, no atomics, no acquire/release). As soon as other
>> threads need to touch this lock, they go through a state-machine
>> which ensures that at least one membarrier RSEQ-fence is issued
>> before attempting to acquire the lock with a CAS. The fast thread
>> grabs the lock within a RSEQ critical section to check whether
>> multi-threaded synchronization is required.
> 
> This sounds very interesting. In most cases a file handle is used by one
> thread (and even if multiple threads write to stdout, there are typically
> several writes from the same thread), so biasing it to a thread seems fine.
> Can this handle the recursive locking like in sysdeps/nptl/stdio-lock.h?

This should not be an issue, see:

commit 247d07509f618b8e39043e3cbcb963bea4869128
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Wed Sep 27 14:46:28 2023 -0400

     rseq biased lock: use thread pointer as owner field

^ adds owner field.

commit fd13bfaee0a6ab8e486ea95f0a7b03582e5762dd
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Wed Sep 27 15:00:12 2023 -0400

     Implement biased rseq lock as recursive lock

^ recursive.

> The complex locking scheme used is what makes it hard to optimize...
> 
>> The baseline (no locking) is 0.5 ns/loop, which is already removed from each
>> result in the table below.
>>
>>                                          ns/loop
>> biased locking (fast thread, inline):       0.5
>> biased locking (slow thread, inline):       1.3
>> biased locking (fast thread, calls):        2.4
>> biased locking (slow thread, calls):        2.4
>> pthread mutex:                              5.2
> 
> If these results are similar with multiple threads then that would mean
> a significant speedup!

Actually those tests were with multiple threads from the perspective of 
libc's SINGLE_THREAD_P special-cases: one for main() (waiting on 
pthread_join) and one created by pthread_create.

The approach used in this bias locking does not depend on the number of 
threads running within the process: it adapts to the threads actually 
using the lock.

> Just a quick check shows that getchar_unlocked()
> is ~9 times faster than getchar() if there are multiple threads (with one
> thread they are almost identical), so ~90% of the time is due to locking
> overheads...

Thanks for the feedback!

Mathieu

> 
> Cheers,
> Wilco

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Biased locking prototype with RSEQ (for libio)
@ 2023-09-27 17:38 Wilco Dijkstra
  2023-09-27 19:14 ` Mathieu Desnoyers
  0 siblings, 1 reply; 3+ messages in thread
From: Wilco Dijkstra @ 2023-09-27 17:38 UTC (permalink / raw)
  To: mathieu.desnoyers; +Cc: 'GNU C Library'

Hi Mathieu,

> The basic idea here is that a single thread is designed as the
> "biased fast" thread. As long as only this thread accesses the
> lock, it can acquire and release the lock with loads and stores
> (no barriers, no atomics, no acquire/release). As soon as other
> threads need to touch this lock, they go through a state-machine
> which ensures that at least one membarrier RSEQ-fence is issued
> before attempting to acquire the lock with a CAS. The fast thread
> grabs the lock within a RSEQ critical section to check whether
> multi-threaded synchronization is required.

This sounds very interesting. In most cases a file handle is used by one
thread (and even if multiple threads write to stdout, there are typically
several writes from the same thread), so biasing it to a thread seems fine.
Can this handle the recursive locking like in sysdeps/nptl/stdio-lock.h?
The complex locking scheme used is what makes it hard to optimize...

> The baseline (no locking) is 0.5 ns/loop, which is already removed from each
> result in the table below.
>
>                                         ns/loop
> biased locking (fast thread, inline):       0.5
> biased locking (slow thread, inline):       1.3
> biased locking (fast thread, calls):        2.4
> biased locking (slow thread, calls):        2.4
> pthread mutex:                              5.2

If these results are similar with multiple threads then that would mean
a significant speedup! Just a quick check shows that getchar_unlocked()
is ~9 times faster than getchar() if there are multiple threads (with one
thread they are almost identical), so ~90% of the time is due to locking
overheads...

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-09-27 19:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-26 18:57 Biased locking prototype with RSEQ (for libio) Mathieu Desnoyers
2023-09-27 17:38 Wilco Dijkstra
2023-09-27 19:14 ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).