public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* Priority Inversion and Unlimited Spin of pthread_rwlock_t
@ 2024-03-12  3:19 Peng Zheng
  2024-03-12  3:48 ` Peng Zheng
  2024-03-12 14:39 ` Florian Weimer
  0 siblings, 2 replies; 5+ messages in thread
From: Peng Zheng @ 2024-03-12  3:19 UTC (permalink / raw)
  To: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 2577 bytes --]

Hi,

I found that there are several unlimited spins in the current 
pthread_rwlock's implementation.
Therefore, it suffers from the same issue of user-space spinlocks as 
mentioned in this LWN article ([0]):

 > One thread might be spinning on a lock while the holder has been 
preempted and isn't running at all. In such cases, the lock will not be 
released soon, and the spinning just wastes CPU time. In the worst case, 
the thread that is spinning may be the one that is keeping the lock 
holder from running, meaning that the spinning thread is actively 
preventing the lock it needs from being released. In such situations, 
the code should simply stop spinning and go to sleep until the lock is 
released.

I just encountered one such issue in an embedded Linux system: there 
were several readers of priority SCHED_RR, and one writer of priority 
SCHED_OTHER.

It was found that two high priority readers are spinning (consuming 100% 
CPU) within the loop near the end of `__pthread_rwlock_rdlock_full`:

   for (;;)
     {
       while (((wpf = atomic_load_relaxed (&rwlock->__data.__wrphase_futex))
           | PTHREAD_RWLOCK_FUTEX_USED) == (1 | PTHREAD_RWLOCK_FUTEX_USED))
       {/*omitted*/}
       if (ready)
     /* See below.  */
              break;
       if ((atomic_load_acquire (&rwlock->__data.__readers)
        & PTHREAD_RWLOCK_WRPHASE) == 0)
             ready = true;
     }
   return 0;

And the SCHED_OTHER writer was just about to enable the 
`__wrphase_futex` in `__pthread_rwlock_wrlock_full` (just one ARM 
instruction away)
but never able to do that (the two readers ate nearly all available CPUs):

   while ((r & PTHREAD_RWLOCK_WRPHASE) == 0
      && (r >> PTHREAD_RWLOCK_READER_SHIFT) == 0)
     {
       if (atomic_compare_exchange_weak_acquire (&rwlock->__data.__readers,
                         &r, r | PTHREAD_RWLOCK_WRPHASE))
     {
       atomic_store_relaxed (&rwlock->__data.__wrphase_futex, 1);  
/* writer was stuck HERE! */

       goto done;
     }
       /* TODO Back-off.  */
     }

In ARM assembly:

move r3, #1 ; the writer is stuck HERE!
str r3,[r12,#8] ; r12 holds the address of rwlock->__data, and 8 is the 
offset of __readers in __data


Unlimited user space spin is too dangerous to be used, how about 
limiting the total number of spins before suspending using futex? Or 
using rseq as mentioned in the LWN artible?

Any ideas?

[0] https://lwn.net/Articles/944895/

Regards,

-- 
Peng Zheng

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Priority Inversion and Unlimited Spin of pthread_rwlock_t
  2024-03-12  3:19 Priority Inversion and Unlimited Spin of pthread_rwlock_t Peng Zheng
@ 2024-03-12  3:48 ` Peng Zheng
  2024-03-14  7:32   ` Peng Zheng
  2024-03-12 14:39 ` Florian Weimer
  1 sibling, 1 reply; 5+ messages in thread
From: Peng Zheng @ 2024-03-12  3:48 UTC (permalink / raw)
  To: libc-alpha

On 2024/3/12 11:19, Peng Zheng wrote:
> Hi,
> 
> I found that there are several unlimited spins in the current 
> pthread_rwlock's implementation.
> Therefore, it suffers from the same issue of user-space spinlocks as 
> mentioned in this LWN article ([0]):

I forget to mention this issue is glibc-specific.
Just checked musl's source code, all spin in its pthread_rwlock_t is 
limited to 100. For example:

int __pthread_rwlock_timedwrlock(pthread_rwlock_t *restrict rw, const 
struct timespec *restrict at)
{
	int r, t;
	
	r = pthread_rwlock_trywrlock(rw);
	if (r != EBUSY) return r;
	
	int spins = 100;
	while (spins-- && rw->_rw_lock && !rw->_rw_waiters) a_spin();

	while ((r=__pthread_rwlock_trywrlock(rw))==EBUSY) {
		if (!(r=rw->_rw_lock)) continue;
		t = r | 0x80000000;
		a_inc(&rw->_rw_waiters);
		a_cas(&rw->_rw_lock, r, t);
		r = __timedwait(&rw->_rw_lock, t, CLOCK_REALTIME, at, rw->_rw_shared^128);
		a_dec(&rw->_rw_waiters);
		if (r && r != EINTR) return r;
	}
	return r;
}

Regards,

-- 
Peng Zheng


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Priority Inversion and Unlimited Spin of pthread_rwlock_t
  2024-03-12  3:19 Priority Inversion and Unlimited Spin of pthread_rwlock_t Peng Zheng
  2024-03-12  3:48 ` Peng Zheng
@ 2024-03-12 14:39 ` Florian Weimer
  2024-03-13  1:48   ` Peng Zheng
  1 sibling, 1 reply; 5+ messages in thread
From: Florian Weimer @ 2024-03-12 14:39 UTC (permalink / raw)
  To: Peng Zheng; +Cc: libc-alpha

* Peng Zheng:

> And the SCHED_OTHER writer was just about to enable the `__wrphase_futex` in
> `__pthread_rwlock_wrlock_full` (just one ARM instruction away) 
> but never able to do that (the two readers ate nearly all available CPUs):
>
>   while ((r & PTHREAD_RWLOCK_WRPHASE) == 0
>      && (r >> PTHREAD_RWLOCK_READER_SHIFT) == 0)
>     {
>       if (atomic_compare_exchange_weak_acquire (&rwlock->__data.__readers,
>                         &r, r | PTHREAD_RWLOCK_WRPHASE))
>     {
>       atomic_store_relaxed (&rwlock->__data.__wrphase_futex, 1);  /* writer was stuck HERE! */
>
>       goto done;
>     }
>       /* TODO Back-off.  */
>     }

Is this about filling in the TODO?

> Unlimited user space spin is too dangerous to be used, how about
> limiting the total number of spins before suspending using futex? Or
> using rseq as mentioned in the LWN artible?

> [0] https://lwn.net/Articles/944895/

The LWN article does not talk about priority inversion.  With rseq,
userspace can detect preemption, but that may never happen with certain
realtime scheduling priorities.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Priority Inversion and Unlimited Spin of pthread_rwlock_t
  2024-03-12 14:39 ` Florian Weimer
@ 2024-03-13  1:48   ` Peng Zheng
  0 siblings, 0 replies; 5+ messages in thread
From: Peng Zheng @ 2024-03-13  1:48 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-alpha

On 2024/3/12 22:39, Florian Weimer wrote:
> * Peng Zheng:
> 
>> And the SCHED_OTHER writer was just about to enable the `__wrphase_futex` in
>> `__pthread_rwlock_wrlock_full` (just one ARM instruction away)
>> but never able to do that (the two readers ate nearly all available CPUs):
>>
>>    while ((r & PTHREAD_RWLOCK_WRPHASE) == 0
>>       && (r >> PTHREAD_RWLOCK_READER_SHIFT) == 0)
>>      {
>>        if (atomic_compare_exchange_weak_acquire (&rwlock->__data.__readers,
>>                          &r, r | PTHREAD_RWLOCK_WRPHASE))
>>      {
>>        atomic_store_relaxed (&rwlock->__data.__wrphase_futex, 1);  /* writer was stuck HERE! */
>>
>>        goto done;
>>      }
>>        /* TODO Back-off.  */
>>      }
> 
> Is this about filling in the TODO?

No, I forgot to remove inline comments from the original source.

It is about a low priority (SCHED_OTHER) writer, which was about to 
acquire its lock, preempted by high priority readers, and thus was not 
able to set `__wrphase_futex` to 1 (see IT IS PREEMPTED HERE comment).

   while ((r & PTHREAD_RWLOCK_WRPHASE) == 0
      && (r >> PTHREAD_RWLOCK_READER_SHIFT) == 0)
     {
       if (atomic_compare_exchange_weak_acquire (&rwlock->__data.__readers,
                         &r, r | PTHREAD_RWLOCK_WRPHASE))
     {
       /* IT IS PREEMPTED HERE */
       atomic_store_relaxed (&rwlock->__data.__wrphase_futex, 1);
       goto done;
     }
     }


And these two priority readers were stuck in a loop near the end of 
`__pthread_rwlock_rdlock_full` eating all available CPU.

   for (;;)
     {
       while (((wpf = atomic_load_relaxed (&rwlock->__data.__wrphase_futex))
           | PTHREAD_RWLOCK_FUTEX_USED) == (1 | PTHREAD_RWLOCK_FUTEX_USED))
       {/*omitted*/}
       if (ready)
              break;
       if ((atomic_load_acquire (&rwlock->__data.__readers)
        & PTHREAD_RWLOCK_WRPHASE) == 0)
             ready = true;
     }
   return 0;

Note that PTHREAD_RWLOCK_WRPHASE was already set by the preempted 
writer. That means `ready` is always false.

Note also that `__wrphase_futex` was not yet enabled by the preempted 
writer. That means these readers can not wait on futex to stop spinning.

This illustrates one of the several unlimited spin possibilities and I 
encounter two/three of them. If you are interested, I could provide 
corresponding postmortem debug sessions.

-- 
Peng Zheng


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Priority Inversion and Unlimited Spin of pthread_rwlock_t
  2024-03-12  3:48 ` Peng Zheng
@ 2024-03-14  7:32   ` Peng Zheng
  0 siblings, 0 replies; 5+ messages in thread
From: Peng Zheng @ 2024-03-14  7:32 UTC (permalink / raw)
  To: libc-alpha; +Cc: triegel

On 2024/3/12 11:48, Peng Zheng wrote:
> On 2024/3/12 11:19, Peng Zheng wrote:
>> Hi,
>>
>> I found that there are several unlimited spins in the current 
>> pthread_rwlock's implementation.
>> Therefore, it suffers from the same issue of user-space spinlocks as 
>> mentioned in this LWN article ([0]):
> 
> I forget to mention this issue is glibc-specific.
> Just checked musl's source code, all spin in its pthread_rwlock_t is 
> limited to 100. For example:

It seems that this issue is introduced by 
cc25c8b4c1196a8c29e9a45b1e096b99a87b7f8c, whose design is quite 
complicated.

It's highly non-trivial to add limited spin to the current implementation.

This issue has extensive impact because OpenSSL use wrlock as ordinary 
mutex by default.

https://github.com/openssl/openssl/blob/3cb0755323281267211fbe951b94a2552e99d32a/crypto/threads_pthread.c#L622

And it impacts nearly all subsystems of OpenSSL.

https://github.com/openssl/openssl/blob/b372b1f76450acdfed1e2301a39810146e28b02c/crypto/ex_data.c#L50C22-L50C34

I can safely tell that nearly all linux running real-time task using 
OpenSSL might be affected by this issue (like real-time video streaming 
over TLS connection).



> 
> int __pthread_rwlock_timedwrlock(pthread_rwlock_t *restrict rw, const 
> struct timespec *restrict at)
> {
>      int r, t;
> 
>      r = pthread_rwlock_trywrlock(rw);
>      if (r != EBUSY) return r;
> 
>      int spins = 100;
>      while (spins-- && rw->_rw_lock && !rw->_rw_waiters) a_spin();
> 
>      while ((r=__pthread_rwlock_trywrlock(rw))==EBUSY) {
>          if (!(r=rw->_rw_lock)) continue;
>          t = r | 0x80000000;
>          a_inc(&rw->_rw_waiters);
>          a_cas(&rw->_rw_lock, r, t);
>          r = __timedwait(&rw->_rw_lock, t, CLOCK_REALTIME, at, 
> rw->_rw_shared^128);
>          a_dec(&rw->_rw_waiters);
>          if (r && r != EINTR) return r;
>      }
>      return r;
> }
> 
> Regards,
> 

-- 
Peng Zheng


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-03-14  7:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-12  3:19 Priority Inversion and Unlimited Spin of pthread_rwlock_t Peng Zheng
2024-03-12  3:48 ` Peng Zheng
2024-03-14  7:32   ` Peng Zheng
2024-03-12 14:39 ` Florian Weimer
2024-03-13  1:48   ` Peng Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).