public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug nptl/26570] New: One of the thread is getting hanged when two threads are trying yo join each other.
@ 2020-09-03  8:30 agarwalshashank12 at gmail dot com
  2020-09-03 13:08 ` [Bug nptl/26570] " adhemerval.zanella at linaro dot org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: agarwalshashank12 at gmail dot com @ 2020-09-03  8:30 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26570

            Bug ID: 26570
           Summary: One of the thread is getting hanged when two threads
                    are trying yo join each other.
           Product: glibc
           Version: 2.31
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: nptl
          Assignee: unassigned at sourceware dot org
          Reporter: agarwalshashank12 at gmail dot com
                CC: drepper.fsp at gmail dot com
  Target Milestone: ---

Created attachment 12814
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12814&action=edit
Test Case describing the scenario

Hello,

Description -

I have obeserved a behaviour in pthread_join(3) where 2 threads are trying to
join each other and one of the thread is getting hanged indefinitely as it
stuck in the futex(2) system call for the other thread to wake.

Test Case Description -

1. Create multiple threads using pthread_create in the joinable state.
2. thread1 joins thread2 using pthread_join.
3. thread2 joins thread1 using pthread_join.

EXPECTED RESULT :
1. Thread1 should join thread2.
2. when thread2 tries to join thread1 it detects a deadlock.

Actual Result :
Thread 2 is getting hanged.

Glibc Version -
2.31, 2.28

Sample Code -
pthread_join_sample.c attached

Enviorenment -
$uname -a
Linux a1sb2-010 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux

Root Caues -

When any thread performs the pthread_join(3) on any thread, Glibc stores the ID
of the former thread(performing pthread_join(3)) into the member joinid of the
structure pthread of the thread on which pthread_join(3) is performed.

Hence in the expected scenario pthread_join(3) code flow works as follows -

1. thread t1 performing pthread_join(3) on thread t2, GLibc performs below
checks before invoking futex wait-

Check 1 -
If the joinid of t1 contains the id of the t2 then it returns the EDEADLK if
not then moves to check 2,

Check 2

Glibc checks wheather anyother thread has already performed the pthread_join on
t2 or not. This is achieved with the help of joinid member of the thread t2. If
this member contains the NULL value it means no other thread has performed the
pthread_join on t2 and thread t1 updates its ID into the joinid member of
structure pthread of t2 in nptl/pthread_join_common.c and perform futex wait.
If the joinid member of thread t2 already contain some non null value it means
some other thread has already performed pthread_join(3) on thread t2 and
library returns EINVAL to the thread t1.

2. When thread t2 performs pthread_join(3) on t1, Glibc fails in check 2 i.e
joinid of t2 contains the id of the t1 and it returns the EDEADLK.

As per analysis In Glibcv2.31, thread performing pthread_join(3) on any thread
is not able to update the ID of the former thread into the member joinid of the
other thread, Hence both the threads pass check 1 and check 2 and gone into the
futex wait and one thread gets hanged. In glibc 2.31 below code semantics is
used to update the ID of the thread -

atomic_compare_exchange_weak_acquire (&pd->joinid,&self,NULL)
This macro internally calls the builtin function __atomic_compare_exchange_n
((mem), (expected), (desired), 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) of gcc.
This built-in function implements an atomic compare and exchange operation.
This compares the contents of *ptr with the contents of *expected. If equal,
the operation is a read-modify-write operation that writes desired into *ptr.
If they are not equal, the operation is a read and the current contents of *ptr
are written into *expected.

The expected value passed to this builtin function is &self i.e the ID of the
thread calling pthread_join(3). So as per our analysis joinid of the thread on
which pthread_join(3) is performed will never be equal to the ID of the thread
calling pthread_join(3) as its initial value will by NULL and this member
joinid is always updated from pthread_join(3) hence in builtin function
__atomic_compare_exchange_n the comaprision will always gets failed and the
joinid will never be updated. Both the threads will go into the futex wait
operation.

Fix -

In order to fix the issue, expected value passed to the function
__atomic_compare_exchange_n should be null and the desired value should be the
ID of the thread performing pthread_join(3) as below -

atomic_compare_exchange_weak_acquire (&pd->joinid,&null_ptr,self)

Regards
Shashank Agarwal
HCL Technologies Limited
shashank_agarwal@hcl.com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug nptl/26570] One of the thread is getting hanged when two threads are trying yo join each other.
  2020-09-03  8:30 [Bug nptl/26570] New: One of the thread is getting hanged when two threads are trying yo join each other agarwalshashank12 at gmail dot com
@ 2020-09-03 13:08 ` adhemerval.zanella at linaro dot org
  2020-09-04  5:23 ` agarwalshashank12 at gmail dot com
  2020-09-04  8:01 ` schwab@linux-m68k.org
  2 siblings, 0 replies; 4+ messages in thread
From: adhemerval.zanella at linaro dot org @ 2020-09-03 13:08 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26570

Adhemerval Zanella <adhemerval.zanella at linaro dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |adhemerval.zanella at linaro dot o
                   |                            |rg

--- Comment #1 from Adhemerval Zanella <adhemerval.zanella at linaro dot org> ---
(In reply to Shashank Agarwal from comment #0)
> Created attachment 12814 [details]
> Test Case describing the scenario
> 
> Hello,
> 
> Description -
> 
> I have obeserved a behaviour in pthread_join(3) where 2 threads are trying
> to join each other and one of the thread is getting hanged indefinitely as
> it stuck in the futex(2) system call for the other thread to wake.
> 
> Test Case Description -
> 
> 1. Create multiple threads using pthread_create in the joinable state.
> 2. thread1 joins thread2 using pthread_join.
> 3. thread2 joins thread1 using pthread_join.
> 
> EXPECTED RESULT :
> 1. Thread1 should join thread2.
> 2. when thread2 tries to join thread1 it detects a deadlock.

The deadlock failure *may* be reportable by a pthread_join implementation, but
it is not portable solution and relying on such behaviour is a fragile
approach.  The glibc tries only to catch some possible scenarios which are the
most straighforward, but there multiple other ones that current implementation
does not handle.

A fully reliable deadlock implementation will either require some extra kernel
support or a costly userspace implementation to maintain a user list per futex
word (which I am not fully sure it is feasible).

> 
> Actual Result :
> Thread 2 is getting hanged.
> 
> Glibc Version -
> 2.31, 2.28
> 
> Sample Code -
> pthread_join_sample.c attached
> 
> Enviorenment -
> $uname -a
> Linux a1sb2-010 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux
> 
> Root Caues -
> 
> When any thread performs the pthread_join(3) on any thread, Glibc stores the
> ID of the former thread(performing pthread_join(3)) into the member joinid
> of the structure pthread of the thread on which pthread_join(3) is performed.
> 
> Hence in the expected scenario pthread_join(3) code flow works as follows -
> 
> 1. thread t1 performing pthread_join(3) on thread t2, GLibc performs below
> checks before invoking futex wait-
> 
> Check 1 -
> If the joinid of t1 contains the id of the t2 then it returns the EDEADLK if
> not then moves to check 2,
> 
> Check 2
> 
> Glibc checks wheather anyother thread has already performed the pthread_join
> on t2 or not. This is achieved with the help of joinid member of the thread
> t2. If this member contains the NULL value it means no other thread has
> performed the pthread_join on t2 and thread t1 updates its ID into the
> joinid member of structure pthread of t2 in nptl/pthread_join_common.c and
> perform futex wait. If the joinid member of thread t2 already contain some
> non null value it means some other thread has already performed
> pthread_join(3) on thread t2 and library returns EINVAL to the thread t1.
> 
> 2. When thread t2 performs pthread_join(3) on t1, Glibc fails in check 2 i.e
> joinid of t2 contains the id of the t1 and it returns the EDEADLK.
> 
> As per analysis In Glibcv2.31, thread performing pthread_join(3) on any
> thread is not able to update the ID of the former thread into the member
> joinid of the other thread, Hence both the threads pass check 1 and check 2
> and gone into the futex wait and one thread gets hanged. In glibc 2.31 below
> code semantics is used to update the ID of the thread -
> 
> atomic_compare_exchange_weak_acquire (&pd->joinid,&self,NULL)
> This macro internally calls the builtin function __atomic_compare_exchange_n
> ((mem), (expected), (desired), 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) of
> gcc. This built-in function implements an atomic compare and exchange
> operation. This compares the contents of *ptr with the contents of
> *expected. If equal, the operation is a read-modify-write operation that
> writes desired into *ptr.
> If they are not equal, the operation is a read and the current contents of
> *ptr are written into *expected.
> 
> The expected value passed to this builtin function is &self i.e the ID of
> the thread calling pthread_join(3). So as per our analysis joinid of the
> thread on which pthread_join(3) is performed will never be equal to the ID
> of the thread calling pthread_join(3) as its initial value will by NULL and
> this member joinid is always updated from pthread_join(3) hence in builtin
> function __atomic_compare_exchange_n the comaprision will always gets failed
> and the joinid will never be updated. Both the threads will go into the
> futex wait operation.
> 
> Fix -
> 
> In order to fix the issue, expected value passed to the function
> __atomic_compare_exchange_n should be null and the desired value should be
> the ID of the thread performing pthread_join(3) as below -
> 
> atomic_compare_exchange_weak_acquire (&pd->joinid,&null_ptr,self)

This is wrong, the 'joinid' here is used to synchronize with the pthread_detach
to indicate a waiter.  The pthread_create initializes the field to 'NULL' for
default case or a self-reference to its own 'struct pthread' for the case of a
detached attribute. 

This CAS checks if pthread_join is being called for a detached thread, where
'joinid' refers to the 'struct pthread' of the calling thread.  This is also
UB, but glibc implementation handles this scenarios as an invalid call to
caller.

I don't see a easy way to report deadlock in this scenarios in a *reliable* way
without some extra kernel support.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug nptl/26570] One of the thread is getting hanged when two threads are trying yo join each other.
  2020-09-03  8:30 [Bug nptl/26570] New: One of the thread is getting hanged when two threads are trying yo join each other agarwalshashank12 at gmail dot com
  2020-09-03 13:08 ` [Bug nptl/26570] " adhemerval.zanella at linaro dot org
@ 2020-09-04  5:23 ` agarwalshashank12 at gmail dot com
  2020-09-04  8:01 ` schwab@linux-m68k.org
  2 siblings, 0 replies; 4+ messages in thread
From: agarwalshashank12 at gmail dot com @ 2020-09-04  5:23 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26570

--- Comment #2 from Shashank Agarwal <agarwalshashank12 at gmail dot com> ---
(In reply to Adhemerval Zanella from comment #1)
> (In reply to Shashank Agarwal from comment #0)
> > Created attachment 12814 [details]
> > Test Case describing the scenario
> > 
> > Hello,
> > 
> > Description -
> > 
> > I have obeserved a behaviour in pthread_join(3) where 2 threads are trying
> > to join each other and one of the thread is getting hanged indefinitely as
> > it stuck in the futex(2) system call for the other thread to wake.
> > 
> > Test Case Description -
> > 
> > 1. Create multiple threads using pthread_create in the joinable state.
> > 2. thread1 joins thread2 using pthread_join.
> > 3. thread2 joins thread1 using pthread_join.
> > 
> > EXPECTED RESULT :
> > 1. Thread1 should join thread2.
> > 2. when thread2 tries to join thread1 it detects a deadlock.
> 
> The deadlock failure *may* be reportable by a pthread_join implementation,
> but it is not portable solution and relying on such behaviour is a fragile
> approach.  The glibc tries only to catch some possible scenarios which are
> the most straighforward, but there multiple other ones that current
> implementation does not handle.
> 
> A fully reliable deadlock implementation will either require some extra
> kernel support or a costly userspace implementation to maintain a user list
> per futex word (which I am not fully sure it is feasible).
> 
> > 
> > Actual Result :
> > Thread 2 is getting hanged.
> > 
> > Glibc Version -
> > 2.31, 2.28
> > 
> > Sample Code -
> > pthread_join_sample.c attached
> > 
> > Enviorenment -
> > $uname -a
> > Linux a1sb2-010 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
> > x86_64 x86_64 x86_64 GNU/Linux
> > 
> > Root Caues -
> > 
> > When any thread performs the pthread_join(3) on any thread, Glibc stores the
> > ID of the former thread(performing pthread_join(3)) into the member joinid
> > of the structure pthread of the thread on which pthread_join(3) is performed.
> > 
> > Hence in the expected scenario pthread_join(3) code flow works as follows -
> > 
> > 1. thread t1 performing pthread_join(3) on thread t2, GLibc performs below
> > checks before invoking futex wait-
> > 
> > Check 1 -
> > If the joinid of t1 contains the id of the t2 then it returns the EDEADLK if
> > not then moves to check 2,
> > 
> > Check 2
> > 
> > Glibc checks wheather anyother thread has already performed the pthread_join
> > on t2 or not. This is achieved with the help of joinid member of the thread
> > t2. If this member contains the NULL value it means no other thread has
> > performed the pthread_join on t2 and thread t1 updates its ID into the
> > joinid member of structure pthread of t2 in nptl/pthread_join_common.c and
> > perform futex wait. If the joinid member of thread t2 already contain some
> > non null value it means some other thread has already performed
> > pthread_join(3) on thread t2 and library returns EINVAL to the thread t1.
> > 
> > 2. When thread t2 performs pthread_join(3) on t1, Glibc fails in check 2 i.e
> > joinid of t2 contains the id of the t1 and it returns the EDEADLK.
> > 
> > As per analysis In Glibcv2.31, thread performing pthread_join(3) on any
> > thread is not able to update the ID of the former thread into the member
> > joinid of the other thread, Hence both the threads pass check 1 and check 2
> > and gone into the futex wait and one thread gets hanged. In glibc 2.31 below
> > code semantics is used to update the ID of the thread -
> > 
> > atomic_compare_exchange_weak_acquire (&pd->joinid,&self,NULL)
> > This macro internally calls the builtin function __atomic_compare_exchange_n
> > ((mem), (expected), (desired), 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) of
> > gcc. This built-in function implements an atomic compare and exchange
> > operation. This compares the contents of *ptr with the contents of
> > *expected. If equal, the operation is a read-modify-write operation that
> > writes desired into *ptr.
> > If they are not equal, the operation is a read and the current contents of
> > *ptr are written into *expected.
> > 
> > The expected value passed to this builtin function is &self i.e the ID of
> > the thread calling pthread_join(3). So as per our analysis joinid of the
> > thread on which pthread_join(3) is performed will never be equal to the ID
> > of the thread calling pthread_join(3) as its initial value will by NULL and
> > this member joinid is always updated from pthread_join(3) hence in builtin
> > function __atomic_compare_exchange_n the comaprision will always gets failed
> > and the joinid will never be updated. Both the threads will go into the
> > futex wait operation.
> > 
> > Fix -
> > 
> > In order to fix the issue, expected value passed to the function
> > __atomic_compare_exchange_n should be null and the desired value should be
> > the ID of the thread performing pthread_join(3) as below -
> > 
> > atomic_compare_exchange_weak_acquire (&pd->joinid,&null_ptr,self)
> 
> This is wrong, the 'joinid' here is used to synchronize with the
> pthread_detach to indicate a waiter.  The pthread_create initializes the
> field to 'NULL' for default case or a self-reference to its own 'struct
> pthread' for the case of a detached attribute. 
> 
> This CAS checks if pthread_join is being called for a detached thread, where
> 'joinid' refers to the 'struct pthread' of the calling thread.  This is also
> UB, but glibc implementation handles this scenarios as an invalid call to
> caller.
> 
> I don't see a easy way to report deadlock in this scenarios in a *reliable*
> way without some extra kernel support.

Hello,

I understand your concern. But as per my understanding in Glibc v2.21 joinid
was updated with the ID of the thread calling pthread_join(3) that is why I
thought in v2.31 this is the issue in CAS when updating the joinid -

else if (__builtin_expect (atomic_compare_and_exchange_bool_acq (&pd->joinid,
                                                                   self,
                                                                   NULL), 0))
    /* There is already somebody waiting for the thread.  */
    result = EINVAL;

Apart from this issue, I have also observed the another issue in Glibc v2.31 in
pthread_join(3). As per the POSIX implementation of pthread_join(3) EINVAL
should be return to the thread when pthread_join(3) is called for the thread on
which another thread is already waiting to join. In Glibc v2.31 one of the
thread is getting hanged and going for the futex wait indefinately for the same
reason as mention in comment#1.

Regards
Shashank Agarwal
HCL Technologies Limited
shashank_agarwal@hcl.com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug nptl/26570] One of the thread is getting hanged when two threads are trying yo join each other.
  2020-09-03  8:30 [Bug nptl/26570] New: One of the thread is getting hanged when two threads are trying yo join each other agarwalshashank12 at gmail dot com
  2020-09-03 13:08 ` [Bug nptl/26570] " adhemerval.zanella at linaro dot org
  2020-09-04  5:23 ` agarwalshashank12 at gmail dot com
@ 2020-09-04  8:01 ` schwab@linux-m68k.org
  2 siblings, 0 replies; 4+ messages in thread
From: schwab@linux-m68k.org @ 2020-09-04  8:01 UTC (permalink / raw)
  To: glibc-bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=26570

--- Comment #3 from Andreas Schwab <schwab@linux-m68k.org> ---
This is only a recommendation, not a requirement.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-04  8:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03  8:30 [Bug nptl/26570] New: One of the thread is getting hanged when two threads are trying yo join each other agarwalshashank12 at gmail dot com
2020-09-03 13:08 ` [Bug nptl/26570] " adhemerval.zanella at linaro dot org
2020-09-04  5:23 ` agarwalshashank12 at gmail dot com
2020-09-04  8:01 ` schwab@linux-m68k.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).