From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <glibc-bugs-return-8861-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Received: (qmail 31951 invoked by alias); 5 Nov 2008 09:57:15 -0000
Received: (qmail 25885 invoked by uid 48); 5 Nov 2008 09:55:57 -0000
Date: Wed, 05 Nov 2008 09:57:00 -0000
Message-ID: <20081105095557.25884.qmail@sourceware.org>
From: "tom dot honermann at oracle dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sources.redhat.com
In-Reply-To: <20070704013541.4737.nmiell@comcast.net>
References: <20070704013541.4737.nmiell@comcast.net>
Reply-To: sourceware-bugzilla@sourceware.org
Subject: [Bug libc/4737] fork is not async-signal-safe
X-Bugzilla-Reason: CC
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
X-SW-Source: 2008-11/txt/msg00013.txt.bz2


------- Additional Comments From tom dot honermann at oracle dot com  2008-11-05 09:55 -------
Oracle/PeopleSoft is also running into this bug.  Oracle engineers (though not
myself) are currently assigned to resolving this issue.  I am hoping to
facilitate discussion regarding what an acceptable solution to this problem
should look like.

I've been studying the glibc-2.3.4 source code (I know, old, but this is the
version that we will ultimately have to create patches for).  I suspect (but
have not verified) that the underlying issue is still present in the latest CVS
source code (implied by the fact that this bug report is still open).  My
priority is to get this corrected for Linux/x86_64.

The stack trace for the hang I've been seeing looks like:
#0  0x00000034cc0d9128 in __lll_mutex_lock_wait () from /lib64/libc.so.6
#1  0x00000034cc07262c in _L_lock_57 () from /lib64/libc.so.6
#2  0x00000034cc06bfa3 in ptmalloc_lock_all () from /lib64/libc.so.6
#3  0x00000034cc09461a in fork () from /lib64/libc.so.6 
<Oracle/PeopleSoft signal handler stack frames>
#9  <signal handler called> 
#10 0x00000034cc030015 in raise () from /lib64/libc.so.6
#11 0x00000034cc031980 in abort () from /lib64/libc.so.6
#12 0x00000034cc0674db in __libc_message () from /lib64/libc.so.6
#13 0x00000034cc06e8a0 in _int_free () from /lib64/libc.so.6
#14 0x00000034cc071fbc in free () from /lib64/libc.so.6 

The root cause for the signal generated in this case was heap corruption (glibc
detected the corruption and aborted the process).  The invoked signal handler is
simply trying to fork/exec a program to gather diagnostics we need to help us
find the cause of the heap corruption.

The Linux/x86_64 glibc build is currently using "normal" mutexes for locking the
heap arenas (see 'ptmalloc_init' in malloc/arena.c).  These mutexes are
initialized by calling 'mutex_init' in 'ptmalloc_init' and these "normal"
mutexes will deadlock if a thread owning the mutex attempts to re-acquire it.

The simplest solution seems to me to convert these to recursive mutexes.  The
reason for using a recursive mutex is to allow a thread that already holds one
of the arena mutexes to handle a signal, call fork from within that signal
handler, call ptmalloc_lock_all, and still obtain a lock to all arena mutexes. 
This would allow the thread to continue while the data structures for the
previously locked arena are not in a stable state (since the original heap
function invocation was interrupted by the signal), but this should be ok since
heap functions are not async-signal safe and therefore may not be (reliably)
called from within a signal handler anyway.  Since the relevant thread in both
the parent and child processes is still executing within the context of a signal
handler, the arena data structures may not be touched by either thread.

The downsides to this approach are performance overhead and the potential for
defects to go unnoticed during development (since unintentional attempts to
recursively lock a mutex would no longer lead to deadlocks).

This approach would also require changes to 'ptmalloc_unlock_all2' (which
currently re-initializes arena mutexes in the child processes rather than
unlocking them) since a return from the signal handler in the child process will
attempt to unlock the previously held arena mutex lock.  If the mutex is
re-initialized, the unlock call could result in undesirable behavior.

Eagerly awaiting comments and criticisms...

-- 


http://sourceware.org/bugzilla/show_bug.cgi?id=4737

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.