From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 31951 invoked by alias); 5 Nov 2008 09:57:15 -0000 Received: (qmail 25885 invoked by uid 48); 5 Nov 2008 09:55:57 -0000 Date: Wed, 05 Nov 2008 09:57:00 -0000 Message-ID: <20081105095557.25884.qmail@sourceware.org> From: "tom dot honermann at oracle dot com" To: glibc-bugs@sources.redhat.com In-Reply-To: <20070704013541.4737.nmiell@comcast.net> References: <20070704013541.4737.nmiell@comcast.net> Reply-To: sourceware-bugzilla@sourceware.org Subject: [Bug libc/4737] fork is not async-signal-safe X-Bugzilla-Reason: CC Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: glibc-bugs-owner@sourceware.org X-SW-Source: 2008-11/txt/msg00013.txt.bz2 ------- Additional Comments From tom dot honermann at oracle dot com 2008-11-05 09:55 ------- Oracle/PeopleSoft is also running into this bug. Oracle engineers (though not myself) are currently assigned to resolving this issue. I am hoping to facilitate discussion regarding what an acceptable solution to this problem should look like. I've been studying the glibc-2.3.4 source code (I know, old, but this is the version that we will ultimately have to create patches for). I suspect (but have not verified) that the underlying issue is still present in the latest CVS source code (implied by the fact that this bug report is still open). My priority is to get this corrected for Linux/x86_64. The stack trace for the hang I've been seeing looks like: #0 0x00000034cc0d9128 in __lll_mutex_lock_wait () from /lib64/libc.so.6 #1 0x00000034cc07262c in _L_lock_57 () from /lib64/libc.so.6 #2 0x00000034cc06bfa3 in ptmalloc_lock_all () from /lib64/libc.so.6 #3 0x00000034cc09461a in fork () from /lib64/libc.so.6 #9 #10 0x00000034cc030015 in raise () from /lib64/libc.so.6 #11 0x00000034cc031980 in abort () from /lib64/libc.so.6 #12 0x00000034cc0674db in __libc_message () from /lib64/libc.so.6 #13 0x00000034cc06e8a0 in _int_free () from /lib64/libc.so.6 #14 0x00000034cc071fbc in free () from /lib64/libc.so.6 The root cause for the signal generated in this case was heap corruption (glibc detected the corruption and aborted the process). The invoked signal handler is simply trying to fork/exec a program to gather diagnostics we need to help us find the cause of the heap corruption. The Linux/x86_64 glibc build is currently using "normal" mutexes for locking the heap arenas (see 'ptmalloc_init' in malloc/arena.c). These mutexes are initialized by calling 'mutex_init' in 'ptmalloc_init' and these "normal" mutexes will deadlock if a thread owning the mutex attempts to re-acquire it. The simplest solution seems to me to convert these to recursive mutexes. The reason for using a recursive mutex is to allow a thread that already holds one of the arena mutexes to handle a signal, call fork from within that signal handler, call ptmalloc_lock_all, and still obtain a lock to all arena mutexes. This would allow the thread to continue while the data structures for the previously locked arena are not in a stable state (since the original heap function invocation was interrupted by the signal), but this should be ok since heap functions are not async-signal safe and therefore may not be (reliably) called from within a signal handler anyway. Since the relevant thread in both the parent and child processes is still executing within the context of a signal handler, the arena data structures may not be touched by either thread. The downsides to this approach are performance overhead and the potential for defects to go unnoticed during development (since unintentional attempts to recursively lock a mutex would no longer lead to deadlocks). This approach would also require changes to 'ptmalloc_unlock_all2' (which currently re-initializes arena mutexes in the child processes rather than unlocking them) since a return from the signal handler in the child process will attempt to unlock the previously held arena mutex lock. If the mutex is re-initialized, the unlock call could result in undesirable behavior. Eagerly awaiting comments and criticisms... -- http://sourceware.org/bugzilla/show_bug.cgi?id=4737 ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.