From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <glibc-bugs-return-12524-listarch-glibc-bugs=sources.redhat.com@sourceware.org>
Received: (qmail 13380 invoked by alias); 14 Apr 2011 06:33:34 -0000
Received: (qmail 13371 invoked by uid 22791); 14 Apr 2011 06:33:33 -0000
X-SWARE-Spam-Status: No, hits=-2.7 required=5.0	tests=ALL_TRUSTED,AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Received: from localhost (HELO sourceware.org) (127.0.0.1)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 14 Apr 2011 06:33:27 +0000
From: "dhatch at ilm dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sources.redhat.com
Subject: [Bug nptl/12674] New: sem_post/sem_wait race causing sem_post to return EINVAL
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: nptl
X-Bugzilla-Keywords:
X-Bugzilla-Severity: critical
X-Bugzilla-Who: dhatch at ilm dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: drepper.fsp at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
Message-ID: <bug-12674-131@http.sourceware.org/bugzilla/>
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Date: Thu, 14 Apr 2011 06:33:00 -0000
Mailing-List: contact glibc-bugs-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <glibc-bugs.sourceware.org>
List-Subscribe: <mailto:glibc-bugs-subscribe@sourceware.org>
List-Post: <mailto:glibc-bugs@sourceware.org>
List-Help: <mailto:glibc-bugs-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: glibc-bugs-owner@sourceware.org
X-SW-Source: 2011-04/txt/msg00053.txt.bz2

http://sourceware.org/bugzilla/show_bug.cgi?id=12674

           Summary: sem_post/sem_wait race causing sem_post to return
                    EINVAL
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: critical
          Priority: P2
         Component: nptl
        AssignedTo: drepper.fsp@gmail.com
        ReportedBy: dhatch@ilm.com


Created attachment 5671
  --> http://sourceware.org/bugzilla/attachment.cgi?id=5671
the test program, to be run in gdb as described

There appears to be a race in the implementation of sem_post/sem_wait on AMD64
(nptl/sysdeps/unix/sysv/linux/x86_64/sem_post.S in the source code)
which sometimes causes sem_post to access freed memory
and to fail with EINVAL.

In a nutshell, if sem_post happens to go to sleep
right after it increments sem->value
but before it looks at sem->nwaiters,
another thread can sail through a sem_wait without blocking
and destroy the semaphore,
so that when the sem_post thread wakes up and looks at sem->nwaiters,
it is looking at already-freed (and possibly unmapped) memory.

The bug was originally filed as gentoo bug 93366
( http://bugs.gentoo.org/show_bug.cgi?id=93366 ).

It's extremely hard to reproduce,
and I don't have a simple program that can demonstrate the problem reliably
by just running it (for less than a million years).
But it can be reproduced consistently 
either by hacking up the sem_post source code
and adding a sleep() at a crucial point,
or by carefully stopping and resuming the threads
in a debugger with thread-specific breakpoints.
I'll include instructions for doing the latter using gdb >=7.1.

We're observing the problem on an AMD64 machine
running RHEL5.3 Linux,
with glibc-2.5-34.el5_3.1
and gcc-4.1.2-44.el5,
which I know is ancient 
but I also downloaded the most current glibc source code today
and compiled the sem_post.S and sem_wait.S from it,
and I can still reproduce the problem using those.


Here are the instructions for reproducing the problem
using gdb 7.1 or 7.2 on the attached program
(gdb 7.0.1 and earlier fail with a supposed syntax error
on the "b *(sem_post+18) thread 3").


% gcc -Wall -g semtest.c -lpthread -o semtest
% gdb ./semtest

    # per http://sourceware.org/gdb/onlinedocs/gdb/Non_002dStop-Mode.html ...
    # Enable the async interface.
    set target-async 1
    # If using the CLI, pagination breaks non-stop.
    set pagination off
    # Finally, turn it on!
    set non-stop on 

    b waiter
    b poster
    r
        # thread 2 stops in waiter
        # thread 3 stops in poster

    t 2
    b sem_wait thread 2 
    c
        # thread 2 (waiter) stops at the beginning of sem_wait(varsem)

    disas sem_post 
        # look for the "cmpq $0x0,0x8(%rdi)" and put a breakpoint there.
        # in older versions it's sem_post+4;
        # in newer versions it's sem_post+18.
    t 3
    b *(sem_post+18) thread 3    <-- or sem_post+4 or whatever
    c
        # thread 3 (poster) stops at the breakpoint inside sem_post,
        # after incrementing varsem->value (4-byte value 0 bytes into the
object)
        # but before looking at varsem->nwaiters (8-byte value 8 bytes into the
object)

    t 2
    b free thread 2
    c
        # thread 2 (waiter) sails through the sem_wait without blocking,
        # calls sem_destroy(varsem),
        # trashes the memory,
        # and stops at the beginning of free

    t 3
    c
        # thread 3 (poster) resumes in the middle of sem_post,
        # looks at varsem->nwaiters and sees it's nonzero (trash)
        # so it makes the FUTEX_WAKE syscall which returns EINVAL,
        # the program exits with error message
        # "sem_post() in poster: Invalid argument"


I hope I am not overinflating this bug's severity by calling
it "critical" ("major" would feel more appropriate to me,
but there seems to be no "major" option, only "normal" and "critical").
Although failure is rare,
we are about to be forced to implement our own semaphores
rather than using the posix semaphores because of this bug,
so it does seem rather severe.

-- 
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.