From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 6257 invoked by alias); 16 Nov 2016 13:40:26 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 5771 invoked by uid 89); 16 Nov 2016 13:40:25 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.3 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy=UD:The, commitdiff, elision, glibcgit X-HELO: mail-ua0-f175.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=Y0xCTOOFUIJGBSmgdicxiZ1709kdOYzlG2wyhjpsLlE=; b=dzrwZ+9Fdx/uCUU5yDWfUoLcW/62Fo7MpFxnfO2E7CTmW6cd20ylMZ8MVwkha87N/q W+IMFLsn0Eyewp7+11CRJQdZG243t0LutW2Ox1m0hYl+nRrWBEjm84tYD/dzli2kdx+E e7n1sIbjZIsPjjG7mTSFMP8D4RfDqVEukBmVIlX222zkACu+5XT2nqHXNsvvDH58oT1c i0q2vBioG7VUoG2NSxM/s+UJKn1M5SRIhCaaZeOl2I9DEuzyUEmWvXYHxscJk6SoGxLE cyoLZbSwegXEXXNBn5I4lPCwh6N+QlbU4FQUqofqVEX6OESbPjIXGezuqkSVrLtdHQCn PcbQ== X-Gm-Message-State: ABUngvcD3lJ+QscA0aaC12h3Vfd15p3LGxDmPoAyBWDCFn8CdR3THE4K7f9dd4o502sRtn3D X-Received: by 10.159.40.101 with SMTP id c92mr1685038uac.111.1479303613661; Wed, 16 Nov 2016 05:40:13 -0800 (PST) Subject: Re: BZ 20822 :powerpc: race condition in __lll_unlock_elision To: Rajalakshmi Srinivasaraghavan , libc-alpha@sourceware.org References: Cc: aaron Sawdey , Ulrich Weigand , Steve Munroe , carlos@redhat.com, adconrad@ubuntu.com, wschmidt@linux.vnet.ibm.com From: Adhemerval Zanella Message-ID: <113d3633-4425-4a56-492e-6aab3b75de2d@linaro.org> Date: Wed, 16 Nov 2016 13:40:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-SW-Source: 2016-11/txt/msg00546.txt.bz2 On 15/11/2016 14:12, Rajalakshmi Srinivasaraghavan wrote: > > The initial problem reported was memory corruption in MongoDB only on Ubuntu 16.04 ppc64le(not on Ubuntu 15). > The problem was not easily recreatable and debugging with many tools and creative ideas, the problem is narrowed down to lock elision in glibc. > Ubuntu 16.04 has glibc 2.23 with lock elision enabled by default which made the testcase fail only on 16.04. > > As stated in BZ 20822, short description is > > "The update of *adapt_count after the release of the lock causes a race condition. Thread A unlocks, thread B continues and returns, destroying mutex on stack, > then gets into another function, thread A writes to *adapt_count and corrupts stack.The window is very narrow and requires that the > machine be in smt mode, so likely the two threads have to be on the same core in order to hit this" > > Looking back the file changes in sysdeps/unix/sysv/linux/powerpc/elision-unlock.c, suspecting the commit > https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fadd2ad9cc36115440d50b0eae9299e65988917d which is introduced to improve > the performance. > > Thinking of the following fixes. > > 1) Update of adapt_count before the unlock. But I have still not identified if its going to affect the performance. > > 2) In elision-lock.c/elision-trylock.c update adapt_count right after the lock acquire at the end of the function. > In either case we have to be careful not to introduce another race condition. > > Any suggestions? > The fadd2ad9c commit indeed should have raised us a flags, since it tries to update a variable ('adapt_count' or pthread_mutex_t.__data.__elision) in a transaction, where it should be semantically handled as atomic access, and later in a non-atomic way. This is wrong and I think we should either revert the patch or move the adapt count update to 'inside' the transaction (your first option). However, I am not sure if this is the culprit of the issue. Based on bug report, it states the faulty scenario is: "[...] Thread A unlocks, thread B continues and returns, destroying mutex on stack, then gets into another function, thread A writes to *adapt_count and corrupts stack. [...]" It seems *after* thread B destroy the mutex, it was not clear that the program just calls 'pthread_mutex_destroy' or if the thread deallocates the mutex on the stack (by returning or something else). First option should not trigger an invalid access, but second is definitively an program error. Either way, I am also not sure if this current behaviour would lead to a correctness issue, but rather a performance one: the race 'adapt_count' usage would lead to TLE being avoided, not really invalid memory access.