From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 28275 invoked by alias); 24 May 2011 17:02:18 -0000 Received: (qmail 28252 invoked by uid 22791); 24 May 2011 17:02:14 -0000 X-SWARE-Spam-Status: No, hits=-2.6 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,TW_DM X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 24 May 2011 17:02:01 +0000 From: "m.k.edwards at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/48126] arm_output_sync_loop: misplaced memory barrier, missing clrex / dummy strex X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: m.k.edwards at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Date: Tue, 24 May 2011 17:05:00 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2011-05/txt/msg02222.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48126 --- Comment #4 from Michael K. Edwards 2011-05-24 16:38:41 UTC --- OK, that's a clear explanation of why the DMB is necessary in the case where both the compare and the store succeed (neither branch is taken; at a higher semantic level, a lock is acquired, if that's what the atomic is being used for). For future reference, I would appreciate having those ARM ARM quotations, along with some indication of how load scheduling interacts with a branch past a memory barrier. Suppose that the next instruction after label "2" is a load. On some ARMv7 implementations, I presume that this load can get issued speculatively as early as label "1", due to the "bne 2f" branch shadow, which skips the trailing dmb. I gather that the intention is that, if this branch is not taken (and thus we execute through the trailing dmb), the fetch results from the branch shadow should be discarded, and the load re-issued (with, in a multi-core device, the appropriate ordering guarantee with respect to the strex). If this interpretation is more or less right, and the shipping silicon behaves as intended, then the branch past the dmb may be harmless -- although I might argue that it wastes memory bandwidth in what is usually the common case (compare-and-swap succeeds) in exchange for a slight reduction in latency in the other case. Yet that's still not quite the documented semantics of the GCC compare-and-swap primitive, which is supposed to be totally ordered whether or not the swap succeeds. When I write a lock-free algorithm, I may well rely on the guarantee that the value read in the next line of code was actually fetched after the value fetched by the ldrex. In fact, I have code that does rely on this guarantee; if thread A sees that thread B has altered the atomic location, then it expects to be able to see all data that thread B wrote before issuing its compare-and-swap. Here's the problem case: thread A: thread B: dmb store Y dmb ldrex X cmp bne (doesn't branch) strex X ldrex X cmp bne (branches) load Y (speculated above) Is there something I'm not seeing that prevents this?