From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) by sourceware.org (Postfix) with ESMTPS id B862E3858D28 for ; Wed, 3 Nov 2021 17:27:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B862E3858D28 Received: by mail-ed1-x52e.google.com with SMTP id j21so11871424edt.11 for ; Wed, 03 Nov 2021 10:27:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=fxiibbwRbVpFLCgakzq/NJIFaMOg5zdNR4Or2Q09py4=; b=Myqe3oO+SqEakXYQe0+fN6K105j/S7nU76Sedad6IOdXqwsj2PFfAl+XaKk9uO1VRL u2KIzIKqmymkJmC4ZwZt+Id8PjopaspZKFuY0BBF/229RwhouQim3FfNmG679hbAFhNg 2ioC6xoydqNXCxLlgFlObtipKmR0ZZZtHYfpHf16pp6M5CJ6biZaTbC7Az25azAsuoi/ GBWvDjRH5LounttIcXFOS6GsERjS6emPaI3+1azGlU3Et34ThfdjqOrhU7r7a3wadDQc pcP9e9U+6nbs1O7S1Pb2YmlKkmkY46h4mYb87QhYhplRsCCBZY8NjpouPSbeViX2YQKc cprw== X-Gm-Message-State: AOAM532W/g56IAQPeuBGczTbmu78zhtWKRosW+5iSKh2G8J/byaFHh7x EMoKpMAaezI3/0ZQjTIQw5cBhx7WQxypFPVCKKw= X-Google-Smtp-Source: ABdhPJyPrNm8eDCnmt7r/zCLT9qrxngw4Rj63GfP+z9pCYFiXBJLnJoGddSkKs5x3J6BrmLstgUZQa9V08NFZC8h/x8= X-Received: by 2002:a17:906:7044:: with SMTP id r4mr24464976ejj.256.1635960423779; Wed, 03 Nov 2021 10:27:03 -0700 (PDT) MIME-Version: 1.0 References: <20211103150415.1211388-1-hjl.tools@gmail.com> In-Reply-To: From: Oleh Derevenko Date: Wed, 3 Nov 2021 19:26:52 +0200 Message-ID: Subject: Re: [PATCH] x86: Optimize atomic_compare_and_exchange_[val|bool]_acq [BZ #28537] To: Arjan van de Ven Cc: "H.J. Lu" , libc-alpha@sourceware.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Nov 2021 17:27:06 -0000 Arjan > What the patch does is check non-atomic first if the actual atomic operation has a chance of working. if it has a chance, the actual normal atomic operation is done as before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors out early. The idea of atomic function is that they are intended to work fairly with any type of memory. In your case, the speculative reads for a cached device memory may result in cache access only and will prevent fetching memory updates from the device, thus making the reading thread "see" the change later than it could. If you want to make a "RAM-specific" version of compare-n-exchange give it a distinct specific name. On Wed, Nov 3, 2021 at 7:00 PM Arjan van de Ven wrote: > > On 11/3/2021 8:50 AM, Oleh Derevenko wrote: > > Hi, H.J. Lu > > > > You may not perform plain reads on values you want to be atomic. This > > results in undefined behavior. > > so the way the patch works is that it does not DEPEND on that read to be atomic. > > What the patch does is check non-atomic first if the actual atomic operation has > a chance of working. if it has a chance, the actual normal atomic operation is done as > before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors > out early. > > The big gain is for the contended lock case (_acq suffix!). If there's, say, 4 threads spinning > on a lock. Before this patch these 4 cpu cores would be taking turns bouncing the cacheline around > super aggressively.. which causes system degradation and worse, also makes the core that will > eventually unlock the lock wait for the cacheline. > > Now with the patch, the "it is locked already" is noticed before the cacheline gets taken exclusive, > so all 4 spinning cores have the same cacheline in a shared state -- no pingponging. > Now the core that is going to unlock the lock can now do the exclusive acquire of the cacheline > without having to fight with those 4 cores in the exclusive acquire fight. > > > > For example, the compiler IS NOT obliged to perform the read with a > > single CPU instruction -- of course it will not, but it is allowed to > > read it in two halves and compare them separately. Or it may reuse > > cached value from previous evaluations. > > > This is only the compiler level issue. Similar issues will arise at > > CPU level with all the kind of memory coherency, caching and > > instruction reordering. > > the cpu in this case won't, the x86 memory model won't allow that > (and this is in the x86 implementation code) > > > Or if the value would cross a cache line boundary the plain read might > > return half-updated value with the part from one cache line being new > > and the other part being old. > > (I can't say in polite company what cmpxchg across cache lines does) -- Oleh Derevenko -- Skype with underscore