From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 109563 invoked by alias); 21 Mar 2017 15:43:48 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 109146 invoked by uid 89); 21 Mar 2017 15:43:46 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.6 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=no version=3.3.2 spammy=acquired, buys, spin, 2017-02-20 X-HELO: mx0a-001b2d01.pphosted.com Subject: Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros. To: libc-alpha@sourceware.org References: <1481905917-15654-1-git-send-email-stli@linux.vnet.ibm.com> <5857CF10.1060100@arm.com> <628f6311-239c-5eea-572c-c2acae6fcbee@linux.vnet.ibm.com> <1487017743.16322.80.camel@redhat.com> <60a34645-17e4-6693-1343-03c55b0c47ad@linux.vnet.ibm.com> <1487437038.20203.68.camel@redhat.com> <25ad863b-6f20-bfb7-95e6-3b04a2b3eee8@linux.vnet.ibm.com> <1487598702.20203.138.camel@redhat.com> From: Stefan Liebler Date: Tue, 21 Mar 2017 15:43:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 17032115-0008-0000-0000-0000040984D2 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17032115-0009-0000-0000-00001CEB65AB Message-Id: <9c3fc2b3-57b6-b160-3f97-5ce3be05f4c0@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-03-21_12:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=1 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001 definitions=main-1703210134 X-SW-Source: 2017-03/txt/msg00471.txt.bz2 On 03/14/2017 04:55 PM, Stefan Liebler wrote: > On 02/20/2017 02:51 PM, Torvald Riegel wrote: >> On Mon, 2017-02-20 at 13:15 +0100, Stefan Liebler wrote: >>>>>> A while ago we tried hard to remove all code that would fail silently >>>>>> when a macro had a typo in the name, for example. We still have a >>>>>> few >>>>>> of them left (e.g., in the atomics), but I think this here doesn't >>>>>> warrant an exception. >>>>>> >>>>> Okay. You're right. >>>>> >>>>> In comment of patch 2/2, you've mentioned a header where an >>>>> architecture >>>>> shall define those parameters. >>>>> Thus I've added the file nptl/pthread_spin_parameters.h. >>>> >>>> I think the stub version should be in sysdeps/ but the wiki is missing >>>> this information (default ENOTSUP location): >>>> https://sourceware.org/glibc/wiki/Style_and_Conventions#Proper_sysdeps_Location >>>> >>> In case we need a new header, shall we move it to sysdeps/nptl/ ? >> >> I would guess that this would be the right place for a stub / generic >> variant of such a header, but I'm not quite sure. >> >>>> >>>> I'm still a bit undecided about how to deal with other arch's >>>> pthread_spin_lock.c files that just set SPIN_LOCK_READS_BETWEEN_CMPXCHG >>>> and then include the generic pthread_spin_lock.c. Maybe it's best >>>> if we >>>> just remove them in this patch of yours. >>>> >>> I think the archs currently using the generic implementation have just >>> copied the default value to get rid of the warning "machine-dependent >>> file should define SPIN_LOCK_READS_BETWEEN_CMPXCHG". But this is only an >>> assumption. >>> In general removing those "wrapper"-pthread_spin_lock.c files and using >>> information from a header like the proposed pthread_spin_parameters.h or >>> atomic-machine.h is a good approach. >> >> Okay. So let's do that. >> > Done. The wrapper-pthread_spin_lock.c files in sysdeps subfolders are > removed and the option SPIN_LOCK_READS_BETWEEN_CMPXCHG is not available > anymore. > Instead there is only a loop of plain reads until we observe an not > acquired lock. > See comment in nptl/pthread_spin_lock.c. > >>>> I agree that the there are architecture-specific properties of atomic >>>> operations that have a significant effect on performance. For s390, >>>> you >>>> list the following points: >>>> * atomic exchange (I suppose all atomic read-modify-write ops?) are >>>> implemented through CAS. >>> atomic exchange is implemented by a CAS loop due to lack of an exchange >>> instruction. For exchanging to a "0", s390 can use the load-and-and >>> instruction instead of a CAS loop (See my atomic-machine.h patch; For >>> gcc, we plan to emit such a load-and-and instruction for builtin >>> __atomic_exchange_n in future;). This also saves one register usage >>> compared to the CAS loop. >>> Exchanging to "0" is e.g. used for lll_unlock macro or in >>> malloc_consolidate. >> >> I guess this may then be yet another property atomic-machine.h could >> expose: Whether an atomic fetch-and-and exists and is not implemented >> throough a CAS loop. Something like lll_unlock would then do an >> exchange or fetch-and-and, and if that fails, a CAS loop. >> >>> Starting with z196 zarch CPUs, the following instructions which are used >>> by the appropriate __atomic_fetch_xyz builtins instead of a CAS loop: >>> load-and-add, load-and-and, load-and-exclusive-or, load-and-or. >>> As information: >>> I will update my atomic-machine.h patch to use at least some of the C11 >>> atomic builtins or all depending on gcc version. >> >> Thanks. >> >>>> These properties are specific to the architecture, but they are not >>>> specific to the synchronization code (eg, to spinlocks). Thus, I think >>>> such settings should be in the atomics headers (i.e., in >>>> atomic-machine.h). >>>> This should probably be a separate patch. I would propose the >>>> following >>>> macros (both are bool): >>>> /* Nonzero iff all atomic read-modify-write operations (e.g., atomic >>>> exchange) are implemented using a CAS loop. */ >>>> #define ATOMIC_RMW_USES_CAS 1 >>> Is one macro enough to describe all read-modify-write operations in >>> include/atomic.h? >>> Please also see my comment above. >> >> Yes, it may not be enough (e.g., you say that s390 may have fetch_and >> but not an exchange, but other archs may just have an exchange but no >> custom fetch_and). >> So maybe we should add ATOMIC_EXCHANGE_USES_CAS and >> ATOMIC_FETCH_AND_USES_CAS for now, and add further ones on demand? >> >>>> /* Nonzero iff CAS always assumes it will store, and thus has cache >>>> performance effects similar to a store (e.g., it always puts the >>>> respective cacheline into an exclusive state, even if the comparison >>>> against the expected value fails). */ >>>> ATOMIC_CAS_ALWAYS_PREPARES_FOR_STORE 1 >>>> >>>> I'm not sure whether there are architectures for which the second macro >>>> would not be true. It would be good to investigate this, maybe we >>>> don't >>>> need to add this at all. >>>> >>> We plan that in future gcc will emit e.g. a load-and-test instruction in >>> front of the CAS instruction if the old-value is a constant zero. >> >> That can be useful, but what I outlined would be about a more generic >> case. If you are going to solve this for your arch in gcc, you might >> not need to address this in this patch. >> >>>> >>>> >>>> For the spin lock specifically, we have the following possibilities: >>>> >>>> 1) If we assume that trylock will most likely succeed in practice: >>>> * We just do an exchange. The properties above don't matter. >>>> >>>> 2) If we want to bias towards cases where trylock succeeds, but don't >>>> rule out contention: >>>> * If exchange not a CAS loop, and exchange is faster than CAS, do an >>>> exchange. >>>> * If exchange is a CAS loop, use a weak CAS and not an exchange so we >>>> bail out after the first failed attempt to change the state. >>>> >>>> 3) If we expect contention to be likely: >>>> * If CAS always brings the cache line into an exclusive state, then >>>> load >>>> first. Then do 2). >>>> >>>> Do we have consensus yet on whether we assume 2 or 3 (I don't think 1 >>>> buys us anything over 2)? >>>> >>> Yes. Sounds good. >> >> I read this as stating that you agree that 1) doesn't need to be >> considered. But we still to choose between 2) and 3) :) >> >> I'm not quite sure what to favor because some of the code out there that >> uses trylock just uses it to prevent deadlock (e.g., when trying to >> acquire multiple locks at once), whereas other code uses spin_trylock to >> implement their own back-off and bounded spinning. >> I lean towards assuming that lock acquisitons, including the former use >> case for trylock, are supposed to succeed in the common case. That is, >> I would pick 2), but I have no data to back this up. >> >> Either way, whatever we choose, we should document polish the cases 1-3 >> above and the reasoning behind our choice for one of them, and add it as >> a comment to the spinlock code. >> > > Okay. I've attached an updated patch. It is now using case 2). > This choice applies to pthread_spin_trylock.c and the first attempt to > acquire the lock in pthread_spin_lock.c. > Therefore I've introduced ATOMIC_EXCHANGE_USES_CAS for all architectures > in atomic-machine.h files. There is a check in include/atomic.h which > ensures that it is defined to either 0 or 1. Can you please review the > setting of 0 or 1? > > Bye Stefan Ping