From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-78110-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 109563 invoked by alias); 21 Mar 2017 15:43:48 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 109146 invoked by uid 89); 21 Mar 2017 15:43:46 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.6 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=no version=3.3.2 spammy=acquired, buys, spin, 2017-02-20
X-HELO: mx0a-001b2d01.pphosted.com
Subject: Re: [PATCH 1/2] Optimize generic spinlock code and use C11 like
 atomic macros.
To: libc-alpha@sourceware.org
References: <1481905917-15654-1-git-send-email-stli@linux.vnet.ibm.com>
 <5857CF10.1060100@arm.com>
 <628f6311-239c-5eea-572c-c2acae6fcbee@linux.vnet.ibm.com>
 <1487017743.16322.80.camel@redhat.com>
 <60a34645-17e4-6693-1343-03c55b0c47ad@linux.vnet.ibm.com>
 <1487437038.20203.68.camel@redhat.com>
 <25ad863b-6f20-bfb7-95e6-3b04a2b3eee8@linux.vnet.ibm.com>
 <1487598702.20203.138.camel@redhat.com>
 <b57d3477-a041-7b06-82ac-6d2b6c6bb08c@linux.vnet.ibm.com>
From: Stefan Liebler <stli@linux.vnet.ibm.com>
Date: Tue, 21 Mar 2017 15:43:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.7.0
MIME-Version: 1.0
In-Reply-To: <b57d3477-a041-7b06-82ac-6d2b6c6bb08c@linux.vnet.ibm.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
x-cbid: 17032115-0008-0000-0000-0000040984D2
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17032115-0009-0000-0000-00001CEB65AB
Message-Id: <9c3fc2b3-57b6-b160-3f97-5ce3be05f4c0@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-03-21_12:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=1
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001
 definitions=main-1703210134
X-SW-Source: 2017-03/txt/msg00471.txt.bz2

On 03/14/2017 04:55 PM, Stefan Liebler wrote:
> On 02/20/2017 02:51 PM, Torvald Riegel wrote:
>> On Mon, 2017-02-20 at 13:15 +0100, Stefan Liebler wrote:
>>>>>> A while ago we tried hard to remove all code that would fail silently
>>>>>> when a macro had a typo in the name, for example.  We still have a
>>>>>> few
>>>>>> of them left (e.g., in the atomics), but I think this here doesn't
>>>>>> warrant an exception.
>>>>>>
>>>>> Okay. You're right.
>>>>>
>>>>> In comment of patch 2/2, you've mentioned a header where an
>>>>> architecture
>>>>> shall define those parameters.
>>>>> Thus I've added the file nptl/pthread_spin_parameters.h.
>>>>
>>>> I think the stub version should be in sysdeps/ but the wiki is missing
>>>> this information (default ENOTSUP location):
>>>> https://sourceware.org/glibc/wiki/Style_and_Conventions#Proper_sysdeps_Location
>>>>
>>> In case we need a new header, shall we move it to sysdeps/nptl/ ?
>>
>> I would guess that this would be the right place for a stub / generic
>> variant of such a header, but I'm not quite sure.
>>
>>>>
>>>> I'm still a bit undecided about how to deal with other arch's
>>>> pthread_spin_lock.c files that just set SPIN_LOCK_READS_BETWEEN_CMPXCHG
>>>> and then include the generic pthread_spin_lock.c.  Maybe it's best
>>>> if we
>>>> just remove them in this patch of yours.
>>>>
>>> I think the archs currently using the generic implementation have just
>>> copied the default value to get rid of the warning "machine-dependent
>>> file should define SPIN_LOCK_READS_BETWEEN_CMPXCHG". But this is only an
>>> assumption.
>>> In general removing those "wrapper"-pthread_spin_lock.c files and using
>>> information from a header like the proposed pthread_spin_parameters.h or
>>> atomic-machine.h is a good approach.
>>
>> Okay.  So let's do that.
>>
> Done. The wrapper-pthread_spin_lock.c files in sysdeps subfolders are
> removed and the option SPIN_LOCK_READS_BETWEEN_CMPXCHG is not available
> anymore.
> Instead there is only a loop of plain reads until we observe an not
> acquired lock.
> See comment in nptl/pthread_spin_lock.c.
>
>>>> I agree that the there are architecture-specific properties of atomic
>>>> operations that have a significant effect on performance.  For s390,
>>>> you
>>>> list the following points:
>>>> * atomic exchange (I suppose all atomic read-modify-write ops?) are
>>>> implemented through CAS.
>>> atomic exchange is implemented by a CAS loop due to lack of an exchange
>>> instruction. For exchanging to a "0", s390 can use the load-and-and
>>> instruction instead of a CAS loop (See my atomic-machine.h patch; For
>>> gcc, we plan to emit such a load-and-and instruction for builtin
>>> __atomic_exchange_n in future;). This also saves one register usage
>>> compared to the CAS loop.
>>> Exchanging to "0" is e.g. used for lll_unlock macro or in
>>> malloc_consolidate.
>>
>> I guess this may then be yet another property atomic-machine.h could
>> expose: Whether an atomic fetch-and-and exists and is not implemented
>> throough a CAS loop.  Something like lll_unlock would then do an
>> exchange or fetch-and-and, and if that fails, a CAS loop.
>>
>>> Starting with z196 zarch CPUs, the following instructions which are used
>>> by the appropriate __atomic_fetch_xyz builtins instead of a CAS loop:
>>> load-and-add, load-and-and, load-and-exclusive-or, load-and-or.
>>> As information:
>>> I will update my atomic-machine.h patch to use at least some of the C11
>>> atomic builtins or all depending on gcc version.
>>
>> Thanks.
>>
>>>> These properties are specific to the architecture, but they are not
>>>> specific to the synchronization code (eg, to spinlocks).  Thus, I think
>>>> such settings should be in the atomics headers (i.e., in
>>>> atomic-machine.h).
>>>> This should probably be a separate patch.  I would propose the
>>>> following
>>>> macros (both are bool):
>>>> /* Nonzero iff all atomic read-modify-write operations (e.g., atomic
>>>> exchange) are implemented using a CAS loop.  */
>>>> #define ATOMIC_RMW_USES_CAS 1
>>> Is one macro enough to describe all read-modify-write operations in
>>> include/atomic.h?
>>> Please also see my comment above.
>>
>> Yes, it may not be enough (e.g., you say that s390 may have fetch_and
>> but not an exchange, but other archs may just have an exchange but no
>> custom fetch_and).
>> So maybe we should add ATOMIC_EXCHANGE_USES_CAS and
>> ATOMIC_FETCH_AND_USES_CAS for now, and add further ones on demand?
>>
>>>> /* Nonzero iff CAS always assumes it will store, and thus has cache
>>>> performance effects similar to a store (e.g., it always puts the
>>>> respective cacheline into an exclusive state, even if the comparison
>>>> against the expected value fails).  */
>>>> ATOMIC_CAS_ALWAYS_PREPARES_FOR_STORE 1
>>>>
>>>> I'm not sure whether there are architectures for which the second macro
>>>> would not be true.  It would be good to investigate this, maybe we
>>>> don't
>>>> need to add this at all.
>>>>
>>> We plan that in future gcc will emit e.g. a load-and-test instruction in
>>> front of the CAS instruction if the old-value is a constant zero.
>>
>> That can be useful, but what I outlined would be about a more generic
>> case.  If you are going to solve this for your arch in gcc, you might
>> not need to address this in this patch.
>>
>>>>
>>>>
>>>> For the spin lock specifically, we have the following possibilities:
>>>>
>>>> 1) If we assume that trylock will most likely succeed in practice:
>>>> * We just do an exchange.  The properties above don't matter.
>>>>
>>>> 2) If we want to bias towards cases where trylock succeeds, but don't
>>>> rule out contention:
>>>> * If exchange not a CAS loop, and exchange is faster than CAS, do an
>>>> exchange.
>>>> * If exchange is a CAS loop, use a weak CAS and not an exchange so we
>>>> bail out after the first failed attempt to change the state.
>>>>
>>>> 3) If we expect contention to be likely:
>>>> * If CAS always brings the cache line into an exclusive state, then
>>>> load
>>>> first.  Then do 2).
>>>>
>>>> Do we have consensus yet on whether we assume 2 or 3 (I don't think 1
>>>> buys us anything over 2)?
>>>>
>>> Yes. Sounds good.
>>
>> I read this as stating that you agree that 1) doesn't need to be
>> considered.  But we still to choose between 2) and 3)  :)
>>
>> I'm not quite sure what to favor because some of the code out there that
>> uses trylock just uses it to prevent deadlock (e.g., when trying to
>> acquire multiple locks at once), whereas other code uses spin_trylock to
>> implement their own back-off and bounded spinning.
>> I lean towards assuming that lock acquisitons, including the former use
>> case for trylock, are supposed to succeed in the common case.  That is,
>> I would pick 2), but I have no data to back this up.
>>
>> Either way, whatever we choose, we should document polish the cases 1-3
>> above and the reasoning behind our choice for one of them, and add it as
>> a comment to the spinlock code.
>>
>
> Okay. I've attached an updated patch. It is now using case 2).
> This choice applies to pthread_spin_trylock.c and the first attempt to
> acquire the lock in pthread_spin_lock.c.
> Therefore I've introduced ATOMIC_EXCHANGE_USES_CAS for all architectures
> in atomic-machine.h files. There is a check in include/atomic.h which
> ensures that it is defined to either 0 or 1. Can you please review the
> setting of 0 or 1?
>
> Bye Stefan
Ping