From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <pc@us.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 045703858D35
 for <libc-alpha@sourceware.org>; Thu, 11 Nov 2021 00:30:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 045703858D35
Received: from pps.filterd (m0098416.ppops.net [127.0.0.1])
 by mx0b-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 1AANCq3c004749; 
 Thu, 11 Nov 2021 00:30:25 GMT
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3c8qjch661-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Thu, 11 Nov 2021 00:30:25 +0000
Received: from m0098416.ppops.net (m0098416.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 1AB0E8cI011480;
 Thu, 11 Nov 2021 00:30:25 GMT
Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com
 [169.63.121.186])
 by mx0b-001b2d01.pphosted.com with ESMTP id 3c8qjch65n-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Thu, 11 Nov 2021 00:30:25 +0000
Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1])
 by ppma03wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1AB0D2Dq013412;
 Thu, 11 Nov 2021 00:30:24 GMT
Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com
 [9.57.198.29]) by ppma03wdc.us.ibm.com with ESMTP id 3c5hbcc015-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Thu, 11 Nov 2021 00:30:24 +0000
Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com
 [9.57.199.107])
 by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 1AB0UN9R43188638
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Thu, 11 Nov 2021 00:30:23 GMT
Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id C42A7124052;
 Thu, 11 Nov 2021 00:30:23 +0000 (GMT)
Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 3C896124062;
 Thu, 11 Nov 2021 00:30:23 +0000 (GMT)
Received: from li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com (unknown
 [9.65.79.57]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTPS;
 Thu, 11 Nov 2021 00:30:23 +0000 (GMT)
Date: Wed, 10 Nov 2021 18:30:21 -0600
From: "Paul A. Clarke" <pc@us.ibm.com>
To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: Paul E Murphy <murphyp@linux.ibm.com>,
 GNU C Library <libc-alpha@sourceware.org>,
 Florian Weimer <fweimer@redhat.com>,
 Andreas Schwab <schwab@linux-m68k.org>,
 Arjan van de Ven <arjan@linux.intel.com>
Subject: Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
Message-ID: <20211111003021.GH4930@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com>
References: <20211110001614.2087610-1-hjl.tools@gmail.com>
 <d12b76f2-a810-d58d-4b7c-844a7b0a689b@linux.ibm.com>
 <20211110200722.GF4930@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com>
 <CAMe9rOrCeYN7SG9ob+n1jFJNvWcYqdt+Vc5tAXnfuZDo7AASgA@mail.gmail.com>
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAMe9rOrCeYN7SG9ob+n1jFJNvWcYqdt+Vc5tAXnfuZDo7AASgA@mail.gmail.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: S2O4EXqIq9gqz2jJH6ibwU0pDBHVWyeV
X-Proofpoint-ORIG-GUID: oYGoOKOqv5drsvKURlnp17a8lY9omNpS
X-Proofpoint-UnRewURL: 1 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475
 definitions=2021-11-10_14,2021-11-08_02,2020-04-07_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 spamscore=0
 lowpriorityscore=0 phishscore=0 mlxlogscore=999 impostorscore=0
 suspectscore=0 priorityscore=1501 bulkscore=0 mlxscore=0 clxscore=1015
 adultscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2110150000 definitions=main-2111100116
X-Spam-Status: No, score=-4.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Nov 2021 00:30:32 -0000

On Wed, Nov 10, 2021 at 01:33:26PM -0800, H.J. Lu wrote:
> On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
> >
> > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading.  See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf 
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > > >
> > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > > line bouncing on contended locks.
> > > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > > don't know if it's actually rare; in the contended case it is clearly not
> > > > rare.
> > >
> > > Are you able to share benchmarks of this change? I am curious what effects
> > > this might have on other platforms.
> >
> > I'd like to see the expected performance results, too.
> >
> > For me, the results are not uniformly positive (Power10).
> > From bench-pthread-locks:
> >
> >                          bench   bench-patched
> > mutex-empty              4.73371 4.54792   3.9%
> > mutex-filler             18.5395 18.3419   1.1%
> > mutex_trylock-empty      10.46   2.46364  76.4%
> > mutex_trylock-filler     16.2188 16.1758   0.3%
> > rwlock_read-empty        16.5118 16.4681   0.3%
> > rwlock_read-filler       20.68   20.4416   1.2%
> > rwlock_tryread-empty     2.06572 2.17284  -5.2%
> > rwlock_tryread-filler    16.082  16.1215  -0.2%
> > rwlock_write-empty       31.3723 31.259    0.4%
> > rwlock_write-filler      41.6492 69.313  -66.4%
> > rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> > rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> > spin_lock-empty          16.7964 16.7731   0.1%
> > spin_lock-filler         20.6118 20.4175   0.9%
> > spin_trylock-empty       8.99989 8.98879   0.1%
> > spin_trylock-filler      16.4732 15.9957   2.9%
> > sem_wait-empty           15.805  15.7391   0.4%
> > sem_wait-filler          19.2346 19.5098  -1.4%
> > sem_trywait-empty        2.06405 2.03782   1.3%
> > sem_trywait-filler       15.921  15.8408   0.5%
> > condvar-empty            1385.84 1387.29  -0.1%
> > condvar-filler           1419.82 1424.01  -0.3%
> > consumer_producer-empty  2550.01 2395.29   6.1%
> > consumer_producer-filler 2709.4  2558.28   5.6%
> 
> Small regressions on uncontended locks are expected due to extra
> check.   What do you get with my current branch
> 
> https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl

                         bench   bench-hjl
mutex-empty              4.73371 4.65279   1.7%
mutex-filler             18.5395 18.3971   0.8%
mutex_trylock-empty      10.46   10.1671   2.8%
mutex_trylock-filler     16.2188 16.7105  -3.0%
rwlock_read-empty        16.5118 16.4697   0.3%
rwlock_read-filler       20.68   20.0416   3.1%
rwlock_tryread-empty     2.06572 2.038     1.3%
rwlock_tryread-filler    16.082  15.7182   2.3%
rwlock_write-empty       31.3723 31.1147   0.8%
rwlock_write-filler      41.6492 69.8115 -67.6%
rwlock_trywrite-empty    2.20584 2.32175  -5.3%
rwlock_trywrite-filler   15.7044 15.86    -1.0%
spin_lock-empty          16.7964 16.4342   2.2%
spin_lock-filler         20.6118 20.3916   1.1%
spin_trylock-empty       8.99989 8.98884   0.1%
spin_trylock-filler      16.4732 16.1979   1.7%
sem_wait-empty           15.805  15.7558   0.3%
sem_wait-filler          19.2346 19.2554  -0.1%
sem_trywait-empty        2.06405 2.03789   1.3%
sem_trywait-filler       15.921  15.7884   0.8%
condvar-empty            1385.84 1341.96   3.2%
condvar-filler           1419.82 1343.06   5.4%
consumer_producer-empty  2550.01 2446.33   4.1%
consumer_producer-filler 2709.4  2659.59   1.8%

...still one very bad outlier, and a few of concern.

> BTW, how did you compare the 2 results?  I tried compare_bench.py
> and got
> 
> Traceback (most recent call last):
>   File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 196, in <module>
>     main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
>   File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 165, in main
>     bench1 = bench.parse_bench(bench1, schema)
>   File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
> line 137, in parse_bench
>     bench = json.load(benchfile)
>   File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
>     return loads(fp.read(),
>   File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
>     return _default_decoder.decode(s)
>   File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
>     raise JSONDecodeError("Extra data", s, end)
> json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)

I did it the old-fashioned way, in a spreadsheet.  :-)

I see the same errors you see with compare_bench.py.

Upon further investigation, compare_bench.py expects input in the form
produced by "make bench". The output from running the benchtest directly
is insufficient.  Using the respective outputs from
"make BENCHSET=bench-pthread bench":
--
$ ./benchtests/scripts/compare_bench.py --threshold 2 --stats mean A.out B.out
[snip]
+++ thread_create(stack=1024,guard=2)[mean]: (2.15%) from 372674 to 364660
+++ thread_create(stack=2048,guard=1)[mean]: (4.88%) from 377835 to 359396
+++ thread_create(stack=2048,guard=2)[mean]: (3.58%) from 377306 to 363798
+++ pthread_locks(mutex-empty)[mean]: (4.27%) from 4.85936 to 4.65185
--- pthread_locks(mutex_trylock-filler)[mean]: (3.09%) from 16.0579 to 16.5533
--- pthread_locks(rwlock_write-filler)[mean]: (56.90%) from 44.4255 to 69.7047
--- pthread_locks(rwlock_trywrite-empty)[mean]: (6.73%) from 2.17594 to 2.32244
+++ pthread_locks(spin_lock-empty)[mean]: (2.17%) from 16.8086 to 16.4436
--- pthread_locks(spin_trylock-filler)[mean]: (2.34%) from 16.1119 to 16.4896
+++ pthread_locks(consumer_producer-empty)[mean]: (2.94%) from 2531.95 to 2457.48
--

PC