From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by sourceware.org (Postfix) with ESMTPS id 4F6E4385840A for ; Thu, 11 Nov 2021 16:24:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4F6E4385840A Received: by mail-pj1-x1033.google.com with SMTP id fv9-20020a17090b0e8900b001a6a5ab1392so5158203pjb.1 for ; Thu, 11 Nov 2021 08:24:31 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=2Q+jmO0MV+QFWZsIHmYC5oIWsUtga0Z36CfRQ1Njuug=; b=zfzr8+9m2xywdk/relhc4XoUDiPx7YuSus0s/MzZE41zllj0r1Y3KWrVrx5hxSoS8x BiGUlkOJVLM/U2BSx4lzBl/mfQPGHNouZOiaGpQ2N41x+JK1wLZK31wnAqSHPI+Er9Y/ 2Pc+cgZPoMUkNg19g8cLozMox/AiB0peIL0QgS7Frh+MuD4bjAY6HYL+Jpzj5M5D/VLe vXawhv8xmuh8iTGD6sRWoWFl8HNOrlRpNSDtt24ZlwpZRcaEcp1KmBhtLiD2t3tiPXja Dqj7xygpixNuZD705OKwzuF5icWZafXM1uFJTrO0vHnYSafQY3xCsXjy4mWrSH2irNkN KVow== X-Gm-Message-State: AOAM532X4eoOmhoxD2yXJJjahZTrB4vOQHZKhHCyIUFlzv3xbbiVqGod 1IuuwSe8gD0k3mmob2mhv8w= X-Google-Smtp-Source: ABdhPJwXRbNmdCLOzJKSFWEgOCUk4BcqEJxzrcqRUsqF68TpaTVdD38r0CqqVTGTVqrmKLWtWzrl/g== X-Received: by 2002:a17:903:22c6:b0:141:fac1:b722 with SMTP id y6-20020a17090322c600b00141fac1b722mr298772plg.23.1636647870451; Thu, 11 Nov 2021 08:24:30 -0800 (PST) Received: from gnu-cfl-2.localdomain ([172.58.35.133]) by smtp.gmail.com with ESMTPSA id k16sm4244503pfu.183.2021.11.11.08.24.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Nov 2021 08:24:29 -0800 (PST) Received: from gnu-cfl-2.lan (localhost [IPv6:::1]) by gnu-cfl-2.localdomain (Postfix) with ESMTP id DF1CA1A00D7; Thu, 11 Nov 2021 08:24:28 -0800 (PST) From: "H.J. Lu" To: libc-alpha@sourceware.org Cc: Florian Weimer , Oleh Derevenko , Arjan van de Ven , Andreas Schwab , "Paul A . Clarke" , Noah Goldstein Subject: [PATCH v6 0/4] Optimize CAS [BZ #28537] Date: Thu, 11 Nov 2021 08:24:24 -0800 Message-Id: <20211111162428.2286605-1-hjl.tools@gmail.com> X-Mailer: git-send-email 2.33.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3022.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Nov 2021 16:24:32 -0000 Changes in v6: 1. Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock loop if compare may fail. 2. Remove low level lock changes. 3. Don't change CAS usages in __pthread_mutex_lock_full. 4. Avoid extra load with CAS in __pthread_mutex_clocklock_common. 5. Reduce CAS in malloc spinlocks. Changes in v5: 1. Put back __glibc_unlikely in __lll_trylock and lll_cond_trylock. 2. Remove an atomic load in a CAS usage which has been already optimized. 3. Add an empty statement with a semicolon to a goto label for older compiler versions. 4. Simplify CAS optimization. CAS instruction is expensive. From the x86 CPU's point of view, getting a cache line for writing is more expensive than reading. See Appendix A.2 Spinlock in: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf The full compare and swap will grab the cache line exclusive and cause excessive cache line bouncing. Optimize CAS in low level locks and pthread_mutex_lock.c: 1. Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock loop if compare may fail to reduce cache line bouncing on contended locks. 2. Replace boolean CAS with value CAS to avoid the extra load. 2. Change malloc spinlocks to do an atomic load and check if compare may fail. Skip CAS and spin if compare may fail to reduce cache line bouncing on contended locks. With all CAS optimizations applied, on a machine with 112 cores, mutex-empty 17.4575 17.3908 0.38% mutex-filler 48.4768 46.4925 4.1% mutex_trylock-empty 19.2726 19.2737 -0.0057% mutex_trylock-filler 54.0893 54.105 -0.029% rwlock_read-empty 39.7572 39.8933 -0.34% rwlock_read-filler 75.109 74.0818 1.4% rwlock_tryread-empty 5.28944 5.28938 0.0011% rwlock_tryread-filler 39.6297 39.734 -0.26% rwlock_write-empty 60.6644 60.6151 0.081% rwlock_write-filler 92.92 90.0722 3.1% rwlock_trywrite-empty 7.24741 6.59308 9% rwlock_trywrite-filler 42.7404 41.6767 2.5% spin_lock-empty 19.1078 19.1079 -0.00052% spin_lock-filler 51.0646 51.6041 -1.1% spin_trylock-empty 16.4707 16.4811 -0.063% spin_trylock-filler 50.5355 50.4012 0.27% sem_wait-empty 42.1991 42.1683 0.073% sem_wait-filler 74.6699 74.7883 -0.16% sem_trywait-empty 5.27062 5.2702 0.008% sem_trywait-filler 40.1541 40.1684 -0.036% condvar-empty 5488.91 5165.95 5.9% condvar-filler 1442.43 1474.21 -2.2% consumer_producer-empty 16508.2 16705.3 -1.2% consumer_producer-filler 16781.1 16942.3 -0.96% H.J. Lu (4): Add LLL_MUTEX_READ_LOCK [BZ #28537] Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537] Reduce CAS in malloc spinlocks Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537] malloc/arena.c | 5 +++++ malloc/malloc.c | 10 ++++++++++ nptl/pthread_mutex_lock.c | 17 ++++++++++++----- nptl/pthread_mutex_timedlock.c | 10 +++++----- 4 files changed, 32 insertions(+), 10 deletions(-) -- 2.33.1