From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from huawei.com (szxga04-in.huawei.com [45.249.212.190]) by sourceware.org (Postfix) with ESMTPS id EDA6A385C019 for ; Mon, 16 Mar 2020 07:30:42 +0000 (GMT) Received: from DGGEMS407-HUB.china.huawei.com (unknown [172.30.72.59]) by Forcepoint Email with ESMTP id 58AB32743F62576EADC4; Mon, 16 Mar 2020 15:30:39 +0800 (CST) Received: from [127.0.0.1] (10.173.221.251) by DGGEMS407-HUB.china.huawei.com (10.3.19.207) with Microsoft SMTP Server id 14.3.487.0; Mon, 16 Mar 2020 15:30:28 +0800 From: liqingqing Subject: pthread_cond performence Discussion To: "libc-alpha@sourceware.org" , CC: Hushiyuan , Liusirui , Message-ID: Date: Mon, 16 Mar 2020 15:30:27 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 8bit X-Originating-IP: [10.173.221.251] X-CFilter-Loop: Reflected X-Spam-Status: No, score=-2.5 required=5.0 tests=GARBLED_BODY, GIT_PATCH_1, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Mar 2020 07:30:45 -0000 The new condvar implementation that provides stronger ordering guarantees. For the waiters's ordering without expand the size of the struct of pthread_cond_t, It uses a little bits to maintain the state machine which has two different start group G1 and G2. This algorithm is very cleverly. But when I test MySQL performance and found that this new condvar implementation will affect the performance when there are many cores in one machine. the scenario is that in my arm server, test 200 terminals to read and write the database in 4P processor environment(totally 256 cores), and I found that It can get better performance when I use the old algorithm. here is the performace I tested: old algorithm new algorithm 755449.3 668712.05 I think maybe there has too many cache false sharing when in my environment. Does anyone has the same problem? And is there room for optimization about the new algorithm? the test step is: [root@client]# ./runBenchmark.sh props.mysql_4p_arm [root@client]# cat props.mysql_4p_arm db=mysql driver=com.mysql.cj.jdbc.Driver #conn=jdbc:mysql://222.222.222.11:3306/tpccpart #conn=jdbc:mysql://222.222.222.132:3306/tpcc1000 #conn=jdbc:mysql://222.222.222.145:3306/tpcc conn=jdbc:mysql://222.222.222.212:3306/tpcc user=root password=123456 warehouses=1000 loadWorkers=30 terminals=200 //To run specified transactions per terminal- runMins must equal zero runTxnsPerTerminal=0 //To run for specified minutes- runTxnsPerTerminal must equal zero runMins=5 //Number of total transactions per minute limitTxnsPerMin=1000000000 //Set to true to run in 4.x compatible mode. Set to false to use the //entire configured database evenly. terminalWarehouseFixed=true //The following five values must add up to 100 newOrderWeight=45 paymentWeight=43 orderStatusWeight=4 deliveryWeight=4 stockLevelWeight=4 // Directory name to create for collecting detailed result data. // Comment this out to suppress. //resultDirectory=my_result_%tY-%tm-%td_%tH%tM%tS //osCollectorScript=./misc/os_collector_linux.py //osCollectorInterval=1 //osCollectorSSHAddr=user@dbhost //osCollectorDevices=net_eth0 blk_sda below is the struct of pthread_cond_t: /* Common definition of pthread_cond_t. */ // consumer and producer maybe in the same cache_line. struct __pthread_cond_s { __extension__ union { __extension__ unsigned long long int __wseq; //LSB is index of current G2. struct { unsigned int __low; //等待着的序列号,G2 unsigned int __high; //等待着的序列号 G1 } __wseq32; }; __extension__ union { __extension__ unsigned long long int __g1_start; // LSB is index of current G2. struct { unsigned int __low; unsigned int __high; } __g1_start32; }; unsigned int __g_refs[2] __LOCK_ALIGNMENT; // LSB is true if waiters should run futex_wake when they remove the last reference. unsigned int __g_size[2]; unsigned int __g1_orig_size; // Initial size of G1 unsigned int __wrefs; // Bit 2 is true if waiters should run futex_wake when they remove the last reference. pthread_cond_destroy uses this as futex word. // Bit 1 is the clock ID (0 == CLOCK_REALTIME, 1 == CLOCK_MONOTONIC). // Bit 0 is true iff this is a process-shared condvar. unsigned int __g_signals[2]; // LSB is true iff this group has been completely signaled (i.e., it is closed). };