From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by sourceware.org (Postfix) with ESMTPS id 548813858C66; Tue, 1 Aug 2023 12:45:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 548813858C66 Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=huawei.com Received: from kwepemi500013.china.huawei.com (unknown [172.30.72.56]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4RFZWJ6j3XztQS4; Tue, 1 Aug 2023 20:42:20 +0800 (CST) Received: from M910t (10.110.54.157) by kwepemi500013.china.huawei.com (7.221.188.120) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Tue, 1 Aug 2023 20:45:40 +0800 Date: Tue, 1 Aug 2023 20:45:23 +0800 From: Changbin Du To: Changbin Du CC: , , Ning Jia , Li Yu , Wang Nan , Hui Wang Subject: Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Message-ID: <20230801124523.xd73w26qip3kh3ta@M910t> References: <20230731125535.wpgchdsjegx2yg4h@M910t> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20230731125535.wpgchdsjegx2yg4h@M910t> X-Originating-IP: [10.110.54.157] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To kwepemi500013.china.huawei.com (7.221.188.120) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-3.5 required=5.0 tests=BAYES_00,KAM_DMARC_STATUS,KAM_NUMSUBJECT,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote: > The result (p-core, no ht, no turbo, performance mode): > > O2 O3 PGO > cycles 2,581,832,749 8,638,401,568 9,394,200,585 > (1.07s) (3.49s) (3.80s) > instructions 12,609,600,094 11,827,675,782 12,036,010,638 > branches 2,303,416,221 2,671,184,833 2,723,414,574 > branch-misses 0.00% 7.94% 8.84% > cache-misses 3,012,613 3,055,722 3,076,316 > L1-icache-load-misses 11,416,391 12,112,703 11,896,077 > icache_tag.stalls 1,553,521 1,364,092 1,896,066 > itlb_misses.stlb_hit 6,856 21,756 22,600 > itlb_misses.walk_completed 14,430 4,454 15,084 > baclears.any 131,573 140,355 131,644 > int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993 > machine_clears.count 22,235 39,671 37,307 > dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493 > frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319 > idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906 > idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650 > idq.mite_uops 207,881,687 216,734,007 212,003,064 > > > Above data shows: > o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively. > o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to > aggressive inline. > o O3/PGO introduced very bad branch prediction. I will explain it later. > o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond > my expectation. > o O3/PGO introduced 78% and 68% more machine clears. This is interesting and > I don't know why. (subcategory MC is not measured yet) The MCs are caused by memory ordering conflict and attribute to the kernel rcu lock in I/O path, when ext4 tries to update its journal. > o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO. > o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x. > DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO. > o Other events are mainly affected by bad branch misprediction. > -- Cheers, Changbin Du