From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=HFN+=DS=huawei.com=changbin.du@sourceware.org>
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
	by sourceware.org (Postfix) with ESMTPS id 548813858C66;
	Tue,  1 Aug 2023 12:45:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 548813858C66
Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=huawei.com
Received: from kwepemi500013.china.huawei.com (unknown [172.30.72.56])
	by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4RFZWJ6j3XztQS4;
	Tue,  1 Aug 2023 20:42:20 +0800 (CST)
Received: from M910t (10.110.54.157) by kwepemi500013.china.huawei.com
 (7.221.188.120) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Tue, 1 Aug
 2023 20:45:40 +0800
Date: Tue, 1 Aug 2023 20:45:23 +0800
From: Changbin Du <changbin.du@huawei.com>
To: Changbin Du <changbin.du@huawei.com>
CC: <gcc@gcc.gnu.org>, <gcc-bugs@gcc.gnu.org>, Ning Jia <ning.jia@huawei.com>,
	Li Yu <marvin.tms@huawei.com>, Wang Nan <wangnan0@huawei.com>, Hui Wang
	<hw.huiwang@huawei.com>
Subject: Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance
 drop relative to O2
Message-ID: <20230801124523.xd73w26qip3kh3ta@M910t>
References: <20230731125535.wpgchdsjegx2yg4h@M910t>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <20230731125535.wpgchdsjegx2yg4h@M910t>
X-Originating-IP: [10.110.54.157]
X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To
 kwepemi500013.china.huawei.com (7.221.188.120)
X-CFilter-Loop: Reflected
X-Spam-Status: No, score=-3.5 required=5.0 tests=BAYES_00,KAM_DMARC_STATUS,KAM_NUMSUBJECT,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-bugs.sourceware.org>

On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> The result (p-core, no ht, no turbo, performance mode):
> 
>                                 O2                      O3              PGO
> cycles                          2,581,832,749   8,638,401,568   9,394,200,585
>                                 (1.07s)         (3.49s)         (3.80s)
> instructions                    12,609,600,094  11,827,675,782  12,036,010,638
> branches                        2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses                   0.00%           7.94%           8.84%
> cache-misses                    3,012,613       3,055,722       3,076,316
> L1-icache-load-misses           11,416,391      12,112,703      11,896,077
> icache_tag.stalls               1,553,521       1,364,092       1,896,066
> itlb_misses.stlb_hit            6,856           21,756          22,600
> itlb_misses.walk_completed      14,430          4,454           15,084
> baclears.any                    131,573         140,355         131,644
> int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
> machine_clears.count            22,235          39,671          37,307
> dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
> frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
> idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops                   207,881,687     216,734,007     212,003,064
> 
> 
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
>     aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
>     my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
>     I don't know why. (subcategory MC is not measured yet)
The MCs are caused by memory ordering conflict and attribute to the kernel rcu
lock in I/O path, when ext4 tries to update its journal.

>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
>     DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
> 

-- 
Cheers,
Changbin Du