From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by sourceware.org (Postfix) with ESMTPS id 045DE3858D28; Tue, 1 Aug 2023 12:21:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 045DE3858D28 Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=huawei.com Received: from kwepemi500013.china.huawei.com (unknown [172.30.72.55]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4RFYzS1BP6ztRkN; Tue, 1 Aug 2023 20:18:12 +0800 (CST) Received: from M910t (10.110.54.157) by kwepemi500013.china.huawei.com (7.221.188.120) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.27; Tue, 1 Aug 2023 20:21:32 +0800 Date: Tue, 1 Aug 2023 20:21:14 +0800 From: Changbin Du To: Richard Biener CC: Changbin Du , Jan Hubicka , , , Ning Jia , Li Yu , Wang Nan , Hui Wang Subject: Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Message-ID: <20230801122114.4tyrwi7kemukya73@M910t> References: <20230731125535.wpgchdsjegx2yg4h@M910t> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Originating-IP: [10.110.54.157] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To kwepemi500013.china.huawei.com (7.221.188.120) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,BODY_8BITS,KAM_DMARC_STATUS,KAM_NUMSUBJECT,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, Jul 31, 2023 at 03:53:26PM +0200, Richard Biener wrote: [snip] > > The main difference in the compilation output about code around the miss-prediction > > branch is: > > o In O2: predicated instruction (cmov here) is selected to eliminate above > > branch. cmov is true better than branch here. > > o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction > > is selected. But this branch is obviously *unpredictable* and the compiler > > doesn't know it. This why O3/PGO are are so bad for this program. > > > > Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm. > > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the > > same purpose. The result is negative. > > But does it appear to be predictable with your profiling data? > I profiled the branch-misses event on a kabylake machine. 99% of the mis-prediction blames to encode_file() function. $ sudo perf record -e branch-instructions:pp,branch-misses:pp -c 1000 -- taskset -c 0 ./huffman.O3 test.data Samples: 197K of event 'branch-misses:pp', Event count (approx.): 197618000 Overhead Command Shared Object Symbol 99.58% huffman.O3 huffman.O3 [.] encode_file 0.12% huffman.O3 [kernel.vmlinux] [k] __x86_indirect_thunk_array 0.11% huffman.O3 libc-2.31.so [.] _IO_getc 0.01% huffman.O3 [kernel.vmlinux] [k] common_file_perm Then annotate encode_file() function: Samples: 197K of event 'branch-misses:pp', 1000 Hz, Event count (approx.): 197618000 encode_file /work/myWork/linux/pgo/huffman.O3 [Percent: local period] Percent│ ↑ je 38 │ bitout(): │ current_byte <<= 1; │ 70: add %edi,%edi │ if (b == '1') current_byte |= 1; 48.70 │ ┌──cmp $0x31,%dl 47.11 │ ├──jne 7a │ │ or $0x1,%edi │ │nbits++; │ 7a:└─→inc %eax │ if (b == '1') current_byte |= 1; │ mov %edi,current_byte │ nbits++; │ mov %eax,nbits │ if (nbits == 8) { 1.16 │ cmp $0x8,%eax 3.03 │ ↓ je a0 │ encode_file(): │ for (s=codes[ch]; *s; s++) bitout (outfile, *s); │ movzbl 0x1(%r13),%edx │ inc %r13 │ test %dl,%dl │ ↑ jne 70 │ ↑ jmp 38 │ nop -- Cheers, Changbin Du