From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235]) by sourceware.org (Postfix) with ESMTPS id 735003858C53; Mon, 31 Jul 2023 13:54:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 735003858C53 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-x235.google.com with SMTP id 38308e7fff4ca-2b9c66e2e36so50615011fa.1; Mon, 31 Jul 2023 06:54:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1690811660; x=1691416460; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NRYNfTxYOl8RxY2Q+fBip1JxE252DJIlKVy75RtfBpg=; b=Ofj2VPsh5WTi15EUi1ouPTOyo3u+4p42mmG1soz3jzrarP7O1pboc2RPQFPZH0hWuR 6bBtE1p17oFShkXzWHKUgpNdewDPMWyRNbOt0pfArHGY49h8oAHFVTAFxpMHYWmz2Trv 3NMzc1dBAu1vCYD5LatbNShKt2BHlWPnwjklAN8OS3YR0loQTuotf+mLWEPf2C96XIcI KsE42gBAlxhFTa+fwsW+4f5WfX1R1fCBjUJUAe+p/405utXu45xXBfXKFxcrXpbS/ahA N4iAR2fzJBDhJekMBcYGT+sCBuGD6BTRb8ExW/bL/FeNaGu0lvQUdb0pxCzgfMwezn++ UaWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690811660; x=1691416460; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NRYNfTxYOl8RxY2Q+fBip1JxE252DJIlKVy75RtfBpg=; b=OPgI2xe39LynYUyb4dqczuKwkp1AE73jy/iL8b1XrrssAS1pe0RV/CbPQK8brUbKtS QtVe1M+ySAFyYdXz5yvgK0ajWAj6KNUr+0S+W/FWdcPgZ6TfSpHhqET3X8/0epqz8/79 zsrrcJsyYCk8QydJCyHaBXoXTMmOP2FrHNBDocI1EqaOQwSqeYGHG615DAkdfk75zNks +oZEV+N0MqlG9Qxg640P4WsWxyp81SACNuDqO+vyZOwc0JRSl8xOdGkbEkgjSJEZ8z4T 2GbSaabDilkc78ZTXep/Az74MnnZ+bjozl+s4DXBFprNFnnetpdLodLTUXP4CdbrN1/R uvqg== X-Gm-Message-State: ABy/qLY/ED/EIjUY8WGw5SMePt90Irtd1bkvee5WZNafKB8R0FAR/pTT Mde50cCxpjbzyJFlYH0caqLgOSoFVQVxjEaZhZc= X-Google-Smtp-Source: APBJJlE0aj8eXhAs4EVIBEO/NKpYBe24JkLFg6/6NXccnlzZqChlJSlNKiIfp1XM1MpX3gc1h9Td2h9S6eTVhNNJZZg= X-Received: by 2002:a2e:b6c8:0:b0:2b4:6a06:4c26 with SMTP id m8-20020a2eb6c8000000b002b46a064c26mr3184763ljo.2.1690811659646; Mon, 31 Jul 2023 06:54:19 -0700 (PDT) MIME-Version: 1.0 References: <20230731125535.wpgchdsjegx2yg4h@M910t> In-Reply-To: <20230731125535.wpgchdsjegx2yg4h@M910t> From: Richard Biener Date: Mon, 31 Jul 2023 15:53:26 +0200 Message-ID: Subject: Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 To: Changbin Du , Jan Hubicka Cc: gcc@gcc.gnu.org, gcc-bugs@gcc.gnu.org, Ning Jia , Li Yu , Wang Nan , Hui Wang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_NUMSUBJECT,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, Jul 31, 2023 at 2:57=E2=80=AFPM Changbin Du via Gcc wrote: > > Hello, folks. > This is to discuss Gcc's heuristic strategy about Predicated Instructions= and > Branches. And probably something needs to be improved. > > [The story] > Weeks ago, I built a huffman encoding program with O2, O3, and PGO respec= tively. > This program is nothing special, just a random code I found on the intern= et. You > can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c. > > Build it with O2/O3/PGO (My GCC is 13.1): > $ gcc -O2 -march=3Dnative -g -o huffman huffman.c > $ gcc -O3 -march=3Dnative -g -o huffman.O3 huffman.c > > $ gcc -O2 -march=3Dnative -g -fprofile-generate -o huffman.instrumented h= uffman.c > $ ./huffman.instrumented test.data > $ gcc -O2 -march=3Dnative -g -fprofile-use=3Dhuffman.instrumented.gcda -o= huffman.pgo huffman.c > > Run them on my 12900H laptop: > $ head -c 50M /dev/urandom > test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data > > The result (p-core, no ht, no turbo, performance mode): > > O2 O3 P= GO > cycles 2,581,832,749 8,638,401,568 9,394,200= ,585 > (1.07s) (3.49s) (3.80s) > instructions 12,609,600,094 11,827,675,782 12,036,01= 0,638 > branches 2,303,416,221 2,671,184,833 2,723,414= ,574 > branch-misses 0.00% 7.94% 8.84% > cache-misses 3,012,613 3,055,722 3,076,316 > L1-icache-load-misses 11,416,391 12,112,703 11,896,07= 7 > icache_tag.stalls 1,553,521 1,364,092 1,896,066 > itlb_misses.stlb_hit 6,856 21,756 22,600 > itlb_misses.walk_completed 14,430 4,454 15,084 > baclears.any 131,573 140,355 131,644 > int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,9= 93 > machine_clears.count 22,235 39,671 37,307 > dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493 > frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,31= 9 > idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969= ,906 > idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,65= 7,650 > idq.mite_uops 207,881,687 216,734,007 212,003,0= 64 > > > Above data shows: > o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively. > o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes= to > aggressive inline. > o O3/PGO introduced very bad branch prediction. I will explain it later= . > o Code built with O3 has high iTLB miss but much lower sTLB miss. This = is beyond > my expectation. > o O3/PGO introduced 78% and 68% more machine clears. This is interestin= g and > I don't know why. (subcategory MC is not measured yet) > o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO. > o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increase= d 2x. > DSB hit well. So frontend fetching and decoding is not a problem for = O3/PGO. > o Other events are mainly affected by bad branch misprediction. > > Additionally, here is the TMA level 2 analysis: The main changes in the p= ipeline > slots are of Bad Speculation and Frontend Bound categories. I doubt the a= ccuracy > of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss a= nd > idq.mite_uops data. > > $ perf stat --topdown --td-level=3D2 --cputype core -- taskset -c 0 ./huf= fman test.data > test.data.huf is 1.00% of test.data > > Performance counter stats for 'taskset -c 0 ./huffman test.data': > > % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations %= tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_f= etch_latency % tma_machine_clears > 0.0 0.8 11.4 = 76.8 2.0 8.3 = 0.8 0.0 > > 1.073381357 seconds time elapsed > > 0.945233000 seconds user > 0.095719000 seconds sys > > > $ perf stat --topdown --td-level=3D2 --cputype core -- taskset -c 0 ./huf= fman.O3 test.data > test.data.huf is 1.00% of test.data > > Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data': > > % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations %= tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_f= etch_latency % tma_machine_clears > 38.2 6.6 3.5 = 21.7 0.9 20.9 = 7.5 0.8 > > 3.501875873 seconds time elapsed > > 3.378572000 seconds user > 0.084163000 seconds sys > > > $ perf stat --topdown --td-level=3D2 --cputype core -- taskset -c 0 ./huf= fman.pgo test.data > test.data.huf is 1.00% of test.data > > Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data': > > % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations %= tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_f= etch_latency % tma_machine_clears > 40.3 6.3 3.6 = 19.4 1.2 17.8 = 10.7 0.8 > > 3.803413059 seconds time elapsed > > 3.686474000 seconds user > 0.079707000 seconds sys > > > I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* p= erformance > drop than O2. > > > [Predicated Ins vs Branches] > Then I analyzed the Bad Speculation problem. 99% of the miss-prediction i= n O3/PGO > is caused by the below branch. > > @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) { > > /* put a one on the end of this byte if b is '1' */ > > - if (b =3D=3D '1') current_byte |=3D 1; > + //if (b =3D=3D '1') current_byte |=3D 1; > > /* one more bit */ > > If I comment it out as above patch, then O3/PGO can get 16% and 12% perfo= rmance > improvement compared to O2 on x86. > > O2 O3 PGO > cycles 2,497,674,824 2,104,993,224 2,199,753,593 > instructions 10,457,508,646 9,723,056,131 10,457,216,225 > branches 2,303,029,380 2,250,522,323 2,302,994,942 > branch-misses 0.00% 0.01% 0.01% > > The main difference in the compilation output about code around the miss-= prediction > branch is: > o In O2: predicated instruction (cmov here) is selected to eliminate ab= ove > branch. cmov is true better than branch here. > o In O3/PGO: bitout() is inlined into encode_file(), and branch instruc= tion > is selected. But this branch is obviously *unpredictable* and the com= piler > doesn't know it. This why O3/PGO are are so bad for this program. > > Gcc doesn't support __builtin_unpredictable() which has been introduced b= y llvm. > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can se= rve the > same purpose. The result is negative. But does it appear to be predictable with your profiling data? > I think we could come to a conclusion that there must be something can im= prove in > Gcc's heuristic strategy about Predicated Instructions and branches, at l= east > for O3 and PGO. > > And can we add __builtin_unpredictable() support for Gcc? As usually it's= hard > for the compiler to detect unpredictable branches. > > -- > Cheers, > Changbin Du