From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Whi/=DR=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235])
	by sourceware.org (Postfix) with ESMTPS id 735003858C53;
	Mon, 31 Jul 2023 13:54:21 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 735003858C53
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x235.google.com with SMTP id 38308e7fff4ca-2b9c66e2e36so50615011fa.1;
        Mon, 31 Jul 2023 06:54:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1690811660; x=1691416460;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=NRYNfTxYOl8RxY2Q+fBip1JxE252DJIlKVy75RtfBpg=;
        b=Ofj2VPsh5WTi15EUi1ouPTOyo3u+4p42mmG1soz3jzrarP7O1pboc2RPQFPZH0hWuR
         6bBtE1p17oFShkXzWHKUgpNdewDPMWyRNbOt0pfArHGY49h8oAHFVTAFxpMHYWmz2Trv
         3NMzc1dBAu1vCYD5LatbNShKt2BHlWPnwjklAN8OS3YR0loQTuotf+mLWEPf2C96XIcI
         KsE42gBAlxhFTa+fwsW+4f5WfX1R1fCBjUJUAe+p/405utXu45xXBfXKFxcrXpbS/ahA
         N4iAR2fzJBDhJekMBcYGT+sCBuGD6BTRb8ExW/bL/FeNaGu0lvQUdb0pxCzgfMwezn++
         UaWA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1690811660; x=1691416460;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=NRYNfTxYOl8RxY2Q+fBip1JxE252DJIlKVy75RtfBpg=;
        b=OPgI2xe39LynYUyb4dqczuKwkp1AE73jy/iL8b1XrrssAS1pe0RV/CbPQK8brUbKtS
         QtVe1M+ySAFyYdXz5yvgK0ajWAj6KNUr+0S+W/FWdcPgZ6TfSpHhqET3X8/0epqz8/79
         zsrrcJsyYCk8QydJCyHaBXoXTMmOP2FrHNBDocI1EqaOQwSqeYGHG615DAkdfk75zNks
         +oZEV+N0MqlG9Qxg640P4WsWxyp81SACNuDqO+vyZOwc0JRSl8xOdGkbEkgjSJEZ8z4T
         2GbSaabDilkc78ZTXep/Az74MnnZ+bjozl+s4DXBFprNFnnetpdLodLTUXP4CdbrN1/R
         uvqg==
X-Gm-Message-State: ABy/qLY/ED/EIjUY8WGw5SMePt90Irtd1bkvee5WZNafKB8R0FAR/pTT
	Mde50cCxpjbzyJFlYH0caqLgOSoFVQVxjEaZhZc=
X-Google-Smtp-Source: APBJJlE0aj8eXhAs4EVIBEO/NKpYBe24JkLFg6/6NXccnlzZqChlJSlNKiIfp1XM1MpX3gc1h9Td2h9S6eTVhNNJZZg=
X-Received: by 2002:a2e:b6c8:0:b0:2b4:6a06:4c26 with SMTP id
 m8-20020a2eb6c8000000b002b46a064c26mr3184763ljo.2.1690811659646; Mon, 31 Jul
 2023 06:54:19 -0700 (PDT)
MIME-Version: 1.0
References: <20230731125535.wpgchdsjegx2yg4h@M910t>
In-Reply-To: <20230731125535.wpgchdsjegx2yg4h@M910t>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 31 Jul 2023 15:53:26 +0200
Message-ID: <CAFiYyc2TtaSZtcmY0XjYzxMubNMSuuqvTHS_OzKTLc_M8Mg=Aw@mail.gmail.com>
Subject: Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance
 drop relative to O2
To: Changbin Du <changbin.du@huawei.com>, Jan Hubicka <hubicka@ucw.cz>
Cc: gcc@gcc.gnu.org, gcc-bugs@gcc.gnu.org, Ning Jia <ning.jia@huawei.com>, 
	Li Yu <marvin.tms@huawei.com>, Wang Nan <wangnan0@huawei.com>, 
	Hui Wang <hw.huiwang@huawei.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_NUMSUBJECT,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-bugs.sourceware.org>

On Mon, Jul 31, 2023 at 2:57=E2=80=AFPM Changbin Du via Gcc <gcc@gcc.gnu.or=
g> wrote:
>
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions=
 and
> Branches. And probably something needs to be improved.
>
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO respec=
tively.
> This program is nothing special, just a random code I found on the intern=
et. You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
>
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=3Dnative -g -o huffman huffman.c
> $ gcc -O3 -march=3Dnative -g -o huffman.O3 huffman.c
>
> $ gcc -O2 -march=3Dnative -g -fprofile-generate -o huffman.instrumented h=
uffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=3Dnative -g -fprofile-use=3Dhuffman.instrumented.gcda -o=
 huffman.pgo huffman.c
>
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data
>
> The result (p-core, no ht, no turbo, performance mode):
>
>                                 O2                      O3              P=
GO
> cycles                          2,581,832,749   8,638,401,568   9,394,200=
,585
>                                 (1.07s)         (3.49s)         (3.80s)
> instructions                    12,609,600,094  11,827,675,782  12,036,01=
0,638
> branches                        2,303,416,221   2,671,184,833   2,723,414=
,574
> branch-misses                   0.00%           7.94%           8.84%
> cache-misses                    3,012,613       3,055,722       3,076,316
> L1-icache-load-misses           11,416,391      12,112,703      11,896,07=
7
> icache_tag.stalls               1,553,521       1,364,092       1,896,066
> itlb_misses.stlb_hit            6,856           21,756          22,600
> itlb_misses.walk_completed      14,430          4,454           15,084
> baclears.any                    131,573         140,355         131,644
> int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,9=
93
> machine_clears.count            22,235          39,671          37,307
> dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
> frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,31=
9
> idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969=
,906
> idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,65=
7,650
> idq.mite_uops                   207,881,687     216,734,007     212,003,0=
64
>
>
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes=
 to
>     aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later=
.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This =
is beyond
>     my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interestin=
g and
>     I don't know why. (subcategory MC is not measured yet)
>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increase=
d 2x.
>     DSB hit well. So frontend fetching and decoding is not a problem for =
O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
>
> Additionally, here is the TMA level 2 analysis: The main changes in the p=
ipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the a=
ccuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss a=
nd
> idq.mite_uops data.
>
> $ perf stat --topdown --td-level=3D2 --cputype core -- taskset -c 0 ./huf=
fman test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman test.data':
>
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %=
  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_f=
etch_latency %  tma_machine_clears
>                        0.0                  0.8                     11.4 =
                    76.8                  2.0                     8.3      =
            0.8                    0.0
>
>        1.073381357 seconds time elapsed
>
>        0.945233000 seconds user
>        0.095719000 seconds sys
>
>
> $ perf stat --topdown --td-level=3D2 --cputype core -- taskset -c 0 ./huf=
fman.O3 test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
>
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %=
  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_f=
etch_latency %  tma_machine_clears
>                       38.2                  6.6                      3.5 =
                    21.7                  0.9                    20.9      =
            7.5                    0.8
>
>        3.501875873 seconds time elapsed
>
>        3.378572000 seconds user
>        0.084163000 seconds sys
>
>
> $ perf stat --topdown --td-level=3D2 --cputype core -- taskset -c 0 ./huf=
fman.pgo test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':
>
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %=
  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_f=
etch_latency %  tma_machine_clears
>                       40.3                  6.3                      3.6 =
                    19.4                  1.2                    17.8      =
           10.7                    0.8
>
>        3.803413059 seconds time elapsed
>
>        3.686474000 seconds user
>        0.079707000 seconds sys
>
>
> I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* p=
erformance
> drop than O2.
>
>
> [Predicated Ins vs Branches]
> Then I analyzed the Bad Speculation problem. 99% of the miss-prediction i=
n O3/PGO
> is caused by the below branch.
>
> @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {
>
>         /* put a one on the end of this byte if b is '1' */
>
> -       if (b =3D=3D '1') current_byte |=3D 1;
> +       //if (b =3D=3D '1') current_byte |=3D 1;
>
>         /* one more bit */
>
> If I comment it out as above patch, then O3/PGO can get 16% and 12% perfo=
rmance
> improvement compared to O2 on x86.
>
>                         O2              O3              PGO
> cycles                  2,497,674,824   2,104,993,224   2,199,753,593
> instructions            10,457,508,646  9,723,056,131   10,457,216,225
> branches                2,303,029,380   2,250,522,323   2,302,994,942
> branch-misses           0.00%           0.01%           0.01%
>
> The main difference in the compilation output about code around the miss-=
prediction
> branch is:
>   o In O2: predicated instruction (cmov here) is selected to eliminate ab=
ove
>     branch. cmov is true better than branch here.
>   o In O3/PGO: bitout() is inlined into encode_file(), and branch instruc=
tion
>     is selected. But this branch is obviously *unpredictable* and the com=
piler
>     doesn't know it. This why O3/PGO are are so bad for this program.
>
> Gcc doesn't support __builtin_unpredictable() which has been introduced b=
y llvm.
> Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can se=
rve the
> same purpose. The result is negative.

But does it appear to be predictable with your profiling data?

> I think we could come to a conclusion that there must be something can im=
prove in
> Gcc's heuristic strategy about Predicated Instructions and branches, at l=
east
> for O3 and PGO.
>
> And can we add __builtin_unpredictable() support for Gcc? As usually it's=
 hard
> for the compiler to detect unpredictable branches.
>
> --
> Cheers,
> Changbin Du