public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
@ 2023-07-31 12:55 Changbin Du
  2023-07-31 13:53 ` Richard Biener
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Changbin Du @ 2023-07-31 12:55 UTC (permalink / raw)
  To: gcc, gcc-bugs; +Cc: Ning Jia, Li Yu, Wang Nan, Hui Wang, Changbin Du

Hello, folks.
This is to discuss Gcc's heuristic strategy about Predicated Instructions and
Branches. And probably something needs to be improved.

[The story]
Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
This program is nothing special, just a random code I found on the internet. You
can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.

Build it with O2/O3/PGO (My GCC is 13.1):
$ gcc -O2 -march=native -g -o huffman huffman.c
$ gcc -O3 -march=native -g -o huffman.O3 huffman.c

$ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
$ ./huffman.instrumented test.data
$ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c

Run them on my 12900H laptop:
$ head -c 50M /dev/urandom > test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data

The result (p-core, no ht, no turbo, performance mode):

                                O2                      O3              PGO
cycles                          2,581,832,749   8,638,401,568   9,394,200,585
                                (1.07s)         (3.49s)         (3.80s)
instructions                    12,609,600,094  11,827,675,782  12,036,010,638
branches                        2,303,416,221   2,671,184,833   2,723,414,574
branch-misses                   0.00%           7.94%           8.84%
cache-misses                    3,012,613       3,055,722       3,076,316
L1-icache-load-misses           11,416,391      12,112,703      11,896,077
icache_tag.stalls               1,553,521       1,364,092       1,896,066
itlb_misses.stlb_hit            6,856           21,756          22,600
itlb_misses.walk_completed      14,430          4,454           15,084
baclears.any                    131,573         140,355         131,644
int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
machine_clears.count            22,235          39,671          37,307
dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
idq.mite_uops                   207,881,687     216,734,007     212,003,064


Above data shows:
  o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
  o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
    aggressive inline.
  o O3/PGO introduced very bad branch prediction. I will explain it later.
  o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
    my expectation.
  o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
    I don't know why. (subcategory MC is not measured yet)
  o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
  o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
    DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
  o Other events are mainly affected by bad branch misprediction.

Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
idq.mite_uops data.

$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
                       0.0                  0.8                     11.4                     76.8                  2.0                     8.3                  0.8                    0.0

       1.073381357 seconds time elapsed

       0.945233000 seconds user
       0.095719000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
                      38.2                  6.6                      3.5                     21.7                  0.9                    20.9                  7.5                    0.8

       3.501875873 seconds time elapsed

       3.378572000 seconds user
       0.084163000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
                      40.3                  6.3                      3.6                     19.4                  1.2                    17.8                 10.7                    0.8

       3.803413059 seconds time elapsed

       3.686474000 seconds user
       0.079707000 seconds sys


I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance
drop than O2.


[Predicated Ins vs Branches]
Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO
is caused by the below branch.

@@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {

        /* put a one on the end of this byte if b is '1' */

-       if (b == '1') current_byte |= 1;
+       //if (b == '1') current_byte |= 1;

        /* one more bit */

If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
improvement compared to O2 on x86.

                        O2              O3              PGO
cycles                  2,497,674,824   2,104,993,224   2,199,753,593
instructions            10,457,508,646  9,723,056,131   10,457,216,225
branches                2,303,029,380   2,250,522,323   2,302,994,942
branch-misses           0.00%           0.01%           0.01%

The main difference in the compilation output about code around the miss-prediction
branch is:
  o In O2: predicated instruction (cmov here) is selected to eliminate above
    branch. cmov is true better than branch here.
  o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
    is selected. But this branch is obviously *unpredictable* and the compiler
    doesn't know it. This why O3/PGO are are so bad for this program.

Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
same purpose. The result is negative.

I think we could come to a conclusion that there must be something can improve in
Gcc's heuristic strategy about Predicated Instructions and branches, at least
for O3 and PGO.

And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
for the compiler to detect unpredictable branches.

--
Cheers,
Changbin Du

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-08-11 13:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
2023-07-31 13:53 ` Richard Biener
2023-08-01  8:44   ` Jan Hubicka
2023-08-01 12:31     ` Changbin Du
2023-08-01 12:21   ` Changbin Du
2023-08-01 12:45 ` Changbin Du
2023-08-11 13:37 ` Changbin Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).