Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: Changbin Du <changbin.du@huawei.com>
To: Changbin Du <changbin.du@huawei.com>
Cc: <gcc@gcc.gnu.org>, <gcc-bugs@gcc.gnu.org>,
	Ning Jia <ning.jia@huawei.com>, Li Yu <marvin.tms@huawei.com>,
	Wang Nan <wangnan0@huawei.com>, Hui Wang <hw.huiwang@huawei.com>
Subject: Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
Date: Fri, 11 Aug 2023 21:37:45 +0800	[thread overview]
Message-ID: <20230811133745.23os2wmy76x4jsg7@M910t> (raw)
In-Reply-To: <20230731125535.wpgchdsjegx2yg4h@M910t>

On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
> 
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
> This program is nothing special, just a random code I found on the internet. You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
> 
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
> 
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c
> 
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data
> 
> The result (p-core, no ht, no turbo, performance mode):
> 
>                                 O2                      O3              PGO
> cycles                          2,581,832,749   8,638,401,568   9,394,200,585
>                                 (1.07s)         (3.49s)         (3.80s)
> instructions                    12,609,600,094  11,827,675,782  12,036,010,638
> branches                        2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses                   0.00%           7.94%           8.84%
> cache-misses                    3,012,613       3,055,722       3,076,316
> L1-icache-load-misses           11,416,391      12,112,703      11,896,077
> icache_tag.stalls               1,553,521       1,364,092       1,896,066
> itlb_misses.stlb_hit            6,856           21,756          22,600
> itlb_misses.walk_completed      14,430          4,454           15,084
> baclears.any                    131,573         140,355         131,644
> int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
> machine_clears.count            22,235          39,671          37,307
> dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
> frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
> idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops                   207,881,687     216,734,007     212,003,064
> 
> 
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
>     aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
>     my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
>     I don't know why. (subcategory MC is not measured yet)
>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
>     DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
> 
> Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
> 
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data
> test.data.huf is 1.00% of test.data
> 
>  Performance counter stats for 'taskset -c 0 ./huffman test.data':
> 
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                        0.0                  0.8                     11.4                     76.8                  2.0                     8.3                  0.8                    0.0
> 
>        1.073381357 seconds time elapsed
> 
>        0.945233000 seconds user
>        0.095719000 seconds sys
> 
> 
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
> 
>  Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
> 
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                       38.2                  6.6                      3.5                     21.7                  0.9                    20.9                  7.5                    0.8
> 
>        3.501875873 seconds time elapsed
> 
>        3.378572000 seconds user
>        0.084163000 seconds sys
> 
> 
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data
> test.data.huf is 1.00% of test.data
> 
>  Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':
> 
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                       40.3                  6.3                      3.6                     19.4                  1.2                    17.8                 10.7                    0.8
> 
>        3.803413059 seconds time elapsed
> 
>        3.686474000 seconds user
>        0.079707000 seconds sys
> 
> 
> I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance
> drop than O2.
> 
> 
> [Predicated Ins vs Branches]
> Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO
> is caused by the below branch.
> 
> @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {
> 
>         /* put a one on the end of this byte if b is '1' */
> 
> -       if (b == '1') current_byte |= 1;
> +       //if (b == '1') current_byte |= 1;
> 
>         /* one more bit */
> 
> If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> improvement compared to O2 on x86.
> 
>                         O2              O3              PGO
> cycles                  2,497,674,824   2,104,993,224   2,199,753,593
> instructions            10,457,508,646  9,723,056,131   10,457,216,225
> branches                2,303,029,380   2,250,522,323   2,302,994,942
> branch-misses           0.00%           0.01%           0.01%
> 
> The main difference in the compilation output about code around the miss-prediction
> branch is:
>   o In O2: predicated instruction (cmov here) is selected to eliminate above
>     branch. cmov is true better than branch here.
>   o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
>     is selected. But this branch is obviously *unpredictable* and the compiler
>     doesn't know it. This why O3/PGO are are so bad for this program.
>

After some debugging, I found the if-conversion pass failed at below
'!REG_P (dest)' check with O3.

static bool
check_cond_move_block (basic_block bb,
		       hash_map<rtx, rtx> *vals,
		       vec<rtx> *regs,
		       rtx cond)
{
      ...
      dest = SET_DEST (set);
      src = SET_SRC (set);
      if (!REG_P (dest) // REG_P() return 0 here
	  || (HARD_REGISTER_P (dest)
	      && targetm.small_register_classes_for_mode_p (GET_MODE (dest)))) {
	      printf("check_cond_move_block: failed at line %d\n", __LINE__); //
	      return false;
	}


-- 
Cheers,
Changbin Du

     prev parent reply	other threads:[~2023-08-11 13:38 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-31 12:55 Changbin Du
2023-07-31 13:53 ` Richard Biener
2023-08-01  8:44   ` Jan Hubicka
2023-08-01 12:31     ` Changbin Du
2023-08-01 12:21   ` Changbin Du
2023-08-01 12:45 ` Changbin Du
2023-08-11 13:37 ` Changbin Du [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230811133745.23os2wmy76x4jsg7@M910t \
    --to=changbin.du@huawei.com \
    --cc=gcc-bugs@gcc.gnu.org \
    --cc=gcc@gcc.gnu.org \
    --cc=hw.huiwang@huawei.com \
    --cc=marvin.tms@huawei.com \
    --cc=ning.jia@huawei.com \
    --cc=wangnan0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).