* [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
@ 2023-07-31 12:55 Changbin Du
2023-07-31 13:53 ` Richard Biener
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Changbin Du @ 2023-07-31 12:55 UTC (permalink / raw)
To: gcc, gcc-bugs; +Cc: Ning Jia, Li Yu, Wang Nan, Hui Wang, Changbin Du
Hello, folks.
This is to discuss Gcc's heuristic strategy about Predicated Instructions and
Branches. And probably something needs to be improved.
[The story]
Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
This program is nothing special, just a random code I found on the internet. You
can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
Build it with O2/O3/PGO (My GCC is 13.1):
$ gcc -O2 -march=native -g -o huffman huffman.c
$ gcc -O3 -march=native -g -o huffman.O3 huffman.c
$ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
$ ./huffman.instrumented test.data
$ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c
Run them on my 12900H laptop:
$ head -c 50M /dev/urandom > test.data
$ perf stat -r3 --table -- taskset -c 0 ./huffman test.data
$ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data
$ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data
The result (p-core, no ht, no turbo, performance mode):
O2 O3 PGO
cycles 2,581,832,749 8,638,401,568 9,394,200,585
(1.07s) (3.49s) (3.80s)
instructions 12,609,600,094 11,827,675,782 12,036,010,638
branches 2,303,416,221 2,671,184,833 2,723,414,574
branch-misses 0.00% 7.94% 8.84%
cache-misses 3,012,613 3,055,722 3,076,316
L1-icache-load-misses 11,416,391 12,112,703 11,896,077
icache_tag.stalls 1,553,521 1,364,092 1,896,066
itlb_misses.stlb_hit 6,856 21,756 22,600
itlb_misses.walk_completed 14,430 4,454 15,084
baclears.any 131,573 140,355 131,644
int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993
machine_clears.count 22,235 39,671 37,307
dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493
frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319
idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906
idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650
idq.mite_uops 207,881,687 216,734,007 212,003,064
Above data shows:
o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
aggressive inline.
o O3/PGO introduced very bad branch prediction. I will explain it later.
o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
my expectation.
o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
I don't know why. (subcategory MC is not measured yet)
o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
o Other events are mainly affected by bad branch misprediction.
Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
idq.mite_uops data.
$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data
test.data.huf is 1.00% of test.data
Performance counter stats for 'taskset -c 0 ./huffman test.data':
% tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
0.0 0.8 11.4 76.8 2.0 8.3 0.8 0.0
1.073381357 seconds time elapsed
0.945233000 seconds user
0.095719000 seconds sys
$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data
test.data.huf is 1.00% of test.data
Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
% tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
38.2 6.6 3.5 21.7 0.9 20.9 7.5 0.8
3.501875873 seconds time elapsed
3.378572000 seconds user
0.084163000 seconds sys
$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data
test.data.huf is 1.00% of test.data
Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':
% tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
40.3 6.3 3.6 19.4 1.2 17.8 10.7 0.8
3.803413059 seconds time elapsed
3.686474000 seconds user
0.079707000 seconds sys
I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance
drop than O2.
[Predicated Ins vs Branches]
Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO
is caused by the below branch.
@@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {
/* put a one on the end of this byte if b is '1' */
- if (b == '1') current_byte |= 1;
+ //if (b == '1') current_byte |= 1;
/* one more bit */
If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
improvement compared to O2 on x86.
O2 O3 PGO
cycles 2,497,674,824 2,104,993,224 2,199,753,593
instructions 10,457,508,646 9,723,056,131 10,457,216,225
branches 2,303,029,380 2,250,522,323 2,302,994,942
branch-misses 0.00% 0.01% 0.01%
The main difference in the compilation output about code around the miss-prediction
branch is:
o In O2: predicated instruction (cmov here) is selected to eliminate above
branch. cmov is true better than branch here.
o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
is selected. But this branch is obviously *unpredictable* and the compiler
doesn't know it. This why O3/PGO are are so bad for this program.
Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
same purpose. The result is negative.
I think we could come to a conclusion that there must be something can improve in
Gcc's heuristic strategy about Predicated Instructions and branches, at least
for O3 and PGO.
And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
for the compiler to detect unpredictable branches.
--
Cheers,
Changbin Du
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
@ 2023-07-31 13:53 ` Richard Biener
2023-08-01 8:44 ` Jan Hubicka
2023-08-01 12:21 ` Changbin Du
2023-08-01 12:45 ` Changbin Du
2023-08-11 13:37 ` Changbin Du
2 siblings, 2 replies; 7+ messages in thread
From: Richard Biener @ 2023-07-31 13:53 UTC (permalink / raw)
To: Changbin Du, Jan Hubicka
Cc: gcc, gcc-bugs, Ning Jia, Li Yu, Wang Nan, Hui Wang
On Mon, Jul 31, 2023 at 2:57 PM Changbin Du via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
>
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
> This program is nothing special, just a random code I found on the internet. You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
>
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
>
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c
>
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data
>
> The result (p-core, no ht, no turbo, performance mode):
>
> O2 O3 PGO
> cycles 2,581,832,749 8,638,401,568 9,394,200,585
> (1.07s) (3.49s) (3.80s)
> instructions 12,609,600,094 11,827,675,782 12,036,010,638
> branches 2,303,416,221 2,671,184,833 2,723,414,574
> branch-misses 0.00% 7.94% 8.84%
> cache-misses 3,012,613 3,055,722 3,076,316
> L1-icache-load-misses 11,416,391 12,112,703 11,896,077
> icache_tag.stalls 1,553,521 1,364,092 1,896,066
> itlb_misses.stlb_hit 6,856 21,756 22,600
> itlb_misses.walk_completed 14,430 4,454 15,084
> baclears.any 131,573 140,355 131,644
> int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993
> machine_clears.count 22,235 39,671 37,307
> dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493
> frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319
> idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906
> idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650
> idq.mite_uops 207,881,687 216,734,007 212,003,064
>
>
> Above data shows:
> o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
> o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
> aggressive inline.
> o O3/PGO introduced very bad branch prediction. I will explain it later.
> o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
> my expectation.
> o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
> I don't know why. (subcategory MC is not measured yet)
> o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
> o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
> DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
> o Other events are mainly affected by bad branch misprediction.
>
> Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
> 0.0 0.8 11.4 76.8 2.0 8.3 0.8 0.0
>
> 1.073381357 seconds time elapsed
>
> 0.945233000 seconds user
> 0.095719000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
> 38.2 6.6 3.5 21.7 0.9 20.9 7.5 0.8
>
> 3.501875873 seconds time elapsed
>
> 3.378572000 seconds user
> 0.084163000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
> 40.3 6.3 3.6 19.4 1.2 17.8 10.7 0.8
>
> 3.803413059 seconds time elapsed
>
> 3.686474000 seconds user
> 0.079707000 seconds sys
>
>
> I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance
> drop than O2.
>
>
> [Predicated Ins vs Branches]
> Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO
> is caused by the below branch.
>
> @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {
>
> /* put a one on the end of this byte if b is '1' */
>
> - if (b == '1') current_byte |= 1;
> + //if (b == '1') current_byte |= 1;
>
> /* one more bit */
>
> If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> improvement compared to O2 on x86.
>
> O2 O3 PGO
> cycles 2,497,674,824 2,104,993,224 2,199,753,593
> instructions 10,457,508,646 9,723,056,131 10,457,216,225
> branches 2,303,029,380 2,250,522,323 2,302,994,942
> branch-misses 0.00% 0.01% 0.01%
>
> The main difference in the compilation output about code around the miss-prediction
> branch is:
> o In O2: predicated instruction (cmov here) is selected to eliminate above
> branch. cmov is true better than branch here.
> o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> is selected. But this branch is obviously *unpredictable* and the compiler
> doesn't know it. This why O3/PGO are are so bad for this program.
>
> Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
> Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
> same purpose. The result is negative.
But does it appear to be predictable with your profiling data?
> I think we could come to a conclusion that there must be something can improve in
> Gcc's heuristic strategy about Predicated Instructions and branches, at least
> for O3 and PGO.
>
> And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
> for the compiler to detect unpredictable branches.
>
> --
> Cheers,
> Changbin Du
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
2023-07-31 13:53 ` Richard Biener
@ 2023-08-01 8:44 ` Jan Hubicka
2023-08-01 12:31 ` Changbin Du
2023-08-01 12:21 ` Changbin Du
1 sibling, 1 reply; 7+ messages in thread
From: Jan Hubicka @ 2023-08-01 8:44 UTC (permalink / raw)
To: Richard Biener
Cc: Changbin Du, gcc, gcc-bugs, Ning Jia, Li Yu, Wang Nan, Hui Wang
> > If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> > improvement compared to O2 on x86.
> >
> > O2 O3 PGO
> > cycles 2,497,674,824 2,104,993,224 2,199,753,593
> > instructions 10,457,508,646 9,723,056,131 10,457,216,225
> > branches 2,303,029,380 2,250,522,323 2,302,994,942
> > branch-misses 0.00% 0.01% 0.01%
> >
> > The main difference in the compilation output about code around the miss-prediction
> > branch is:
> > o In O2: predicated instruction (cmov here) is selected to eliminate above
> > branch. cmov is true better than branch here.
> > o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> > is selected. But this branch is obviously *unpredictable* and the compiler
> > doesn't know it. This why O3/PGO are are so bad for this program.
> >
> > Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
> > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
> > same purpose. The result is negative.
>
> But does it appear to be predictable with your profiling data?
Also one thing is that __builtin_expect and
__builtin_expect_with_probability only affects the static branch
prediciton algorithm, so with profile feedback they are ignored on every
branch executed at least once during the train run.
setting probability 0.5 is really not exactly the same as hint that the
branch will be mispredicted, since modern CPUs handle well regularly
behaving branchs (such as a branch firing every even iteration of loop).
So I think having the builting is not a bad idea. I was thinking if it
makes sense to represent it withing profile_probability type and I am
not convinced, since "unpredictable probability" sounds counceptually
odd and we would need to keep the flag intact over all probability
updates we do. For things like loop exits we recompute probabilities
from frequencies after unrolling/vectorizaiton and other things and we
would need to invent new API to propagate the flag from previous
probability (which is not even part of the computation right now)
So I guess the challenge is how to pass this info down through the
optimization pipeline, since we would need to annotate gimple
conds/switches and manage it to RTL level. On gimple we have flags and
on rtl level notes so there is space for it, but we would need to
maintain the info through CFG changes.
Auto-FDO may be interesting way to detect such branches.
Honza
>
> > I think we could come to a conclusion that there must be something can improve in
> > Gcc's heuristic strategy about Predicated Instructions and branches, at least
> > for O3 and PGO.
> >
> > And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
> > for the compiler to detect unpredictable branches.
> >
> > --
> > Cheers,
> > Changbin Du
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
2023-07-31 13:53 ` Richard Biener
2023-08-01 8:44 ` Jan Hubicka
@ 2023-08-01 12:21 ` Changbin Du
1 sibling, 0 replies; 7+ messages in thread
From: Changbin Du @ 2023-08-01 12:21 UTC (permalink / raw)
To: Richard Biener
Cc: Changbin Du, Jan Hubicka, gcc, gcc-bugs, Ning Jia, Li Yu,
Wang Nan, Hui Wang
On Mon, Jul 31, 2023 at 03:53:26PM +0200, Richard Biener wrote:
[snip]
> > The main difference in the compilation output about code around the miss-prediction
> > branch is:
> > o In O2: predicated instruction (cmov here) is selected to eliminate above
> > branch. cmov is true better than branch here.
> > o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> > is selected. But this branch is obviously *unpredictable* and the compiler
> > doesn't know it. This why O3/PGO are are so bad for this program.
> >
> > Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
> > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
> > same purpose. The result is negative.
>
> But does it appear to be predictable with your profiling data?
>
I profiled the branch-misses event on a kabylake machine. 99% of the
mis-prediction blames to encode_file() function.
$ sudo perf record -e branch-instructions:pp,branch-misses:pp -c 1000 -- taskset -c 0 ./huffman.O3 test.data
Samples: 197K of event 'branch-misses:pp', Event count (approx.): 197618000
Overhead Command Shared Object Symbol
99.58% huffman.O3 huffman.O3 [.] encode_file
0.12% huffman.O3 [kernel.vmlinux] [k] __x86_indirect_thunk_array
0.11% huffman.O3 libc-2.31.so [.] _IO_getc
0.01% huffman.O3 [kernel.vmlinux] [k] common_file_perm
Then annotate encode_file() function:
Samples: 197K of event 'branch-misses:pp', 1000 Hz, Event count (approx.): 197618000
encode_file /work/myWork/linux/pgo/huffman.O3 [Percent: local period]
Percent│ ↑ je 38
│ bitout():
│ current_byte <<= 1;
│ 70: add %edi,%edi
│ if (b == '1') current_byte |= 1;
48.70 │ ┌──cmp $0x31,%dl
47.11 │ ├──jne 7a
│ │ or $0x1,%edi
│ │nbits++;
│ 7a:└─→inc %eax
│ if (b == '1') current_byte |= 1;
│ mov %edi,current_byte
│ nbits++;
│ mov %eax,nbits
│ if (nbits == 8) {
1.16 │ cmp $0x8,%eax
3.03 │ ↓ je a0
│ encode_file():
│ for (s=codes[ch]; *s; s++) bitout (outfile, *s);
│ movzbl 0x1(%r13),%edx
│ inc %r13
│ test %dl,%dl
│ ↑ jne 70
│ ↑ jmp 38
│ nop
--
Cheers,
Changbin Du
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
2023-08-01 8:44 ` Jan Hubicka
@ 2023-08-01 12:31 ` Changbin Du
0 siblings, 0 replies; 7+ messages in thread
From: Changbin Du @ 2023-08-01 12:31 UTC (permalink / raw)
To: Jan Hubicka
Cc: Richard Biener, Changbin Du, gcc, gcc-bugs, Ning Jia, Li Yu,
Wang Nan, Hui Wang
On Tue, Aug 01, 2023 at 10:44:02AM +0200, Jan Hubicka wrote:
> > > If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> > > improvement compared to O2 on x86.
> > >
> > > O2 O3 PGO
> > > cycles 2,497,674,824 2,104,993,224 2,199,753,593
> > > instructions 10,457,508,646 9,723,056,131 10,457,216,225
> > > branches 2,303,029,380 2,250,522,323 2,302,994,942
> > > branch-misses 0.00% 0.01% 0.01%
> > >
> > > The main difference in the compilation output about code around the miss-prediction
> > > branch is:
> > > o In O2: predicated instruction (cmov here) is selected to eliminate above
> > > branch. cmov is true better than branch here.
> > > o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> > > is selected. But this branch is obviously *unpredictable* and the compiler
> > > doesn't know it. This why O3/PGO are are so bad for this program.
> > >
> > > Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
> > > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
> > > same purpose. The result is negative.
> >
> > But does it appear to be predictable with your profiling data?
>
> Also one thing is that __builtin_expect and
> __builtin_expect_with_probability only affects the static branch
> prediciton algorithm, so with profile feedback they are ignored on every
> branch executed at least once during the train run.
>
> setting probability 0.5 is really not exactly the same as hint that the
> branch will be mispredicted, since modern CPUs handle well regularly
> behaving branchs (such as a branch firing every even iteration of loop).
>
Yeah. Setting probability 0.5 is just an experimental attempt. I don't know
how the heuristic works internally.
> So I think having the builting is not a bad idea. I was thinking if it
> makes sense to represent it withing profile_probability type and I am
> not convinced, since "unpredictable probability" sounds counceptually
> odd and we would need to keep the flag intact over all probability
> updates we do. For things like loop exits we recompute probabilities
> from frequencies after unrolling/vectorizaiton and other things and we
> would need to invent new API to propagate the flag from previous
> probability (which is not even part of the computation right now)
>
> So I guess the challenge is how to pass this info down through the
> optimization pipeline, since we would need to annotate gimple
> conds/switches and manage it to RTL level. On gimple we have flags and
> on rtl level notes so there is space for it, but we would need to
> maintain the info through CFG changes.
>
> Auto-FDO may be interesting way to detect such branches.
>
So I suppose PGO also could. But branch instruction is selected in my test just
as O3 does. And data shows that comv works better than branch here.
> Honza
> >
> > > I think we could come to a conclusion that there must be something can improve in
> > > Gcc's heuristic strategy about Predicated Instructions and branches, at least
> > > for O3 and PGO.
> > >
> > > And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
> > > for the compiler to detect unpredictable branches.
> > >
> > > --
> > > Cheers,
> > > Changbin Du
--
Cheers,
Changbin Du
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
2023-07-31 13:53 ` Richard Biener
@ 2023-08-01 12:45 ` Changbin Du
2023-08-11 13:37 ` Changbin Du
2 siblings, 0 replies; 7+ messages in thread
From: Changbin Du @ 2023-08-01 12:45 UTC (permalink / raw)
To: Changbin Du; +Cc: gcc, gcc-bugs, Ning Jia, Li Yu, Wang Nan, Hui Wang
On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> The result (p-core, no ht, no turbo, performance mode):
>
> O2 O3 PGO
> cycles 2,581,832,749 8,638,401,568 9,394,200,585
> (1.07s) (3.49s) (3.80s)
> instructions 12,609,600,094 11,827,675,782 12,036,010,638
> branches 2,303,416,221 2,671,184,833 2,723,414,574
> branch-misses 0.00% 7.94% 8.84%
> cache-misses 3,012,613 3,055,722 3,076,316
> L1-icache-load-misses 11,416,391 12,112,703 11,896,077
> icache_tag.stalls 1,553,521 1,364,092 1,896,066
> itlb_misses.stlb_hit 6,856 21,756 22,600
> itlb_misses.walk_completed 14,430 4,454 15,084
> baclears.any 131,573 140,355 131,644
> int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993
> machine_clears.count 22,235 39,671 37,307
> dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493
> frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319
> idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906
> idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650
> idq.mite_uops 207,881,687 216,734,007 212,003,064
>
>
> Above data shows:
> o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
> o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
> aggressive inline.
> o O3/PGO introduced very bad branch prediction. I will explain it later.
> o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
> my expectation.
> o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
> I don't know why. (subcategory MC is not measured yet)
The MCs are caused by memory ordering conflict and attribute to the kernel rcu
lock in I/O path, when ext4 tries to update its journal.
> o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
> o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
> DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
> o Other events are mainly affected by bad branch misprediction.
>
--
Cheers,
Changbin Du
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
2023-07-31 13:53 ` Richard Biener
2023-08-01 12:45 ` Changbin Du
@ 2023-08-11 13:37 ` Changbin Du
2 siblings, 0 replies; 7+ messages in thread
From: Changbin Du @ 2023-08-11 13:37 UTC (permalink / raw)
To: Changbin Du; +Cc: gcc, gcc-bugs, Ning Jia, Li Yu, Wang Nan, Hui Wang
On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
>
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
> This program is nothing special, just a random code I found on the internet. You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
>
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
>
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c
>
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data
>
> The result (p-core, no ht, no turbo, performance mode):
>
> O2 O3 PGO
> cycles 2,581,832,749 8,638,401,568 9,394,200,585
> (1.07s) (3.49s) (3.80s)
> instructions 12,609,600,094 11,827,675,782 12,036,010,638
> branches 2,303,416,221 2,671,184,833 2,723,414,574
> branch-misses 0.00% 7.94% 8.84%
> cache-misses 3,012,613 3,055,722 3,076,316
> L1-icache-load-misses 11,416,391 12,112,703 11,896,077
> icache_tag.stalls 1,553,521 1,364,092 1,896,066
> itlb_misses.stlb_hit 6,856 21,756 22,600
> itlb_misses.walk_completed 14,430 4,454 15,084
> baclears.any 131,573 140,355 131,644
> int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993
> machine_clears.count 22,235 39,671 37,307
> dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493
> frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319
> idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906
> idq.dsb_uops 11,149,445,952 26,438,051,062 28,622,657,650
> idq.mite_uops 207,881,687 216,734,007 212,003,064
>
>
> Above data shows:
> o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
> o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
> aggressive inline.
> o O3/PGO introduced very bad branch prediction. I will explain it later.
> o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
> my expectation.
> o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
> I don't know why. (subcategory MC is not measured yet)
> o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
> o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
> DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
> o Other events are mainly affected by bad branch misprediction.
>
> Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
> 0.0 0.8 11.4 76.8 2.0 8.3 0.8 0.0
>
> 1.073381357 seconds time elapsed
>
> 0.945233000 seconds user
> 0.095719000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
> 38.2 6.6 3.5 21.7 0.9 20.9 7.5 0.8
>
> 3.501875873 seconds time elapsed
>
> 3.378572000 seconds user
> 0.084163000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data
> test.data.huf is 1.00% of test.data
>
> Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':
>
> % tma_branch_mispredicts % tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears
> 40.3 6.3 3.6 19.4 1.2 17.8 10.7 0.8
>
> 3.803413059 seconds time elapsed
>
> 3.686474000 seconds user
> 0.079707000 seconds sys
>
>
> I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance
> drop than O2.
>
>
> [Predicated Ins vs Branches]
> Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO
> is caused by the below branch.
>
> @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {
>
> /* put a one on the end of this byte if b is '1' */
>
> - if (b == '1') current_byte |= 1;
> + //if (b == '1') current_byte |= 1;
>
> /* one more bit */
>
> If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> improvement compared to O2 on x86.
>
> O2 O3 PGO
> cycles 2,497,674,824 2,104,993,224 2,199,753,593
> instructions 10,457,508,646 9,723,056,131 10,457,216,225
> branches 2,303,029,380 2,250,522,323 2,302,994,942
> branch-misses 0.00% 0.01% 0.01%
>
> The main difference in the compilation output about code around the miss-prediction
> branch is:
> o In O2: predicated instruction (cmov here) is selected to eliminate above
> branch. cmov is true better than branch here.
> o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> is selected. But this branch is obviously *unpredictable* and the compiler
> doesn't know it. This why O3/PGO are are so bad for this program.
>
After some debugging, I found the if-conversion pass failed at below
'!REG_P (dest)' check with O3.
static bool
check_cond_move_block (basic_block bb,
hash_map<rtx, rtx> *vals,
vec<rtx> *regs,
rtx cond)
{
...
dest = SET_DEST (set);
src = SET_SRC (set);
if (!REG_P (dest) // REG_P() return 0 here
|| (HARD_REGISTER_P (dest)
&& targetm.small_register_classes_for_mode_p (GET_MODE (dest)))) {
printf("check_cond_move_block: failed at line %d\n", __LINE__); //
return false;
}
--
Cheers,
Changbin Du
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-08-11 13:38 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
2023-07-31 13:53 ` Richard Biener
2023-08-01 8:44 ` Jan Hubicka
2023-08-01 12:31 ` Changbin Du
2023-08-01 12:21 ` Changbin Du
2023-08-01 12:45 ` Changbin Du
2023-08-11 13:37 ` Changbin Du
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).