[Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
@ 2023-07-31 12:55 Changbin Du
  2023-07-31 13:53 ` Richard Biener
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Changbin Du @ 2023-07-31 12:55 UTC (permalink / raw)
  To: gcc, gcc-bugs; +Cc: Ning Jia, Li Yu, Wang Nan, Hui Wang, Changbin Du

Hello, folks.
This is to discuss Gcc's heuristic strategy about Predicated Instructions and
Branches. And probably something needs to be improved.

[The story]
Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
This program is nothing special, just a random code I found on the internet. You
can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.

Build it with O2/O3/PGO (My GCC is 13.1):
$ gcc -O2 -march=native -g -o huffman huffman.c
$ gcc -O3 -march=native -g -o huffman.O3 huffman.c

$ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
$ ./huffman.instrumented test.data
$ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c

Run them on my 12900H laptop:
$ head -c 50M /dev/urandom > test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data

The result (p-core, no ht, no turbo, performance mode):

                                O2                      O3              PGO
cycles                          2,581,832,749   8,638,401,568   9,394,200,585
                                (1.07s)         (3.49s)         (3.80s)
instructions                    12,609,600,094  11,827,675,782  12,036,010,638
branches                        2,303,416,221   2,671,184,833   2,723,414,574
branch-misses                   0.00%           7.94%           8.84%
cache-misses                    3,012,613       3,055,722       3,076,316
L1-icache-load-misses           11,416,391      12,112,703      11,896,077
icache_tag.stalls               1,553,521       1,364,092       1,896,066
itlb_misses.stlb_hit            6,856           21,756          22,600
itlb_misses.walk_completed      14,430          4,454           15,084
baclears.any                    131,573         140,355         131,644
int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
machine_clears.count            22,235          39,671          37,307
dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
idq.mite_uops                   207,881,687     216,734,007     212,003,064


Above data shows:
  o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
  o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
    aggressive inline.
  o O3/PGO introduced very bad branch prediction. I will explain it later.
  o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
    my expectation.
  o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
    I don't know why. (subcategory MC is not measured yet)
  o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
  o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
    DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
  o Other events are mainly affected by bad branch misprediction.

Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
idq.mite_uops data.

$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
                       0.0                  0.8                     11.4                     76.8                  2.0                     8.3                  0.8                    0.0

       1.073381357 seconds time elapsed

       0.945233000 seconds user
       0.095719000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
                      38.2                  6.6                      3.5                     21.7                  0.9                    20.9                  7.5                    0.8

       3.501875873 seconds time elapsed

       3.378572000 seconds user
       0.084163000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':

 %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
                      40.3                  6.3                      3.6                     19.4                  1.2                    17.8                 10.7                    0.8

       3.803413059 seconds time elapsed

       3.686474000 seconds user
       0.079707000 seconds sys


I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance
drop than O2.


[Predicated Ins vs Branches]
Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO
is caused by the below branch.

@@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {

        /* put a one on the end of this byte if b is '1' */

-       if (b == '1') current_byte |= 1;
+       //if (b == '1') current_byte |= 1;

        /* one more bit */

If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
improvement compared to O2 on x86.

                        O2              O3              PGO
cycles                  2,497,674,824   2,104,993,224   2,199,753,593
instructions            10,457,508,646  9,723,056,131   10,457,216,225
branches                2,303,029,380   2,250,522,323   2,302,994,942
branch-misses           0.00%           0.01%           0.01%

The main difference in the compilation output about code around the miss-prediction
branch is:
  o In O2: predicated instruction (cmov here) is selected to eliminate above
    branch. cmov is true better than branch here.
  o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
    is selected. But this branch is obviously *unpredictable* and the compiler
    doesn't know it. This why O3/PGO are are so bad for this program.

Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
same purpose. The result is negative.

I think we could come to a conclusion that there must be something can improve in
Gcc's heuristic strategy about Predicated Instructions and branches, at least
for O3 and PGO.

And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
for the compiler to detect unpredictable branches.

--
Cheers,
Changbin Du

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
  2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
@ 2023-07-31 13:53 ` Richard Biener
  2023-08-01  8:44   ` Jan Hubicka
  2023-08-01 12:21   ` Changbin Du
  2023-08-01 12:45 ` Changbin Du
  2023-08-11 13:37 ` Changbin Du
  2 siblings, 2 replies; 7+ messages in thread
From: Richard Biener @ 2023-07-31 13:53 UTC (permalink / raw)
  To: Changbin Du, Jan Hubicka
  Cc: gcc, gcc-bugs, Ning Jia, Li Yu, Wang Nan, Hui Wang

On Mon, Jul 31, 2023 at 2:57 PM Changbin Du via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
>
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
> This program is nothing special, just a random code I found on the internet. You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
>
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
>
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c
>
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data
>
> The result (p-core, no ht, no turbo, performance mode):
>
>                                 O2                      O3              PGO
> cycles                          2,581,832,749   8,638,401,568   9,394,200,585
>                                 (1.07s)         (3.49s)         (3.80s)
> instructions                    12,609,600,094  11,827,675,782  12,036,010,638
> branches                        2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses                   0.00%           7.94%           8.84%
> cache-misses                    3,012,613       3,055,722       3,076,316
> L1-icache-load-misses           11,416,391      12,112,703      11,896,077
> icache_tag.stalls               1,553,521       1,364,092       1,896,066
> itlb_misses.stlb_hit            6,856           21,756          22,600
> itlb_misses.walk_completed      14,430          4,454           15,084
> baclears.any                    131,573         140,355         131,644
> int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
> machine_clears.count            22,235          39,671          37,307
> dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
> frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
> idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops                   207,881,687     216,734,007     212,003,064
>
>
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
>     aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
>     my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
>     I don't know why. (subcategory MC is not measured yet)
>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
>     DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
>
> Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman test.data':
>
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                        0.0                  0.8                     11.4                     76.8                  2.0                     8.3                  0.8                    0.0
>
>        1.073381357 seconds time elapsed
>
>        0.945233000 seconds user
>        0.095719000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
>
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                       38.2                  6.6                      3.5                     21.7                  0.9                    20.9                  7.5                    0.8
>
>        3.501875873 seconds time elapsed
>
>        3.378572000 seconds user
>        0.084163000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':
>
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                       40.3                  6.3                      3.6                     19.4                  1.2                    17.8                 10.7                    0.8
>
>        3.803413059 seconds time elapsed
>
>        3.686474000 seconds user
>        0.079707000 seconds sys
>
>
> I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance
> drop than O2.
>
>
> [Predicated Ins vs Branches]
> Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO
> is caused by the below branch.
>
> @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {
>
>         /* put a one on the end of this byte if b is '1' */
>
> -       if (b == '1') current_byte |= 1;
> +       //if (b == '1') current_byte |= 1;
>
>         /* one more bit */
>
> If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> improvement compared to O2 on x86.
>
>                         O2              O3              PGO
> cycles                  2,497,674,824   2,104,993,224   2,199,753,593
> instructions            10,457,508,646  9,723,056,131   10,457,216,225
> branches                2,303,029,380   2,250,522,323   2,302,994,942
> branch-misses           0.00%           0.01%           0.01%
>
> The main difference in the compilation output about code around the miss-prediction
> branch is:
>   o In O2: predicated instruction (cmov here) is selected to eliminate above
>     branch. cmov is true better than branch here.
>   o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
>     is selected. But this branch is obviously *unpredictable* and the compiler
>     doesn't know it. This why O3/PGO are are so bad for this program.
>
> Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
> Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
> same purpose. The result is negative.

But does it appear to be predictable with your profiling data?

> I think we could come to a conclusion that there must be something can improve in
> Gcc's heuristic strategy about Predicated Instructions and branches, at least
> for O3 and PGO.
>
> And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
> for the compiler to detect unpredictable branches.
>
> --
> Cheers,
> Changbin Du

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
  2023-07-31 13:53 ` Richard Biener
@ 2023-08-01  8:44   ` Jan Hubicka
  2023-08-01 12:31     ` Changbin Du
  2023-08-01 12:21   ` Changbin Du
  1 sibling, 1 reply; 7+ messages in thread
From: Jan Hubicka @ 2023-08-01  8:44 UTC (permalink / raw)
  To: Richard Biener
  Cc: Changbin Du, gcc, gcc-bugs, Ning Jia, Li Yu, Wang Nan, Hui Wang

> > If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> > improvement compared to O2 on x86.
> >
> >                         O2              O3              PGO
> > cycles                  2,497,674,824   2,104,993,224   2,199,753,593
> > instructions            10,457,508,646  9,723,056,131   10,457,216,225
> > branches                2,303,029,380   2,250,522,323   2,302,994,942
> > branch-misses           0.00%           0.01%           0.01%
> >
> > The main difference in the compilation output about code around the miss-prediction
> > branch is:
> >   o In O2: predicated instruction (cmov here) is selected to eliminate above
> >     branch. cmov is true better than branch here.
> >   o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> >     is selected. But this branch is obviously *unpredictable* and the compiler
> >     doesn't know it. This why O3/PGO are are so bad for this program.
> >
> > Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
> > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
> > same purpose. The result is negative.
> 
> But does it appear to be predictable with your profiling data?

Also one thing is that __builtin_expect and
__builtin_expect_with_probability only affects the static branch
prediciton algorithm, so with profile feedback they are ignored on every
branch executed at least once during the train run.

setting probability 0.5 is really not exactly the same as hint that the
branch will be mispredicted, since modern CPUs handle well regularly
behaving branchs (such as a branch firing every even iteration of loop).

So I think having the builting is not a bad idea.  I was thinking if it
makes sense to represent it withing profile_probability type and I am
not convinced, since "unpredictable probability" sounds counceptually
odd and we would need to keep the flag intact over all probability
updates we do.  For things like loop exits we recompute probabilities
from frequencies after unrolling/vectorizaiton and other things and we
would need to invent new API to propagate the flag from previous
probability (which is not even part of the computation right now)

So I guess the challenge is how to pass this info down through the
optimization pipeline, since we would need to annotate gimple
conds/switches and manage it to RTL level.  On gimple we have flags and
on rtl level notes so there is space for it, but we would need to
maintain the info through CFG changes.

Auto-FDO may be interesting way to detect such branches.

Honza
> 
> > I think we could come to a conclusion that there must be something can improve in
> > Gcc's heuristic strategy about Predicated Instructions and branches, at least
> > for O3 and PGO.
> >
> > And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
> > for the compiler to detect unpredictable branches.
> >
> > --
> > Cheers,
> > Changbin Du

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
  2023-08-01  8:44   ` Jan Hubicka
@ 2023-08-01 12:31     ` Changbin Du
  0 siblings, 0 replies; 7+ messages in thread
From: Changbin Du @ 2023-08-01 12:31 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Biener, Changbin Du, gcc, gcc-bugs, Ning Jia, Li Yu,
	Wang Nan, Hui Wang

On Tue, Aug 01, 2023 at 10:44:02AM +0200, Jan Hubicka wrote:
> > > If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> > > improvement compared to O2 on x86.
> > >
> > >                         O2              O3              PGO
> > > cycles                  2,497,674,824   2,104,993,224   2,199,753,593
> > > instructions            10,457,508,646  9,723,056,131   10,457,216,225
> > > branches                2,303,029,380   2,250,522,323   2,302,994,942
> > > branch-misses           0.00%           0.01%           0.01%
> > >
> > > The main difference in the compilation output about code around the miss-prediction
> > > branch is:
> > >   o In O2: predicated instruction (cmov here) is selected to eliminate above
> > >     branch. cmov is true better than branch here.
> > >   o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> > >     is selected. But this branch is obviously *unpredictable* and the compiler
> > >     doesn't know it. This why O3/PGO are are so bad for this program.
> > >
> > > Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
> > > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
> > > same purpose. The result is negative.
> > 
> > But does it appear to be predictable with your profiling data?
> 
> Also one thing is that __builtin_expect and
> __builtin_expect_with_probability only affects the static branch
> prediciton algorithm, so with profile feedback they are ignored on every
> branch executed at least once during the train run.
> 
> setting probability 0.5 is really not exactly the same as hint that the
> branch will be mispredicted, since modern CPUs handle well regularly
> behaving branchs (such as a branch firing every even iteration of loop).
>
Yeah. Setting probability 0.5 is just an experimental attempt. I don't know
how the heuristic works internally.

> So I think having the builting is not a bad idea.  I was thinking if it
> makes sense to represent it withing profile_probability type and I am
> not convinced, since "unpredictable probability" sounds counceptually
> odd and we would need to keep the flag intact over all probability
> updates we do.  For things like loop exits we recompute probabilities
> from frequencies after unrolling/vectorizaiton and other things and we
> would need to invent new API to propagate the flag from previous
> probability (which is not even part of the computation right now)
> 
> So I guess the challenge is how to pass this info down through the
> optimization pipeline, since we would need to annotate gimple
> conds/switches and manage it to RTL level.  On gimple we have flags and
> on rtl level notes so there is space for it, but we would need to
> maintain the info through CFG changes.
> 
> Auto-FDO may be interesting way to detect such branches.
> 
So I suppose PGO also could. But branch instruction is selected in my test just
as O3 does. And data shows that comv works better than branch here.

> Honza
> > 
> > > I think we could come to a conclusion that there must be something can improve in
> > > Gcc's heuristic strategy about Predicated Instructions and branches, at least
> > > for O3 and PGO.
> > >
> > > And can we add __builtin_unpredictable() support for Gcc? As usually it's hard
> > > for the compiler to detect unpredictable branches.
> > >
> > > --
> > > Cheers,
> > > Changbin Du

-- 
Cheers,
Changbin Du

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
  2023-07-31 13:53 ` Richard Biener
  2023-08-01  8:44   ` Jan Hubicka
@ 2023-08-01 12:21   ` Changbin Du
  1 sibling, 0 replies; 7+ messages in thread
From: Changbin Du @ 2023-08-01 12:21 UTC (permalink / raw)
  To: Richard Biener
  Cc: Changbin Du, Jan Hubicka, gcc, gcc-bugs, Ning Jia, Li Yu,
	Wang Nan, Hui Wang

On Mon, Jul 31, 2023 at 03:53:26PM +0200, Richard Biener wrote:
[snip]
> > The main difference in the compilation output about code around the miss-prediction
> > branch is:
> >   o In O2: predicated instruction (cmov here) is selected to eliminate above
> >     branch. cmov is true better than branch here.
> >   o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
> >     is selected. But this branch is obviously *unpredictable* and the compiler
> >     doesn't know it. This why O3/PGO are are so bad for this program.
> >
> > Gcc doesn't support __builtin_unpredictable() which has been introduced by llvm.
> > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can serve the
> > same purpose. The result is negative.
> 
> But does it appear to be predictable with your profiling data?
>
I profiled the branch-misses event on a kabylake machine. 99% of the
mis-prediction blames to encode_file() function.

$ sudo perf record -e branch-instructions:pp,branch-misses:pp -c 1000 -- taskset -c 0 ./huffman.O3 test.data

Samples: 197K of event 'branch-misses:pp', Event count (approx.): 197618000
Overhead  Command     Shared Object     Symbol
  99.58%  huffman.O3  huffman.O3        [.] encode_file
   0.12%  huffman.O3  [kernel.vmlinux]  [k] __x86_indirect_thunk_array
   0.11%  huffman.O3  libc-2.31.so      [.] _IO_getc
   0.01%  huffman.O3  [kernel.vmlinux]  [k] common_file_perm

Then annotate encode_file() function:

Samples: 197K of event 'branch-misses:pp', 1000 Hz, Event count (approx.): 197618000
encode_file  /work/myWork/linux/pgo/huffman.O3 [Percent: local period]
Percent│     ↑ je      38
       │     bitout():
       │     current_byte <<= 1;
       │ 70:   add     %edi,%edi
       │     if (b == '1') current_byte |= 1;
 48.70 │    ┌──cmp     $0x31,%dl
 47.11 │    ├──jne     7a
       │    │  or      $0x1,%edi
       │    │nbits++;  
       │ 7a:└─→inc     %eax
       │     if (b == '1') current_byte |= 1;
       │       mov     %edi,current_byte
       │     nbits++;
       │       mov     %eax,nbits
       │     if (nbits == 8) {
  1.16 │       cmp     $0x8,%eax
  3.03 │     ↓ je      a0
       │     encode_file():
       │     for (s=codes[ch]; *s; s++) bitout (outfile, *s);
       │       movzbl  0x1(%r13),%edx
       │       inc     %r13
       │       test    %dl,%dl
       │     ↑ jne     70
       │     ↑ jmp     38
       │       nop

-- 
Cheers,
Changbin Du

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
  2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
  2023-07-31 13:53 ` Richard Biener
@ 2023-08-01 12:45 ` Changbin Du
  2023-08-11 13:37 ` Changbin Du
  2 siblings, 0 replies; 7+ messages in thread
From: Changbin Du @ 2023-08-01 12:45 UTC (permalink / raw)
  To: Changbin Du; +Cc: gcc, gcc-bugs, Ning Jia, Li Yu, Wang Nan, Hui Wang

On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> The result (p-core, no ht, no turbo, performance mode):
> 
>                                 O2                      O3              PGO
> cycles                          2,581,832,749   8,638,401,568   9,394,200,585
>                                 (1.07s)         (3.49s)         (3.80s)
> instructions                    12,609,600,094  11,827,675,782  12,036,010,638
> branches                        2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses                   0.00%           7.94%           8.84%
> cache-misses                    3,012,613       3,055,722       3,076,316
> L1-icache-load-misses           11,416,391      12,112,703      11,896,077
> icache_tag.stalls               1,553,521       1,364,092       1,896,066
> itlb_misses.stlb_hit            6,856           21,756          22,600
> itlb_misses.walk_completed      14,430          4,454           15,084
> baclears.any                    131,573         140,355         131,644
> int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
> machine_clears.count            22,235          39,671          37,307
> dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
> frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
> idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops                   207,881,687     216,734,007     212,003,064
> 
> 
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
>     aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
>     my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
>     I don't know why. (subcategory MC is not measured yet)
The MCs are caused by memory ordering conflict and attribute to the kernel rcu
lock in I/O path, when ext4 tries to update its journal.

>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
>     DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
> 

-- 
Cheers,
Changbin Du

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
  2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
  2023-07-31 13:53 ` Richard Biener
  2023-08-01 12:45 ` Changbin Du
@ 2023-08-11 13:37 ` Changbin Du
  2 siblings, 0 replies; 7+ messages in thread
From: Changbin Du @ 2023-08-11 13:37 UTC (permalink / raw)
  To: Changbin Du; +Cc: gcc, gcc-bugs, Ning Jia, Li Yu, Wang Nan, Hui Wang

On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
> 
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
> This program is nothing special, just a random code I found on the internet. You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
> 
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
> 
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c
> 
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data
> 
> The result (p-core, no ht, no turbo, performance mode):
> 
>                                 O2                      O3              PGO
> cycles                          2,581,832,749   8,638,401,568   9,394,200,585
>                                 (1.07s)         (3.49s)         (3.80s)
> instructions                    12,609,600,094  11,827,675,782  12,036,010,638
> branches                        2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses                   0.00%           7.94%           8.84%
> cache-misses                    3,012,613       3,055,722       3,076,316
> L1-icache-load-misses           11,416,391      12,112,703      11,896,077
> icache_tag.stalls               1,553,521       1,364,092       1,896,066
> itlb_misses.stlb_hit            6,856           21,756          22,600
> itlb_misses.walk_completed      14,430          4,454           15,084
> baclears.any                    131,573         140,355         131,644
> int_misc.clear_resteer_cycles   2,545,915       586,578,125     679,021,993
> machine_clears.count            22,235          39,671          37,307
> dsb2mite_switches.penalty_cycles 6,985,838      12,929,675      8,405,493
> frontend_retired.any_dsb_miss   28,785,677      28,161,724      28,093,319
> idq.dsb_cycles_any              1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops                    11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops                   207,881,687     216,734,007     212,003,064
> 
> 
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
>     aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond
>     my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
>     I don't know why. (subcategory MC is not measured yet)
>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
>     DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
> 
> Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
> 
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data
> test.data.huf is 1.00% of test.data
> 
>  Performance counter stats for 'taskset -c 0 ./huffman test.data':
> 
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                        0.0                  0.8                     11.4                     76.8                  2.0                     8.3                  0.8                    0.0
> 
>        1.073381357 seconds time elapsed
> 
>        0.945233000 seconds user
>        0.095719000 seconds sys
> 
> 
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
> 
>  Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
> 
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                       38.2                  6.6                      3.5                     21.7                  0.9                    20.9                  7.5                    0.8
> 
>        3.501875873 seconds time elapsed
> 
>        3.378572000 seconds user
>        0.084163000 seconds sys
> 
> 
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data
> test.data.huf is 1.00% of test.data
> 
>  Performance counter stats for 'taskset -c 0 ./huffman.pgo test.data':
> 
>  %  tma_branch_mispredicts    %  tma_core_bound %  tma_heavy_operations %  tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  tma_fetch_latency %  tma_machine_clears
>                       40.3                  6.3                      3.6                     19.4                  1.2                    17.8                 10.7                    0.8
> 
>        3.803413059 seconds time elapsed
> 
>        3.686474000 seconds user
>        0.079707000 seconds sys
> 
> 
> I also tried the same program with O2/O3 on arm64. And O3 lead to *30%* performance
> drop than O2.
> 
> 
> [Predicated Ins vs Branches]
> Then I analyzed the Bad Speculation problem. 99% of the miss-prediction in O3/PGO
> is caused by the below branch.
> 
> @@ -264,7 +264,7 @@ void bitout (FILE *f, char b) {
> 
>         /* put a one on the end of this byte if b is '1' */
> 
> -       if (b == '1') current_byte |= 1;
> +       //if (b == '1') current_byte |= 1;
> 
>         /* one more bit */
> 
> If I comment it out as above patch, then O3/PGO can get 16% and 12% performance
> improvement compared to O2 on x86.
> 
>                         O2              O3              PGO
> cycles                  2,497,674,824   2,104,993,224   2,199,753,593
> instructions            10,457,508,646  9,723,056,131   10,457,216,225
> branches                2,303,029,380   2,250,522,323   2,302,994,942
> branch-misses           0.00%           0.01%           0.01%
> 
> The main difference in the compilation output about code around the miss-prediction
> branch is:
>   o In O2: predicated instruction (cmov here) is selected to eliminate above
>     branch. cmov is true better than branch here.
>   o In O3/PGO: bitout() is inlined into encode_file(), and branch instruction
>     is selected. But this branch is obviously *unpredictable* and the compiler
>     doesn't know it. This why O3/PGO are are so bad for this program.
>

After some debugging, I found the if-conversion pass failed at below
'!REG_P (dest)' check with O3.

static bool
check_cond_move_block (basic_block bb,
		       hash_map<rtx, rtx> *vals,
		       vec<rtx> *regs,
		       rtx cond)
{
      ...
      dest = SET_DEST (set);
      src = SET_SRC (set);
      if (!REG_P (dest) // REG_P() return 0 here
	  || (HARD_REGISTER_P (dest)
	      && targetm.small_register_classes_for_mode_p (GET_MODE (dest)))) {
	      printf("check_cond_move_block: failed at line %d\n", __LINE__); //
	      return false;
	}


-- 
Cheers,
Changbin Du

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-08-11 13:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-31 12:55 [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2 Changbin Du
2023-07-31 13:53 ` Richard Biener
2023-08-01  8:44   ` Jan Hubicka
2023-08-01 12:31     ` Changbin Du
2023-08-01 12:21   ` Changbin Du
2023-08-01 12:45 ` Changbin Du
2023-08-11 13:37 ` Changbin Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).