* the elimination of if blocks in GCC during if-conversion and vectorization @ 2023-10-12 12:16 Hanke Zhang 2023-10-17 9:23 ` Richard Biener 0 siblings, 1 reply; 10+ messages in thread From: Hanke Zhang @ 2023-10-12 12:16 UTC (permalink / raw) To: gcc Hi, I'm recently working on vectorization of GCC. I'm stuck in a small problem and would like to ask for advice. For example, for the following code: int main() { int size = 1000; int *foo = malloc(sizeof(int) * size); int c1 = rand(), t1 = rand(); for (int i = 0; i < size; i++) { if (foo[i] & c1) { foo[i] = t1; } } // prevents the loop above from being optimized for (int i = 0; i < size; i++) { printf("%d", foo[i]); } } First of all, the if statement block in the loop will be converted to a MASK_STORE through if-conversion optimization. But after tree-vector, it will still become a branched form. The part of the final disassembly structure probably looks like below(Using IDA to do this), and you can see that there is still such a branch 'if ( !_ZF )' in it, which will lead to low efficiency. do { while ( 1 ) { __asm { vpand ymm0, ymm2, ymmword ptr [rax] vpcmpeqd ymm0, ymm0, ymm1 vpcmpeqd ymm0, ymm0, ymm1 vptest ymm0, ymm0 } if ( !_ZF ) break; _RAX += 8; if ( _RAX == v9 ) goto LABEL_5; } __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } _RAX += 8; } while ( _RAX != v9 ); Why can't we just replace the vptest and if statement with some other instructions like vpblendvb so that it can be faster? Or is there a good way to do that? Thanks Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-12 12:16 the elimination of if blocks in GCC during if-conversion and vectorization Hanke Zhang @ 2023-10-17 9:23 ` Richard Biener 2023-10-17 11:54 ` Hanke Zhang 0 siblings, 1 reply; 10+ messages in thread From: Richard Biener @ 2023-10-17 9:23 UTC (permalink / raw) To: Hanke Zhang; +Cc: gcc On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > problem and would like to ask for advice. > > For example, for the following code: > > int main() { > int size = 1000; > int *foo = malloc(sizeof(int) * size); > int c1 = rand(), t1 = rand(); > > for (int i = 0; i < size; i++) { > if (foo[i] & c1) { > foo[i] = t1; > } > } > > // prevents the loop above from being optimized > for (int i = 0; i < size; i++) { > printf("%d", foo[i]); > } > } > > First of all, the if statement block in the loop will be converted to > a MASK_STORE through if-conversion optimization. But after > tree-vector, it will still become a branched form. The part of the > final disassembly structure probably looks like below(Using IDA to do > this), and you can see that there is still such a branch 'if ( !_ZF )' > in it, which will lead to low efficiency. > > do > { > while ( 1 ) > { > __asm > { > vpand ymm0, ymm2, ymmword ptr [rax] > vpcmpeqd ymm0, ymm0, ymm1 > vpcmpeqd ymm0, ymm0, ymm1 > vptest ymm0, ymm0 > } > if ( !_ZF ) > break; > _RAX += 8; > if ( _RAX == v9 ) > goto LABEL_5; > } > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > _RAX += 8; > } > while ( _RAX != v9 ); > > Why can't we just replace the vptest and if statement with some other > instructions like vpblendvb so that it can be faster? Or is there a > good way to do that? The branch is added by optimize_mask_stores after vectorization because fully masked (disabled) masked stores can incur a quite heavy penalty on some architectures when fault assists (read-only pages, but also COW pages) are ran into. All the microcode handling needs to possibly be carried out multiple times, for each such access to the same page. That can cause a 1000x slowdown when you hit this case. Thus every masked store is replaced by if (mask != 0) masked_store (); and this is an optimization (which itself has a small cost). Richard. > > Thanks > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-17 9:23 ` Richard Biener @ 2023-10-17 11:54 ` Hanke Zhang 2023-10-17 11:57 ` Richard Biener 0 siblings, 1 reply; 10+ messages in thread From: Hanke Zhang @ 2023-10-17 11:54 UTC (permalink / raw) To: Richard Biener; +Cc: gcc Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道: > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > problem and would like to ask for advice. > > > > For example, for the following code: > > > > int main() { > > int size = 1000; > > int *foo = malloc(sizeof(int) * size); > > int c1 = rand(), t1 = rand(); > > > > for (int i = 0; i < size; i++) { > > if (foo[i] & c1) { > > foo[i] = t1; > > } > > } > > > > // prevents the loop above from being optimized > > for (int i = 0; i < size; i++) { > > printf("%d", foo[i]); > > } > > } > > > > First of all, the if statement block in the loop will be converted to > > a MASK_STORE through if-conversion optimization. But after > > tree-vector, it will still become a branched form. The part of the > > final disassembly structure probably looks like below(Using IDA to do > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > in it, which will lead to low efficiency. > > > > do > > { > > while ( 1 ) > > { > > __asm > > { > > vpand ymm0, ymm2, ymmword ptr [rax] > > vpcmpeqd ymm0, ymm0, ymm1 > > vpcmpeqd ymm0, ymm0, ymm1 > > vptest ymm0, ymm0 > > } > > if ( !_ZF ) > > break; > > _RAX += 8; > > if ( _RAX == v9 ) > > goto LABEL_5; > > } > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > _RAX += 8; > > } > > while ( _RAX != v9 ); > > > > Why can't we just replace the vptest and if statement with some other > > instructions like vpblendvb so that it can be faster? Or is there a > > good way to do that? > > The branch is added by optimize_mask_stores after vectorization because > fully masked (disabled) masked stores can incur a quite heavy penalty on > some architectures when fault assists (read-only pages, but also COW pages) > are ran into. All the microcode handling needs to possibly be carried out > multiple times, for each such access to the same page. That can cause > a 1000x slowdown when you hit this case. Thus every masked store > is replaced by > > if (mask != 0) > masked_store (); > > and this is an optimization (which itself has a small cost). > > Richard. Yeah, I know that and I have seen the code of optimize_mask_store(). And the main problem here is that when multiple MASK_STORE appear in the same loop, many branches will appear, resulting in a decrease in overall efficiency. And my original idea is that why can't we replace MASK_STORE with more effective SIMD instructions because icc can do much better in this case. Then I give it up, because the ability to analyze vectorization of gcc is not as good as icc and my ability does not support me modifying this part of the code. Thanks very much for your reply. > > > > > Thanks > > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-17 11:54 ` Hanke Zhang @ 2023-10-17 11:57 ` Richard Biener 2023-10-17 12:39 ` Hanke Zhang 0 siblings, 1 reply; 10+ messages in thread From: Richard Biener @ 2023-10-17 11:57 UTC (permalink / raw) To: Hanke Zhang; +Cc: gcc On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道: > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > > problem and would like to ask for advice. > > > > > > For example, for the following code: > > > > > > int main() { > > > int size = 1000; > > > int *foo = malloc(sizeof(int) * size); > > > int c1 = rand(), t1 = rand(); > > > > > > for (int i = 0; i < size; i++) { > > > if (foo[i] & c1) { > > > foo[i] = t1; > > > } > > > } > > > > > > // prevents the loop above from being optimized > > > for (int i = 0; i < size; i++) { > > > printf("%d", foo[i]); > > > } > > > } > > > > > > First of all, the if statement block in the loop will be converted to > > > a MASK_STORE through if-conversion optimization. But after > > > tree-vector, it will still become a branched form. The part of the > > > final disassembly structure probably looks like below(Using IDA to do > > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > > in it, which will lead to low efficiency. > > > > > > do > > > { > > > while ( 1 ) > > > { > > > __asm > > > { > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > vpcmpeqd ymm0, ymm0, ymm1 > > > vpcmpeqd ymm0, ymm0, ymm1 > > > vptest ymm0, ymm0 > > > } > > > if ( !_ZF ) > > > break; > > > _RAX += 8; > > > if ( _RAX == v9 ) > > > goto LABEL_5; > > > } > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > _RAX += 8; > > > } > > > while ( _RAX != v9 ); > > > > > > Why can't we just replace the vptest and if statement with some other > > > instructions like vpblendvb so that it can be faster? Or is there a > > > good way to do that? > > > > The branch is added by optimize_mask_stores after vectorization because > > fully masked (disabled) masked stores can incur a quite heavy penalty on > > some architectures when fault assists (read-only pages, but also COW pages) > > are ran into. All the microcode handling needs to possibly be carried out > > multiple times, for each such access to the same page. That can cause > > a 1000x slowdown when you hit this case. Thus every masked store > > is replaced by > > > > if (mask != 0) > > masked_store (); > > > > and this is an optimization (which itself has a small cost). > > > > Richard. > > Yeah, I know that and I have seen the code of optimize_mask_store(). > And the main problem here is that when multiple MASK_STORE appear in > the same loop, many branches will appear, resulting in a decrease in > overall efficiency. > > And my original idea is that why can't we replace MASK_STORE with more > effective SIMD instructions because icc can do much better in this > case. ICC probably doesn't care for the case where foo[] isn't writable. In fact for the case at hand we see it comes from malloc() which we can assume to return writable memory I guess. That means if-conversion can treat the unconditional read as a way to also allow to speculate the write (with -fallow-store-data-races). Note there's currently no pointer analysis that tracks writability. > Then I give it up, because the ability to analyze vectorization > of gcc is not as good as icc and my ability does not support me > modifying this part of the code. > > Thanks very much for your reply. You're welcome. Richard. > > > > > > > > Thanks > > > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-17 11:57 ` Richard Biener @ 2023-10-17 12:39 ` Hanke Zhang 2023-10-19 11:57 ` Richard Biener 0 siblings, 1 reply; 10+ messages in thread From: Hanke Zhang @ 2023-10-17 12:39 UTC (permalink / raw) To: Richard Biener; +Cc: gcc Hi Richard I get it, thank you again. And I got another problem, so I'd like ask it by the way. Can the left shift of the induction variable in a loop be optimized as a constant? Like the code below: int ans = 0; int width = rand() % 16; for (int j = 0; j < width; j++) ans += 1 << (j + width) into: int width = rand() % 16; ans = (1 << (2 * width) - (1 << width)); I came across a more complex version of that and found that gcc doesn't seem to handle it, so wanted to write a pass myself to optimize it. I got two questions here. Does GCC have such optimizations? If I want to do my own optimization, where should I put it? Put it behind the pass_iv_optimize? Thanks Hanke Zhang Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道: > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道: > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > > > problem and would like to ask for advice. > > > > > > > > For example, for the following code: > > > > > > > > int main() { > > > > int size = 1000; > > > > int *foo = malloc(sizeof(int) * size); > > > > int c1 = rand(), t1 = rand(); > > > > > > > > for (int i = 0; i < size; i++) { > > > > if (foo[i] & c1) { > > > > foo[i] = t1; > > > > } > > > > } > > > > > > > > // prevents the loop above from being optimized > > > > for (int i = 0; i < size; i++) { > > > > printf("%d", foo[i]); > > > > } > > > > } > > > > > > > > First of all, the if statement block in the loop will be converted to > > > > a MASK_STORE through if-conversion optimization. But after > > > > tree-vector, it will still become a branched form. The part of the > > > > final disassembly structure probably looks like below(Using IDA to do > > > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > > > in it, which will lead to low efficiency. > > > > > > > > do > > > > { > > > > while ( 1 ) > > > > { > > > > __asm > > > > { > > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > vptest ymm0, ymm0 > > > > } > > > > if ( !_ZF ) > > > > break; > > > > _RAX += 8; > > > > if ( _RAX == v9 ) > > > > goto LABEL_5; > > > > } > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > > _RAX += 8; > > > > } > > > > while ( _RAX != v9 ); > > > > > > > > Why can't we just replace the vptest and if statement with some other > > > > instructions like vpblendvb so that it can be faster? Or is there a > > > > good way to do that? > > > > > > The branch is added by optimize_mask_stores after vectorization because > > > fully masked (disabled) masked stores can incur a quite heavy penalty on > > > some architectures when fault assists (read-only pages, but also COW pages) > > > are ran into. All the microcode handling needs to possibly be carried out > > > multiple times, for each such access to the same page. That can cause > > > a 1000x slowdown when you hit this case. Thus every masked store > > > is replaced by > > > > > > if (mask != 0) > > > masked_store (); > > > > > > and this is an optimization (which itself has a small cost). > > > > > > Richard. > > > > Yeah, I know that and I have seen the code of optimize_mask_store(). > > And the main problem here is that when multiple MASK_STORE appear in > > the same loop, many branches will appear, resulting in a decrease in > > overall efficiency. > > > > And my original idea is that why can't we replace MASK_STORE with more > > effective SIMD instructions because icc can do much better in this > > case. > > ICC probably doesn't care for the case where foo[] isn't writable. In > fact for the case at hand we see it comes from malloc() which we > can assume to return writable memory I guess. That means if-conversion > can treat the unconditional read as a way to also allow to speculate > the write (with -fallow-store-data-races). > > Note there's currently no pointer analysis that tracks writability. > > > Then I give it up, because the ability to analyze vectorization > > of gcc is not as good as icc and my ability does not support me > > modifying this part of the code. > > > > Thanks very much for your reply. > > You're welcome. > > Richard. > > > > > > > > > > > > Thanks > > > > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-17 12:39 ` Hanke Zhang @ 2023-10-19 11:57 ` Richard Biener 2023-10-23 10:50 ` Hanke Zhang 0 siblings, 1 reply; 10+ messages in thread From: Richard Biener @ 2023-10-19 11:57 UTC (permalink / raw) To: Hanke Zhang; +Cc: gcc On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > Hi Richard > I get it, thank you again. > > And I got another problem, so I'd like ask it by the way. Can the left > shift of the induction variable in a loop be optimized as a constant? > Like the code below: > > int ans = 0; > int width = rand() % 16; > for (int j = 0; j < width; j++) > ans += 1 << (j + width) > > into: > > int width = rand() % 16; > ans = (1 << (2 * width) - (1 << width)); > > I came across a more complex version of that and found that gcc > doesn't seem to handle it, so wanted to write a pass myself to > optimize it. > > I got two questions here. Does GCC have such optimizations? If I want > to do my own optimization, where should I put it? Put it behind the > pass_iv_optimize? GCC has the final value replacement pass (pass_scev_cprop) doing these kind of transforms. Since 'ans' does not have an affine evolution this case would need to be pattern matched (there are some existing pattern matchings in the pass). > Thanks > Hanke Zhang > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道: > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道: > > > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > > > > problem and would like to ask for advice. > > > > > > > > > > For example, for the following code: > > > > > > > > > > int main() { > > > > > int size = 1000; > > > > > int *foo = malloc(sizeof(int) * size); > > > > > int c1 = rand(), t1 = rand(); > > > > > > > > > > for (int i = 0; i < size; i++) { > > > > > if (foo[i] & c1) { > > > > > foo[i] = t1; > > > > > } > > > > > } > > > > > > > > > > // prevents the loop above from being optimized > > > > > for (int i = 0; i < size; i++) { > > > > > printf("%d", foo[i]); > > > > > } > > > > > } > > > > > > > > > > First of all, the if statement block in the loop will be converted to > > > > > a MASK_STORE through if-conversion optimization. But after > > > > > tree-vector, it will still become a branched form. The part of the > > > > > final disassembly structure probably looks like below(Using IDA to do > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > > > > in it, which will lead to low efficiency. > > > > > > > > > > do > > > > > { > > > > > while ( 1 ) > > > > > { > > > > > __asm > > > > > { > > > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > vptest ymm0, ymm0 > > > > > } > > > > > if ( !_ZF ) > > > > > break; > > > > > _RAX += 8; > > > > > if ( _RAX == v9 ) > > > > > goto LABEL_5; > > > > > } > > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > > > _RAX += 8; > > > > > } > > > > > while ( _RAX != v9 ); > > > > > > > > > > Why can't we just replace the vptest and if statement with some other > > > > > instructions like vpblendvb so that it can be faster? Or is there a > > > > > good way to do that? > > > > > > > > The branch is added by optimize_mask_stores after vectorization because > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on > > > > some architectures when fault assists (read-only pages, but also COW pages) > > > > are ran into. All the microcode handling needs to possibly be carried out > > > > multiple times, for each such access to the same page. That can cause > > > > a 1000x slowdown when you hit this case. Thus every masked store > > > > is replaced by > > > > > > > > if (mask != 0) > > > > masked_store (); > > > > > > > > and this is an optimization (which itself has a small cost). > > > > > > > > Richard. > > > > > > Yeah, I know that and I have seen the code of optimize_mask_store(). > > > And the main problem here is that when multiple MASK_STORE appear in > > > the same loop, many branches will appear, resulting in a decrease in > > > overall efficiency. > > > > > > And my original idea is that why can't we replace MASK_STORE with more > > > effective SIMD instructions because icc can do much better in this > > > case. > > > > ICC probably doesn't care for the case where foo[] isn't writable. In > > fact for the case at hand we see it comes from malloc() which we > > can assume to return writable memory I guess. That means if-conversion > > can treat the unconditional read as a way to also allow to speculate > > the write (with -fallow-store-data-races). > > > > Note there's currently no pointer analysis that tracks writability. > > > > > Then I give it up, because the ability to analyze vectorization > > > of gcc is not as good as icc and my ability does not support me > > > modifying this part of the code. > > > > > > Thanks very much for your reply. > > > > You're welcome. > > > > Richard. > > > > > > > > > > > > > > > > Thanks > > > > > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-19 11:57 ` Richard Biener @ 2023-10-23 10:50 ` Hanke Zhang 2023-10-23 12:29 ` Richard Biener 0 siblings, 1 reply; 10+ messages in thread From: Hanke Zhang @ 2023-10-23 10:50 UTC (permalink / raw) To: Richard Biener; +Cc: gcc Hi Richard: Thanks for your advice. But when I try a simpler example like the one below before looking at the code, GCC still does nothing. int main() { int width; scanf("%d", &width); int sum = 0; for (int i = 0; i < width; i++) sum += i; printf("%d\n", sum); } I tried O3 and LTO, but still the same. So I'd like to ask why, or am I doing something wrong? Thanks Hanke Zhang Richard Biener <richard.guenther@gmail.com> 于2023年10月19日周四 20:00写道: > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > Hi Richard > > I get it, thank you again. > > > > And I got another problem, so I'd like ask it by the way. Can the left > > shift of the induction variable in a loop be optimized as a constant? > > Like the code below: > > > > int ans = 0; > > int width = rand() % 16; > > for (int j = 0; j < width; j++) > > ans += 1 << (j + width) > > > > into: > > > > int width = rand() % 16; > > ans = (1 << (2 * width) - (1 << width)); > > > > I came across a more complex version of that and found that gcc > > doesn't seem to handle it, so wanted to write a pass myself to > > optimize it. > > > > I got two questions here. Does GCC have such optimizations? If I want > > to do my own optimization, where should I put it? Put it behind the > > pass_iv_optimize? > > GCC has the final value replacement pass (pass_scev_cprop) doing these > kind of transforms. Since 'ans' does not have an affine evolution this > case would need to be pattern matched (there are some existing pattern > matchings in the pass). > > > Thanks > > Hanke Zhang > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道: > > > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道: > > > > > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > > > > > problem and would like to ask for advice. > > > > > > > > > > > > For example, for the following code: > > > > > > > > > > > > int main() { > > > > > > int size = 1000; > > > > > > int *foo = malloc(sizeof(int) * size); > > > > > > int c1 = rand(), t1 = rand(); > > > > > > > > > > > > for (int i = 0; i < size; i++) { > > > > > > if (foo[i] & c1) { > > > > > > foo[i] = t1; > > > > > > } > > > > > > } > > > > > > > > > > > > // prevents the loop above from being optimized > > > > > > for (int i = 0; i < size; i++) { > > > > > > printf("%d", foo[i]); > > > > > > } > > > > > > } > > > > > > > > > > > > First of all, the if statement block in the loop will be converted to > > > > > > a MASK_STORE through if-conversion optimization. But after > > > > > > tree-vector, it will still become a branched form. The part of the > > > > > > final disassembly structure probably looks like below(Using IDA to do > > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > > > > > in it, which will lead to low efficiency. > > > > > > > > > > > > do > > > > > > { > > > > > > while ( 1 ) > > > > > > { > > > > > > __asm > > > > > > { > > > > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > vptest ymm0, ymm0 > > > > > > } > > > > > > if ( !_ZF ) > > > > > > break; > > > > > > _RAX += 8; > > > > > > if ( _RAX == v9 ) > > > > > > goto LABEL_5; > > > > > > } > > > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > > > > _RAX += 8; > > > > > > } > > > > > > while ( _RAX != v9 ); > > > > > > > > > > > > Why can't we just replace the vptest and if statement with some other > > > > > > instructions like vpblendvb so that it can be faster? Or is there a > > > > > > good way to do that? > > > > > > > > > > The branch is added by optimize_mask_stores after vectorization because > > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on > > > > > some architectures when fault assists (read-only pages, but also COW pages) > > > > > are ran into. All the microcode handling needs to possibly be carried out > > > > > multiple times, for each such access to the same page. That can cause > > > > > a 1000x slowdown when you hit this case. Thus every masked store > > > > > is replaced by > > > > > > > > > > if (mask != 0) > > > > > masked_store (); > > > > > > > > > > and this is an optimization (which itself has a small cost). > > > > > > > > > > Richard. > > > > > > > > Yeah, I know that and I have seen the code of optimize_mask_store(). > > > > And the main problem here is that when multiple MASK_STORE appear in > > > > the same loop, many branches will appear, resulting in a decrease in > > > > overall efficiency. > > > > > > > > And my original idea is that why can't we replace MASK_STORE with more > > > > effective SIMD instructions because icc can do much better in this > > > > case. > > > > > > ICC probably doesn't care for the case where foo[] isn't writable. In > > > fact for the case at hand we see it comes from malloc() which we > > > can assume to return writable memory I guess. That means if-conversion > > > can treat the unconditional read as a way to also allow to speculate > > > the write (with -fallow-store-data-races). > > > > > > Note there's currently no pointer analysis that tracks writability. > > > > > > > Then I give it up, because the ability to analyze vectorization > > > > of gcc is not as good as icc and my ability does not support me > > > > modifying this part of the code. > > > > > > > > Thanks very much for your reply. > > > > > > You're welcome. > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-23 10:50 ` Hanke Zhang @ 2023-10-23 12:29 ` Richard Biener 2023-10-23 12:56 ` Hanke Zhang 0 siblings, 1 reply; 10+ messages in thread From: Richard Biener @ 2023-10-23 12:29 UTC (permalink / raw) To: Hanke Zhang; +Cc: gcc On Mon, Oct 23, 2023 at 12:50 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > Hi Richard: > > Thanks for your advice. But when I try a simpler example like the one > below before looking at the code, GCC still does nothing. > > int main() { > int width; > scanf("%d", &width); > int sum = 0; > for (int i = 0; i < width; i++) sum += i; > printf("%d\n", sum); > } > > I tried O3 and LTO, but still the same. So I'd like to ask why, or am > I doing something wrong? -fdump-tree-sccp-details-scev reveals (set_scalar_evolution instantiated_below = 5 (scalar = sum_9) (scalar_evolution = {0, +, {1, +, 1}_1}_1)) ) (chrec_apply (varying_loop = 1) (chrec = {0, +, {1, +, 1}_1}_1) (x = (unsigned int) width.0_12 + 4294967295) (res = scev_not_known)) so we don't know how to apply a variable number of iterations to the affine expression {0, +, {1, +, 1}_1}_1, that is, we do not know how to compute the final value of the reduction. For a constant, say width == 100 we do: (set_scalar_evolution instantiated_below = 2 (scalar = sum_6) (scalar_evolution = {0, +, {1, +, 1}_1}_1)) ) (chrec_apply (varying_loop = 1) (chrec = {0, +, {1, +, 1}_1}_1) (x = 99) (res = 4950)) Richard. > > Thanks > Hanke Zhang > > Richard Biener <richard.guenther@gmail.com> 于2023年10月19日周四 20:00写道: > > > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > Hi Richard > > > I get it, thank you again. > > > > > > And I got another problem, so I'd like ask it by the way. Can the left > > > shift of the induction variable in a loop be optimized as a constant? > > > Like the code below: > > > > > > int ans = 0; > > > int width = rand() % 16; > > > for (int j = 0; j < width; j++) > > > ans += 1 << (j + width) > > > > > > into: > > > > > > int width = rand() % 16; > > > ans = (1 << (2 * width) - (1 << width)); > > > > > > I came across a more complex version of that and found that gcc > > > doesn't seem to handle it, so wanted to write a pass myself to > > > optimize it. > > > > > > I got two questions here. Does GCC have such optimizations? If I want > > > to do my own optimization, where should I put it? Put it behind the > > > pass_iv_optimize? > > > > GCC has the final value replacement pass (pass_scev_cprop) doing these > > kind of transforms. Since 'ans' does not have an affine evolution this > > case would need to be pattern matched (there are some existing pattern > > matchings in the pass). > > > > > Thanks > > > Hanke Zhang > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道: > > > > > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道: > > > > > > > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > > > > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > > > > > > problem and would like to ask for advice. > > > > > > > > > > > > > > For example, for the following code: > > > > > > > > > > > > > > int main() { > > > > > > > int size = 1000; > > > > > > > int *foo = malloc(sizeof(int) * size); > > > > > > > int c1 = rand(), t1 = rand(); > > > > > > > > > > > > > > for (int i = 0; i < size; i++) { > > > > > > > if (foo[i] & c1) { > > > > > > > foo[i] = t1; > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > // prevents the loop above from being optimized > > > > > > > for (int i = 0; i < size; i++) { > > > > > > > printf("%d", foo[i]); > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > First of all, the if statement block in the loop will be converted to > > > > > > > a MASK_STORE through if-conversion optimization. But after > > > > > > > tree-vector, it will still become a branched form. The part of the > > > > > > > final disassembly structure probably looks like below(Using IDA to do > > > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > > > > > > in it, which will lead to low efficiency. > > > > > > > > > > > > > > do > > > > > > > { > > > > > > > while ( 1 ) > > > > > > > { > > > > > > > __asm > > > > > > > { > > > > > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > > vptest ymm0, ymm0 > > > > > > > } > > > > > > > if ( !_ZF ) > > > > > > > break; > > > > > > > _RAX += 8; > > > > > > > if ( _RAX == v9 ) > > > > > > > goto LABEL_5; > > > > > > > } > > > > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > > > > > _RAX += 8; > > > > > > > } > > > > > > > while ( _RAX != v9 ); > > > > > > > > > > > > > > Why can't we just replace the vptest and if statement with some other > > > > > > > instructions like vpblendvb so that it can be faster? Or is there a > > > > > > > good way to do that? > > > > > > > > > > > > The branch is added by optimize_mask_stores after vectorization because > > > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on > > > > > > some architectures when fault assists (read-only pages, but also COW pages) > > > > > > are ran into. All the microcode handling needs to possibly be carried out > > > > > > multiple times, for each such access to the same page. That can cause > > > > > > a 1000x slowdown when you hit this case. Thus every masked store > > > > > > is replaced by > > > > > > > > > > > > if (mask != 0) > > > > > > masked_store (); > > > > > > > > > > > > and this is an optimization (which itself has a small cost). > > > > > > > > > > > > Richard. > > > > > > > > > > Yeah, I know that and I have seen the code of optimize_mask_store(). > > > > > And the main problem here is that when multiple MASK_STORE appear in > > > > > the same loop, many branches will appear, resulting in a decrease in > > > > > overall efficiency. > > > > > > > > > > And my original idea is that why can't we replace MASK_STORE with more > > > > > effective SIMD instructions because icc can do much better in this > > > > > case. > > > > > > > > ICC probably doesn't care for the case where foo[] isn't writable. In > > > > fact for the case at hand we see it comes from malloc() which we > > > > can assume to return writable memory I guess. That means if-conversion > > > > can treat the unconditional read as a way to also allow to speculate > > > > the write (with -fallow-store-data-races). > > > > > > > > Note there's currently no pointer analysis that tracks writability. > > > > > > > > > Then I give it up, because the ability to analyze vectorization > > > > > of gcc is not as good as icc and my ability does not support me > > > > > modifying this part of the code. > > > > > > > > > > Thanks very much for your reply. > > > > > > > > You're welcome. > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-23 12:29 ` Richard Biener @ 2023-10-23 12:56 ` Hanke Zhang 2023-10-26 9:18 ` Richard Biener 0 siblings, 1 reply; 10+ messages in thread From: Hanke Zhang @ 2023-10-23 12:56 UTC (permalink / raw) To: Richard Biener; +Cc: gcc Richard Biener <richard.guenther@gmail.com> 于2023年10月23日周一 20:32写道: > > On Mon, Oct 23, 2023 at 12:50 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > Hi Richard: > > > > Thanks for your advice. But when I try a simpler example like the one > > below before looking at the code, GCC still does nothing. > > > > int main() { > > int width; > > scanf("%d", &width); > > int sum = 0; > > for (int i = 0; i < width; i++) sum += i; > > printf("%d\n", sum); > > } > > > > I tried O3 and LTO, but still the same. So I'd like to ask why, or am > > I doing something wrong? > > -fdump-tree-sccp-details-scev reveals > > (set_scalar_evolution > instantiated_below = 5 > (scalar = sum_9) > (scalar_evolution = {0, +, {1, +, 1}_1}_1)) > ) > (chrec_apply > (varying_loop = 1) > (chrec = {0, +, {1, +, 1}_1}_1) > (x = (unsigned int) width.0_12 + 4294967295) > (res = scev_not_known)) > > so we don't know how to apply a variable number of iterations to > the affine expression {0, +, {1, +, 1}_1}_1, that is, we do not > know how to compute the final value of the reduction. > > For a constant, say width == 100 we do: > > (set_scalar_evolution > instantiated_below = 2 > (scalar = sum_6) > (scalar_evolution = {0, +, {1, +, 1}_1}_1)) > ) > (chrec_apply > (varying_loop = 1) > (chrec = {0, +, {1, +, 1}_1}_1) > (x = 99) > (res = 4950)) Yeah, I also found this result in previous experiments. But what confuses me is that if the 'width' can't be inferred to INTEGER_CST, there's nothing we can do, right? Because in my case, the variables corresponding to 'width' are almost all undetermined values, such as 'width = rand()'. That said, I can hardly get any optimizations in my cases, right? Thanks Hanke Zhang > > Richard. > > > > > Thanks > > Hanke Zhang > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月19日周四 20:00写道: > > > > > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > > > Hi Richard > > > > I get it, thank you again. > > > > > > > > And I got another problem, so I'd like ask it by the way. Can the left > > > > shift of the induction variable in a loop be optimized as a constant? > > > > Like the code below: > > > > > > > > int ans = 0; > > > > int width = rand() % 16; > > > > for (int j = 0; j < width; j++) > > > > ans += 1 << (j + width) > > > > > > > > into: > > > > > > > > int width = rand() % 16; > > > > ans = (1 << (2 * width) - (1 << width)); > > > > > > > > I came across a more complex version of that and found that gcc > > > > doesn't seem to handle it, so wanted to write a pass myself to > > > > optimize it. > > > > > > > > I got two questions here. Does GCC have such optimizations? If I want > > > > to do my own optimization, where should I put it? Put it behind the > > > > pass_iv_optimize? > > > > > > GCC has the final value replacement pass (pass_scev_cprop) doing these > > > kind of transforms. Since 'ans' does not have an affine evolution this > > > case would need to be pattern matched (there are some existing pattern > > > matchings in the pass). > > > > > > > Thanks > > > > Hanke Zhang > > > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道: > > > > > > > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道: > > > > > > > > > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > > > > > > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > > > > > > > problem and would like to ask for advice. > > > > > > > > > > > > > > > > For example, for the following code: > > > > > > > > > > > > > > > > int main() { > > > > > > > > int size = 1000; > > > > > > > > int *foo = malloc(sizeof(int) * size); > > > > > > > > int c1 = rand(), t1 = rand(); > > > > > > > > > > > > > > > > for (int i = 0; i < size; i++) { > > > > > > > > if (foo[i] & c1) { > > > > > > > > foo[i] = t1; > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > // prevents the loop above from being optimized > > > > > > > > for (int i = 0; i < size; i++) { > > > > > > > > printf("%d", foo[i]); > > > > > > > > } > > > > > > > > } > > > > > > > > > > > > > > > > First of all, the if statement block in the loop will be converted to > > > > > > > > a MASK_STORE through if-conversion optimization. But after > > > > > > > > tree-vector, it will still become a branched form. The part of the > > > > > > > > final disassembly structure probably looks like below(Using IDA to do > > > > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > > > > > > > in it, which will lead to low efficiency. > > > > > > > > > > > > > > > > do > > > > > > > > { > > > > > > > > while ( 1 ) > > > > > > > > { > > > > > > > > __asm > > > > > > > > { > > > > > > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > > > vptest ymm0, ymm0 > > > > > > > > } > > > > > > > > if ( !_ZF ) > > > > > > > > break; > > > > > > > > _RAX += 8; > > > > > > > > if ( _RAX == v9 ) > > > > > > > > goto LABEL_5; > > > > > > > > } > > > > > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > > > > > > _RAX += 8; > > > > > > > > } > > > > > > > > while ( _RAX != v9 ); > > > > > > > > > > > > > > > > Why can't we just replace the vptest and if statement with some other > > > > > > > > instructions like vpblendvb so that it can be faster? Or is there a > > > > > > > > good way to do that? > > > > > > > > > > > > > > The branch is added by optimize_mask_stores after vectorization because > > > > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on > > > > > > > some architectures when fault assists (read-only pages, but also COW pages) > > > > > > > are ran into. All the microcode handling needs to possibly be carried out > > > > > > > multiple times, for each such access to the same page. That can cause > > > > > > > a 1000x slowdown when you hit this case. Thus every masked store > > > > > > > is replaced by > > > > > > > > > > > > > > if (mask != 0) > > > > > > > masked_store (); > > > > > > > > > > > > > > and this is an optimization (which itself has a small cost). > > > > > > > > > > > > > > Richard. > > > > > > > > > > > > Yeah, I know that and I have seen the code of optimize_mask_store(). > > > > > > And the main problem here is that when multiple MASK_STORE appear in > > > > > > the same loop, many branches will appear, resulting in a decrease in > > > > > > overall efficiency. > > > > > > > > > > > > And my original idea is that why can't we replace MASK_STORE with more > > > > > > effective SIMD instructions because icc can do much better in this > > > > > > case. > > > > > > > > > > ICC probably doesn't care for the case where foo[] isn't writable. In > > > > > fact for the case at hand we see it comes from malloc() which we > > > > > can assume to return writable memory I guess. That means if-conversion > > > > > can treat the unconditional read as a way to also allow to speculate > > > > > the write (with -fallow-store-data-races). > > > > > > > > > > Note there's currently no pointer analysis that tracks writability. > > > > > > > > > > > Then I give it up, because the ability to analyze vectorization > > > > > > of gcc is not as good as icc and my ability does not support me > > > > > > modifying this part of the code. > > > > > > > > > > > > Thanks very much for your reply. > > > > > > > > > > You're welcome. > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the elimination of if blocks in GCC during if-conversion and vectorization 2023-10-23 12:56 ` Hanke Zhang @ 2023-10-26 9:18 ` Richard Biener 0 siblings, 0 replies; 10+ messages in thread From: Richard Biener @ 2023-10-26 9:18 UTC (permalink / raw) To: Hanke Zhang; +Cc: gcc On Mon, Oct 23, 2023 at 2:56 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > Richard Biener <richard.guenther@gmail.com> 于2023年10月23日周一 20:32写道: > > > > On Mon, Oct 23, 2023 at 12:50 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > Hi Richard: > > > > > > Thanks for your advice. But when I try a simpler example like the one > > > below before looking at the code, GCC still does nothing. > > > > > > int main() { > > > int width; > > > scanf("%d", &width); > > > int sum = 0; > > > for (int i = 0; i < width; i++) sum += i; > > > printf("%d\n", sum); > > > } > > > > > > I tried O3 and LTO, but still the same. So I'd like to ask why, or am > > > I doing something wrong? > > > > -fdump-tree-sccp-details-scev reveals > > > > (set_scalar_evolution > > instantiated_below = 5 > > (scalar = sum_9) > > (scalar_evolution = {0, +, {1, +, 1}_1}_1)) > > ) > > (chrec_apply > > (varying_loop = 1) > > (chrec = {0, +, {1, +, 1}_1}_1) > > (x = (unsigned int) width.0_12 + 4294967295) > > (res = scev_not_known)) > > > > so we don't know how to apply a variable number of iterations to > > the affine expression {0, +, {1, +, 1}_1}_1, that is, we do not > > know how to compute the final value of the reduction. > > > > For a constant, say width == 100 we do: > > > > (set_scalar_evolution > > instantiated_below = 2 > > (scalar = sum_6) > > (scalar_evolution = {0, +, {1, +, 1}_1}_1)) > > ) > > (chrec_apply > > (varying_loop = 1) > > (chrec = {0, +, {1, +, 1}_1}_1) > > (x = 99) > > (res = 4950)) > > Yeah, I also found this result in previous experiments. But what > confuses me is that if the 'width' can't be inferred to INTEGER_CST, > there's nothing we can do, right? > > Because in my case, the variables corresponding to 'width' are almost > all undetermined values, such as 'width = rand()'. That said, I can > hardly get any optimizations in my cases, right? I think the result would be (width * (width - 1)) / 2, chrec_apply is where "pattern matching" of known series is done. Richard. > > Thanks > Hanke Zhang > > > > > > Richard. > > > > > > > > Thanks > > > Hanke Zhang > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月19日周四 20:00写道: > > > > > > > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > > > > > Hi Richard > > > > > I get it, thank you again. > > > > > > > > > > And I got another problem, so I'd like ask it by the way. Can the left > > > > > shift of the induction variable in a loop be optimized as a constant? > > > > > Like the code below: > > > > > > > > > > int ans = 0; > > > > > int width = rand() % 16; > > > > > for (int j = 0; j < width; j++) > > > > > ans += 1 << (j + width) > > > > > > > > > > into: > > > > > > > > > > int width = rand() % 16; > > > > > ans = (1 << (2 * width) - (1 << width)); > > > > > > > > > > I came across a more complex version of that and found that gcc > > > > > doesn't seem to handle it, so wanted to write a pass myself to > > > > > optimize it. > > > > > > > > > > I got two questions here. Does GCC have such optimizations? If I want > > > > > to do my own optimization, where should I put it? Put it behind the > > > > > pass_iv_optimize? > > > > > > > > GCC has the final value replacement pass (pass_scev_cprop) doing these > > > > kind of transforms. Since 'ans' does not have an affine evolution this > > > > case would need to be pattern matched (there are some existing pattern > > > > matchings in the pass). > > > > > > > > > Thanks > > > > > Hanke Zhang > > > > > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道: > > > > > > > > > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote: > > > > > > > > > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道: > > > > > > > > > > > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > > > > > > > > > > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small > > > > > > > > > problem and would like to ask for advice. > > > > > > > > > > > > > > > > > > For example, for the following code: > > > > > > > > > > > > > > > > > > int main() { > > > > > > > > > int size = 1000; > > > > > > > > > int *foo = malloc(sizeof(int) * size); > > > > > > > > > int c1 = rand(), t1 = rand(); > > > > > > > > > > > > > > > > > > for (int i = 0; i < size; i++) { > > > > > > > > > if (foo[i] & c1) { > > > > > > > > > foo[i] = t1; > > > > > > > > > } > > > > > > > > > } > > > > > > > > > > > > > > > > > > // prevents the loop above from being optimized > > > > > > > > > for (int i = 0; i < size; i++) { > > > > > > > > > printf("%d", foo[i]); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > > > > > > > > > > First of all, the if statement block in the loop will be converted to > > > > > > > > > a MASK_STORE through if-conversion optimization. But after > > > > > > > > > tree-vector, it will still become a branched form. The part of the > > > > > > > > > final disassembly structure probably looks like below(Using IDA to do > > > > > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )' > > > > > > > > > in it, which will lead to low efficiency. > > > > > > > > > > > > > > > > > > do > > > > > > > > > { > > > > > > > > > while ( 1 ) > > > > > > > > > { > > > > > > > > > __asm > > > > > > > > > { > > > > > > > > > vpand ymm0, ymm2, ymmword ptr [rax] > > > > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > > > > vpcmpeqd ymm0, ymm0, ymm1 > > > > > > > > > vptest ymm0, ymm0 > > > > > > > > > } > > > > > > > > > if ( !_ZF ) > > > > > > > > > break; > > > > > > > > > _RAX += 8; > > > > > > > > > if ( _RAX == v9 ) > > > > > > > > > goto LABEL_5; > > > > > > > > > } > > > > > > > > > __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 } > > > > > > > > > _RAX += 8; > > > > > > > > > } > > > > > > > > > while ( _RAX != v9 ); > > > > > > > > > > > > > > > > > > Why can't we just replace the vptest and if statement with some other > > > > > > > > > instructions like vpblendvb so that it can be faster? Or is there a > > > > > > > > > good way to do that? > > > > > > > > > > > > > > > > The branch is added by optimize_mask_stores after vectorization because > > > > > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on > > > > > > > > some architectures when fault assists (read-only pages, but also COW pages) > > > > > > > > are ran into. All the microcode handling needs to possibly be carried out > > > > > > > > multiple times, for each such access to the same page. That can cause > > > > > > > > a 1000x slowdown when you hit this case. Thus every masked store > > > > > > > > is replaced by > > > > > > > > > > > > > > > > if (mask != 0) > > > > > > > > masked_store (); > > > > > > > > > > > > > > > > and this is an optimization (which itself has a small cost). > > > > > > > > > > > > > > > > Richard. > > > > > > > > > > > > > > Yeah, I know that and I have seen the code of optimize_mask_store(). > > > > > > > And the main problem here is that when multiple MASK_STORE appear in > > > > > > > the same loop, many branches will appear, resulting in a decrease in > > > > > > > overall efficiency. > > > > > > > > > > > > > > And my original idea is that why can't we replace MASK_STORE with more > > > > > > > effective SIMD instructions because icc can do much better in this > > > > > > > case. > > > > > > > > > > > > ICC probably doesn't care for the case where foo[] isn't writable. In > > > > > > fact for the case at hand we see it comes from malloc() which we > > > > > > can assume to return writable memory I guess. That means if-conversion > > > > > > can treat the unconditional read as a way to also allow to speculate > > > > > > the write (with -fallow-store-data-races). > > > > > > > > > > > > Note there's currently no pointer analysis that tracks writability. > > > > > > > > > > > > > Then I give it up, because the ability to analyze vectorization > > > > > > > of gcc is not as good as icc and my ability does not support me > > > > > > > modifying this part of the code. > > > > > > > > > > > > > > Thanks very much for your reply. > > > > > > > > > > > > You're welcome. > > > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Hanke Zhang ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-10-26 9:19 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-10-12 12:16 the elimination of if blocks in GCC during if-conversion and vectorization Hanke Zhang 2023-10-17 9:23 ` Richard Biener 2023-10-17 11:54 ` Hanke Zhang 2023-10-17 11:57 ` Richard Biener 2023-10-17 12:39 ` Hanke Zhang 2023-10-19 11:57 ` Richard Biener 2023-10-23 10:50 ` Hanke Zhang 2023-10-23 12:29 ` Richard Biener 2023-10-23 12:56 ` Hanke Zhang 2023-10-26 9:18 ` Richard Biener
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).