the elimination of if blocks in GCC during if-conversion and vectorization

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* the elimination of if blocks in GCC during if-conversion and vectorization
@ 2023-10-12 12:16 Hanke Zhang
  2023-10-17  9:23 ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Hanke Zhang @ 2023-10-12 12:16 UTC (permalink / raw)
  To: gcc

Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
problem and would like to ask for advice.

For example, for the following code:

int main() {
  int size = 1000;
  int *foo = malloc(sizeof(int) * size);
  int c1 = rand(), t1 = rand();

  for (int i = 0; i < size; i++) {
    if (foo[i] & c1) {
      foo[i] = t1;
    }
  }

  // prevents the loop above from being optimized
  for (int i = 0; i < size; i++) {
    printf("%d", foo[i]);
  }
}

First of all, the if statement block in the loop will be converted to
a MASK_STORE through if-conversion optimization. But after
tree-vector, it will still become a branched form. The part of the
final disassembly structure probably looks like below(Using IDA to do
this), and you can see that there is still such a branch 'if ( !_ZF )'
in it, which will lead to low efficiency.

do
  {
    while ( 1 )
    {
      __asm
      {
        vpand   ymm0, ymm2, ymmword ptr [rax]
        vpcmpeqd ymm0, ymm0, ymm1
        vpcmpeqd ymm0, ymm0, ymm1
        vptest  ymm0, ymm0
      }
      if ( !_ZF )
        break;
      _RAX += 8;
      if ( _RAX == v9 )
        goto LABEL_5;
    }
    __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
    _RAX += 8;
  }
  while ( _RAX != v9 );

Why can't we just replace the vptest and if statement with some other
instructions like vpblendvb so that it can be faster? Or is there a
good way to do that?

Thanks
Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-12 12:16 the elimination of if blocks in GCC during if-conversion and vectorization Hanke Zhang
@ 2023-10-17  9:23 ` Richard Biener
  2023-10-17 11:54   ` Hanke Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-10-17  9:23 UTC (permalink / raw)
  To: Hanke Zhang; +Cc: gcc

On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> problem and would like to ask for advice.
>
> For example, for the following code:
>
> int main() {
>   int size = 1000;
>   int *foo = malloc(sizeof(int) * size);
>   int c1 = rand(), t1 = rand();
>
>   for (int i = 0; i < size; i++) {
>     if (foo[i] & c1) {
>       foo[i] = t1;
>     }
>   }
>
>   // prevents the loop above from being optimized
>   for (int i = 0; i < size; i++) {
>     printf("%d", foo[i]);
>   }
> }
>
> First of all, the if statement block in the loop will be converted to
> a MASK_STORE through if-conversion optimization. But after
> tree-vector, it will still become a branched form. The part of the
> final disassembly structure probably looks like below(Using IDA to do
> this), and you can see that there is still such a branch 'if ( !_ZF )'
> in it, which will lead to low efficiency.
>
> do
>   {
>     while ( 1 )
>     {
>       __asm
>       {
>         vpand   ymm0, ymm2, ymmword ptr [rax]
>         vpcmpeqd ymm0, ymm0, ymm1
>         vpcmpeqd ymm0, ymm0, ymm1
>         vptest  ymm0, ymm0
>       }
>       if ( !_ZF )
>         break;
>       _RAX += 8;
>       if ( _RAX == v9 )
>         goto LABEL_5;
>     }
>     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
>     _RAX += 8;
>   }
>   while ( _RAX != v9 );
>
> Why can't we just replace the vptest and if statement with some other
> instructions like vpblendvb so that it can be faster? Or is there a
> good way to do that?

The branch is added by optimize_mask_stores after vectorization because
fully masked (disabled) masked stores can incur a quite heavy penalty on
some architectures when fault assists (read-only pages, but also COW pages)
are ran into.  All the microcode handling needs to possibly be carried out
multiple times, for each such access to the same page.  That can cause
a 1000x slowdown when you hit this case.  Thus every masked store
is replaced by

 if (mask != 0)
   masked_store ();

and this is an optimization (which itself has a small cost).

Richard.

>
> Thanks
> Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-17  9:23 ` Richard Biener
@ 2023-10-17 11:54   ` Hanke Zhang
  2023-10-17 11:57     ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Hanke Zhang @ 2023-10-17 11:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc

Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道：
>
> On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> >
> > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > problem and would like to ask for advice.
> >
> > For example, for the following code:
> >
> > int main() {
> >   int size = 1000;
> >   int *foo = malloc(sizeof(int) * size);
> >   int c1 = rand(), t1 = rand();
> >
> >   for (int i = 0; i < size; i++) {
> >     if (foo[i] & c1) {
> >       foo[i] = t1;
> >     }
> >   }
> >
> >   // prevents the loop above from being optimized
> >   for (int i = 0; i < size; i++) {
> >     printf("%d", foo[i]);
> >   }
> > }
> >
> > First of all, the if statement block in the loop will be converted to
> > a MASK_STORE through if-conversion optimization. But after
> > tree-vector, it will still become a branched form. The part of the
> > final disassembly structure probably looks like below(Using IDA to do
> > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > in it, which will lead to low efficiency.
> >
> > do
> >   {
> >     while ( 1 )
> >     {
> >       __asm
> >       {
> >         vpand   ymm0, ymm2, ymmword ptr [rax]
> >         vpcmpeqd ymm0, ymm0, ymm1
> >         vpcmpeqd ymm0, ymm0, ymm1
> >         vptest  ymm0, ymm0
> >       }
> >       if ( !_ZF )
> >         break;
> >       _RAX += 8;
> >       if ( _RAX == v9 )
> >         goto LABEL_5;
> >     }
> >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> >     _RAX += 8;
> >   }
> >   while ( _RAX != v9 );
> >
> > Why can't we just replace the vptest and if statement with some other
> > instructions like vpblendvb so that it can be faster? Or is there a
> > good way to do that?
>
> The branch is added by optimize_mask_stores after vectorization because
> fully masked (disabled) masked stores can incur a quite heavy penalty on
> some architectures when fault assists (read-only pages, but also COW pages)
> are ran into.  All the microcode handling needs to possibly be carried out
> multiple times, for each such access to the same page.  That can cause
> a 1000x slowdown when you hit this case.  Thus every masked store
> is replaced by
>
>  if (mask != 0)
>    masked_store ();
>
> and this is an optimization (which itself has a small cost).
>
> Richard.

Yeah, I know that and I have seen the code of optimize_mask_store().
And the main problem here is that when multiple MASK_STORE appear in
the same loop, many branches will appear, resulting in a decrease in
overall efficiency.

And my original idea is that why can't we replace MASK_STORE with more
effective SIMD instructions because icc can do much better in this
case. Then I give it up, because the ability to analyze vectorization
of gcc is not as good as icc and my ability does not support me
modifying this part of the code.

Thanks very much for your reply.

>
> >
> > Thanks
> > Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-17 11:54   ` Hanke Zhang
@ 2023-10-17 11:57     ` Richard Biener
  2023-10-17 12:39       ` Hanke Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-10-17 11:57 UTC (permalink / raw)
  To: Hanke Zhang; +Cc: gcc

On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道：
> >
> > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > >
> > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > > problem and would like to ask for advice.
> > >
> > > For example, for the following code:
> > >
> > > int main() {
> > >   int size = 1000;
> > >   int *foo = malloc(sizeof(int) * size);
> > >   int c1 = rand(), t1 = rand();
> > >
> > >   for (int i = 0; i < size; i++) {
> > >     if (foo[i] & c1) {
> > >       foo[i] = t1;
> > >     }
> > >   }
> > >
> > >   // prevents the loop above from being optimized
> > >   for (int i = 0; i < size; i++) {
> > >     printf("%d", foo[i]);
> > >   }
> > > }
> > >
> > > First of all, the if statement block in the loop will be converted to
> > > a MASK_STORE through if-conversion optimization. But after
> > > tree-vector, it will still become a branched form. The part of the
> > > final disassembly structure probably looks like below(Using IDA to do
> > > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > > in it, which will lead to low efficiency.
> > >
> > > do
> > >   {
> > >     while ( 1 )
> > >     {
> > >       __asm
> > >       {
> > >         vpand   ymm0, ymm2, ymmword ptr [rax]
> > >         vpcmpeqd ymm0, ymm0, ymm1
> > >         vpcmpeqd ymm0, ymm0, ymm1
> > >         vptest  ymm0, ymm0
> > >       }
> > >       if ( !_ZF )
> > >         break;
> > >       _RAX += 8;
> > >       if ( _RAX == v9 )
> > >         goto LABEL_5;
> > >     }
> > >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > >     _RAX += 8;
> > >   }
> > >   while ( _RAX != v9 );
> > >
> > > Why can't we just replace the vptest and if statement with some other
> > > instructions like vpblendvb so that it can be faster? Or is there a
> > > good way to do that?
> >
> > The branch is added by optimize_mask_stores after vectorization because
> > fully masked (disabled) masked stores can incur a quite heavy penalty on
> > some architectures when fault assists (read-only pages, but also COW pages)
> > are ran into.  All the microcode handling needs to possibly be carried out
> > multiple times, for each such access to the same page.  That can cause
> > a 1000x slowdown when you hit this case.  Thus every masked store
> > is replaced by
> >
> >  if (mask != 0)
> >    masked_store ();
> >
> > and this is an optimization (which itself has a small cost).
> >
> > Richard.
>
> Yeah, I know that and I have seen the code of optimize_mask_store().
> And the main problem here is that when multiple MASK_STORE appear in
> the same loop, many branches will appear, resulting in a decrease in
> overall efficiency.
>
> And my original idea is that why can't we replace MASK_STORE with more
> effective SIMD instructions because icc can do much better in this
> case.

ICC probably doesn't care for the case where foo[] isn't writable.  In
fact for the case at hand we see it comes from malloc() which we
can assume to return writable memory I guess.  That means if-conversion
can treat the unconditional read as a way to also allow to speculate
the write (with -fallow-store-data-races).

Note there's currently no pointer analysis that tracks writability.

> Then I give it up, because the ability to analyze vectorization
> of gcc is not as good as icc and my ability does not support me
> modifying this part of the code.
>
> Thanks very much for your reply.

You're welcome.

Richard.

> >
> > >
> > > Thanks
> > > Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-17 11:57     ` Richard Biener
@ 2023-10-17 12:39       ` Hanke Zhang
  2023-10-19 11:57         ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Hanke Zhang @ 2023-10-17 12:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc

Hi Richard
I get it, thank you again.

And I got another problem, so I'd like ask it by the way. Can the left
shift of the induction variable in a loop be optimized as a constant?
Like the code below:

int ans = 0;
int width = rand() % 16;
for (int j = 0; j < width; j++)
  ans += 1 << (j + width)

into:

int width = rand() % 16;
ans = (1 << (2 * width) - (1 << width));

I came across a more complex version of that and found that gcc
doesn't seem to handle it, so wanted to write a pass myself to
optimize it.

I got two questions here. Does GCC have such optimizations? If I want
to do my own optimization, where should I put it? Put it behind the
pass_iv_optimize?

Thanks
Hanke Zhang

Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道：
>
> On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> >
> > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道：
> > >
> > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > > >
> > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > > > problem and would like to ask for advice.
> > > >
> > > > For example, for the following code:
> > > >
> > > > int main() {
> > > >   int size = 1000;
> > > >   int *foo = malloc(sizeof(int) * size);
> > > >   int c1 = rand(), t1 = rand();
> > > >
> > > >   for (int i = 0; i < size; i++) {
> > > >     if (foo[i] & c1) {
> > > >       foo[i] = t1;
> > > >     }
> > > >   }
> > > >
> > > >   // prevents the loop above from being optimized
> > > >   for (int i = 0; i < size; i++) {
> > > >     printf("%d", foo[i]);
> > > >   }
> > > > }
> > > >
> > > > First of all, the if statement block in the loop will be converted to
> > > > a MASK_STORE through if-conversion optimization. But after
> > > > tree-vector, it will still become a branched form. The part of the
> > > > final disassembly structure probably looks like below(Using IDA to do
> > > > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > > > in it, which will lead to low efficiency.
> > > >
> > > > do
> > > >   {
> > > >     while ( 1 )
> > > >     {
> > > >       __asm
> > > >       {
> > > >         vpand   ymm0, ymm2, ymmword ptr [rax]
> > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > >         vptest  ymm0, ymm0
> > > >       }
> > > >       if ( !_ZF )
> > > >         break;
> > > >       _RAX += 8;
> > > >       if ( _RAX == v9 )
> > > >         goto LABEL_5;
> > > >     }
> > > >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > > >     _RAX += 8;
> > > >   }
> > > >   while ( _RAX != v9 );
> > > >
> > > > Why can't we just replace the vptest and if statement with some other
> > > > instructions like vpblendvb so that it can be faster? Or is there a
> > > > good way to do that?
> > >
> > > The branch is added by optimize_mask_stores after vectorization because
> > > fully masked (disabled) masked stores can incur a quite heavy penalty on
> > > some architectures when fault assists (read-only pages, but also COW pages)
> > > are ran into.  All the microcode handling needs to possibly be carried out
> > > multiple times, for each such access to the same page.  That can cause
> > > a 1000x slowdown when you hit this case.  Thus every masked store
> > > is replaced by
> > >
> > >  if (mask != 0)
> > >    masked_store ();
> > >
> > > and this is an optimization (which itself has a small cost).
> > >
> > > Richard.
> >
> > Yeah, I know that and I have seen the code of optimize_mask_store().
> > And the main problem here is that when multiple MASK_STORE appear in
> > the same loop, many branches will appear, resulting in a decrease in
> > overall efficiency.
> >
> > And my original idea is that why can't we replace MASK_STORE with more
> > effective SIMD instructions because icc can do much better in this
> > case.
>
> ICC probably doesn't care for the case where foo[] isn't writable.  In
> fact for the case at hand we see it comes from malloc() which we
> can assume to return writable memory I guess.  That means if-conversion
> can treat the unconditional read as a way to also allow to speculate
> the write (with -fallow-store-data-races).
>
> Note there's currently no pointer analysis that tracks writability.
>
> > Then I give it up, because the ability to analyze vectorization
> > of gcc is not as good as icc and my ability does not support me
> > modifying this part of the code.
> >
> > Thanks very much for your reply.
>
> You're welcome.
>
> Richard.
>
> > >
> > > >
> > > > Thanks
> > > > Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-17 12:39       ` Hanke Zhang
@ 2023-10-19 11:57         ` Richard Biener
  2023-10-23 10:50           ` Hanke Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-10-19 11:57 UTC (permalink / raw)
  To: Hanke Zhang; +Cc: gcc

On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
>
> Hi Richard
> I get it, thank you again.
>
> And I got another problem, so I'd like ask it by the way. Can the left
> shift of the induction variable in a loop be optimized as a constant?
> Like the code below:
>
> int ans = 0;
> int width = rand() % 16;
> for (int j = 0; j < width; j++)
>   ans += 1 << (j + width)
>
> into:
>
> int width = rand() % 16;
> ans = (1 << (2 * width) - (1 << width));
>
> I came across a more complex version of that and found that gcc
> doesn't seem to handle it, so wanted to write a pass myself to
> optimize it.
>
> I got two questions here. Does GCC have such optimizations? If I want
> to do my own optimization, where should I put it? Put it behind the
> pass_iv_optimize?

GCC has the final value replacement pass (pass_scev_cprop) doing these
kind of transforms.  Since 'ans' does not have an affine evolution this
case would need to be pattern matched (there are some existing pattern
matchings in the pass).

> Thanks
> Hanke Zhang
>
> Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道：
> >
> > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > >
> > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道：
> > > >
> > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > > > >
> > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > > > > problem and would like to ask for advice.
> > > > >
> > > > > For example, for the following code:
> > > > >
> > > > > int main() {
> > > > >   int size = 1000;
> > > > >   int *foo = malloc(sizeof(int) * size);
> > > > >   int c1 = rand(), t1 = rand();
> > > > >
> > > > >   for (int i = 0; i < size; i++) {
> > > > >     if (foo[i] & c1) {
> > > > >       foo[i] = t1;
> > > > >     }
> > > > >   }
> > > > >
> > > > >   // prevents the loop above from being optimized
> > > > >   for (int i = 0; i < size; i++) {
> > > > >     printf("%d", foo[i]);
> > > > >   }
> > > > > }
> > > > >
> > > > > First of all, the if statement block in the loop will be converted to
> > > > > a MASK_STORE through if-conversion optimization. But after
> > > > > tree-vector, it will still become a branched form. The part of the
> > > > > final disassembly structure probably looks like below(Using IDA to do
> > > > > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > > > > in it, which will lead to low efficiency.
> > > > >
> > > > > do
> > > > >   {
> > > > >     while ( 1 )
> > > > >     {
> > > > >       __asm
> > > > >       {
> > > > >         vpand   ymm0, ymm2, ymmword ptr [rax]
> > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > >         vptest  ymm0, ymm0
> > > > >       }
> > > > >       if ( !_ZF )
> > > > >         break;
> > > > >       _RAX += 8;
> > > > >       if ( _RAX == v9 )
> > > > >         goto LABEL_5;
> > > > >     }
> > > > >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > > > >     _RAX += 8;
> > > > >   }
> > > > >   while ( _RAX != v9 );
> > > > >
> > > > > Why can't we just replace the vptest and if statement with some other
> > > > > instructions like vpblendvb so that it can be faster? Or is there a
> > > > > good way to do that?
> > > >
> > > > The branch is added by optimize_mask_stores after vectorization because
> > > > fully masked (disabled) masked stores can incur a quite heavy penalty on
> > > > some architectures when fault assists (read-only pages, but also COW pages)
> > > > are ran into.  All the microcode handling needs to possibly be carried out
> > > > multiple times, for each such access to the same page.  That can cause
> > > > a 1000x slowdown when you hit this case.  Thus every masked store
> > > > is replaced by
> > > >
> > > >  if (mask != 0)
> > > >    masked_store ();
> > > >
> > > > and this is an optimization (which itself has a small cost).
> > > >
> > > > Richard.
> > >
> > > Yeah, I know that and I have seen the code of optimize_mask_store().
> > > And the main problem here is that when multiple MASK_STORE appear in
> > > the same loop, many branches will appear, resulting in a decrease in
> > > overall efficiency.
> > >
> > > And my original idea is that why can't we replace MASK_STORE with more
> > > effective SIMD instructions because icc can do much better in this
> > > case.
> >
> > ICC probably doesn't care for the case where foo[] isn't writable.  In
> > fact for the case at hand we see it comes from malloc() which we
> > can assume to return writable memory I guess.  That means if-conversion
> > can treat the unconditional read as a way to also allow to speculate
> > the write (with -fallow-store-data-races).
> >
> > Note there's currently no pointer analysis that tracks writability.
> >
> > > Then I give it up, because the ability to analyze vectorization
> > > of gcc is not as good as icc and my ability does not support me
> > > modifying this part of the code.
> > >
> > > Thanks very much for your reply.
> >
> > You're welcome.
> >
> > Richard.
> >
> > > >
> > > > >
> > > > > Thanks
> > > > > Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-19 11:57         ` Richard Biener
@ 2023-10-23 10:50           ` Hanke Zhang
  2023-10-23 12:29             ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Hanke Zhang @ 2023-10-23 10:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc

Hi Richard:

Thanks for your advice. But when I try a simpler example like the one
below before looking at the code, GCC still does nothing.

int main() {
    int width;
    scanf("%d", &width);
    int sum = 0;
    for (int i = 0; i < width; i++) sum += i;
    printf("%d\n", sum);
}

I tried O3 and LTO, but still the same. So I'd like to ask why, or am
I doing something wrong?

Thanks
Hanke Zhang

Richard Biener <richard.guenther@gmail.com> 于2023年10月19日周四 20:00写道：
>
> On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> >
> > Hi Richard
> > I get it, thank you again.
> >
> > And I got another problem, so I'd like ask it by the way. Can the left
> > shift of the induction variable in a loop be optimized as a constant?
> > Like the code below:
> >
> > int ans = 0;
> > int width = rand() % 16;
> > for (int j = 0; j < width; j++)
> >   ans += 1 << (j + width)
> >
> > into:
> >
> > int width = rand() % 16;
> > ans = (1 << (2 * width) - (1 << width));
> >
> > I came across a more complex version of that and found that gcc
> > doesn't seem to handle it, so wanted to write a pass myself to
> > optimize it.
> >
> > I got two questions here. Does GCC have such optimizations? If I want
> > to do my own optimization, where should I put it? Put it behind the
> > pass_iv_optimize?
>
> GCC has the final value replacement pass (pass_scev_cprop) doing these
> kind of transforms.  Since 'ans' does not have an affine evolution this
> case would need to be pattern matched (there are some existing pattern
> matchings in the pass).
>
> > Thanks
> > Hanke Zhang
> >
> > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道：
> > >
> > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > > >
> > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道：
> > > > >
> > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > > > > >
> > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > > > > > problem and would like to ask for advice.
> > > > > >
> > > > > > For example, for the following code:
> > > > > >
> > > > > > int main() {
> > > > > >   int size = 1000;
> > > > > >   int *foo = malloc(sizeof(int) * size);
> > > > > >   int c1 = rand(), t1 = rand();
> > > > > >
> > > > > >   for (int i = 0; i < size; i++) {
> > > > > >     if (foo[i] & c1) {
> > > > > >       foo[i] = t1;
> > > > > >     }
> > > > > >   }
> > > > > >
> > > > > >   // prevents the loop above from being optimized
> > > > > >   for (int i = 0; i < size; i++) {
> > > > > >     printf("%d", foo[i]);
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > > First of all, the if statement block in the loop will be converted to
> > > > > > a MASK_STORE through if-conversion optimization. But after
> > > > > > tree-vector, it will still become a branched form. The part of the
> > > > > > final disassembly structure probably looks like below(Using IDA to do
> > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > > > > > in it, which will lead to low efficiency.
> > > > > >
> > > > > > do
> > > > > >   {
> > > > > >     while ( 1 )
> > > > > >     {
> > > > > >       __asm
> > > > > >       {
> > > > > >         vpand   ymm0, ymm2, ymmword ptr [rax]
> > > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > > >         vptest  ymm0, ymm0
> > > > > >       }
> > > > > >       if ( !_ZF )
> > > > > >         break;
> > > > > >       _RAX += 8;
> > > > > >       if ( _RAX == v9 )
> > > > > >         goto LABEL_5;
> > > > > >     }
> > > > > >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > > > > >     _RAX += 8;
> > > > > >   }
> > > > > >   while ( _RAX != v9 );
> > > > > >
> > > > > > Why can't we just replace the vptest and if statement with some other
> > > > > > instructions like vpblendvb so that it can be faster? Or is there a
> > > > > > good way to do that?
> > > > >
> > > > > The branch is added by optimize_mask_stores after vectorization because
> > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on
> > > > > some architectures when fault assists (read-only pages, but also COW pages)
> > > > > are ran into.  All the microcode handling needs to possibly be carried out
> > > > > multiple times, for each such access to the same page.  That can cause
> > > > > a 1000x slowdown when you hit this case.  Thus every masked store
> > > > > is replaced by
> > > > >
> > > > >  if (mask != 0)
> > > > >    masked_store ();
> > > > >
> > > > > and this is an optimization (which itself has a small cost).
> > > > >
> > > > > Richard.
> > > >
> > > > Yeah, I know that and I have seen the code of optimize_mask_store().
> > > > And the main problem here is that when multiple MASK_STORE appear in
> > > > the same loop, many branches will appear, resulting in a decrease in
> > > > overall efficiency.
> > > >
> > > > And my original idea is that why can't we replace MASK_STORE with more
> > > > effective SIMD instructions because icc can do much better in this
> > > > case.
> > >
> > > ICC probably doesn't care for the case where foo[] isn't writable.  In
> > > fact for the case at hand we see it comes from malloc() which we
> > > can assume to return writable memory I guess.  That means if-conversion
> > > can treat the unconditional read as a way to also allow to speculate
> > > the write (with -fallow-store-data-races).
> > >
> > > Note there's currently no pointer analysis that tracks writability.
> > >
> > > > Then I give it up, because the ability to analyze vectorization
> > > > of gcc is not as good as icc and my ability does not support me
> > > > modifying this part of the code.
> > > >
> > > > Thanks very much for your reply.
> > >
> > > You're welcome.
> > >
> > > Richard.
> > >
> > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-23 10:50           ` Hanke Zhang
@ 2023-10-23 12:29             ` Richard Biener
  2023-10-23 12:56               ` Hanke Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-10-23 12:29 UTC (permalink / raw)
  To: Hanke Zhang; +Cc: gcc

On Mon, Oct 23, 2023 at 12:50 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
>
> Hi Richard:
>
> Thanks for your advice. But when I try a simpler example like the one
> below before looking at the code, GCC still does nothing.
>
> int main() {
>     int width;
>     scanf("%d", &width);
>     int sum = 0;
>     for (int i = 0; i < width; i++) sum += i;
>     printf("%d\n", sum);
> }
>
> I tried O3 and LTO, but still the same. So I'd like to ask why, or am
> I doing something wrong?

-fdump-tree-sccp-details-scev reveals

(set_scalar_evolution
  instantiated_below = 5
  (scalar = sum_9)
  (scalar_evolution = {0, +, {1, +, 1}_1}_1))
)
(chrec_apply
  (varying_loop = 1)
  (chrec = {0, +, {1, +, 1}_1}_1)
  (x = (unsigned int) width.0_12 + 4294967295)
  (res = scev_not_known))

so we don't know how to apply a variable number of iterations to
the affine expression {0, +, {1, +, 1}_1}_1, that is, we do not
know how to compute the final value of the reduction.

For a constant, say width == 100 we do:

(set_scalar_evolution
  instantiated_below = 2
  (scalar = sum_6)
  (scalar_evolution = {0, +, {1, +, 1}_1}_1))
)
(chrec_apply
  (varying_loop = 1)
  (chrec = {0, +, {1, +, 1}_1}_1)
  (x = 99)
  (res = 4950))

Richard.

>
> Thanks
> Hanke Zhang
>
> Richard Biener <richard.guenther@gmail.com> 于2023年10月19日周四 20:00写道：
> >
> > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > >
> > > Hi Richard
> > > I get it, thank you again.
> > >
> > > And I got another problem, so I'd like ask it by the way. Can the left
> > > shift of the induction variable in a loop be optimized as a constant?
> > > Like the code below:
> > >
> > > int ans = 0;
> > > int width = rand() % 16;
> > > for (int j = 0; j < width; j++)
> > >   ans += 1 << (j + width)
> > >
> > > into:
> > >
> > > int width = rand() % 16;
> > > ans = (1 << (2 * width) - (1 << width));
> > >
> > > I came across a more complex version of that and found that gcc
> > > doesn't seem to handle it, so wanted to write a pass myself to
> > > optimize it.
> > >
> > > I got two questions here. Does GCC have such optimizations? If I want
> > > to do my own optimization, where should I put it? Put it behind the
> > > pass_iv_optimize?
> >
> > GCC has the final value replacement pass (pass_scev_cprop) doing these
> > kind of transforms.  Since 'ans' does not have an affine evolution this
> > case would need to be pattern matched (there are some existing pattern
> > matchings in the pass).
> >
> > > Thanks
> > > Hanke Zhang
> > >
> > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道：
> > > >
> > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > > > >
> > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道：
> > > > > >
> > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > > > > > > problem and would like to ask for advice.
> > > > > > >
> > > > > > > For example, for the following code:
> > > > > > >
> > > > > > > int main() {
> > > > > > >   int size = 1000;
> > > > > > >   int *foo = malloc(sizeof(int) * size);
> > > > > > >   int c1 = rand(), t1 = rand();
> > > > > > >
> > > > > > >   for (int i = 0; i < size; i++) {
> > > > > > >     if (foo[i] & c1) {
> > > > > > >       foo[i] = t1;
> > > > > > >     }
> > > > > > >   }
> > > > > > >
> > > > > > >   // prevents the loop above from being optimized
> > > > > > >   for (int i = 0; i < size; i++) {
> > > > > > >     printf("%d", foo[i]);
> > > > > > >   }
> > > > > > > }
> > > > > > >
> > > > > > > First of all, the if statement block in the loop will be converted to
> > > > > > > a MASK_STORE through if-conversion optimization. But after
> > > > > > > tree-vector, it will still become a branched form. The part of the
> > > > > > > final disassembly structure probably looks like below(Using IDA to do
> > > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > > > > > > in it, which will lead to low efficiency.
> > > > > > >
> > > > > > > do
> > > > > > >   {
> > > > > > >     while ( 1 )
> > > > > > >     {
> > > > > > >       __asm
> > > > > > >       {
> > > > > > >         vpand   ymm0, ymm2, ymmword ptr [rax]
> > > > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > > > >         vptest  ymm0, ymm0
> > > > > > >       }
> > > > > > >       if ( !_ZF )
> > > > > > >         break;
> > > > > > >       _RAX += 8;
> > > > > > >       if ( _RAX == v9 )
> > > > > > >         goto LABEL_5;
> > > > > > >     }
> > > > > > >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > > > > > >     _RAX += 8;
> > > > > > >   }
> > > > > > >   while ( _RAX != v9 );
> > > > > > >
> > > > > > > Why can't we just replace the vptest and if statement with some other
> > > > > > > instructions like vpblendvb so that it can be faster? Or is there a
> > > > > > > good way to do that?
> > > > > >
> > > > > > The branch is added by optimize_mask_stores after vectorization because
> > > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on
> > > > > > some architectures when fault assists (read-only pages, but also COW pages)
> > > > > > are ran into.  All the microcode handling needs to possibly be carried out
> > > > > > multiple times, for each such access to the same page.  That can cause
> > > > > > a 1000x slowdown when you hit this case.  Thus every masked store
> > > > > > is replaced by
> > > > > >
> > > > > >  if (mask != 0)
> > > > > >    masked_store ();
> > > > > >
> > > > > > and this is an optimization (which itself has a small cost).
> > > > > >
> > > > > > Richard.
> > > > >
> > > > > Yeah, I know that and I have seen the code of optimize_mask_store().
> > > > > And the main problem here is that when multiple MASK_STORE appear in
> > > > > the same loop, many branches will appear, resulting in a decrease in
> > > > > overall efficiency.
> > > > >
> > > > > And my original idea is that why can't we replace MASK_STORE with more
> > > > > effective SIMD instructions because icc can do much better in this
> > > > > case.
> > > >
> > > > ICC probably doesn't care for the case where foo[] isn't writable.  In
> > > > fact for the case at hand we see it comes from malloc() which we
> > > > can assume to return writable memory I guess.  That means if-conversion
> > > > can treat the unconditional read as a way to also allow to speculate
> > > > the write (with -fallow-store-data-races).
> > > >
> > > > Note there's currently no pointer analysis that tracks writability.
> > > >
> > > > > Then I give it up, because the ability to analyze vectorization
> > > > > of gcc is not as good as icc and my ability does not support me
> > > > > modifying this part of the code.
> > > > >
> > > > > Thanks very much for your reply.
> > > >
> > > > You're welcome.
> > > >
> > > > Richard.
> > > >
> > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > > Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-23 12:29             ` Richard Biener
@ 2023-10-23 12:56               ` Hanke Zhang
  2023-10-26  9:18                 ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Hanke Zhang @ 2023-10-23 12:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc

Richard Biener <richard.guenther@gmail.com> 于2023年10月23日周一 20:32写道：
>
> On Mon, Oct 23, 2023 at 12:50 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> >
> > Hi Richard:
> >
> > Thanks for your advice. But when I try a simpler example like the one
> > below before looking at the code, GCC still does nothing.
> >
> > int main() {
> >     int width;
> >     scanf("%d", &width);
> >     int sum = 0;
> >     for (int i = 0; i < width; i++) sum += i;
> >     printf("%d\n", sum);
> > }
> >
> > I tried O3 and LTO, but still the same. So I'd like to ask why, or am
> > I doing something wrong?
>
> -fdump-tree-sccp-details-scev reveals
>
> (set_scalar_evolution
>   instantiated_below = 5
>   (scalar = sum_9)
>   (scalar_evolution = {0, +, {1, +, 1}_1}_1))
> )
> (chrec_apply
>   (varying_loop = 1)
>   (chrec = {0, +, {1, +, 1}_1}_1)
>   (x = (unsigned int) width.0_12 + 4294967295)
>   (res = scev_not_known))
>
> so we don't know how to apply a variable number of iterations to
> the affine expression {0, +, {1, +, 1}_1}_1, that is, we do not
> know how to compute the final value of the reduction.
>
> For a constant, say width == 100 we do:
>
> (set_scalar_evolution
>   instantiated_below = 2
>   (scalar = sum_6)
>   (scalar_evolution = {0, +, {1, +, 1}_1}_1))
> )
> (chrec_apply
>   (varying_loop = 1)
>   (chrec = {0, +, {1, +, 1}_1}_1)
>   (x = 99)
>   (res = 4950))

Yeah, I also found this result in previous experiments. But what
confuses me is that if the 'width' can't be inferred to INTEGER_CST,
there's nothing we can do, right?

Because in my case, the variables corresponding to 'width' are almost
all undetermined values, such as 'width = rand()'. That said, I can
hardly get any optimizations in my cases, right?

Thanks
Hanke Zhang


>
> Richard.
>
> >
> > Thanks
> > Hanke Zhang
> >
> > Richard Biener <richard.guenther@gmail.com> 于2023年10月19日周四 20:00写道：
> > >
> > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > > >
> > > > Hi Richard
> > > > I get it, thank you again.
> > > >
> > > > And I got another problem, so I'd like ask it by the way. Can the left
> > > > shift of the induction variable in a loop be optimized as a constant?
> > > > Like the code below:
> > > >
> > > > int ans = 0;
> > > > int width = rand() % 16;
> > > > for (int j = 0; j < width; j++)
> > > >   ans += 1 << (j + width)
> > > >
> > > > into:
> > > >
> > > > int width = rand() % 16;
> > > > ans = (1 << (2 * width) - (1 << width));
> > > >
> > > > I came across a more complex version of that and found that gcc
> > > > doesn't seem to handle it, so wanted to write a pass myself to
> > > > optimize it.
> > > >
> > > > I got two questions here. Does GCC have such optimizations? If I want
> > > > to do my own optimization, where should I put it? Put it behind the
> > > > pass_iv_optimize?
> > >
> > > GCC has the final value replacement pass (pass_scev_cprop) doing these
> > > kind of transforms.  Since 'ans' does not have an affine evolution this
> > > case would need to be pattern matched (there are some existing pattern
> > > matchings in the pass).
> > >
> > > > Thanks
> > > > Hanke Zhang
> > > >
> > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道：
> > > > >
> > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > > > > >
> > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道：
> > > > > > >
> > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > > > > > > > problem and would like to ask for advice.
> > > > > > > >
> > > > > > > > For example, for the following code:
> > > > > > > >
> > > > > > > > int main() {
> > > > > > > >   int size = 1000;
> > > > > > > >   int *foo = malloc(sizeof(int) * size);
> > > > > > > >   int c1 = rand(), t1 = rand();
> > > > > > > >
> > > > > > > >   for (int i = 0; i < size; i++) {
> > > > > > > >     if (foo[i] & c1) {
> > > > > > > >       foo[i] = t1;
> > > > > > > >     }
> > > > > > > >   }
> > > > > > > >
> > > > > > > >   // prevents the loop above from being optimized
> > > > > > > >   for (int i = 0; i < size; i++) {
> > > > > > > >     printf("%d", foo[i]);
> > > > > > > >   }
> > > > > > > > }
> > > > > > > >
> > > > > > > > First of all, the if statement block in the loop will be converted to
> > > > > > > > a MASK_STORE through if-conversion optimization. But after
> > > > > > > > tree-vector, it will still become a branched form. The part of the
> > > > > > > > final disassembly structure probably looks like below(Using IDA to do
> > > > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > > > > > > > in it, which will lead to low efficiency.
> > > > > > > >
> > > > > > > > do
> > > > > > > >   {
> > > > > > > >     while ( 1 )
> > > > > > > >     {
> > > > > > > >       __asm
> > > > > > > >       {
> > > > > > > >         vpand   ymm0, ymm2, ymmword ptr [rax]
> > > > > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > > > > >         vptest  ymm0, ymm0
> > > > > > > >       }
> > > > > > > >       if ( !_ZF )
> > > > > > > >         break;
> > > > > > > >       _RAX += 8;
> > > > > > > >       if ( _RAX == v9 )
> > > > > > > >         goto LABEL_5;
> > > > > > > >     }
> > > > > > > >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > > > > > > >     _RAX += 8;
> > > > > > > >   }
> > > > > > > >   while ( _RAX != v9 );
> > > > > > > >
> > > > > > > > Why can't we just replace the vptest and if statement with some other
> > > > > > > > instructions like vpblendvb so that it can be faster? Or is there a
> > > > > > > > good way to do that?
> > > > > > >
> > > > > > > The branch is added by optimize_mask_stores after vectorization because
> > > > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on
> > > > > > > some architectures when fault assists (read-only pages, but also COW pages)
> > > > > > > are ran into.  All the microcode handling needs to possibly be carried out
> > > > > > > multiple times, for each such access to the same page.  That can cause
> > > > > > > a 1000x slowdown when you hit this case.  Thus every masked store
> > > > > > > is replaced by
> > > > > > >
> > > > > > >  if (mask != 0)
> > > > > > >    masked_store ();
> > > > > > >
> > > > > > > and this is an optimization (which itself has a small cost).
> > > > > > >
> > > > > > > Richard.
> > > > > >
> > > > > > Yeah, I know that and I have seen the code of optimize_mask_store().
> > > > > > And the main problem here is that when multiple MASK_STORE appear in
> > > > > > the same loop, many branches will appear, resulting in a decrease in
> > > > > > overall efficiency.
> > > > > >
> > > > > > And my original idea is that why can't we replace MASK_STORE with more
> > > > > > effective SIMD instructions because icc can do much better in this
> > > > > > case.
> > > > >
> > > > > ICC probably doesn't care for the case where foo[] isn't writable.  In
> > > > > fact for the case at hand we see it comes from malloc() which we
> > > > > can assume to return writable memory I guess.  That means if-conversion
> > > > > can treat the unconditional read as a way to also allow to speculate
> > > > > the write (with -fallow-store-data-races).
> > > > >
> > > > > Note there's currently no pointer analysis that tracks writability.
> > > > >
> > > > > > Then I give it up, because the ability to analyze vectorization
> > > > > > of gcc is not as good as icc and my ability does not support me
> > > > > > modifying this part of the code.
> > > > > >
> > > > > > Thanks very much for your reply.
> > > > >
> > > > > You're welcome.
> > > > >
> > > > > Richard.
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: the elimination of if blocks in GCC during if-conversion and vectorization
  2023-10-23 12:56               ` Hanke Zhang
@ 2023-10-26  9:18                 ` Richard Biener
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Biener @ 2023-10-26  9:18 UTC (permalink / raw)
  To: Hanke Zhang; +Cc: gcc

On Mon, Oct 23, 2023 at 2:56 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> 于2023年10月23日周一 20:32写道：
> >
> > On Mon, Oct 23, 2023 at 12:50 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > >
> > > Hi Richard:
> > >
> > > Thanks for your advice. But when I try a simpler example like the one
> > > below before looking at the code, GCC still does nothing.
> > >
> > > int main() {
> > >     int width;
> > >     scanf("%d", &width);
> > >     int sum = 0;
> > >     for (int i = 0; i < width; i++) sum += i;
> > >     printf("%d\n", sum);
> > > }
> > >
> > > I tried O3 and LTO, but still the same. So I'd like to ask why, or am
> > > I doing something wrong?
> >
> > -fdump-tree-sccp-details-scev reveals
> >
> > (set_scalar_evolution
> >   instantiated_below = 5
> >   (scalar = sum_9)
> >   (scalar_evolution = {0, +, {1, +, 1}_1}_1))
> > )
> > (chrec_apply
> >   (varying_loop = 1)
> >   (chrec = {0, +, {1, +, 1}_1}_1)
> >   (x = (unsigned int) width.0_12 + 4294967295)
> >   (res = scev_not_known))
> >
> > so we don't know how to apply a variable number of iterations to
> > the affine expression {0, +, {1, +, 1}_1}_1, that is, we do not
> > know how to compute the final value of the reduction.
> >
> > For a constant, say width == 100 we do:
> >
> > (set_scalar_evolution
> >   instantiated_below = 2
> >   (scalar = sum_6)
> >   (scalar_evolution = {0, +, {1, +, 1}_1}_1))
> > )
> > (chrec_apply
> >   (varying_loop = 1)
> >   (chrec = {0, +, {1, +, 1}_1}_1)
> >   (x = 99)
> >   (res = 4950))
>
> Yeah, I also found this result in previous experiments. But what
> confuses me is that if the 'width' can't be inferred to INTEGER_CST,
> there's nothing we can do, right?
>
> Because in my case, the variables corresponding to 'width' are almost
> all undetermined values, such as 'width = rand()'. That said, I can
> hardly get any optimizations in my cases, right?

I think the result would be (width * (width - 1)) / 2, chrec_apply is where
"pattern matching" of known series is done.

Richard.

>
> Thanks
> Hanke Zhang
>
>
> >
> > Richard.
> >
> > >
> > > Thanks
> > > Hanke Zhang
> > >
> > > Richard Biener <richard.guenther@gmail.com> 于2023年10月19日周四 20:00写道：
> > > >
> > > > On Tue, Oct 17, 2023 at 2:39 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > > > >
> > > > > Hi Richard
> > > > > I get it, thank you again.
> > > > >
> > > > > And I got another problem, so I'd like ask it by the way. Can the left
> > > > > shift of the induction variable in a loop be optimized as a constant?
> > > > > Like the code below:
> > > > >
> > > > > int ans = 0;
> > > > > int width = rand() % 16;
> > > > > for (int j = 0; j < width; j++)
> > > > >   ans += 1 << (j + width)
> > > > >
> > > > > into:
> > > > >
> > > > > int width = rand() % 16;
> > > > > ans = (1 << (2 * width) - (1 << width));
> > > > >
> > > > > I came across a more complex version of that and found that gcc
> > > > > doesn't seem to handle it, so wanted to write a pass myself to
> > > > > optimize it.
> > > > >
> > > > > I got two questions here. Does GCC have such optimizations? If I want
> > > > > to do my own optimization, where should I put it? Put it behind the
> > > > > pass_iv_optimize?
> > > >
> > > > GCC has the final value replacement pass (pass_scev_cprop) doing these
> > > > kind of transforms.  Since 'ans' does not have an affine evolution this
> > > > case would need to be pattern matched (there are some existing pattern
> > > > matchings in the pass).
> > > >
> > > > > Thanks
> > > > > Hanke Zhang
> > > > >
> > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 20:00写道：
> > > > > >
> > > > > > On Tue, Oct 17, 2023 at 1:54 PM Hanke Zhang <hkzhang455@gmail.com> wrote:
> > > > > > >
> > > > > > > Richard Biener <richard.guenther@gmail.com> 于2023年10月17日周二 17:26写道：
> > > > > > > >
> > > > > > > > On Thu, Oct 12, 2023 at 2:18 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
> > > > > > > > >
> > > > > > > > > Hi, I'm recently working on vectorization of GCC. I'm stuck in a small
> > > > > > > > > problem and would like to ask for advice.
> > > > > > > > >
> > > > > > > > > For example, for the following code:
> > > > > > > > >
> > > > > > > > > int main() {
> > > > > > > > >   int size = 1000;
> > > > > > > > >   int *foo = malloc(sizeof(int) * size);
> > > > > > > > >   int c1 = rand(), t1 = rand();
> > > > > > > > >
> > > > > > > > >   for (int i = 0; i < size; i++) {
> > > > > > > > >     if (foo[i] & c1) {
> > > > > > > > >       foo[i] = t1;
> > > > > > > > >     }
> > > > > > > > >   }
> > > > > > > > >
> > > > > > > > >   // prevents the loop above from being optimized
> > > > > > > > >   for (int i = 0; i < size; i++) {
> > > > > > > > >     printf("%d", foo[i]);
> > > > > > > > >   }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > First of all, the if statement block in the loop will be converted to
> > > > > > > > > a MASK_STORE through if-conversion optimization. But after
> > > > > > > > > tree-vector, it will still become a branched form. The part of the
> > > > > > > > > final disassembly structure probably looks like below(Using IDA to do
> > > > > > > > > this), and you can see that there is still such a branch 'if ( !_ZF )'
> > > > > > > > > in it, which will lead to low efficiency.
> > > > > > > > >
> > > > > > > > > do
> > > > > > > > >   {
> > > > > > > > >     while ( 1 )
> > > > > > > > >     {
> > > > > > > > >       __asm
> > > > > > > > >       {
> > > > > > > > >         vpand   ymm0, ymm2, ymmword ptr [rax]
> > > > > > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > > > > > >         vpcmpeqd ymm0, ymm0, ymm1
> > > > > > > > >         vptest  ymm0, ymm0
> > > > > > > > >       }
> > > > > > > > >       if ( !_ZF )
> > > > > > > > >         break;
> > > > > > > > >       _RAX += 8;
> > > > > > > > >       if ( _RAX == v9 )
> > > > > > > > >         goto LABEL_5;
> > > > > > > > >     }
> > > > > > > > >     __asm { vpmaskmovd ymmword ptr [rax], ymm0, ymm3 }
> > > > > > > > >     _RAX += 8;
> > > > > > > > >   }
> > > > > > > > >   while ( _RAX != v9 );
> > > > > > > > >
> > > > > > > > > Why can't we just replace the vptest and if statement with some other
> > > > > > > > > instructions like vpblendvb so that it can be faster? Or is there a
> > > > > > > > > good way to do that?
> > > > > > > >
> > > > > > > > The branch is added by optimize_mask_stores after vectorization because
> > > > > > > > fully masked (disabled) masked stores can incur a quite heavy penalty on
> > > > > > > > some architectures when fault assists (read-only pages, but also COW pages)
> > > > > > > > are ran into.  All the microcode handling needs to possibly be carried out
> > > > > > > > multiple times, for each such access to the same page.  That can cause
> > > > > > > > a 1000x slowdown when you hit this case.  Thus every masked store
> > > > > > > > is replaced by
> > > > > > > >
> > > > > > > >  if (mask != 0)
> > > > > > > >    masked_store ();
> > > > > > > >
> > > > > > > > and this is an optimization (which itself has a small cost).
> > > > > > > >
> > > > > > > > Richard.
> > > > > > >
> > > > > > > Yeah, I know that and I have seen the code of optimize_mask_store().
> > > > > > > And the main problem here is that when multiple MASK_STORE appear in
> > > > > > > the same loop, many branches will appear, resulting in a decrease in
> > > > > > > overall efficiency.
> > > > > > >
> > > > > > > And my original idea is that why can't we replace MASK_STORE with more
> > > > > > > effective SIMD instructions because icc can do much better in this
> > > > > > > case.
> > > > > >
> > > > > > ICC probably doesn't care for the case where foo[] isn't writable.  In
> > > > > > fact for the case at hand we see it comes from malloc() which we
> > > > > > can assume to return writable memory I guess.  That means if-conversion
> > > > > > can treat the unconditional read as a way to also allow to speculate
> > > > > > the write (with -fallow-store-data-races).
> > > > > >
> > > > > > Note there's currently no pointer analysis that tracks writability.
> > > > > >
> > > > > > > Then I give it up, because the ability to analyze vectorization
> > > > > > > of gcc is not as good as icc and my ability does not support me
> > > > > > > modifying this part of the code.
> > > > > > >
> > > > > > > Thanks very much for your reply.
> > > > > >
> > > > > > You're welcome.
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Hanke Zhang

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-10-26  9:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-12 12:16 the elimination of if blocks in GCC during if-conversion and vectorization Hanke Zhang
2023-10-17  9:23 ` Richard Biener
2023-10-17 11:54   ` Hanke Zhang
2023-10-17 11:57     ` Richard Biener
2023-10-17 12:39       ` Hanke Zhang
2023-10-19 11:57         ` Richard Biener
2023-10-23 10:50           ` Hanke Zhang
2023-10-23 12:29             ` Richard Biener
2023-10-23 12:56               ` Hanke Zhang
2023-10-26  9:18                 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).