public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* SLP vectorizer on non-loop?
@ 2011-11-01 10:42 Bingfeng Mei
  2011-11-01 11:13 ` Ira Rosen
  0 siblings, 1 reply; 4+ messages in thread
From: Bingfeng Mei @ 2011-11-01 10:42 UTC (permalink / raw)
  To: gcc

Hello,
I have one example with two very similar loops. cunrolli pass unrolls one loop completely
but not the other based on slightly different cost estimations. The not-unrolled loop 
get SLP-vectorized, then unrolled by "cunroll" pass, whereas the other unrolled loop cannot
be vectorized since it is not a loop any more.  In the end, there is big difference of
performance between two loops. 

My question is why SLP vectorization has to be performed on loop (it is a sub-pass under
pass_tree_loop). Conceptually, cannot it be done on any basic block? Our port are still
stuck at 4.5. But I checked 4.7, it seems still the same. I also checked functions in 
tree-vect-slp.c. They use a lot of loop_vinfo structures. But in some places it checks
whether loop_vinfo exists to use it or other alternative. I tried to add an extra SLP 
pass after pass_tree_loop, but it didn't work. I wonder how easy to make SLP works for 
non-loop.

Thanks,
Bingfeng Mei

Broadcom UK

void foo (int *__restrict__ temp_hist_buffer, 
          int * __restrict__ p_hist_buff, 
          int *__restrict__ p_input)
{
  int i;
  for(i=0;i<4;i++)
     temp_hist_buffer[i]=p_hist_buff[i];

  for(i=0;i<4;i++)
     temp_hist_buffer[i+4]=p_input[i];

}


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: SLP vectorizer on non-loop?
  2011-11-01 10:42 SLP vectorizer on non-loop? Bingfeng Mei
@ 2011-11-01 11:13 ` Ira Rosen
  2011-11-01 11:25   ` Bingfeng Mei
  0 siblings, 1 reply; 4+ messages in thread
From: Ira Rosen @ 2011-11-01 11:13 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: gcc



gcc-owner@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM:

> Hello,
> I have one example with two very similar loops. cunrolli pass
> unrolls one loop completely
> but not the other based on slightly different cost estimations. The
> not-unrolled loop
> get SLP-vectorized, then unrolled by "cunroll" pass, whereas the
> other unrolled loop cannot
> be vectorized since it is not a loop any more.  In the end, there is
> big difference of
> performance between two loops.
>

Here what I see with the current trunk on x86_64 with -O3 (with the two
loops split into different functions):

The first loop, the one that doesn't get unrolled by cunrolli, gets loop
vectorized with -fno-vect-cost-model. With the cost model the vectorization
fails because the number of iterations is not sufficient (the vectorizer
tries to apply loop peeling in order to align the accesses), the loop gets
later unrolled by cunroll and the basic block gets vectorized by SLP.

The second loop, unrolled by cunrolli, also gets vectorized by SLP.

The *.optimized dumps look similar:


<bb 2>:
  vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)];
  MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48;
  return;


<bb 2>:
  vect_var_.7_57 = MEM[(int *)p_input_10(D)];
  MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57;
  return;


> My question is why SLP vectorization has to be performed on loop (it
> is a sub-pass under
> pass_tree_loop). Conceptually, cannot it be done on any basic block?
> Our port are still
> stuck at 4.5. But I checked 4.7, it seems still the same. I also
> checked functions in
> tree-vect-slp.c. They use a lot of loop_vinfo structures. But in
> some places it checks
> whether loop_vinfo exists to use it or other alternative. I tried to
> add an extra SLP
> pass after pass_tree_loop, but it didn't work. I wonder how easy to
> make SLP works for
> non-loop.

SLP vectorization works both on loops (in vectorize pass) and on basic
blocks (in slp-vectorize pass).

Ira

>
> Thanks,
> Bingfeng Mei
>
> Broadcom UK
>
> void foo (int *__restrict__ temp_hist_buffer,
>           int * __restrict__ p_hist_buff,
>           int *__restrict__ p_input)
> {
>   int i;
>   for(i=0;i<4;i++)
>      temp_hist_buffer[i]=p_hist_buff[i];
>
>   for(i=0;i<4;i++)
>      temp_hist_buffer[i+4]=p_input[i];
>
> }
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: SLP vectorizer on non-loop?
  2011-11-01 11:13 ` Ira Rosen
@ 2011-11-01 11:25   ` Bingfeng Mei
  2011-11-01 11:36     ` Ira Rosen
  0 siblings, 1 reply; 4+ messages in thread
From: Bingfeng Mei @ 2011-11-01 11:25 UTC (permalink / raw)
  To: Ira Rosen; +Cc: gcc

Ira,
Thank you very much for quick answer. I will check 4.7 x86-64 
to see difference from our port. Is there significant change
between 4.5 & 4.7 regarding SLP? 

Cheers,
Bingfeng

> -----Original Message-----
> From: Ira Rosen [mailto:IRAR@il.ibm.com]
> Sent: 01 November 2011 11:13
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: SLP vectorizer on non-loop?
> 
> 
> 
> gcc-owner@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM:
> 
> > Hello,
> > I have one example with two very similar loops. cunrolli pass
> > unrolls one loop completely
> > but not the other based on slightly different cost estimations. The
> > not-unrolled loop
> > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the
> > other unrolled loop cannot
> > be vectorized since it is not a loop any more.  In the end, there is
> > big difference of
> > performance between two loops.
> >
> 
> Here what I see with the current trunk on x86_64 with -O3 (with the two
> loops split into different functions):
> 
> The first loop, the one that doesn't get unrolled by cunrolli, gets
> loop
> vectorized with -fno-vect-cost-model. With the cost model the
> vectorization
> fails because the number of iterations is not sufficient (the
> vectorizer
> tries to apply loop peeling in order to align the accesses), the loop
> gets
> later unrolled by cunroll and the basic block gets vectorized by SLP.
> 
> The second loop, unrolled by cunrolli, also gets vectorized by SLP.
> 
> The *.optimized dumps look similar:
> 
> 
> <bb 2>:
>   vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)];
>   MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48;
>   return;
> 
> 
> <bb 2>:
>   vect_var_.7_57 = MEM[(int *)p_input_10(D)];
>   MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57;
>   return;
> 
> 
> > My question is why SLP vectorization has to be performed on loop (it
> > is a sub-pass under
> > pass_tree_loop). Conceptually, cannot it be done on any basic block?
> > Our port are still
> > stuck at 4.5. But I checked 4.7, it seems still the same. I also
> > checked functions in
> > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in
> > some places it checks
> > whether loop_vinfo exists to use it or other alternative. I tried to
> > add an extra SLP
> > pass after pass_tree_loop, but it didn't work. I wonder how easy to
> > make SLP works for
> > non-loop.
> 
> SLP vectorization works both on loops (in vectorize pass) and on basic
> blocks (in slp-vectorize pass).
> 
> Ira
> 
> >
> > Thanks,
> > Bingfeng Mei
> >
> > Broadcom UK
> >
> > void foo (int *__restrict__ temp_hist_buffer,
> >           int * __restrict__ p_hist_buff,
> >           int *__restrict__ p_input)
> > {
> >   int i;
> >   for(i=0;i<4;i++)
> >      temp_hist_buffer[i]=p_hist_buff[i];
> >
> >   for(i=0;i<4;i++)
> >      temp_hist_buffer[i+4]=p_input[i];
> >
> > }
> >
> >
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: SLP vectorizer on non-loop?
  2011-11-01 11:25   ` Bingfeng Mei
@ 2011-11-01 11:36     ` Ira Rosen
  0 siblings, 0 replies; 4+ messages in thread
From: Ira Rosen @ 2011-11-01 11:36 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: gcc



"Bingfeng Mei" <bmei@broadcom.com> wrote on 01/11/2011 01:25:14 PM:

> Ira,
> Thank you very much for quick answer. I will check 4.7 x86-64
> to see difference from our port. Is there significant change
> between 4.5 & 4.7 regarding SLP?

Yes, I think so. 4.5 can't SLP data accesses with unknown alignment that
you have here.

Ira

>
> Cheers,
> Bingfeng
>
> > -----Original Message-----
> > From: Ira Rosen [mailto:IRAR@il.ibm.com]
> > Sent: 01 November 2011 11:13
> > To: Bingfeng Mei
> > Cc: gcc@gcc.gnu.org
> > Subject: Re: SLP vectorizer on non-loop?
> >
> >
> >
> > gcc-owner@gcc.gnu.org wrote on 01/11/2011 12:41:32 PM:
> >
> > > Hello,
> > > I have one example with two very similar loops. cunrolli pass
> > > unrolls one loop completely
> > > but not the other based on slightly different cost estimations. The
> > > not-unrolled loop
> > > get SLP-vectorized, then unrolled by "cunroll" pass, whereas the
> > > other unrolled loop cannot
> > > be vectorized since it is not a loop any more.  In the end, there is
> > > big difference of
> > > performance between two loops.
> > >
> >
> > Here what I see with the current trunk on x86_64 with -O3 (with the two
> > loops split into different functions):
> >
> > The first loop, the one that doesn't get unrolled by cunrolli, gets
> > loop
> > vectorized with -fno-vect-cost-model. With the cost model the
> > vectorization
> > fails because the number of iterations is not sufficient (the
> > vectorizer
> > tries to apply loop peeling in order to align the accesses), the loop
> > gets
> > later unrolled by cunroll and the basic block gets vectorized by SLP.
> >
> > The second loop, unrolled by cunrolli, also gets vectorized by SLP.
> >
> > The *.optimized dumps look similar:
> >
> >
> > <bb 2>:
> >   vect_var_.14_48 = MEM[(int *)p_hist_buff_9(D)];
> >   MEM[(int *)temp_hist_buffer_5(D)] = vect_var_.14_48;
> >   return;
> >
> >
> > <bb 2>:
> >   vect_var_.7_57 = MEM[(int *)p_input_10(D)];
> >   MEM[(int *)temp_hist_buffer_6(D) + 16B] = vect_var_.7_57;
> >   return;
> >
> >
> > > My question is why SLP vectorization has to be performed on loop (it
> > > is a sub-pass under
> > > pass_tree_loop). Conceptually, cannot it be done on any basic block?
> > > Our port are still
> > > stuck at 4.5. But I checked 4.7, it seems still the same. I also
> > > checked functions in
> > > tree-vect-slp.c. They use a lot of loop_vinfo structures. But in
> > > some places it checks
> > > whether loop_vinfo exists to use it or other alternative. I tried to
> > > add an extra SLP
> > > pass after pass_tree_loop, but it didn't work. I wonder how easy to
> > > make SLP works for
> > > non-loop.
> >
> > SLP vectorization works both on loops (in vectorize pass) and on basic
> > blocks (in slp-vectorize pass).
> >
> > Ira
> >
> > >
> > > Thanks,
> > > Bingfeng Mei
> > >
> > > Broadcom UK
> > >
> > > void foo (int *__restrict__ temp_hist_buffer,
> > >           int * __restrict__ p_hist_buff,
> > >           int *__restrict__ p_input)
> > > {
> > >   int i;
> > >   for(i=0;i<4;i++)
> > >      temp_hist_buffer[i]=p_hist_buff[i];
> > >
> > >   for(i=0;i<4;i++)
> > >      temp_hist_buffer[i+4]=p_input[i];
> > >
> > > }
> > >
> > >
> >
>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-11-01 11:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-01 10:42 SLP vectorizer on non-loop? Bingfeng Mei
2011-11-01 11:13 ` Ira Rosen
2011-11-01 11:25   ` Bingfeng Mei
2011-11-01 11:36     ` Ira Rosen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).