Duplicating loops and virtual phis

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Duplicating loops and virtual phis
@ 2017-05-12 20:42 Steve Ellcey
  2017-05-13  6:18 ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Steve Ellcey @ 2017-05-12 20:42 UTC (permalink / raw)
  To: gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2148 bytes --]

(Short version of this email, is there a way to recalculate/rebuild virtual
phi nodes after modifying the CFG.)

I have a question about duplicating loops and virtual phi nodes.
I am trying to implement the following optimization as a pass:

Transform:

   for (i = 0; i < n; i++) {
	A[i] = A[i] + B[i];
	C[i] = C[i-1] + D[i];
   }

Into:

   if (noalias between A&B, A&C, A&D)
	for (i = 0; i < 100; i++)
		A[i] = A[i] + B[i];
	for (i = 0; i < 100; i++)
		C[i] = C[i-1] + D[i];
   else
	for (i = 0; i < 100; i++) {
		A[i] = A[i] + B[i];
		C[i] = C[i-1] + D[i];
	}

Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot be
vectorized so it gives up and does not vectorize the loop.  If we split
up the loop into two loops then the vector add with A[i] could be vectorized
even if the one with C[i] could not.

Currently I can introduce the first 'if' that checks for aliasing by
using loop_version() and that seems to work OK.  (My actual compare
for aliasing is actually just an approximation for now.)

Where I am running into problems is with splitting up the single loop
under the noalias if condition into two sequential loops (which I then
intend to 'thin out' by removing one or the other set of instructions.
I am using slpeel_tree_duplicate_loop_to_edge_cfg() for that loop duplication
and while I get the CFG I want, the pass ends with verify_ssa failing due
to bad virtual/MEM PHI nodes.  Perhaps there is a different function that
I should use duplicate the loop.

a.c: In function â€˜fooâ€™:
a.c:2:5: error: PHI node with wrong VUSE on edge from BB 13
 int foo(int *a, int *b, int *c, int *d, int n)
     ^~~
.MEM_40 = PHI <.MEM_15(D)(13), .MEM_34(9)>
expected .MEM_58
a.c:2:5: internal compiler error: verify_ssa failed

I have tried to fix up the PHI node by hand using SET_PHI_ARG_DEF but
have not had any luck.  I was wondering if there was any kind of
'update all the phi nodes' function or just a 'update the virtual phi
nodes' function.  The non-virtual PHI nodes seem to be OK, it is just
the virtual ones that seem wrong after I duplicate the loop into two
consecutive loops.

Steve Ellcey
sellcey@cavium.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Duplicating loops and virtual phis
  2017-05-12 20:42 Duplicating loops and virtual phis Steve Ellcey
@ 2017-05-13  6:18 ` Richard Biener
  2017-05-15 16:56   ` Steve Ellcey
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2017-05-13  6:18 UTC (permalink / raw)
  To: sellcey, Steve Ellcey, gcc

On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey <sellcey@cavium.com> wrote:
>(Short version of this email, is there a way to recalculate/rebuild
>virtual
>phi nodes after modifying the CFG.)
>
>I have a question about duplicating loops and virtual phi nodes.
>I am trying to implement the following optimization as a pass:
>
>Transform:
>
>   for (i = 0; i < n; i++) {
>	A[i] = A[i] + B[i];
>	C[i] = C[i-1] + D[i];
>   }
>
>Into:
>
>   if (noalias between A&B, A&C, A&D)
>	for (i = 0; i < 100; i++)
>		A[i] = A[i] + B[i];
>	for (i = 0; i < 100; i++)
>		C[i] = C[i-1] + D[i];
>   else
>	for (i = 0; i < 100; i++) {
>		A[i] = A[i] + B[i];
>		C[i] = C[i-1] + D[i];
>	}
>
>Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot be
>vectorized so it gives up and does not vectorize the loop.  If we split
>up the loop into two loops then the vector add with A[i] could be
>vectorized
>even if the one with C[i] could not.

Loop distribution does this transform but it doesn't know about versioning for unknown dependences.

>Currently I can introduce the first 'if' that checks for aliasing by
>using loop_version() and that seems to work OK.  (My actual compare
>for aliasing is actually just an approximation for now.)
>
>Where I am running into problems is with splitting up the single loop
>under the noalias if condition into two sequential loops (which I then
>intend to 'thin out' by removing one or the other set of instructions.
>I am using slpeel_tree_duplicate_loop_to_edge_cfg() for that loop
>duplication
>and while I get the CFG I want, the pass ends with verify_ssa failing
>due
>to bad virtual/MEM PHI nodes.  Perhaps there is a different function
>that
>I should use duplicate the loop.

>a.c: In function ���foo���:
>a.c:2:5: error: PHI node with wrong VUSE on edge from BB 13
> int foo(int *a, int *b, int *c, int *d, int n)
>     ^~~
>.MEM_40 = PHI <.MEM_15(D)(13), .MEM_34(9)>
>expected .MEM_58
>a.c:2:5: internal compiler error: verify_ssa failed
>
>I have tried to fix up the PHI node by hand using SET_PHI_ARG_DEF but
>have not had any luck.  I was wondering if there was any kind of
>'update all the phi nodes' function or just a 'update the virtual phi
>nodes' function.  The non-virtual PHI nodes seem to be OK, it is just
>the virtual ones that seem wrong after I duplicate the loop into two
>consecutive loops.
>
>Steve Ellcey
>sellcey@cavium.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Duplicating loops and virtual phis
  2017-05-13  6:18 ` Richard Biener
@ 2017-05-15 16:56   ` Steve Ellcey
  2017-05-15 18:32     ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Steve Ellcey @ 2017-05-15 16:56 UTC (permalink / raw)
  To: Richard Biener, gcc

On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey <sellcey@cavium.c
> om> wrote:
> > 
> > (Short version of this email, is there a way to recalculate/rebuild
> > virtual
> > phi nodes after modifying the CFG.)
> > 
> > I have a question about duplicating loops and virtual phi nodes.
> > I am trying to implement the following optimization as a pass:
> > 
> > Transform:
> > 
> > Â  for (i = 0; i < n; i++) {
> > 	A[i] = A[i] + B[i];
> > 	C[i] = C[i-1] + D[i];
> > Â  }
> > 
> > Into:
> > 
> > Â  if (noalias between A&B, A&C, A&D)
> > 	for (i = 0; i < 100; i++)
> > 		A[i] = A[i] + B[i];
> > 	for (i = 0; i < 100; i++)
> > 		C[i] = C[i-1] + D[i];
> > Â  else
> > 	for (i = 0; i < 100; i++) {
> > 		A[i] = A[i] + B[i];
> > 		C[i] = C[i-1] + D[i];
> > 	}
> > 
> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot be
> > vectorized so it gives up and does not vectorize the loop.Â Â If we split
> > up the loop into two loops then the vector add with A[i] could be
> > vectorized
> > even if the one with C[i] could not.
> Loop distribution does this transform but it doesn't know about
> versioning for unknown dependences.
> 

Yes, I looked at loop distribution. Â But it only works with global
arrays and not with pointer arguments where it doesn't know the size of
the array being pointed at. Â I would like to be able to have it work
with pointer arguments. Â If I call a function with 2 or
more integer pointers, and I have a loop that accesses them with
offsets between 0 and N where N is loop invariant then I should have
enough information (at runtime) to determine if there are overlapping
memory accesses through the pointers and determine whether or not I can
distribute the loop.

The loop splitting code seemed like a better template since it already
knows how to split a loop based on a runtime determined condition. That
part seems to be working for me, it is when I try to
distribute/duplicate one of those loops (under the unaliased condition)
that I am running into the problem with virtual PHIs.

Steve Ellcey
sellcey@cavium.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Duplicating loops and virtual phis
  2017-05-15 16:56   ` Steve Ellcey
@ 2017-05-15 18:32     ` Richard Biener
  2017-05-17  9:41       ` Bin.Cheng
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2017-05-15 18:32 UTC (permalink / raw)
  To: sellcey, Steve Ellcey, gcc

On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey <sellcey@cavium.com> wrote:
>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey <sellcey@cavium.c
>> om> wrote:
>> > 
>> > (Short version of this email, is there a way to recalculate/rebuild
>> > virtual
>> > phi nodes after modifying the CFG.)
>> > 
>> > I have a question about duplicating loops and virtual phi nodes.
>> > I am trying to implement the following optimization as a pass:
>> > 
>> > Transform:
>> > 
>> >   for (i = 0; i < n; i++) {
>> > 	A[i] = A[i] + B[i];
>> > 	C[i] = C[i-1] + D[i];
>> >   }
>> > 
>> > Into:
>> > 
>> >   if (noalias between A&B, A&C, A&D)
>> > 	for (i = 0; i < 100; i++)
>> > 		A[i] = A[i] + B[i];
>> > 	for (i = 0; i < 100; i++)
>> > 		C[i] = C[i-1] + D[i];
>> >   else
>> > 	for (i = 0; i < 100; i++) {
>> > 		A[i] = A[i] + B[i];
>> > 		C[i] = C[i-1] + D[i];
>> > 	}
>> > 
>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>be
>> > vectorized so it gives up and does not vectorize the loop.  If we
>split
>> > up the loop into two loops then the vector add with A[i] could be
>> > vectorized
>> > even if the one with C[i] could not.
>> Loop distribution does this transform but it doesn't know about
>> versioning for unknown dependences.
>> 
>
>Yes, I looked at loop distribution.  But it only works with global
>arrays and not with pointer arguments where it doesn't know the size of
>the array being pointed at.  I would like to be able to have it work
>with pointer arguments.  If I call a function with 2 or
>more integer pointers, and I have a loop that accesses them with
>offsets between 0 and N where N is loop invariant then I should have
>enough information (at runtime) to determine if there are overlapping
>memory accesses through the pointers and determine whether or not I can
>distribute the loop.

Not sure where you got that from. Loop distribution works with our data reference / dependence analysis.  The cost model might be more restricted but that can be fixed.

>The loop splitting code seemed like a better template since it already
>knows how to split a loop based on a runtime determined condition. That
>part seems to be working for me, it is when I try to
>distribute/duplicate one of those loops (under the unaliased condition)
>that I am running into the problem with virtual PHIs.

There's mark_virtual*for_renaming (sp?).

But as said you are performing loop distribution so please enhance the existing pass rather than writing a new one.

Richard.

>Steve Ellcey
>sellcey@cavium.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Duplicating loops and virtual phis
  2017-05-15 18:32     ` Richard Biener
@ 2017-05-17  9:41       ` Bin.Cheng
  2017-05-17 15:58         ` Steve Ellcey
  2017-05-22  0:42         ` Kugan Vivekanandarajah
  0 siblings, 2 replies; 8+ messages in thread
From: Bin.Cheng @ 2017-05-17  9:41 UTC (permalink / raw)
  To: Richard Biener; +Cc: Steve Ellcey, gcc

On Mon, May 15, 2017 at 7:32 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey <sellcey@cavium.com> wrote:
>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey <sellcey@cavium.c
>>> om> wrote:
>>> >
>>> > (Short version of this email, is there a way to recalculate/rebuild
>>> > virtual
>>> > phi nodes after modifying the CFG.)
>>> >
>>> > I have a question about duplicating loops and virtual phi nodes.
>>> > I am trying to implement the following optimization as a pass:
>>> >
>>> > Transform:
>>> >
>>> >   for (i = 0; i < n; i++) {
>>> >    A[i] = A[i] + B[i];
>>> >    C[i] = C[i-1] + D[i];
>>> >   }
>>> >
>>> > Into:
>>> >
>>> >   if (noalias between A&B, A&C, A&D)
>>> >    for (i = 0; i < 100; i++)
>>> >            A[i] = A[i] + B[i];
>>> >    for (i = 0; i < 100; i++)
>>> >            C[i] = C[i-1] + D[i];
>>> >   else
>>> >    for (i = 0; i < 100; i++) {
>>> >            A[i] = A[i] + B[i];
>>> >            C[i] = C[i-1] + D[i];
>>> >    }
>>> >
>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>be
>>> > vectorized so it gives up and does not vectorize the loop.  If we
>>split
>>> > up the loop into two loops then the vector add with A[i] could be
>>> > vectorized
>>> > even if the one with C[i] could not.
>>> Loop distribution does this transform but it doesn't know about
>>> versioning for unknown dependences.
>>>
>>
>>Yes, I looked at loop distribution.  But it only works with global
>>arrays and not with pointer arguments where it doesn't know the size of
>>the array being pointed at.  I would like to be able to have it work
>>with pointer arguments.  If I call a function with 2 or
>>more integer pointers, and I have a loop that accesses them with
>>offsets between 0 and N where N is loop invariant then I should have
>>enough information (at runtime) to determine if there are overlapping
>>memory accesses through the pointers and determine whether or not I can
>>distribute the loop.
>
> Not sure where you got that from. Loop distribution works with our data reference / dependence analysis.  The cost model might be more restricted but that can be fixed.
>
>>The loop splitting code seemed like a better template since it already
>>knows how to split a loop based on a runtime determined condition. That
>>part seems to be working for me, it is when I try to
>>distribute/duplicate one of those loops (under the unaliased condition)
>>that I am running into the problem with virtual PHIs.
>
> There's mark_virtual*for_renaming (sp?).
>
> But as said you are performing loop distribution so please enhance the existing pass rather than writing a new one.
I happen to be working on loop distribution now (If guess correctly,
to get hmmer fixed).  So far my idea is to fuse the finest distributed
loop in two passes, in the first pass, we merge all SCCs due to "true"
data dependence; in the second one we identify all SCCs and breaks
them on dependent edges due to possible alias.  Breaking SCCs with
minimal edge set can be modeled as Feedback arc set problem which is
NP-hard. Fortunately the problem is small in our case and there are
approximation algorithms.  OTOH, we should also improve loop
distribution/fusion to maximize parallelism / minimize
synchronization, as well as maximize data locality, but I think this
is not needed to get hmmer vectorized.

Thanks,
bin
>
> Richard.
>
>>Steve Ellcey
>>sellcey@cavium.com
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Duplicating loops and virtual phis
  2017-05-17  9:41       ` Bin.Cheng
@ 2017-05-17 15:58         ` Steve Ellcey
  2017-05-22  0:42         ` Kugan Vivekanandarajah
  1 sibling, 0 replies; 8+ messages in thread
From: Steve Ellcey @ 2017-05-17 15:58 UTC (permalink / raw)
  To: Bin.Cheng, Richard Biener; +Cc: gcc

On Wed, 2017-05-17 at 10:41 +0100, Bin.Cheng wrote:

> I happen to be working on loop distribution now (If guess correctly,
> to get hmmer fixed).Â Â So far my idea is to fuse the finest
> distributed
> loop in two passes, in the first pass, we merge all SCCs due to
> "true"
> data dependence; in the second one we identify all SCCs and breaks
> them on dependent edges due to possible alias.Â Â Breaking SCCs with
> minimal edge set can be modeled as Feedback arc set problem which is
> NP-hard. Fortunately the problem is small in our case and there are
> approximation algorithms.Â Â OTOH, we should also improve loop
> distribution/fusion to maximize parallelism / minimize
> synchronization, as well as maximize data locality, but I think this
> is not needed to get hmmer vectorized.

Vectorizing hmmer is what I am interested in so I am glad to hear you
are looking into that. Â You are obviously more knowledgable about the
GCC loop infrastructure then I am so I look forward to what you come up
with.

Steve Ellcey
sellcey@cavium.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Duplicating loops and virtual phis
  2017-05-17  9:41       ` Bin.Cheng
  2017-05-17 15:58         ` Steve Ellcey
@ 2017-05-22  0:42         ` Kugan Vivekanandarajah
  2017-05-22  8:31           ` Bin.Cheng
  1 sibling, 1 reply; 8+ messages in thread
From: Kugan Vivekanandarajah @ 2017-05-22  0:42 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Richard Biener, Steve Ellcey, gcc

Hi Bin and Steve,

On 17 May 2017 at 19:41, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Mon, May 15, 2017 at 7:32 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey <sellcey@cavium.com> wrote:
>>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey <sellcey@cavium.c
>>>> om> wrote:
>>>> >
>>>> > (Short version of this email, is there a way to recalculate/rebuild
>>>> > virtual
>>>> > phi nodes after modifying the CFG.)
>>>> >
>>>> > I have a question about duplicating loops and virtual phi nodes.
>>>> > I am trying to implement the following optimization as a pass:
>>>> >
>>>> > Transform:
>>>> >
>>>> >   for (i = 0; i < n; i++) {
>>>> >    A[i] = A[i] + B[i];
>>>> >    C[i] = C[i-1] + D[i];
>>>> >   }
>>>> >
>>>> > Into:
>>>> >
>>>> >   if (noalias between A&B, A&C, A&D)
>>>> >    for (i = 0; i < 100; i++)
>>>> >            A[i] = A[i] + B[i];
>>>> >    for (i = 0; i < 100; i++)
>>>> >            C[i] = C[i-1] + D[i];
>>>> >   else
>>>> >    for (i = 0; i < 100; i++) {
>>>> >            A[i] = A[i] + B[i];
>>>> >            C[i] = C[i-1] + D[i];
>>>> >    }
>>>> >
>>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>>be
>>>> > vectorized so it gives up and does not vectorize the loop.  If we
>>>split
>>>> > up the loop into two loops then the vector add with A[i] could be
>>>> > vectorized
>>>> > even if the one with C[i] could not.
>>>> Loop distribution does this transform but it doesn't know about
>>>> versioning for unknown dependences.
>>>>
>>>
>>>Yes, I looked at loop distribution.  But it only works with global
>>>arrays and not with pointer arguments where it doesn't know the size of
>>>the array being pointed at.  I would like to be able to have it work
>>>with pointer arguments.  If I call a function with 2 or
>>>more integer pointers, and I have a loop that accesses them with
>>>offsets between 0 and N where N is loop invariant then I should have
>>>enough information (at runtime) to determine if there are overlapping
>>>memory accesses through the pointers and determine whether or not I can
>>>distribute the loop.
>>
>> Not sure where you got that from. Loop distribution works with our data reference / dependence analysis.  The cost model might be more restricted but that can be fixed.
>>
>>>The loop splitting code seemed like a better template since it already
>>>knows how to split a loop based on a runtime determined condition. That
>>>part seems to be working for me, it is when I try to
>>>distribute/duplicate one of those loops (under the unaliased condition)
>>>that I am running into the problem with virtual PHIs.
>>
>> There's mark_virtual*for_renaming (sp?).
>>
>> But as said you are performing loop distribution so please enhance the existing pass rather than writing a new one.
> I happen to be working on loop distribution now (If guess correctly,
> to get hmmer fixed).  So far my idea is to fuse the finest distributed
> loop in two passes, in the first pass, we merge all SCCs due to "true"
> data dependence; in the second one we identify all SCCs and breaks
> them on dependent edges due to possible alias.  Breaking SCCs with
> minimal edge set can be modeled as Feedback arc set problem which is
> NP-hard. Fortunately the problem is small in our case and there are
> approximation algorithms.  OTOH, we should also improve loop
> distribution/fusion to maximize parallelism / minimize
> synchronization, as well as maximize data locality, but I think this
> is not needed to get hmmer vectorized.

I am also looking into vectoring homer loop. Glad to know you are also
looking at this problem and looking forward to seeing the patches.

I have some experimental patches where I added the data reference that
needs runtime checking to a list

static int
pg_add_dependence_edges (struct graph *rdg, vec<loop_p> loops, int dir,
              vec<data_reference_p> drs1,
-             vec<data_reference_p> drs2)
+             vec<data_reference_p> drs2,
+             vec<ddr_p> &ddrs,
+             bool runtime_alias_check)

Then I am vesioning the main loop based on the condition generated
from the runtime check. I have borrowed the logic from vectorizer
(like pruning the list and generating the condition). I have neither
verified nor benchmarked it enough yet.

As I understand, we also should  have some form of cost model where we
should be able too see the data access patterns and decide if the
distributed loops can be vectorized?

Cost model in  similar_memory_accesses also need to be relaxd based on
the ability to vectorize distributed loops.

Thanks,
Kugan


> Thanks,
> bin
>>
>> Richard.
>>
>>>Steve Ellcey
>>>sellcey@cavium.com
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Duplicating loops and virtual phis
  2017-05-22  0:42         ` Kugan Vivekanandarajah
@ 2017-05-22  8:31           ` Bin.Cheng
  0 siblings, 0 replies; 8+ messages in thread
From: Bin.Cheng @ 2017-05-22  8:31 UTC (permalink / raw)
  To: Kugan Vivekanandarajah; +Cc: Richard Biener, Steve Ellcey, gcc

On Mon, May 22, 2017 at 1:42 AM, Kugan Vivekanandarajah
<kugan.vivekanandarajah@linaro.org> wrote:
> Hi Bin and Steve,
>
> On 17 May 2017 at 19:41, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Mon, May 15, 2017 at 7:32 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On May 15, 2017 6:56:53 PM GMT+02:00, Steve Ellcey <sellcey@cavium.com> wrote:
>>>>On Sat, 2017-05-13 at 08:18 +0200, Richard Biener wrote:
>>>>> On May 12, 2017 10:42:34 PM GMT+02:00, Steve Ellcey <sellcey@cavium.c
>>>>> om> wrote:
>>>>> >
>>>>> > (Short version of this email, is there a way to recalculate/rebuild
>>>>> > virtual
>>>>> > phi nodes after modifying the CFG.)
>>>>> >
>>>>> > I have a question about duplicating loops and virtual phi nodes.
>>>>> > I am trying to implement the following optimization as a pass:
>>>>> >
>>>>> > Transform:
>>>>> >
>>>>> >   for (i = 0; i < n; i++) {
>>>>> >    A[i] = A[i] + B[i];
>>>>> >    C[i] = C[i-1] + D[i];
>>>>> >   }
>>>>> >
>>>>> > Into:
>>>>> >
>>>>> >   if (noalias between A&B, A&C, A&D)
>>>>> >    for (i = 0; i < 100; i++)
>>>>> >            A[i] = A[i] + B[i];
>>>>> >    for (i = 0; i < 100; i++)
>>>>> >            C[i] = C[i-1] + D[i];
>>>>> >   else
>>>>> >    for (i = 0; i < 100; i++) {
>>>>> >            A[i] = A[i] + B[i];
>>>>> >            C[i] = C[i-1] + D[i];
>>>>> >    }
>>>>> >
>>>>> > Right now the vectorizer sees that 'C[i] = C[i-1] + D[i];' cannot
>>>>be
>>>>> > vectorized so it gives up and does not vectorize the loop.  If we
>>>>split
>>>>> > up the loop into two loops then the vector add with A[i] could be
>>>>> > vectorized
>>>>> > even if the one with C[i] could not.
>>>>> Loop distribution does this transform but it doesn't know about
>>>>> versioning for unknown dependences.
>>>>>
>>>>
>>>>Yes, I looked at loop distribution.  But it only works with global
>>>>arrays and not with pointer arguments where it doesn't know the size of
>>>>the array being pointed at.  I would like to be able to have it work
>>>>with pointer arguments.  If I call a function with 2 or
>>>>more integer pointers, and I have a loop that accesses them with
>>>>offsets between 0 and N where N is loop invariant then I should have
>>>>enough information (at runtime) to determine if there are overlapping
>>>>memory accesses through the pointers and determine whether or not I can
>>>>distribute the loop.
>>>
>>> Not sure where you got that from. Loop distribution works with our data reference / dependence analysis.  The cost model might be more restricted but that can be fixed.
>>>
>>>>The loop splitting code seemed like a better template since it already
>>>>knows how to split a loop based on a runtime determined condition. That
>>>>part seems to be working for me, it is when I try to
>>>>distribute/duplicate one of those loops (under the unaliased condition)
>>>>that I am running into the problem with virtual PHIs.
>>>
>>> There's mark_virtual*for_renaming (sp?).
>>>
>>> But as said you are performing loop distribution so please enhance the existing pass rather than writing a new one.
>> I happen to be working on loop distribution now (If guess correctly,
>> to get hmmer fixed).  So far my idea is to fuse the finest distributed
>> loop in two passes, in the first pass, we merge all SCCs due to "true"
>> data dependence; in the second one we identify all SCCs and breaks
>> them on dependent edges due to possible alias.  Breaking SCCs with
>> minimal edge set can be modeled as Feedback arc set problem which is
>> NP-hard. Fortunately the problem is small in our case and there are
>> approximation algorithms.  OTOH, we should also improve loop
>> distribution/fusion to maximize parallelism / minimize
>> synchronization, as well as maximize data locality, but I think this
>> is not needed to get hmmer vectorized.
>
> I am also looking into vectoring homer loop. Glad to know you are also
> looking at this problem and looking forward to seeing the patches.
>
> I have some experimental patches where I added the data reference that
> needs runtime checking to a list
>
> static int
> pg_add_dependence_edges (struct graph *rdg, vec<loop_p> loops, int dir,
>               vec<data_reference_p> drs1,
> -             vec<data_reference_p> drs2)
> +             vec<data_reference_p> drs2,
> +             vec<ddr_p> &ddrs,
> +             bool runtime_alias_check)
>
> Then I am vesioning the main loop based on the condition generated
> from the runtime check. I have borrowed the logic from vectorizer
> (like pruning the list and generating the condition). I have neither
Yes, I am doing experiments factor runtime alias check out from
vectorizer as general interfaces.  Also I need to fix negative step
bug in current runtime alias check.

> verified nor benchmarked it enough yet.
>
> As I understand, we also should  have some form of cost model where we
> should be able too see the data access patterns and decide if the
> distributed loops can be vectorized?
As the start, the idea is to version loop as gcc does in ifcvt, we can
revert the versioning later if distributed loop can't be vectorized
(or optimized in other significant way).  In case of hmmer, it's also
important to eliminate number of dead stores under condition of
no-alias.

>
> Cost model in  similar_memory_accesses also need to be relaxd based on
> the ability to vectorize distributed loops.
Yes, better cost model for fusion is needed in general.

Thanks,
bin
>
> Thanks,
> Kugan
>
>
>> Thanks,
>> bin
>>>
>>> Richard.
>>>
>>>>Steve Ellcey
>>>>sellcey@cavium.com
>>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-05-22  8:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-12 20:42 Duplicating loops and virtual phis Steve Ellcey
2017-05-13  6:18 ` Richard Biener
2017-05-15 16:56   ` Steve Ellcey
2017-05-15 18:32     ` Richard Biener
2017-05-17  9:41       ` Bin.Cheng
2017-05-17 15:58         ` Steve Ellcey
2017-05-22  0:42         ` Kugan Vivekanandarajah
2017-05-22  8:31           ` Bin.Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).