[Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/45021]  New: Redundant prefetches for the vectorized loop
@ 2010-07-21 17:46 changpeng dot fang at amd dot com
  2010-07-21 18:27 ` [Bug tree-optimization/45021] " changpeng dot fang at amd dot com
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: changpeng dot fang at amd dot com @ 2010-07-21 17:46 UTC (permalink / raw)
  To: gcc-bugs

For the following test case, prefetches will be inserted for both the load and
store of a[i] if the loop is vectorized:

float a[1024], b[1024];
void foo(int beta)
{
  int i;
  for(i=0; i<1024; i++)
     a[i] = a[i] + beta * b[i];
}

with gcc -O3 -fprefetch-loop-arrays -march=amdfam10 -S, a piece of the assembly
is:
        movaps  (%rcx), %xmm0
        addl    $4, %edi
        prefetcht0      (%rdx)
        prefetcht0      240(%rcx)
        prefetchw       (%rdx)
        leaq    64(%rax), %rsi
        mulps   %xmm1, %xmm0


If we don't vectorize the loop, we only generate prefetch for the load a[i]:
        addl    $16, %eax
        salq    $2, %rcx
        mulss   %xmm1, %xmm0
        prefetcht0      a+92(%rcx)
        prefetcht0      b+92(%rcx)
        movl    %esi, %ecx


-- 
           Summary: Redundant prefetches for the vectorized loop
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: changpeng dot fang at amd dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/45021] Redundant prefetches for the vectorized loop
  2010-07-21 17:46 [Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop changpeng dot fang at amd dot com
@ 2010-07-21 18:27 ` changpeng dot fang at amd dot com
  2010-07-24 20:32 ` [Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too) pinskia at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: changpeng dot fang at amd dot com @ 2010-07-21 18:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from changpeng dot fang at amd dot com  2010-07-21 18:26 -------
The direct reason is that prefetching could not differentiate the base
addresses
of the vectorized load and store (of a[i]):
*vect_pa.6_24
*vect_pa.19_37


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too)
  2010-07-21 17:46 [Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop changpeng dot fang at amd dot com
  2010-07-21 18:27 ` [Bug tree-optimization/45021] " changpeng dot fang at amd dot com
@ 2010-07-24 20:32 ` pinskia at gcc dot gnu dot org
  2010-07-24 20:42 ` rakdver at kam dot mff dot cuni dot cz
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-07-24 20:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from pinskia at gcc dot gnu dot org  2010-07-24 20:32 -------
(In reply to comment #1)
> The direct reason is that prefetching could not differentiate the base
> addresses
> of the vectorized load and store (of a[i]):
> *vect_pa.6_24
> *vect_pa.19_37

Here is a testcase which shows the same issue without the vectorizer (compile
-O2 -fprefetch-loop-arrays -march=amdfam10 -fno-tree-ccp -fno-tree-vrp
-fno-tree-dominator-opts):
float *f();
float aa[1024];
float bb[1024];
void foo(int beta)
{
  int i;
  float *a = aa, *a1 = aa, *b = bb;
  for(i=0; i<1024; i++)
{
     *a = *a1 + beta * *b;
a++; a1++; b++;
}
}


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
           Keywords|                            |missed-optimization
   Last reconfirmed|0000-00-00 00:00:00         |2010-07-24 20:32:24
               date|                            |
            Summary|Redundant prefetches for the|Redundant prefetches for
                   |vectorized loop             |some loops (vectorizer
                   |                            |produced ones too)


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too)
  2010-07-21 17:46 [Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop changpeng dot fang at amd dot com
  2010-07-21 18:27 ` [Bug tree-optimization/45021] " changpeng dot fang at amd dot com
  2010-07-24 20:32 ` [Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too) pinskia at gcc dot gnu dot org
@ 2010-07-24 20:42 ` rakdver at kam dot mff dot cuni dot cz
  2010-07-28 18:23 ` changpeng dot fang at amd dot com
  2010-07-28 18:28 ` changpeng dot fang at amd dot com
  4 siblings, 0 replies; 6+ messages in thread
From: rakdver at kam dot mff dot cuni dot cz @ 2010-07-24 20:42 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from rakdver at kam dot mff dot cuni dot cz  2010-07-24 20:41 -------
Subject: Re:  Redundant prefetches for some
        loops (vectorizer produced ones too)

> > The direct reason is that prefetching could not differentiate the base
> > addresses
> > of the vectorized load and store (of a[i]):
> > *vect_pa.6_24
> > *vect_pa.19_37
> 
> Here is a testcase which shows the same issue without the vectorizer (compile
> -O2 -fprefetch-loop-arrays -march=amdfam10 -fno-tree-ccp -fno-tree-vrp
> -fno-tree-dominator-opts):
> float *f();
> float aa[1024];
> float bb[1024];
> void foo(int beta)
> {
>   int i;
>   float *a = aa, *a1 = aa, *b = bb;
>   for(i=0; i<1024; i++)
> {
>      *a = *a1 + beta * *b;
> a++; a1++; b++;
> }
> }

I am not sure that this issue should be addressed in the prefetching pass; as
this
example shows, we already have three other passes that deal with it ordinarily.
Perhaps adjusting the vectorizer code generation or scheduling copy propagation
after vectorizer would be better.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too)
  2010-07-21 17:46 [Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop changpeng dot fang at amd dot com
                   ` (2 preceding siblings ...)
  2010-07-24 20:42 ` rakdver at kam dot mff dot cuni dot cz
@ 2010-07-28 18:23 ` changpeng dot fang at amd dot com
  2010-07-28 18:28 ` changpeng dot fang at amd dot com
  4 siblings, 0 replies; 6+ messages in thread
From: changpeng dot fang at amd dot com @ 2010-07-28 18:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from changpeng dot fang at amd dot com  2010-07-28 18:22 -------
Andrew's example is exactly what the prefetch sees for the test case (in the
bug description). Unfortunately, the prefetch pass could not recognize that
vect_pa.6_24 and vect_pa.20_38 are exactly the same address:

<bb 2>:
  pretmp.2_18 = (float) beta_4(D);
  vect_pa.9_22 = (vector(4) float *) &a;
  vect_pa.6_23 = vect_pa.9_22;
  vect_cst_.12_27 = {pretmp.2_18, pretmp.2_18, pretmp.2_18, pretmp.2_18};
  vect_pb.16_29 = (vector(4) float *) &b;
  vect_pb.13_30 = vect_pb.16_29;
  vect_pa.23_36 = (vector(4) float *) &a;
  vect_pa.20_37 = vect_pa.23_36;

<bb 3>:
  # vect_pa.6_24 = PHI <vect_pa.6_25(4), vect_pa.6_23(2)>
  # vect_pb.13_31 = PHI <vect_pb.13_32(4), vect_pb.13_30(2)>
  # vect_pa.20_38 = PHI <vect_pa.20_39(4), vect_pa.20_37(2)>
  # ivtmp.24_40 = PHI <ivtmp.24_41(4), 0(2)>
  vect_var_.10_26 = *vect_pa.6_24;
  vect_var_.11_28 = vect_cst_.12_27;
  vect_var_.17_33 = *vect_pb.13_31;
  vect_var_.18_34 = vect_var_.11_28 * vect_var_.17_33;
  vect_var_.19_35 = vect_var_.10_26 + vect_var_.18_34;
  *vect_pa.20_38 = vect_var_.19_35;
  vect_pa.6_25 = vect_pa.6_24 + 16;
  vect_pb.13_32 = vect_pb.13_31 + 16;
  vect_pa.20_39 = vect_pa.20_38 + 16;
  ivtmp.24_41 = ivtmp.24_40 + 1;
  if (ivtmp.24_41 < 256)
    goto <bb 4>;
  else
    goto <bb 5>;

<bb 4>:
  goto <bb 3>;


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too)
  2010-07-21 17:46 [Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop changpeng dot fang at amd dot com
                   ` (3 preceding siblings ...)
  2010-07-28 18:23 ` changpeng dot fang at amd dot com
@ 2010-07-28 18:28 ` changpeng dot fang at amd dot com
  4 siblings, 0 replies; 6+ messages in thread
From: changpeng dot fang at amd dot com @ 2010-07-28 18:28 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from changpeng dot fang at amd dot com  2010-07-28 18:28 -------
Thing is a little complicate if we change the code to:

a[i] = a[i+1] + beta * b[i];

The prefetch pass want to group a[i] and a[i+1], i.e. they have
the same base address with an offset of 4 bytes.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45021


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-07-28 18:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-21 17:46 [Bug tree-optimization/45021] New: Redundant prefetches for the vectorized loop changpeng dot fang at amd dot com
2010-07-21 18:27 ` [Bug tree-optimization/45021] " changpeng dot fang at amd dot com
2010-07-24 20:32 ` [Bug tree-optimization/45021] Redundant prefetches for some loops (vectorizer produced ones too) pinskia at gcc dot gnu dot org
2010-07-24 20:42 ` rakdver at kam dot mff dot cuni dot cz
2010-07-28 18:23 ` changpeng dot fang at amd dot com
2010-07-28 18:28 ` changpeng dot fang at amd dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).