[Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing)

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing)
@ 2014-02-05 22:41 marcin.krotkiewski at gmail dot com
  2014-02-06  8:28 ` [Bug rtl-optimization/60086] " jakub at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: marcin.krotkiewski at gmail dot com @ 2014-02-05 22:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

            Bug ID: 60086
           Summary: suboptimal asm generated for a loop (store/load false
                    aliasing)
           Product: gcc
           Version: 4.7.3
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: marcin.krotkiewski at gmail dot com

Created attachment 32060
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32060&action=edit
source code that compiles

Hello,

I am seeing suboptimal performance of the following loop compiled with
gcc 4.7.3 (but also 4.4.7, Ubuntu, full test code attached):

    for(i=0; i<NSIZE; i++){
      a[i] += b[i];
      c[i] += d[i];
    }

Arrays are dynamically allocated and aligned to page boundary, declared
with __restrict__ and __attribute__((aligned(32))). I am running on
Intel i7-2620M (Sandy Bridge).

The problem is IMHO related to '4k aliasing'. It happens for the most
common case of a/b/c/d starting at page boundary (e.g., natural result
of malloc). To demonstrate, here is the assembly generated with 'gcc
-mtune=native -mavx -O3':

.L8:
        vmovapd (%rdx,%rdi), %ymm0        #1 load b
        addq    $1, %r8                #2
        vaddpd  (%rcx,%rdi), %ymm0, %ymm0    #3 load a and add
        vmovapd %ymm0, (%rdx,%rdi)        #4 store a
        vmovapd (%rax,%rdi), %ymm0        #5 load d
        vaddpd  (%rsi,%rdi), %ymm0, %ymm0    #6 load c and add
        vmovapd %ymm0, (%rax,%rdi)        #7 store c
        addq    $32, %rdi            #8
        cmpq    %r8, %r12            #9
        ja      .L8                #10

The 4k aliasing problem is caused by lines 4 and 5 (writing result to
array a and reading data from either c or d). From my tests this seems
to be the default behavior for both AVX and SSE2 instruction sets, and
for both vectorized and non-vectorized cases.

It is easy to fix the problem by placing the two writes together, at the
end of the iteration, e.g.:

.L8:
        vmovapd (%rdx,%rdi), %ymm1        #1
        addq    $1, %r8                #2
        vaddpd  (%rcx,%rdi), %ymm1, %ymm1    #3
        vmovapd (%rax,%rdi), %ymm0        #4
        vaddpd  (%rsi,%rdi), %ymm0, %ymm0    #5
        vmovapd %ymm1, (%rdx,%rdi)        #6
        vmovapd %ymm0, (%rax,%rdi)        #7
        addq    $32, %rdi            #8
        cmpq    %r8, %r12            #9
        ja      .L8                #10

In this case the writes happen after all the loads. The above code is
(almost) what ICC generates for this case. For problem sizes small
enough to fit in L1 the speedup is roughly 50%.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
@ 2014-02-06  8:28 ` jakub at gcc dot gnu.org
  2014-02-06  9:34 ` marcin.krotkiewski at gmail dot com
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-02-06  8:28 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-02-06
                 CC|                            |abel at gcc dot gnu.org,
                   |                            |jakub at gcc dot gnu.org,
                   |                            |uros at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC right now only handles __restrict on function parameters, so in this case
the aliasing info isn't known.  While the loop is versioned for aliasing at
runtime, the info about that is only known during the vectorizer, therefore
e.g. scheduler can hardly know it.  The pointers to overaligned memory is
something you should generally avoid, __builtin_assume_aligned is what can be
used to tell the compiler about the alignment instead, overaligned types often
actually hurt generated code instead of improving it.  And the way you are
calling posix_memalign is IMHO a strict aliasing violation.

Perhaps GCC could handle posix_memalign specially as builtin if declared with
the right prototype (and optionally some new attribute) and derive both the
aliasing and alignment info from it, like the taking of the address of the
pointer in it isn't really an escape site of any kind, all the call does is
return two values instead of just one, so it could be folded into passing an
address of some temporary to the call instead and then loading from the
temporary and using some special pass-thru builtin that would tell GCC that the
pointer is really malloc-like (non-aliasing anything else) and also
use__builtin_assume_aligned.  The GNU memalign is far better than
posix_memalign from this POV.

Anyway, if I rewrite your testcase as:
#include <stdlib.h>
#include <stdio.h>

__attribute__((noinline)) void
foo (double *__restrict__ a, double *__restrict__ b, double *__restrict__ c,
double *__restrict__ d, unsigned long NSIZE)
{
  unsigned long i, j;
  a = __builtin_assume_aligned (a, 32);
  b = __builtin_assume_aligned (b, 32);
  c = __builtin_assume_aligned (c, 32);
  d = __builtin_assume_aligned (d, 32);
  // initialize memory
  for(i=0; i<NSIZE; i++){
    a[i] = 0;
    b[i] = 0;
    c[i] = 0;
    d[i] = 0;
  }

  // outer loop - repeat short tests
  for(j=0; j<10000; j++){

    // inner loop - do the work
    for(i=0; i<NSIZE; i++){
      a[i] += b[i];
      c[i] += d[i];
    }

    // dummy - prevent loop interchange
    if(a[NSIZE/2]<0) printf("%lf\n", a[NSIZE/2]);
  }
}

int main(int argc, char*argv[])
{
  unsigned long NSIZE = atol(argv[1]);
  void *a, *b, *c, *d;

  // allocate starting from page boundary
  posix_memalign(&a, 4096, sizeof(double)*(NSIZE));
  posix_memalign(&b, 4096, sizeof(double)*(NSIZE));
  posix_memalign(&c, 4096, sizeof(double)*(NSIZE));
  posix_memalign(&d, 4096, sizeof(double)*(NSIZE));
  foo ((double *) a, (double *) b, (double *) c, (double *) d, NSIZE);
  return 0;
}

we don't do versioning for alias and also (as before) assume sufficient
alignment, but still the scheduler doesn't reorder the loads vs. the store,
unless -O3 -mavx -fschedule-insns.  The reason why the second scheduler doesn't
reorder those is that RA allocates the same register.  With -O3 -mavx
-fselective-scheduling2 the stores are also changed, but we end up with a
weird:
.L9:
        movq    -136(%rbp), %rdx
        vmovapd (%r9,%rax), %ymm0
        addq    $1, %rdi
        vmovapd (%r10,%rax), %ymm8
        vaddpd  (%rdx,%rax), %ymm0, %ymm0
        movq    -144(%rbp), %rdx
        vaddpd  (%rdx,%rax), %ymm8, %ymm9
        vmovapd %ymm0, (%r9,%rax)
        vmovapd %ymm8, %ymm0
        vmovapd %ymm9, %ymm0
        vmovapd %ymm0, (%r10,%rax)
        addq    $32, %rax
        cmpq    %rdi, -152(%rbp)
        ja      .L9
Why there is the vmovapd %ymm8, %ymm0 is a mystery, and vmovapd %ymm9, %ymm0
could be very well merged with the store into vmovapd %ymm9, (%r10,%rax).


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
  2014-02-06  8:28 ` [Bug rtl-optimization/60086] " jakub at gcc dot gnu.org
@ 2014-02-06  9:34 ` marcin.krotkiewski at gmail dot com
  2014-02-06 10:10 ` mpolacek at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: marcin.krotkiewski at gmail dot com @ 2014-02-06  9:34 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

--- Comment #2 from Marcin Krotkiewski <marcin.krotkiewski at gmail dot com> ---
Jakub, thank you for your comments.

> GCC right now only handles __restrict on function parameters, so in this
> case the aliasing info isn't known.  While the loop is versioned for
> aliasing at runtime, the info about that is only known during the
> vectorizer, therefore e.g. scheduler can hardly know it. 

Does it mean that __restrict is not necessary in order to have a vectorized
code path? I see that if I compile your modified test.c, the loop is vectorized
regardless of whether I use __restrict, or not (runtime versioning). On the
other hand, using __restrict causes gcc to invoke memset for initialization,
while leaving it out results in two paths with a loop.

On the interesting side. Your test.c works indeed if compiled with additional
-fschedule-insns flag. However, if I now remove the __restrict keyword from
function arguments, I do see a vectorized path, but the flag has no effect and
instructions are again not reordered.

> The pointers to
> overaligned memory is something you should generally avoid,
> __builtin_assume_aligned is what can be used to tell the compiler about the
> alignment instead, overaligned types often actually hurt generated code
> instead of improving it.  

Thanks. Could you suggest what is the preferred way to use it in a portable
manner? e.g. make it suitable for icc, which has a __assume_aligned builtin?
Should I wrap it in a macro?

> And the way you are calling posix_memalign is IMHO
> a strict aliasing violation.

Could be,  gcc des not show a warning with -Wall. Thanks for pointing it out.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
  2014-02-06  8:28 ` [Bug rtl-optimization/60086] " jakub at gcc dot gnu.org
  2014-02-06  9:34 ` marcin.krotkiewski at gmail dot com
@ 2014-02-06 10:10 ` mpolacek at gcc dot gnu.org
  2014-02-06 10:22 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: mpolacek at gcc dot gnu.org @ 2014-02-06 10:10 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

Marek Polacek <mpolacek at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mpolacek at gcc dot gnu.org

--- Comment #3 from Marek Polacek <mpolacek at gcc dot gnu.org> ---
(In reply to Marcin Krotkiewski from comment #2)
> Thanks. Could you suggest what is the preferred way to use it in a portable
> manner? e.g. make it suitable for icc, which has a __assume_aligned builtin?
> Should I wrap it in a macro?

I think you could wrap it in a macro that is defined based on:
#ifdef __INTEL__
# define ...
#elif defined __GNUC__
# define ...
#endif


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
                   ` (2 preceding siblings ...)
  2014-02-06 10:10 ` mpolacek at gcc dot gnu.org
@ 2014-02-06 10:22 ` rguenth at gcc dot gnu.org
  2014-02-07  8:52 ` abel at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-02-06 10:22 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
As of posix_memalign the issue is not so much that of alias analysis (we could
handle it but we don't have a builtin right now) but that of alignment analysis
which doesn't implement alignment tracking of pointers stored in memory.  We
could "lower"

  posix_memalign (&ptr, align, size);

to

  posix_memalign (&ptr, align, size);
  ptr = __builtin_assume_algined (ptr, align);

and hope for FRE to fix things up enough to make that useful.

I have opened PR60092.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
                   ` (3 preceding siblings ...)
  2014-02-06 10:22 ` rguenth at gcc dot gnu.org
@ 2014-02-07  8:52 ` abel at gcc dot gnu.org
  2014-02-07  8:53 ` abel at gcc dot gnu.org
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: abel at gcc dot gnu.org @ 2014-02-07  8:52 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

--- Comment #5 from Andrey Belevantsev <abel at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #1)
> ...
> doesn't reorder those is that RA allocates the same register.  With -O3
> -mavx -fselective-scheduling2 the stores are also changed, but we end up
> with a weird:
> .L9:
>         movq    -136(%rbp), %rdx
>         vmovapd (%r9,%rax), %ymm0
>         addq    $1, %rdi
>         vmovapd (%r10,%rax), %ymm8
>         vaddpd  (%rdx,%rax), %ymm0, %ymm0
>         movq    -144(%rbp), %rdx
>         vaddpd  (%rdx,%rax), %ymm8, %ymm9
>         vmovapd %ymm0, (%r9,%rax)
>         vmovapd %ymm8, %ymm0
>         vmovapd %ymm9, %ymm0
>         vmovapd %ymm0, (%r10,%rax)
>         addq    $32, %rax
>         cmpq    %rdi, -152(%rbp)
>         ja      .L9
> Why there is the vmovapd %ymm8, %ymm0 is a mystery, and vmovapd %ymm9, %ymm0
> could be very well merged with the store into vmovapd %ymm9, (%r10,%rax).

That's because we do a renaming and a substitution.  We have (in the middle of
scheduling, just scheduled insn 78):

  262: dx:DI=[bp:DI-0x88]
   72: xmm0:V4DF=[r9:DI+ax:DI]
   78: {di:DI=di:DI+0x1;clobber flags:CC;}   <--- we are here
   73: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
   74: [r9:DI+ax:DI]=xmm0:V4DF
   75: xmm0:V4DF=[r10:DI+ax:DI]
  263: dx:DI=[bp:DI-0x90]
   76: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
   77: [r10:DI+ax:DI]=xmm0:V4DF

Now we want to schedule insn 75 but xmm0 is busy in 74 and 73, so we rename it
to xmm8 and have:

  262: dx:DI=[bp:DI-0x88]
   72: xmm0:V4DF=[r9:DI+ax:DI]
   78: {di:DI=di:DI+0x1;clobber flags:CC;}
  459: xmm8:V4DF=[r10:DI+ax:DI]              <--- we are here
   73: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
   74: [r9:DI+ax:DI]=xmm0:V4DF
  461: xmm0:V4DF=xmm8:V4DF                   <--- copy after renaming 
  263: dx:DI=[bp:DI-0x90]
   76: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
   77: [r10:DI+ax:DI]=xmm0:V4DF

Then after scheduling insns 73 and 263 we have

  262: dx:DI=[bp:DI-0x88]
   72: xmm0:V4DF=[r9:DI+ax:DI]
   78: {di:DI=di:DI+0x1;clobber flags:CC;}
  459: xmm8:V4DF=[r10:DI+ax:DI]
   73: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
  263: dx:DI=[bp:DI-0x90]                   <--- we are here
   74: [r9:DI+ax:DI]=xmm0:V4DF
  461: xmm0:V4DF=xmm8:V4DF
   76: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
   77: [r10:DI+ax:DI]=xmm0:V4DF

and now we want to schedule insn 76.  We substitute its rhs through a copy 461
but then xmm0 is again busy so we rename the target register to xmm9 and get

  262: dx:DI=[bp:DI-0x88]
   72: xmm0:V4DF=[r9:DI+ax:DI]
   78: {di:DI=di:DI+0x1;clobber flags:CC;}
  459: xmm8:V4DF=[r10:DI+ax:DI]
   73: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
  263: dx:DI=[bp:DI-0x90]
  464: xmm9:V4DF=xmm8:V4DF+[dx:DI+ax:DI]    <--- new renamed insn
   74: [r9:DI+ax:DI]=xmm0:V4DF
  461: xmm0:V4DF=xmm8:V4DF
  466: xmm0:V4DF=xmm9:V4DF                  <--- copy after renaming
   77: [r10:DI+ax:DI]=xmm0:V4DF


At this point insn 461 is dead but we do not notice, and it doesn't look easy. 
I think there was some suggestion in the original research for killing dead
insn copies left after renaming but I don't remember offhand.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
                   ` (4 preceding siblings ...)
  2014-02-07  8:52 ` abel at gcc dot gnu.org
@ 2014-02-07  8:53 ` abel at gcc dot gnu.org
  2014-02-07 14:33 ` amonakov at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: abel at gcc dot gnu.org @ 2014-02-07  8:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

--- Comment #6 from Andrey Belevantsev <abel at gcc dot gnu.org> ---
Forgot to mention that we end up scheduling this block in 21 cycles while the
regular scheduling needs 24.  Not that it's so important though.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
                   ` (5 preceding siblings ...)
  2014-02-07  8:53 ` abel at gcc dot gnu.org
@ 2014-02-07 14:33 ` amonakov at gcc dot gnu.org
  2014-02-07 16:43 ` marcin.krotkiewski at gmail dot com
  2014-02-07 17:21 ` amonakov at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: amonakov at gcc dot gnu.org @ 2014-02-07 14:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #1)
> alignment, but still the scheduler doesn't reorder the loads vs. the store,
> unless -O3 -mavx -fschedule-insns.  The reason why the second scheduler
> doesn't reorder those is that RA allocates the same register

I think you usually want -fschedule-insns (pre-regalloc scheduling) or
-frename-registers rather than -fselective-scheduling2 when the goal is to
workaround RA conservativeness.  Unfortunately, stack accesses in the loop
prevent sched2 from using the additional freedom supplied by regrename for AVX
code in this case (when tuning is enabled).  The stack accesses seem to be a
trunk regression judging by good code supplied in the opening comment. 

(-O3 -mavx -fschedule-insns or -frename-registers, same modulo ymm* names,
%rpb-based accesses in the loop are pretty bad, but otherwise it's scheduled as
desired)
.L9:    
        movq    -136(%rbp), %rdx
        vmovapd (%r9,%rax), %ymm1
        addq    $1, %rdi
        vmovapd (%r10,%rax), %ymm0
        vaddpd  (%rdx,%rax), %ymm1, %ymm1
        movq    -144(%rbp), %rdx
        vaddpd  (%rdx,%rax), %ymm0, %ymm0
        vmovapd %ymm1, (%r9,%rax)
        vmovapd %ymm0, (%r10,%rax)
        addq    $32, %rax
        cmpq    %rdi, -152(%rbp)
        ja      .L9

(-O3 -fschedule-insns or -frename-registers, same modulo xmm* names, scheduled
as desired)
.L7:
        movapd  (%r9,%rax), %xmm0
        addq    $1, %rdi
        movapd  (%r10,%rax), %xmm2
        addpd   (%r11,%rax), %xmm0
        addpd   (%rcx,%rax), %xmm2
        movaps  %xmm0, (%r9,%rax)
        movaps  %xmm2, (%r10,%rax)
        addq    $16, %rax
        cmpq    %rdi, %r8
        ja      .L7

(-mavx -O3 -mtune=corei7-avx -frename-registers, stack-based references prevent
good scheduling)
.L9:
        movq    -136(%rbp), %rdx
        addq    $1, %rdi
        vmovapd (%r9,%rax), %ymm0
        vmovapd (%r10,%rax), %ymm3
        vaddpd  (%rdx,%rax), %ymm0, %ymm2
        movq    -144(%rbp), %rdx
        vmovapd %ymm2, (%r9,%rax)
        vaddpd  (%rdx,%rax), %ymm3, %ymm4
        vmovapd %ymm4, (%r10,%rax)
        addq    $32, %rax
        cmpq    %rdi, -152(%rbp)
        ja      .L9

(-mavx -O3 -mtune=corei7-avx -fschedule-insns -fno-ivopts, no spilling in the
loop, scheduled as desired)
.L9:    
        addq    $32, %rcx
        addq    $32, %r10
        vmovapd (%rdx), %ymm1
        addq    $32, %rsi
        vmovapd (%rdi), %ymm0
        addq    $32, %r11
        addq    $1, %rax
        addq    $32, %rdx
        vaddpd  -32(%rcx), %ymm1, %ymm1
        addq    $32, %rdi
        vaddpd  -32(%r10), %ymm0, %ymm0
        vmovapd %ymm1, -32(%rsi)
        vmovapd %ymm0, -32(%r11)
        cmpq    %rax, -184(%rbp)
        ja      .L9


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
                   ` (6 preceding siblings ...)
  2014-02-07 14:33 ` amonakov at gcc dot gnu.org
@ 2014-02-07 16:43 ` marcin.krotkiewski at gmail dot com
  2014-02-07 17:21 ` amonakov at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: marcin.krotkiewski at gmail dot com @ 2014-02-07 16:43 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

--- Comment #8 from Marcin Krotkiewski <marcin.krotkiewski at gmail dot com> ---
(In reply to Andrey Belevantsev from comment #5)
> At this point insn 461 is dead but we do not notice, and it doesn't look
> easy.  I think there was some suggestion in the original research for
> killing dead insn copies left after renaming but I don't remember offhand.

Following Alexanders suggestion, I compiled the test code with -mavx -O3
-fselective-scheduling2 -frename-registers. This seems to get rid of the dead
instructions and yields the desired scheduling:

.L5:
    vmovapd    (%rbx,%rdi), %ymm0
     addq    $1, %rsi
    vmovapd    (%r12,%rdi), %ymm3
     vaddpd    0(%r13,%rdi), %ymm0, %ymm2
    vaddpd    (%r14,%rdi), %ymm3, %ymm4
     vmovapd    %ymm2, (%rbx,%rdi)
     vmovapd    %ymm4, (%r12,%rdi)
     addq    $32, %rdi
    cmpq    %rsi, %rdx
    ja    .L5

Alexander, I should maybe clarify that the 'good' code was prepared by hand,
modifying the 'bad' asm I got from gcc 4.7. Asm generated by gcc 4.4 was the
same. If that is what you were refering to.

I am a bit confused now. It seems that all fine and the desired asm can be
generated, so there is no real bug. But why is the original code compiled with
-O3 -mavx bad then? Is -fschedule-insns not enabled at -O2?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
  2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
                   ` (7 preceding siblings ...)
  2014-02-07 16:43 ` marcin.krotkiewski at gmail dot com
@ 2014-02-07 17:21 ` amonakov at gcc dot gnu.org
  8 siblings, 0 replies; 10+ messages in thread
From: amonakov at gcc dot gnu.org @ 2014-02-07 17:21 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086

--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
By "good code" I was referring to the fact that your 4.7 asm does not contain
stack (%rbp) references in the vectorized loop.

Historically, first scheduling (-fschedule-insns) was problematic for 32-bit
x86 due to increase of register pressure to the point that register allocator
couldn't handle.  The situation got much better with the introduction of
-fsched-pressure, and was tuned a bit further recently.  I don't know the
latest status; I think compile time cost is the reason it's not enabled at -O2.

For regrename, the manpage notes there are debug info issues on some targets; I
don't know if there's anything else beside that.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-02-07 17:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
2014-02-06  8:28 ` [Bug rtl-optimization/60086] " jakub at gcc dot gnu.org
2014-02-06  9:34 ` marcin.krotkiewski at gmail dot com
2014-02-06 10:10 ` mpolacek at gcc dot gnu.org
2014-02-06 10:22 ` rguenth at gcc dot gnu.org
2014-02-07  8:52 ` abel at gcc dot gnu.org
2014-02-07  8:53 ` abel at gcc dot gnu.org
2014-02-07 14:33 ` amonakov at gcc dot gnu.org
2014-02-07 16:43 ` marcin.krotkiewski at gmail dot com
2014-02-07 17:21 ` amonakov at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).