public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing)
@ 2014-02-05 22:41 marcin.krotkiewski at gmail dot com
2014-02-06 8:28 ` [Bug rtl-optimization/60086] " jakub at gcc dot gnu.org
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: marcin.krotkiewski at gmail dot com @ 2014-02-05 22:41 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
Bug ID: 60086
Summary: suboptimal asm generated for a loop (store/load false
aliasing)
Product: gcc
Version: 4.7.3
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: marcin.krotkiewski at gmail dot com
Created attachment 32060
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32060&action=edit
source code that compiles
Hello,
I am seeing suboptimal performance of the following loop compiled with
gcc 4.7.3 (but also 4.4.7, Ubuntu, full test code attached):
for(i=0; i<NSIZE; i++){
a[i] += b[i];
c[i] += d[i];
}
Arrays are dynamically allocated and aligned to page boundary, declared
with __restrict__ and __attribute__((aligned(32))). I am running on
Intel i7-2620M (Sandy Bridge).
The problem is IMHO related to '4k aliasing'. It happens for the most
common case of a/b/c/d starting at page boundary (e.g., natural result
of malloc). To demonstrate, here is the assembly generated with 'gcc
-mtune=native -mavx -O3':
.L8:
vmovapd (%rdx,%rdi), %ymm0 #1 load b
addq $1, %r8 #2
vaddpd (%rcx,%rdi), %ymm0, %ymm0 #3 load a and add
vmovapd %ymm0, (%rdx,%rdi) #4 store a
vmovapd (%rax,%rdi), %ymm0 #5 load d
vaddpd (%rsi,%rdi), %ymm0, %ymm0 #6 load c and add
vmovapd %ymm0, (%rax,%rdi) #7 store c
addq $32, %rdi #8
cmpq %r8, %r12 #9
ja .L8 #10
The 4k aliasing problem is caused by lines 4 and 5 (writing result to
array a and reading data from either c or d). From my tests this seems
to be the default behavior for both AVX and SSE2 instruction sets, and
for both vectorized and non-vectorized cases.
It is easy to fix the problem by placing the two writes together, at the
end of the iteration, e.g.:
.L8:
vmovapd (%rdx,%rdi), %ymm1 #1
addq $1, %r8 #2
vaddpd (%rcx,%rdi), %ymm1, %ymm1 #3
vmovapd (%rax,%rdi), %ymm0 #4
vaddpd (%rsi,%rdi), %ymm0, %ymm0 #5
vmovapd %ymm1, (%rdx,%rdi) #6
vmovapd %ymm0, (%rax,%rdi) #7
addq $32, %rdi #8
cmpq %r8, %r12 #9
ja .L8 #10
In this case the writes happen after all the loads. The above code is
(almost) what ICC generates for this case. For problem sizes small
enough to fit in L1 the speedup is roughly 50%.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
@ 2014-02-06 8:28 ` jakub at gcc dot gnu.org
2014-02-06 9:34 ` marcin.krotkiewski at gmail dot com
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-02-06 8:28 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2014-02-06
CC| |abel at gcc dot gnu.org,
| |jakub at gcc dot gnu.org,
| |uros at gcc dot gnu.org,
| |vmakarov at gcc dot gnu.org
Ever confirmed|0 |1
--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC right now only handles __restrict on function parameters, so in this case
the aliasing info isn't known. While the loop is versioned for aliasing at
runtime, the info about that is only known during the vectorizer, therefore
e.g. scheduler can hardly know it. The pointers to overaligned memory is
something you should generally avoid, __builtin_assume_aligned is what can be
used to tell the compiler about the alignment instead, overaligned types often
actually hurt generated code instead of improving it. And the way you are
calling posix_memalign is IMHO a strict aliasing violation.
Perhaps GCC could handle posix_memalign specially as builtin if declared with
the right prototype (and optionally some new attribute) and derive both the
aliasing and alignment info from it, like the taking of the address of the
pointer in it isn't really an escape site of any kind, all the call does is
return two values instead of just one, so it could be folded into passing an
address of some temporary to the call instead and then loading from the
temporary and using some special pass-thru builtin that would tell GCC that the
pointer is really malloc-like (non-aliasing anything else) and also
use__builtin_assume_aligned. The GNU memalign is far better than
posix_memalign from this POV.
Anyway, if I rewrite your testcase as:
#include <stdlib.h>
#include <stdio.h>
__attribute__((noinline)) void
foo (double *__restrict__ a, double *__restrict__ b, double *__restrict__ c,
double *__restrict__ d, unsigned long NSIZE)
{
unsigned long i, j;
a = __builtin_assume_aligned (a, 32);
b = __builtin_assume_aligned (b, 32);
c = __builtin_assume_aligned (c, 32);
d = __builtin_assume_aligned (d, 32);
// initialize memory
for(i=0; i<NSIZE; i++){
a[i] = 0;
b[i] = 0;
c[i] = 0;
d[i] = 0;
}
// outer loop - repeat short tests
for(j=0; j<10000; j++){
// inner loop - do the work
for(i=0; i<NSIZE; i++){
a[i] += b[i];
c[i] += d[i];
}
// dummy - prevent loop interchange
if(a[NSIZE/2]<0) printf("%lf\n", a[NSIZE/2]);
}
}
int main(int argc, char*argv[])
{
unsigned long NSIZE = atol(argv[1]);
void *a, *b, *c, *d;
// allocate starting from page boundary
posix_memalign(&a, 4096, sizeof(double)*(NSIZE));
posix_memalign(&b, 4096, sizeof(double)*(NSIZE));
posix_memalign(&c, 4096, sizeof(double)*(NSIZE));
posix_memalign(&d, 4096, sizeof(double)*(NSIZE));
foo ((double *) a, (double *) b, (double *) c, (double *) d, NSIZE);
return 0;
}
we don't do versioning for alias and also (as before) assume sufficient
alignment, but still the scheduler doesn't reorder the loads vs. the store,
unless -O3 -mavx -fschedule-insns. The reason why the second scheduler doesn't
reorder those is that RA allocates the same register. With -O3 -mavx
-fselective-scheduling2 the stores are also changed, but we end up with a
weird:
.L9:
movq -136(%rbp), %rdx
vmovapd (%r9,%rax), %ymm0
addq $1, %rdi
vmovapd (%r10,%rax), %ymm8
vaddpd (%rdx,%rax), %ymm0, %ymm0
movq -144(%rbp), %rdx
vaddpd (%rdx,%rax), %ymm8, %ymm9
vmovapd %ymm0, (%r9,%rax)
vmovapd %ymm8, %ymm0
vmovapd %ymm9, %ymm0
vmovapd %ymm0, (%r10,%rax)
addq $32, %rax
cmpq %rdi, -152(%rbp)
ja .L9
Why there is the vmovapd %ymm8, %ymm0 is a mystery, and vmovapd %ymm9, %ymm0
could be very well merged with the store into vmovapd %ymm9, (%r10,%rax).
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
2014-02-06 8:28 ` [Bug rtl-optimization/60086] " jakub at gcc dot gnu.org
@ 2014-02-06 9:34 ` marcin.krotkiewski at gmail dot com
2014-02-06 10:10 ` mpolacek at gcc dot gnu.org
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: marcin.krotkiewski at gmail dot com @ 2014-02-06 9:34 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
--- Comment #2 from Marcin Krotkiewski <marcin.krotkiewski at gmail dot com> ---
Jakub, thank you for your comments.
> GCC right now only handles __restrict on function parameters, so in this
> case the aliasing info isn't known. While the loop is versioned for
> aliasing at runtime, the info about that is only known during the
> vectorizer, therefore e.g. scheduler can hardly know it.
Does it mean that __restrict is not necessary in order to have a vectorized
code path? I see that if I compile your modified test.c, the loop is vectorized
regardless of whether I use __restrict, or not (runtime versioning). On the
other hand, using __restrict causes gcc to invoke memset for initialization,
while leaving it out results in two paths with a loop.
On the interesting side. Your test.c works indeed if compiled with additional
-fschedule-insns flag. However, if I now remove the __restrict keyword from
function arguments, I do see a vectorized path, but the flag has no effect and
instructions are again not reordered.
> The pointers to
> overaligned memory is something you should generally avoid,
> __builtin_assume_aligned is what can be used to tell the compiler about the
> alignment instead, overaligned types often actually hurt generated code
> instead of improving it.
Thanks. Could you suggest what is the preferred way to use it in a portable
manner? e.g. make it suitable for icc, which has a __assume_aligned builtin?
Should I wrap it in a macro?
> And the way you are calling posix_memalign is IMHO
> a strict aliasing violation.
Could be, gcc des not show a warning with -Wall. Thanks for pointing it out.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
2014-02-06 8:28 ` [Bug rtl-optimization/60086] " jakub at gcc dot gnu.org
2014-02-06 9:34 ` marcin.krotkiewski at gmail dot com
@ 2014-02-06 10:10 ` mpolacek at gcc dot gnu.org
2014-02-06 10:22 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mpolacek at gcc dot gnu.org @ 2014-02-06 10:10 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
Marek Polacek <mpolacek at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |mpolacek at gcc dot gnu.org
--- Comment #3 from Marek Polacek <mpolacek at gcc dot gnu.org> ---
(In reply to Marcin Krotkiewski from comment #2)
> Thanks. Could you suggest what is the preferred way to use it in a portable
> manner? e.g. make it suitable for icc, which has a __assume_aligned builtin?
> Should I wrap it in a macro?
I think you could wrap it in a macro that is defined based on:
#ifdef __INTEL__
# define ...
#elif defined __GNUC__
# define ...
#endif
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
` (2 preceding siblings ...)
2014-02-06 10:10 ` mpolacek at gcc dot gnu.org
@ 2014-02-06 10:22 ` rguenth at gcc dot gnu.org
2014-02-07 8:52 ` abel at gcc dot gnu.org
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-02-06 10:22 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
As of posix_memalign the issue is not so much that of alias analysis (we could
handle it but we don't have a builtin right now) but that of alignment analysis
which doesn't implement alignment tracking of pointers stored in memory. We
could "lower"
posix_memalign (&ptr, align, size);
to
posix_memalign (&ptr, align, size);
ptr = __builtin_assume_algined (ptr, align);
and hope for FRE to fix things up enough to make that useful.
I have opened PR60092.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
` (3 preceding siblings ...)
2014-02-06 10:22 ` rguenth at gcc dot gnu.org
@ 2014-02-07 8:52 ` abel at gcc dot gnu.org
2014-02-07 8:53 ` abel at gcc dot gnu.org
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: abel at gcc dot gnu.org @ 2014-02-07 8:52 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
--- Comment #5 from Andrey Belevantsev <abel at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #1)
> ...
> doesn't reorder those is that RA allocates the same register. With -O3
> -mavx -fselective-scheduling2 the stores are also changed, but we end up
> with a weird:
> .L9:
> movq -136(%rbp), %rdx
> vmovapd (%r9,%rax), %ymm0
> addq $1, %rdi
> vmovapd (%r10,%rax), %ymm8
> vaddpd (%rdx,%rax), %ymm0, %ymm0
> movq -144(%rbp), %rdx
> vaddpd (%rdx,%rax), %ymm8, %ymm9
> vmovapd %ymm0, (%r9,%rax)
> vmovapd %ymm8, %ymm0
> vmovapd %ymm9, %ymm0
> vmovapd %ymm0, (%r10,%rax)
> addq $32, %rax
> cmpq %rdi, -152(%rbp)
> ja .L9
> Why there is the vmovapd %ymm8, %ymm0 is a mystery, and vmovapd %ymm9, %ymm0
> could be very well merged with the store into vmovapd %ymm9, (%r10,%rax).
That's because we do a renaming and a substitution. We have (in the middle of
scheduling, just scheduled insn 78):
262: dx:DI=[bp:DI-0x88]
72: xmm0:V4DF=[r9:DI+ax:DI]
78: {di:DI=di:DI+0x1;clobber flags:CC;} <--- we are here
73: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
74: [r9:DI+ax:DI]=xmm0:V4DF
75: xmm0:V4DF=[r10:DI+ax:DI]
263: dx:DI=[bp:DI-0x90]
76: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
77: [r10:DI+ax:DI]=xmm0:V4DF
Now we want to schedule insn 75 but xmm0 is busy in 74 and 73, so we rename it
to xmm8 and have:
262: dx:DI=[bp:DI-0x88]
72: xmm0:V4DF=[r9:DI+ax:DI]
78: {di:DI=di:DI+0x1;clobber flags:CC;}
459: xmm8:V4DF=[r10:DI+ax:DI] <--- we are here
73: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
74: [r9:DI+ax:DI]=xmm0:V4DF
461: xmm0:V4DF=xmm8:V4DF <--- copy after renaming
263: dx:DI=[bp:DI-0x90]
76: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
77: [r10:DI+ax:DI]=xmm0:V4DF
Then after scheduling insns 73 and 263 we have
262: dx:DI=[bp:DI-0x88]
72: xmm0:V4DF=[r9:DI+ax:DI]
78: {di:DI=di:DI+0x1;clobber flags:CC;}
459: xmm8:V4DF=[r10:DI+ax:DI]
73: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
263: dx:DI=[bp:DI-0x90] <--- we are here
74: [r9:DI+ax:DI]=xmm0:V4DF
461: xmm0:V4DF=xmm8:V4DF
76: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
77: [r10:DI+ax:DI]=xmm0:V4DF
and now we want to schedule insn 76. We substitute its rhs through a copy 461
but then xmm0 is again busy so we rename the target register to xmm9 and get
262: dx:DI=[bp:DI-0x88]
72: xmm0:V4DF=[r9:DI+ax:DI]
78: {di:DI=di:DI+0x1;clobber flags:CC;}
459: xmm8:V4DF=[r10:DI+ax:DI]
73: xmm0:V4DF=xmm0:V4DF+[dx:DI+ax:DI]
263: dx:DI=[bp:DI-0x90]
464: xmm9:V4DF=xmm8:V4DF+[dx:DI+ax:DI] <--- new renamed insn
74: [r9:DI+ax:DI]=xmm0:V4DF
461: xmm0:V4DF=xmm8:V4DF
466: xmm0:V4DF=xmm9:V4DF <--- copy after renaming
77: [r10:DI+ax:DI]=xmm0:V4DF
At this point insn 461 is dead but we do not notice, and it doesn't look easy.
I think there was some suggestion in the original research for killing dead
insn copies left after renaming but I don't remember offhand.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
` (4 preceding siblings ...)
2014-02-07 8:52 ` abel at gcc dot gnu.org
@ 2014-02-07 8:53 ` abel at gcc dot gnu.org
2014-02-07 14:33 ` amonakov at gcc dot gnu.org
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: abel at gcc dot gnu.org @ 2014-02-07 8:53 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
--- Comment #6 from Andrey Belevantsev <abel at gcc dot gnu.org> ---
Forgot to mention that we end up scheduling this block in 21 cycles while the
regular scheduling needs 24. Not that it's so important though.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
` (5 preceding siblings ...)
2014-02-07 8:53 ` abel at gcc dot gnu.org
@ 2014-02-07 14:33 ` amonakov at gcc dot gnu.org
2014-02-07 16:43 ` marcin.krotkiewski at gmail dot com
2014-02-07 17:21 ` amonakov at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: amonakov at gcc dot gnu.org @ 2014-02-07 14:33 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
Alexander Monakov <amonakov at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amonakov at gcc dot gnu.org
--- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #1)
> alignment, but still the scheduler doesn't reorder the loads vs. the store,
> unless -O3 -mavx -fschedule-insns. The reason why the second scheduler
> doesn't reorder those is that RA allocates the same register
I think you usually want -fschedule-insns (pre-regalloc scheduling) or
-frename-registers rather than -fselective-scheduling2 when the goal is to
workaround RA conservativeness. Unfortunately, stack accesses in the loop
prevent sched2 from using the additional freedom supplied by regrename for AVX
code in this case (when tuning is enabled). The stack accesses seem to be a
trunk regression judging by good code supplied in the opening comment.
(-O3 -mavx -fschedule-insns or -frename-registers, same modulo ymm* names,
%rpb-based accesses in the loop are pretty bad, but otherwise it's scheduled as
desired)
.L9:
movq -136(%rbp), %rdx
vmovapd (%r9,%rax), %ymm1
addq $1, %rdi
vmovapd (%r10,%rax), %ymm0
vaddpd (%rdx,%rax), %ymm1, %ymm1
movq -144(%rbp), %rdx
vaddpd (%rdx,%rax), %ymm0, %ymm0
vmovapd %ymm1, (%r9,%rax)
vmovapd %ymm0, (%r10,%rax)
addq $32, %rax
cmpq %rdi, -152(%rbp)
ja .L9
(-O3 -fschedule-insns or -frename-registers, same modulo xmm* names, scheduled
as desired)
.L7:
movapd (%r9,%rax), %xmm0
addq $1, %rdi
movapd (%r10,%rax), %xmm2
addpd (%r11,%rax), %xmm0
addpd (%rcx,%rax), %xmm2
movaps %xmm0, (%r9,%rax)
movaps %xmm2, (%r10,%rax)
addq $16, %rax
cmpq %rdi, %r8
ja .L7
(-mavx -O3 -mtune=corei7-avx -frename-registers, stack-based references prevent
good scheduling)
.L9:
movq -136(%rbp), %rdx
addq $1, %rdi
vmovapd (%r9,%rax), %ymm0
vmovapd (%r10,%rax), %ymm3
vaddpd (%rdx,%rax), %ymm0, %ymm2
movq -144(%rbp), %rdx
vmovapd %ymm2, (%r9,%rax)
vaddpd (%rdx,%rax), %ymm3, %ymm4
vmovapd %ymm4, (%r10,%rax)
addq $32, %rax
cmpq %rdi, -152(%rbp)
ja .L9
(-mavx -O3 -mtune=corei7-avx -fschedule-insns -fno-ivopts, no spilling in the
loop, scheduled as desired)
.L9:
addq $32, %rcx
addq $32, %r10
vmovapd (%rdx), %ymm1
addq $32, %rsi
vmovapd (%rdi), %ymm0
addq $32, %r11
addq $1, %rax
addq $32, %rdx
vaddpd -32(%rcx), %ymm1, %ymm1
addq $32, %rdi
vaddpd -32(%r10), %ymm0, %ymm0
vmovapd %ymm1, -32(%rsi)
vmovapd %ymm0, -32(%r11)
cmpq %rax, -184(%rbp)
ja .L9
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
` (6 preceding siblings ...)
2014-02-07 14:33 ` amonakov at gcc dot gnu.org
@ 2014-02-07 16:43 ` marcin.krotkiewski at gmail dot com
2014-02-07 17:21 ` amonakov at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: marcin.krotkiewski at gmail dot com @ 2014-02-07 16:43 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
--- Comment #8 from Marcin Krotkiewski <marcin.krotkiewski at gmail dot com> ---
(In reply to Andrey Belevantsev from comment #5)
> At this point insn 461 is dead but we do not notice, and it doesn't look
> easy. I think there was some suggestion in the original research for
> killing dead insn copies left after renaming but I don't remember offhand.
Following Alexanders suggestion, I compiled the test code with -mavx -O3
-fselective-scheduling2 -frename-registers. This seems to get rid of the dead
instructions and yields the desired scheduling:
.L5:
vmovapd (%rbx,%rdi), %ymm0
addq $1, %rsi
vmovapd (%r12,%rdi), %ymm3
vaddpd 0(%r13,%rdi), %ymm0, %ymm2
vaddpd (%r14,%rdi), %ymm3, %ymm4
vmovapd %ymm2, (%rbx,%rdi)
vmovapd %ymm4, (%r12,%rdi)
addq $32, %rdi
cmpq %rsi, %rdx
ja .L5
Alexander, I should maybe clarify that the 'good' code was prepared by hand,
modifying the 'bad' asm I got from gcc 4.7. Asm generated by gcc 4.4 was the
same. If that is what you were refering to.
I am a bit confused now. It seems that all fine and the desired asm can be
generated, so there is no real bug. But why is the original code compiled with
-O3 -mavx bad then? Is -fschedule-insns not enabled at -O2?
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug rtl-optimization/60086] suboptimal asm generated for a loop (store/load false aliasing)
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
` (7 preceding siblings ...)
2014-02-07 16:43 ` marcin.krotkiewski at gmail dot com
@ 2014-02-07 17:21 ` amonakov at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: amonakov at gcc dot gnu.org @ 2014-02-07 17:21 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60086
--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
By "good code" I was referring to the fact that your 4.7 asm does not contain
stack (%rbp) references in the vectorized loop.
Historically, first scheduling (-fschedule-insns) was problematic for 32-bit
x86 due to increase of register pressure to the point that register allocator
couldn't handle. The situation got much better with the introduction of
-fsched-pressure, and was tuned a bit further recently. I don't know the
latest status; I think compile time cost is the reason it's not enabled at -O2.
For regrename, the manpage notes there are debug info issues on some targets; I
don't know if there's anything else beside that.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-02-07 17:21 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-05 22:41 [Bug rtl-optimization/60086] New: suboptimal asm generated for a loop (store/load false aliasing) marcin.krotkiewski at gmail dot com
2014-02-06 8:28 ` [Bug rtl-optimization/60086] " jakub at gcc dot gnu.org
2014-02-06 9:34 ` marcin.krotkiewski at gmail dot com
2014-02-06 10:10 ` mpolacek at gcc dot gnu.org
2014-02-06 10:22 ` rguenth at gcc dot gnu.org
2014-02-07 8:52 ` abel at gcc dot gnu.org
2014-02-07 8:53 ` abel at gcc dot gnu.org
2014-02-07 14:33 ` amonakov at gcc dot gnu.org
2014-02-07 16:43 ` marcin.krotkiewski at gmail dot com
2014-02-07 17:21 ` amonakov at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).