public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop
@ 2007-11-07 9:05 ubizjak at gmail dot com
2007-11-07 18:06 ` [Bug rtl-optimization/34011] " dorit at gcc dot gnu dot org
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: ubizjak at gmail dot com @ 2007-11-07 9:05 UTC (permalink / raw)
To: gcc-bugs
Following testcase exposes optimization problem with current SVN gcc:
--cut here--
extern const int srcshift;
void good (const int *srcdata, int *dstdata)
{
int i;
for (i = 0; i < 256; i++)
dstdata[i] = srcdata[i] << srcshift;
}
void bad (const int *srcdata, int *dstdata)
{
int i;
for (i = 0; i < 256; i++)
{
dstdata[i] |= srcdata[i] << srcshift;
}
}
--cut here--
Using -O3 -msse2, the loop in above testcase gets vectorized, and produced code
differs substantially between good and bad function:
good:
...
.L8:
xorl %eax, %eax
movd srcshift, %xmm1
.p2align 4,,7
.p2align 3
.L4:
movdqu (%ebx,%eax), %xmm0
pslld %xmm1, %xmm0
movdqa %xmm0, (%esi,%eax)
addl $16, %eax
cmpl $1024, %eax
jne .L4
...
bad:
...
.L21:
movl %esi, %eax (2)
movl %ebx, %edx
leal 1024(%esi), %ecx
.p2align 4,,7
.p2align 3
.L17:
movdqu (%edx), %xmm0
movd srcshift, %xmm1 (1)
pslld %xmm1, %xmm0
movdqu (%eax), %xmm1 (3)
por %xmm1, %xmm0
movdqa %xmm0, (%eax)
addl $16, %eax (4)
addl $16, %edx
cmpl %ecx, %eax
jne .L17
popl %ebx
popl %esi
popl %ebp
ret
In addition to memory load in the loop (1), several other problems can be
identified: There is no need to move registers (2), because loop is followed by
function exit. For some reason, additional IV is used (4) and the same address
is accessed with unaligned access (3) as well as aligned access.
Expected code for "bad" case would be something like "good" case with
additional movaps+por instructions:
.L8:
xorl %eax, %eax
movd srcshift, %xmm1
.p2align 4,,7
.p2align 3
.L4:
movdqu (%ebx,%eax), %xmm0
movaps %xmm0, %xmm2
pslld %xmm1, %xmm0
por %xmm2, %xmm0
movdqa %xmm0, (%esi,%eax)
addl $16, %eax
cmpl $1024, %eax
jne .L4
Missing IV elimination could be attributed to tree loop optimizations, but
others are IMO RTL optimization problems, because we enter RTL generation with:
good:
<bb 3>:
MEM[base: dstdata, index: ivtmp.60] = M*(vect_p.29 + ivtmp.60){misalignment:
0} << srcshift.1;
bad:
<bb 4>:
MEM[index: ivtmp.127] = M*(vector int *) ivtmp.130{misalignment: 0} <<
srcshift.3 | M*(vector int *) ivtmp.127{misalignment: 0};
--
Summary: Memory load is not eliminated from tight vectorized loop
Product: gcc
Version: 4.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: ubizjak at gmail dot com
GCC target triplet: i686-*-*, x86_64-*-*
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/34011] Memory load is not eliminated from tight vectorized loop
2007-11-07 9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
@ 2007-11-07 18:06 ` dorit at gcc dot gnu dot org
2009-09-12 19:25 ` ubizjak at gmail dot com
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: dorit at gcc dot gnu dot org @ 2007-11-07 18:06 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from dorit at gcc dot gnu dot org 2007-11-07 18:06 -------
(In reply to comment #0)
> Following testcase exposes optimization problem with current SVN gcc:
...
> the same address
> is accessed with unaligned access (3) as well as aligned access.
This is a missed-optimization in the vectorizer - we use loop-versioning to
deal with the fact that we don't yet support misaligned stores; so the
vectorized version of the loop is guarded by a runtime test that checks that
the address of the store is aligned. However, we don't use the information that
there's a load from the same address that is therefore also guaranteed to be
aligned.
We actualy have this information (we detect DRs that have the same alignment
and collect them in STMT_VINFO_SAME_ALIGN_REFS), but we don't use it when we do
the versioning. We *do* use this information when instead of versioning the
loop, we peel the loop to make the store aligned. In this case we also mark the
relevant SAME_ALIGN_REFS as aligned and generate aligned accesses for them.
(By the way, the reason we decide to use loop-versioning and not loop-peeling
is because we can't determing whether the pointers overlap at compile time. So
we have to use runtime dependence testing (i.e. versioning for aliasing), and
since we currently don't support both versioning and peeling together, this
dictates that we will use runtime alignment testing instead of peeling.)
Here is how it looks like in the vectorizer dump file:
"
pr34011.c:14: note: === vect_analyze_dependences ===
pr34011.c:14: note: dependence distance = 0.
pr34011.c:14: note: accesses have the same alignment.
pr34011.c:14: note: dependence distance modulo vf == 0 between *D.1529_9 and
*D.1529_9
pr34011.c:14: note: versioning for alias required: can't determine dependence
between *D.1531_14 and *D.1529_9
pr34011.c:14: note: mark for run-time aliasing test between *D.1531_14 and
*D.1529_9
...
pr34011.c:14: note: === vect_enhance_data_refs_alignment ===
pr34011.c:14: note: Unknown misalignment, is_packed = 0
pr34011.c:14: note: Alignment of access forced using versioning.
pr34011.c:14: note: Versioning for alignment will be applied.
pr34011.c:14: note: Vectorizing an unaligned access.
pr34011.c:14: note: Vectorizing an unaligned access.
"
Instead, if I add __restrict__ qualifiers to the pointer arguments, we get
this:
"
pr34011b.c:14: note: === vect_analyze_dependences ===
pr34011b.c:14: note: dependence distance = 0.
pr34011b.c:14: note: accesses have the same alignment.
pr34011b.c:14: note: dependence distance modulo vf == 0 between *D.1529_9 and
*D.1529_9
...
pr34011b.c:14: note: === vect_enhance_data_refs_alignment ===
pr34011b.c:14: note: Unknown misalignment, is_packed = 0
...
pr34011b.c:14: note: Alignment of access forced using peeling.
pr34011b.c:14: note: Peeling for alignment will be applied.
pr34011b.c:14: note: Vectorizing an unaligned access.
"
i.e. we don't need to use runtime dependence testing and version the loop, so
we can use peeling to align the store along with anything that has the same
alignment as the store:
<bb 6>:
MEM[base: D.1676, index: ivtmp.142] = M*(vect_p.111 +
ivtmp.142){misalignment: 0} << srcshift | MEM[base: D.1676, index: ivtmp.142];
...
> Missing IV elimination could be attributed to tree loop optimizations, but
> others are IMO RTL optimization problems,
(except for the misaligned access, which the vectorizer can avoid).
> because we enter RTL generation with:
> bad:
> <bb 4>:
> MEM[index: ivtmp.127] = M*(vector int *) ivtmp.130{misalignment: 0} <<
> srcshift.3 | M*(vector int *) ivtmp.127{misalignment: 0};
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug rtl-optimization/34011] Memory load is not eliminated from tight vectorized loop
2007-11-07 9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
2007-11-07 18:06 ` [Bug rtl-optimization/34011] " dorit at gcc dot gnu dot org
@ 2009-09-12 19:25 ` ubizjak at gmail dot com
2009-09-12 20:02 ` [Bug tree-optimization/34011] " rguenth at gcc dot gnu dot org
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: ubizjak at gmail dot com @ 2009-09-12 19:25 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from ubizjak at gmail dot com 2009-09-12 19:25 -------
The testcase does not verctorize anymore, even in the modified form:
--cut here--
const int srcshift;
void good (int *restrict srcdata, int *restrict dstdata)
{
int i;
for (i = 0; i < 256; i++)
dstdata[i] = srcdata[i] << srcshift;
}
void bad (int *restrict srcdata, int *restrict dstdata)
{
int i;
for (i = 0; i < 256; i++)
{
dstdata[i] |= srcdata[i] << srcshift;
}
}
--cut here--
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
2007-11-07 9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
2007-11-07 18:06 ` [Bug rtl-optimization/34011] " dorit at gcc dot gnu dot org
2009-09-12 19:25 ` ubizjak at gmail dot com
@ 2009-09-12 20:02 ` rguenth at gcc dot gnu dot org
2009-09-15 14:07 ` rguenth at gcc dot gnu dot org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-12 20:02 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from rguenth at gcc dot gnu dot org 2009-09-12 20:02 -------
srcshift is not moved out of the loop because we think the store to dstdata may
alias it. I'll fix that.
Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c (revision 151651)
+++ tree-ssa-alias.c (working copy)
@@ -633,6 +633,9 @@ indirect_ref_may_alias_decl_p (tree ref1
HOST_WIDE_INT offset2, HOST_WIDE_INT max_size2,
alias_set_type base2_alias_set)
{
+ if (TREE_READONLY (base2))
+ return false;
+
/* If only one reference is based on a variable, they cannot alias if
the pointer access is beyond the extent of the variable access.
(the pointer base cannot validly point to an offset less than zero
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|unassigned at gcc dot gnu |rguenth at gcc dot gnu dot
|dot org |org
Status|UNCONFIRMED |ASSIGNED
Ever Confirmed|0 |1
Last reconfirmed|0000-00-00 00:00:00 |2009-09-12 20:02:13
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
2007-11-07 9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
` (2 preceding siblings ...)
2009-09-12 20:02 ` [Bug tree-optimization/34011] " rguenth at gcc dot gnu dot org
@ 2009-09-15 14:07 ` rguenth at gcc dot gnu dot org
2009-09-15 14:40 ` rguenth at gcc dot gnu dot org
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-15 14:07 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from rguenth at gcc dot gnu dot org 2009-09-15 14:07 -------
With the alias issue fixed I get
good:
.LFB0:
.cfi_startproc
movd srcshift(%rip), %xmm1
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movdqu (%rdi,%rax), %xmm0
pslld %xmm1, %xmm0
movdqu %xmm0, (%rsi,%rax)
addq $16, %rax
cmpq $1024, %rax
jne .L2
rep
ret
bad:
.LFB1:
.cfi_startproc
movd srcshift(%rip), %xmm2
leaq 1024(%rsi), %rax
.p2align 4,,10
.p2align 3
.L6:
movdqu (%rdi), %xmm0
addq $16, %rdi
movdqu (%rsi), %xmm1
pslld %xmm2, %xmm0
por %xmm1, %xmm0
movdqu %xmm0, (%rsi)
addq $16, %rsi
cmpq %rax, %rsi
jne .L6
rep
ret
which looks good in both cases.
For the original testcase which results in a runtime alias check we get
bad:
.LFB1:
.cfi_startproc
leaq 16(%rdi), %rax
cmpq %rax, %rsi
leaq 16(%rsi), %rax
seta %dl
cmpq %rax, %rdi
seta %al
orb %al, %dl
je .L10
leaq 1024(%rsi), %rax
.p2align 4,,10
.p2align 3
.L11:
movdqu (%rdi), %xmm0
addq $16, %rdi
movd srcshift(%rip), %xmm1
pslld %xmm1, %xmm0
movdqu (%rsi), %xmm1
por %xmm1, %xmm0
movdqu %xmm0, (%rsi)
addq $16, %rsi
cmpq %rax, %rsi
jne .L11
rep
ret
.L10:
movzbl srcshift(%rip), %ecx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L13:
movl (%rdi,%rax), %edx
sall %cl, %edx
orl %edx, (%rsi,%rax)
addq $4, %rax
cmpq $1024, %rax
jne .L13
rep
ret
thus still bad. It is IRA / reload that moves the srcshift load back into
the loop for some reason.
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |vmakarov at gcc dot gnu dot
| |org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
2007-11-07 9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
` (3 preceding siblings ...)
2009-09-15 14:07 ` rguenth at gcc dot gnu dot org
@ 2009-09-15 14:40 ` rguenth at gcc dot gnu dot org
2009-09-16 8:51 ` rguenth at gcc dot gnu dot org
2009-09-17 9:08 ` rguenth at gcc dot gnu dot org
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-15 14:40 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from rguenth at gcc dot gnu dot org 2009-09-15 14:40 -------
Which is likely because it decides to allocate $cx for the load destination
(operand for the scalar shift) and then needs to re-load it to $xmm? for the
vector shift. The placement of the re-load inside the loop is unfortunate...
Reloads for insn # 67
Reload 0: reload_in (SI) = (reg:SI 116 [ pretmp.11 ])
SSE_REGS, RELOAD_FOR_INPUT (opnum = 2)
reload_in_reg: (reg:SI 116 [ pretmp.11 ])
reload_reg_rtx: (reg:SI 22 xmm1)
Reloads for insn # 83
Reload 0: reload_in (QI) = (subreg:QI (reg:SI 116 [ pretmp.11 ]) 0)
CREG, RELOAD_FOR_INPUT (opnum = 2)
reload_in_reg: (subreg:QI (reg:SI 116 [ pretmp.11 ]) 0)
reload_reg_rtx: (reg:QI 2 cx)
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
2007-11-07 9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
` (4 preceding siblings ...)
2009-09-15 14:40 ` rguenth at gcc dot gnu dot org
@ 2009-09-16 8:51 ` rguenth at gcc dot gnu dot org
2009-09-17 9:08 ` rguenth at gcc dot gnu dot org
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-16 8:51 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from rguenth at gcc dot gnu dot org 2009-09-16 08:50 -------
Subject: Bug 34011
Author: rguenth
Date: Wed Sep 16 08:50:46 2009
New Revision: 151740
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=151740
Log:
2009-09-16 Richard Guenther <rguenther@suse.de>
PR middle-end/34011
* tree-flow-inline.h (may_be_aliased): Compute readonly variables
as non-aliased.
* gcc.dg/tree-ssa/ssa-lim-7.c: New testcase.
Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-flow-inline.h
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
2007-11-07 9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
` (5 preceding siblings ...)
2009-09-16 8:51 ` rguenth at gcc dot gnu dot org
@ 2009-09-17 9:08 ` rguenth at gcc dot gnu dot org
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-17 9:08 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from rguenth at gcc dot gnu dot org 2009-09-17 09:08 -------
The problem is now back to the original one.
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|rguenth at gcc dot gnu dot |unassigned at gcc dot gnu
|org |dot org
Status|ASSIGNED |NEW
Keywords| |missed-optimization, ra
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-09-17 9:08 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-07 9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
2007-11-07 18:06 ` [Bug rtl-optimization/34011] " dorit at gcc dot gnu dot org
2009-09-12 19:25 ` ubizjak at gmail dot com
2009-09-12 20:02 ` [Bug tree-optimization/34011] " rguenth at gcc dot gnu dot org
2009-09-15 14:07 ` rguenth at gcc dot gnu dot org
2009-09-15 14:40 ` rguenth at gcc dot gnu dot org
2009-09-16 8:51 ` rguenth at gcc dot gnu dot org
2009-09-17 9:08 ` rguenth at gcc dot gnu dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).