[Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/34011]  New: Memory load is not eliminated from tight vectorized loop
@ 2007-11-07  9:05 ubizjak at gmail dot com
  2007-11-07 18:06 ` [Bug rtl-optimization/34011] " dorit at gcc dot gnu dot org
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: ubizjak at gmail dot com @ 2007-11-07  9:05 UTC (permalink / raw)
  To: gcc-bugs

Following testcase exposes optimization problem with current SVN gcc:

--cut here--
extern const int srcshift;

void good (const int *srcdata, int *dstdata)
{
  int i;

  for (i = 0; i < 256; i++)
    dstdata[i] = srcdata[i] << srcshift;
}


void bad (const int *srcdata, int *dstdata)
{
  int i;

  for (i = 0; i < 256; i++)
    {
      dstdata[i] |= srcdata[i] << srcshift;
    }
}
--cut here--

Using -O3 -msse2, the loop in above testcase gets vectorized, and produced code
differs substantially between good and bad function:

good:
        ...
.L8:
        xorl    %eax, %eax
        movd    srcshift, %xmm1
        .p2align 4,,7
        .p2align 3
.L4:
        movdqu  (%ebx,%eax), %xmm0
        pslld   %xmm1, %xmm0
        movdqa  %xmm0, (%esi,%eax)
        addl    $16, %eax
        cmpl    $1024, %eax
        jne     .L4
        ...

bad:
        ...
.L21:
        movl    %esi, %eax        (2)
        movl    %ebx, %edx
        leal    1024(%esi), %ecx
        .p2align 4,,7
        .p2align 3
.L17:
        movdqu  (%edx), %xmm0
        movd    srcshift, %xmm1   (1)
        pslld   %xmm1, %xmm0
        movdqu  (%eax), %xmm1     (3)
        por     %xmm1, %xmm0
        movdqa  %xmm0, (%eax)
        addl    $16, %eax         (4)
        addl    $16, %edx
        cmpl    %ecx, %eax
        jne     .L17
        popl    %ebx
        popl    %esi
        popl    %ebp
        ret

In addition to memory load in the loop (1), several other problems can be
identified: There is no need to move registers (2), because loop is followed by
function exit. For some reason, additional IV is used (4) and the same address
is accessed with unaligned access (3) as well as aligned access.

Expected code for "bad" case would be something like "good" case with
additional movaps+por instructions:

.L8:
        xorl    %eax, %eax
        movd    srcshift, %xmm1
        .p2align 4,,7
        .p2align 3
.L4:
        movdqu  (%ebx,%eax), %xmm0
        movaps  %xmm0, %xmm2
        pslld   %xmm1, %xmm0
        por     %xmm2, %xmm0
        movdqa  %xmm0, (%esi,%eax)
        addl    $16, %eax
        cmpl    $1024, %eax
        jne     .L4

Missing IV elimination could be attributed to tree loop optimizations, but
others are IMO RTL optimization problems, because we enter RTL generation with:

good:
<bb 3>:
  MEM[base: dstdata, index: ivtmp.60] = M*(vect_p.29 + ivtmp.60){misalignment:
0} << srcshift.1;

bad:
<bb 4>:
  MEM[index: ivtmp.127] = M*(vector int *) ivtmp.130{misalignment: 0} <<
srcshift.3 | M*(vector int *) ivtmp.127{misalignment: 0};


-- 
           Summary: Memory load is not eliminated from tight vectorized loop
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: ubizjak at gmail dot com
GCC target triplet: i686-*-*, x86_64-*-*


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/34011] Memory load is not eliminated from tight vectorized loop
  2007-11-07  9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
@ 2007-11-07 18:06 ` dorit at gcc dot gnu dot org
  2009-09-12 19:25 ` ubizjak at gmail dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: dorit at gcc dot gnu dot org @ 2007-11-07 18:06 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #1 from dorit at gcc dot gnu dot org  2007-11-07 18:06 -------
(In reply to comment #0)
> Following testcase exposes optimization problem with current SVN gcc:
...
> the same address
> is accessed with unaligned access (3) as well as aligned access.

This is a missed-optimization in the vectorizer - we use loop-versioning to
deal with the fact that we don't yet support misaligned stores; so the
vectorized version of the loop is guarded by a runtime test that checks that
the address of the store is aligned. However, we don't use the information that
there's a load from the same address that is therefore also guaranteed to be
aligned. 

We actualy have this information (we detect DRs that have the same alignment
and collect them in STMT_VINFO_SAME_ALIGN_REFS), but we don't use it when we do
the versioning. We *do* use this information when instead of versioning the
loop, we peel the loop to make the store aligned. In this case we also mark the
relevant SAME_ALIGN_REFS as aligned and generate aligned accesses for them.

(By the way, the reason we decide to use loop-versioning and not loop-peeling
is because we can't determing whether the pointers overlap at compile time. So
we have to use runtime dependence testing (i.e. versioning for aliasing), and
since we currently don't support both versioning and peeling together, this
dictates that we will use runtime alignment testing instead of peeling.)

Here is how it looks like in the vectorizer dump file:

"
pr34011.c:14: note: === vect_analyze_dependences ===
pr34011.c:14: note: dependence distance  = 0.
pr34011.c:14: note: accesses have the same alignment.
pr34011.c:14: note: dependence distance modulo vf == 0 between *D.1529_9 and
*D.1529_9
pr34011.c:14: note: versioning for alias required: can't determine dependence
between *D.1531_14 and *D.1529_9
pr34011.c:14: note: mark for run-time aliasing test between *D.1531_14 and
*D.1529_9
...
pr34011.c:14: note: === vect_enhance_data_refs_alignment ===
pr34011.c:14: note: Unknown misalignment, is_packed = 0
pr34011.c:14: note: Alignment of access forced using versioning.
pr34011.c:14: note: Versioning for alignment will be applied.
pr34011.c:14: note: Vectorizing an unaligned access.
pr34011.c:14: note: Vectorizing an unaligned access.
"

Instead, if I add __restrict__ qualifiers to the pointer arguments, we get
this:

"
pr34011b.c:14: note: === vect_analyze_dependences ===
pr34011b.c:14: note: dependence distance  = 0.
pr34011b.c:14: note: accesses have the same alignment.
pr34011b.c:14: note: dependence distance modulo vf == 0 between *D.1529_9 and
*D.1529_9
...
pr34011b.c:14: note: === vect_enhance_data_refs_alignment ===
pr34011b.c:14: note: Unknown misalignment, is_packed = 0
...
pr34011b.c:14: note: Alignment of access forced using peeling.
pr34011b.c:14: note: Peeling for alignment will be applied.
pr34011b.c:14: note: Vectorizing an unaligned access.
"

i.e. we don't need to use runtime dependence testing and version the loop, so
we can use peeling to align the store along with anything that has the same
alignment as the store:

<bb 6>:
  MEM[base: D.1676, index: ivtmp.142] = M*(vect_p.111 +
ivtmp.142){misalignment: 0} << srcshift | MEM[base: D.1676, index: ivtmp.142];

...
> Missing IV elimination could be attributed to tree loop optimizations, but
> others are IMO RTL optimization problems, 

(except for the misaligned access, which the vectorizer can avoid).

> because we enter RTL generation with:
> bad:
> <bb 4>:
>   MEM[index: ivtmp.127] = M*(vector int *) ivtmp.130{misalignment: 0} <<
> srcshift.3 | M*(vector int *) ivtmp.127{misalignment: 0};

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug rtl-optimization/34011] Memory load is not eliminated from tight vectorized loop
  2007-11-07  9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
  2007-11-07 18:06 ` [Bug rtl-optimization/34011] " dorit at gcc dot gnu dot org
@ 2009-09-12 19:25 ` ubizjak at gmail dot com
  2009-09-12 20:02 ` [Bug tree-optimization/34011] " rguenth at gcc dot gnu dot org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: ubizjak at gmail dot com @ 2009-09-12 19:25 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from ubizjak at gmail dot com  2009-09-12 19:25 -------
The testcase does not verctorize anymore, even in the modified form:

--cut here--
const int srcshift;

void good (int *restrict srcdata, int *restrict dstdata)
{
  int i;

  for (i = 0; i < 256; i++)
    dstdata[i] = srcdata[i] << srcshift;
}


void bad (int *restrict srcdata, int *restrict dstdata)
{
  int i;

  for (i = 0; i < 256; i++)
    {
      dstdata[i] |= srcdata[i] << srcshift;
    }
}
--cut here--


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
  2007-11-07  9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
  2007-11-07 18:06 ` [Bug rtl-optimization/34011] " dorit at gcc dot gnu dot org
  2009-09-12 19:25 ` ubizjak at gmail dot com
@ 2009-09-12 20:02 ` rguenth at gcc dot gnu dot org
  2009-09-15 14:07 ` rguenth at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-12 20:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from rguenth at gcc dot gnu dot org  2009-09-12 20:02 -------
srcshift is not moved out of the loop because we think the store to dstdata may
alias it.  I'll fix that.

Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c    (revision 151651)
+++ tree-ssa-alias.c    (working copy)
@@ -633,6 +633,9 @@ indirect_ref_may_alias_decl_p (tree ref1
                               HOST_WIDE_INT offset2, HOST_WIDE_INT max_size2,
                               alias_set_type base2_alias_set)
 {
+  if (TREE_READONLY (base2))
+    return false;
+
   /* If only one reference is based on a variable, they cannot alias if
      the pointer access is beyond the extent of the variable access.
      (the pointer base cannot validly point to an offset less than zero


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|unassigned at gcc dot gnu   |rguenth at gcc dot gnu dot
                   |dot org                     |org
             Status|UNCONFIRMED                 |ASSIGNED
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2009-09-12 20:02:13
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
  2007-11-07  9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
                   ` (2 preceding siblings ...)
  2009-09-12 20:02 ` [Bug tree-optimization/34011] " rguenth at gcc dot gnu dot org
@ 2009-09-15 14:07 ` rguenth at gcc dot gnu dot org
  2009-09-15 14:40 ` rguenth at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-15 14:07 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from rguenth at gcc dot gnu dot org  2009-09-15 14:07 -------
With the alias issue fixed I get

good:
.LFB0:
        .cfi_startproc
        movd    srcshift(%rip), %xmm1
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L2:
        movdqu  (%rdi,%rax), %xmm0
        pslld   %xmm1, %xmm0
        movdqu  %xmm0, (%rsi,%rax)
        addq    $16, %rax
        cmpq    $1024, %rax
        jne     .L2
        rep
        ret

bad:
.LFB1:
        .cfi_startproc
        movd    srcshift(%rip), %xmm2
        leaq    1024(%rsi), %rax
        .p2align 4,,10
        .p2align 3
.L6:
        movdqu  (%rdi), %xmm0
        addq    $16, %rdi
        movdqu  (%rsi), %xmm1
        pslld   %xmm2, %xmm0
        por     %xmm1, %xmm0
        movdqu  %xmm0, (%rsi)
        addq    $16, %rsi
        cmpq    %rax, %rsi
        jne     .L6
        rep
        ret

which looks good in both cases.

For the original testcase which results in a runtime alias check we get

bad:
.LFB1:
        .cfi_startproc
        leaq    16(%rdi), %rax
        cmpq    %rax, %rsi
        leaq    16(%rsi), %rax
        seta    %dl
        cmpq    %rax, %rdi
        seta    %al
        orb     %al, %dl
        je      .L10
        leaq    1024(%rsi), %rax
        .p2align 4,,10
        .p2align 3
.L11:
        movdqu  (%rdi), %xmm0
        addq    $16, %rdi
        movd    srcshift(%rip), %xmm1
        pslld   %xmm1, %xmm0
        movdqu  (%rsi), %xmm1
        por     %xmm1, %xmm0
        movdqu  %xmm0, (%rsi)
        addq    $16, %rsi
        cmpq    %rax, %rsi
        jne     .L11
        rep
        ret
.L10:
        movzbl  srcshift(%rip), %ecx
        xorl    %eax, %eax
        .p2align 4,,10
        .p2align 3
.L13:
        movl    (%rdi,%rax), %edx
        sall    %cl, %edx
        orl     %edx, (%rsi,%rax)
        addq    $4, %rax
        cmpq    $1024, %rax
        jne     .L13
        rep
        ret

thus still bad.  It is IRA / reload that moves the srcshift load back into
the loop for some reason.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vmakarov at gcc dot gnu dot
                   |                            |org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
  2007-11-07  9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
                   ` (3 preceding siblings ...)
  2009-09-15 14:07 ` rguenth at gcc dot gnu dot org
@ 2009-09-15 14:40 ` rguenth at gcc dot gnu dot org
  2009-09-16  8:51 ` rguenth at gcc dot gnu dot org
  2009-09-17  9:08 ` rguenth at gcc dot gnu dot org
  6 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-15 14:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from rguenth at gcc dot gnu dot org  2009-09-15 14:40 -------
Which is likely because it decides to allocate $cx for the load destination
(operand for the scalar shift) and then needs to re-load it to $xmm? for the
vector shift.  The placement of the re-load inside the loop is unfortunate...

Reloads for insn # 67
Reload 0: reload_in (SI) = (reg:SI 116 [ pretmp.11 ])
        SSE_REGS, RELOAD_FOR_INPUT (opnum = 2)
        reload_in_reg: (reg:SI 116 [ pretmp.11 ])
        reload_reg_rtx: (reg:SI 22 xmm1)

Reloads for insn # 83
Reload 0: reload_in (QI) = (subreg:QI (reg:SI 116 [ pretmp.11 ]) 0)
        CREG, RELOAD_FOR_INPUT (opnum = 2)
        reload_in_reg: (subreg:QI (reg:SI 116 [ pretmp.11 ]) 0)
        reload_reg_rtx: (reg:QI 2 cx)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
  2007-11-07  9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
                   ` (4 preceding siblings ...)
  2009-09-15 14:40 ` rguenth at gcc dot gnu dot org
@ 2009-09-16  8:51 ` rguenth at gcc dot gnu dot org
  2009-09-17  9:08 ` rguenth at gcc dot gnu dot org
  6 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-16  8:51 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from rguenth at gcc dot gnu dot org  2009-09-16 08:50 -------
Subject: Bug 34011

Author: rguenth
Date: Wed Sep 16 08:50:46 2009
New Revision: 151740

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=151740
Log:
2009-09-16  Richard Guenther  <rguenther@suse.de>

        PR middle-end/34011
        * tree-flow-inline.h (may_be_aliased): Compute readonly variables
        as non-aliased.

        * gcc.dg/tree-ssa/ssa-lim-7.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-7.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-flow-inline.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
  2007-11-07  9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
                   ` (5 preceding siblings ...)
  2009-09-16  8:51 ` rguenth at gcc dot gnu dot org
@ 2009-09-17  9:08 ` rguenth at gcc dot gnu dot org
  6 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-09-17  9:08 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from rguenth at gcc dot gnu dot org  2009-09-17 09:08 -------
The problem is now back to the original one.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|rguenth at gcc dot gnu dot  |unassigned at gcc dot gnu
                   |org                         |dot org
             Status|ASSIGNED                    |NEW
           Keywords|                            |missed-optimization, ra


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
       [not found] <bug-34011-4@http.gcc.gnu.org/bugzilla/>
  2012-01-20 10:42 ` ubizjak at gmail dot com
@ 2021-07-26 19:49 ` pinskia at gcc dot gnu.org
  1 sibling, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-07-26 19:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011

--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
good function:
.L3:
        movdqu  (%rdi,%rax), %xmm0
        pslld   %xmm1, %xmm0
        movups  %xmm0, (%rsi,%rax)
        addq    $16, %rax
        cmpq    $1024, %rax
        jne     .L3

bad function:
.L11:
        movdqu  (%rdi,%rax), %xmm0
        movdqu  (%rsi,%rax), %xmm2
        pslld   %xmm1, %xmm0
        por     %xmm2, %xmm0
        movups  %xmm0, (%rsi,%rax)
        addq    $16, %rax
        cmpq    $1024, %rax
        jne     .L11


Looks good to me now.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug tree-optimization/34011] Memory load is not eliminated from tight vectorized loop
       [not found] <bug-34011-4@http.gcc.gnu.org/bugzilla/>
@ 2012-01-20 10:42 ` ubizjak at gmail dot com
  2021-07-26 19:49 ` pinskia at gcc dot gnu.org
  1 sibling, 0 replies; 10+ messages in thread
From: ubizjak at gmail dot com @ 2012-01-20 10:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34011

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2009-09-12 20:02:13         |2012-01-20 20:02:13
      Known to fail|                            |4.7.0

--- Comment #8 from Uros Bizjak <ubizjak at gmail dot com> 2012-01-20 10:29:54 UTC ---
Reconfirmed with

"GCC: (GNU) 4.7.0 20120118 (experimental) [trunk revision 183277]"


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-07-26 19:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-07  9:05 [Bug rtl-optimization/34011] New: Memory load is not eliminated from tight vectorized loop ubizjak at gmail dot com
2007-11-07 18:06 ` [Bug rtl-optimization/34011] " dorit at gcc dot gnu dot org
2009-09-12 19:25 ` ubizjak at gmail dot com
2009-09-12 20:02 ` [Bug tree-optimization/34011] " rguenth at gcc dot gnu dot org
2009-09-15 14:07 ` rguenth at gcc dot gnu dot org
2009-09-15 14:40 ` rguenth at gcc dot gnu dot org
2009-09-16  8:51 ` rguenth at gcc dot gnu dot org
2009-09-17  9:08 ` rguenth at gcc dot gnu dot org
     [not found] <bug-34011-4@http.gcc.gnu.org/bugzilla/>
2012-01-20 10:42 ` ubizjak at gmail dot com
2021-07-26 19:49 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).