From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id F38BB3858403; Tue, 31 Aug 2021 00:15:01 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org F38BB3858403
From: "wilson at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/102139] New: -O3 miscompile due to
 slp-vectorize on strict align target
Date: Tue, 31 Aug 2021 00:15:01 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 11.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: wilson at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-102139-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 31 Aug 2021 00:15:02 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102139

            Bug ID: 102139
           Summary: -O3 miscompile due to slp-vectorize on strict align
                    target
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wilson at gcc dot gnu.org
  Target Milestone: ---

This was originally reported here.
https://github.com/riscv/riscv-gcc/issues/289

This testcase is miscompiled at -O3 for a riscv64 target, though this is no=
t a
bug in the riscv64 port.  I think it will fail for any strict align target.

typedef unsigned short uint16_t;

void zero_two_uint16(uint16_t* ptr) {
  ptr[0] =3D 0;
  ptr[1] =3D 0;
}

void zero(uint16_t* ptr) {
  for (int i =3D 0; i < 16; ++i) {
    zero_two_uint16(ptr);
    ptr +=3D 2;
  }
}

The output is
zero:
        sd      zero,0(a0)
        sd      zero,8(a0)
        sd      zero,16(a0)
        sd      zero,24(a0)
        sd      zero,32(a0)
        sd      zero,40(a0)
        sd      zero,48(a0)
        sd      zero,56(a0)
        ret
which fails due to unaligned accesses as a0 only has 2 byte alignment.

A git bisect tracked the problem down to this commit.

commit f5e18dd
Author: Kewen Lin linkw@gcc.gnu.org
Date: Tue Nov 3 02:51:47 2020 +0000

        pass: Run cleanup passes before SLP [PR96789]
        ...

I get correct code if I disable the fre4 pass, which is the fre pass inside
pre_slp_scalar_cleanup which was added by this patch.

The 169t.vectorize pass adds an address alignment check, and then emits a l=
oop
with double-word stores if aligned, and a loop with half-word stores if
unaligned.  172t.cunroll fully unrolls both loops.  The 173t.fre4 pass dele=
tes
a phi node before the half-word stores.  The 172t output has
  <bb 13> [local count: 12627204]:
  # ptr_3 =3D PHI <ptr_4(D)(2)>
  # ivtmp_15 =3D PHI <16(2)>
  *ptr_3 =3D 0;
and the 173t.fre4 output has
  <bb 13> [local count: 12627204]:
  *ptr_4(D) =3D 0;
In the 175t.slp1 pass, the block of half-word stores gets vectorized which =
is
wrong.  Then later 207t.dce7 notices duplicate code and deletes the second
block of stores.

Comparing the full slp1 dump with fre4 disabled versus the unmodified slp1
dump, I see that the first significant difference is when computing pointer
alignment.  With fre4 disabled, I get

tmp.c:4:10: note:  recording new base alignment for vectp_ptr.8_125
  alignment:    8
  misalignment: 0
  based on:     MEM <vector(4) short unsigned int> [(uint16_t
*)vectp_ptr.8_125] =3D { 0, 0, 0, 0 };
tmp.c:4:10: note:  recording new base alignment for ptr_3
  alignment:    2
  misalignment: 0
  based on:     *ptr_3 =3D 0;
tmp.c:4:10: note:   =3D=3D=3D vect_slp_analyze_instance_alignment =3D=3D=3D
tmp.c:4:10: note:   vect_compute_data_ref_alignment:
tmp.c:4:10: note:   can't force alignment of ref: *ptr_3

It then refuses to vectorize.  With the unmodified compiler I get

tmp.c:4:10: note:  recording new base alignment for ptr_4(D)
  alignment:    8
  misalignment: 0
  based on:     MEM <vector(4) short unsigned int> [(uint16_t *)ptr_4(D)] =
=3D {
0, 0, 0, 0 };
tmp.c:4:10: note:   =3D=3D=3D vect_slp_analyze_instance_alignment =3D=3D=3D
tmp.c:4:10: note:   vect_compute_data_ref_alignment:
tmp.c:4:10: missed:   misalign =3D 0 bytes of ref *ptr_4(D)

and then goes ahead and vectorizes which is wrong.

Maybe fre4 shouldn't optimize away a phi node when the pointers have differ=
ent
alignment?

I noticed that before slp1 runs, the double-word store block has
  # ALIGN =3D 8, MISALIGN =3D 0
but the half-word store block does not.  After slp1 runs, both the double-w=
ord
store and the half-word store block have these notes.=