From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 68B6339888B6; Fri,  9 Apr 2021 07:05:46 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 68B6339888B6
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/99971] GCC generates partially vectorized and
 scalar code at once
Date: Fri, 09 Apr 2021 07:05:46 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 10.2.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: keywords assigned_to everconfirmed
 cf_reconfirmed_on bug_status
Message-ID: <bug-99971-4-Mmw3DZgY1p@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-99971-4@http.gcc.gnu.org/bugzilla/>
References: <bug-99971-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Apr 2021 07:05:46 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D99971

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot =
gnu.org
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-04-09
             Status|UNCONFIRMED                 |ASSIGNED
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  While we manage to analyze for the "perfect" solution" we fail
because dependence testing doesn't handle a piece, this throws away half
of the vectorization.  We do actually see that we'll retain the scalar
loads and computations but still doing three vector loads and a vector add
seems cheaper than doing four scalar stores:

0x1fdb5a0 x_2(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 y1_3(D)->a 1 times unaligned_load (misalign -1) costs 12 in body
0x1fdb5a0 _13 + _14 1 times vector_stmt costs 4 in body
0x1fdb5a0 _15 1 times unaligned_store (misalign -1) costs 12 in body
0x1fddcb0 _15 1 times scalar_store costs 12 in body
0x1fddcb0 _18 1 times scalar_store costs 12 in body
0x1fddcb0 _21 1 times scalar_store costs 12 in body
0x1fddcb0 _24 1 times scalar_store costs 12 in body
t.C:28:1: note:  Cost model analysis:
  Vector inside of basic block cost: 40
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar cost of basic block: 48
t.C:28:1: note:  Basic block will be vectorized using SLP

now, fortunately GCC 11 will improve on this [a bit] and we'll produce

_Z4testR1ARKS_S2_:
.LFB2:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        movdqu  (%rdi), %xmm1
        paddd   %xmm1, %xmm0
        movups  %xmm0, (%rdi)
        movd    %xmm0, %eax
        subl    (%rdx), %eax
        movl    %eax, (%rdi)
        pextrd  $1, %xmm0, %eax
        subl    4(%rdx), %eax
        movl    %eax, 4(%rdi)
        pextrd  $2, %xmm0, %eax
        subl    8(%rdx), %eax
        movl    %eax, 8(%rdi)
        pextrd  $3, %xmm0, %eax
        subl    12(%rdx), %eax
        movl    %eax, 12(%rdi)
        ret

which is not re-doing the scalar loads/adds but instead uses the vector
result.  Still the same dependence issue is present:

t.C:16:11: missed:   can't determine dependence between y1_3(D)->b and
x_2(D)->a
t.C:16:11: note:  removing SLP instance operations starting from: x_2(D)->a=
 =3D
_6;

the scalar code before vectorization looks like

  <bb 2> [local count: 1073741824]:
  _13 =3D x_2(D)->a;
  _14 =3D y1_3(D)->a;
  _15 =3D _13 + _14;
  x_2(D)->a =3D _15;
  _16 =3D x_2(D)->b;
  _17 =3D y1_3(D)->b;  <---
  _18 =3D _16 + _17;
  x_2(D)->b =3D _18;
  _19 =3D x_2(D)->c;
  _20 =3D y1_3(D)->c;
  _21 =3D _19 + _20;
  x_2(D)->c =3D _21;
  _22 =3D x_2(D)->d;
  _23 =3D y1_3(D)->d;
  _24 =3D _22 + _23;
  x_2(D)->d =3D _24;
  _5 =3D y2_4(D)->a;
  _6 =3D _15 - _5;
  x_2(D)->a =3D _6;  <---
  _7 =3D y2_4(D)->b;
  _8 =3D _18 - _7;
  x_2(D)->b =3D _8;
  _9 =3D y2_4(D)->c;
  _10 =3D _21 - _9;
  x_2(D)->c =3D _10;
  _11 =3D y2_4(D)->d;
  _12 =3D _24 - _11;
  x_2(D)->d =3D _12;
  return;


Using

void test(A& __restrict x, A const& y1, A const& y2)
{
    x +=3D y1;
    x -=3D y2;
}

produces optimal assembly even with GCC 10:

_Z4testR1ARKS_S2_:
.LFB2:
        .cfi_startproc
        movdqu  (%rsi), %xmm0
        movdqu  (%rdx), %xmm1
        movdqu  (%rdi), %xmm2
        psubd   %xmm1, %xmm0
        paddd   %xmm2, %xmm0
        movups  %xmm0, (%rdi)
        ret

note that I think we should be able to handle the dependences even without
the __restrict annotation.=