From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id B22AA3851C23; Tue, 27 Jul 2021 07:24:08 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B22AA3851C23
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/39821] 120% slowdown with vectorizer
Date: Tue, 27 Jul 2021 07:24:08 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 4.4.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-39821-4-Jow8FLHE1e@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-39821-4@http.gcc.gnu.org/bugzilla/>
References: <bug-39821-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jul 2021 07:24:08 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D39821
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
0x398f310 _2 * _4 1 times scalar_stmt costs 12 in body
...
0x392b3f0 _1 w* _3 2 times vec_promote_demote costs 8 in body
...
t4.c:4:12: note:  Cost model analysis:
  Vector inside of loop cost: 40
  Vector prologue cost: 4
  Vector epilogue cost: 108
  Scalar iteration cost: 40
  Scalar outside cost: 32
  Vector outside cost: 112
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 3

so clearly the widening multiplication is not costed correctly.  With SSE 4=
.2
we can do better:

.L4:
        movdqu  (%rcx,%rax), %xmm0
        movdqu  (%rsi,%rax), %xmm1
        addq    $16, %rax
        movdqa  %xmm0, %xmm3
        movdqa  %xmm1, %xmm4
        punpckldq       %xmm0, %xmm3
        punpckldq       %xmm1, %xmm4
        punpckhdq       %xmm0, %xmm0
        pmuldq  %xmm4, %xmm3
        punpckhdq       %xmm1, %xmm1
        pmuldq  %xmm1, %xmm0
        paddq   %xmm3, %xmm2
        paddq   %xmm0, %xmm2
        cmpq    %rdi, %rax
        jne     .L4

but even there the costing is imprecise.  The vectorizer is unhelpful in
categorizing the widen mult as vec_promote_demote which then fails to
run into

        case MULT_EXPR:
        case WIDEN_MULT_EXPR:
        case MULT_HIGHPART_EXPR:
          stmt_cost =3D ix86_multiplication_cost (ix86_cost, mode);
          break;

fixing that yields

0x392b3f0 _1 w* _3 2 times vector_stmt costs 136 in body

for both SSE2 and SSE4.2 and AVX2 so that's over-estimating cost then via

      /* V*DImode is emulated with 5-8 insns.  */
      else if (mode =3D=3D V2DImode || mode =3D=3D V4DImode)
        {
          if (TARGET_XOP && mode =3D=3D V2DImode)
            return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 3);
          else
            return ix86_vec_cost (mode, cost->mulss * 3 + cost->sse_op * 5);
        }

with cost->mulss =3D=3D 16.  I suppose it is somehow failing to realize it's
doing a widening multiply.=