From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 94702385BF84; Mon, 25 Oct 2021 21:44:09 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 94702385BF84
From: "peter at cordes dot ca" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/102494] Failure to optimize vector reduction
 properly especially when using OpenMP
Date: Mon, 25 Oct 2021 21:44:09 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: peter at cordes dot ca
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-102494-4-hdA7DQ6Jy2@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-102494-4@http.gcc.gnu.org/bugzilla/>
References: <bug-102494-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 25 Oct 2021 21:44:09 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102494

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca
--- Comment #10 from Peter Cordes <peter at cordes dot ca> ---
Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcv=
Ta=20
Still doing two separate sign extensions and two stores / wider reload (sto=
re
forwarding stall):

-O3 -march=3Dskylake -fopenmp
simde_vaddlv_s8:
        push    rbp
        vpmovsxbw       xmm2, xmm0
        vpsrlq  xmm0, xmm0, 32
        mov     rbp, rsp
        vpmovsxbw       xmm3, xmm0
        and     rsp, -32
        vmovq   QWORD PTR [rsp-16], xmm2
        vmovq   QWORD PTR [rsp-8], xmm3
        vmovdqa xmm4, XMMWORD PTR [rsp-16]
   ... then asm using byte-shifts

Including stuff like
   movdqa  xmm1, xmm0
   psrldq  xmm1, 4

instead of pshufd, which is an option because high garbage can be ignored.

And ARM64 goes scalar.

----

Current trunk *without* -fopenmp produces decent asm
https://godbolt.org/z/h1KEKPTW9

For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3)
simde_vaddlv_s8:
        sxtl    v0.8h, v0.8b
        addv    h0, v0.8h
        umov    w0, v0.h[0]
        ret

x86-64 gcc  -O3 -march=3Dskylake
simde_vaddlv_s8:
        vpmovsxbw       xmm1, xmm0
        vpsrlq  xmm0, xmm0, 32
        vpmovsxbw       xmm0, xmm0
        vpaddw  xmm0, xmm1, xmm0
        vpsrlq  xmm1, xmm0, 32
        vpaddw  xmm0, xmm0, xmm1
        vpsrlq  xmm1, xmm0, 16
        vpaddw  xmm0, xmm0, xmm1
        vpextrw eax, xmm0, 0
        ret


That's pretty good, but  VMOVD eax, xmm0  would be more efficient than  VPE=
XTRW
when we don't need to avoid high garbage (because it's a return value in th=
is
case).  VPEXTRW zero-extends into RAX, so it's not directly helpful if we n=
eed
to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar mo=
vsx.

Or with BMI2, go scalar before the last shift / VPADDW step, e.g.
  ...
  vmovd  eax, xmm0
  rorx   edx, eax, 16
  add    eax, edx=