From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 6A7E83858D35; Wed, 18 Jan 2023 12:33:23 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6A7E83858D35
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1674045203;
	bh=hU4lq+0Tb8mtPqYrEZKI4i6hwJCxQfZaQF0/j/+9M+s=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=FYUJi/XHnyWa0Hi/GtAgJX+6TISY1Q48giI/4rFbmaUwOISTuESZekYQTa9D5x3X+
	 x13zooJhsuy73B432fr/7SdpTnb8YRwSAwf06wuEivR+OB20hJY5cEKCM10klDhofa
	 2/GWX/5sCHK4+a7Vev4HaTGwnBIEYVelH+1lnMtE=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/108410] x264 averaging loop not optimized well for
 avx512
Date: Wed, 18 Jan 2023 12:33:21 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-108410-4-u1nNWTYfCj@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108410-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108410-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108410
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The naiive masked epilogue (--param vect-partial-vector-usage=3D1 and suppo=
rt
for whilesiult as in a prototype I have) then looks like

        leal    -1(%rdx), %eax
        cmpl    $62, %eax
        jbe     .L11

.L11:
        xorl    %ecx, %ecx
        jmp     .L4

.L4:
        movl    %ecx, %eax
        subl    %ecx, %edx
        addq    %rax, %rsi
        addq    %rax, %rdi
        addq    %r8, %rax
        cmpl    $64, %edx
        jl      .L8=20
        kxorq   %k1, %k1, %k1
        kxnorq  %k1, %k1, %k1
.L7:
        vmovdqu8        (%rsi), %zmm0{%k1}{z}
        vmovdqu8        (%rdi), %zmm1{%k1}{z}
        vpavgb  %zmm1, %zmm0, %zmm0
        vmovdqu8        %zmm0, (%rax){%k1}
.L21:
        vzeroupper
        ret

.L8:
        vmovdqa64       .LC0(%rip), %zmm1
        vpbroadcastb    %edx, %zmm0
        vpcmpb  $1, %zmm0, %zmm1, %k1
        jmp     .L7

RTL isn't good at jump threading the mess caused by my ad-hoc whileult
RTL expansion - representing this at a higher level is probably the way
to go.  What you'd basically should get is for the epilogue (also used
when the main vectorized loop isn't entered):

        vmovdqa64       .LC0(%rip), %zmm1
        vpbroadcastb    %edx, %zmm0
        vpcmpb  $1, %zmm0, %zmm1, %k1
        vmovdqu8        (%rsi), %zmm0{%k1}{z}
        vmovdqu8        (%rdi), %zmm1{%k1}{z}
        vpavgb  %zmm1, %zmm0, %zmm0
        vmovdqu8        %zmm0, (%rax){%k1}

that is a compare of a vector with { niter, niter, ... } with { 0, 1,2 3, .=
. }
producing the mask (that has a latency of 3 according to agner) and then
simply the vectorized code masked.  You can probably assembly code that
if you'd be interested in the (optimal) performance outcome.

For now we probably want to have the main loop traditionally vectorized
without masking because Intel has poor mask support and AMD has bad
latency on the mask producing compares.  But having a masked vectorized
epilog avoids the need for a scalar epilog, saving code-size, and
avoids the need to vectorize that multiple times (or choosing SSE vectors
here).  For Zen4 the above will of course utilize two 512bit op halves
even when one is fully masked (well, I suppose at least that this is the ca=
se).=