From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id B696E3858D35; Fri,  9 Jun 2023 12:11:58 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B696E3858D35
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1686312718;
	bh=5myJOpO2MjDe6XDr8CCA7CoZplVOuoc3jfcHl1HXgxE=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=GFMFtaJBNX41Rn+Ny9nWfmwXwFI6v6/fy4Lw+fUyc7/yLZO2/2Nug+FFbLpUgA0k2
	 HdEiH2IdzQ6PuVn+4d2CL3rcLsWCXegAlkZYj3yipb3S5B38oJPlWzjVvnxYj1j0TB
	 WBI6sWKN963t/fZkDtQa1G6cBJtIk/DoWOH761SI=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/108410] x264 averaging loop not optimized well for
 avx512
Date: Fri, 09 Jun 2023 12:11:56 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-108410-4-DgfOUooFRv@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108410-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108410-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108410
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, for the case we can use the same mask compare type as we use as type f=
or
the IV (so we know we can represent all required values) we can elide the
saturation.  So for example

void foo (double * __restrict a, double *b, double *c, int n)
{
  for (int i =3D 0; i < n; ++i)
    a[i] =3D b[i] + c[i];
}

can produce

        testl   %ecx, %ecx
        jle     .L5
        vmovdqa .LC0(%rip), %ymm3
        vpbroadcastd    %ecx, %ymm2
        xorl    %eax, %eax
        subl    $8, %ecx
        vpcmpud $6, %ymm3, %ymm2, %k1
        .p2align 4
        .p2align 3
.L3:
        vmovupd (%rsi,%rax), %zmm1{%k1}
        vmovupd (%rdx,%rax), %zmm0{%k1}
        movl    %ecx, %r8d
        vaddpd  %zmm1, %zmm0, %zmm2{%k1}{z}
        addl    $8, %r8d
        vmovupd %zmm2, (%rdi,%rax){%k1}
        vpbroadcastd    %ecx, %ymm2
        addq    $64, %rax
        subl    $8, %ecx
        vpcmpud $6, %ymm3, %ymm2, %k1
        cmpl    $8, %r8d
        ja      .L3
        vzeroupper
.L5:
        ret

That should work as long as the data size is larger or matches the IV size
which is hopefully the case for all FP testcases.  The trick is going to be
to make this visible to costing - I'm not sure we get to decide whether
to use masking or not when we do not want to decide between vector sizes
(the x86 backend picks the first successful one).  For SVE it's either
masking (with SVE modes) or not masking (with NEON modes) so it's
decided based on mode rather than as additional knob.

Performance-wise the above is likely still slower than not using masking
plus a masked epilog but it would actually save on code-size for -Os
or -O2.  Of course for code-size we might want to stick to SSE/AVX
for the smaller encoding.

Note we have to watch out for all-zero masks for masked stores since
that's very slow (for a reason unknown to me), when we have a stmt
split to multiple vector stmts it's not uncommon (esp. for the epilog)
to have one of them with an all-zero bit mask.  For the loop case and
.MASK_STORE we emit branchy code for this but we might want to avoid
the situation by costing (and not using a masked loop/epilog in that
case).=