From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id AC1C2385841F; Tue, 11 Oct 2022 10:59:59 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org AC1C2385841F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1665485999;
	bh=QLYrzU/E/w9bnXhNwRbehLFtKOAkw6emIiQRpTXMNw4=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=JEyfm+jpATHhZa4GsdYROhThTUshJnu13ehXs/A7mILPjWTYNMrc4IHKT0JT26CHr
	 s/ZTX7uJEG3pQwRzpyiK+YzAXi38strQdWdYttFBid6uV48ejXR7xXjVLPD/0QTQWU
	 ekioVWgwu83DKzEtD0MmHWjvlqCV2O//OFcQ7YsA=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/107093] AVX512 mask operations not simplified in fully
 masked loop
Date: Tue, 11 Oct 2022 10:59:33 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-107093-4-12q7L6zD0h@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-107093-4@http.gcc.gnu.org/bugzilla/>
References: <bug-107093-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107093
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #5)
> Also i think masked epilog(--param=3Dvect-partial-vector-usage=3D1) shoul=
d be
> good for general cases under AVX512, espicially when main loop's vector
> width is 512, and the remain tripcount is not enough for 256-bit
> vectorization but ok for 128-bit vectorization.

Yes, for the fully masked variant I was mostly targeting -O2 with its
very-cheap (size wise) cost model.  Since we don't vectorize the
epilogue of a vectorized epilogue (yet) going fully masked there
should indeed help.  Also when we start to use the unroll hint the
vectorized epilogue might get full width iterations to handle as well.

One downside for a fully masked body is that we're using masked stores
which usually have higher latency due to the "merge" semantics which
means an extra memory input + merge operation.  Not sure if modern
uArchs can optimize the all-ones mask case, the vectorizer, for
.MASK_STORE, still has the code to change those to emit a mask
compare against all-zeros and only conditionally doing a .MASK_STORE.
That could be enhanced to single out the all-ones case, at least for
the .MASK_STOREs in a main fully masked loop when the mask is only
from the iteration (rather than conditional execution).=