From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id AC1C2385841F; Tue, 11 Oct 2022 10:59:59 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org AC1C2385841F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1665485999; bh=QLYrzU/E/w9bnXhNwRbehLFtKOAkw6emIiQRpTXMNw4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=JEyfm+jpATHhZa4GsdYROhThTUshJnu13ehXs/A7mILPjWTYNMrc4IHKT0JT26CHr s/ZTX7uJEG3pQwRzpyiK+YzAXi38strQdWdYttFBid6uV48ejXR7xXjVLPD/0QTQWU ekioVWgwu83DKzEtD0MmHWjvlqCV2O//OFcQ7YsA= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/107093] AVX512 mask operations not simplified in fully masked loop Date: Tue, 11 Oct 2022 10:59:33 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107093 --- Comment #7 from Richard Biener --- (In reply to Hongtao.liu from comment #5) > Also i think masked epilog(--param=3Dvect-partial-vector-usage=3D1) shoul= d be > good for general cases under AVX512, espicially when main loop's vector > width is 512, and the remain tripcount is not enough for 256-bit > vectorization but ok for 128-bit vectorization. Yes, for the fully masked variant I was mostly targeting -O2 with its very-cheap (size wise) cost model. Since we don't vectorize the epilogue of a vectorized epilogue (yet) going fully masked there should indeed help. Also when we start to use the unroll hint the vectorized epilogue might get full width iterations to handle as well. One downside for a fully masked body is that we're using masked stores which usually have higher latency due to the "merge" semantics which means an extra memory input + merge operation. Not sure if modern uArchs can optimize the all-ones mask case, the vectorizer, for .MASK_STORE, still has the code to change those to emit a mask compare against all-zeros and only conditionally doing a .MASK_STORE. That could be enhanced to single out the all-ones case, at least for the .MASK_STOREs in a main fully masked loop when the mask is only from the iteration (rather than conditional execution).=