From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 118468 invoked by alias); 27 Oct 2015 03:07:57 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 118443 invoked by uid 48); 27 Oct 2015 03:07:52 -0000 From: "haneef503 at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug other/68109] New: GCC fails to vectorize popcount on x86_64 Date: Tue, 27 Oct 2015 03:07:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: other X-Bugzilla-Version: 5.2.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: haneef503 at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-10/txt/msg02198.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68109 Bug ID: 68109 Summary: GCC fails to vectorize popcount on x86_64 Product: gcc Version: 5.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: haneef503 at gmail dot com Target Milestone: --- Created attachment 36595 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36595&action=edit Clang Vectorized Assembly Output The following code is an SSCCE that GCC doesn't vectorize on x86_64: #include #include size_t hd (const uint8_t *restrict a, const uint8_t *restrict b, size_t l) { size_t r = 0, x; for (x = 0; x < l; x++) r += __builtin_popcount (a[x] ^ b[x]); return r; } On other architectures, such as power8, GCC successfully vectorizes the loop. However, on x86_64, there doesn't actually exist a vector version of the `popcnt` instruction. Despite this, as shown by [http://wm.ite.pl/articles/sse-popcount.html] it is actually possible to vectorize popcount by using SSE2 or SSSE3 instructions. Further research on [https://software.intel.com/sites/landingpage/IntrinsicsGuide/] shows that it may be possible to achieve further performance on the latest architectures gains by using AVX2 instructions along the same lines as in the article, albeit with 256-bit YMM registers in place of the 128-bit XMM registers used in the article. Since GCC often has support for insofar unreleased architectures, I did a bit more research on the Intel Intrisics Guide mentioned above for future architectures and found that the same could likely also be done using AVX-512 with the 512-bit ZMM registers if you guys are interested. Anyways, I did find that clang has been doing these optimizations since ~clang3.5. I've attached an output of the resulting [vectorized] assembly emitted by clang3.7 for the above function, since it appears to be done relatively thoroughly and cleanly. In both GCC and Clang, I used the following flags: -xc -O2 -ftree-vectorize -D_GNU_SOURCE -std=gnu11 -fverbose-asm