From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-500643-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 118468 invoked by alias); 27 Oct 2015 03:07:57 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 118443 invoked by uid 48); 27 Oct 2015 03:07:52 -0000
From: "haneef503 at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug other/68109] New: GCC fails to vectorize popcount on x86_64
Date: Tue, 27 Oct 2015 03:07:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: other
X-Bugzilla-Version: 5.2.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: haneef503 at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution:
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created
Message-ID: <bug-68109-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-10/txt/msg02198.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68109

            Bug ID: 68109
           Summary: GCC fails to vectorize popcount on x86_64
           Product: gcc
           Version: 5.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: haneef503 at gmail dot com
  Target Milestone: ---

Created attachment 36595
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36595&action=edit
Clang Vectorized Assembly Output

The following code is an SSCCE that GCC doesn't vectorize on x86_64:

#include <stdlib.h>
#include <stdint.h>

size_t hd (const uint8_t *restrict a, const uint8_t *restrict b, size_t l) {
  size_t r = 0, x;
  for (x = 0; x < l; x++)
    r += __builtin_popcount (a[x] ^ b[x]);

  return r;
}

On other architectures, such as power8, GCC successfully vectorizes the loop.
However, on x86_64, there doesn't actually exist a vector version of the
`popcnt` instruction. Despite this, as shown by
[http://wm.ite.pl/articles/sse-popcount.html] it is actually possible to
vectorize popcount by using SSE2 or SSSE3 instructions. Further research on
[https://software.intel.com/sites/landingpage/IntrinsicsGuide/] shows that it
may be possible to achieve further performance on the latest architectures
gains by using AVX2 instructions along the same lines as in the article, albeit
with 256-bit YMM registers in place of the 128-bit XMM registers used in the
article. Since GCC often has support for insofar unreleased architectures, I
did a bit more research on the Intel Intrisics Guide mentioned above for future
architectures and found that the same could likely also be done using AVX-512
with the 512-bit ZMM registers if you guys are interested.

Anyways, I did find that clang has been doing these optimizations since
~clang3.5. I've attached an output of the resulting [vectorized] assembly
emitted by clang3.7 for the above function, since it appears to be done
relatively thoroughly and cleanly.

In both GCC and Clang, I used the following flags:

-xc -O2 -ftree-vectorize -D_GNU_SOURCE  -std=gnu11 -fverbose-asm