public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug other/68109] New: GCC fails to vectorize popcount on x86_64
@ 2015-10-27  3:07 haneef503 at gmail dot com
  2015-10-27  9:51 ` [Bug target/68109] " rguenth at gcc dot gnu.org
  2021-08-16  4:50 ` [Bug tree-optimization/68109] " pinskia at gcc dot gnu.org
  0 siblings, 2 replies; 3+ messages in thread
From: haneef503 at gmail dot com @ 2015-10-27  3:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68109

            Bug ID: 68109
           Summary: GCC fails to vectorize popcount on x86_64
           Product: gcc
           Version: 5.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: haneef503 at gmail dot com
  Target Milestone: ---

Created attachment 36595
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36595&action=edit
Clang Vectorized Assembly Output

The following code is an SSCCE that GCC doesn't vectorize on x86_64:

#include <stdlib.h>
#include <stdint.h>

size_t hd (const uint8_t *restrict a, const uint8_t *restrict b, size_t l) {
  size_t r = 0, x;
  for (x = 0; x < l; x++)
    r += __builtin_popcount (a[x] ^ b[x]);

  return r;
}

On other architectures, such as power8, GCC successfully vectorizes the loop.
However, on x86_64, there doesn't actually exist a vector version of the
`popcnt` instruction. Despite this, as shown by
[http://wm.ite.pl/articles/sse-popcount.html] it is actually possible to
vectorize popcount by using SSE2 or SSSE3 instructions. Further research on
[https://software.intel.com/sites/landingpage/IntrinsicsGuide/] shows that it
may be possible to achieve further performance on the latest architectures
gains by using AVX2 instructions along the same lines as in the article, albeit
with 256-bit YMM registers in place of the 128-bit XMM registers used in the
article. Since GCC often has support for insofar unreleased architectures, I
did a bit more research on the Intel Intrisics Guide mentioned above for future
architectures and found that the same could likely also be done using AVX-512
with the 512-bit ZMM registers if you guys are interested.

Anyways, I did find that clang has been doing these optimizations since
~clang3.5. I've attached an output of the resulting [vectorized] assembly
emitted by clang3.7 for the above function, since it appears to be done
relatively thoroughly and cleanly.

In both GCC and Clang, I used the following flags:

-xc -O2 -ftree-vectorize -D_GNU_SOURCE  -std=gnu11 -fverbose-asm


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug target/68109] GCC fails to vectorize popcount on x86_64
  2015-10-27  3:07 [Bug other/68109] New: GCC fails to vectorize popcount on x86_64 haneef503 at gmail dot com
@ 2015-10-27  9:51 ` rguenth at gcc dot gnu.org
  2021-08-16  4:50 ` [Bug tree-optimization/68109] " pinskia at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-10-27  9:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68109

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*, i?86-*-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-10-27
          Component|other                       |target
             Blocks|                            |53947
     Ever confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  The target would have to provide the neccessary target
builtin/expander.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [Bug tree-optimization/68109] GCC fails to vectorize popcount on x86_64
  2015-10-27  3:07 [Bug other/68109] New: GCC fails to vectorize popcount on x86_64 haneef503 at gmail dot com
  2015-10-27  9:51 ` [Bug target/68109] " rguenth at gcc dot gnu.org
@ 2021-08-16  4:50 ` pinskia at gcc dot gnu.org
  1 sibling, 0 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-16  4:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68109

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |tree-optimization

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Could there be generic support for popcount added?

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-16  4:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-27  3:07 [Bug other/68109] New: GCC fails to vectorize popcount on x86_64 haneef503 at gmail dot com
2015-10-27  9:51 ` [Bug target/68109] " rguenth at gcc dot gnu.org
2021-08-16  4:50 ` [Bug tree-optimization/68109] " pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).