public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform
@ 2014-11-09 12:14 marcus.kool at urlfilterdb dot com
  2014-11-09 16:28 ` [Bug c/63791] " jakub at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: marcus.kool at urlfilterdb dot com @ 2014-11-09 12:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

            Bug ID: 63791
           Summary: use 32-byte version of vpbroadcastb on AVX2 platform
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: marcus.kool at urlfilterdb dot com

Created attachment 33926
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33926&action=edit
code with _mm256_set1_epi8, _mm256_loadu_si256, _mm256_cmpeq_epi8,
_mm256_movemask_epi8

With gcc 4.9.2 and compile options 
-std=c99 -mavx2 -mbmi -mbmi2 -O3 -fno-tree-vectorize
on an Intel Haswell CPU
the intrinsic function _mm256_set1_epi8() generates 3 instructions while it
could do better with only 2 instructions.

Generated code is either
   vmovd         reg, xmmreg
   vpbroadcastb  xmmreg, xmmreg
   vinserti128   $1, xmmreg, ymmreg, ymmreg
or
   vmovd         reg, xmmreg
   vpbroadcastb  xmmreg, xmmreg
   vperm2i128    $0, ymmreg, ymmreg, ymmreg

But it could generate faster code instead:
   vmovd         reg, xmmreg
   vpbroadcastb  xmmreg, ymmreg

Example C source is in the attachment.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug c/63791] use 32-byte version of vpbroadcastb on AVX2 platform
  2014-11-09 12:14 [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform marcus.kool at urlfilterdb dot com
@ 2014-11-09 16:28 ` jakub at gcc dot gnu.org
  2015-05-01 13:51 ` [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms marcus.kool at urlfilterdb dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: jakub at gcc dot gnu.org @ 2014-11-09 16:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-11-09
                 CC|                            |jakub at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
This got fixed on the trunk with r216541.  As it isn't a regression, I think no
plans to backport that, there have been lots of permutation/broadcast changes
and improvements on the trunk, most of them related to AVX512, but some of them
AVX2 specific too.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms
  2014-11-09 12:14 [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform marcus.kool at urlfilterdb dot com
  2014-11-09 16:28 ` [Bug c/63791] " jakub at gcc dot gnu.org
@ 2015-05-01 13:51 ` marcus.kool at urlfilterdb dot com
  2015-05-01 13:53 ` marcus.kool at urlfilterdb dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: marcus.kool at urlfilterdb dot com @ 2015-05-01 13:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

Marcus Kool <marcus.kool at urlfilterdb dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
            Summary|use 32-byte version of      |use 32-byte version of
                   |vpbroadcastb on AVX2        |vpbroadcastb (and register
                   |platform                    |to poulate) on AVX/AVX2
                   |                            |platforms
      Known to fail|                            |4.8.4, 4.9.2, 5.1.0
           Severity|minor                       |normal

--- Comment #2 from Marcus Kool <marcus.kool at urlfilterdb dot com> ---
After the comment of Jakub I waited for the release of gcc 5.1.0 but
performance of programs that use *_set1_epi8() got 6% worse because gcc 5.1.0
now uses vpbroadcastb in the intended way but to populate the ymm register it
uses slow memory instead of a register.

This is what 5.1.0 generates:

movl            %edi, -20(%rbp)
vpbroadcastb    -20(%rbp), %ymm0

while this is optimal:
   vmovd         %edi, %xmm0
   vpbroadcastb  %xmm0, %ymm0

Also for the AVX platform (see attachment avx.c) gcc 5.1.0 also uses memory and
many instructions to populate an xmm register:
        movl    %edi, -12(%rsp)
        vpxor   %xmm1, %xmm1, %xmm1
        vmovd   -12(%rsp), %xmm0
        xorl    %eax, %eax
        vpshufb %xmm1, %xmm0, %xmm0
where
        vmovd   %edi, %xmm0
        vpbroadcastb %xmm0, %xmm0
is optimal.

To resume, gcc 4.8.4 and gcc 4.9.2 produce code that can be optimised further,
and gcc 5.1.0 produces even slower code which means that the implementation of
*_set1_epi8() is slower/much-slower than that it can be.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms
  2014-11-09 12:14 [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform marcus.kool at urlfilterdb dot com
  2014-11-09 16:28 ` [Bug c/63791] " jakub at gcc dot gnu.org
  2015-05-01 13:51 ` [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms marcus.kool at urlfilterdb dot com
@ 2015-05-01 13:53 ` marcus.kool at urlfilterdb dot com
  2015-05-01 14:10 ` marcus.kool at urlfilterdb dot com
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: marcus.kool at urlfilterdb dot com @ 2015-05-01 13:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

--- Comment #3 from Marcus Kool <marcus.kool at urlfilterdb dot com> ---
Created attachment 35436
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35436&action=edit
example code to show code generation on AVX platform (avx.c)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms
  2014-11-09 12:14 [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform marcus.kool at urlfilterdb dot com
                   ` (2 preceding siblings ...)
  2015-05-01 13:53 ` marcus.kool at urlfilterdb dot com
@ 2015-05-01 14:10 ` marcus.kool at urlfilterdb dot com
  2015-05-01 18:43 ` hjl.tools at gmail dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: marcus.kool at urlfilterdb dot com @ 2015-05-01 14:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

--- Comment #4 from Marcus Kool <marcus.kool at urlfilterdb dot com> ---
>         movl    %edi, -12(%rsp)
>         vpxor   %xmm1, %xmm1, %xmm1
>         vmovd   -12(%rsp), %xmm0
>         xorl    %eax, %eax
>         vpshufb %xmm1, %xmm0, %xmm0

The xorl instruction is part of an other statement and is not part of the code
generated by _mm_set1_epi8().


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms
  2014-11-09 12:14 [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform marcus.kool at urlfilterdb dot com
                   ` (3 preceding siblings ...)
  2015-05-01 14:10 ` marcus.kool at urlfilterdb dot com
@ 2015-05-01 18:43 ` hjl.tools at gmail dot com
  2015-05-01 22:40 ` marcus.kool at urlfilterdb dot com
  2015-05-01 23:07 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: hjl.tools at gmail dot com @ 2015-05-01 18:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |INVALID

--- Comment #5 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Marcus Kool from comment #2)
>
> To resume, gcc 4.8.4 and gcc 4.9.2 produce code that can be optimised
> further, and gcc 5.1.0 produces even slower code which means that the
> implementation of *_set1_epi8() is slower/much-slower than that it can be.

That is done on purpose.  Add -mtune=intel will get what you want.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms
  2014-11-09 12:14 [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform marcus.kool at urlfilterdb dot com
                   ` (4 preceding siblings ...)
  2015-05-01 18:43 ` hjl.tools at gmail dot com
@ 2015-05-01 22:40 ` marcus.kool at urlfilterdb dot com
  2015-05-01 23:07 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: marcus.kool at urlfilterdb dot com @ 2015-05-01 22:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

Marcus Kool <marcus.kool at urlfilterdb dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|INVALID                     |---

--- Comment #6 from Marcus Kool <marcus.kool at urlfilterdb dot com> ---
(In reply to H.J. Lu from comment #5)
> (In reply to Marcus Kool from comment #2)
> >
> > To resume, gcc 4.8.4 and gcc 4.9.2 produce code that can be optimised
> > further, and gcc 5.1.0 produces even slower code which means that the
> > implementation of *_set1_epi8() is slower/much-slower than that it can be.
> 
> That is done on purpose.  Add -mtune=intel will get what you want.

I do not understand why with the "-O3 -mavx2" flags gcc "on purpose" produces 4
instructions and the compiler flag "-mtune=intel" is mandatory to improve this. 

The -mavx2 flags says that the platform is AVX2 (i.e. Haswell and better).
I read the man page of 5.1.0 and must say that I am frustrated with the
explanation given for the flag "-mtune=intel".  I always thought that -O3 was
the way to tell the compiler "do your best to make the code run fast" and up to
gcc 4.9.2, "-mavx2" was good enough to say to generate code for AVX2.

Resuming, for gcc previous to 5.x, "-O3 -mavx2" did the trick and starting with
5.x "-O3 -mavx2 -mtune=intel" does the trick.  And the additional caveat is
that the behaviour of "-mtune=intel" can change in the future.  
What is the reasoning behind this ?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms
  2014-11-09 12:14 [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform marcus.kool at urlfilterdb dot com
                   ` (5 preceding siblings ...)
  2015-05-01 22:40 ` marcus.kool at urlfilterdb dot com
@ 2015-05-01 23:07 ` pinskia at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2015-05-01 23:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63791

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |INVALID

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The reason is because -mavx2 is not a tuning option.  It does not or should not
change the tuning of the code.  It just enables an ISA feature of the produced
code.  If you want explicit tuning for intel cores, use -mtune=intel.

The reason why -mtune=intel can change in the future is because Intel cores are
not set in stone and newer intel cores can and will have different tuning
parameters.  -mtune=intel is based on the current majority of intel cores at
the released time of GCC.  So if say 10 years down the line, some newer intel
core which has a major tuning difference to the current ones; -mtune=intel will
change for those cores rather than stay constant for the ones now (2015).


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-01 23:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-09 12:14 [Bug c/63791] New: use 32-byte version of vpbroadcastb on AVX2 platform marcus.kool at urlfilterdb dot com
2014-11-09 16:28 ` [Bug c/63791] " jakub at gcc dot gnu.org
2015-05-01 13:51 ` [Bug target/63791] use 32-byte version of vpbroadcastb (and register to poulate) on AVX/AVX2 platforms marcus.kool at urlfilterdb dot com
2015-05-01 13:53 ` marcus.kool at urlfilterdb dot com
2015-05-01 14:10 ` marcus.kool at urlfilterdb dot com
2015-05-01 18:43 ` hjl.tools at gmail dot com
2015-05-01 22:40 ` marcus.kool at urlfilterdb dot com
2015-05-01 23:07 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).