public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/103897] New: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX*
@ 2022-01-03 11:19 nekotekina at gmail dot com
  2022-01-03 13:44 ` [Bug tree-optimization/103897] " pinskia at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: nekotekina at gmail dot com @ 2022-01-03 11:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103897

            Bug ID: 103897
           Summary: x86: Missing optimizations with _mm_undefined_si128
                    and PMOVSX*
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nekotekina at gmail dot com
  Target Milestone: ---

Hello, I was trying to use VPMOVSXWD and other PMOVSX* intrinsics and also
emulate them for SSE2 targets. I noticed two (at least) distinct problems:
1) (V)PMOVSX** can use a memory operand, but emits a separate load instruction.
2) _mm_undefined_si128() always generates an additional zeroing instruction.
Using it in combination with unpack and arithmetic shift instructions looks
like an optimal way to emulate PMOVSX for SSE2 target.

Godbolt example includes clang output for comparison.

https://godbolt.org/z/KE8q9v6qG

#include <emmintrin.h>
#include <immintrin.h>

__attribute__((__target__("avx"))) void test0(__m128i* dst, __m128i* src)
{
    // Emit VPMOVSXWD: can combine load from memory, but emits 2 instructions
    // Looks like gcc 8.5 was doing better
    *dst = _mm_cvtepi16_epi32(*src);
}

void test1(__m128i* dst, __m128i* src)
{
    // Emulate VPMOVSXWD: sets zero specifically for _mm_undefined_si128
    *dst = _mm_srai_epi32(_mm_unpacklo_epi16(_mm_undefined_si128(), *src), 16);
}

void test2(__m128i* dst, __m128i* src)
{
    // Sets zero register but absolutely can reuse PSLLW result    
    *dst = _mm_srai_epi32(_mm_unpacklo_epi16(_mm_undefined_si128(),
_mm_slli_epi16(*src, 1)), 16);
}

void test3(__m128i* dst, __m128i* src)
{
    // Similar to test1, but emulate "high" VPMOVSXWD
    *dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(), *src), 16);
}

__attribute__((__target__("avx"))) void test4(__m128i* dst, __m128i* src)
{
    // Bonus (not sure what is the idiomatic way to MOVSX high part)
    *dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(), *src), 16);
}

__attribute__((__target__("avx"))) void test5(__m128i* dst, __m128i* src)
{
    // Emits two zeroing instructions
    *dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(),
_mm_packs_epi16(_mm_undefined_si128(), *src)), 16);
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/103897] x86: Missing optimizations with _mm_undefined_si128 and PMOVSX*
  2022-01-03 11:19 [Bug tree-optimization/103897] New: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX* nekotekina at gmail dot com
@ 2022-01-03 13:44 ` pinskia at gcc dot gnu.org
  2022-01-03 19:20 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-03 13:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103897

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
           Severity|normal                      |enhancement

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The zeroing is a known issue which I suspect is going to be resolved during the
next stage 1, there is a few other bugs referencing the reason why we do the
zeroing of registers.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug tree-optimization/103897] x86: Missing optimizations with _mm_undefined_si128 and PMOVSX*
  2022-01-03 11:19 [Bug tree-optimization/103897] New: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX* nekotekina at gmail dot com
  2022-01-03 13:44 ` [Bug tree-optimization/103897] " pinskia at gcc dot gnu.org
@ 2022-01-03 19:20 ` pinskia at gcc dot gnu.org
  2022-01-04  2:52 ` [Bug target/103897] " crazylht at gmail dot com
  2022-01-04  3:08 ` crazylht at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-01-03 19:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103897

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |61810

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
The _mm_undefined_* issue is recorded as PR 61810.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61810
[Bug 61810] init-regs.c papers over issues elsewhere

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/103897] x86: Missing optimizations with _mm_undefined_si128 and PMOVSX*
  2022-01-03 11:19 [Bug tree-optimization/103897] New: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX* nekotekina at gmail dot com
  2022-01-03 13:44 ` [Bug tree-optimization/103897] " pinskia at gcc dot gnu.org
  2022-01-03 19:20 ` pinskia at gcc dot gnu.org
@ 2022-01-04  2:52 ` crazylht at gmail dot com
  2022-01-04  3:08 ` crazylht at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2022-01-04  2:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103897

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
For 1), we have 
(define_insn "*sse4_1_<code>v4hiv4si2<mask_name>_1"
  [(set (match_operand:V4SI 0 "register_operand" "=Yr,*x,v")
        (any_extend:V4SI
          (match_operand:V4HI 1 "memory_operand" "m,m,m")))]

and Failed to match this instruction
(set (reg:V4SI 88)
    (sign_extend:V4SI (vec_select:V4HI (mem:V8HI (reg:DI 91) [0 *src_3(D)+0 S16
A128])
            (parallel [
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                ]))))

I doubt the optimization is unsafe since there could be trap in 16-byte load,
but ok for 8-byte load.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/103897] x86: Missing optimizations with _mm_undefined_si128 and PMOVSX*
  2022-01-03 11:19 [Bug tree-optimization/103897] New: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX* nekotekina at gmail dot com
                   ` (2 preceding siblings ...)
  2022-01-04  2:52 ` [Bug target/103897] " crazylht at gmail dot com
@ 2022-01-04  3:08 ` crazylht at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2022-01-04  3:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103897

--- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
For 2), it seems to be a generic optimization to optimize

(set (mem:V4SI (reg:DI 95) [0 *dst_7(D)+0 S16 A128])
    (ashiftrt:V4SI (subreg:V4SI (vec_select:V8HI (vec_concat:V16HI
(const_vector:V8HI [
                            (const_int 0 [0]) repeated x8
                        ])
                    (mem:V8HI (reg:DI 96) [0 *src_3(D)+0 S16 A128]))
                (parallel [
                        (const_int 0 [0])
                        (const_int 8 [0x8])
                        (const_int 1 [0x1])
                        (const_int 9 [0x9])
                        (const_int 2 [0x2])
                        (const_int 10 [0xa])
                        (const_int 3 [0x3])
                        (const_int 11 [0xb])
                    ])) 0)
        (const_int 16 [0x10])))


to sign_extend of (mem:V8HI (reg:DI 96)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-01-04  3:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-03 11:19 [Bug tree-optimization/103897] New: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX* nekotekina at gmail dot com
2022-01-03 13:44 ` [Bug tree-optimization/103897] " pinskia at gcc dot gnu.org
2022-01-03 19:20 ` pinskia at gcc dot gnu.org
2022-01-04  2:52 ` [Bug target/103897] " crazylht at gmail dot com
2022-01-04  3:08 ` crazylht at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).