[Bug c/54174] New: Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/54174] New: Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0)
@ 2012-08-04 17:58 dag at nimrod dot no
  2012-08-05 10:39 ` [Bug target/54174] " rguenth at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: dag at nimrod dot no @ 2012-08-04 17:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174

             Bug #: 54174
           Summary: Missed optimization: Unnecessary vmovaps generated for
                    __builtin_ia32_vextractf128_ps256(v, 0)
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: dag@nimrod.no


Pasting the following test code into test.c and compiling with gcc -Wall -O
-mavx -S test.c

----
typedef float v4sf __attribute__ ((vector_size (4*4)));
typedef float v8sf __attribute__ ((vector_size (4*8)));

v4sf add(v8sf v)
{
  v4sf a = __builtin_ia32_vextractf128_ps256(v, 0);
  v4sf b = __builtin_ia32_vextractf128_ps256(v, 1);
  return a + b;
}
----

makes gcc generate the following code:

    vmovaps    %xmm0, %xmm1
    vextractf128    $0x1, %ymm0, %xmm0
    vaddps    %xmm0, %xmm1, %xmm0

However if the statements for a and b are swapped, i.e.

  v4sf b = __builtin_ia32_vextractf128_ps256(v, 1);
  v4sf a = __builtin_ia32_vextractf128_ps256(v, 0);

then gcc is able to optimize away the vmovaps instruction:

    vextractf128    $0x1, %ymm0, %xmm1
    vaddps    %xmm1, %xmm0, %xmm0

It thus seems like optimization rules are in place to make
__builtin_ia32_vextractf128_ps256(v, 0) a noop, however regardless of this a
vmovaps is generated (or perhaps rather not optimized away) in most cases.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/54174] Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0)
  2012-08-04 17:58 [Bug c/54174] New: Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0) dag at nimrod dot no
@ 2012-08-05 10:39 ` rguenth at gcc dot gnu.org
  2021-08-21 19:23 ` pinskia at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-08-05 10:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |ra
             Target|                            |x86_64-*-*
          Component|c                           |target

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-08-05 10:38:21 UTC ---
That's more likely a register allocator issue.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/54174] Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0)
  2012-08-04 17:58 [Bug c/54174] New: Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0) dag at nimrod dot no
  2012-08-05 10:39 ` [Bug target/54174] " rguenth at gcc dot gnu.org
@ 2021-08-21 19:23 ` pinskia at gcc dot gnu.org
  2021-08-23 11:28 ` crazylht at gmail dot com
  2024-05-16  1:50 ` lin1.hu at intel dot com
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-21 19:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-08-21
           Severity|normal                      |enhancement
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.
Most likely the vec_extract_lo_<mode> pattern should have a tie for the input
and output being the same register

Something like:
(define_insn "vec_extract_lo_<mode>"
  [(set (match_operand:<ssehalfvecmode> 0 "nonimmediate_operand" "=v,v,vm,v")
        (vec_select:<ssehalfvecmode>
          (match_operand:V8FI 1 "nonimmediate_operand" "0,v,v,vm")
          (parallel [(const_int 0) (const_int 1)
                     (const_int 2) (const_int 3)])))]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/54174] Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0)
  2012-08-04 17:58 [Bug c/54174] New: Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0) dag at nimrod dot no
  2012-08-05 10:39 ` [Bug target/54174] " rguenth at gcc dot gnu.org
  2021-08-21 19:23 ` pinskia at gcc dot gnu.org
@ 2021-08-23 11:28 ` crazylht at gmail dot com
  2024-05-16  1:50 ` lin1.hu at intel dot com
  3 siblings, 0 replies; 5+ messages in thread
From: crazylht at gmail dot com @ 2021-08-23 11:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174

--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Richard Biener from comment #1)
> That's more likely a register allocator issue.

Yes, LRA allocate registers from back to front which means change source code
like below will eliminate redundant mov.

typedef float v4sf __attribute__ ((vector_size (4*4)));
typedef float v8sf __attribute__ ((vector_size (4*8)));

v4sf add(v8sf v)
{
  v4sf b = __builtin_ia32_vextractf128_ps256(v, 1);
  v4sf a = __builtin_ia32_vextractf128_ps256(v, 0);
  return a + b;
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/54174] Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0)
  2012-08-04 17:58 [Bug c/54174] New: Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0) dag at nimrod dot no
                   ` (2 preceding siblings ...)
  2021-08-23 11:28 ` crazylht at gmail dot com
@ 2024-05-16  1:50 ` lin1.hu at intel dot com
  3 siblings, 0 replies; 5+ messages in thread
From: lin1.hu at intel dot com @ 2024-05-16  1:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174

Hu Lin <lin1.hu at intel dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lin1.hu at intel dot com

--- Comment #4 from Hu Lin <lin1.hu at intel dot com> ---
I tried to modify vec_extract_lo_<mode> to:

 (define_insn "vec_extract_lo_<mode>"
   [(set (match_operand:<ssehalfvecmode> 0 "nonimmediate_operand" "=v,v,vm,v")
         (vec_select:<ssehalfvecmode>
           (match_operand:VI4F_256 1 "nonimmediate_operand" "0,v,v,vm")
           (parallel [(const_int 0) (const_int 1)
                      (const_int 2) (const_int 3)])))]

and 
 (define_insn "vec_extract_lo_<mode>"
   [(set (match_operand:<ssehalfvecmode> 0 "nonimmediate_operand"
"=v,?v,?vm,?v")
         (vec_select:<ssehalfvecmode>
           (match_operand:VI4F_256 1 "nonimmediate_operand" "0,v,v,vm")
           (parallel [(const_int 0) (const_int 1)
                      (const_int 2) (const_int 3)])))]

In 315r.reload 
         Considering alt=0 of insn 7:   (0) =v  (1) 0
            1 Matching alt: reject+=2
          overall=8,losers=1,rld_nregs=1
         Considering alt=1 of insn 7:   (0) ?v  (1) v
            Staticly defined alt reject+=6
          overall=0,losers=0,rld_nregs=0
      Choosing alt 1 in insn 7:  (0) ?v  (1) v {vec_extract_lo_v8sf}
and I tried to use !, alt=0 is still rejected.

And I even tried to modify
 (define_insn "vec_extract_lo_<mode>"
   [(set (match_operand:<ssehalfvecmode> 0 "nonimmediate_operand" "=v")
         (vec_select:<ssehalfvecmode>
           (match_operand:VI4F_256 1 "nonimmediate_operand" "0")
           (parallel [(const_int 0) (const_int 1)
                      (const_int 2) (const_int 3)])))]

Although, vec_extract_lo_v8sf uses the same reg %xmm2, compiler will add an
extra insn "vmovaps %ymm0, %ymm2" after reload.

For the other hand, we tried to split the pattern to
  [(set (match_dup 0) (match_dup 1))]
{
   operands[1] = gen_lowpart (<ssehalfvecmode>mode, operands[1]);
}
before reload. But GCC can't execute Register Coalescer like Clang.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-05-16  1:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-04 17:58 [Bug c/54174] New: Missed optimization: Unnecessary vmovaps generated for __builtin_ia32_vextractf128_ps256(v, 0) dag at nimrod dot no
2012-08-05 10:39 ` [Bug target/54174] " rguenth at gcc dot gnu.org
2021-08-21 19:23 ` pinskia at gcc dot gnu.org
2021-08-23 11:28 ` crazylht at gmail dot com
2024-05-16  1:50 ` lin1.hu at intel dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).