public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
@ 2012-02-22 13:21 ` xiaoyuanbo at yeah dot net
  2012-03-22 13:24 ` venkataramanan.kumar at amd dot com
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: xiaoyuanbo at yeah dot net @ 2012-02-22 13:21 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

xiaoyuanbo <xiaoyuanbo at yeah dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |xiaoyuanbo at yeah dot net

--- Comment #3 from xiaoyuanbo <xiaoyuanbo at yeah dot net> 2012-02-22 12:48:13 UTC ---
i tell you


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
  2012-02-22 13:21 ` [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target xiaoyuanbo at yeah dot net
@ 2012-03-22 13:24 ` venkataramanan.kumar at amd dot com
  2012-03-22 13:26 ` venkataramanan.kumar at amd dot com
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: venkataramanan.kumar at amd dot com @ 2012-03-22 13:24 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #4 from Venkataramanan <venkataramanan.kumar at amd dot com> 2012-03-22 13:17:34 UTC ---
I dont have permission to confirm this bug.

Here is my analysis for the cause.

#(insn:TI 4886 4885 4888 132 (set (reg:V2DF 25 xmm4 [8797])
#        (mult:V2DF (reg:V2DF 25 xmm4 [8795])
#            (reg:V2DF 22 xmm1 [8758]))) ac.f90:499 1138 {*mulv2df3}
#     (nil))
        vmulpd  %xmm1, %xmm4, %xmm4     # 4886  *mulv2df3/2     [length = 4]

We are forcing a conversion from V2DF to V4SF mode here for unaligned moves
when TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL is set.

(-----Snip ix86_expand_vector_move_misalign-----)
            case V2DFmode:
              if (TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL)
                {
                  op0 = gen_lowpart (V4SFmode, op0);
                  op1 = gen_lowpart (V4SFmode, op1);
                  emit_insn (gen_sse_movups (op0, op1));
                  return;
                }
(-----Snip-----) 

This conversion generates RTL as shown below.

#(insn:TI 4888 4886 4890 132 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                (const_int 6136 [0x17f8])) [3 MEM[(real(kind=8)[26] *)&dclroo
+ 152B]+0 S16 A64])
#        (unspec:V4SF [
#                (reg:V4SF 25 xmm4 [8797])
#            ] UNSPEC_MOVU)) ac.f90:499 1104 {*sse_movups}
#     (expr_list:REG_DEAD (reg:V4SF 25 xmm4 [8797])
#        (nil)))
        vmovups %xmm4, 6136(%rsp)       # 4888  *sse_movups/2   [length = 9]

Now GCC does not know how to come back to V2DF mode again. As Uros said, it
reloads through memory.

#(insn 4930 4929 8259 132 (set (reg:V4SF 23 xmm2)
#        (unspec:V4SF [
#                (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                        (const_int 6136 [0x17f8])) [3 MEM[(real(kind=8)[26]
*)&dclroo + 152B]+0 S16 A64])
#            ] UNSPEC_MOVU)) ac.f90:503 1104 {*sse_movups}
#     (nil))
        vmovups 6136(%rsp), %xmm2       # 4930  *sse_movups/1   [length = 9]
#(insn:TI 8259 4930 8261 132 (set (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
#                (const_int 240 [0xf0])) [12 %sfp+-11184 S16 A128])
#        (reg:V4SF 23 xmm2)) ac.f90:503 1098 {*movv4sf_internal}
#     (expr_list:REG_DEAD (reg:V4SF 23 xmm2)
#        (nil)))
        vmovaps %xmm2, 240(%rsp)        # 8259  *movv4sf_internal/3     [length
= 9]
#(insn 8261 8259 4931 132 (set (reg:V2DF 23 xmm2)
#        (mem/c:V2DF (plus:DI (reg/f:DI 7 sp)
#                (const_int 240 [0xf0])) [12 %sfp+-11184 S16 A128])) ac.f90:503
1100 {*movv2df_internal}
#     (nil))
        vmovaps 240(%rsp), %xmm2        # 8261  *movv2df_internal/2     [length
= 9]
#(insn:TI 4931 8261 8260 132 (set (reg:V2DF 23 xmm2)
#        (div:V2DF (reg:V2DF 23 xmm2)
#            (mem/c:V2DF (plus:DI (reg/f:DI 7 sp)
#                    (const_int 6128 [0x17f0])) [3 MEM[(real(kind=8)[26]
*)&dclroo + 144B]+0 S16 A128]))) ac.f90:503 1144 {sse2_divv2df3}
#     (nil))
        vdivpd  6128(%rsp), %xmm2, %xmm2        # 4931  sse2_divv2df3/2 [length
= 9]


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
  2012-02-22 13:21 ` [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target xiaoyuanbo at yeah dot net
  2012-03-22 13:24 ` venkataramanan.kumar at amd dot com
@ 2012-03-22 13:26 ` venkataramanan.kumar at amd dot com
  2012-03-22 13:29 ` venkataramanan.kumar at amd dot com
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: venkataramanan.kumar at amd dot com @ 2012-03-22 13:26 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #5 from Venkataramanan <venkataramanan.kumar at amd dot com> 2012-03-22 13:23:39 UTC ---
Created attachment 26955
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26955
Simple patch to print vmovups at assembly generation stage


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2012-03-22 13:26 ` venkataramanan.kumar at amd dot com
@ 2012-03-22 13:29 ` venkataramanan.kumar at amd dot com
  2012-03-22 13:56 ` ubizjak at gmail dot com
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: venkataramanan.kumar at amd dot com @ 2012-03-22 13:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #6 from Venkataramanan <venkataramanan.kumar at amd dot com> 2012-03-22 13:26:16 UTC ---
This patch tries to change vmovupd to vmovups during assembly printing stage 
when tuning flag for bdver1 is set. 

I am yet to test this one. Please provide your suggestion.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2012-03-22 13:29 ` venkataramanan.kumar at amd dot com
@ 2012-03-22 13:56 ` ubizjak at gmail dot com
  2012-03-27 10:52 ` venkataramanan.kumar at amd dot com
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: ubizjak at gmail dot com @ 2012-03-22 13:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2012-03-22
     Ever Confirmed|0                           |1

--- Comment #7 from Uros Bizjak <ubizjak at gmail dot com> 2012-03-22 13:53:02 UTC ---
(In reply to comment #6)
> This patch tries to change vmovupd to vmovups during assembly printing stage 
> when tuning flag for bdver1 is set. 
> 
> I am yet to test this one. Please provide your suggestion.

Can you please attach a small testcase that exposes this problem? I will look
into it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2012-03-22 13:56 ` ubizjak at gmail dot com
@ 2012-03-27 10:52 ` venkataramanan.kumar at amd dot com
  2012-03-27 11:27 ` venkataramanan.kumar at amd dot com
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: venkataramanan.kumar at amd dot com @ 2012-03-27 10:52 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #8 from Venkataramanan <venkataramanan.kumar at amd dot com> 2012-03-27 10:46:53 UTC ---
Created attachment 27013
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27013
Simplied test case form ac.f90


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2012-03-27 10:52 ` venkataramanan.kumar at amd dot com
@ 2012-03-27 11:27 ` venkataramanan.kumar at amd dot com
  2012-03-27 17:05 ` [Bug rtl-optimization/44141] " ubizjak at gmail dot com
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: venkataramanan.kumar at amd dot com @ 2012-03-27 11:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #9 from Venkataramanan <venkataramanan.kumar at amd dot com> 2012-03-27 10:51:06 UTC ---
(In reply to comment #8)
> Created attachment 27013 [details]
> Simplied test case form ac.f90

GCC revision : 184502
Command to reproduce:  gfortran unoptimal_move.f90 -S -march=bdver1 -Ofast -dP

Unoptimal move patterns can be found by grepping the assembly as follows:

grep  *movv4sf_internal unoptimal_move.s

#        (reg:V4SF 21 xmm0)) test.f90:16 1100 {*movv4sf_internal}
        vmovaps %xmm0, (%rsp)   # 393   *movv4sf_internal/3     [length = 5]
#        (reg:V4SF 21 xmm0)) test.f90:16 1100 {*movv4sf_internal}
        vmovaps %xmm0, 32(%rsp) # 396   *movv4sf_internal/3     [length = 6]
#        (reg:V4SF 21 xmm0)) test.f90:16 1100 {*movv4sf_internal}
        vmovaps %xmm0, 64(%rsp) # 399   *movv4sf_internal/3     [length = 6]
#        (reg:V4SF 21 xmm0)) test.f90:16 1100 {*movv4sf_internal}
        vmovaps %xmm0, 96(%rsp) # 402   *movv4sf_internal/3     [length = 6]


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug rtl-optimization/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2012-03-27 11:27 ` venkataramanan.kumar at amd dot com
@ 2012-03-27 17:05 ` ubizjak at gmail dot com
  2012-03-28  3:23 ` venkataramanan.kumar at amd dot com
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: ubizjak at gmail dot com @ 2012-03-27 17:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |ra
                 CC|                            |ubizjak at gmail dot com,
                   |                            |uweigand at gcc dot
                   |                            |gnu.org, vmakarov at gcc
                   |                            |dot gnu.org
          Component|target                      |rtl-optimization

--- Comment #10 from Uros Bizjak <ubizjak at gmail dot com> 2012-03-27 17:02:18 UTC ---
(In reply to comment #9)
> (In reply to comment #8)
> > Created attachment 27013 [details]
> > Simplied test case form ac.f90
> 
> GCC revision : 184502
> Command to reproduce:  gfortran unoptimal_move.f90 -S -march=bdver1 -Ofast -dP
> 
> Unoptimal move patterns can be found by grepping the assembly as follows:
> 
> grep  *movv4sf_internal unoptimal_move.s
> 
> #        (reg:V4SF 21 xmm0)) test.f90:16 1100 {*movv4sf_internal}
>         vmovaps %xmm0, (%rsp)   # 393   *movv4sf_internal/3     [length = 5]
> #        (reg:V4SF 21 xmm0)) test.f90:16 1100 {*movv4sf_internal}
>         vmovaps %xmm0, 32(%rsp) # 396   *movv4sf_internal/3     [length = 6]
> #        (reg:V4SF 21 xmm0)) test.f90:16 1100 {*movv4sf_internal}
>         vmovaps %xmm0, 64(%rsp) # 399   *movv4sf_internal/3     [length = 6]
> #        (reg:V4SF 21 xmm0)) test.f90:16 1100 {*movv4sf_internal}
>         vmovaps %xmm0, 96(%rsp) # 402   *movv4sf_internal/3     [length = 6]

This is register allocation / reload issue. MODES_TIEABLE_P and
CANNOT_CHANGE_MODE_CLASS are correct (so V4SF and V2DF should be
interchangeable). There are plenty of examples, where mode changes without
problems in _.194r.reload, i.e.

(insn 24 23 25 2 (set (reg:V4SF 51 xmm14 [282])
        (unspec:V4SF [
                (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
                        (const_int 1432 [0x598])) [3 MEM[(real(kind=8)[26]
*)&dsroo + 8B]+0 S16 A64])
            ] UNSPEC_MOVU)) unoptimal_move.f90:16 1114 {*sse_movups}
     (nil))

(insn 25 24 348 2 (set (reg:V2DF 52 xmm15 [283])
        (div:V2DF (reg:V2DF 51 xmm14 [282])
            (mem/c:V2DF (plus:DI (reg/f:DI 7 sp)
                    (const_int 1424 [0x590])) [3 MEM[(real(kind=8)[26]
*)&dsroo]+0 S16 A128]))) unoptimal_move.f90:16 1154 {sse2_divv2df3}
     (nil))

But when the input and output operands of the division match, following
sequence is generated:

(insn 45 44 377 2 (set (reg:V4SF 21 xmm0)
        (unspec:V4SF [
                (mem/c:V4SF (plus:DI (reg/f:DI 7 sp)
                        (const_int 1544 [0x608])) [3 MEM[(real(kind=8)[26]
*)&dsroo + 120B]+0 S16 A64])
            ] UNSPEC_MOVU)) unoptimal_move.f90:16 1114 {*sse_movups}
     (nil))

(insn 377 45 379 2 (set (mem/c:V4SF (reg/f:DI 7 sp) [10 %sfp+-2112 S16 A128])
        (reg:V4SF 21 xmm0)) unoptimal_move.f90:16 1108 {*movv4sf_internal}
     (nil))

(insn 379 377 46 2 (set (reg:V2DF 21 xmm0)
        (mem/c:V2DF (reg/f:DI 7 sp) [10 %sfp+-2112 S16 A128]))
unoptimal_move.f90:16 1110 {*movv2df_internal}
     (nil))

(insn 46 379 378 2 (set (reg:V2DF 21 xmm0)
        (div:V2DF (reg:V2DF 21 xmm0)
            (mem/c:V2DF (plus:DI (reg/f:DI 7 sp)
                    (const_int 1536 [0x600])) [3 MEM[(real(kind=8)[26] *)&dsroo
+ 112B]+0 S16 A128]))) unoptimal_move.f90:16 1154 {sse2_divv2df3}
     (nil))

Both sequences have the same starting sequence, from _193r.ira:

(insn 24 23 25 2 (set (subreg:V4SF (reg:V2DF 282) 0)
        (unspec:V4SF [
                (mem/c:V4SF (plus:DI (reg/f:DI 20 frame)
                        (const_int -680 [0xfffffffffffffd58])) [3
MEM[(real(kind=8)[26] *)&dsroo + 8B]+0 S16 A64])
            ] UNSPEC_MOVU)) unoptimal_move.f90:16 1114 {*sse_movups}
     (nil))

(insn 25 24 348 2 (set (reg:V2DF 283)
        (div:V2DF (reg:V2DF 282)
            (mem/c:V2DF (plus:DI (reg/f:DI 20 frame)
                    (const_int -688 [0xfffffffffffffd50])) [3
MEM[(real(kind=8)[26] *)&dsroo]+0 S16 A128]))) unoptimal_move.f90:16 1154
{sse2_divv2df3}
     (nil))

The proposed fix would only put this issue under-the-rug.

Reconfirmed as rtl-optimization problem.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug rtl-optimization/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2012-03-27 17:05 ` [Bug rtl-optimization/44141] " ubizjak at gmail dot com
@ 2012-03-28  3:23 ` venkataramanan.kumar at amd dot com
  2012-03-28  7:08 ` ubizjak at gmail dot com
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: venkataramanan.kumar at amd dot com @ 2012-03-28  3:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #11 from Venkataramanan <venkataramanan.kumar at amd dot com> 2012-03-28 03:02:19 UTC ---
Uros, Can you please assign this bug under my name. I will see what is
hapenning at reload.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug rtl-optimization/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2012-03-28  3:23 ` venkataramanan.kumar at amd dot com
@ 2012-03-28  7:08 ` ubizjak at gmail dot com
  2012-03-28  8:01 ` jakub at gcc dot gnu.org
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: ubizjak at gmail dot com @ 2012-03-28  7:08 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
         AssignedTo|unassigned at gcc dot       |venkataramanan.kumar at amd
                   |gnu.org                     |dot com


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug rtl-optimization/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2012-03-28  7:08 ` ubizjak at gmail dot com
@ 2012-03-28  8:01 ` jakub at gcc dot gnu.org
  2012-03-28 10:42 ` venkataramanan.kumar at amd dot com
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-28  8:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-28 07:47:58 UTC ---
Having a vector mode changing subreg on the LHS of an instruction is a very
common issue in the i386 backend, and unfortunately e.g. means that lots of
insns can't be combined or simplified.  I wonder if the expansion sometimes
shouldn't use a non-subregged temporary as lhs and add a move from subreg of
the temporary to the desired destination.

BTW, if with TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL ps is more desirable than pd
for movs, then perhaps it would be better to add a mode attr similar to
ssemodesuffix, which would emit pd or ps for V2DFmode depending on
TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL (see e.g. i128 mode attr).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug rtl-optimization/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2012-03-28  8:01 ` jakub at gcc dot gnu.org
@ 2012-03-28 10:42 ` venkataramanan.kumar at amd dot com
  2012-03-28 10:53 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: venkataramanan.kumar at amd dot com @ 2012-03-28 10:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #13 from Venkataramanan <venkataramanan.kumar at amd dot com> 2012-03-28 10:32:31 UTC ---
(In reply to comment #12)
> Having a vector mode changing subreg on the LHS of an instruction is a very
> common issue in the i386 backend, and unfortunately e.g. means that lots of
> insns can't be combined or simplified.  I wonder if the expansion sometimes
> shouldn't use a non-subregged temporary as lhs and add a move from subreg of
> the temporary to the desired destination.

The expander now converts as shown below for unaligned moves with V2DF mode.

            if (TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL)
                {
                  op0 = gen_lowpart (V4SFmode, op0);
                  op1 = gen_lowpart (V4SFmode, op1);
                  emit_insn (gen_sse_movups (op0, op1));
                  return;
                }

You mean conversion is not needed here?  


> BTW, if with TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL ps is more desirable than pd
> for movs, then perhaps it would be better to add a mode attr similar to
> ssemodesuffix, which would emit pd or ps for V2DFmode depending on
> TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL (see e.g. i128 mode attr).

Yes with TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL we want to generate movups
instead of movupd.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug rtl-optimization/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2012-03-28 10:42 ` venkataramanan.kumar at amd dot com
@ 2012-03-28 10:53 ` jakub at gcc dot gnu.org
  2012-04-01  7:55 ` venkataramanan.kumar at amd dot com
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-28 10:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #14 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-28 10:40:45 UTC ---
(In reply to comment #13)
> The expander now converts as shown below for unaligned moves with V2DF mode.
> 
>             if (TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL)
>                 {
>                   op0 = gen_lowpart (V4SFmode, op0);
>                   op1 = gen_lowpart (V4SFmode, op1);
>                   emit_insn (gen_sse_movups (op0, op1));
>                   return;
>                 }
> 
> You mean conversion is not needed here?  

No, I meant it should do perhaps:
  rtx tem = gen_reg_rtx (V4SFmode);
  emit_insn (gen_sse_movups (tem, gen_lowpart (V4SFmode, op1)));
  emit_move_insn (op0, gen_lowpart (GET_MODE (op0), tem));
  return;
or similar.  Of course in this case it can be done using changes in the
patterns (note, there are lots of other insns that emit {,v}mov{u,a}pd,
which of those should be changed?), I meant this as a general comment that
a vector mode changing subreg on a lhs of an insn is highly undesirable.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug rtl-optimization/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2012-03-28 10:53 ` jakub at gcc dot gnu.org
@ 2012-04-01  7:55 ` venkataramanan.kumar at amd dot com
  2012-05-07 17:28 ` uweigand at gcc dot gnu.org
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: venkataramanan.kumar at amd dot com @ 2012-04-01  7:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #15 from Venkataramanan <venkataramanan.kumar at amd dot com> 2012-04-01 07:55:24 UTC ---
Hi Uros,

I had a look at reload pass.

I have an RTL sequence that look like this.

(insn 32 31 33 2 (set (subreg:V4SF (reg:V2DF 284) 0) <== psuedo reguster 
        (unspec:V4SF [
                (mem/c:V4SF (plus:DI (reg/f:DI 20 frame)
                        (const_int -568 [0xfffffffffffffdc8])) [3
MEM[(real(kind=8)[26] *)&dsroo + 120B]+0 S16 A64])
            ] UNSPEC_MOVU)) test.f90:16 1106 {*sse_movups}
     (nil))

(insn 33 32 371 2 (set (reg:V2DF 285) <= pseudo register 
        (div:V2DF (reg:V2DF 284) <== pseudo register
            (mem/c:V2DF (plus:DI (reg/f:DI 20 frame)
                    (const_int -576 [0xfffffffffffffdc0])) [3
MEM[(real(kind=8)[26] *)&dsroo + 112B]+0 S16 A128]))) test.f90:16 1146
{sse2_divv2df3}
     (nil))

Now reload examines the first RTL. 

(insn 32 31 33 2 (set (subreg:V4SF (reg:V2DF 284) 0) <== psuedo reguster 
        (unspec:V4SF [
                (mem/c:V4SF (plus:DI (reg/f:DI 20 frame)
                        (const_int -568 [0xfffffffffffffdc8])) [3
MEM[(real(kind=8)[26] *)&dsroo + 120B]+0 S16 A64])
            ] UNSPEC_MOVU)) test.f90:16 1106 {*sse_movups}
     (nil))

This did not get an Hard register and so at reload it chooses xmm0 and
generates an output reload as follows.


(insn 393 0 0 (set (reg:V2DF 284)
                                                                             
(reg:V2DF 21 xmm0)) -1

There is possiblity of reload inheritenace and we can avoid input reload for
the next RTL insn 33

(insn 33 32 371 2 (set (reg:V2DF 285)
        (div:V2DF (reg:V2DF 284)
            (mem/c:V2DF (plus:DI (reg/f:DI 20 frame)
                    (const_int -576 [0xfffffffffffffdc0])) [3
MEM[(real(kind=8)[26] *)&dsroo + 112B]+0 S16 A128]))) test.f90:16 1146
{sse2_divv2df3}
     (nil))

But It is not happening and input reload gets generated again before this RTL
as follows:

(insn 395 0 0 (set (reg:V2DF 21 xmm0)
                                                                               
 (reg:V2DF 284)

Also another outpout reload gets emitted after the insn 33 for its output
reload as  

(insn 394 0 0 (set (reg:V2DF 285)
                                                                             
(reg:V2DF 21 xmm0)) -1. But I am not sure If this computation prevented input
reload inheritence in the insn 33 .

I suspect emit_reload_insns is not preserving output reloads in insn32 for
further inheritence insn33. 

Please povide your opinion.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug rtl-optimization/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2012-04-01  7:55 ` venkataramanan.kumar at amd dot com
@ 2012-05-07 17:28 ` uweigand at gcc dot gnu.org
  2012-05-08 10:28 ` [Bug target/44141] " rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: uweigand at gcc dot gnu.org @ 2012-05-07 17:28 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #16 from Ulrich Weigand <uweigand at gcc dot gnu.org> 2012-05-07 17:17:06 UTC ---
Reload inheritance generally gives up on handling subregs of pseudos, mostly
because there is no mechanism to track invalidation of parts of pseudos.

Now, in this particular case where a subreg still refers to the whole of the
pseudo and just re-interprets it as a different vector mode, it might be
possible to enhance inheritance support to handle such cases.

But this looks like a significant chance to what is already one of the most
complex parts of reload -- and would certainly involve some risk of introducing
subtle bugs.   I'm not sure if this is worth the effort ...

In particular, in this specific case the back-end could make things a whole lot
simpler by not insisting on a particular mode.  I understand MOVUPS simply
moves 16 bytes between (unaligned) memory and registers -- there's nothing in
particular that requires this to be encoded as V4SF mode.

I'd suggest simply extending the movups patterns to handle moves between
arbitrary 16-byte (vector?) modes, all in the end resolving to the same
assembler instruction.  Then you'd be able to just encode the move in question
along the lines of
   (set (reg:V2DF) (unspec:V2DF [(mem:V2DF ...)] UNSPEC_MOVU))
which would probably generate better code not just in reload, but other parts
of the RTL middle-end as well ...

[ An even more radical change might be to encode both movups and movaps as just
plain moves, with no unspec, and have the final assembly output make the choice
between the two based on the operand's MEM_ALIGN value ... Of course, this
needs MEM_ALIGN to be always correct. ]


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2012-05-07 17:28 ` uweigand at gcc dot gnu.org
@ 2012-05-08 10:28 ` rguenth at gcc dot gnu.org
  2012-05-08 10:36 ` ubizjak at gmail dot com
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-05-08 10:28 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |target

--- Comment #17 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-05-08 10:17:18 UTC ---
Which makes this a target bug then.  Uros?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2012-05-08 10:28 ` [Bug target/44141] " rguenth at gcc dot gnu.org
@ 2012-05-08 10:36 ` ubizjak at gmail dot com
  2012-05-09 17:04 ` ubizjak at gmail dot com
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 20+ messages in thread
From: ubizjak at gmail dot com @ 2012-05-08 10:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #18 from Uros Bizjak <ubizjak at gmail dot com> 2012-05-08 10:32:33 UTC ---
(In reply to comment #17)
> Which makes this a target bug then.  Uros?

Following the explanation in comment #16, I'd say so.

Please note that we already implement the radical change, mentioned at the end
of the comment, but for TARGET_AVX only (please see *mov<mode>_internal
patterns in config/i386/sse.md), and the less radical change with
TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL in move patterns. As suggested, we will
add T_S_P_S_I_O handling in movu patterns, avoiding mode changes entirely.

Also, it looks that vector move patterns need some TLC.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (16 preceding siblings ...)
  2012-05-08 10:36 ` ubizjak at gmail dot com
@ 2012-05-09 17:04 ` ubizjak at gmail dot com
  2012-05-09 18:37 ` uros at gcc dot gnu.org
  2012-05-09 20:42 ` ubizjak at gmail dot com
  19 siblings, 0 replies; 20+ messages in thread
From: ubizjak at gmail dot com @ 2012-05-09 17:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|ra                          |
         AssignedTo|venkataramanan.kumar at amd |ubizjak at gmail dot com
                   |dot com                     |
   Target Milestone|---                         |4.8.0

--- Comment #19 from Uros Bizjak <ubizjak at gmail dot com> 2012-05-09 16:58:58 UTC ---
I have a patch in testing.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (17 preceding siblings ...)
  2012-05-09 17:04 ` ubizjak at gmail dot com
@ 2012-05-09 18:37 ` uros at gcc dot gnu.org
  2012-05-09 20:42 ` ubizjak at gmail dot com
  19 siblings, 0 replies; 20+ messages in thread
From: uros at gcc dot gnu.org @ 2012-05-09 18:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

--- Comment #20 from uros at gcc dot gnu.org 2012-05-09 18:06:52 UTC ---
Author: uros
Date: Wed May  9 18:06:47 2012
New Revision: 187347

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=187347
Log:
    PR target/44141
    * config/i386/i386.c (ix86_expand_vector_move_misalign): Do not handle
    128 bit vectors specially for TARGET_AVX.  Emit sse2_movupd and
    sse_movupd RTXes for TARGET_AVX, TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL
    or when optimizing for size.
    * config/i386/sse.md (*mov<mode>_internal): Remove
    TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL handling from asm output code.
    Calculate "mode" attribute according to optimize_function_for_size_p
    and TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL flag.
    (*<sse>_movu<ssemodesuffix><avxsizesuffix>): Choose asm template
    depending on the mode of the instruction.  Calculate "mode" attribute
    according to optimize_function_for_size_p, TARGET_SSE_TYPELESS_STORES
    and TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL flags.
    (*<sse2>_movdqu<avxsizesuffix>): Ditto.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/sse.md


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target
       [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
                   ` (18 preceding siblings ...)
  2012-05-09 18:37 ` uros at gcc dot gnu.org
@ 2012-05-09 20:42 ` ubizjak at gmail dot com
  19 siblings, 0 replies; 20+ messages in thread
From: ubizjak at gmail dot com @ 2012-05-09 20:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44141

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
                URL|                            |http://gcc.gnu.org/ml/gcc-p
                   |                            |atches/2012-05/msg00689.htm
                   |                            |l
         Resolution|                            |FIXED

--- Comment #21 from Uros Bizjak <ubizjak at gmail dot com> 2012-05-09 19:41:34 UTC ---
Fixed.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-05-09 19:42 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-44141-4@http.gcc.gnu.org/bugzilla/>
2012-02-22 13:21 ` [Bug target/44141] Redundant loads and stores generated for AMD bdver1 target xiaoyuanbo at yeah dot net
2012-03-22 13:24 ` venkataramanan.kumar at amd dot com
2012-03-22 13:26 ` venkataramanan.kumar at amd dot com
2012-03-22 13:29 ` venkataramanan.kumar at amd dot com
2012-03-22 13:56 ` ubizjak at gmail dot com
2012-03-27 10:52 ` venkataramanan.kumar at amd dot com
2012-03-27 11:27 ` venkataramanan.kumar at amd dot com
2012-03-27 17:05 ` [Bug rtl-optimization/44141] " ubizjak at gmail dot com
2012-03-28  3:23 ` venkataramanan.kumar at amd dot com
2012-03-28  7:08 ` ubizjak at gmail dot com
2012-03-28  8:01 ` jakub at gcc dot gnu.org
2012-03-28 10:42 ` venkataramanan.kumar at amd dot com
2012-03-28 10:53 ` jakub at gcc dot gnu.org
2012-04-01  7:55 ` venkataramanan.kumar at amd dot com
2012-05-07 17:28 ` uweigand at gcc dot gnu.org
2012-05-08 10:28 ` [Bug target/44141] " rguenth at gcc dot gnu.org
2012-05-08 10:36 ` ubizjak at gmail dot com
2012-05-09 17:04 ` ubizjak at gmail dot com
2012-05-09 18:37 ` uros at gcc dot gnu.org
2012-05-09 20:42 ` ubizjak at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).