[Bug target/47764] New: The constant load instruction should be hoisted out of loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/47764] New: The constant load instruction should be hoisted out of loop
@ 2011-02-16  7:35 carrot at google dot com
  2011-02-18 11:49 ` [Bug target/47764] " ibolton at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: carrot at google dot com @ 2011-02-16  7:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

           Summary: The constant load instruction should be hoisted out of
                    loop
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: carrot@google.com
            Target: arm-linux-androideabi


Created attachment 23359
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23359
testcase

The attached test case is extracted from zlib. Compile it with options
-march=armv7-a -mthumb -Os, gcc 4.6 generates:

init_block:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    movs    r3, #0
.L2:
    adds    r2, r0, r3
    adds    r3, r3, #4
    movs    r1, #0                   // A
    cmp    r3, #1144
    strh    r1, [r2, #60]    @ movhi  // B
    bne    .L2
    movs    r3, #0
.L3:
    adds    r2, r0, r3
    adds    r3, r3, #4
    movs    r1, #0                   // C
    cmp    r3, #120
    strh    r1, [r2, #2352]    @ movhi
    bne    .L3
    movs    r2, #0
.L4:
    adds    r1, r0, r2
    adds    r2, r2, #4
    movs    r3, #0                   // D
    cmp    r2, #76
    strh    r3, [r1, #2596]    @ movhi
    bne    .L4
    movs    r2, #1
    str    r3, [r0, #2760]
    strh    r2, [r0, #1084]    @ movhi
    str    r3, [r0, #2756]
    str    r3, [r0, #2764]
    str    r3, [r0, #2752]
    bx    lr

Note that instruction A in loop L2 loads constant 0 to register r1, then
instruction B stores r1 into memory. There is no other usage of r1 in the loop.
So it's better to move instruction A out of the loop.

Similarly instruction C can be moved out of loop L3. Actually it can be removed
since after instruction A the register r1 already contains 0 and no instruction
modify it later.

Similarly instruction D cam be moved out of loop L4. It can also be removed if
we exchange the register usage of r1 and r3 in loop L4.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/47764] The constant load instruction should be hoisted out of loop
  2011-02-16  7:35 [Bug target/47764] New: The constant load instruction should be hoisted out of loop carrot at google dot com
@ 2011-02-18 11:49 ` ibolton at gcc dot gnu.org
  2011-02-19  7:06 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: ibolton at gcc dot gnu.org @ 2011-02-18 11:49 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

Ian Bolton <ibolton at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.02.18 11:36:27
                 CC|                            |ibolton at gcc dot gnu.org
     Ever Confirmed|0                           |1
      Known to fail|                            |4.6.0

--- Comment #1 from Ian Bolton <ibolton at gcc dot gnu.org> 2011-02-18 11:36:27 UTC ---
I have confirmed this for r170052 of trunk.

Any ideas of how this improvement could be implemented, Carrot?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/47764] The constant load instruction should be hoisted out of loop
  2011-02-16  7:35 [Bug target/47764] New: The constant load instruction should be hoisted out of loop carrot at google dot com
  2011-02-18 11:49 ` [Bug target/47764] " ibolton at gcc dot gnu.org
@ 2011-02-19  7:06 ` pinskia at gcc dot gnu.org
  2011-02-21  4:10 ` carrot at google dot com
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2011-02-19  7:06 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> 2011-02-19 03:23:51 UTC ---
This is most likely a cost issue.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/47764] The constant load instruction should be hoisted out of loop
  2011-02-16  7:35 [Bug target/47764] New: The constant load instruction should be hoisted out of loop carrot at google dot com
  2011-02-18 11:49 ` [Bug target/47764] " ibolton at gcc dot gnu.org
  2011-02-19  7:06 ` pinskia at gcc dot gnu.org
@ 2011-02-21  4:10 ` carrot at google dot com
  2011-08-10 22:18 ` ramana at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: carrot at google dot com @ 2011-02-21  4:10 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

--- Comment #3 from Carrot <carrot at google dot com> 2011-02-21 03:15:45 UTC ---
> Any ideas of how this improvement could be implemented, Carrot?

The root cause of this problem is that arm/thumb store instruction can't
directly store a immediate number to memory, but gcc doesn't realize this early
enough. In most part of the rtl phase, the following form is kept.

  (insn 41 38 42 3 (set (mem:HI (plus:SI (reg/f:SI 169)
                  (const_int 60 [0x3c])) [2 MEM[(struct deflate_state *)D.2085 
  _3 + 60B]+0 S2 A16])
          (const_int 0 [0])) src/trees.c:45 696 {*thumb2_movhi_insn}
       (expr_list:REG_DEAD (reg/f:SI 169)
          (nil)))

Until register allocation it finds the restriction of the store instruction and
split it into two instructions, load 0 into register and store register to
memory. But it's too late to do a loop optimization.

One possible method is to split this insn earlier than loop optimization (maybe
directly in expand pass), and let loop and cse optimizations do the rest. It
may increase register pressure in part of the program, we should rematerialize
it in such cases.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/47764] The constant load instruction should be hoisted out of loop
  2011-02-16  7:35 [Bug target/47764] New: The constant load instruction should be hoisted out of loop carrot at google dot com
                   ` (2 preceding siblings ...)
  2011-02-21  4:10 ` carrot at google dot com
@ 2011-08-10 22:18 ` ramana at gcc dot gnu.org
  2013-01-24  7:25 ` [Bug rtl-optimization/47764] " ubizjak at gmail dot com
  2014-12-17  0:38 ` carrot at google dot com
  5 siblings, 0 replies; 7+ messages in thread
From: ramana at gcc dot gnu.org @ 2011-08-10 22:18 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ramana at gcc dot gnu.org

--- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2011-08-10 22:17:23 UTC ---
(In reply to comment #3)
> > Any ideas of how this improvement could be implemented, Carrot?
> 
> The root cause of this problem is that arm/thumb store instruction can't
> directly store a immediate number to memory, but gcc doesn't realize this early
> enough. In most part of the rtl phase, the following form is kept.
> 
>   (insn 41 38 42 3 (set (mem:HI (plus:SI (reg/f:SI 169)
>                   (const_int 60 [0x3c])) [2 MEM[(struct deflate_state *)D.2085 
>   _3 + 60B]+0 S2 A16])
>           (const_int 0 [0])) src/trees.c:45 696 {*thumb2_movhi_insn}
>        (expr_list:REG_DEAD (reg/f:SI 169)
>           (nil)))
> 
> Until register allocation it finds the restriction of the store instruction and
> split it into two instructions, load 0 into register and store register to
> memory. But it's too late to do a loop optimization.

Eh, how is splitting this early going to help with hoisting this out of a loop
? 

Ramana


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/47764] The constant load instruction should be hoisted out of loop
  2011-02-16  7:35 [Bug target/47764] New: The constant load instruction should be hoisted out of loop carrot at google dot com
                   ` (3 preceding siblings ...)
  2011-08-10 22:18 ` ramana at gcc dot gnu.org
@ 2013-01-24  7:25 ` ubizjak at gmail dot com
  2014-12-17  0:38 ` carrot at google dot com
  5 siblings, 0 replies; 7+ messages in thread
From: ubizjak at gmail dot com @ 2013-01-24  7:25 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

Uros Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|arm-linux-androideabi       |
                 CC|                            |ubizjak at gmail dot com
          Component|target                      |rtl-optimization
      Known to fail|                            |4.7.0, 4.8.0

--- Comment #5 from Uros Bizjak <ubizjak at gmail dot com> 2013-01-24 07:25:04 UTC ---
This is a problem with rtl-optimization, gcse2 pass.

Following testcase also fails on x86_64, with 4.8 [1] that removes (!o,F)
alternative.

Following test, when compiled with -O3 hoists memory load out of the loop:

--cut here--
volatile double y;

void
test ()
{
  int z;

  for (z = 0; z < 1000; z++)
    y = 0.1;
}
--cut here--

_.210r.postreload:

   15: L15:
    8: NOTE_INSN_BASIC_BLOCK 3
   23: xmm0:DF=[`*.LC0']
   10: [`y']=xmm0:DF
      REG_DEAD xmm0:DF
   11: NOTE_INSN_DELETED
   12: {flags:CCZ=cmp(ax:SI-0x1,0);ax:SI=ax:SI-0x1;}
   13: pc={(flags:CCZ!=0)?L15:pc}
      REG_BR_PROB 0x26ab

_.211r.gcse2:

   26: xmm0:DF=[`*.LC0']
   15: L15:
    8: NOTE_INSN_BASIC_BLOCK 3
   10: [`y']=xmm0:DF
      REG_DEAD xmm0:DF
   11: NOTE_INSN_DELETED
   12: {flags:CCZ=cmp(ax:SI-0x1,0);ax:SI=ax:SI-0x1;}
   13: pc={(flags:CCZ!=0)?L15:pc}
      REG_BR_PROB 0x26ab

However, when constant is changed to 0.0 (so, we can load it directly to %xmm
register using xorpd insn):

--cut here--
volatile double y;

void
test ()
{
  int z;

  for (z = 0; z < 1000; z++)
    y = 0.0;
}
--cut here--

gcc -O3:

_.211r.gcse2:

   15: L15:
    8: NOTE_INSN_BASIC_BLOCK 3
   10: xmm0:DF=0.0
   23: [`y']=xmm0:DF
      REG_DEAD xmm0:DF
   11: NOTE_INSN_DELETED
   12: {flags:CCZ=cmp(ax:SI-0x1,0);ax:SI=ax:SI-0x1;}
   13: pc={(flags:CCZ!=0)?L15:pc}
      REG_BR_PROB 0x26ab

Constant load remains inside the loop. It looks that gcse2 pass cares only for
loads from memory, but I see no reason why constant load should not be
considered. It looks like an oversight to me.

The same happens with:

--cut here--
volatile long long y;

void
test ()
{
  int z;

  for (z = 0; z < 1000; z++)
    y = 0x123456789;
}
--cut here--

_.211r.gcse2:

   15: L15:
    8: NOTE_INSN_BASIC_BLOCK 3
   23: dx:DI=0x123456789
   24: [`y']=dx:DI
      REG_DEAD dx:DI
   11: NOTE_INSN_DELETED
   12: {flags:CCZ=cmp(ax:SI-0x1,0);ax:SI=ax:SI-0x1;}
   13: pc={(flags:CCZ!=0)?L15:pc}
      REG_BR_PROB 0x26ab

resulting in:

.L3:
        movabsq $4886718345, %rdx
        subl    $1, %eax
        movq    %rdx, y(%rip)
        jne     .L3

Reconfirmed as rtl-optimization (gcse2 pass) problem.

[1] 4.8.0 20130124 (experimental) [trunk revision 195417]


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug rtl-optimization/47764] The constant load instruction should be hoisted out of loop
  2011-02-16  7:35 [Bug target/47764] New: The constant load instruction should be hoisted out of loop carrot at google dot com
                   ` (4 preceding siblings ...)
  2013-01-24  7:25 ` [Bug rtl-optimization/47764] " ubizjak at gmail dot com
@ 2014-12-17  0:38 ` carrot at google dot com
  5 siblings, 0 replies; 7+ messages in thread
From: carrot at google dot com @ 2014-12-17  0:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

--- Comment #6 from Carrot <carrot at google dot com> ---
Another example for ppc. 
Following code is disassembled from sha1dgst.o in openssl which is compiled by
gcc

0000000000000000 <sha1_block_data_order>:
      ...
      80:       82 5a 52 3f     addis   r26,r18,23170
      84:       78 9a 4a 7e     xor     r10,r18,r19
      88:       08 00 c4 8a     lbz     r22,8(r4)
      8c:       88 00 1f ea     ld      r16,136(r31)
      90:       0b 00 a4 8b     lbz     r29,11(r4)
      94:       02 00 c4 8b     lbz     r30,2(r4)
      98:       99 79 5a 3b     addi    r26,r26,31129
      ...

it uses two instructions to do (r18 + 23170 << 16 + 31129), this large constant
is used many times. In following command line sha1.gcc is disassembled from
sha1dgst.o.

$ grep 31129 sha1.gcc | wc
     20     140     881
$ grep 23170 sha1.gcc | wc
     20     140     886

If we load this large constant into a register, and use this register later, we
can save 18 instructions.

There are more such cases in the same functions:

$ grep 28378 sha1.gcc | wc
     20     140     875
$ grep "\-5215" sha1.gcc | wc
     20     140     867
$ grep "\-28900" sha1.gcc | wc
     20     140     915
$ grep "\-17188" sha1.gcc | wc
     20     140     916
$ grep "\-13725" sha1.gcc | wc
     20     140     915
$ grep "\-15914" sha1.gcc | wc
     20     140     914

More worse, these codes are inside a hot loop.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-12-17  0:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-16  7:35 [Bug target/47764] New: The constant load instruction should be hoisted out of loop carrot at google dot com
2011-02-18 11:49 ` [Bug target/47764] " ibolton at gcc dot gnu.org
2011-02-19  7:06 ` pinskia at gcc dot gnu.org
2011-02-21  4:10 ` carrot at google dot com
2011-08-10 22:18 ` ramana at gcc dot gnu.org
2013-01-24  7:25 ` [Bug rtl-optimization/47764] " ubizjak at gmail dot com
2014-12-17  0:38 ` carrot at google dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).