public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug rtl-optimization/61241] New: built-in memset makes the caller function slower than normal memset
@ 2014-05-20  1:33 ma.jiang at zte dot com.cn
  2014-05-20  1:38 ` [Bug rtl-optimization/61241] " ma.jiang at zte dot com.cn
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: ma.jiang at zte dot com.cn @ 2014-05-20  1:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61241

            Bug ID: 61241
           Summary: built-in memset makes the caller function slower than
                    normal memset
           Product: gcc
           Version: 4.10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ma.jiang at zte dot com.cn

Compiled with  -O2, 

#include <string.h>
extern int off;
void *test(char *a1, char* a2)
{
        memset(a2, 123, 123);
        return a2 + off;
}

gives a result as following.

        mov     ip, r1
        mov     r1, #123
        stmfd   sp!, {r3, lr}
        mov     r0, ip
        mov     r2, r1
        bl      memset
        movw    r3, #:lower16:off
        movt    r3, #:upper16:off
        mov     ip, r0
        ldr     r0, [r3]
        add     r0, ip, r0
        ldmfd   sp!, {r3, pc}

After adding -fno-builtin, the assemble code becomes shorter.

        stmfd   sp!, {r4, lr}
        mov     r4, r1
        mov     r1, #123
        mov     r0, r4
        mov     r2, r1
        bl      memset
        movw    r3, #:lower16:off
        movt    r3, #:upper16:off
        ldr     r0, [r3]
        add     r0, r4, r0
        ldmfd   sp!, {r4, pc}

One reason is that arm_eabi must align stack to 8 bytes, so it push a
meaningless r3. But that is not the most important reason.

When using built-in memset, ira can know that memset does not change the value
of r0. Then choosing r0 instead of ip is clearly more profitable, because this
choice can get rid of the redundant "mov ip,r0; mov r0,ip;" pair.

For this rtl sequence:

(insn 7 8 9 2 (set (reg:SI 0 r0)
        (reg/v/f:SI 115 [ a2 ])) open_test.c:5 186 {*arm_movsi_insn}
     (nil))
(insn 9 7 10 2 (set (reg:SI 2 r2)
        (reg:SI 1 r1)) open_test.c:5 186 {*arm_movsi_insn}
     (expr_list:REG_EQUAL (const_int 123 [0x7b])
        (nil)))
(call_insn 10 9 24 2 (parallel [
            (set (reg:SI 0 r0)
                (call (mem:SI (symbol_ref:SI ("memset") [flags 0x41] 
<function_decl 0xb7d72500 memset>) [0 __builtin_memset S4 A32])
                    (const_int 0 [0])))
            (use (const_int 0 [0]))
            (clobber (reg:SI 14 lr))
        ]) open_test.c:5 251 {*call_value_symbol}
     (expr_list:REG_RETURNED (reg/v/f:SI 115 [ a2 ])
        (expr_list:REG_DEAD (reg:SI 2 r2)
            (expr_list:REG_DEAD (reg:SI 1 r1)
                (expr_list:REG_UNUSED (reg:SI 0 r0)
                    (expr_list:REG_EH_REGION (const_int 0 [0])
                        (nil))))))
    (expr_list:REG_CFA_WINDOW_SAVE (set (reg:SI 0 r0)
            (reg:SI 0 r0))
        (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 2 r2))
            (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 1 r1))
                (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 0 r0))
                    (nil))))))

Assigning r0 to r115 was blocked by two pieces of code in
process_bb_node_lives(In ira-lives.c).

1:
      call_p = CALL_P (insn);
      for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++)
        if (!call_p || !DF_REF_FLAGS_IS_SET (*def_rec, DF_REF_MAY_CLOBBER))
          mark_ref_live (*def_rec);
2:
      /* Mark each used value as live.  */
      for (use_rec = DF_INSN_USES (insn); *use_rec; use_rec++)
        mark_ref_live (*use_rec);

In piece 1, "set (reg:SI 0 )  (reg/v/f:SI 115)" will make r0 conflict with 
r115 when r115 is living. This is not necessary as "set (reg:SI 0) (reg:SI 0)"
will not hurt any other instruction. Making r0 conflict with all living pseudo
registers will lose the chance to optimize a set instruction. I think at least
for a simple single set, we should not make the source register conflict with
the dest register when one of them is hard register and the other is not.

In piece 2, after call memset, r0 will become living and then conflict with
living r115. This code neglect that r115 is the result of
find_call_crossed_cheap_reg, and in fact r115 is the same as r0.

As discussed above, the two pieces of code block the ira to do a more
profitable choice.I have build a patch to fix this problem. After the patch,
the assemble code with built-in memset become shorter than normal memset.

        mov     r0, r1
        mov     r1, #123
        stmfd   sp!, {r3, lr}
        mov     r2, r1
        bl      memset
        movw    r3, #:lower16:off
        movt    r3, #:upper16:off
        ldr     r3, [r3]
        add     r0, r0, r3
        ldmfd   sp!, {r3, pc}

I have done a "bootstrap" and "make check" on x86, nothing change after the
patch. Is that patch OK for trunk?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug rtl-optimization/61241] built-in memset makes the caller function slower than normal memset
  2014-05-20  1:33 [Bug rtl-optimization/61241] New: built-in memset makes the caller function slower than normal memset ma.jiang at zte dot com.cn
@ 2014-05-20  1:38 ` ma.jiang at zte dot com.cn
  2014-05-20  1:51 ` ma.jiang at zte dot com.cn
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: ma.jiang at zte dot com.cn @ 2014-05-20  1:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61241

--- Comment #1 from ma.jiang at zte dot com.cn ---
Created attachment 32822
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=32822&action=edit
proposed patch


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug rtl-optimization/61241] built-in memset makes the caller function slower than normal memset
  2014-05-20  1:33 [Bug rtl-optimization/61241] New: built-in memset makes the caller function slower than normal memset ma.jiang at zte dot com.cn
  2014-05-20  1:38 ` [Bug rtl-optimization/61241] " ma.jiang at zte dot com.cn
@ 2014-05-20  1:51 ` ma.jiang at zte dot com.cn
  2014-05-20  8:36 ` [Bug rtl-optimization/61241] built-in memset makes the caller function slower ktkachov at gcc dot gnu.org
  2014-05-20 14:13 ` ma.jiang at zte dot com.cn
  3 siblings, 0 replies; 5+ messages in thread
From: ma.jiang at zte dot com.cn @ 2014-05-20  1:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61241

--- Comment #2 from ma.jiang at zte dot com.cn ---
Created attachment 32823
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=32823&action=edit
testcase

should be put into gcc/testsuite/gcc.target/arm


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug rtl-optimization/61241] built-in memset makes the caller function slower
  2014-05-20  1:33 [Bug rtl-optimization/61241] New: built-in memset makes the caller function slower than normal memset ma.jiang at zte dot com.cn
  2014-05-20  1:38 ` [Bug rtl-optimization/61241] " ma.jiang at zte dot com.cn
  2014-05-20  1:51 ` ma.jiang at zte dot com.cn
@ 2014-05-20  8:36 ` ktkachov at gcc dot gnu.org
  2014-05-20 14:13 ` ma.jiang at zte dot com.cn
  3 siblings, 0 replies; 5+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2014-05-20  8:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61241

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ktkachov at gcc dot gnu.org

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Can you please send the patch to gcc-patches@gcc.gnu.org including a ChangeLog


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug rtl-optimization/61241] built-in memset makes the caller function slower
  2014-05-20  1:33 [Bug rtl-optimization/61241] New: built-in memset makes the caller function slower than normal memset ma.jiang at zte dot com.cn
                   ` (2 preceding siblings ...)
  2014-05-20  8:36 ` [Bug rtl-optimization/61241] built-in memset makes the caller function slower ktkachov at gcc dot gnu.org
@ 2014-05-20 14:13 ` ma.jiang at zte dot com.cn
  3 siblings, 0 replies; 5+ messages in thread
From: ma.jiang at zte dot com.cn @ 2014-05-20 14:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61241

--- Comment #4 from ma.jiang at zte dot com.cn ---
(In reply to ktkachov from comment #3)
> Can you please send the patch to gcc-patches@gcc.gnu.org including a
> ChangeLog
Done! Thanks.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-05-20 14:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-20  1:33 [Bug rtl-optimization/61241] New: built-in memset makes the caller function slower than normal memset ma.jiang at zte dot com.cn
2014-05-20  1:38 ` [Bug rtl-optimization/61241] " ma.jiang at zte dot com.cn
2014-05-20  1:51 ` ma.jiang at zte dot com.cn
2014-05-20  8:36 ` [Bug rtl-optimization/61241] built-in memset makes the caller function slower ktkachov at gcc dot gnu.org
2014-05-20 14:13 ` ma.jiang at zte dot com.cn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).