public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal
@ 2021-04-28 15:09 vda.linux at googlemail dot com
  2021-04-28 15:39 ` [Bug target/100320] [8/9/10/11/12 Regression] " jakub at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: vda.linux at googlemail dot com @ 2021-04-28 15:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

            Bug ID: 100320
           Summary: regression: 32-bit x86 memcpy is suboptimal
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vda.linux at googlemail dot com
  Target Milestone: ---

Bug 21329 has returned.

32-bit x86 memory block moves are using "movl $LEN,%ecx; rep movsl" insns.

However, for fixed short blocks it is more efficient to just repeat a few
"movsl" insns - this allows to drop "mov $LEN,%ecx" insn.

It's shorter, and more importantly, "rep movsl" are slow-start microcoded insns
(they are faster than moves using general-purpose registers only on blocks
larger than 100-200 bytes) - OTOH, bare "movsl" are not microcoded and take ~4
cycles to execute.

21329 was closed with it fixed:

CVSROOT:        /cvs/gcc
Module name:    gcc
Branch:         gcc-4_0-rhl-branch
Changes by:     jakub@gcc.gnu.org       2005-05-18 19:08:44
Modified files:
        gcc            : ChangeLog 
        gcc/config/i386: i386.c 
Log message:
        2005-05-06  Denis Vlasenko  <vda@port.imtp.ilyichevsk.odessa.ua>
        Jakub Jelinek  <jakub@redhat.com>       
        PR target/21329
        * config/i386/i386.c (ix86_expand_movmem): Don't use rep; movsb
        for -Os if (movsl;)*(movsw;)?(movsb;)? sequence is shorter.
        Don't use rep; movs{l,q} if the repetition count is really small,
        instead use a sequence of movs{l,q} instructions.

(the above is commit 95935e2db5c45bef5631f51538d1e10d8b5b7524 in
gcc.gnu.org/git/gcc.git,
seems that code was largely replaced by:
commit 8c996513856f2769aee1730cb211050fef055fb5
Author: Jan Hubicka <jh@suse.cz>
Date:   Mon Nov 27 17:00:26 2006 +010
    expr.c (emit_block_move_via_libcall): Export.
)


With gcc version 11.0.0 20210210 (Red Hat 11.0.0-0) (GCC) I see "rep movsl"s
again:

void *f(void *d, const void *s)
{ return memcpy(d, s, 16); }

$ gcc -Os -m32 -fomit-frame-pointer -c -o z.o z.c && objdump -drw z.o
z.o:     file format elf32-i386
Disassembly of section .text:
00000000 <f>:
   0:   57                      push   %edi
   1:   b9 04 00 00 00          mov    $0x4,%ecx
   6:   56                      push   %esi
   7:   8b 44 24 0c             mov    0xc(%esp),%eax
   b:   8b 74 24 10             mov    0x10(%esp),%esi
   f:   89 c7                   mov    %eax,%edi
  11:   f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
  13:   5e                      pop    %esi
  14:   5f                      pop    %edi
  15:   c3                      ret 

The expected code would not have "mov $0x4,%ecx" and would have "rep movsl"
replaced by "movsl;movsl;movsl;movsl".

The testcase from 21329 with implicit block moves via struct copies, from here
        https://gcc.gnu.org/bugzilla/attachment.cgi?id=8790
also demonstrates it:

$ gcc -Os -m32 -fomit-frame-pointer -c -o z1.o z1.c && objdump -drw z1.o
z1.o:     file format elf32-i386
Disassembly of section .text:
00000000 <f10>:
   0:   a1 00 00 00 00          mov    0x0,%eax 1: R_386_32     w10
   5:   a3 00 00 00 00          mov    %eax,0x0 6: R_386_32     t10
   a:   c3                      ret    
0000000b <f20>:
   b:   a1 00 00 00 00          mov    0x0,%eax c: R_386_32     w20
  10:   8b 15 04 00 00 00       mov    0x4,%edx 12: R_386_32    w20
  16:   a3 00 00 00 00          mov    %eax,0x0 17: R_386_32    t20
  1b:   89 15 04 00 00 00       mov    %edx,0x4 1d: R_386_32    t20
  21:   c3                      ret    
00000022 <f21>:
  22:   57                      push   %edi
  23:   b9 09 00 00 00          mov    $0x9,%ecx
  28:   bf 00 00 00 00          mov    $0x0,%edi        29: R_386_32    t21
  2d:   56                      push   %esi
  2e:   be 00 00 00 00          mov    $0x0,%esi        2f: R_386_32    w21
  33:   f3 a4                   rep movsb %ds:(%esi),%es:(%edi)
  35:   5e                      pop    %esi
  36:   5f                      pop    %edi
  37:   c3                      ret    
00000038 <f22>:
  38:   57                      push   %edi
  39:   b9 0a 00 00 00          mov    $0xa,%ecx
  3e:   bf 00 00 00 00          mov    $0x0,%edi        3f: R_386_32    t22
  43:   56                      push   %esi
  44:   be 00 00 00 00          mov    $0x0,%esi        45: R_386_32    w22
  49:   f3 a4                   rep movsb %ds:(%esi),%es:(%edi)
  4b:   5e                      pop    %esi
  4c:   5f                      pop    %edi
  4d:   c3                      ret    
0000004e <f23>:
  4e:   57                      push   %edi
  4f:   b9 0b 00 00 00          mov    $0xb,%ecx
  54:   bf 00 00 00 00          mov    $0x0,%edi        55: R_386_32    t23
  59:   56                      push   %esi
  5a:   be 00 00 00 00          mov    $0x0,%esi        5b: R_386_32    w23
  5f:   f3 a4                   rep movsb %ds:(%esi),%es:(%edi)
  61:   5e                      pop    %esi
  62:   5f                      pop    %edi
  63:   c3                      ret    
00000064 <f30>:
  64:   57                      push   %edi
  65:   b9 03 00 00 00          mov    $0x3,%ecx
  6a:   bf 00 00 00 00          mov    $0x0,%edi        6b: R_386_32    t30
  6f:   56                      push   %esi
  70:   be 00 00 00 00          mov    $0x0,%esi        71: R_386_32    w30
  75:   f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
  77:   5e                      pop    %esi
  78:   5f                      pop    %edi
  79:   c3                      ret    
0000007a <f40>:
  7a:   57                      push   %edi
  7b:   b9 04 00 00 00          mov    $0x4,%ecx
  80:   bf 00 00 00 00          mov    $0x0,%edi        81: R_386_32    t40
  85:   56                      push   %esi
  86:   be 00 00 00 00          mov    $0x0,%esi        87: R_386_32    w40
  8b:   f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
  8d:   5e                      pop    %esi
  8e:   5f                      pop    %edi
  8f:   c3                      ret    
00000090 <f50>:
  90:   57                      push   %edi
  91:   b9 05 00 00 00          mov    $0x5,%ecx
  96:   bf 00 00 00 00          mov    $0x0,%edi        97: R_386_32    t50
  9b:   56                      push   %esi
  9c:   be 00 00 00 00          mov    $0x0,%esi        9d: R_386_32    w50
  a1:   f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
  a3:   5e                      pop    %esi
  a4:   5f                      pop    %edi
  a5:   c3                      ret    
000000a6 <f60>:
  a6:   57                      push   %edi
  a7:   b9 06 00 00 00          mov    $0x6,%ecx
  ac:   bf 00 00 00 00          mov    $0x0,%edi        ad: R_386_32    t60
  b1:   56                      push   %esi
  b2:   be 00 00 00 00          mov    $0x0,%esi        b3: R_386_32    w60
  b7:   f3 a5                   rep movsl %ds:(%esi),%es:(%edi)
  b9:   5e                      pop    %esi
  ba:   5f                      pop    %edi
  bb:   c3                      ret    
000000bc <f>:
...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/100320] [8/9/10/11/12 Regression] 32-bit x86 memcpy is suboptimal
  2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
@ 2021-04-28 15:39 ` jakub at gcc dot gnu.org
  2021-04-28 15:43 ` vda.linux at googlemail dot com
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-04-28 15:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|c                           |target
   Target Milestone|---                         |8.5
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |jakub at gcc dot gnu.org
   Last reconfirmed|                            |2021-04-28
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
            Summary|regression: 32-bit x86      |[8/9/10/11/12 Regression]
                   |memcpy is suboptimal        |32-bit x86 memcpy is
                   |                            |suboptimal

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Indeed, at least with -minline-all-stringops -Os -m32 -fomit-frame-pointer
starting with r0-68071-g95935e2db5c45bef5631f51538d1e10d8b5b7524
it was a series of movsl insns and starting with most likely
r0-77675-g8c996513856f2769aee1730cb211050fef055fb5
(can't know for sure, as the compiler then ICEs for a couple of revisions on
it) it is back rep movsl.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/100320] [8/9/10/11/12 Regression] 32-bit x86 memcpy is suboptimal
  2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
  2021-04-28 15:39 ` [Bug target/100320] [8/9/10/11/12 Regression] " jakub at gcc dot gnu.org
@ 2021-04-28 15:43 ` vda.linux at googlemail dot com
  2021-04-29  7:10 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vda.linux at googlemail dot com @ 2021-04-28 15:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

--- Comment #2 from Denis Vlasenko <vda.linux at googlemail dot com> ---
The relevant code in current git seems to be:

static void
expand_set_or_cpymem_via_rep (rtx destmem, rtx srcmem,
                           rtx destptr, rtx srcptr, rtx value, rtx orig_value,
                           rtx count,
                           machine_mode mode, bool issetmem)
{
  rtx destexp;
  rtx srcexp;
  rtx countreg;
  HOST_WIDE_INT rounded_count;

  /* If possible, it is shorter to use rep movs.
     TODO: Maybe it is better to move this logic to decide_alg.  */
  if (mode == QImode && CONST_INT_P (count) && !(INTVAL (count) & 3)
      && !TARGET_PREFER_KNOWN_REP_MOVSB_STOSB
      && (!issetmem || orig_value == const0_rtx))
    mode = SImode;

  if (destptr != XEXP (destmem, 0) || GET_MODE (destmem) != BLKmode)
    destmem = adjust_automodify_address_nv (destmem, BLKmode, destptr, 0);

  countreg = ix86_zero_extend_to_Pmode (scale_counter (count,
                                                       GET_MODE_SIZE (mode)));
  if (mode != QImode)
    {
      destexp = gen_rtx_ASHIFT (Pmode, countreg,
                                GEN_INT (exact_log2 (GET_MODE_SIZE (mode))));
      destexp = gen_rtx_PLUS (Pmode, destexp, destptr);
    }
  else
    destexp = gen_rtx_PLUS (Pmode, destptr, countreg);
  if ((!issetmem || orig_value == const0_rtx) && CONST_INT_P (count))
    {
      rounded_count
        = ROUND_DOWN (INTVAL (count), (HOST_WIDE_INT) GET_MODE_SIZE (mode));
      destmem = shallow_copy_rtx (destmem);
      set_mem_size (destmem, rounded_count);
    }
  else if (MEM_SIZE_KNOWN_P (destmem))
    clear_mem_size (destmem);

  if (issetmem)
    {
      value = force_reg (mode, gen_lowpart (mode, value));
      emit_insn (gen_rep_stos (destptr, countreg, destmem, value, destexp));
    }
  else
    {
      if (srcptr != XEXP (srcmem, 0) || GET_MODE (srcmem) != BLKmode)
        srcmem = adjust_automodify_address_nv (srcmem, BLKmode, srcptr, 0);
      if (mode != QImode)
        {
          srcexp = gen_rtx_ASHIFT (Pmode, countreg,
                                   GEN_INT (exact_log2 (GET_MODE_SIZE
(mode))));
          srcexp = gen_rtx_PLUS (Pmode, srcexp, srcptr);
        }
      else
        srcexp = gen_rtx_PLUS (Pmode, srcptr, countreg);
      if (CONST_INT_P (count))
        {
          rounded_count
            = ROUND_DOWN (INTVAL (count), (HOST_WIDE_INT) GET_MODE_SIZE
(mode));
          srcmem = shallow_copy_rtx (srcmem);
          set_mem_size (srcmem, rounded_count);
        }
      else
        {
          if (MEM_SIZE_KNOWN_P (srcmem))
            clear_mem_size (srcmem);
        }
      emit_insn (gen_rep_mov (destptr, destmem, srcptr, srcmem, countreg,
                              destexp, srcexp));
    }
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/100320] [8/9/10/11/12 Regression] 32-bit x86 memcpy is suboptimal
  2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
  2021-04-28 15:39 ` [Bug target/100320] [8/9/10/11/12 Regression] " jakub at gcc dot gnu.org
  2021-04-28 15:43 ` vda.linux at googlemail dot com
@ 2021-04-29  7:10 ` rguenth at gcc dot gnu.org
  2021-05-14  9:54 ` [Bug target/100320] [9/10/11/12 " jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-29  7:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2
           Keywords|                            |missed-optimization
             Target|                            |i?86-*-*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/100320] [9/10/11/12 Regression] 32-bit x86 memcpy is suboptimal
  2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
                   ` (2 preceding siblings ...)
  2021-04-29  7:10 ` rguenth at gcc dot gnu.org
@ 2021-05-14  9:54 ` jakub at gcc dot gnu.org
  2021-06-01  8:20 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-05-14  9:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|8.5                         |9.4

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 8 branch is being closed.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/100320] [9/10/11/12 Regression] 32-bit x86 memcpy is suboptimal
  2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
                   ` (3 preceding siblings ...)
  2021-05-14  9:54 ` [Bug target/100320] [9/10/11/12 " jakub at gcc dot gnu.org
@ 2021-06-01  8:20 ` rguenth at gcc dot gnu.org
  2022-05-27  9:45 ` [Bug target/100320] [10/11/12/13 " rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-06-01  8:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|9.4                         |9.5

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9.4 is being released, retargeting bugs to GCC 9.5.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/100320] [10/11/12/13 Regression] 32-bit x86 memcpy is suboptimal
  2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
                   ` (4 preceding siblings ...)
  2021-06-01  8:20 ` rguenth at gcc dot gnu.org
@ 2022-05-27  9:45 ` rguenth at gcc dot gnu.org
  2022-06-28 10:44 ` jakub at gcc dot gnu.org
  2023-07-07 10:39 ` [Bug target/100320] [11/12/13/14 " rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-05-27  9:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|9.5                         |10.4

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 9 branch is being closed

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/100320] [10/11/12/13 Regression] 32-bit x86 memcpy is suboptimal
  2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
                   ` (5 preceding siblings ...)
  2022-05-27  9:45 ` [Bug target/100320] [10/11/12/13 " rguenth at gcc dot gnu.org
@ 2022-06-28 10:44 ` jakub at gcc dot gnu.org
  2023-07-07 10:39 ` [Bug target/100320] [11/12/13/14 " rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-06-28 10:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.4                        |10.5

--- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 10.4 is being released, retargeting bugs to GCC 10.5.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/100320] [11/12/13/14 Regression] 32-bit x86 memcpy is suboptimal
  2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
                   ` (6 preceding siblings ...)
  2022-06-28 10:44 ` jakub at gcc dot gnu.org
@ 2023-07-07 10:39 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-07 10:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100320

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|10.5                        |11.5

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 10 branch is being closed.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-07-07 10:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-28 15:09 [Bug c/100320] New: regression: 32-bit x86 memcpy is suboptimal vda.linux at googlemail dot com
2021-04-28 15:39 ` [Bug target/100320] [8/9/10/11/12 Regression] " jakub at gcc dot gnu.org
2021-04-28 15:43 ` vda.linux at googlemail dot com
2021-04-29  7:10 ` rguenth at gcc dot gnu.org
2021-05-14  9:54 ` [Bug target/100320] [9/10/11/12 " jakub at gcc dot gnu.org
2021-06-01  8:20 ` rguenth at gcc dot gnu.org
2022-05-27  9:45 ` [Bug target/100320] [10/11/12/13 " rguenth at gcc dot gnu.org
2022-06-28 10:44 ` jakub at gcc dot gnu.org
2023-07-07 10:39 ` [Bug target/100320] [11/12/13/14 " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).