[Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
@ 2021-08-30  8:27 jankowski938 at gmail dot com
  2021-08-30  8:32 ` [Bug c/102125] " jankowski938 at gmail dot com
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: jankowski938 at gmail dot com @ 2021-08-30  8:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

            Bug ID: 102125
           Summary: (ARM Cortex-M3 and newer) missed optimization. memcpy
                    not needed operations
           Product: gcc
           Version: 10.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jankowski938 at gmail dot com
  Target Milestone: ---

uint64_t bar64(const uint8_t *rData1)
{
    uint64_t buffer;
    memcpy(&buffer, rData1, sizeof(buffer));
    return buffer;
}

compiler options: 
-Ox -mthumb -mcpu=cortex-my

where x : 2,3,s   y:3,4,7

```
bar64:
        sub     sp, sp, #8
        mov     r2, r0
        ldr     r0, [r0]  @ unaligned
        ldr     r1, [r2, #4]      @ unaligned
        mov     r3, sp
        stmia   r3!, {r0, r1}
        ldrd    r0, [sp]
        add     sp, sp, #8
        bx      lr
```

it is enough to:

```
        mov     r3, r0
        ldr     r0, [r0]  @ unaligned
        ldr     r1, [r3, #4]      @ unaligned
        bx      lr
```

32 bit memcpy is optimized correctly:

Full example code:

```
uint64_t foo64(const uint8_t *rData1)
{
    uint64_t buffer;
    buffer =  (((uint64_t)rData1[7]) << 56)|((uint64_t)(rData1[6]) <<
48)|((uint64_t)(rData1[5]) << 40)|(((uint64_t)rData1[4]) << 32)|
                            (((uint64_t)rData1[3]) <<
24)|(((uint64_t)rData1[2]) << 16)|((uint64_t)(rData1[1]) << 8)|rData1[0];
    return buffer;
}

uint64_t bar64(const uint8_t *rData1)
{
    uint64_t buffer;
    memcpy(&buffer, rData1, sizeof(buffer));
    return buffer;
}

uint32_t foo32(const uint8_t *rData1)
{
    uint32_t buffer;
    buffer = (((uint32_t)rData1[3]) << 24)|(((uint32_t)rData1[2]) <<
16)|((uint32_t)(rData1[1]) << 8)|rData1[0];
    return buffer;
}

uint32_t bar32(const uint8_t *rData1)
{
    uint32_t buffer;
    memcpy(&buffer, rData1, sizeof(buffer));
    return buffer;
}
```

compiler output:
```
foo64:
        mov     r3, r0
        ldr     r0, [r0]  @ unaligned
        ldr     r1, [r3, #4]      @ unaligned
        bx      lr
bar64:
        sub     sp, sp, #8
        mov     r2, r0
        ldr     r0, [r0]  @ unaligned
        ldr     r1, [r2, #4]      @ unaligned
        mov     r3, sp
        stmia   r3!, {r0, r1}
        ldrd    r0, [sp]
        add     sp, sp, #8
        bx      lr
foo32:
        ldr     r0, [r0]  @ unaligned
        bx      lr
bar32:
        ldr     r0, [r0]  @ unaligned
        bx      lr
```

Clang compiles without overhead:

https://godbolt.org/z/P7G7Whxqz

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug c/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
@ 2021-08-30  8:32 ` jankowski938 at gmail dot com
  2021-08-30 11:40 ` [Bug target/102125] " rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: jankowski938 at gmail dot com @ 2021-08-30  8:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #1 from Piotr <jankowski938 at gmail dot com> ---
IMO it is quite important as `memcpy` type punning is considered as the safest

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
  2021-08-30  8:32 ` [Bug c/102125] " jankowski938 at gmail dot com
@ 2021-08-30 11:40 ` rguenth at gcc dot gnu.org
  2021-08-30 19:29 ` pinskia at gcc dot gnu.org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-08-30 11:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|c                           |target
   Last reconfirmed|                            |2021-08-30
             Target|                            |arm
           Keywords|                            |missed-optimization
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
One common source of missed optimizations is gimple_fold_builtin_memory_op
which has

      /* If we can perform the copy efficiently with first doing all loads
         and then all stores inline it that way.  Currently efficiently
         means that we can load all the memory into a single integer
         register which is what MOVE_MAX gives us.  */
      src_align = get_pointer_alignment (src);
      dest_align = get_pointer_alignment (dest);
      if (tree_fits_uhwi_p (len)
          && compare_tree_int (len, MOVE_MAX) <= 0
...
                  /* If the destination pointer is not aligned we must be able
                     to emit an unaligned store.  */
                  && (dest_align >= GET_MODE_ALIGNMENT (mode)
                      || !targetm.slow_unaligned_access (mode, dest_align)
                      || (optab_handler (movmisalign_optab, mode)
                          != CODE_FOR_nothing)))

where here likely the MOVE_MAX limit applies (it is 4).  Since we actually
do need to perform two loads the code seems to do what is intended (but
that's of course "bad" for 64bit copies on 32bit archs and likewise for
128bit copies on 64bit archs).

It's usually too late for RTL memcpy expansion to fully elide stack storage.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
  2021-08-30  8:32 ` [Bug c/102125] " jankowski938 at gmail dot com
  2021-08-30 11:40 ` [Bug target/102125] " rguenth at gcc dot gnu.org
@ 2021-08-30 19:29 ` pinskia at gcc dot gnu.org
  2021-08-30 20:14 ` jankowski938 at gmail dot com
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-30 19:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=91674

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I suspect PR 91674 is the same.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
                   ` (2 preceding siblings ...)
  2021-08-30 19:29 ` pinskia at gcc dot gnu.org
@ 2021-08-30 20:14 ` jankowski938 at gmail dot com
  2021-08-31 11:54 ` rearnsha at gcc dot gnu.org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: jankowski938 at gmail dot com @ 2021-08-30 20:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #4 from Piotr <jankowski938 at gmail dot com> ---
        mov     r3, r0
        ldr     r0, [r0]  @ unaligned
        ldr     r1, [r3, #4]      @ unaligned
        bx      lr

can be optimized even more 

        ldr     r1, [r0, #4]      @ unaligned
        ldr     r0, [r0]  @ unaligned
        bx      lr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
                   ` (3 preceding siblings ...)
  2021-08-30 20:14 ` jankowski938 at gmail dot com
@ 2021-08-31 11:54 ` rearnsha at gcc dot gnu.org
  2021-08-31 16:42 ` rearnsha at gcc dot gnu.org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2021-08-31 11:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #5 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
Testcase was not quite complete.  Extending it to:

typedef unsigned long long uint64_t;
typedef unsigned long uint32_t;
typedef unsigned char uint8_t;
uint64_t bar64(const uint8_t *rData1)
{
    uint64_t buffer;
    __builtin_memcpy(&buffer, rData1, sizeof(buffer));
    return buffer;
}

uint32_t bar32(const uint8_t *rData1)
{
    uint32_t buffer;
    __builtin_memcpy(&buffer, rData1, sizeof(buffer));
    return buffer;
}

and then looking at the optimized tree output we see:


;; Function bar64 (bar64, funcdef_no=0, decl_uid=4196, cgraph_uid=1,
symbol_order=0)

uint64_t bar64 (const uint8_t * rData1)
{
  uint64_t buffer;
  uint64_t _4;

  <bb 2> [local count: 1073741824]:
  __builtin_memcpy (&buffer, rData1_2(D), 8);
  _4 = buffer;
  buffer ={v} {CLOBBER};
  return _4;

}



;; Function bar32 (bar32, funcdef_no=1, decl_uid=4200, cgraph_uid=2,
symbol_order=1)

uint32_t bar32 (const uint8_t * rData1)
{
  unsigned int _3;

  <bb 2> [local count: 1073741824]:
  _3 = MEM <unsigned int> [(char * {ref-all})rData1_2(D)];
  return _3;

}

So in the 32-bit case we've eliminated the memcpy at the tree level, but failed
to do that for 64-bit objects.

We probably need to add 64-bit support to the movmisalign<mode> pattern.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
                   ` (4 preceding siblings ...)
  2021-08-31 11:54 ` rearnsha at gcc dot gnu.org
@ 2021-08-31 16:42 ` rearnsha at gcc dot gnu.org
  2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2021-08-31 16:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #6 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> One common source of missed optimizations is gimple_fold_builtin_memory_op
> which has [...]

Yes, this is the source of the problem.  I wonder if this should be scaled by
something like MOVE_RATIO to get a more acceptable limit, especially at higher
optimization levels.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
                   ` (5 preceding siblings ...)
  2021-08-31 16:42 ` rearnsha at gcc dot gnu.org
@ 2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
  2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-09-13 10:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Earnshaw <rearnsha@gcc.gnu.org>:

https://gcc.gnu.org/g:408e8b906632f215f6652b8851bba612cde07c25

commit r12-3480-g408e8b906632f215f6652b8851bba612cde07c25
Author: Richard Earnshaw <rearnsha@arm.com>
Date:   Thu Sep 9 10:56:01 2021 +0100

    rtl: directly handle MEM in gen_highpart [PR102125]

    gen_lowpart_general handles forming a lowpart of a MEM by using
    adjust_address to rework and validate a new version of the MEM.
    Do the same for gen_highpart rather than calling simplify_gen_subreg
    for this case.

    gcc/ChangeLog:

            PR target/102125
            * emit-rtl.c (gen_highpart): Use adjust_address to handle
            MEM rather than calling simplify_gen_subreg.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
                   ` (6 preceding siblings ...)
  2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
@ 2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
  2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-09-13 10:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Earnshaw <rearnsha@gcc.gnu.org>:

https://gcc.gnu.org/g:f0cfd070b68772eaaa19a3b711fbd9e85b244240

commit r12-3481-gf0cfd070b68772eaaa19a3b711fbd9e85b244240
Author: Richard Earnshaw <rearnsha@arm.com>
Date:   Fri Sep 3 16:53:13 2021 +0100

    arm: expand handling of movmisalign for DImode [PR102125]

    DImode is currently handled only for machines with vector modes
    enabled, but this is unduly restrictive and is generally better done
    in core registers.

    gcc/ChangeLog:

            PR target/102125
            * config/arm/arm.md (movmisaligndi): New define_expand.
            * config/arm/vec-common.md (movmisalign<mode>): Iterate over VDQ
mode.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
                   ` (7 preceding siblings ...)
  2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
@ 2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
  2021-09-13 10:29 ` rearnsha at gcc dot gnu.org
  2022-03-23 14:57 ` cvs-commit at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-09-13 10:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #9 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Earnshaw <rearnsha@gcc.gnu.org>:

https://gcc.gnu.org/g:5f6a6c91d7c592cb49f7c519f289777eac09bb74

commit r12-3482-g5f6a6c91d7c592cb49f7c519f289777eac09bb74
Author: Richard Earnshaw <rearnsha@arm.com>
Date:   Fri Sep 3 17:06:15 2021 +0100

    gimple: allow more folding of memcpy [PR102125]

    The current restriction on folding memcpy to a single element of size
    MOVE_MAX is excessively cautious on most machines and limits some
    significant further optimizations.  So relax the restriction provided
    the copy size does not exceed MOVE_MAX * MOVE_RATIO and that a SET
    insn exists for moving the value into machine registers.

    Note that there were already checks in place for having misaligned
    move operations when one or more of the operands were unaligned.

    On Arm this now permits optimizing

    uint64_t bar64(const uint8_t *rData1)
    {
        uint64_t buffer;
        memcpy(&buffer, rData1, sizeof(buffer));
        return buffer;
    }

    from
            ldr     r2, [r0]        @ unaligned
            sub     sp, sp, #8
            ldr     r3, [r0, #4]    @ unaligned
            strd    r2, [sp]
            ldrd    r0, [sp]
            add     sp, sp, #8

    to
            mov     r3, r0
            ldr     r0, [r0]        @ unaligned
            ldr     r1, [r3, #4]    @ unaligned

    PR target/102125 - (ARM Cortex-M3 and newer) missed optimization. memcpy
not needed operations

    gcc/ChangeLog:

            PR target/102125
            * gimple-fold.c (gimple_fold_builtin_memory_op): Allow folding
            memcpy if the size is not more than MOVE_MAX * MOVE_RATIO.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
                   ` (8 preceding siblings ...)
  2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
@ 2021-09-13 10:29 ` rearnsha at gcc dot gnu.org
  2022-03-23 14:57 ` cvs-commit at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: rearnsha at gcc dot gnu.org @ 2021-09-13 10:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

Richard Earnshaw <rearnsha at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #10 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
Fixed on master branch.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Bug target/102125] (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations
  2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
                   ` (9 preceding siblings ...)
  2021-09-13 10:29 ` rearnsha at gcc dot gnu.org
@ 2022-03-23 14:57 ` cvs-commit at gcc dot gnu.org
  10 siblings, 0 replies; 12+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-03-23 14:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102125

--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:d9792f8d227cdd409c2b082ef0685b47ccfaa334

commit r12-7786-gd9792f8d227cdd409c2b082ef0685b47ccfaa334
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Mar 23 14:53:49 2022 +0100

    target/102125 - alternative memcpy folding improvement

    The following extends the heuristical memcpy folding path with the
    ability to use misaligned accesses on strict-alignment targets just
    like the size-based path does.  That avoids regressing the following
    testcase on arm

        uint64_t bar64(const uint8_t *rData1)
        {
            uint64_t buffer;
            memcpy(&buffer, rData1, sizeof(buffer));
            return buffer;
        }

    when r12-3482-g5f6a6c91d7c592 is reverted.

    2022-03-23  Richard Biener  <rguenther@suse.de>

            PR target/102125
            * gimple-fold.cc (gimple_fold_builtin_memory_op): Allow the
            use of movmisalign when either the source or destination
            decl is properly aligned.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-03-23 14:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-30  8:27 [Bug c/102125] New: (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations jankowski938 at gmail dot com
2021-08-30  8:32 ` [Bug c/102125] " jankowski938 at gmail dot com
2021-08-30 11:40 ` [Bug target/102125] " rguenth at gcc dot gnu.org
2021-08-30 19:29 ` pinskia at gcc dot gnu.org
2021-08-30 20:14 ` jankowski938 at gmail dot com
2021-08-31 11:54 ` rearnsha at gcc dot gnu.org
2021-08-31 16:42 ` rearnsha at gcc dot gnu.org
2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
2021-09-13 10:27 ` cvs-commit at gcc dot gnu.org
2021-09-13 10:29 ` rearnsha at gcc dot gnu.org
2022-03-23 14:57 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).