[Bug tree-optimization/97875] New: suboptimal loop vectorization

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/97875] New: suboptimal loop vectorization
@ 2020-11-17 13:11 clyon at gcc dot gnu.org
  2020-11-17 15:25 ` [Bug tree-optimization/97875] " rguenth at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-11-17 13:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

            Bug ID: 97875
           Summary: suboptimal loop vectorization
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: clyon at gcc dot gnu.org
  Target Milestone: ---

Looking at the code generated for gcc.target/arm/simd/mve-vsub_1.c:
#include <stdint.h>

void test_vsub_i32 (int32_t * dest, int32_t * a, int32_t * b) {
  int i;
  for (i=0; i<4; i++) {
    dest[i] = a[i] - b[i];
  }
}

Compiled with -mfloat-abi=hard -mfpu=auto -march=armv8.1-m.main+mve -mthumb
-O3, we get:
test_vsub_i32:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        add     ip, r1, #4
        adds    r3, r2, #4
        sub     ip, r0, ip
        subs    r3, r0, r3
        cmp     ip, #8
        it      hi
        cmphi   r3, #8
        bls     .L2
        orr     r3, r2, r0
        orrs    r3, r3, r1
        lsls    r3, r3, #28
        bne     .L2
        vldrw.32        q3, [r1]
        vldrw.32        q2, [r2]
        vsub.i32        q3, q3, q2
        vstrw.32        q3, [r0]
        bx      lr
.L2:
        ldr     r3, [r1]
        push    {r4}
        ldr     r4, [r2]
        subs    r3, r3, r4
        str     r3, [r0]
        ldr     r4, [r2, #4]
        ldr     r3, [r1, #4]
        subs    r3, r3, r4
        str     r3, [r0, #4]
        ldr     r4, [r2, #8]
        ldr     r3, [r1, #8]
        subs    r3, r3, r4
        str     r3, [r0, #8]
        ldr     r3, [r1, #12]
        ldr     r2, [r2, #12]
        ldr     r4, [sp], #4
        subs    r3, r3, r2
        str     r3, [r0, #12]
        bx      lr


but only the short vectorized part is necessary:
        vldrw.32        q3, [r1]
        vldrw.32        q2, [r2]
        vsub.i32        q3, q3, q2
        vstrw.32        q3, [r0]
        bx      lr

Since the loop trip count is constant (=4), why isn't this better optimized?


If I declare 'dest' as __restrict__, I get something better, but still not
perfect:
test_vsub_i32:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        orr     r3, r2, r0
        orrs    r3, r3, r1
        lsls    r3, r3, #28
        bne     .L2
        vldrw.32        q3, [r1]
        vldrw.32        q2, [r2]
        vsub.i32        q3, q3, q2
        vstrw.32        q3, [r0]
        bx      lr
.L2:
        push    {r4, r5}
        ldr     r3, [r1]
        ldr     r4, [r2]
        subs    r4, r3, r4
        str     r4, [r0]
        ldr     r3, [r1, #4]
        ldr     r4, [r2, #4]
        subs    r5, r3, r4
        str     r5, [r0, #4]
        ldrd    r4, r3, [r1, #8]
        ldrd    r5, r1, [r2, #8]
        subs    r4, r4, r5
        subs    r3, r3, r1
        strd    r4, r3, [r0, #8]
        pop     {r4, r5}
        bx      lr



Compiling for cortex-a9 and Neon:
-mfloat-abi=hard -mcpu=cortex-a9 -mfpu=neon -O3
test_vsub_i32:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        add     ip, r2, #4
        adds    r3, r1, #4
        sub     ip, r0, ip
        subs    r3, r0, r3
        cmp     ip, #8
        it      hi
        cmphi   r3, #8
        bls     .L2
        vld1.32 {q8}, [r1]
        vld1.32 {q9}, [r2]
        vsub.i32        q8, q8, q9
        vst1.32 {q8}, [r0]
        bx      lr
.L2:
        ldr     r3, [r1]
        push    {r4}
        ldr     r4, [r2]
        subs    r3, r3, r4
        str     r3, [r0]
        ldr     r4, [r2, #4]
        ldr     r3, [r1, #4]
        subs    r3, r3, r4
        str     r3, [r0, #4]
        ldr     r4, [r2, #8]
        ldr     r3, [r1, #8]
        subs    r3, r3, r4
        ldr     r4, [sp], #4
        str     r3, [r0, #8]
        ldr     r3, [r1, #12]
        ldr     r2, [r2, #12]
        subs    r3, r3, r2
        str     r3, [r0, #12]
        bx      lr


But in this case adding __restrict__ works well:
test_vsub_i32:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        vld1.32 {q8}, [r1]
        vld1.32 {q9}, [r2]
        vsub.i32        q8, q8, q9
        vst1.32 {q8}, [r0]
        bx      lr

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/97875] suboptimal loop vectorization
  2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
@ 2020-11-17 15:25 ` rguenth at gcc dot gnu.org
  2020-11-17 15:41 ` [Bug target/97875] " clyon at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-11-17 15:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-11-17
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |WAITING

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
restrict is necessary because of possible aliasing (you see the runtime alias
test).  For

        orr     r3, r2, r0
        orrs    r3, r3, r1
        lsls    r3, r3, #28
        bne     .L2

it looks like this is a runtime alignment test - does the specified arch
support unaligned vector loads/stores?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/97875] suboptimal loop vectorization
  2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
  2020-11-17 15:25 ` [Bug tree-optimization/97875] " rguenth at gcc dot gnu.org
@ 2020-11-17 15:41 ` clyon at gcc dot gnu.org
  2020-11-18  8:17 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-11-17 15:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

--- Comment #2 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Checking the Arm v8-M manual, my understanding is that this architecture does
not support unaligned vector loads/stores.

However, my understanding is that vldrw.32 accepts to load from addresses
aligned on 32 bits, which is the case since a and b are pointers to int32_t.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/97875] suboptimal loop vectorization
  2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
  2020-11-17 15:25 ` [Bug tree-optimization/97875] " rguenth at gcc dot gnu.org
  2020-11-17 15:41 ` [Bug target/97875] " clyon at gcc dot gnu.org
@ 2020-11-18  8:17 ` rguenth at gcc dot gnu.org
  2020-12-09 15:06 ` clyon at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-11-18  8:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
That would then point to DR_TARGET_ALIGNMENT being wrong here.  Now, not sure
whether we can guarantee to pick the "correct" instruction at RTL expansion but
surely the vectorizer can elide the runtime alignment check and emit
appropriately aligned (to vector element) vector loads / stores here.

You mention vldrw.32 but I assume the same applies to vstrw.32

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/97875] suboptimal loop vectorization
  2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2020-11-18  8:17 ` rguenth at gcc dot gnu.org
@ 2020-12-09 15:06 ` clyon at gcc dot gnu.org
  2020-12-09 16:59 ` clyon at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-12-09 15:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

--- Comment #4 from Christophe Lyon <clyon at gcc dot gnu.org> ---

In both cases (Neon and MVE), DR_TARGET_ALIGNMENT is 8, so the decision to emit
a useless loop tail comes from elsewhere.

And yes, MVE vldrw.32 and vstrw.32 share the same alignment properties.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/97875] suboptimal loop vectorization
  2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2020-12-09 15:06 ` clyon at gcc dot gnu.org
@ 2020-12-09 16:59 ` clyon at gcc dot gnu.org
  2020-12-10 14:42 ` clyon at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-12-09 16:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

--- Comment #5 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Interestingly, if I make arm_builtin_support_vector_misalignment() behave the
same for MVE and Neon, the generated code (with __restrict__) becomes:
test_vsub_i32:
        @ args = 0, pretend = 0, frame = 16
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        push    {r4, r5, r6, r7, r8, r9, r10, fp}       @ 61    [c=8 l=4] 
*push_multi
        ldrd    r10, fp, [r1, #8]       @ 75    [c=8 l=4]  *thumb2_ldrd
        ldrd    r6, r7, [r2, #8]        @ 76    [c=8 l=4]  *thumb2_ldrd
        ldr     r4, [r2]        @ 14    [c=12 l=4]  *thumb2_movsi_vfp/5
        ldr     r8, [r1]        @ 9     [c=12 l=4]  *thumb2_movsi_vfp/6
        ldr     r9, [r1, #4]    @ 10    [c=12 l=4]  *thumb2_movsi_vfp/6
        ldr     r5, [r2, #4]    @ 15    [c=12 l=4]  *thumb2_movsi_vfp/5
        vmov    d6, r8, r9  @ v4si      @ 35    [c=4 l=8]  *mve_movv4si/1
        vmov    d7, r10, fp
        vmov    d4, r4, r5  @ v4si      @ 36    [c=4 l=8]  *mve_movv4si/1
        vmov    d5, r6, r7
        sub     sp, sp, #16     @ 62    [c=4 l=4]  *arm_addsi3/11
        mov     r3, sp  @ 37    [c=4 l=2]  *thumb2_movsi_vfp/0
        vsub.i32        q3, q3, q2      @ 18    [c=80 l=4]  mve_vsubqv4si
        vstrw.32        q3, [r3]        @ 34    [c=4 l=4]  *mve_movv4si/7
        ldrd    r4, r1, [sp]    @ 77    [c=8 l=4]  *thumb2_ldrd_base
        ldrd    r2, r3, [sp, #8]        @ 78    [c=8 l=4]  *thumb2_ldrd
        strd    r4, r1, [r0]    @ 79    [c=8 l=4]  *thumb2_strd_base
        strd    r2, r3, [r0, #8]        @ 80    [c=8 l=4]  *thumb2_strd
        add     sp, sp, #16     @ 66    [c=4 l=4]  *arm_addsi3/5
        @ sp needed     @ 67    [c=8 l=0]  force_register_use
        pop     {r4, r5, r6, r7, r8, r9, r10, fp}       @ 68    [c=8 l=4] 
*load_multiple_with_writeback
        bx      lr      @ 69    [c=8 l=4]  *thumb2_return


The Neon version has:
        vld1.32 {q8}, [r1]      @ 8     [c=8 l=4]  *movmisalignv4si_neon_load
        vld1.32 {q9}, [r2]      @ 9     [c=8 l=4]  *movmisalignv4si_neon_load
        vsub.i32        q8, q8, q9      @ 10    [c=80 l=4]  *subv4si3_neon
        vst1.32 {q8}, [r0]      @ 11    [c=8 l=4]  *movmisalignv4si_neon_store
        bx      lr      @ 21    [c=8 l=4]  *thumb2_return

So it seems MVE needs movmisalign pattern.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/97875] suboptimal loop vectorization
  2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2020-12-09 16:59 ` clyon at gcc dot gnu.org
@ 2020-12-10 14:42 ` clyon at gcc dot gnu.org
  2021-01-12 16:51 ` cvs-commit at gcc dot gnu.org
  2021-01-12 16:52 ` clyon at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-12-10 14:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

Christophe Lyon <clyon at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |ASSIGNED

--- Comment #6 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Indeed enabling movmisalign for MVE greatly helps.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/97875] suboptimal loop vectorization
  2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2020-12-10 14:42 ` clyon at gcc dot gnu.org
@ 2021-01-12 16:51 ` cvs-commit at gcc dot gnu.org
  2021-01-12 16:52 ` clyon at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-01-12 16:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Christophe Lyon <clyon@gcc.gnu.org>:

https://gcc.gnu.org/g:25bef68902f42f414f99626cefb2d3df81de7dc8

commit r11-6616-g25bef68902f42f414f99626cefb2d3df81de7dc8
Author: Christophe Lyon <christophe.lyon@linaro.org>
Date:   Tue Jan 12 16:47:27 2021 +0000

    arm: Add movmisalign patterns for MVE (PR target/97875)

    This patch adds new movmisalign<mode>_mve_load and store patterns for
    MVE to help vectorization. They are very similar to their Neon
    counterparts, but use different iterators and instructions.

    Indeed MVE supports less vectors modes than Neon, so we use the
    MVE_VLD_ST iterator where Neon uses VQX.

    Since the supported modes are different from the ones valid for
    arithmetic operators, we introduce two new sets of macros:

    ARM_HAVE_NEON_<MODE>_LDST
      true if Neon has vector load/store instructions for <MODE>

    ARM_HAVE_<MODE>_LDST
      true if any vector extension has vector load/store instructions for
<MODE>

    We move the movmisalign<mode> expander from neon.md to vec-commond.md, and
    replace the TARGET_NEON enabler with ARM_HAVE_<MODE>_LDST.

    The patch also updates the mve-vneg.c test to scan for the better code
    generation when loading and storing the vectors involved: it checks
    that no 'orr' instruction is generated to cope with misalignment at
    runtime.
    This test was chosen among the other mve tests, but any other should
    be OK. Using a plain vector copy loop (dest[i] = a[i]) is not a good
    test because the compiler chooses to use memcpy.

    For instance we now generate:
    test_vneg_s32x4:
            vldrw.32       q3, [r1]
            vneg.s32  q3, q3
            vstrw.32       q3, [r0]
            bx      lr

    instead of:
    test_vneg_s32x4:
            orr     r3, r1, r0
            lsls    r3, r3, #28
            bne     .L15
            vldrw.32        q3, [r1]
            vneg.s32  q3, q3
            vstrw.32        q3, [r0]
            bx      lr
            .L15:
            push    {r4, r5}
            ldrd    r2, r3, [r1, #8]
            ldrd    r5, r4, [r1]
            rsbs    r2, r2, #0
            rsbs    r5, r5, #0
            rsbs    r4, r4, #0
            rsbs    r3, r3, #0
            strd    r5, r4, [r0]
            pop     {r4, r5}
            strd    r2, r3, [r0, #8]
            bx      lr

    2021-01-12  Christophe Lyon  <christophe.lyon@linaro.org>

            PR target/97875
            gcc/
            * config/arm/arm.h (ARM_HAVE_NEON_V8QI_LDST): New macro.
            (ARM_HAVE_NEON_V16QI_LDST, ARM_HAVE_NEON_V4HI_LDST): Likewise.
            (ARM_HAVE_NEON_V8HI_LDST, ARM_HAVE_NEON_V2SI_LDST): Likewise.
            (ARM_HAVE_NEON_V4SI_LDST, ARM_HAVE_NEON_V4HF_LDST): Likewise.
            (ARM_HAVE_NEON_V8HF_LDST, ARM_HAVE_NEON_V4BF_LDST): Likewise.
            (ARM_HAVE_NEON_V8BF_LDST, ARM_HAVE_NEON_V2SF_LDST): Likewise.
            (ARM_HAVE_NEON_V4SF_LDST, ARM_HAVE_NEON_DI_LDST): Likewise.
            (ARM_HAVE_NEON_V2DI_LDST): Likewise.
            (ARM_HAVE_V8QI_LDST, ARM_HAVE_V16QI_LDST): Likewise.
            (ARM_HAVE_V4HI_LDST, ARM_HAVE_V8HI_LDST): Likewise.
            (ARM_HAVE_V2SI_LDST, ARM_HAVE_V4SI_LDST, ARM_HAVE_V4HF_LDST):
Likewise.
            (ARM_HAVE_V8HF_LDST, ARM_HAVE_V4BF_LDST, ARM_HAVE_V8BF_LDST):
Likewise.
            (ARM_HAVE_V2SF_LDST, ARM_HAVE_V4SF_LDST, ARM_HAVE_DI_LDST):
Likewise.
            (ARM_HAVE_V2DI_LDST): Likewise.
            * config/arm/mve.md (*movmisalign<mode>_mve_store): New pattern.
            (*movmisalign<mode>_mve_load): New pattern.
            * config/arm/neon.md (movmisalign<mode>): Move to ...
            * config/arm/vec-common.md: ... here.

            PR target/97875
            gcc/testsuite/
            * gcc.target/arm/simd/mve-vneg.c: Update test.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug target/97875] suboptimal loop vectorization
  2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-01-12 16:51 ` cvs-commit at gcc dot gnu.org
@ 2021-01-12 16:52 ` clyon at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2021-01-12 16:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875

Christophe Lyon <clyon at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #8 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Fixed on trunk

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-01-12 16:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
2020-11-17 15:25 ` [Bug tree-optimization/97875] " rguenth at gcc dot gnu.org
2020-11-17 15:41 ` [Bug target/97875] " clyon at gcc dot gnu.org
2020-11-18  8:17 ` rguenth at gcc dot gnu.org
2020-12-09 15:06 ` clyon at gcc dot gnu.org
2020-12-09 16:59 ` clyon at gcc dot gnu.org
2020-12-10 14:42 ` clyon at gcc dot gnu.org
2021-01-12 16:51 ` cvs-commit at gcc dot gnu.org
2021-01-12 16:52 ` clyon at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).