public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/97875] New: suboptimal loop vectorization
@ 2020-11-17 13:11 clyon at gcc dot gnu.org
2020-11-17 15:25 ` [Bug tree-optimization/97875] " rguenth at gcc dot gnu.org
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-11-17 13:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
Bug ID: 97875
Summary: suboptimal loop vectorization
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: clyon at gcc dot gnu.org
Target Milestone: ---
Looking at the code generated for gcc.target/arm/simd/mve-vsub_1.c:
#include <stdint.h>
void test_vsub_i32 (int32_t * dest, int32_t * a, int32_t * b) {
int i;
for (i=0; i<4; i++) {
dest[i] = a[i] - b[i];
}
}
Compiled with -mfloat-abi=hard -mfpu=auto -march=armv8.1-m.main+mve -mthumb
-O3, we get:
test_vsub_i32:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
add ip, r1, #4
adds r3, r2, #4
sub ip, r0, ip
subs r3, r0, r3
cmp ip, #8
it hi
cmphi r3, #8
bls .L2
orr r3, r2, r0
orrs r3, r3, r1
lsls r3, r3, #28
bne .L2
vldrw.32 q3, [r1]
vldrw.32 q2, [r2]
vsub.i32 q3, q3, q2
vstrw.32 q3, [r0]
bx lr
.L2:
ldr r3, [r1]
push {r4}
ldr r4, [r2]
subs r3, r3, r4
str r3, [r0]
ldr r4, [r2, #4]
ldr r3, [r1, #4]
subs r3, r3, r4
str r3, [r0, #4]
ldr r4, [r2, #8]
ldr r3, [r1, #8]
subs r3, r3, r4
str r3, [r0, #8]
ldr r3, [r1, #12]
ldr r2, [r2, #12]
ldr r4, [sp], #4
subs r3, r3, r2
str r3, [r0, #12]
bx lr
but only the short vectorized part is necessary:
vldrw.32 q3, [r1]
vldrw.32 q2, [r2]
vsub.i32 q3, q3, q2
vstrw.32 q3, [r0]
bx lr
Since the loop trip count is constant (=4), why isn't this better optimized?
If I declare 'dest' as __restrict__, I get something better, but still not
perfect:
test_vsub_i32:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
orr r3, r2, r0
orrs r3, r3, r1
lsls r3, r3, #28
bne .L2
vldrw.32 q3, [r1]
vldrw.32 q2, [r2]
vsub.i32 q3, q3, q2
vstrw.32 q3, [r0]
bx lr
.L2:
push {r4, r5}
ldr r3, [r1]
ldr r4, [r2]
subs r4, r3, r4
str r4, [r0]
ldr r3, [r1, #4]
ldr r4, [r2, #4]
subs r5, r3, r4
str r5, [r0, #4]
ldrd r4, r3, [r1, #8]
ldrd r5, r1, [r2, #8]
subs r4, r4, r5
subs r3, r3, r1
strd r4, r3, [r0, #8]
pop {r4, r5}
bx lr
Compiling for cortex-a9 and Neon:
-mfloat-abi=hard -mcpu=cortex-a9 -mfpu=neon -O3
test_vsub_i32:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
add ip, r2, #4
adds r3, r1, #4
sub ip, r0, ip
subs r3, r0, r3
cmp ip, #8
it hi
cmphi r3, #8
bls .L2
vld1.32 {q8}, [r1]
vld1.32 {q9}, [r2]
vsub.i32 q8, q8, q9
vst1.32 {q8}, [r0]
bx lr
.L2:
ldr r3, [r1]
push {r4}
ldr r4, [r2]
subs r3, r3, r4
str r3, [r0]
ldr r4, [r2, #4]
ldr r3, [r1, #4]
subs r3, r3, r4
str r3, [r0, #4]
ldr r4, [r2, #8]
ldr r3, [r1, #8]
subs r3, r3, r4
ldr r4, [sp], #4
str r3, [r0, #8]
ldr r3, [r1, #12]
ldr r2, [r2, #12]
subs r3, r3, r2
str r3, [r0, #12]
bx lr
But in this case adding __restrict__ works well:
test_vsub_i32:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
vld1.32 {q8}, [r1]
vld1.32 {q9}, [r2]
vsub.i32 q8, q8, q9
vst1.32 {q8}, [r0]
bx lr
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug tree-optimization/97875] suboptimal loop vectorization
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
@ 2020-11-17 15:25 ` rguenth at gcc dot gnu.org
2020-11-17 15:41 ` [Bug target/97875] " clyon at gcc dot gnu.org
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-11-17 15:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2020-11-17
Ever confirmed|0 |1
Status|UNCONFIRMED |WAITING
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
restrict is necessary because of possible aliasing (you see the runtime alias
test). For
orr r3, r2, r0
orrs r3, r3, r1
lsls r3, r3, #28
bne .L2
it looks like this is a runtime alignment test - does the specified arch
support unaligned vector loads/stores?
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/97875] suboptimal loop vectorization
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
2020-11-17 15:25 ` [Bug tree-optimization/97875] " rguenth at gcc dot gnu.org
@ 2020-11-17 15:41 ` clyon at gcc dot gnu.org
2020-11-18 8:17 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-11-17 15:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
--- Comment #2 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Checking the Arm v8-M manual, my understanding is that this architecture does
not support unaligned vector loads/stores.
However, my understanding is that vldrw.32 accepts to load from addresses
aligned on 32 bits, which is the case since a and b are pointers to int32_t.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/97875] suboptimal loop vectorization
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
2020-11-17 15:25 ` [Bug tree-optimization/97875] " rguenth at gcc dot gnu.org
2020-11-17 15:41 ` [Bug target/97875] " clyon at gcc dot gnu.org
@ 2020-11-18 8:17 ` rguenth at gcc dot gnu.org
2020-12-09 15:06 ` clyon at gcc dot gnu.org
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-11-18 8:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rsandifo at gcc dot gnu.org
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
That would then point to DR_TARGET_ALIGNMENT being wrong here. Now, not sure
whether we can guarantee to pick the "correct" instruction at RTL expansion but
surely the vectorizer can elide the runtime alignment check and emit
appropriately aligned (to vector element) vector loads / stores here.
You mention vldrw.32 but I assume the same applies to vstrw.32
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/97875] suboptimal loop vectorization
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
` (2 preceding siblings ...)
2020-11-18 8:17 ` rguenth at gcc dot gnu.org
@ 2020-12-09 15:06 ` clyon at gcc dot gnu.org
2020-12-09 16:59 ` clyon at gcc dot gnu.org
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-12-09 15:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
--- Comment #4 from Christophe Lyon <clyon at gcc dot gnu.org> ---
In both cases (Neon and MVE), DR_TARGET_ALIGNMENT is 8, so the decision to emit
a useless loop tail comes from elsewhere.
And yes, MVE vldrw.32 and vstrw.32 share the same alignment properties.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/97875] suboptimal loop vectorization
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
` (3 preceding siblings ...)
2020-12-09 15:06 ` clyon at gcc dot gnu.org
@ 2020-12-09 16:59 ` clyon at gcc dot gnu.org
2020-12-10 14:42 ` clyon at gcc dot gnu.org
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-12-09 16:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
--- Comment #5 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Interestingly, if I make arm_builtin_support_vector_misalignment() behave the
same for MVE and Neon, the generated code (with __restrict__) becomes:
test_vsub_i32:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
push {r4, r5, r6, r7, r8, r9, r10, fp} @ 61 [c=8 l=4]
*push_multi
ldrd r10, fp, [r1, #8] @ 75 [c=8 l=4] *thumb2_ldrd
ldrd r6, r7, [r2, #8] @ 76 [c=8 l=4] *thumb2_ldrd
ldr r4, [r2] @ 14 [c=12 l=4] *thumb2_movsi_vfp/5
ldr r8, [r1] @ 9 [c=12 l=4] *thumb2_movsi_vfp/6
ldr r9, [r1, #4] @ 10 [c=12 l=4] *thumb2_movsi_vfp/6
ldr r5, [r2, #4] @ 15 [c=12 l=4] *thumb2_movsi_vfp/5
vmov d6, r8, r9 @ v4si @ 35 [c=4 l=8] *mve_movv4si/1
vmov d7, r10, fp
vmov d4, r4, r5 @ v4si @ 36 [c=4 l=8] *mve_movv4si/1
vmov d5, r6, r7
sub sp, sp, #16 @ 62 [c=4 l=4] *arm_addsi3/11
mov r3, sp @ 37 [c=4 l=2] *thumb2_movsi_vfp/0
vsub.i32 q3, q3, q2 @ 18 [c=80 l=4] mve_vsubqv4si
vstrw.32 q3, [r3] @ 34 [c=4 l=4] *mve_movv4si/7
ldrd r4, r1, [sp] @ 77 [c=8 l=4] *thumb2_ldrd_base
ldrd r2, r3, [sp, #8] @ 78 [c=8 l=4] *thumb2_ldrd
strd r4, r1, [r0] @ 79 [c=8 l=4] *thumb2_strd_base
strd r2, r3, [r0, #8] @ 80 [c=8 l=4] *thumb2_strd
add sp, sp, #16 @ 66 [c=4 l=4] *arm_addsi3/5
@ sp needed @ 67 [c=8 l=0] force_register_use
pop {r4, r5, r6, r7, r8, r9, r10, fp} @ 68 [c=8 l=4]
*load_multiple_with_writeback
bx lr @ 69 [c=8 l=4] *thumb2_return
The Neon version has:
vld1.32 {q8}, [r1] @ 8 [c=8 l=4] *movmisalignv4si_neon_load
vld1.32 {q9}, [r2] @ 9 [c=8 l=4] *movmisalignv4si_neon_load
vsub.i32 q8, q8, q9 @ 10 [c=80 l=4] *subv4si3_neon
vst1.32 {q8}, [r0] @ 11 [c=8 l=4] *movmisalignv4si_neon_store
bx lr @ 21 [c=8 l=4] *thumb2_return
So it seems MVE needs movmisalign pattern.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/97875] suboptimal loop vectorization
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
` (4 preceding siblings ...)
2020-12-09 16:59 ` clyon at gcc dot gnu.org
@ 2020-12-10 14:42 ` clyon at gcc dot gnu.org
2021-01-12 16:51 ` cvs-commit at gcc dot gnu.org
2021-01-12 16:52 ` clyon at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2020-12-10 14:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
Christophe Lyon <clyon at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|WAITING |ASSIGNED
--- Comment #6 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Indeed enabling movmisalign for MVE greatly helps.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/97875] suboptimal loop vectorization
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
` (5 preceding siblings ...)
2020-12-10 14:42 ` clyon at gcc dot gnu.org
@ 2021-01-12 16:51 ` cvs-commit at gcc dot gnu.org
2021-01-12 16:52 ` clyon at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-01-12 16:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Christophe Lyon <clyon@gcc.gnu.org>:
https://gcc.gnu.org/g:25bef68902f42f414f99626cefb2d3df81de7dc8
commit r11-6616-g25bef68902f42f414f99626cefb2d3df81de7dc8
Author: Christophe Lyon <christophe.lyon@linaro.org>
Date: Tue Jan 12 16:47:27 2021 +0000
arm: Add movmisalign patterns for MVE (PR target/97875)
This patch adds new movmisalign<mode>_mve_load and store patterns for
MVE to help vectorization. They are very similar to their Neon
counterparts, but use different iterators and instructions.
Indeed MVE supports less vectors modes than Neon, so we use the
MVE_VLD_ST iterator where Neon uses VQX.
Since the supported modes are different from the ones valid for
arithmetic operators, we introduce two new sets of macros:
ARM_HAVE_NEON_<MODE>_LDST
true if Neon has vector load/store instructions for <MODE>
ARM_HAVE_<MODE>_LDST
true if any vector extension has vector load/store instructions for
<MODE>
We move the movmisalign<mode> expander from neon.md to vec-commond.md, and
replace the TARGET_NEON enabler with ARM_HAVE_<MODE>_LDST.
The patch also updates the mve-vneg.c test to scan for the better code
generation when loading and storing the vectors involved: it checks
that no 'orr' instruction is generated to cope with misalignment at
runtime.
This test was chosen among the other mve tests, but any other should
be OK. Using a plain vector copy loop (dest[i] = a[i]) is not a good
test because the compiler chooses to use memcpy.
For instance we now generate:
test_vneg_s32x4:
vldrw.32 q3, [r1]
vneg.s32 q3, q3
vstrw.32 q3, [r0]
bx lr
instead of:
test_vneg_s32x4:
orr r3, r1, r0
lsls r3, r3, #28
bne .L15
vldrw.32 q3, [r1]
vneg.s32 q3, q3
vstrw.32 q3, [r0]
bx lr
.L15:
push {r4, r5}
ldrd r2, r3, [r1, #8]
ldrd r5, r4, [r1]
rsbs r2, r2, #0
rsbs r5, r5, #0
rsbs r4, r4, #0
rsbs r3, r3, #0
strd r5, r4, [r0]
pop {r4, r5}
strd r2, r3, [r0, #8]
bx lr
2021-01-12 Christophe Lyon <christophe.lyon@linaro.org>
PR target/97875
gcc/
* config/arm/arm.h (ARM_HAVE_NEON_V8QI_LDST): New macro.
(ARM_HAVE_NEON_V16QI_LDST, ARM_HAVE_NEON_V4HI_LDST): Likewise.
(ARM_HAVE_NEON_V8HI_LDST, ARM_HAVE_NEON_V2SI_LDST): Likewise.
(ARM_HAVE_NEON_V4SI_LDST, ARM_HAVE_NEON_V4HF_LDST): Likewise.
(ARM_HAVE_NEON_V8HF_LDST, ARM_HAVE_NEON_V4BF_LDST): Likewise.
(ARM_HAVE_NEON_V8BF_LDST, ARM_HAVE_NEON_V2SF_LDST): Likewise.
(ARM_HAVE_NEON_V4SF_LDST, ARM_HAVE_NEON_DI_LDST): Likewise.
(ARM_HAVE_NEON_V2DI_LDST): Likewise.
(ARM_HAVE_V8QI_LDST, ARM_HAVE_V16QI_LDST): Likewise.
(ARM_HAVE_V4HI_LDST, ARM_HAVE_V8HI_LDST): Likewise.
(ARM_HAVE_V2SI_LDST, ARM_HAVE_V4SI_LDST, ARM_HAVE_V4HF_LDST):
Likewise.
(ARM_HAVE_V8HF_LDST, ARM_HAVE_V4BF_LDST, ARM_HAVE_V8BF_LDST):
Likewise.
(ARM_HAVE_V2SF_LDST, ARM_HAVE_V4SF_LDST, ARM_HAVE_DI_LDST):
Likewise.
(ARM_HAVE_V2DI_LDST): Likewise.
* config/arm/mve.md (*movmisalign<mode>_mve_store): New pattern.
(*movmisalign<mode>_mve_load): New pattern.
* config/arm/neon.md (movmisalign<mode>): Move to ...
* config/arm/vec-common.md: ... here.
PR target/97875
gcc/testsuite/
* gcc.target/arm/simd/mve-vneg.c: Update test.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug target/97875] suboptimal loop vectorization
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
` (6 preceding siblings ...)
2021-01-12 16:51 ` cvs-commit at gcc dot gnu.org
@ 2021-01-12 16:52 ` clyon at gcc dot gnu.org
7 siblings, 0 replies; 9+ messages in thread
From: clyon at gcc dot gnu.org @ 2021-01-12 16:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97875
Christophe Lyon <clyon at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #8 from Christophe Lyon <clyon at gcc dot gnu.org> ---
Fixed on trunk
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-01-12 16:52 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-17 13:11 [Bug tree-optimization/97875] New: suboptimal loop vectorization clyon at gcc dot gnu.org
2020-11-17 15:25 ` [Bug tree-optimization/97875] " rguenth at gcc dot gnu.org
2020-11-17 15:41 ` [Bug target/97875] " clyon at gcc dot gnu.org
2020-11-18 8:17 ` rguenth at gcc dot gnu.org
2020-12-09 15:06 ` clyon at gcc dot gnu.org
2020-12-09 16:59 ` clyon at gcc dot gnu.org
2020-12-10 14:42 ` clyon at gcc dot gnu.org
2021-01-12 16:51 ` cvs-commit at gcc dot gnu.org
2021-01-12 16:52 ` clyon at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).