public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE @ 2023-02-08 19:17 gbs at canishe dot com 2023-02-08 19:30 ` [Bug tree-optimization/108724] " pinskia at gcc dot gnu.org ` (11 more replies) 0 siblings, 12 replies; 13+ messages in thread From: gbs at canishe dot com @ 2023-02-08 19:17 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Bug ID: 108724 Summary: [11 regression] Poor codegen when summing two arrays without AVX or SSE Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gbs at canishe dot com Target Milestone: --- This program: void foo(int *a, const int *__restrict b, const int *__restrict c) { for (int i = 0; i < 16; i++) { a[i] = b[i] + c[i]; } } When compiled for x86 by GCC 11.1+ with -O3 -mno-avx -mno-sse, produces: foo: movq %rdx, %rax subq $8, %rsp movl (%rsi), %edx movq %rsi, %rcx addl (%rax), %edx movl 4(%rax), %esi movq $0, (%rsp) movl %edx, (%rsp) movq (%rsp), %rdx addl 4(%rcx), %esi movq %rdx, -8(%rsp) movl %esi, -4(%rsp) movq -8(%rsp), %rdx movq %rdx, (%rdi) movl 8(%rax), %edx addl 8(%rcx), %edx movq $0, -16(%rsp) movl %edx, -16(%rsp) movq -16(%rsp), %rdx movl 12(%rcx), %esi addl 12(%rax), %esi movq %rdx, -24(%rsp) movl %esi, -20(%rsp) movq -24(%rsp), %rdx movq %rdx, 8(%rdi) [snip more of the same] movl 48(%rcx), %edx movq $0, -96(%rsp) addl 48(%rax), %edx movl %edx, -96(%rsp) movq -96(%rsp), %rdx movl 52(%rcx), %esi addl 52(%rax), %esi movq %rdx, -104(%rsp) movl %esi, -100(%rsp) movq -104(%rsp), %rdx movq %rdx, 48(%rdi) movl 56(%rcx), %edx movq $0, -112(%rsp) addl 56(%rax), %edx movl %edx, -112(%rsp) movq -112(%rsp), %rdx movl 60(%rcx), %ecx addl 60(%rax), %ecx movq %rdx, -120(%rsp) movl %ecx, -116(%rsp) movq -120(%rsp), %rdx movq %rdx, 56(%rdi) addq $8, %rsp ret (Godbolt link: https://godbolt.org/z/qq9dbP8ed) This is bizarre - it's storing intermediate results on the stack, instead of keeping them in registers or writing them directly to *a, which is bound to be slow. (GCC 10.4, and Clang, produce more or less what I would expect, using only the provided arrays and a register.) I haven't done any benchmarking myself, but Jonathan Wakely's results (on list: https://gcc.gnu.org/pipermail/gcc-help/2023-February/142181.html) seem to bear this out. From a bisect, this behavior seems to have been introduced by commit 33c0f246f799b7403171e97f31276a8feddd05c9 (tree-optimization/97626 - handle SCCs properly in SLP stmt analysis) from Oct 2020, and persists into GCC trunk. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com @ 2023-02-08 19:30 ` pinskia at gcc dot gnu.org 2023-02-09 9:37 ` crazylht at gmail dot com ` (10 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: pinskia at gcc dot gnu.org @ 2023-02-08 19:30 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Summary|[11 regression] Poor |Poor codegen when summing |codegen when summing two |two arrays without AVX or |arrays without AVX or SSE |SSE Component|target |tree-optimization --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- I noticed that on aarch64, with -mgeneral-regs-only, the problem of trying to do a vectorization has been there since at least GCC 5 ... ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com 2023-02-08 19:30 ` [Bug tree-optimization/108724] " pinskia at gcc dot gnu.org @ 2023-02-09 9:37 ` crazylht at gmail dot com 2023-02-09 13:54 ` rguenth at gcc dot gnu.org ` (9 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: crazylht at gmail dot com @ 2023-02-09 9:37 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Hongtao.liu <crazylht at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crazylht at gmail dot com --- Comment #2 from Hongtao.liu <crazylht at gmail dot com> --- Guess it's related to 8537bool 8538vect_can_vectorize_without_simd_p (tree_code code) 8539{ 8540 switch (code) 8541 { 8542 case PLUS_EXPR: 8543 case MINUS_EXPR: 8544 case NEGATE_EXPR: 8545 case BIT_AND_EXPR: 8546 case BIT_IOR_EXPR: 8547 case BIT_XOR_EXPR: 8548 case BIT_NOT_EXPR: 8549 return true; 8550 8551 default: 8552 return false; 8553 } vectorizer will still do vectorization even without simd instructions. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com 2023-02-08 19:30 ` [Bug tree-optimization/108724] " pinskia at gcc dot gnu.org 2023-02-09 9:37 ` crazylht at gmail dot com @ 2023-02-09 13:54 ` rguenth at gcc dot gnu.org 2023-02-10 10:00 ` rguenth at gcc dot gnu.org ` (8 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-02-09 13:54 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Last reconfirmed| |2023-02-09 --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- Adding -fopt-info shows t.c:3:21: optimized: loop vectorized using 8 byte vectors t.c:1:6: optimized: loop with 7 iterations completely unrolled (header execution count 63136016) disabling unrolling instead shows .L2: leaq (%rsi,%rax), %r8 leaq (%rdx,%rax), %rdi movl (%r8), %ecx addl (%rdi), %ecx movq %r10, -8(%rsp) movl %ecx, -8(%rsp) movq -8(%rsp), %rcx movl 4(%rdi), %edi addl 4(%r8), %edi movq %rcx, -16(%rsp) movl %edi, -12(%rsp) movq -16(%rsp), %rcx movq %rcx, (%r9,%rax) addq $8, %rax cmpq $64, %rax jne .L2 and what happens is that vector lowering fails to perform generic vector addition (vector lowering is supposed to materialize that), but instead decomposes the vector, doing scalar adds, which eventually results in us spilling ... The reason is that vector lowering does /* Expand a vector operation to scalars; for integer types we can use special bit twiddling tricks to do the sums a word at a time, using function F_PARALLEL instead of F. These tricks are done only if they can process at least four items, that is, only if the vector holds at least four items and if a word can hold four items. */ static tree expand_vector_addition (gimple_stmt_iterator *gsi, elem_op_func f, elem_op_func f_parallel, tree type, tree a, tree b, enum tree_code code) { int parts_per_word = BITS_PER_WORD / vector_element_bits (type); if (INTEGRAL_TYPE_P (TREE_TYPE (type)) && parts_per_word >= 4 && nunits_for_known_piecewise_op (type) >= 4) return expand_vector_parallel (gsi, f_parallel, type, a, b, code); else return expand_vector_piecewise (gsi, f, type, TREE_TYPE (type), a, b, code, false); so it only treats >= 4 elements as profitable to vectorize this way but the vectorizer doesn't seem to know that, it instead applies its own cost model here while vector lowering doesn't have any. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (2 preceding siblings ...) 2023-02-09 13:54 ` rguenth at gcc dot gnu.org @ 2023-02-10 10:00 ` rguenth at gcc dot gnu.org 2023-02-10 10:07 ` [Bug tree-optimization/108724] [11/12/13 Regression] " rguenth at gcc dot gnu.org ` (7 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-02-10 10:00 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://gcc.gnu.org/bugzill | |a/show_bug.cgi?id=101801 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- So the vectorizer thinks that foo: .LFB0: .cfi_startproc movabsq $9223372034707292159, %rax movq %rdi, %rcx movq (%rdx), %r10 movq (%rsi), %rdi movq %rax, %r8 movq %rax, %r9 movq %rax, %r11 andq %r10, %r8 andq %rdi, %r9 addq %r8, %r9 movq %rdi, %r8 movabsq $-9223372034707292160, %rdi xorq %r10, %r8 movq 8(%rdx), %r10 andq %rdi, %r8 xorq %r9, %r8 movq %rax, %r9 movq %r8, (%rcx) movq 8(%rsi), %r8 andq %r10, %r9 andq %r8, %r11 xorq %r10, %r8 movq 16(%rdx), %r10 addq %r11, %r9 andq %rdi, %r8 movq %rax, %r11 xorq %r9, %r8 movq %rax, %r9 andq %r10, %r11 movq %r8, 8(%rcx) movq 16(%rsi), %r8 andq %r8, %r9 xorq %r10, %r8 movq 24(%rdx), %r10 addq %r11, %r9 andq %rdi, %r8 movq %rax, %r11 xorq %r9, %r8 movq %rax, %r9 andq %r10, %r11 movq %r8, 16(%rcx) movq 24(%rsi), %r8 andq %r8, %r9 xorq %r10, %r8 movq 32(%rdx), %r10 addq %r11, %r9 andq %rdi, %r8 movq %rax, %r11 xorq %r9, %r8 movq %rax, %r9 andq %r10, %r11 movq %r8, 24(%rcx) movq 32(%rsi), %r8 andq %r8, %r9 xorq %r10, %r8 movq 40(%rdx), %r10 addq %r11, %r9 andq %rdi, %r8 movq %rax, %r11 xorq %r9, %r8 movq %rax, %r9 andq %r10, %r11 movq %r8, 32(%rcx) movq 40(%rsi), %r8 andq %r8, %r9 addq %r11, %r9 xorq %r10, %r8 movq 48(%rsi), %r10 movq %rax, %r11 andq %rdi, %r8 xorq %r9, %r8 movq %rax, %r9 andq %r10, %r11 movq %r8, 40(%rcx) movq 48(%rdx), %r8 movq 56(%rdx), %rdx andq %r8, %r9 xorq %r10, %r8 addq %r11, %r9 andq %rdi, %r8 xorq %r9, %r8 movq %r8, 48(%rcx) movq 56(%rsi), %r8 movq %rax, %rsi andq %rdx, %rsi andq %r8, %rax xorq %r8, %rdx addq %rsi, %rax andq %rdi, %rdx xorq %rdx, %rax movq %rax, 56(%rcx) ret will be faster than when not vectorizing. Not vectorizing produces foo: .LFB0: .cfi_startproc movq %rsi, %rcx movl (%rsi), %esi addl (%rdx), %esi movl %esi, (%rdi) movl 4(%rdx), %esi addl 4(%rcx), %esi movl %esi, 4(%rdi) movl 8(%rdx), %esi addl 8(%rcx), %esi movl %esi, 8(%rdi) movl 12(%rdx), %esi addl 12(%rcx), %esi movl %esi, 12(%rdi) movl 16(%rdx), %esi addl 16(%rcx), %esi movl %esi, 16(%rdi) movl 20(%rdx), %esi addl 20(%rcx), %esi movl %esi, 20(%rdi) movl 24(%rdx), %esi addl 24(%rcx), %esi movl %esi, 24(%rdi) movl 28(%rdx), %esi addl 28(%rcx), %esi movl %esi, 28(%rdi) movl 32(%rdx), %esi addl 32(%rcx), %esi movl %esi, 32(%rdi) movl 36(%rdx), %esi addl 36(%rcx), %esi movl %esi, 36(%rdi) movl 40(%rdx), %esi addl 40(%rcx), %esi movl %esi, 40(%rdi) movl 44(%rdx), %esi addl 44(%rcx), %esi movl %esi, 44(%rdi) movl 48(%rdx), %esi addl 48(%rcx), %esi movl %esi, 48(%rdi) movl 52(%rdx), %esi addl 52(%rcx), %esi movl %esi, 52(%rdi) movl 56(%rdx), %esi movl 60(%rdx), %edx addl 56(%rcx), %esi addl 60(%rcx), %edx movl %esi, 56(%rdi) movl %edx, 60(%rdi) ret The vectorizer produces un-lowered vector adds which is good in case followup optimizations are possible (the ops are not obfuscated), but also bad because unrolling estimates the size in a wrong way. Costs go *_3 1 times scalar_load costs 12 in prologue *_5 1 times scalar_load costs 12 in prologue _4 + _6 1 times scalar_stmt costs 4 in prologue _8 1 times scalar_store costs 12 in prologue vs *_3 1 times unaligned_load (misalign -1) costs 12 in body *_5 1 times unaligned_load (misalign -1) costs 12 in body _4 + _6 1 times vector_stmt costs 4 in body _4 + _6 5 times scalar_stmt costs 20 in body _8 1 times unaligned_store (misalign -1) costs 12 in body and as usual the savings from the wide loads and store outweight the loss on the arithmetic side, but it's a close call: Vector inside of loop cost: 60 Scalar iteration cost: 40 so 2*40 > 60 and the vectorization is profitable. The easiest fix is to avoid vectorizing, another possibility is to adhere to the vectorizers decision and expand to the lowered sequence immediately from within the vectorizer itself. The original goal was to remove this hard cap on the number of elements but this bug shows that work was incomplete. I'm going to re-instantiate the hard cap and revisit for GCC 14. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] [11/12/13 Regression] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (3 preceding siblings ...) 2023-02-10 10:00 ` rguenth at gcc dot gnu.org @ 2023-02-10 10:07 ` rguenth at gcc dot gnu.org 2023-02-10 11:22 ` cvs-commit at gcc dot gnu.org ` (6 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-02-10 10:07 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P3 |P2 Summary|Poor codegen when summing |[11/12/13 Regression] Poor |two arrays without AVX or |codegen when summing two |SSE |arrays without AVX or SSE Target Milestone|--- |11.4 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] [11/12/13 Regression] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (4 preceding siblings ...) 2023-02-10 10:07 ` [Bug tree-optimization/108724] [11/12/13 Regression] " rguenth at gcc dot gnu.org @ 2023-02-10 11:22 ` cvs-commit at gcc dot gnu.org 2023-02-10 11:22 ` [Bug tree-optimization/108724] [11/12 " rguenth at gcc dot gnu.org ` (5 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: cvs-commit at gcc dot gnu.org @ 2023-02-10 11:22 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 --- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:dc87e1391c55c666c7ff39d4f0dea87666f25468 commit r13-5771-gdc87e1391c55c666c7ff39d4f0dea87666f25468 Author: Richard Biener <rguenther@suse.de> Date: Fri Feb 10 11:07:30 2023 +0100 tree-optimization/108724 - vectorized code getting piecewise expanded This fixes an oversight to when removing the hard limits on using generic vectors for the vectorizer to enable both SLP and BB vectorization to use those. The vectorizer relies on vector lowering to expand plus, minus and negate to bit operations but vector lowering has a hard limit on the minimum number of elements per work item. Vectorizer costs for the testcase at hand work out to vectorize a loop with just two work items per vector and that causes element wise expansion and spilling. The fix for now is to re-instantiate the hard limit, matching what vector lowering does. For the future the way to go is to emit the lowered sequence directly from the vectorizer instead. PR tree-optimization/108724 * tree-vect-stmts.cc (vectorizable_operation): Avoid using word_mode vectors when vector lowering will decompose them to elementwise operations. * gcc.target/i386/pr108724.c: New testcase. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] [11/12 Regression] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (5 preceding siblings ...) 2023-02-10 11:22 ` cvs-commit at gcc dot gnu.org @ 2023-02-10 11:22 ` rguenth at gcc dot gnu.org 2023-03-15 9:48 ` cvs-commit at gcc dot gnu.org ` (4 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-02-10 11:22 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Known to work| |13.0 Summary|[11/12/13 Regression] Poor |[11/12 Regression] Poor |codegen when summing two |codegen when summing two |arrays without AVX or SSE |arrays without AVX or SSE --- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- Fixed on trunk sofar. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] [11/12 Regression] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (6 preceding siblings ...) 2023-02-10 11:22 ` [Bug tree-optimization/108724] [11/12 " rguenth at gcc dot gnu.org @ 2023-03-15 9:48 ` cvs-commit at gcc dot gnu.org 2023-05-05 8:34 ` [Bug tree-optimization/108724] [11 " rguenth at gcc dot gnu.org ` (3 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: cvs-commit at gcc dot gnu.org @ 2023-03-15 9:48 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 --- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> --- The releases/gcc-12 branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:21e7145aaf582c263e69a3ee05dfa9d42bdbd1aa commit r12-9258-g21e7145aaf582c263e69a3ee05dfa9d42bdbd1aa Author: Richard Biener <rguenther@suse.de> Date: Fri Feb 10 11:07:30 2023 +0100 tree-optimization/108724 - vectorized code getting piecewise expanded This fixes an oversight to when removing the hard limits on using generic vectors for the vectorizer to enable both SLP and BB vectorization to use those. The vectorizer relies on vector lowering to expand plus, minus and negate to bit operations but vector lowering has a hard limit on the minimum number of elements per work item. Vectorizer costs for the testcase at hand work out to vectorize a loop with just two work items per vector and that causes element wise expansion and spilling. The fix for now is to re-instantiate the hard limit, matching what vector lowering does. For the future the way to go is to emit the lowered sequence directly from the vectorizer instead. PR tree-optimization/108724 * tree-vect-stmts.cc (vectorizable_operation): Avoid using word_mode vectors when vector lowering will decompose them to elementwise operations. * gcc.target/i386/pr108724.c: New testcase. (cherry picked from commit dc87e1391c55c666c7ff39d4f0dea87666f25468) ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/108724] [11 Regression] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (7 preceding siblings ...) 2023-03-15 9:48 ` cvs-commit at gcc dot gnu.org @ 2023-05-05 8:34 ` rguenth at gcc dot gnu.org 2023-05-05 12:06 ` [Bug target/108724] " rguenth at gcc dot gnu.org ` (2 subsequent siblings) 11 siblings, 0 replies; 13+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-05-05 8:34 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 --- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- Note that the GCC 11 branch isn't affected by what's fixed on trunk and GCC 12, the 11 branch still uses vect_worthwhile_without_simd_p which checks vect_min_worthwhile_factor. Instead the GCC 11 branch correctly doesn't vectorize the add but instead vectorizes the stores only which results in stack spilling to build the DImode values we then store. We're also doing this in odd ways. In GCC 13/12 we have fixed this in the costing: node 0x4156628 1 times vec_construct costs 100 in prologue while the GCC 11 branch costs this as 8. Not sure where this happens in the target. Hmm, I think this happens in error. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/108724] [11 Regression] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (8 preceding siblings ...) 2023-05-05 8:34 ` [Bug tree-optimization/108724] [11 " rguenth at gcc dot gnu.org @ 2023-05-05 12:06 ` rguenth at gcc dot gnu.org 2023-05-23 12:55 ` rguenth at gcc dot gnu.org 2023-05-29 10:08 ` jakub at gcc dot gnu.org 11 siblings, 0 replies; 13+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-05-05 12:06 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|tree-optimization |target Target| |x86_64-*-* i?86-*-* --- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> --- And the remaining issue with GCC 11 would be that we fail to account for the GPR -> XMM move. Or the remaining issue for _all_ branches is that we fail to realize that emulated "vector" CTORs are even more expensive since we lack a good way to materialize the CTOR in a GPR (generic RTL expansion fails to consider using shift + and for example). Not sure what a good expansion of a V2SImode, V4HImode or V8QImode CTOR to a GPR DImode reg would look like. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/108724] [11 Regression] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (9 preceding siblings ...) 2023-05-05 12:06 ` [Bug target/108724] " rguenth at gcc dot gnu.org @ 2023-05-23 12:55 ` rguenth at gcc dot gnu.org 2023-05-29 10:08 ` jakub at gcc dot gnu.org 11 siblings, 0 replies; 13+ messages in thread From: rguenth at gcc dot gnu.org @ 2023-05-23 12:55 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 --- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- On trunk we're back to vectorizing but as intended with DImode which makes us save half of the loads and stores and we think the extended required arithmetic covers up for that (by quite some margin). movabsq $9223372034707292159, %rcx movq (%rdx), %rax movq (%rsi), %rsi movq %rcx, %rdx andq %rax, %rdx andq %rsi, %rcx xorq %rsi, %rax addq %rcx, %rdx movabsq $-9223372034707292160, %rcx andq %rcx, %rax xorq %rdx, %rax movq %rax, (%rdi) vs movl (%rdx), %eax addl (%rsi), %eax movl %eax, (%rdi) movl 4(%rdx), %eax addl 4(%rsi), %eax movl %eax, 4(%rdi) ^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/108724] [11 Regression] Poor codegen when summing two arrays without AVX or SSE 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com ` (10 preceding siblings ...) 2023-05-23 12:55 ` rguenth at gcc dot gnu.org @ 2023-05-29 10:08 ` jakub at gcc dot gnu.org 11 siblings, 0 replies; 13+ messages in thread From: jakub at gcc dot gnu.org @ 2023-05-29 10:08 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724 Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|11.4 |11.5 --- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> --- GCC 11.4 is being released, retargeting bugs to GCC 11.5. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-05-29 10:08 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com 2023-02-08 19:30 ` [Bug tree-optimization/108724] " pinskia at gcc dot gnu.org 2023-02-09 9:37 ` crazylht at gmail dot com 2023-02-09 13:54 ` rguenth at gcc dot gnu.org 2023-02-10 10:00 ` rguenth at gcc dot gnu.org 2023-02-10 10:07 ` [Bug tree-optimization/108724] [11/12/13 Regression] " rguenth at gcc dot gnu.org 2023-02-10 11:22 ` cvs-commit at gcc dot gnu.org 2023-02-10 11:22 ` [Bug tree-optimization/108724] [11/12 " rguenth at gcc dot gnu.org 2023-03-15 9:48 ` cvs-commit at gcc dot gnu.org 2023-05-05 8:34 ` [Bug tree-optimization/108724] [11 " rguenth at gcc dot gnu.org 2023-05-05 12:06 ` [Bug target/108724] " rguenth at gcc dot gnu.org 2023-05-23 12:55 ` rguenth at gcc dot gnu.org 2023-05-29 10:08 ` jakub at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).