public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3
@ 2021-12-20 20:54 husseydevin at gmail dot com
  2021-12-20 20:59 ` [Bug middle-end/103781] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: husseydevin at gmail dot com @ 2021-12-20 20:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

            Bug ID: 103781
           Summary: [AArch64, 11 regr.] Failed partial vectorization of
                    mulv2di3
           Product: gcc
           Version: 11.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: husseydevin at gmail dot com
  Target Milestone: ---

As of GCC 11, the AArch64 backend is very greedy in trying to vectorize
mulv2di3. However, there is no mulv2di3 routine so it extracts from the vector.

The bad codegen should be obvious. 

#include <stdint.h>

void fma_u64(uint64_t *restrict acc, const uint64_t *restrict x, const uint64_t
*restrict y)
{
    for (int i = 0; i < 16384; i++){
        acc[0] += *x++ * *y++;
        acc[1] += *x++ * *y++;
    }
}

gcc-11 -O3

fma_u64:
.LFB0:
        .cfi_startproc
        ldr     q1, [x0]
        add     x6, x1, 262144
        .p2align 3,,7
.L2:
        ldr     x4, [x1], 16
        ldr     x5, [x2], 16
        ldr     x3, [x1, -8]
        mul     x4, x4, x5
        ldr     x5, [x2, -8]
        fmov    d0, x4
        ins     v0.d[1], x5
        mul     x3, x3, x5
        ins     v0.d[1], x3
        add     v1.2d, v1.2d, v0.2d
        cmp     x1, x6
        bne     .L2
        str     q1, [x0]
        ret
        .cfi_endproc

GCC 10.2.1 emits better code.

fma_u64:
.LFB0:
        .cfi_startproc
        ldp     x4, x3, [x0]
        add     x9, x1, 262144
        .p2align 3,,7
.L2:
        ldr     x8, [x1], 16
        ldr     x7, [x2], 16
        ldr     x6, [x1, -8]
        ldr     x5, [x2, -8]
        madd    x4, x8, x7, x4
        madd    x3, x6, x5, x3
        cmp     x9, x1
        bne     .L2
        stp     x4, x3, [x0]
        ret
        .cfi_endproc

However, the ideal code would be a 2 iteration unroll.

Side note: why not ldp in the loop?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-01-26  0:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-20 20:54 [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 husseydevin at gmail dot com
2021-12-20 20:59 ` [Bug middle-end/103781] " pinskia at gcc dot gnu.org
2021-12-20 21:12 ` [Bug target/103781] Cost model for SLP for aarch64 is not so good still husseydevin at gmail dot com
2021-12-20 21:13 ` pinskia at gcc dot gnu.org
2021-12-20 21:51 ` [Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good husseydevin at gmail dot com
2023-12-16  4:37 ` pinskia at gcc dot gnu.org
2024-01-26  0:47 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).