[Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3
@ 2021-12-20 20:54 husseydevin at gmail dot com
  2021-12-20 20:59 ` [Bug middle-end/103781] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: husseydevin at gmail dot com @ 2021-12-20 20:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

            Bug ID: 103781
           Summary: [AArch64, 11 regr.] Failed partial vectorization of
                    mulv2di3
           Product: gcc
           Version: 11.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: husseydevin at gmail dot com
  Target Milestone: ---

As of GCC 11, the AArch64 backend is very greedy in trying to vectorize
mulv2di3. However, there is no mulv2di3 routine so it extracts from the vector.

The bad codegen should be obvious. 

#include <stdint.h>

void fma_u64(uint64_t *restrict acc, const uint64_t *restrict x, const uint64_t
*restrict y)
{
    for (int i = 0; i < 16384; i++){
        acc[0] += *x++ * *y++;
        acc[1] += *x++ * *y++;
    }
}

gcc-11 -O3

fma_u64:
.LFB0:
        .cfi_startproc
        ldr     q1, [x0]
        add     x6, x1, 262144
        .p2align 3,,7
.L2:
        ldr     x4, [x1], 16
        ldr     x5, [x2], 16
        ldr     x3, [x1, -8]
        mul     x4, x4, x5
        ldr     x5, [x2, -8]
        fmov    d0, x4
        ins     v0.d[1], x5
        mul     x3, x3, x5
        ins     v0.d[1], x3
        add     v1.2d, v1.2d, v0.2d
        cmp     x1, x6
        bne     .L2
        str     q1, [x0]
        ret
        .cfi_endproc

GCC 10.2.1 emits better code.

fma_u64:
.LFB0:
        .cfi_startproc
        ldp     x4, x3, [x0]
        add     x9, x1, 262144
        .p2align 3,,7
.L2:
        ldr     x8, [x1], 16
        ldr     x7, [x2], 16
        ldr     x6, [x1, -8]
        ldr     x5, [x2, -8]
        madd    x4, x8, x7, x4
        madd    x3, x6, x5, x3
        cmp     x9, x1
        bne     .L2
        stp     x4, x3, [x0]
        ret
        .cfi_endproc

However, the ideal code would be a 2 iteration unroll.

Side note: why not ldp in the loop?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug middle-end/103781] [AArch64, 11 regr.] Failed partial vectorization of mulv2di3
  2021-12-20 20:54 [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 husseydevin at gmail dot com
@ 2021-12-20 20:59 ` pinskia at gcc dot gnu.org
  2021-12-20 21:12 ` [Bug target/103781] Cost model for SLP for aarch64 is not so good still husseydevin at gmail dot com
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-20 20:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>Side note: why not ldp in the loop?
Because of the way LDP formation is done, it is just badly done in general
(file a different bug for that). It is a known issue that ldp/stp formation is
not good really.

>As of GCC 11, the AArch64 backend is very greedy in trying to vectorize mulv2di3.

No, you are actually seeing SLP happening really and since mul does not exist,
it does not do that.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103781] Cost model for SLP for aarch64 is not so good still
  2021-12-20 20:54 [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 husseydevin at gmail dot com
  2021-12-20 20:59 ` [Bug middle-end/103781] " pinskia at gcc dot gnu.org
@ 2021-12-20 21:12 ` husseydevin at gmail dot com
  2021-12-20 21:13 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: husseydevin at gmail dot com @ 2021-12-20 21:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

--- Comment #2 from Devin Hussey <husseydevin at gmail dot com> ---
Yeah my bad, I meant SLP, I get them mixed up all the time.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103781] Cost model for SLP for aarch64 is not so good still
  2021-12-20 20:54 [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 husseydevin at gmail dot com
  2021-12-20 20:59 ` [Bug middle-end/103781] " pinskia at gcc dot gnu.org
  2021-12-20 21:12 ` [Bug target/103781] Cost model for SLP for aarch64 is not so good still husseydevin at gmail dot com
@ 2021-12-20 21:13 ` pinskia at gcc dot gnu.org
  2021-12-20 21:51 ` [Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good husseydevin at gmail dot com
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-20 21:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I think it is the generic (and cortex-a53) cost model which is bad,
-mcpu=cortex-a57 and -mcpu=neoverse-n1 is fine.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good
  2021-12-20 20:54 [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 husseydevin at gmail dot com
                   ` (2 preceding siblings ...)
  2021-12-20 21:13 ` pinskia at gcc dot gnu.org
@ 2021-12-20 21:51 ` husseydevin at gmail dot com
  2023-12-16  4:37 ` pinskia at gcc dot gnu.org
  2024-01-26  0:47 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: husseydevin at gmail dot com @ 2021-12-20 21:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

--- Comment #4 from Devin Hussey <husseydevin at gmail dot com> ---
Makes sense because the multiplier is what, 5 cycles on an A53?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good
  2021-12-20 20:54 [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 husseydevin at gmail dot com
                   ` (3 preceding siblings ...)
  2021-12-20 21:51 ` [Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good husseydevin at gmail dot com
@ 2023-12-16  4:37 ` pinskia at gcc dot gnu.org
  2024-01-26  0:47 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-12-16  4:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pinskia at gcc dot gnu.org

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I know that the generic cost model has changed on the trunk but I am not sure
this one is fixed ...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good
  2021-12-20 20:54 [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 husseydevin at gmail dot com
                   ` (4 preceding siblings ...)
  2023-12-16  4:37 ` pinskia at gcc dot gnu.org
@ 2024-01-26  0:47 ` pinskia at gcc dot gnu.org
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-01-26  0:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-01-26
     Ever confirmed|0                           |1

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.

Note if sve is turned on, we get:
```
.L2:
        ldr     q30, [x1], 16
        ldr     q29, [x2], 16
        mul     z29.d, z30.d, z29.d
        add     v31.2d, v31.2d, v29.2d
        cmp     x1, x3
        bne     .L2
```
For the inner loop on the trunk which is 100% what you want as then it is
vectorized.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-01-26  0:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-20 20:54 [Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3 husseydevin at gmail dot com
2021-12-20 20:59 ` [Bug middle-end/103781] " pinskia at gcc dot gnu.org
2021-12-20 21:12 ` [Bug target/103781] Cost model for SLP for aarch64 is not so good still husseydevin at gmail dot com
2021-12-20 21:13 ` pinskia at gcc dot gnu.org
2021-12-20 21:51 ` [Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good husseydevin at gmail dot com
2023-12-16  4:37 ` pinskia at gcc dot gnu.org
2024-01-26  0:47 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).