public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation
@ 2020-06-29 12:49 rsandifo at gcc dot gnu.org
  2021-08-12  8:01 ` [Bug target/95962] " tnfchris at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2020-06-29 12:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962

            Bug ID: 95962
           Summary: Inefficient code for simple arm_neon.h iota operation
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
            Blocks: 95958
  Target Milestone: ---
            Target: aarch64*-*-*

For:

#include <arm_neon.h>

int32x4_t
foo (void)
{
  int32_t array[] = { 0, 1, 2, 3 };
  return vld1q_s32 (array);
}

we produce:

foo:
.LFB4217:
        .cfi_startproc
        sub     sp, sp, #16
        .cfi_def_cfa_offset 16
        mov     x0, 2
        mov     x1, 4294967296
        movk    x0, 0x3, lsl 32
        stp     x1, x0, [sp]
        ldr     q0, [sp]
        add     sp, sp, 16
        .cfi_def_cfa_offset 0
        ret

In contrast, clang produces essentially perfect code:

        adrp    x8, .LCPI0_0
        ldr     q0, [x8, :lo12:.LCPI0_0]
        ret

I think the problem is a combination of two things:

- __builtin_aarch64_ld1v4si & co. are treated as general
  functions rather than pure functions, so in principle
  it could write to the given address.  This stops us
  promoting the array to a constant.

- The loads could be reduced to native gimple-level
  operations, at least on little-endian targets.

IMO this a bug rather than an enhancement.  Intrinsics only
exist to optimise code, and what GCC is doing falls short
of what users should reasonably expect.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/95962] Inefficient code for simple arm_neon.h iota operation
  2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
@ 2021-08-12  8:01 ` tnfchris at gcc dot gnu.org
  2021-08-20 11:52 ` rsandifo at gcc dot gnu.org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-08-12  8:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-08-12
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #1 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
We generate the correct code at -O3 but not -O2.

At -O3 we generate

foo:
        adrp    x0, .LC0
        sub     sp, sp, #16
        ldr     q0, [x0, #:lo12:.LC0]
        add     sp, sp, 16
        ret

where the problem seems to be at at -O2 store merging has broken up the
construction of `array` into two separate memory accesses:

  MEM <unsigned long> [(int *)&array] = 4294967296;
  MEM <unsigned long> [(int *)&array + 8B] = 12884901890;

whereas at -O3 we still have a single assignment:

  MEM <vector(4) int> [(int *)&array] = { 0, 1, 2, 3 };

I'm not sure even if we made these loads gimple level if that would help. we'd
still have the explicit MEMs created by store merging.

Perhaps we should just make store-merging allow TImode merges and split them in
the backend if needed.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/95962] Inefficient code for simple arm_neon.h iota operation
  2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
  2021-08-12  8:01 ` [Bug target/95962] " tnfchris at gcc dot gnu.org
@ 2021-08-20 11:52 ` rsandifo at gcc dot gnu.org
  2021-11-15 15:09 ` tnfchris at gcc dot gnu.org
  2021-12-03 17:05 ` rsandifo at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-08-20 11:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962

--- Comment #2 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #1)
> We generate the correct code at -O3 but not -O2.
> 
> At -O3 we generate
> 
> foo:
>         adrp    x0, .LC0
>         sub     sp, sp, #16
>         ldr     q0, [x0, #:lo12:.LC0]
>         add     sp, sp, 16
>         ret
> 
> where the problem seems to be at at -O2 store merging has broken up the
> construction of `array` into two separate memory accesses:
> 
>   MEM <unsigned long> [(int *)&array] = 4294967296;
>   MEM <unsigned long> [(int *)&array + 8B] = 12884901890;
> 
> whereas at -O3 we still have a single assignment:
> 
>   MEM <vector(4) int> [(int *)&array] = { 0, 1, 2, 3 };
> 
> I'm not sure even if we made these loads gimple level if that would help.
> we'd still have the explicit MEMs created by store merging.
If we folded them to gimple loads, the gimple optimisers should replace
the MEM with an assignment of the VECTOR_CST { 0, 1, 2, 3 } to an SSA name,
with the function returning the SSA name.

expand will convert this back into a memory access, in the form of
an RTL constant pool load.  But that will avoid the stack temporary
and thus the pointless stack adjustments.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/95962] Inefficient code for simple arm_neon.h iota operation
  2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
  2021-08-12  8:01 ` [Bug target/95962] " tnfchris at gcc dot gnu.org
  2021-08-20 11:52 ` rsandifo at gcc dot gnu.org
@ 2021-11-15 15:09 ` tnfchris at gcc dot gnu.org
  2021-12-03 17:05 ` rsandifo at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2021-11-15 15:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
This is now fixed on trunk, at least for ld1/st1.

Was this ticket about the general problem for loads or just the ld1/st1
examples?

I'll leave it open since we still need to do the rest of the loads/stores.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/95962] Inefficient code for simple arm_neon.h iota operation
  2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-11-15 15:09 ` tnfchris at gcc dot gnu.org
@ 2021-12-03 17:05 ` rsandifo at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2021-12-03 17:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962

--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #3)
> This is now fixed on trunk, at least for ld1/st1.
Nice!

> Was this ticket about the general problem for loads or just the ld1/st1
> examples?
> 
> I'll leave it open since we still need to do the rest of the loads/stores.
Yeah, agree we should keep it open for {ld,st}[234].

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-12-03 17:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29 12:49 [Bug target/95962] New: Inefficient code for simple arm_neon.h iota operation rsandifo at gcc dot gnu.org
2021-08-12  8:01 ` [Bug target/95962] " tnfchris at gcc dot gnu.org
2021-08-20 11:52 ` rsandifo at gcc dot gnu.org
2021-11-15 15:09 ` tnfchris at gcc dot gnu.org
2021-12-03 17:05 ` rsandifo at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).