[Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182
@ 2024-05-28 10:59 manolis.tsamis at vrull dot eu
  2024-05-28 12:03 ` [Bug rtl-optimization/115258] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: manolis.tsamis at vrull dot eu @ 2024-05-28 10:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

            Bug ID: 115258
           Summary: [14 Regression][aarch64] Additional XORs generated
                    after
                    r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---

For the following testcase:

  typedef int veci __attribute__ ((vector_size (4 * sizeof (int))));
  void fun (veci *a, veci *b, veci *c) {
    *c = __builtin_shufflevector (*a, *b, 0, 5, 2, 7);
  }

After this commit we generate (at -O2)

        adrp    x3, .LC0
        ldr     q30, [x1]
        ldr     q31, [x0]
        ldr     q29, [x3, #:lo12:.LC0]
        eor     v31.16b, v31.16b, v30.16b
        eor     v30.16b, v31.16b, v30.16b
        eor     v31.16b, v31.16b, v30.16b
        tbl     v30.16b, {v30.16b - v31.16b}, v29.16b
        str     q30, [x2]
        ret

Instead of

        adrp    x3, .LC0
        ldr     q30, [x0]
        ldr     q31, [x1]
        ldr     q29, [x3, #:lo12:.LC0]
        tbl     v30.16b, {v30.16b - v31.16b}, v29.16b
        str     q30, [x2]
        ret

The 3 newly introduced eor instructions just swap the values of v30 and v31,
which are loaded in the reverse order compared to the old code.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug rtl-optimization/115258] [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182
  2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
@ 2024-05-28 12:03 ` rguenth at gcc dot gnu.org
  2024-05-28 14:50 ` [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290 pinskia at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-28 12:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
   Target Milestone|---                         |14.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290
  2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
  2024-05-28 12:03 ` [Bug rtl-optimization/115258] " rguenth at gcc dot gnu.org
@ 2024-05-28 14:50 ` pinskia at gcc dot gnu.org
  2024-05-29 10:02 ` rsandifo at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-28 14:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-05-28
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.

Though the register allocation before was also just "lucky".
(define_insn_and_split "aarch64_combinev16qi"
  [(set (match_operand:V2x16QI 0 "register_operand" "=w")
        (unspec:V2x16QI [(match_operand:V16QI 1 "register_operand" "w")
                         (match_operand:V16QI 2 "register_operand" "w")]
                        UNSPEC_CONCAT))]
  "TARGET_SIMD"
  "#"
  "&& reload_completed"
  [(const_int 0)]
{
  aarch64_split_combinev16qi (operands);
  DONE;
}
[(set_attr "type" "multiple")]
)


It seems like for little-endian the constraint on 1 should include 0. But I
could be wrong.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290
  2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
  2024-05-28 12:03 ` [Bug rtl-optimization/115258] " rguenth at gcc dot gnu.org
  2024-05-28 14:50 ` [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290 pinskia at gcc dot gnu.org
@ 2024-05-29 10:02 ` rsandifo at gcc dot gnu.org
  2024-05-29 15:43 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-05-29 10:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

Richard Sandiford <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rsandifo at gcc dot gnu.org

--- Comment #2 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
I agree with Andrew that this seems mostly to be luck.  On that basis, I'm not
sure (either way) whether we should backport the fix.

Now that we're hopefully getting better at tracking and allocating subregs, it
probably makes sense to allow the split before reload.  Doing that (and
adjusting the split accordingly) seems to fix the testcase.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290
  2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
                   ` (2 preceding siblings ...)
  2024-05-29 10:02 ` rsandifo at gcc dot gnu.org
@ 2024-05-29 15:43 ` cvs-commit at gcc dot gnu.org
  2024-05-29 15:53 ` [Bug target/115258] [14 " rsandifo at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-05-29 15:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

--- Comment #3 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec

commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
Author: Richard Sandiford <richard.sandiford@arm.com>
Date:   Wed May 29 16:43:33 2024 +0100

    aarch64: Split aarch64_combinev16qi before RA [PR115258]

    Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
    purpose is to put the two input data vectors into consecutive registers.
    This aarch64_combinev16qi was then split after reload into individual
    moves (from the first input to the first half of the output, and from
    the second input to the second half of the output).

    In the worst case, the RA might allocate things so that the destination
    of the aarch64_combinev16qi is the second input followed by the first
    input.  In that case, the split form of aarch64_combinev16qi uses three
    eors to swap the registers around.

    This PR is about a test where this worst case occurred.  And given the
    insn description, that allocation doesn't semm unreasonable.

    early-ra should (hopefully) mean that we're now better at allocating
    subregs of vector registers.  The upcoming RA subreg patches should
    improve things further.  The best fix for the PR therefore seems
    to be to split the combination before RA, so that the RA can see
    the underlying moves.

    Perhaps it even makes sense to do this at expand time, avoiding the need
    for aarch64_combinev16qi entirely.  That deserves more experimentation
    though.

    gcc/
            PR target/115258
            * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
            the split before reload.
            * config/aarch64/aarch64.cc (aarch64_split_combinev16qi):
Generalize
            into a form that handles pseudo registers.

    gcc/testsuite/
            PR target/115258
            * gcc.target/aarch64/pr115258.c: New test.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290
  2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
                   ` (3 preceding siblings ...)
  2024-05-29 15:43 ` cvs-commit at gcc dot gnu.org
@ 2024-05-29 15:53 ` rsandifo at gcc dot gnu.org
  2024-08-01  9:41 ` jakub at gcc dot gnu.org
  2024-09-18  3:12 ` kugan at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rsandifo at gcc dot gnu.org @ 2024-05-29 15:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

Richard Sandiford <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[14/15 Regression] register |[14 Regression] register
                   |swaps for vector perm in    |swaps for vector perm in
                   |some cases after r14-6290   |some cases after r14-6290

--- Comment #4 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
Leaving open in case we do decide to backport.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290
  2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
                   ` (4 preceding siblings ...)
  2024-05-29 15:53 ` [Bug target/115258] [14 " rsandifo at gcc dot gnu.org
@ 2024-08-01  9:41 ` jakub at gcc dot gnu.org
  2024-09-18  3:12 ` kugan at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: jakub at gcc dot gnu.org @ 2024-08-01  9:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|14.2                        |14.3

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 14.2 is being released, retargeting bugs to GCC 14.3.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/115258] [14 Regression] register swaps for vector perm in some cases after r14-6290
  2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
                   ` (5 preceding siblings ...)
  2024-08-01  9:41 ` jakub at gcc dot gnu.org
@ 2024-09-18  3:12 ` kugan at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: kugan at gcc dot gnu.org @ 2024-09-18  3:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115258

kugan at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kugan at gcc dot gnu.org

--- Comment #6 from kugan at gcc dot gnu.org ---
This (In reply to GCC Commits from comment #3)
> The trunk branch has been updated by Richard Sandiford
> <rsandifo@gcc.gnu.org>:
> 
> https://gcc.gnu.org/g:39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
> 
> commit r15-906-g39263ed2d39ac1cebde59bc5e72ddcad5dc7a1ec
> Author: Richard Sandiford <richard.sandiford@arm.com>
> Date:   Wed May 29 16:43:33 2024 +0100
> 
>     aarch64: Split aarch64_combinev16qi before RA [PR115258]
>     
>     Two-vector TBL instructions are fed by an aarch64_combinev16qi, whose
>     purpose is to put the two input data vectors into consecutive registers.
>     This aarch64_combinev16qi was then split after reload into individual
>     moves (from the first input to the first half of the output, and from
>     the second input to the second half of the output).
>     
>     In the worst case, the RA might allocate things so that the destination
>     of the aarch64_combinev16qi is the second input followed by the first
>     input.  In that case, the split form of aarch64_combinev16qi uses three
>     eors to swap the registers around.
>     
>     This PR is about a test where this worst case occurred.  And given the
>     insn description, that allocation doesn't semm unreasonable.
>     
>     early-ra should (hopefully) mean that we're now better at allocating
>     subregs of vector registers.  The upcoming RA subreg patches should
>     improve things further.  The best fix for the PR therefore seems
>     to be to split the combination before RA, so that the RA can see
>     the underlying moves.
>     
>     Perhaps it even makes sense to do this at expand time, avoiding the need
>     for aarch64_combinev16qi entirely.  That deserves more experimentation
>     though.
>     
>     gcc/
>             PR target/115258
>             * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Allow
>             the split before reload.
>             * config/aarch64/aarch64.cc (aarch64_split_combinev16qi):
> Generalize
>             into a form that handles pseudo registers.
>     
>     gcc/testsuite/
>             PR target/115258
>             * gcc.target/aarch64/pr115258.c: New test.

This is causing performance regression in some TSVC kernels and others. Here is
an example:
https://godbolt.org/z/r91nYEEsP

We now get:
.L3:
        add     x3, x26, x0
        add     x2, x25, x0
        add     x3, x3, 65536
        add     x2, x2, 65536
        sub     x0, x0, #16
        ldr     q31, [x3, 62448]
        mov     v28.16b, v31.16b
        mov     v29.16b, v31.16b
        tbl     v31.16b, {v28.16b - v29.16b}, v30.16b
        fadd    v31.4s, v31.4s, v25.4s
        mov     v26.16b, v31.16b
        mov     v27.16b, v31.16b
        tbl     v31.16b, {v26.16b - v27.16b}, v30.16b
        str     q31, [x2, 62448]
        cmp     x0, x27
        bne     .L3

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-09-18  3:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-28 10:59 [Bug rtl-optimization/115258] New: [14 Regression][aarch64] Additional XORs generated after r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 manolis.tsamis at vrull dot eu
2024-05-28 12:03 ` [Bug rtl-optimization/115258] " rguenth at gcc dot gnu.org
2024-05-28 14:50 ` [Bug target/115258] [14/15 Regression] register swaps for vector perm in some cases after r14-6290 pinskia at gcc dot gnu.org
2024-05-29 10:02 ` rsandifo at gcc dot gnu.org
2024-05-29 15:43 ` cvs-commit at gcc dot gnu.org
2024-05-29 15:53 ` [Bug target/115258] [14 " rsandifo at gcc dot gnu.org
2024-08-01  9:41 ` jakub at gcc dot gnu.org
2024-09-18  3:12 ` kugan at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).