[Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
@ 2024-04-16 12:34 nsz at gcc dot gnu.org
  2024-04-16 12:55 ` [Bug target/114741] " pinskia at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: nsz at gcc dot gnu.org @ 2024-04-16 12:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

            Bug ID: 114741
           Summary: [14 regression] aarch64 sve: unnecessary fmov for
                    scalar int bit operations
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nsz at gcc dot gnu.org
  Target Milestone: ---

void foo(unsigned i, unsigned *p)
{
    *p = i & 1;
}

with gcc -march=armv8-a+sve -O2 compiles to

foo:
        fmov    s31, w0
        and     z31.s, z31.s, #1
        str     s31, [x1]
        ret

instead of

foo:
        and     w0, w0, 1
        str     w0, [x1]
        ret

it is wrong with -mcpu=generic but good e.g. with -mcpu=neoverse-v1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
@ 2024-04-16 12:55 ` pinskia at gcc dot gnu.org
  2024-04-16 14:55 ` wilco at gcc dot gnu.org
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-16 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
   Target Milestone|---                         |14.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
  2024-04-16 12:55 ` [Bug target/114741] " pinskia at gcc dot gnu.org
@ 2024-04-16 14:55 ` wilco at gcc dot gnu.org
  2024-04-16 15:45 ` wilco at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: wilco at gcc dot gnu.org @ 2024-04-16 14:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
This example always goes wrong:

void foo2(unsigned *p)
{
    *p &= 1;
}

Eg. with -mcpu=neoverse-v1:

        ldr     s31, [x0]
        and     z31.s, z31.s, #1
        str     s31, [x0]
        ret

This doesn't make any sense since there are usually fewer vector units than
integer ALUs, and the typically have higher latency.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
  2024-04-16 12:55 ` [Bug target/114741] " pinskia at gcc dot gnu.org
  2024-04-16 14:55 ` wilco at gcc dot gnu.org
@ 2024-04-16 15:45 ` wilco at gcc dot gnu.org
  2024-04-16 16:42 ` pinskia at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: wilco at gcc dot gnu.org @ 2024-04-16 15:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

--- Comment #2 from Wilco <wilco at gcc dot gnu.org> ---
It looks like the underlying bug is '^' being incorrectly treated like '?' in
record_reg_classes (which is never used during reload). Fixing that results in
the expected code being generated in all cases. It looks this issue was
introduced in the original commit d1457701461d5a49ca6b5d8a6d1c83a37a6dc771

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-04-16 15:45 ` wilco at gcc dot gnu.org
@ 2024-04-16 16:42 ` pinskia at gcc dot gnu.org
  2024-04-16 18:03 ` pinskia at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-16 16:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2024-04-16
             Status|UNCONFIRMED                 |NEW
                 CC|                            |pinskia at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-04-16 16:42 ` pinskia at gcc dot gnu.org
@ 2024-04-16 18:03 ` pinskia at gcc dot gnu.org
  2024-04-16 19:18 ` tnfchris at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-04-16 18:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Wilco from comment #2)
> It looks like the underlying bug is '^' being incorrectly treated like '?'
> in record_reg_classes (which is never used during reload). Fixing that
> results in the expected code being generated in all cases. It looks this
> issue was introduced in the original commit
> d1457701461d5a49ca6b5d8a6d1c83a37a6dc771

static const struct cpu_regmove_cost generic_armv9_a_regmove_cost =
{
  1, /* GP2GP  */
  /* Spilling to int<->fp instead of memory is recommended so set
     realistic costs compared to memmov_cost.  */
  3, /* GP2FP  */
  2, /* FP2GP  */
  2 /* FP2FP  */
};

Note these costs are broken.
TARGET_REGISTER_MOVE_COST  has this to say about the special value 2:
```
If reload sees an insn consisting of a single set between two hard registers,
and if TARGET_REGISTER_MOVE_COST applied to their classes returns a value of 2,
reload does not check to ensure that the constraints of the insn are met.
Setting a cost of other than 2 will allow reload to verify that the constraints
are met. You should do this if the ‘movm’ pattern’s constraints do not allow
such copying.
```

The way I implemented this for thunderx was have GP2GP being cost of 2 and then
be relative from there. That gave better code generation in general and even
less spilling. I know I wrote about this before too.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-04-16 18:03 ` pinskia at gcc dot gnu.org
@ 2024-04-16 19:18 ` tnfchris at gcc dot gnu.org
  2024-04-16 19:20 ` tnfchris at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-04-16 19:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org

--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Wilco from comment #2)
> > It looks like the underlying bug is '^' being incorrectly treated like '?'
> > in record_reg_classes (which is never used during reload). Fixing that
> > results in the expected code being generated in all cases. It looks this
> > issue was introduced in the original commit
> > d1457701461d5a49ca6b5d8a6d1c83a37a6dc771
> 
> static const struct cpu_regmove_cost generic_armv9_a_regmove_cost =
> {
>   1, /* GP2GP  */
>   /* Spilling to int<->fp instead of memory is recommended so set
>      realistic costs compared to memmov_cost.  */
>   3, /* GP2FP  */
>   2, /* FP2GP  */
>   2 /* FP2FP  */
> };
> 
> 
> Note these costs are broken.
> TARGET_REGISTER_MOVE_COST  has this to say about the special value 2:
> ```
> If reload sees an insn consisting of a single set between two hard
> registers, and if TARGET_REGISTER_MOVE_COST applied to their classes returns
> a value of 2, reload does not check to ensure that the constraints of the
> insn are met. Setting a cost of other than 2 will allow reload to verify
> that the constraints are met. You should do this if the ‘movm’ pattern’s
> constraints do not allow such copying.
> ```
> 
> The way I implemented this for thunderx was have GP2GP being cost of 2 and
> then be relative from there. That gave better code generation in general and
> even less spilling. I know I wrote about this before too.

I don't think this is related to this at all
the old generic costs, which armv8 was taken from doesn't use 2, and is broken
https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/tuning_models/generic.h#L42
generic-armv8-a doesn't use 2 and is broken
https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/tuning_models/generic_armv8_a.h#L43
neoverse-n2 uses 2 and isn't broken
https://github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/tuning_models/neoversen2.h

I think the issue is what Wilco mentioned before. 
The GCC docs claim that ^ shouldn't affect initial costing/IRA, but it clearly
does.

it's penalizing the r->r alternative during initial costing and not just during
lra if a reload is needed.

so the question to Vlad is it correct that ^ is treated the same way as ?
during initial costing? i.e. it's penalizing the alternative.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-04-16 19:18 ` tnfchris at gcc dot gnu.org
@ 2024-04-16 19:20 ` tnfchris at gcc dot gnu.org
  2024-04-17 13:33 ` law at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-04-16 19:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

--- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
and the exact armv9-a cost model you quoted, also does the right codegen.
https://godbolt.org/z/obafoT6cj

There is just an inexplicable penalty being applied to the r->r alternative.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-04-16 19:20 ` tnfchris at gcc dot gnu.org
@ 2024-04-17 13:33 ` law at gcc dot gnu.org
  2024-04-17 13:47 ` wilco at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: law at gcc dot gnu.org @ 2024-04-17 13:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

Jeffrey A. Law <law at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |law at gcc dot gnu.org
           Priority|P3                          |P2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-04-17 13:33 ` law at gcc dot gnu.org
@ 2024-04-17 13:47 ` wilco at gcc dot gnu.org
  2024-04-17 13:49 ` tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: wilco at gcc dot gnu.org @ 2024-04-17 13:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

--- Comment #7 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #6)
> and the exact armv9-a cost model you quoted, also does the right codegen.
> https://godbolt.org/z/obafoT6cj
> 
> There is just an inexplicable penalty being applied to the r->r alternative.

Indeed it is not related to cost model - building SPEC shows a significant
regression (~1%) with -mcpu=neoverse-v1 due to AND immediate being quite common
in scalar code. The '^' incorrectly forces many cases to use the SVE
alternative.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-04-17 13:47 ` wilco at gcc dot gnu.org
@ 2024-04-17 13:49 ` tnfchris at gcc dot gnu.org
  2024-04-18 10:49 ` cvs-commit at gcc dot gnu.org
  2024-04-18 10:51 ` tnfchris at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-04-17 13:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |tnfchris at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #8 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Indeed, Regtesting a patch that fixes it, so mine...

It looks like ^ doesn't work well when there's a choice of multiple register
files.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-04-17 13:49 ` tnfchris at gcc dot gnu.org
@ 2024-04-18 10:49 ` cvs-commit at gcc dot gnu.org
  2024-04-18 10:51 ` tnfchris at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-04-18 10:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

--- Comment #9 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:a2f4be3dae04fa8606d1cc8451f0b9d450f7e6e6

commit r14-10014-ga2f4be3dae04fa8606d1cc8451f0b9d450f7e6e6
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Thu Apr 18 11:47:42 2024 +0100

    AArch64: remove reliance on register allocator for simd/gpreg costing.
[PR114741]

    In PR114741 we see that we have a regression in codegen when SVE is enable
where
    the simple testcase:

    void foo(unsigned v, unsigned *p)
    {
        *p = v & 1;
    }

    generates

    foo:
            fmov    s31, w0
            and     z31.s, z31.s, #1
            str     s31, [x1]
            ret

    instead of:

    foo:
            and     w0, w0, 1
            str     w0, [x1]
            ret

    This causes an impact it not just codesize but also performance.  This is
caused
    by the use of the ^ constraint modifier in the pattern <optab><mode>3.

    The documentation states that this modifier should only have an effect on
the
    alternative costing in that a particular alternative is to be preferred
unless
    a non-psuedo reload is needed.

    The pattern was trying to convey that whenever both r and w are required,
that
    it should prefer r unless a reload is needed.  This is because if a reload
is
    needed then we can construct the constants more flexibly on the SIMD side.

    We were using this so simplify the implementation and to get generic cases
such
    as:

    double negabs (double x)
    {
       unsigned long long y;
       memcpy (&y, &x, sizeof(double));
       y = y | (1UL << 63);
       memcpy (&x, &y, sizeof(double));
       return x;
    }

    which don't go through an expander.
    However the implementation of ^ in the register allocator is not according
to
    the documentation in that it also has an effect during coloring.  During
initial
    register class selection it applies a penalty to a class, similar to how ?
does.

    In this example the penalty makes the use of GP regs expensive enough that
it no
    longer considers them:

        r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS
    ;;        3--> b  0: i   9 r106=r105&0x1
        :cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0)
                             PR_HI_REGS+0(0):model 4

    which is not the expected behavior.  For GCC 14 this is a conservative fix.

    1. we remove the ^ modifier from the logical optabs.

    2. In order not to regress copysign we then move the copysign expansion to
       directly use the SIMD variant.  Since copysign only supports floating
point
       modes this is fine and no longer relies on the register allocator to
select
       the right alternative.

    It once again regresses the general case, but this case wasn't optimized in
    earlier GCCs either so it's not a regression in GCC 14.  This change gives
    strict better codegen than earlier GCCs and still optimizes the important
cases.

    gcc/ChangeLog:

            PR target/114741
            * config/aarch64/aarch64.md (<optab><mode>3): Remove ^ from alt 2.
            (copysign<GPF:mode>3): Use SIMD version of IOR directly.

    gcc/testsuite/ChangeLog:

            PR target/114741
            * gcc.target/aarch64/fneg-abs_2.c: Update codegen.
            * gcc.target/aarch64/fneg-abs_4.c: xfail for now.
            * gcc.target/aarch64/pr114741.c: New test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/114741] [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations
  2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2024-04-18 10:49 ` cvs-commit at gcc dot gnu.org
@ 2024-04-18 10:51 ` tnfchris at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-04-18 10:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114741

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED

--- Comment #10 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Fixed, thanks for the report.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-04-18 10:51 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-16 12:34 [Bug target/114741] New: [14 regression] aarch64 sve: unnecessary fmov for scalar int bit operations nsz at gcc dot gnu.org
2024-04-16 12:55 ` [Bug target/114741] " pinskia at gcc dot gnu.org
2024-04-16 14:55 ` wilco at gcc dot gnu.org
2024-04-16 15:45 ` wilco at gcc dot gnu.org
2024-04-16 16:42 ` pinskia at gcc dot gnu.org
2024-04-16 18:03 ` pinskia at gcc dot gnu.org
2024-04-16 19:18 ` tnfchris at gcc dot gnu.org
2024-04-16 19:20 ` tnfchris at gcc dot gnu.org
2024-04-17 13:33 ` law at gcc dot gnu.org
2024-04-17 13:47 ` wilco at gcc dot gnu.org
2024-04-17 13:49 ` tnfchris at gcc dot gnu.org
2024-04-18 10:49 ` cvs-commit at gcc dot gnu.org
2024-04-18 10:51 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).