* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
@ 2021-01-19 7:55 ` rguenth at gcc dot gnu.org
2021-01-19 8:37 ` ubizjak at gmail dot com
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-01-19 7:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2021-01-19
Keywords| |missed-optimization
Ever confirmed|0 |1
Target|x86 |x86_64-*-* i?86-*-*
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. Probably difficult to achieve since the atomics map to
parallel/UNSPEC:
(insn 7 6 8 (parallel [
(set (reg:DI 83 [ _2 ])
(unspec_volatile:DI [
(mem/v:DI (symbol_ref:DI ("a") [flags 0x2] <var_decl
0x7ffff7ff5b40 a>) [-1 S8 A64])
(const_int 3 [0x3])
] UNSPECV_XCHG))
(set (mem/v:DI (symbol_ref:DI ("a") [flags 0x2] <var_decl
0x7ffff7ff5b40 a>) [-1 S8 A64])
(plus:DI (mem/v:DI (symbol_ref:DI ("a") [flags 0x2] <var_decl
0x7ffff7ff5b40 a>) [-1 S8 A64])
(reg:DI 86)))
(clobber (reg:CC 17 flags))
]) "t.c":5:10 -1
(nil))
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
2021-01-19 7:55 ` [Bug target/98737] " rguenth at gcc dot gnu.org
@ 2021-01-19 8:37 ` ubizjak at gmail dot com
2021-01-19 9:53 ` redi at gcc dot gnu.org
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2021-01-19 8:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
--- Comment #2 from Uroš Bizjak <ubizjak at gmail dot com> ---
This can be optimized with peephole2, we already have similar case in sync.md:
;; This peephole2 and following insn optimize
;; __sync_fetch_and_add (x, -N) == N into just lock {add,sub,inc,dec}
;; followed by testing of flags instead of lock xadd and comparisons.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
2021-01-19 7:55 ` [Bug target/98737] " rguenth at gcc dot gnu.org
2021-01-19 8:37 ` ubizjak at gmail dot com
@ 2021-01-19 9:53 ` redi at gcc dot gnu.org
2021-01-19 10:01 ` jakub at gcc dot gnu.org
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: redi at gcc dot gnu.org @ 2021-01-19 9:53 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
--- Comment #3 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #2)
> This can be optimized with peephole2, we already have similar case in
> sync.md:
>
> ;; This peephole2 and following insn optimize
> ;; __sync_fetch_and_add (x, -N) == N into just lock {add,sub,inc,dec}
> ;; followed by testing of flags instead of lock xadd and comparisons.
That doesn't seem to do anything though, I see the same bad code for
_Bool h(long b)
{
return __sync_fetch_and_add(&a, -b) == b;
}
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (2 preceding siblings ...)
2021-01-19 9:53 ` redi at gcc dot gnu.org
@ 2021-01-19 10:01 ` jakub at gcc dot gnu.org
2021-01-26 15:09 ` jakub at gcc dot gnu.org
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-01-19 10:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
That peephole handles only constants, i.e.
_Bool h(void)
{
return __sync_fetch_and_add(&a, -32) == 32;
}
_Bool i(void)
{
return __sync_fetch_and_add(&a, 64) == -64;
}
Anyway, for #c0 the peephole2 would need to match because it is op_fetch rather
than fetch_op the UNSPEC_XCHG followed by the operation again followed by
comparison.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (3 preceding siblings ...)
2021-01-19 10:01 ` jakub at gcc dot gnu.org
@ 2021-01-26 15:09 ` jakub at gcc dot gnu.org
2021-01-26 16:45 ` drepper.fsp+rhbz at gmail dot com
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-01-26 15:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|unassigned at gcc dot gnu.org |jakub at gcc dot gnu.org
Status|NEW |ASSIGNED
--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 50058
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50058&action=edit
gcc11-pr98737.patch
Untested fix.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (4 preceding siblings ...)
2021-01-26 15:09 ` jakub at gcc dot gnu.org
@ 2021-01-26 16:45 ` drepper.fsp+rhbz at gmail dot com
2021-01-26 17:11 ` jakub at gcc dot gnu.org
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: drepper.fsp+rhbz at gmail dot com @ 2021-01-26 16:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
--- Comment #6 from Ulrich Drepper <drepper.fsp+rhbz at gmail dot com> ---
(In reply to Jakub Jelinek from comment #5)
> Created attachment 50058 [details]
> gcc11-pr98737.patch
>
> Untested fix.
This only handles sub?
The same applies to add, or, and, xor. Maybe nand? Can this patch be
generalized?
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (5 preceding siblings ...)
2021-01-26 16:45 ` drepper.fsp+rhbz at gmail dot com
@ 2021-01-26 17:11 ` jakub at gcc dot gnu.org
2021-01-26 20:10 ` drepper.fsp+rhbz at gmail dot com
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-01-26 17:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The sub fix won't be the same as would add, perhaps xor/or/and can be handled
by the same peephole2, but even for that I'm not sure. Though e.g. trying
__atomic_or_fetch (&a, b, ...) == 0 doesn't seem to be something people would
use.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (6 preceding siblings ...)
2021-01-26 17:11 ` jakub at gcc dot gnu.org
@ 2021-01-26 20:10 ` drepper.fsp+rhbz at gmail dot com
2021-12-14 10:20 ` jakub at gcc dot gnu.org
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: drepper.fsp+rhbz at gmail dot com @ 2021-01-26 20:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
--- Comment #8 from Ulrich Drepper <drepper.fsp+rhbz at gmail dot com> ---
(In reply to Jakub Jelinek from comment #7)
> The sub fix won't be the same as would add, perhaps xor/or/and can be
> handled by the same peephole2, but even for that I'm not sure.
Just a proposal, but I can see myself using code like this.
> Though e.g.
> trying __atomic_or_fetch (&a, b, ...) == 0 doesn't seem to be something
> people would use.
I can see this being valid. If b is a variable of some time (e.g.,
representing a flag), a could be the set of all flag set and the result of the
or operation being non-zero could mean some work based on the flags needs to be
done.
The alternative is is
if ((b != 0 && __atomic_or_fetch(&a, b, ...)) || a != 0)
...
Unconditionally performing the or is likely faster than the additional tests
and jumps.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (7 preceding siblings ...)
2021-01-26 20:10 ` drepper.fsp+rhbz at gmail dot com
@ 2021-12-14 10:20 ` jakub at gcc dot gnu.org
2022-01-03 13:17 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-12-14 10:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #50058|0 |1
is obsolete| |
--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 51997
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51997&action=edit
gcc12-pr98737.patch
Updated patch which uses internal fn and optabs for this.
Only the *_fetch and not fetch_* atomics are handled though.
For the latter, I wonder if we instead just shouldn't generally try to replace
__atomic_fetch_add (&x, y, whatever) + y with __atomic_add_fetch (&x, y,
whatever) and __atomic_add_fetch (&x, y, whatever) - y with __atomic_fetch_add
(&x, y, whatever) etc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (8 preceding siblings ...)
2021-12-14 10:20 ` jakub at gcc dot gnu.org
@ 2022-01-03 13:17 ` cvs-commit at gcc dot gnu.org
2022-01-12 15:50 ` jakub at gcc dot gnu.org
2022-01-14 11:07 ` cvs-commit at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-03 13:17 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:
https://gcc.gnu.org/g:6362627b27f395b054f359244fcfcb15ac0ac2ab
commit r12-6190-g6362627b27f395b054f359244fcfcb15ac0ac2ab
Author: Jakub Jelinek <jakub@redhat.com>
Date: Mon Jan 3 14:02:23 2022 +0100
i386, fab: Optimize __atomic_{add,sub,and,or,xor}_fetch (x, y, z)
{==,!=,<,<=,>,>=} 0 [PR98737]
On Wed, Jan 27, 2021 at 12:27:13PM +0100, Ulrich Drepper via Gcc-patches
wrote:
> On 1/27/21 11:37 AM, Jakub Jelinek wrote:
> > Would equality comparison against 0 handle the most common cases.
> >
> > The user can write it as
> > __atomic_sub_fetch (x, y, z) == 0
> > or
> > __atomic_fetch_sub (x, y, z) - y == 0
> > thouch, so the expansion code would need to be able to cope with both.
>
> Please also keep !=0, <0, <=0, >0, and >=0 in mind. They all can be
> useful and can be handled with the flags.
<= 0 and > 0 don't really work well with lock {add,sub,inc,dec}, x86
doesn't
have comparisons that would look solely at both SF and ZF and not at other
flags (and emitting two separate conditional jumps or two setcc insns and
oring them together looks awful).
But the rest can work.
Here is a patch that adds internal functions and optabs for these,
recognizes them at the same spot as e.g. .ATOMIC_BIT_TEST_AND* internal
functions (fold all builtins pass) and expands them appropriately (or for
the <= 0 and > 0 cases of +/- FAILs and let's middle-end fall back).
So far I have handled just the op_fetch builtins, IMHO instead of handling
also __atomic_fetch_sub (x, y, z) - y == 0 etc. we should canonicalize
__atomic_fetch_sub (x, y, z) - y to __atomic_sub_fetch (x, y, z) (and vice
versa).
2022-01-03 Jakub Jelinek <jakub@redhat.com>
PR target/98737
* internal-fn.def (ATOMIC_ADD_FETCH_CMP_0, ATOMIC_SUB_FETCH_CMP_0,
ATOMIC_AND_FETCH_CMP_0, ATOMIC_OR_FETCH_CMP_0,
ATOMIC_XOR_FETCH_CMP_0):
New internal fns.
* internal-fn.h (ATOMIC_OP_FETCH_CMP_0_EQ,
ATOMIC_OP_FETCH_CMP_0_NE,
ATOMIC_OP_FETCH_CMP_0_LT, ATOMIC_OP_FETCH_CMP_0_LE,
ATOMIC_OP_FETCH_CMP_0_GT, ATOMIC_OP_FETCH_CMP_0_GE): New
enumerators.
* internal-fn.c (expand_ATOMIC_ADD_FETCH_CMP_0,
expand_ATOMIC_SUB_FETCH_CMP_0, expand_ATOMIC_AND_FETCH_CMP_0,
expand_ATOMIC_OR_FETCH_CMP_0, expand_ATOMIC_XOR_FETCH_CMP_0): New
functions.
* optabs.def (atomic_add_fetch_cmp_0_optab,
atomic_sub_fetch_cmp_0_optab, atomic_and_fetch_cmp_0_optab,
atomic_or_fetch_cmp_0_optab, atomic_xor_fetch_cmp_0_optab): New
direct optabs.
* builtins.h (expand_ifn_atomic_op_fetch_cmp_0): Declare.
* builtins.c (expand_ifn_atomic_op_fetch_cmp_0): New function.
* tree-ssa-ccp.c: Include internal-fn.h.
(optimize_atomic_bit_test_and): Add . before internal fn call
in function comment. Change return type from void to bool and
return true only if successfully replaced.
(optimize_atomic_op_fetch_cmp_0): New function.
(pass_fold_builtins::execute): Use optimize_atomic_op_fetch_cmp_0
for BUILT_IN_ATOMIC_{ADD,SUB,AND,OR,XOR}_FETCH_{1,2,4,8,16} and
BUILT_IN_SYNC_{ADD,SUB,AND,OR,XOR}_AND_FETCH_{1,2,4,8,16},
for *XOR* ones only if optimize_atomic_bit_test_and failed.
* config/i386/sync.md
(atomic_<plusminus_mnemonic>_fetch_cmp_0<mode>,
atomic_<logic>_fetch_cmp_0<mode>): New define_expand patterns.
(atomic_add_fetch_cmp_0<mode>_1, atomic_sub_fetch_cmp_0<mode>_1,
atomic_<logic>_fetch_cmp_0<mode>_1): New define_insn patterns.
* doc/md.texi (atomic_add_fetch_cmp_0<mode>,
atomic_sub_fetch_cmp_0<mode>, atomic_and_fetch_cmp_0<mode>,
atomic_or_fetch_cmp_0<mode>, atomic_xor_fetch_cmp_0<mode>):
Document
new named patterns.
* gcc.target/i386/pr98737-1.c: New test.
* gcc.target/i386/pr98737-2.c: New test.
* gcc.target/i386/pr98737-3.c: New test.
* gcc.target/i386/pr98737-4.c: New test.
* gcc.target/i386/pr98737-5.c: New test.
* gcc.target/i386/pr98737-6.c: New test.
* gcc.target/i386/pr98737-7.c: New test.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (9 preceding siblings ...)
2022-01-03 13:17 ` cvs-commit at gcc dot gnu.org
@ 2022-01-12 15:50 ` jakub at gcc dot gnu.org
2022-01-14 11:07 ` cvs-commit at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-01-12 15:50 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 52170
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52170&action=edit
gcc12-pr98737-canon.patch
Untested patch to fix up if users misuse the fetch_op or op_fetch atomic when
they should be using the other one.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/98737] Atomic operation on x86 no optimized to use flags
2021-01-18 19:59 [Bug target/98737] New: Atomic operation on x86 no optimized to use flags drepper.fsp+rhbz at gmail dot com
` (10 preceding siblings ...)
2022-01-12 15:50 ` jakub at gcc dot gnu.org
@ 2022-01-14 11:07 ` cvs-commit at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-01-14 11:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98737
--- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:
https://gcc.gnu.org/g:9896e96d4cae00d0f4d2b694284cb30bbd9c80fc
commit r12-6577-g9896e96d4cae00d0f4d2b694284cb30bbd9c80fc
Author: Jakub Jelinek <jakub@redhat.com>
Date: Fri Jan 14 12:04:59 2022 +0100
forwprop: Canonicalize atomic fetch_op op x to op_fetch or vice versa
[PR98737]
When writing the PR98737 fix, I've handled just the case where people
use __atomic_op_fetch (p, x, y) etc.
But some people actually use the other builtins, like
__atomic_fetch_op (p, x, y) op x.
The following patch canonicalizes the latter to the former and vice versa
when possible if the result of the builtin is a single use and if
that use is a cast with same precision, also that cast's lhs has a single
use.
For all ops of +, -, &, | and ^ we can do those
__atomic_fetch_op (p, x, y) op x -> __atomic_op_fetch (p, x, y)
(and __sync too) opts, but cases of INTEGER_CST and SSA_NAME x
behave differently. For INTEGER_CST, typically - x is
canonicalized to + (-x), while for SSA_NAME we need to handle various
casts, which sometimes happen on the second argument of the builtin
(there can be even two subsequent casts for char/short due to the
promotions we do) and there can be a cast on the argument of op too.
And all ops but - are commutative.
For the other direction, i.e.
__atomic_op_fetch (p, x, y) rop x -> __atomic_fetch_op (p, x, y)
we can't handle op of & and |, those aren't reversible, for
op + rop is -, for - rop is + and for ^ rop is ^, otherwise the same
stuff as above applies.
And, there is another case, we canonicalize
x - y == 0 (or != 0) and x ^ y == 0 (or != 0) to x == y (or x != y)
and for constant y x + y == 0 (or != 0) to x == -y (or != -y),
so the patch also virtually undoes those canonicalizations, because
e.g. for the earlier PR98737 patch but even generally, it is better
if a result of atomic op fetch is compared against 0 than doing
atomic fetch op and compare it to some variable or non-zero constant.
As for debug info, for non-reversible operations (& and |) the patch
resets debug stmts if there are any, for -fnon-call-exceptions too
(didn't want to include debug temps right before all uses), but
otherwise it emits (on richi's request) the reverse operation from
the result as a new setter of the old lhs, so that later DCE fixes
up the debug info.
On the emitted assembly for the testcases which are fairly large,
I see substantial decreases of the *.s size:
-rw-rw-r--. 1 jakub jakub 116897 Jan 13 09:58 pr98737-1.svanilla
-rw-rw-r--. 1 jakub jakub 93861 Jan 13 09:57 pr98737-1.spatched
-rw-rw-r--. 1 jakub jakub 70257 Jan 13 09:57 pr98737-2.svanilla
-rw-rw-r--. 1 jakub jakub 67537 Jan 13 09:57 pr98737-2.spatched
There are some functions where due to RA we get one more instruction
than previously, but most of them are smaller even when not hitting
the PR98737 previous patch's optimizations.
2022-01-14 Jakub Jelinek <jakub@redhat.com>
PR target/98737
* tree-ssa-forwprop.c (simplify_builtin_call): Canonicalize
__atomic_fetch_op (p, x, y) op x into __atomic_op_fetch (p, x, y)
and __atomic_op_fetch (p, x, y) iop x into
__atomic_fetch_op (p, x, y).
* gcc.dg/tree-ssa/pr98737-1.c: New test.
* gcc.dg/tree-ssa/pr98737-2.c: New test.
^ permalink raw reply [flat|nested] 13+ messages in thread