From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 2761A3858438; Thu, 8 Jun 2023 22:55:33 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2761A3858438 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1686264933; bh=LwPWJDxdba55SCusFGaSGETKsFLm/Ieqxj/so/IXcnM=; h=From:To:Subject:Date:From; b=y37WgVjp2ahVAKPxhG9ptsM3UM1FAYoBW9EUrISqkbBFpuiPezN6ttFCsV+yQTbnu rscJso3H14jCv5a8dpUWGIa5TF+qswdythn289/fvsGsJWjNI+gdLmPW/ajgiB7xlx uPQhckGdylUNYiDOIFXLFd+IDBOBzgKzw/9AQU4w= From: "thiago at kde dot org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/110184] New: [i386] Missed optimisation: atomic operations should use PF, ZF and SF Date: Thu, 08 Jun 2023 22:55:32 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 13.1.1 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: thiago at kde dot org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110184 Bug ID: 110184 Summary: [i386] Missed optimisation: atomic operations should use PF, ZF and SF Product: gcc Version: 13.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: thiago at kde dot org Target Milestone: --- Follow up from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102566 The x86 locked ALU operations always set PF, ZF and SF, so the atomic built= ins could use those to emit more optimal code instead of a cmpxchg loop. Given: template int atomic_rmw_op(std::atomic_int &i) { int old =3D Op(i); if (old =3D=3D 0) return 1; if (old < 0) return 2; return 0; } ------- Starting with the non-standard __atomic_OP_fetch, the current code for=20 inline int andn_fetch_1(std::atomic_int &i) { return __atomic_and_fetch((int *)&i, ~1, 0); } is L33: movl %eax, %edx andl $-2, %edx lock cmpxchgl %edx, (%rdi) jne .L33 movl %edx, %eax shrl $31, %eax addl %eax, %eax // eax =3D 2 if edx < 0 testl %edx, %edx movl $1, %edx cmove %edx, %eax But it could be more optimally written as: movl %ecx, 1 movl %edx, 2 xorl %eax, %eax lock andl $-2, (%rdi) cmove %ecx, %eax cmovs %edx, %eax The other __atomic_OP_fetch operations are very similar. I note that GCC already realises that if you perform __atomic_and_fetch(ptr, 1), the result can't have the sign bit set. ------- For the standard atomic_fetch_OP operations, there are a couple of caveats: fetch_and: if the retrieved value is ANDed again with the same pattern; for example: int pattern =3D 0x80000001; return i.fetch_and(pattern, std::memory_order_relaxed) & pattern; This appears to be partially implemented, depending on what the pattern is.= For example, it generates the optimal code for pattern =3D 3, 15, 0x7fffffff, 0x80000000. It appears to be related to testing for either SF or ZF, but not both. fetch_or: always for SF, for the useful case when the pattern being ORed doesn't already contain the sign bit. If it does (a "non-useful case"), then the comparison is a constant, and likewise for ZF because it's never set if= the pattern isn't zero. fetch_xor: always, because the original value is reconstructible. Avoid generating unnecessary code in case the code already does the XOR itself, as in: return i.fetch_xor(1, std::memory_order_relaxed) ^ 1; See https://gcc.godbolt.org/z/n9bMnaE4e for full results.=