From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 0060D3858CDB; Tue,  7 Nov 2023 18:38:28 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0060D3858CDB
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1699382309;
	bh=XvHz/seyQjLrwMx/zVw1sC+3Gx1FeHWfYMPYKhAG6mY=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=FHsEsoUsCz9yC/aeSKB7KlPWCmYTncUySq4Ot0TX/G/IlmAWBd7h58qk3m9wCFnQS
	 NKy3h99fSNsvwc2J9Wk36dA8jca3RtVKBbQYJi2my4+XDL+T0131h/sUnQpffI7yAe
	 OH0reaYR7Ix5fBPSnf69Ngx8mEOdLgvc7ePWkbjQ=
From: "tkoenig at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient
 handling of 128-bit arguments
Date: Tue, 07 Nov 2023 18:38:28 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 11.0
X-Bugzilla-Keywords: missed-optimization, ra
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tkoenig at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 11.5
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-97756-4-sfkc3zI7ST@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-97756-4@http.gcc.gnu.org/bugzilla/>
References: <bug-97756-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97756
--- Comment #13 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Patrick Palka from comment #3)
> Perhaps related to this PR: On x86_64, the following basic wrapper around
> int128 addition
>=20
>   __uint128_t f(__uint128_t x, __uint128_t y) { return x + y; }
>=20
> gets compiled (/w -O3, -O2 or -Os) to the seemingly suboptimal
>=20
>         movq    %rdi, %r9
>         movq    %rdx, %rax
>         movq    %rsi, %r8
>         movq    %rcx, %rdx
>         addq    %r9, %rax
>         adcq    %r8, %rdx
>         ret
>=20
> Clang does:
>=20
>         movq    %rdi, %rax
>         addq    %rdx, %rax
>         adcq    %rcx, %rsi
>         movq    %rsi, %rdx
>         retq

With current trunk, this is now

        movq    %rdx, %rax
        movq    %rcx, %rdx
        addq    %rdi, %rax
        adcq    %rsi, %rdx
        ret

so it looks OK.

The original test case regressed a bit, it is now 39 instructions.=