From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 57B4F3858C5F; Thu,  1 Jun 2023 07:58:54 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 57B4F3858C5F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1685606334;
	bh=d+aNjkmJZxsPUb7/c/0iTd1osGB99X74PpMAXW4QkiU=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=LZHSUf+c8mwXpdDvsJ+O/VtLnLqzN7OGHRZM2dItDpFQIap/783chbd/1VwgfzSR0
	 Pj9tIZqXfkuU+GX3LVoSbyt0R2tMEEuXLon4SXwtwnCo8nkOBRShWESmVidM1Mr8o1
	 aQs3B4xFhq3P+k3ZQSkzcSNDGstqPpUfNPKcczYs=
From: "slash.tmp at free dot fr" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/105617] [12/13/14 Regression] Slp is maybe too
 aggressive in some/many cases
Date: Thu, 01 Jun 2023 07:58:52 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.1.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: slash.tmp at free dot fr
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.4
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-105617-4-C8oiCUKFJT@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-105617-4@http.gcc.gnu.org/bugzilla/>
References: <bug-105617-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D105617

--- Comment #18 from Mason <slash.tmp at free dot fr> ---
Hello Michael_S,

As far as I can see, massaging the source helps GCC generate optimal code
(in terms of instruction count, not convinced about scheduling).

#include <x86intrin.h>
typedef unsigned long long u64;
void add4i(u64 dst[4], const u64 A[4], const u64 B[4])
{
  unsigned char c =3D 0;
  c =3D _addcarry_u64(c, A[0], B[0], dst+0);
  c =3D _addcarry_u64(c, A[1], B[1], dst+1);
  c =3D _addcarry_u64(c, A[2], B[2], dst+2);
  c =3D _addcarry_u64(c, A[3], B[3], dst+3);
}


On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=3Dznver1 all generate
the expected:

add4i:
        movq    (%rdx), %rax
        addq    (%rsi), %rax
        movq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    8(%rdx), %rax
        movq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    16(%rdx), %rax
        movq    %rax, 16(%rdi)
        movq    24(%rdx), %rax
        adcq    24(%rsi), %rax
        movq    %rax, 24(%rdi)
        ret

I'll run a few benchmarks to test optimal scheduling.=