From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 57B4F3858C5F; Thu, 1 Jun 2023 07:58:54 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 57B4F3858C5F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1685606334; bh=d+aNjkmJZxsPUb7/c/0iTd1osGB99X74PpMAXW4QkiU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=LZHSUf+c8mwXpdDvsJ+O/VtLnLqzN7OGHRZM2dItDpFQIap/783chbd/1VwgfzSR0 Pj9tIZqXfkuU+GX3LVoSbyt0R2tMEEuXLon4SXwtwnCo8nkOBRShWESmVidM1Mr8o1 aQs3B4xFhq3P+k3ZQSkzcSNDGstqPpUfNPKcczYs= From: "slash.tmp at free dot fr" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases Date: Thu, 01 Jun 2023 07:58:52 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 12.1.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: slash.tmp at free dot fr X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.4 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D105617 --- Comment #18 from Mason --- Hello Michael_S, As far as I can see, massaging the source helps GCC generate optimal code (in terms of instruction count, not convinced about scheduling). #include typedef unsigned long long u64; void add4i(u64 dst[4], const u64 A[4], const u64 B[4]) { unsigned char c =3D 0; c =3D _addcarry_u64(c, A[0], B[0], dst+0); c =3D _addcarry_u64(c, A[1], B[1], dst+1); c =3D _addcarry_u64(c, A[2], B[2], dst+2); c =3D _addcarry_u64(c, A[3], B[3], dst+3); } On godbolt, gcc-{11.4, 12.3, 13.1, trunk} -O3 -march=3Dznver1 all generate the expected: add4i: movq (%rdx), %rax addq (%rsi), %rax movq %rax, (%rdi) movq 8(%rsi), %rax adcq 8(%rdx), %rax movq %rax, 8(%rdi) movq 16(%rsi), %rax adcq 16(%rdx), %rax movq %rax, 16(%rdi) movq 24(%rdx), %rax adcq 24(%rsi), %rax movq %rax, 24(%rdi) ret I'll run a few benchmarks to test optimal scheduling.=