From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 2EE333858C1F; Fri, 16 Jun 2023 14:56:10 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2EE333858C1F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1686927370; bh=LaOt40vJB5PcRaUwT2oEozWdpUQ3Fu0vVOLcnLqn+k4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=lSsmlbaPO5iApBNybSlRwOls9pwrWk2/CRbGmXLX696R5p2z0I/VBro+0XGlpCvXA m/henrEt5Uh+1qacUY/H56jr6iyczGNddvz4ozOCW5eqt/QwxqqPI5qNnSHmpwdVnN Niz2LcszsojswSZn0UFGP50jWOVDRDcUoAGIPZBo= From: "already5chosen at yahoo dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases Date: Fri, 16 Jun 2023 14:56:08 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 12.1.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: already5chosen at yahoo dot com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.4 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D105617 --- Comment #21 from Michael_S --- (In reply to Mason from comment #20) > Doh! You're right. > I come from a background where overlapping/aliasing inputs are heresy, > thus got blindsided :( >=20 > This would be the optimal code, right? >=20 > add4i: > # rdi =3D dst, rsi =3D a, rdx =3D b > movq 0(%rdx), %r8 > movq 8(%rdx), %rax > movq 16(%rdx), %rcx > movq 24(%rdx), %rdx > addq 0(%rsi), %r8 > adcq 8(%rsi), %rax > adcq 16(%rsi), %rcx > adcq 24(%rsi), %rdx > movq %r8, 0(%rdi) > movq %rax, 8(%rdi) > movq %rcx, 16(%rdi) > movq %rdx, 24(%rdi) > ret >=20 If one does not care deeply about latency (which is likely for function that stores result into memory) then that looks good enough. But if one does care deeply then I'd expect interleaved loads, as in first 8 lines of code generated by trunk, to produce slightly lower latency on majo= rity of modern CPUs.=