From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 2EE333858C1F; Fri, 16 Jun 2023 14:56:10 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2EE333858C1F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1686927370;
	bh=LaOt40vJB5PcRaUwT2oEozWdpUQ3Fu0vVOLcnLqn+k4=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=lSsmlbaPO5iApBNybSlRwOls9pwrWk2/CRbGmXLX696R5p2z0I/VBro+0XGlpCvXA
	 m/henrEt5Uh+1qacUY/H56jr6iyczGNddvz4ozOCW5eqt/QwxqqPI5qNnSHmpwdVnN
	 Niz2LcszsojswSZn0UFGP50jWOVDRDcUoAGIPZBo=
From: "already5chosen at yahoo dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/105617] [12/13/14 Regression] Slp is maybe too
 aggressive in some/many cases
Date: Fri, 16 Jun 2023 14:56:08 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.1.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: already5chosen at yahoo dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.4
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-105617-4-VNADzZo6Ab@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-105617-4@http.gcc.gnu.org/bugzilla/>
References: <bug-105617-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D105617
--- Comment #21 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Mason from comment #20)
> Doh! You're right.
> I come from a background where overlapping/aliasing inputs are heresy,
> thus got blindsided :(
>=20
> This would be the optimal code, right?
>=20
> add4i:
> # rdi =3D dst, rsi =3D a, rdx =3D b
> 	movq	 0(%rdx), %r8
> 	movq	 8(%rdx), %rax
> 	movq	16(%rdx), %rcx
> 	movq	24(%rdx), %rdx
> 	addq	 0(%rsi), %r8
> 	adcq	 8(%rsi), %rax
> 	adcq	16(%rsi), %rcx
> 	adcq	24(%rsi), %rdx
> 	movq	%r8,   0(%rdi)
> 	movq	%rax,  8(%rdi)
> 	movq	%rcx, 16(%rdi)
> 	movq	%rdx, 24(%rdi)
> 	ret
>=20

If one does not care deeply about latency (which is likely for function that
stores result into memory) then that looks good enough.
But if one does care deeply then I'd expect interleaved loads, as in first 8
lines of code generated by trunk, to produce slightly lower latency on majo=
rity
of modern CPUs.=