From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id E3BC73858D28; Mon,  4 Mar 2024 16:16:27 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E3BC73858D28
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709568987;
	bh=+XhhR8bilSCCBulC54cg5dRWJ1VMOV44cNkKT8h7zzA=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=jAiqely4asCT8o1AvTmKRY5+kiiCUrDLSgtg8vMxPeZjkNRxeUgwYzFbSZ6ZVkQmx
	 6+cCk+MJR7ly4I0+D1npmveBmMT4RLlYWauFHY+GgMozAAvS1TAhBuBEGkzCOTKuvg
	 xtiV8qCuda9+2EOwJ1Xax2Ug7tJddfC1XBuO+nq8=
From: "rsandifo at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/113441] [14 Regression] Fail to fold the last
 element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7
Date: Mon, 04 Mar 2024 16:16:26 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rsandifo at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 14.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-113441-4-yZl7vygU6I@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-113441-4@http.gcc.gnu.org/bugzilla/>
References: <bug-113441-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D113441
--- Comment #41 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #40)
> So I wonder if we can use "local costing" to decide a gather is always OK
> compared to the alternative with peeling for gaps.  On x86 gather tends
> to be slow compared to open-coding it.
Yeah, on SVE gathers are generally =E2=80=9Cenabling=E2=80=9D instructions =
rather than
something to use for their own sake.

I suppose one problem is that we currently only try to use gathers for
single-element groups.  If we make a local decision to use gathers while
keeping that restriction, we could end up using gathers =E2=80=9Cunnecessar=
ily=E2=80=9D while
still needing to peel for gaps for (say) a two-element group.

That is, it's only better to use gathers than contiguous loads if by doing =
that
we avoid all need to peel for gaps (and if the cost of peeling for gaps was
high enough to justify the cost of using gathers over consecutive loads).

One of the things on the list to do (once everything is SLP!) is to support
loads with gaps directly via predication, so that we never load elements th=
at
aren't needed.  E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with=
 a
32-bit load (LD1W .S) to load only even-indexed elements.  So a single-elem=
ent
group with a group size of 2 could be done cheaply with just consecutive lo=
ads,
without peeling for gaps.=