From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id C319C3AA88FD; Thu, 17 Nov 2022 14:48:26 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C319C3AA88FD
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1668696506;
	bh=pOxLknMICCLA91VJdwWQBzbP1eCHB+HP0GcWm8jJgVM=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=Ihsfm31TRpiglgf7cwjca7j8egazpwx8If7aFZfyabASyrz4eMG+bW2+rvz7UCTeo
	 7PezTxMSf565HfW2nQ3I3dgQiCaZMmbb8DToakV7hyMuYgC9Qfu8EoKmLgt4zz7YDc
	 yShSyQPeSTXNpTyZRxX1JSnC9Uk+ue9JWm4OvscQ=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/107451] [11/12/13 Regression] Segmentation
 fault with vectorized code since r11-6434
Date: Thu, 17 Nov 2022 14:48:26 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 11.3.1
X-Bugzilla-Keywords: wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 11.4
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-107451-4-cjZRAfBluW@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-107451-4@http.gcc.gnu.org/bugzilla/>
References: <bug-107451-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107451
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Peeling for gaps also isn't a good fix here.  One could envision a case with
even three iterations ahead load with

        for(i =3D 0; i < n; i++) {
                dot[0] +=3D x[ix]   * y[ix]   ;
                dot[1] +=3D x[ix] * y[ix] ;
                dot[2] +=3D x[ix]   * y[ix] ;
                dot[3] +=3D x[ix] * y[ix]   ;
                ix +=3D inc_x ;
        }

or similar.  The root cause is how we generate code for VMAT_STRIDED_SLP
where we first generate loads to fill a contiguous output vector but only
then create the permute using the pieces that are actually necessary.

We could simply fail if 'nloads' is bigger than 'vf', or cap 'nloads' and
fail if we the cannot generate the permutation.

When we force VMAT_ELEMENTWISE the very same issue arises but later
optimization will eliminate the unnecessary loads, avoiding the problem:

  _62 =3D *ivtmp_64;
  _61 =3D MEM[(const double *)ivtmp_64 + 8B];
  ivtmp_60 =3D ivtmp_64 + _75;
  _59 =3D *ivtmp_60;
  _58 =3D MEM[(const double *)ivtmp_60 + 8B];
  ivtmp_57 =3D ivtmp_60 + _75;
  vect_cst__48 =3D {_62, _61, _59, _58};
  vect__4.12_47 =3D VEC_PERM_EXPR <vect_cst__48, vect_cst__48, { 1, 0, 1, 0=
 }>;

that just becomes

  _62 =3D MEM[(const double *)ivtmp_64];
  _61 =3D MEM[(const double *)ivtmp_64 + 8B];
  ivtmp_60 =3D ivtmp_64 + _75;
  vect__4.12_47 =3D {_61, _62, _61, _62};

with cost modeling and VMAT_ELEMENTWISE we fall back to SSE vectorization
which works fine.

I fear the proper fix is to integrate load emission with
vect_transform_slp_perm_load somehow, we shouldn't rely on followup
simplifications to fix what the vectorizer emits here.

Since we have no fallback detecting the situation and avoiding it completely
would mean to not vectorize the code (with AVX).=