From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id C319C3AA88FD; Thu, 17 Nov 2022 14:48:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C319C3AA88FD DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1668696506; bh=pOxLknMICCLA91VJdwWQBzbP1eCHB+HP0GcWm8jJgVM=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Ihsfm31TRpiglgf7cwjca7j8egazpwx8If7aFZfyabASyrz4eMG+bW2+rvz7UCTeo 7PezTxMSf565HfW2nQ3I3dgQiCaZMmbb8DToakV7hyMuYgC9Qfu8EoKmLgt4zz7YDc yShSyQPeSTXNpTyZRxX1JSnC9Uk+ue9JWm4OvscQ= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/107451] [11/12/13 Regression] Segmentation fault with vectorized code since r11-6434 Date: Thu, 17 Nov 2022 14:48:26 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 11.3.1 X-Bugzilla-Keywords: wrong-code X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 11.4 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107451 --- Comment #8 from Richard Biener --- Peeling for gaps also isn't a good fix here. One could envision a case with even three iterations ahead load with for(i =3D 0; i < n; i++) { dot[0] +=3D x[ix] * y[ix] ; dot[1] +=3D x[ix] * y[ix] ; dot[2] +=3D x[ix] * y[ix] ; dot[3] +=3D x[ix] * y[ix] ; ix +=3D inc_x ; } or similar. The root cause is how we generate code for VMAT_STRIDED_SLP where we first generate loads to fill a contiguous output vector but only then create the permute using the pieces that are actually necessary. We could simply fail if 'nloads' is bigger than 'vf', or cap 'nloads' and fail if we the cannot generate the permutation. When we force VMAT_ELEMENTWISE the very same issue arises but later optimization will eliminate the unnecessary loads, avoiding the problem: _62 =3D *ivtmp_64; _61 =3D MEM[(const double *)ivtmp_64 + 8B]; ivtmp_60 =3D ivtmp_64 + _75; _59 =3D *ivtmp_60; _58 =3D MEM[(const double *)ivtmp_60 + 8B]; ivtmp_57 =3D ivtmp_60 + _75; vect_cst__48 =3D {_62, _61, _59, _58}; vect__4.12_47 =3D VEC_PERM_EXPR ; that just becomes _62 =3D MEM[(const double *)ivtmp_64]; _61 =3D MEM[(const double *)ivtmp_64 + 8B]; ivtmp_60 =3D ivtmp_64 + _75; vect__4.12_47 =3D {_61, _62, _61, _62}; with cost modeling and VMAT_ELEMENTWISE we fall back to SSE vectorization which works fine. I fear the proper fix is to integrate load emission with vect_transform_slp_perm_load somehow, we shouldn't rely on followup simplifications to fix what the vectorizer emits here. Since we have no fallback detecting the situation and avoiding it completely would mean to not vectorize the code (with AVX).=