From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 524FE39540C4; Tue, 19 May 2020 09:39:18 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 524FE39540C4 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1589881158; bh=Xm6T4EXm/qXHlltzAjSZ3/0irJpvxWj2Uva5HOFu+lU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=h+FkVhjxFCdSb8rfOintwb1KRsNyA0qbzv+EyD2hEdyUqb56wA+Sx3EEiUhHshmUc QWkb9gB3QEGeILiIYFix5Pbe+wni7RWn6Fa19LTmycef/4jXHayhg2wbr+3wknCtAM IR5uyf0rtPbyrSw21nrQk2EsHbGQLL4hmlPFYDQw= From: "guojiufu at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison Date: Tue, 19 May 2020 09:39:18 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 9.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: guojiufu at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: sudi at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 May 2020 09:39:18 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D88398 Jiu Fu Guo changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |guojiufu at gcc dot gnu.org --- Comment #25 from Jiu Fu Guo --- (In reply to Jakub Jelinek from comment #10) > If the compiler knew say from PGO that pos is usually a multiple of certa= in > power of two and that the loop usually iterates many times (I guess the > latter can be determined from comparing the bb count of the loop itself a= nd > its header), it could emit something like: > static int func2(int max, int pos, unsigned char *cur) > { > unsigned char *p =3D cur + pos; > int len =3D 0; > if (max > 32 && (pos & 7) =3D=3D 0) > { > int l =3D ((1 - ((uintptr_t) cur)) & 7) + 1; > while (++len !=3D l) > if (p[len] !=3D cur[len]) > goto end; > unsigned long long __attribute__((may_alias)) *p2 =3D (unsigned long > long *) &p[len]; > unsigned long long __attribute__((may_alias)) *cur2 =3D (unsigned l= ong > long *) &cur[len]; > while (len + 8 < max) > { > if (*p2++ !=3D *cur2++) > break; > len +=3D 8; > } > --len; > } > while (++len !=3D max) > if (p[len] !=3D cur[len]) > break; > end: > return cur[len]; > } >=20 > or so (untested). Of course, it could be done using SIMD too if there is= a > way to terminate the loop if any of the elts is different and could be do= ne > in that case at 16 or 32 or 64 characters at a time etc. > But, without knowing that pos is typically some power of two this would j= ust > waste code size, dealing with the unaligned cases would be more complicat= ed > (one can't read the next elt until proving that the current one is all > equal), so it would need to involve some rotations (or permutes for SIMD). With this kind of widening, we can see ~5% performance improvement for xz = on some platforms.=