From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 524FE39540C4; Tue, 19 May 2020 09:39:18 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 524FE39540C4
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
 s=default; t=1589881158;
 bh=Xm6T4EXm/qXHlltzAjSZ3/0irJpvxWj2Uva5HOFu+lU=;
 h=From:To:Subject:Date:In-Reply-To:References:From;
 b=h+FkVhjxFCdSb8rfOintwb1KRsNyA0qbzv+EyD2hEdyUqb56wA+Sx3EEiUhHshmUc
 QWkb9gB3QEGeILiIYFix5Pbe+wni7RWn6Fa19LTmycef/4jXHayhg2wbr+3wknCtAM
 IR5uyf0rtPbyrSw21nrQk2EsHbGQLL4hmlPFYDQw=
From: "guojiufu at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/88398] vectorization failure for a small loop
 to do byte comparison
Date: Tue, 19 May 2020 09:39:18 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 9.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: guojiufu at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: sudi at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-88398-4-cfiMoYQbCY@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-88398-4@http.gcc.gnu.org/bugzilla/>
References: <bug-88398-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <http://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <http://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 19 May 2020 09:39:18 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D88398

Jiu Fu Guo <guojiufu at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |guojiufu at gcc dot gnu.org
--- Comment #25 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #10)
> If the compiler knew say from PGO that pos is usually a multiple of certa=
in
> power of two and that the loop usually iterates many times (I guess the
> latter can be determined from comparing the bb count of the loop itself a=
nd
> its header), it could emit something like:
> static int func2(int max, int pos, unsigned char *cur)
> {
>   unsigned char *p =3D cur + pos;
>   int len =3D 0;
>   if (max > 32 && (pos & 7) =3D=3D 0)
>     {
>       int l =3D ((1 - ((uintptr_t) cur)) & 7) + 1;
>       while (++len !=3D l)
>         if (p[len] !=3D cur[len])
>           goto end;
>       unsigned long long __attribute__((may_alias)) *p2 =3D (unsigned long
> long *) &p[len];
>       unsigned long long __attribute__((may_alias)) *cur2 =3D (unsigned l=
ong
> long *) &cur[len];
>       while (len + 8 < max)
>         {
>           if (*p2++ !=3D *cur2++)
>             break;
>           len +=3D 8;
>         }
>       --len;
>     }
>   while (++len !=3D max)
>     if (p[len] !=3D cur[len])
>       break;
> end:
>   return cur[len];
> }
>=20
> or so (untested).  Of course, it could be done using SIMD too if there is=
 a
> way to terminate the loop if any of the elts is different and could be do=
ne
> in that case at 16 or 32 or 64 characters at a time etc.
> But, without knowing that pos is typically some power of two this would j=
ust
> waste code size, dealing with the unaligned cases would be more complicat=
ed
> (one can't read the next elt until proving that the current one is all
> equal), so it would need to involve some rotations (or permutes for SIMD).

With this kind of widening,  we can see ~5% performance improvement for xz =
on
some platforms.=