From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19904 invoked by alias); 14 Feb 2012 22:42:10 -0000 Received: (qmail 19894 invoked by uid 22791); 14 Feb 2012 22:42:08 -0000 X-SWARE-Spam-Status: No, hits=-2.7 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,TW_DQ,TW_OV,TW_PX,TW_VD X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 14 Feb 2012 22:41:52 +0000 From: "evstupac at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug c/52252] New: An opportunity for x86 gcc vectorizer (gain up to 3 times) Date: Tue, 14 Feb 2012 22:42:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: c X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: evstupac at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2012-02/txt/msg01481.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D52252 Bug #: 52252 Summary: An opportunity for x86 gcc vectorizer (gain up to 3 times) Classification: Unclassified Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned@gcc.gnu.org ReportedBy: evstupac@gmail.com This is an example of byte conversion from RGB (Red Green Blue) to CMYK (Cy= an Magenta Yellow blacK): #define byte unsigned char #define MIN(a, b) ((a) > (b)?(b):(a)) void convert_image(byte *in, byte *out, int size) { int i; for(i =3D 0; i < size; i++) { byte r =3D in[0]; byte g =3D in[1]; byte b =3D in[2]; byte c, m, y, k, tmp; c =3D 255 - r; m =3D 255 - g; y =3D 255 - b; tmp =3D MIN(m, y); k =3D MIN(c, tmp); out[0] =3D c - k; out[1] =3D m - k; out[2] =3D y - k; out[3] =3D k; in +=3D 3; out +=3D 4; } } Here trunk gcc for Arm unrolls the loop by 2 and vectorizes it using neon; = gcc for x86 does not vectorize it. There are 2 tricky moments in this loop: 1) It converts 3 bytes into 4 2) We need to shuffle bytes after load: Let 0123456789ABCDF be 16 bytes in =E2=80=9Cin=E2=80=9D array (first rgb is= 012, next 345=E2=80=A6) To count vector minimum we need to place 0,1,2 bytes into 3 different vecto= rs. Gcc for Arm does this by 2 special loads: vld3.8 {d16, d18, d20}, [r2]! vld3.8 {d17, d19, d21}, [r2] putting 0 and 3 bytes into q8(d16, d17) 1 and 4 bytes into q9(d18, d19) 2 and 5 bytes into q10(d20, d21) And after all vector transformations it stores by 2 special stores: vst4.8 {d8, d10, d12, d14}, [r3]! vst4.8 {d9, d11, d13, d15}, [r3] However x86 gcc can do the same loads: movq (%edi),%mm5 movq %mm5,%mm7 movq %mm5,%mm6 pshufb %mm3,%mm5 /*0x00ffffff03ffffff*/ pshufb %mm2,%mm6 /*0x01ffffff04ffffff*/ pshufb %mm1,%mm7 /*0x02ffffff05ffffff*/ /* %mm5 =E2=80=93 r, %mm6 =E2=80=93 g, %mm7 =E2=80=93 b */ And same stores: pslld $0x8,%mm6 pslld $0x10,%mm7 pslld $0x18,%mm4 pxor %mm5,%mm6=20 pxor %mm7,%mm4 pxor %mm6,%mm4 pshufb %mm0,%mm4 /*0x000102030405060708*/ /*here redundant*/ movq %mm4,(%esi) /* %mm5 =E2=80=93 c, %mm6 =E2=80=93 m, %mm7 =E2=80=93 y, %mm4 - k */ pshufb here does not do anything, so could be removed, only in case we store less than 4 bytes we will need to shuffle them Moreover x86 gcc can do unroll not only by 2, but by 4: With the following loads: movdqu (%edi),%xmm5 movdqa %xmm5,%xmm7 movdqa %xmm5,%xmm6 pshufb %xmm3,%xmm5 /*0x00ffffff03ffffff06ffffff09ffffff*/ pshufb %xmm2,%xmm6 /*0x01ffffff04ffffff07ffffff0affffff*/ pshufb %xmm1,%xmm7 /*0x02ffffff05ffffff08ffffff0bffffff*/ /* %xmm5 =E2=80=93 r, %xmm6 =E2=80=93 g, %xmm7 =E2=80=93 b */ And stores: pslld $0x8,%xmm6 pslld $0x10,%xmm7 pslld $0x18,%xmm4 pxor %xmm5,%xmm6 pxor %xmm7,%xmm4 pxor %xmm6,%xmm4 pshufb %xmm0,%xmm4 /*0x000102030405060708090a0b0c0d0e0f*/ /*here redundan= t*/ movdqa %xmm4,(%esi) /* %xmm5 =E2=80=93 c, %xmm6 =E2=80=93 m, %xmm7 =E2=80=93 y, %xmm4 - k */