From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20152 invoked by alias); 16 Sep 2011 02:01:18 -0000 Received: (qmail 20111 invoked by uid 22791); 16 Sep 2011 02:01:15 -0000 X-SWARE-Spam-Status: No, hits=-1.5 required=5.0 tests=AWL,BAYES_00,MSGID_MULTIPLE_AT,SPF_SOFTFAIL,TW_EB,TW_OV,TW_VZ,TW_ZB X-Spam-Check-By: sourceware.org Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.43rc1) with SMTP; Fri, 16 Sep 2011 02:00:59 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Fri, 16 Sep 2011 03:00:54 +0100 Received: from jiangningsh02 ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 16 Sep 2011 03:00:49 +0100 From: "Jiangning Liu" To: "'Richard Guenther'" Cc: References: <4e3762ed.030d440a.1143.2af2SMTPIN_ADDED@mx.google.com> In-Reply-To: Subject: RE: A case that PRE optimization hurts performance Date: Fri, 16 Sep 2011 02:01:00 -0000 Message-ID: <000001cc7414$7229c610$567d5230$@liu@arm.com> MIME-Version: 1.0 X-MC-Unique: 111091603005401101 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org X-SW-Source: 2011-09/txt/msg00175.txt.bz2 Hi Richard, I slightly changed the case to be like below, int f(char *t) { int s=3D0; while (*t && s !=3D 1) { switch (s) { case 0: /* path 1 */ s =3D 2; break; case 2: /* path 2 */ s =3D 3; /* changed */ break; default: /* path 3 */ if (*t =3D=3D '-')=20 s =3D 2; break; } t++; } return s; } "-O2" is still worse than "-O2 -fno-tree-pre".=20 "-O2 -fno-tree-pre" result is=20 f: pushl %ebp xorl %eax, %eax movl %esp, %ebp movl 8(%ebp), %edx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl $2, %eax .L7: addl $1, %edx cmpl $1, %eax movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl $2, %eax .p2align 4,,5 je .L17 cmpb $45, %cl .p2align 4,,5 je .L5 addl $1, %edx cmpl $1, %eax movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl %ebp .p2align 4,,2 ret .p2align 4,,7 .p2align 3 .L17: movb $3, %al .p2align 4,,3 jmp .L7 While "-O2" result is=20 f: pushl %ebp xorl %eax, %eax movl %esp, %ebp movl 8(%ebp), %edx pushl %ebx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl $1, %ebx movl $2, %eax .L7: addl $1, %edx testb %bl, %bl movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl $2, %eax .p2align 4,,5 je .L16 cmpb $45, %cl .p2align 4,,5 je .L5 cmpl $1, %eax setne %bl addl $1, %edx testb %bl, %bl movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl %ebx popl %ebp ret .p2align 4,,7 .p2align 3 .L16: movl $1, %ebx movb $3, %al jmp .L7 You may notice that register ebx is introduced, and some more instructions around ebx are generated as well. i.e. setne %bl testb %bl, %bl I agree with you that in theory PRE does the right thing to minimize the computation cost on gimple level. However, the problem is the cost of converting comparison result to a bool value is not considered, so it actually makes binary code worse. For this case, as I summarized below, to complete the same functionality "With PRE" is worse than "Without PRE" for all three paths, * Without PRE, Path1: movl $2, %eax cmpl $1, %eax je .L3 Path2: movb $3, %al cmpl $1, %eax je .L3 Path3: cmpl $1, %eax jne .L14 * With PRE, Path1: movl $1, %ebx movl $2, %eax testb %bl, %bl je .L3 Path2: movl $1, %ebx movb $3, %al testb %bl, %bl je .L3 Path3: cmpl $1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? Thanks, -Jiangning > -----Original Message----- > From: Richard Guenther [mailto:richard.guenther@gmail.com] > Sent: Tuesday, August 02, 2011 5:23 PM > To: Jiangning Liu > Cc: gcc@gcc.gnu.org > Subject: Re: A case that PRE optimization hurts performance >=20 > On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu > wrote: > > Hi, > > > > For the following simple test case, PRE optimization hoists > computation > > (s!=3D1) into the default branch of the switch statement, and finally > causes > > very poor code generation. This problem occurs in both X86 and ARM, > and I > > believe it is also a problem for other targets. > > > > int f(char *t) { > > =A0 =A0int s=3D0; > > > > =A0 =A0while (*t && s !=3D 1) { > > =A0 =A0 =A0 =A0switch (s) { > > =A0 =A0 =A0 =A0case 0: > > =A0 =A0 =A0 =A0 =A0 =A0s =3D 2; > > =A0 =A0 =A0 =A0 =A0 =A0break; > > =A0 =A0 =A0 =A0case 2: > > =A0 =A0 =A0 =A0 =A0 =A0s =3D 1; > > =A0 =A0 =A0 =A0 =A0 =A0break; > > =A0 =A0 =A0 =A0default: > > =A0 =A0 =A0 =A0 =A0 =A0if (*t =3D=3D '-') > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0s =3D 1; > > =A0 =A0 =A0 =A0 =A0 =A0break; > > =A0 =A0 =A0 =A0} > > =A0 =A0 =A0 =A0t++; > > =A0 =A0} > > > > =A0 =A0return s; > > } > > > > Taking X86 as an example, with option "-O2" you may find 52 > instructions > > generated like below, > > > > 00000000 : > > =A0 0: =A0 55 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %ebp > > =A0 1: =A0 31 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xor =A0 =A0%eax,%e= ax > > =A0 3: =A0 89 e5 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%esp,%e= bp > > =A0 5: =A0 57 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %edi > > =A0 6: =A0 56 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %esi > > =A0 7: =A0 53 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %ebx > > =A0 8: =A0 8b 55 08 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mov =A0 =A00x8(%ebp)= ,%edx > > =A0 b: =A0 0f b6 0a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0movzbl (%edx),%ecx > > =A0 e: =A0 84 c9 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %cl,%cl > > =A010: =A0 74 50 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 62 > > =A012: =A0 83 c2 01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0add =A0 =A0$0x1,%edx > > =A015: =A0 85 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %eax,%eax > > =A017: =A0 75 23 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 jne =A0 =A03c > > =A019: =A0 8d b4 26 00 00 00 00 =A0 =A0lea =A0 =A00x0(%esi,%eiz,1),%esi > > =A020: =A0 0f b6 0a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0movzbl (%edx),%ecx > > =A023: =A0 84 c9 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %cl,%cl > > =A025: =A0 0f 95 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0setne =A0%al > > =A028: =A0 89 c7 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%eax,%e= di > > =A02a: =A0 b8 02 00 00 00 =A0 =A0 =A0 =A0 =A0mov =A0 =A0$0x2,%eax > > =A02f: =A0 89 fb =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%edi,%e= bx > > =A031: =A0 83 c2 01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0add =A0 =A0$0x1,%edx > > =A034: =A0 84 db =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %bl,%bl > > =A036: =A0 74 2a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 62 > > =A038: =A0 85 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %eax,%eax > > =A03a: =A0 74 e4 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 20 > > =A03c: =A0 83 f8 02 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmp =A0 =A0$0x2,%eax > > =A03f: =A0 74 1f =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 60 > > =A041: =A0 80 f9 2d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmp =A0 =A0$0x2d,%cl > > =A044: =A0 74 22 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 68 > > =A046: =A0 0f b6 0a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0movzbl (%edx),%ecx > > =A049: =A0 83 f8 01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmp =A0 =A0$0x1,%eax > > =A04c: =A0 0f 95 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0setne =A0%bl > > =A04f: =A0 89 df =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%ebx,%e= di > > =A051: =A0 84 c9 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %cl,%cl > > =A053: =A0 0f 95 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0setne =A0%bl > > =A056: =A0 89 de =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%ebx,%e= si > > =A058: =A0 21 f7 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 and =A0 =A0%esi,%e= di > > =A05a: =A0 eb d3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 jmp =A0 =A02f > > =A05c: =A0 8d 74 26 00 =A0 =A0 =A0 =A0 =A0 =A0 lea =A0 =A00x0(%esi,%eiz= ,1),%esi > > =A060: =A0 b0 01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0$0x1,%al > > =A062: =A0 5b =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebx > > =A063: =A0 5e =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%esi > > =A064: =A0 5f =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%edi > > =A065: =A0 5d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebp > > =A066: =A0 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret > > =A067: =A0 90 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nop > > =A068: =A0 b8 01 00 00 00 =A0 =A0 =A0 =A0 =A0mov =A0 =A0$0x1,%eax > > =A06d: =A0 5b =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebx > > =A06e: =A0 5e =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%esi > > =A06f: =A0 5f =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%edi > > =A070: =A0 5d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebp > > =A071: =A0 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret > > > > But with command line option "-O2 -fno-tree-pre", there are only 12 > > instructions generated, and the code would be very clean like below, > > > > 00000000 : > > =A0 0: =A0 55 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %ebp > > =A0 1: =A0 31 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xor =A0 =A0%eax,%e= ax > > =A0 3: =A0 89 e5 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%esp,%e= bp > > =A0 5: =A0 8b 55 08 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mov =A0 =A00x8(%ebp)= ,%edx > > =A0 8: =A0 80 3a 00 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmpb =A0 $0x0,(%edx) > > =A0 b: =A0 74 0e =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 1b > > =A0 d: =A0 80 7a 01 00 =A0 =A0 =A0 =A0 =A0 =A0 cmpb =A0 $0x0,0x1(%edx) > > =A011: =A0 b0 02 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0$0x2,%al > > =A013: =A0 ba 01 00 00 00 =A0 =A0 =A0 =A0 =A0mov =A0 =A0$0x1,%edx > > =A018: =A0 0f 45 c2 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmovne %edx,%eax > > =A01b: =A0 5d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebp > > =A01c: =A0 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret > > > > Do you have any idea about this? >=20 > If you look at what PRE does it is clear that it's a profitable > transformation > as it reduces the number of instructions computed on the paths where > s is set to a constant. If you ask for size optimization you won't get > this transformation. >=20 > Now - on the tree level we don't see the if-conversion opportunity > (we don't have a conditional move instruction), but still we could > improve phiopt to handle switch statements - not sure what we > could do with the case in question though which is >=20 > : > switch (s_3) , case 0: , case 2: > >=20 > : > goto (); >=20 > : > if (D.2703_6 =3D=3D 45) > goto ; > else > goto (); >=20 > : >=20 > # s_2 =3D PHI <2(4), 1(3), 1(6), s_3(5)> > : >=20 > as the condition in at L3 complicates things. >=20 > The code is also really really special (and written in an awful > way ...). >=20 > Richard. >=20 > > Thanks, > > -Jiangning > > > > > > > >