From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-170227-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 20152 invoked by alias); 16 Sep 2011 02:01:18 -0000
Received: (qmail 20111 invoked by uid 22791); 16 Sep 2011 02:01:15 -0000
X-SWARE-Spam-Status: No, hits=-1.5 required=5.0	tests=AWL,BAYES_00,MSGID_MULTIPLE_AT,SPF_SOFTFAIL,TW_EB,TW_OV,TW_VZ,TW_ZB
X-Spam-Check-By: sourceware.org
Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44)    by sourceware.org (qpsmtpd/0.43rc1) with SMTP; Fri, 16 Sep 2011 02:00:59 +0000
Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Fri, 16 Sep 2011 03:00:54 +0100
Received: from jiangningsh02 ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Fri, 16 Sep 2011 03:00:49 +0100
From: "Jiangning Liu" <jiangning.liu@arm.com>
To: "'Richard Guenther'" <richard.guenther@gmail.com>
Cc: <gcc@gcc.gnu.org>
References: <4e3762ed.030d440a.1143.2af2SMTPIN_ADDED@mx.google.com> <CAFiYyc0KyF4unL8hurQ77=rArSrxonFzz_MGKZDLZ=rSYZ__DQ@mail.gmail.com>
In-Reply-To: <CAFiYyc0KyF4unL8hurQ77=rArSrxonFzz_MGKZDLZ=rSYZ__DQ@mail.gmail.com>
Subject: RE: A case that PRE optimization hurts performance
Date: Fri, 16 Sep 2011 02:01:00 -0000
Message-ID: <000001cc7414$7229c610$567d5230$@liu@arm.com>
MIME-Version: 1.0
X-MC-Unique: 111091603005401101
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
X-SW-Source: 2011-09/txt/msg00175.txt.bz2

Hi Richard,

I slightly changed the case to be like below,

int f(char *t) {
    int s=3D0;

    while (*t && s !=3D 1) {
        switch (s) {
        case 0:   /* path 1 */
            s =3D 2;
            break;
        case 2:   /* path 2 */
            s =3D 3; /* changed */
            break;
        default:  /* path 3 */
            if (*t =3D=3D '-')=20
                s =3D 2;
            break;
        }
        t++;
    }

    return s;
}

"-O2" is still worse than "-O2 -fno-tree-pre".=20

"-O2 -fno-tree-pre" result is=20

f:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	movl	8(%ebp), %edx
	movzbl	(%edx), %ecx
	jmp	.L14
	.p2align 4,,7
	.p2align 3
.L5:
	movl	$2, %eax
.L7:
	addl	$1, %edx
	cmpl	$1, %eax
	movzbl	(%edx), %ecx
	je	.L3
.L14:
	testb	%cl, %cl
	je	.L3
	testl	%eax, %eax
	je	.L5
	cmpl	$2, %eax
	.p2align 4,,5
	je	.L17
	cmpb	$45, %cl
	.p2align 4,,5
	je	.L5
	addl	$1, %edx
	cmpl	$1, %eax
	movzbl	(%edx), %ecx
	jne	.L14
	.p2align 4,,7
	.p2align 3
.L3:
	popl	%ebp
	.p2align 4,,2
	ret
	.p2align 4,,7
	.p2align 3
.L17:
	movb	$3, %al
	.p2align 4,,3
	jmp	.L7

While "-O2" result is=20

f:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	movl	8(%ebp), %edx
	pushl	%ebx
	movzbl	(%edx), %ecx
	jmp	.L14
	.p2align 4,,7
	.p2align 3
.L5:
	movl	$1, %ebx
	movl	$2, %eax
.L7:
	addl	$1, %edx
	testb	%bl, %bl
	movzbl	(%edx), %ecx
	je	.L3
.L14:
	testb	%cl, %cl
	je	.L3
	testl	%eax, %eax
	je	.L5
	cmpl	$2, %eax
	.p2align 4,,5
	je	.L16
	cmpb	$45, %cl
	.p2align 4,,5
	je	.L5
	cmpl	$1, %eax
	setne	%bl
	addl	$1, %edx
	testb	%bl, %bl
	movzbl	(%edx), %ecx
	jne	.L14
	.p2align 4,,7
	.p2align 3
.L3:
	popl	%ebx
	popl	%ebp
	ret
	.p2align 4,,7
	.p2align 3
.L16:
	movl	$1, %ebx
	movb	$3, %al
	jmp	.L7

You may notice that register ebx is introduced, and some more instructions
around ebx are generated as well. i.e.

	setne	%bl
	testb	%bl, %bl

I agree with you that in theory PRE does the right thing to minimize the
computation cost on gimple level. However, the problem is the cost of
converting comparison result to a bool value is not considered, so it
actually makes binary code worse. For this case, as I summarized below, to
complete the same functionality "With PRE" is worse than "Without PRE" for
all three paths,

* Without PRE,

Path1:
	movl	$2, %eax
	cmpl	$1, %eax
	je	.L3

Path2:
	movb	$3, %al
	cmpl	$1, %eax
	je	.L3

Path3:
	cmpl	$1, %eax
	jne	.L14

* With PRE,

Path1:
	movl	$1, %ebx
	movl	$2, %eax
	testb	%bl, %bl
	je	.L3

Path2:
	movl	$1, %ebx
	movb	$3, %al
	testb	%bl, %bl
	je	.L3

Path3:
	cmpl	$1, %eax
	setne	%bl
	testb	%bl, %bl
	jne	.L14

Do you have any more thoughts?

Thanks,
-Jiangning

> -----Original Message-----
> From: Richard Guenther [mailto:richard.guenther@gmail.com]
> Sent: Tuesday, August 02, 2011 5:23 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: A case that PRE optimization hurts performance
>=20
> On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu <jiangning.liu@arm.com>
> wrote:
> > Hi,
> >
> > For the following simple test case, PRE optimization hoists
> computation
> > (s!=3D1) into the default branch of the switch statement, and finally
> causes
> > very poor code generation. This problem occurs in both X86 and ARM,
> and I
> > believe it is also a problem for other targets.
> >
> > int f(char *t) {
> > =A0 =A0int s=3D0;
> >
> > =A0 =A0while (*t && s !=3D 1) {
> > =A0 =A0 =A0 =A0switch (s) {
> > =A0 =A0 =A0 =A0case 0:
> > =A0 =A0 =A0 =A0 =A0 =A0s =3D 2;
> > =A0 =A0 =A0 =A0 =A0 =A0break;
> > =A0 =A0 =A0 =A0case 2:
> > =A0 =A0 =A0 =A0 =A0 =A0s =3D 1;
> > =A0 =A0 =A0 =A0 =A0 =A0break;
> > =A0 =A0 =A0 =A0default:
> > =A0 =A0 =A0 =A0 =A0 =A0if (*t =3D=3D '-')
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0s =3D 1;
> > =A0 =A0 =A0 =A0 =A0 =A0break;
> > =A0 =A0 =A0 =A0}
> > =A0 =A0 =A0 =A0t++;
> > =A0 =A0}
> >
> > =A0 =A0return s;
> > }
> >
> > Taking X86 as an example, with option "-O2" you may find 52
> instructions
> > generated like below,
> >
> > 00000000 <f>:
> > =A0 0: =A0 55 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %ebp
> > =A0 1: =A0 31 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xor =A0 =A0%eax,%e=
ax
> > =A0 3: =A0 89 e5 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%esp,%e=
bp
> > =A0 5: =A0 57 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %edi
> > =A0 6: =A0 56 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %esi
> > =A0 7: =A0 53 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %ebx
> > =A0 8: =A0 8b 55 08 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mov =A0 =A00x8(%ebp)=
,%edx
> > =A0 b: =A0 0f b6 0a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0movzbl (%edx),%ecx
> > =A0 e: =A0 84 c9 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %cl,%cl
> > =A010: =A0 74 50 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 62 <f+0=
x62>
> > =A012: =A0 83 c2 01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0add =A0 =A0$0x1,%edx
> > =A015: =A0 85 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %eax,%eax
> > =A017: =A0 75 23 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 jne =A0 =A03c <f+0=
x3c>
> > =A019: =A0 8d b4 26 00 00 00 00 =A0 =A0lea =A0 =A00x0(%esi,%eiz,1),%esi
> > =A020: =A0 0f b6 0a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0movzbl (%edx),%ecx
> > =A023: =A0 84 c9 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %cl,%cl
> > =A025: =A0 0f 95 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0setne =A0%al
> > =A028: =A0 89 c7 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%eax,%e=
di
> > =A02a: =A0 b8 02 00 00 00 =A0 =A0 =A0 =A0 =A0mov =A0 =A0$0x2,%eax
> > =A02f: =A0 89 fb =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%edi,%e=
bx
> > =A031: =A0 83 c2 01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0add =A0 =A0$0x1,%edx
> > =A034: =A0 84 db =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %bl,%bl
> > =A036: =A0 74 2a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 62 <f+0=
x62>
> > =A038: =A0 85 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %eax,%eax
> > =A03a: =A0 74 e4 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 20 <f+0=
x20>
> > =A03c: =A0 83 f8 02 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmp =A0 =A0$0x2,%eax
> > =A03f: =A0 74 1f =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 60 <f+0=
x60>
> > =A041: =A0 80 f9 2d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmp =A0 =A0$0x2d,%cl
> > =A044: =A0 74 22 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 68 <f+0=
x68>
> > =A046: =A0 0f b6 0a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0movzbl (%edx),%ecx
> > =A049: =A0 83 f8 01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmp =A0 =A0$0x1,%eax
> > =A04c: =A0 0f 95 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0setne =A0%bl
> > =A04f: =A0 89 df =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%ebx,%e=
di
> > =A051: =A0 84 c9 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 test =A0 %cl,%cl
> > =A053: =A0 0f 95 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0setne =A0%bl
> > =A056: =A0 89 de =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%ebx,%e=
si
> > =A058: =A0 21 f7 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 and =A0 =A0%esi,%e=
di
> > =A05a: =A0 eb d3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 jmp =A0 =A02f <f+0=
x2f>
> > =A05c: =A0 8d 74 26 00 =A0 =A0 =A0 =A0 =A0 =A0 lea =A0 =A00x0(%esi,%eiz=
,1),%esi
> > =A060: =A0 b0 01 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0$0x1,%al
> > =A062: =A0 5b =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebx
> > =A063: =A0 5e =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%esi
> > =A064: =A0 5f =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%edi
> > =A065: =A0 5d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebp
> > =A066: =A0 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret
> > =A067: =A0 90 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nop
> > =A068: =A0 b8 01 00 00 00 =A0 =A0 =A0 =A0 =A0mov =A0 =A0$0x1,%eax
> > =A06d: =A0 5b =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebx
> > =A06e: =A0 5e =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%esi
> > =A06f: =A0 5f =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%edi
> > =A070: =A0 5d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebp
> > =A071: =A0 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret
> >
> > But with command line option "-O2 -fno-tree-pre", there are only 12
> > instructions generated, and the code would be very clean like below,
> >
> > 00000000 <f>:
> > =A0 0: =A0 55 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0push =A0 %ebp
> > =A0 1: =A0 31 c0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 xor =A0 =A0%eax,%e=
ax
> > =A0 3: =A0 89 e5 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0%esp,%e=
bp
> > =A0 5: =A0 8b 55 08 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mov =A0 =A00x8(%ebp)=
,%edx
> > =A0 8: =A0 80 3a 00 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmpb =A0 $0x0,(%edx)
> > =A0 b: =A0 74 0e =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 je =A0 =A0 1b <f+0=
x1b>
> > =A0 d: =A0 80 7a 01 00 =A0 =A0 =A0 =A0 =A0 =A0 cmpb =A0 $0x0,0x1(%edx)
> > =A011: =A0 b0 02 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mov =A0 =A0$0x2,%al
> > =A013: =A0 ba 01 00 00 00 =A0 =A0 =A0 =A0 =A0mov =A0 =A0$0x1,%edx
> > =A018: =A0 0f 45 c2 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cmovne %edx,%eax
> > =A01b: =A0 5d =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pop =A0 =A0%ebp
> > =A01c: =A0 c3 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret
> >
> > Do you have any idea about this?
>=20
> If you look at what PRE does it is clear that it's a profitable
> transformation
> as it reduces the number of instructions computed on the paths where
> s is set to a constant.  If you ask for size optimization you won't get
> this transformation.
>=20
> Now - on the tree level we don't see the if-conversion opportunity
> (we don't have a conditional move instruction), but still we could
> improve phiopt to handle switch statements - not sure what we
> could do with the case in question though which is
>=20
> <bb 3>:
>   switch (s_3) <default: <L3>, case 0: <L11>, case 2: <L10>>
>=20
> <L11>:
>   goto <bb 7> (<L10>);
>=20
> <L3>:
>   if (D.2703_6 =3D=3D 45)
>     goto <bb 6>;
>   else
>     goto <bb 7> (<L10>);
>=20
> <bb 6>:
>=20
>   # s_2 =3D PHI <2(4), 1(3), 1(6), s_3(5)>
> <L10>:
>=20
> as the condition in at L3 complicates things.
>=20
> The code is also really really special (and written in an awful
> way ...).
>=20
> Richard.
>=20
> > Thanks,
> > -Jiangning
> >
> >
> >
> >