From: "Jiangning Liu" <jiangning.liu@arm.com>
To: "'Richard Guenther'" <richard.guenther@gmail.com>
Cc: <gcc@gcc.gnu.org>
Subject: RE: A case that PRE optimization hurts performance
Date: Fri, 16 Sep 2011 02:01:00 -0000 [thread overview]
Message-ID: <000001cc7414$7229c610$567d5230$@liu@arm.com> (raw)
In-Reply-To: <CAFiYyc0KyF4unL8hurQ77=rArSrxonFzz_MGKZDLZ=rSYZ__DQ@mail.gmail.com>
Hi Richard,
I slightly changed the case to be like below,
int f(char *t) {
int s=0;
while (*t && s != 1) {
switch (s) {
case 0: /* path 1 */
s = 2;
break;
case 2: /* path 2 */
s = 3; /* changed */
break;
default: /* path 3 */
if (*t == '-')
s = 2;
break;
}
t++;
}
return s;
}
"-O2" is still worse than "-O2 -fno-tree-pre".
"-O2 -fno-tree-pre" result is
f:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
movl 8(%ebp), %edx
movzbl (%edx), %ecx
jmp .L14
.p2align 4,,7
.p2align 3
.L5:
movl $2, %eax
.L7:
addl $1, %edx
cmpl $1, %eax
movzbl (%edx), %ecx
je .L3
.L14:
testb %cl, %cl
je .L3
testl %eax, %eax
je .L5
cmpl $2, %eax
.p2align 4,,5
je .L17
cmpb $45, %cl
.p2align 4,,5
je .L5
addl $1, %edx
cmpl $1, %eax
movzbl (%edx), %ecx
jne .L14
.p2align 4,,7
.p2align 3
.L3:
popl %ebp
.p2align 4,,2
ret
.p2align 4,,7
.p2align 3
.L17:
movb $3, %al
.p2align 4,,3
jmp .L7
While "-O2" result is
f:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
movl 8(%ebp), %edx
pushl %ebx
movzbl (%edx), %ecx
jmp .L14
.p2align 4,,7
.p2align 3
.L5:
movl $1, %ebx
movl $2, %eax
.L7:
addl $1, %edx
testb %bl, %bl
movzbl (%edx), %ecx
je .L3
.L14:
testb %cl, %cl
je .L3
testl %eax, %eax
je .L5
cmpl $2, %eax
.p2align 4,,5
je .L16
cmpb $45, %cl
.p2align 4,,5
je .L5
cmpl $1, %eax
setne %bl
addl $1, %edx
testb %bl, %bl
movzbl (%edx), %ecx
jne .L14
.p2align 4,,7
.p2align 3
.L3:
popl %ebx
popl %ebp
ret
.p2align 4,,7
.p2align 3
.L16:
movl $1, %ebx
movb $3, %al
jmp .L7
You may notice that register ebx is introduced, and some more instructions
around ebx are generated as well. i.e.
setne %bl
testb %bl, %bl
I agree with you that in theory PRE does the right thing to minimize the
computation cost on gimple level. However, the problem is the cost of
converting comparison result to a bool value is not considered, so it
actually makes binary code worse. For this case, as I summarized below, to
complete the same functionality "With PRE" is worse than "Without PRE" for
all three paths,
* Without PRE,
Path1:
movl $2, %eax
cmpl $1, %eax
je .L3
Path2:
movb $3, %al
cmpl $1, %eax
je .L3
Path3:
cmpl $1, %eax
jne .L14
* With PRE,
Path1:
movl $1, %ebx
movl $2, %eax
testb %bl, %bl
je .L3
Path2:
movl $1, %ebx
movb $3, %al
testb %bl, %bl
je .L3
Path3:
cmpl $1, %eax
setne %bl
testb %bl, %bl
jne .L14
Do you have any more thoughts?
Thanks,
-Jiangning
> -----Original Message-----
> From: Richard Guenther [mailto:richard.guenther@gmail.com]
> Sent: Tuesday, August 02, 2011 5:23 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: A case that PRE optimization hurts performance
>
> On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu <jiangning.liu@arm.com>
> wrote:
> > Hi,
> >
> > For the following simple test case, PRE optimization hoists
> computation
> > (s!=1) into the default branch of the switch statement, and finally
> causes
> > very poor code generation. This problem occurs in both X86 and ARM,
> and I
> > believe it is also a problem for other targets.
> >
> > int f(char *t) {
> > int s=0;
> >
> > while (*t && s != 1) {
> > switch (s) {
> > case 0:
> > s = 2;
> > break;
> > case 2:
> > s = 1;
> > break;
> > default:
> > if (*t == '-')
> > s = 1;
> > break;
> > }
> > t++;
> > }
> >
> > return s;
> > }
> >
> > Taking X86 as an example, with option "-O2" you may find 52
> instructions
> > generated like below,
> >
> > 00000000 <f>:
> > 0: 55 push %ebp
> > 1: 31 c0 xor %eax,%eax
> > 3: 89 e5 mov %esp,%ebp
> > 5: 57 push %edi
> > 6: 56 push %esi
> > 7: 53 push %ebx
> > 8: 8b 55 08 mov 0x8(%ebp),%edx
> > b: 0f b6 0a movzbl (%edx),%ecx
> > e: 84 c9 test %cl,%cl
> > 10: 74 50 je 62 <f+0x62>
> > 12: 83 c2 01 add $0x1,%edx
> > 15: 85 c0 test %eax,%eax
> > 17: 75 23 jne 3c <f+0x3c>
> > 19: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
> > 20: 0f b6 0a movzbl (%edx),%ecx
> > 23: 84 c9 test %cl,%cl
> > 25: 0f 95 c0 setne %al
> > 28: 89 c7 mov %eax,%edi
> > 2a: b8 02 00 00 00 mov $0x2,%eax
> > 2f: 89 fb mov %edi,%ebx
> > 31: 83 c2 01 add $0x1,%edx
> > 34: 84 db test %bl,%bl
> > 36: 74 2a je 62 <f+0x62>
> > 38: 85 c0 test %eax,%eax
> > 3a: 74 e4 je 20 <f+0x20>
> > 3c: 83 f8 02 cmp $0x2,%eax
> > 3f: 74 1f je 60 <f+0x60>
> > 41: 80 f9 2d cmp $0x2d,%cl
> > 44: 74 22 je 68 <f+0x68>
> > 46: 0f b6 0a movzbl (%edx),%ecx
> > 49: 83 f8 01 cmp $0x1,%eax
> > 4c: 0f 95 c3 setne %bl
> > 4f: 89 df mov %ebx,%edi
> > 51: 84 c9 test %cl,%cl
> > 53: 0f 95 c3 setne %bl
> > 56: 89 de mov %ebx,%esi
> > 58: 21 f7 and %esi,%edi
> > 5a: eb d3 jmp 2f <f+0x2f>
> > 5c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
> > 60: b0 01 mov $0x1,%al
> > 62: 5b pop %ebx
> > 63: 5e pop %esi
> > 64: 5f pop %edi
> > 65: 5d pop %ebp
> > 66: c3 ret
> > 67: 90 nop
> > 68: b8 01 00 00 00 mov $0x1,%eax
> > 6d: 5b pop %ebx
> > 6e: 5e pop %esi
> > 6f: 5f pop %edi
> > 70: 5d pop %ebp
> > 71: c3 ret
> >
> > But with command line option "-O2 -fno-tree-pre", there are only 12
> > instructions generated, and the code would be very clean like below,
> >
> > 00000000 <f>:
> > 0: 55 push %ebp
> > 1: 31 c0 xor %eax,%eax
> > 3: 89 e5 mov %esp,%ebp
> > 5: 8b 55 08 mov 0x8(%ebp),%edx
> > 8: 80 3a 00 cmpb $0x0,(%edx)
> > b: 74 0e je 1b <f+0x1b>
> > d: 80 7a 01 00 cmpb $0x0,0x1(%edx)
> > 11: b0 02 mov $0x2,%al
> > 13: ba 01 00 00 00 mov $0x1,%edx
> > 18: 0f 45 c2 cmovne %edx,%eax
> > 1b: 5d pop %ebp
> > 1c: c3 ret
> >
> > Do you have any idea about this?
>
> If you look at what PRE does it is clear that it's a profitable
> transformation
> as it reduces the number of instructions computed on the paths where
> s is set to a constant. If you ask for size optimization you won't get
> this transformation.
>
> Now - on the tree level we don't see the if-conversion opportunity
> (we don't have a conditional move instruction), but still we could
> improve phiopt to handle switch statements - not sure what we
> could do with the case in question though which is
>
> <bb 3>:
> switch (s_3) <default: <L3>, case 0: <L11>, case 2: <L10>>
>
> <L11>:
> goto <bb 7> (<L10>);
>
> <L3>:
> if (D.2703_6 == 45)
> goto <bb 6>;
> else
> goto <bb 7> (<L10>);
>
> <bb 6>:
>
> # s_2 = PHI <2(4), 1(3), 1(6), s_3(5)>
> <L10>:
>
> as the condition in at L3 complicates things.
>
> The code is also really really special (and written in an awful
> way ...).
>
> Richard.
>
> > Thanks,
> > -Jiangning
> >
> >
> >
> >
next prev parent reply other threads:[~2011-09-16 2:01 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <4e3762ed.030d440a.1143.2af2SMTPIN_ADDED@mx.google.com>
2011-08-02 9:23 ` Richard Guenther
2011-09-16 2:01 ` Jiangning Liu [this message]
[not found] ` <4e72add8.c9d0e30a.142a.ffffe60eSMTPIN_ADDED@mx.google.com>
2011-09-23 20:25 ` Richard Guenther
2011-09-26 8:34 ` Jiangning Liu
[not found] ` <4e802c3d.94ebd80a.4b1f.ffff8101SMTPIN_ADDED@mx.google.com>
2011-09-26 14:33 ` Richard Guenther
2011-09-27 7:30 ` Jeff Law
2011-09-27 11:15 ` Jiangning Liu
2011-09-27 13:44 ` Richard Guenther
2011-09-27 18:59 ` Jeff Law
2011-09-27 18:59 ` Richard Guenther
2011-09-27 21:56 ` Jeff Law
2011-08-02 2:37 Jiangning Liu
2011-08-15 8:40 ` Václav Zeman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='000001cc7414$7229c610$567d5230$@liu@arm.com' \
--to=jiangning.liu@arm.com \
--cc=gcc@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).