A case that PRE optimization hurts performance

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* A case that PRE optimization hurts performance
@ 2011-08-02  2:37 Jiangning Liu
  2011-08-15  8:40 ` Václav Zeman
  0 siblings, 1 reply; 13+ messages in thread
From: Jiangning Liu @ 2011-08-02  2:37 UTC (permalink / raw)
  To: gcc

Hi,

For the following simple test case, PRE optimization hoists computation
(s!=1) into the default branch of the switch statement, and finally causes
very poor code generation. This problem occurs in both X86 and ARM, and I
believe it is also a problem for other targets. 

int f(char *t) {
    int s=0;

    while (*t && s != 1) {
        switch (s) {
        case 0:
            s = 2;
            break;
        case 2:
            s = 1;
            break;
        default:
            if (*t == '-') 
                s = 1;
            break;
        }
        t++;
    }

    return s;
}

Taking X86 as an example, with option "-O2" you may find 52 instructions
generated like below,

00000000 <f>:
   0:	55                   	push   %ebp
   1:	31 c0                	xor    %eax,%eax
   3:	89 e5                	mov    %esp,%ebp
   5:	57                   	push   %edi
   6:	56                   	push   %esi
   7:	53                   	push   %ebx
   8:	8b 55 08             	mov    0x8(%ebp),%edx
   b:	0f b6 0a             	movzbl (%edx),%ecx
   e:	84 c9                	test   %cl,%cl
  10:	74 50                	je     62 <f+0x62>
  12:	83 c2 01             	add    $0x1,%edx
  15:	85 c0                	test   %eax,%eax
  17:	75 23                	jne    3c <f+0x3c>
  19:	8d b4 26 00 00 00 00 	lea    0x0(%esi,%eiz,1),%esi
  20:	0f b6 0a             	movzbl (%edx),%ecx
  23:	84 c9                	test   %cl,%cl
  25:	0f 95 c0             	setne  %al
  28:	89 c7                	mov    %eax,%edi
  2a:	b8 02 00 00 00       	mov    $0x2,%eax
  2f:	89 fb                	mov    %edi,%ebx
  31:	83 c2 01             	add    $0x1,%edx
  34:	84 db                	test   %bl,%bl
  36:	74 2a                	je     62 <f+0x62>
  38:	85 c0                	test   %eax,%eax
  3a:	74 e4                	je     20 <f+0x20>
  3c:	83 f8 02             	cmp    $0x2,%eax
  3f:	74 1f                	je     60 <f+0x60>
  41:	80 f9 2d             	cmp    $0x2d,%cl
  44:	74 22                	je     68 <f+0x68>
  46:	0f b6 0a             	movzbl (%edx),%ecx
  49:	83 f8 01             	cmp    $0x1,%eax
  4c:	0f 95 c3             	setne  %bl
  4f:	89 df                	mov    %ebx,%edi
  51:	84 c9                	test   %cl,%cl
  53:	0f 95 c3             	setne  %bl
  56:	89 de                	mov    %ebx,%esi
  58:	21 f7                	and    %esi,%edi
  5a:	eb d3                	jmp    2f <f+0x2f>
  5c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
  60:	b0 01                	mov    $0x1,%al
  62:	5b                   	pop    %ebx
  63:	5e                   	pop    %esi
  64:	5f                   	pop    %edi
  65:	5d                   	pop    %ebp
  66:	c3                   	ret    
  67:	90                   	nop
  68:	b8 01 00 00 00       	mov    $0x1,%eax
  6d:	5b                   	pop    %ebx
  6e:	5e                   	pop    %esi
  6f:	5f                   	pop    %edi
  70:	5d                   	pop    %ebp
  71:	c3                   	ret    

But with command line option "-O2 -fno-tree-pre", there are only 12
instructions generated, and the code would be very clean like below,

00000000 <f>:
   0:	55                   	push   %ebp
   1:	31 c0                	xor    %eax,%eax
   3:	89 e5                	mov    %esp,%ebp
   5:	8b 55 08             	mov    0x8(%ebp),%edx
   8:	80 3a 00             	cmpb   $0x0,(%edx)
   b:	74 0e                	je     1b <f+0x1b>
   d:	80 7a 01 00          	cmpb   $0x0,0x1(%edx)
  11:	b0 02                	mov    $0x2,%al
  13:	ba 01 00 00 00       	mov    $0x1,%edx
  18:	0f 45 c2             	cmovne %edx,%eax
  1b:	5d                   	pop    %ebp
  1c:	c3                   	ret    

Do you have any idea about this?

Thanks,
-Jiangning



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
  2011-08-02  2:37 A case that PRE optimization hurts performance Jiangning Liu
@ 2011-08-15  8:40 ` Václav Zeman
  0 siblings, 0 replies; 13+ messages in thread
From: Václav Zeman @ 2011-08-15  8:40 UTC (permalink / raw)
  To: gcc

 On Tue, 2 Aug 2011 10:37:03 +0800, Jiangning Liu wrote:
> Hi,
>
> For the following simple test case, PRE optimization hoists 
> computation
> (s!=1) into the default branch of the switch statement, and finally 
> causes
> very poor code generation. This problem occurs in both X86 and ARM, 
> and I
> believe it is also a problem for other targets.
>
>[...]
>
> Do you have any idea about this?
 Fill a bug report to GCC Bugzilla <http://gcc.gnu.org/bugzilla/>

-- 
 VZ

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
  2011-09-27 18:59                 ` Richard Guenther
@ 2011-09-27 21:56                   ` Jeff Law
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Law @ 2011-09-27 21:56 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Jiangning Liu, gcc

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/27/11 09:00, Richard Guenther wrote:
>> The thing to remember is jump threading is tasked detecting
>> cases where the control statement has a predetermined destination
>> utilizing path sensitive information.   Expanding it to do
>> general path sensitive optimizations is possible, but at even
>> greater cost.
> 
> Yeah, I realize that.  It's just this case is the non-cfg form of
> 
> if (pretmp) if (D.1234) ...
> 
> where jump-threading would probably handle the CFG variant just
> fine.
Possibly.  I haven't looked at the RTL version in a long long time,
but it's still around cfgcleanup::thread_jump.   I've been meaning to
do some final experiments and declare that code as effectively dead
(yes, it still does stuff, but not much, at least not last time I looked).


WRT generalized path sensitive optimizations, the best work I've seen
in the space is from Bodik.  I've been wanting to play with some of
his ideas for path isolation to simplify our current jump threading
implementation.

In theory, if that experiment works out we could utilize that
infrastructure to implement other path sensitive analysis and
optimization.  Obviously, that's a ways out.

jeff
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOgefZAAoJEBRtltQi2kC7JDEH+gOr0BWdN/Ru9oE7Fng1SJvX
adH9HnT0PhVhvbZQuF9Ag3HYYFOZumNgEI5HU2s3gPaFNU/KuUMcCBAD7V/1748P
L4soNu/dyHVZnKFtjx8MKEe9Jc+UMIvfdjewlsGHjzyBoXDIDaqzqQGmBLtM9MjK
gV/tKzdeYz0TB4QMKb27RIBtrZn+l6Ur6IC/PJBxB6qrimvxbet8qxgj+0RDf5XE
QSvCTS77QbfSw8k/5X1ag2lRKSZqX15w+p2b7S60XdvVd4pcIcyYKvyMDUcGy2hl
PEay76+XyFsO8bCcLg69i+UirBrfceE2CVzYYkMfsPxh2RgBViluTveeybblkLs=
=kkAA
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
  2011-09-27 13:44             ` Richard Guenther
@ 2011-09-27 18:59               ` Jeff Law
  2011-09-27 18:59                 ` Richard Guenther
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Law @ 2011-09-27 18:59 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Jiangning Liu, gcc

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/27/11 01:30, Richard Guenther wrote:

> 
>> It knows something about prephitmp.6_1 and thus could simplify 
>> D.2734_9 = prephitmp_6.1 & D.2732_7; to D.2734_9 = D.2732_7; But
>> admittedly I have no idea if DOM tries to simplify things other
>> than comparisons within the jump threading machinery ... (the
>> above block basically ends in if (D.2729_6 != 0 &&
>> prephitmp_6.1), so I'd guess it's worth to simplify the (complex)
>> conditional expression).
Jump threading will enter simplified expressions into the hash tables
for every statement in the block in an effort to utilize any
information available to determine the result of the comparison.
However, those simplifications aren't ever reflected back into the IL.

The thing to remember is jump threading is tasked detecting cases
where the control statement has a predetermined destination utilizing
path sensitive information.   Expanding it to do general path
sensitive optimizations is possible, but at even greater cost.

jeff
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOgeQGAAoJEBRtltQi2kC7xX8H/1odgAgNVKdbWk4FtsNPx9xl
+xHtQdB2zycD0o0TMdQ+PcllXHaIK+ScIZEcxys1lN9HoEEyhyHPmQi7FZbD807y
oTswglsX7hnTM3fMCN62BTFwk6P0QNrbeZ4bsVxF5DkBmMLXbArQ7UEYD9mSHnDv
ixi06cUc/x5CXCAbcrAdUQI/9uF9d83myLs3rS3T0g2nPm+chGRN94Bm6ZbrB0hA
PjyKiZ4j3z6Yc5bs2GF19Rh7vfjD/9NhKF9t9YwuBky4luLv4RPFsT1JQM3+1yuT
tW5xk0et3eUnGFgiLUrF1mkHO/TkSv5iTsDI2tbbCkLCdH3w8Runj21/5qMWXxY=
=X10i
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
  2011-09-27 18:59               ` Jeff Law
@ 2011-09-27 18:59                 ` Richard Guenther
  2011-09-27 21:56                   ` Jeff Law
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Guenther @ 2011-09-27 18:59 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jiangning Liu, gcc

On Tue, Sep 27, 2011 at 4:56 PM, Jeff Law <law@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 09/27/11 01:30, Richard Guenther wrote:
>
>>
>>> It knows something about prephitmp.6_1 and thus could simplify
>>> D.2734_9 = prephitmp_6.1 & D.2732_7; to D.2734_9 = D.2732_7; But
>>> admittedly I have no idea if DOM tries to simplify things other
>>> than comparisons within the jump threading machinery ... (the
>>> above block basically ends in if (D.2729_6 != 0 &&
>>> prephitmp_6.1), so I'd guess it's worth to simplify the (complex)
>>> conditional expression).
> Jump threading will enter simplified expressions into the hash tables
> for every statement in the block in an effort to utilize any
> information available to determine the result of the comparison.
> However, those simplifications aren't ever reflected back into the IL.
>
> The thing to remember is jump threading is tasked detecting cases
> where the control statement has a predetermined destination utilizing
> path sensitive information.   Expanding it to do general path
> sensitive optimizations is possible, but at even greater cost.

Yeah, I realize that.  It's just this case is the non-cfg form of

if (pretmp)
  if (D.1234)
    ...

where jump-threading would probably handle the CFG variant just fine.

Oh well ;)

Richard.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
  2011-09-27  7:30           ` Jeff Law
  2011-09-27 11:15             ` Jiangning Liu
@ 2011-09-27 13:44             ` Richard Guenther
  2011-09-27 18:59               ` Jeff Law
  1 sibling, 1 reply; 13+ messages in thread
From: Richard Guenther @ 2011-09-27 13:44 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jiangning Liu, gcc

On Mon, Sep 26, 2011 at 6:42 PM, Jeff Law <law@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 09/26/11 05:00, Richard Guenther wrote:
>> On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu
>> <jiangning.liu@arm.com> wrote:
>>>>> * Without PRE,
>>>>>
>>>>> Path1: movl    $2, %eax cmpl    $1, %eax je      .L3
>>>>>
>>>>> Path2: movb    $3, %al cmpl    $1, %eax je      .L3
>>>>>
>>>>> Path3: cmpl    $1, %eax jne     .L14
>>>>>
>>>>> * With PRE,
>>>>>
>>>>> Path1: movl    $1, %ebx movl    $2, %eax testb   %bl, %bl je
>>>>> .L3
>>>>>
>>>>> Path2: movl    $1, %ebx movb    $3, %al testb   %bl, %bl je
>>>>> .L3
>>>>>
>>>>> Path3: cmpl    $1, %eax setne   %bl testb   %bl, %bl jne
>>>>> .L14
>>>>>
>>>>> Do you have any more thoughts?
>>>>
>>>> It seems to me that with PRE all the testb %bl, %bl should be
>>>> evaluated at compile-time considering the preceeding movl $1,
>>>> %ebx.  Am I missing something?
>>>>
>>>
>>> Yes. Can this be done by PRE or any other optimizations in middle
>>> end?
>>
>> Hm, the paths as you quote them are obfuscated by missed
>> jump-threading. On the tree level we have
>>
>> # s_2 = PHI <2(5), 3(4), 2(6), s_25(7)> # prephitmp.6_1 = PHI
>> <1(5), 1(4), 1(6), prephitmp.6_3(7)> <L10>: t_14 = t_24 + 1;
>> D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0;
>> D.2734_9 = prephitmp.6_1 & D.2732_7; if (D.2734_9 != 0)
>>
>> where we could thread the cases with prephitmp.6_1 == 1,
>> ultimately removing the & and forwarding the D.2729_6 != 0 test.
>> Which would of course cause some code duplication.
>>
>> Jeff, you recently looked at tree jump-threading, can you see if
>> we can improve things on this particular testcase?
> There's nothing threading can do here because it doesn't know anything
> about the value MEM[t14].

It knows something about prephitmp.6_1 and thus could simplify
D.2734_9 = prephitmp_6.1 & D.2732_7; to D.2734_9 = D.2732_7;
But admittedly I have no idea if DOM tries to simplify things other than
comparisons within the jump threading machinery ... (the above
block basically ends in if (D.2729_6 != 0 && prephitmp_6.1), so I'd
guess it's worth to simplify the (complex) conditional expression).

Richard.

>
> Jeff
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB
> OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb
> GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B
> 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9
> 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M
> q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA=
> =azr5
> -----END PGP SIGNATURE-----
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A case that PRE optimization hurts performance
  2011-09-27  7:30           ` Jeff Law
@ 2011-09-27 11:15             ` Jiangning Liu
  2011-09-27 13:44             ` Richard Guenther
  1 sibling, 0 replies; 13+ messages in thread
From: Jiangning Liu @ 2011-09-27 11:15 UTC (permalink / raw)
  To: 'Jeff Law', Richard Guenther; +Cc: gcc



> -----Original Message-----
> From: Jeff Law [mailto:law@redhat.com]
> Sent: Tuesday, September 27, 2011 12:43 AM
> To: Richard Guenther
> Cc: Jiangning Liu; gcc@gcc.gnu.org
> Subject: Re: A case that PRE optimization hurts performance
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 09/26/11 05:00, Richard Guenther wrote:
> > On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu
> > <jiangning.liu@arm.com> wrote:
> >>>> * Without PRE,
> >>>>
> >>>> Path1: movl    $2, %eax cmpl    $1, %eax je      .L3
> >>>>
> >>>> Path2: movb    $3, %al cmpl    $1, %eax je      .L3
> >>>>
> >>>> Path3: cmpl    $1, %eax jne     .L14
> >>>>
> >>>> * With PRE,
> >>>>
> >>>> Path1: movl    $1, %ebx movl    $2, %eax testb   %bl, %bl je
> >>>> .L3
> >>>>
> >>>> Path2: movl    $1, %ebx movb    $3, %al testb   %bl, %bl je
> >>>> .L3
> >>>>
> >>>> Path3: cmpl    $1, %eax setne   %bl testb   %bl, %bl jne
> >>>> .L14
> >>>>
> >>>> Do you have any more thoughts?
> >>>
> >>> It seems to me that with PRE all the testb %bl, %bl should be
> >>> evaluated at compile-time considering the preceeding movl $1,
> >>> %ebx.  Am I missing something?
> >>>
> >>
> >> Yes. Can this be done by PRE or any other optimizations in middle
> >> end?
> >
> > Hm, the paths as you quote them are obfuscated by missed
> > jump-threading. On the tree level we have
> >
> > # s_2 = PHI <2(5), 3(4), 2(6), s_25(7)> # prephitmp.6_1 = PHI
> > <1(5), 1(4), 1(6), prephitmp.6_3(7)> <L10>: t_14 = t_24 + 1;
> > D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0;
> > D.2734_9 = prephitmp.6_1 & D.2732_7; if (D.2734_9 != 0)
> >
> > where we could thread the cases with prephitmp.6_1 == 1,
> > ultimately removing the & and forwarding the D.2729_6 != 0 test.
> > Which would of course cause some code duplication.
> >
> > Jeff, you recently looked at tree jump-threading, can you see if
> > we can improve things on this particular testcase?
> There's nothing threading can do here because it doesn't know anything
> about the value MEM[t14].
> 

Jeff, 

Could you please explain more about this? What information does jump
threading want to know on MEM[t14]? Do you mean it's hard to duplicate that
basic block due to some reasons?

Thanks,
-Jiangning

> 
> Jeff
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB
> OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb
> GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B
> 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9
> 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M
> q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA=
> =azr5
> -----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
  2011-09-26 14:33         ` Richard Guenther
@ 2011-09-27  7:30           ` Jeff Law
  2011-09-27 11:15             ` Jiangning Liu
  2011-09-27 13:44             ` Richard Guenther
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff Law @ 2011-09-27  7:30 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Jiangning Liu, gcc

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/26/11 05:00, Richard Guenther wrote:
> On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu
> <jiangning.liu@arm.com> wrote:
>>>> * Without PRE,
>>>> 
>>>> Path1: movl    $2, %eax cmpl    $1, %eax je      .L3
>>>> 
>>>> Path2: movb    $3, %al cmpl    $1, %eax je      .L3
>>>> 
>>>> Path3: cmpl    $1, %eax jne     .L14
>>>> 
>>>> * With PRE,
>>>> 
>>>> Path1: movl    $1, %ebx movl    $2, %eax testb   %bl, %bl je
>>>> .L3
>>>> 
>>>> Path2: movl    $1, %ebx movb    $3, %al testb   %bl, %bl je
>>>> .L3
>>>> 
>>>> Path3: cmpl    $1, %eax setne   %bl testb   %bl, %bl jne
>>>> .L14
>>>> 
>>>> Do you have any more thoughts?
>>> 
>>> It seems to me that with PRE all the testb %bl, %bl should be
>>> evaluated at compile-time considering the preceeding movl $1,
>>> %ebx.  Am I missing something?
>>> 
>> 
>> Yes. Can this be done by PRE or any other optimizations in middle
>> end?
> 
> Hm, the paths as you quote them are obfuscated by missed
> jump-threading. On the tree level we have
> 
> # s_2 = PHI <2(5), 3(4), 2(6), s_25(7)> # prephitmp.6_1 = PHI
> <1(5), 1(4), 1(6), prephitmp.6_3(7)> <L10>: t_14 = t_24 + 1; 
> D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0; 
> D.2734_9 = prephitmp.6_1 & D.2732_7; if (D.2734_9 != 0)
> 
> where we could thread the cases with prephitmp.6_1 == 1,
> ultimately removing the & and forwarding the D.2729_6 != 0 test.
> Which would of course cause some code duplication.
> 
> Jeff, you recently looked at tree jump-threading, can you see if
> we can improve things on this particular testcase?
There's nothing threading can do here because it doesn't know anything
about the value MEM[t14].


Jeff
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB
OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb
GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B
51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9
2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M
q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA=
=azr5
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
       [not found]       ` <4e802c3d.94ebd80a.4b1f.ffff8101SMTPIN_ADDED@mx.google.com>
@ 2011-09-26 14:33         ` Richard Guenther
  2011-09-27  7:30           ` Jeff Law
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Guenther @ 2011-09-26 14:33 UTC (permalink / raw)
  To: Jiangning Liu; +Cc: gcc, Jeff Law

On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu <jiangning.liu@arm.com> wrote:
>> > * Without PRE,
>> >
>> > Path1:
>> >        movl    $2, %eax
>> >        cmpl    $1, %eax
>> >        je      .L3
>> >
>> > Path2:
>> >        movb    $3, %al
>> >        cmpl    $1, %eax
>> >        je      .L3
>> >
>> > Path3:
>> >        cmpl    $1, %eax
>> >        jne     .L14
>> >
>> > * With PRE,
>> >
>> > Path1:
>> >        movl    $1, %ebx
>> >        movl    $2, %eax
>> >        testb   %bl, %bl
>> >        je      .L3
>> >
>> > Path2:
>> >        movl    $1, %ebx
>> >        movb    $3, %al
>> >        testb   %bl, %bl
>> >        je      .L3
>> >
>> > Path3:
>> >        cmpl    $1, %eax
>> >        setne   %bl
>> >        testb   %bl, %bl
>> >        jne     .L14
>> >
>> > Do you have any more thoughts?
>>
>> It seems to me that with PRE all the testb %bl, %bl
>> should be evaluated at compile-time considering the
>> preceeding movl $1, %ebx.  Am I missing something?
>>
>
> Yes. Can this be done by PRE or any other optimizations in middle end?

Hm, the paths as you quote them are obfuscated by missed jump-threading.
On the tree level we have

  # s_2 = PHI <2(5), 3(4), 2(6), s_25(7)>
  # prephitmp.6_1 = PHI <1(5), 1(4), 1(6), prephitmp.6_3(7)>
<L10>:
  t_14 = t_24 + 1;
  D.2729_6 = MEM[base: t_14, offset: 0B];
  D.2732_7 = D.2729_6 != 0;
  D.2734_9 = prephitmp.6_1 & D.2732_7;
  if (D.2734_9 != 0)

where we could thread the cases with prephitmp.6_1 == 1, ultimately
removing the & and forwarding the D.2729_6 != 0 test.  Which would
of course cause some code duplication.

Jeff, you recently looked at tree jump-threading, can you see if we
can improve things on this particular testcase?

Thanks,
Richard.

> Thanks,
> -Jiangning
>
>> Richard.
>>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A case that PRE optimization hurts performance
  2011-09-23 20:25     ` Richard Guenther
@ 2011-09-26  8:34       ` Jiangning Liu
       [not found]       ` <4e802c3d.94ebd80a.4b1f.ffff8101SMTPIN_ADDED@mx.google.com>
  1 sibling, 0 replies; 13+ messages in thread
From: Jiangning Liu @ 2011-09-26  8:34 UTC (permalink / raw)
  To: 'Richard Guenther'; +Cc: gcc

> > * Without PRE,
> >
> > Path1:
> >        movl    $2, %eax
> >        cmpl    $1, %eax
> >        je      .L3
> >
> > Path2:
> >        movb    $3, %al
> >        cmpl    $1, %eax
> >        je      .L3
> >
> > Path3:
> >        cmpl    $1, %eax
> >        jne     .L14
> >
> > * With PRE,
> >
> > Path1:
> >        movl    $1, %ebx
> >        movl    $2, %eax
> >        testb   %bl, %bl
> >        je      .L3
> >
> > Path2:
> >        movl    $1, %ebx
> >        movb    $3, %al
> >        testb   %bl, %bl
> >        je      .L3
> >
> > Path3:
> >        cmpl    $1, %eax
> >        setne   %bl
> >        testb   %bl, %bl
> >        jne     .L14
> >
> > Do you have any more thoughts?
> 
> It seems to me that with PRE all the testb %bl, %bl
> should be evaluated at compile-time considering the
> preceeding movl $1, %ebx.  Am I missing something?
> 

Yes. Can this be done by PRE or any other optimizations in middle end?

Thanks,
-Jiangning

> Richard.
> 




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
       [not found]   ` <4e72add8.c9d0e30a.142a.ffffe60eSMTPIN_ADDED@mx.google.com>
@ 2011-09-23 20:25     ` Richard Guenther
  2011-09-26  8:34       ` Jiangning Liu
       [not found]       ` <4e802c3d.94ebd80a.4b1f.ffff8101SMTPIN_ADDED@mx.google.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Richard Guenther @ 2011-09-23 20:25 UTC (permalink / raw)
  To: Jiangning Liu; +Cc: gcc

On Fri, Sep 16, 2011 at 4:00 AM, Jiangning Liu <jiangning.liu@arm.com> wrote:
> Hi Richard,
>
> I slightly changed the case to be like below,
>
> int f(char *t) {
>    int s=0;
>
>    while (*t && s != 1) {
>        switch (s) {
>        case 0:   /* path 1 */
>            s = 2;
>            break;
>        case 2:   /* path 2 */
>            s = 3; /* changed */
>            break;
>        default:  /* path 3 */
>            if (*t == '-')
>                s = 2;
>            break;
>        }
>        t++;
>    }
>
>    return s;
> }
>
> "-O2" is still worse than "-O2 -fno-tree-pre".
>
> "-O2 -fno-tree-pre" result is
>
> f:
>        pushl   %ebp
>        xorl    %eax, %eax
>        movl    %esp, %ebp
>        movl    8(%ebp), %edx
>        movzbl  (%edx), %ecx
>        jmp     .L14
>        .p2align 4,,7
>        .p2align 3
> .L5:
>        movl    $2, %eax
> .L7:
>        addl    $1, %edx
>        cmpl    $1, %eax
>        movzbl  (%edx), %ecx
>        je      .L3
> .L14:
>        testb   %cl, %cl
>        je      .L3
>        testl   %eax, %eax
>        je      .L5
>        cmpl    $2, %eax
>        .p2align 4,,5
>        je      .L17
>        cmpb    $45, %cl
>        .p2align 4,,5
>        je      .L5
>        addl    $1, %edx
>        cmpl    $1, %eax
>        movzbl  (%edx), %ecx
>        jne     .L14
>        .p2align 4,,7
>        .p2align 3
> .L3:
>        popl    %ebp
>        .p2align 4,,2
>        ret
>        .p2align 4,,7
>        .p2align 3
> .L17:
>        movb    $3, %al
>        .p2align 4,,3
>        jmp     .L7
>
> While "-O2" result is
>
> f:
>        pushl   %ebp
>        xorl    %eax, %eax
>        movl    %esp, %ebp
>        movl    8(%ebp), %edx
>        pushl   %ebx
>        movzbl  (%edx), %ecx
>        jmp     .L14
>        .p2align 4,,7
>        .p2align 3
> .L5:
>        movl    $1, %ebx
>        movl    $2, %eax
> .L7:
>        addl    $1, %edx
>        testb   %bl, %bl
>        movzbl  (%edx), %ecx
>        je      .L3
> .L14:
>        testb   %cl, %cl
>        je      .L3
>        testl   %eax, %eax
>        je      .L5
>        cmpl    $2, %eax
>        .p2align 4,,5
>        je      .L16
>        cmpb    $45, %cl
>        .p2align 4,,5
>        je      .L5
>        cmpl    $1, %eax
>        setne   %bl
>        addl    $1, %edx
>        testb   %bl, %bl
>        movzbl  (%edx), %ecx
>        jne     .L14
>        .p2align 4,,7
>        .p2align 3
> .L3:
>        popl    %ebx
>        popl    %ebp
>        ret
>        .p2align 4,,7
>        .p2align 3
> .L16:
>        movl    $1, %ebx
>        movb    $3, %al
>        jmp     .L7
>
> You may notice that register ebx is introduced, and some more instructions
> around ebx are generated as well. i.e.
>
>        setne   %bl
>        testb   %bl, %bl
>
> I agree with you that in theory PRE does the right thing to minimize the
> computation cost on gimple level. However, the problem is the cost of
> converting comparison result to a bool value is not considered, so it
> actually makes binary code worse. For this case, as I summarized below, to
> complete the same functionality "With PRE" is worse than "Without PRE" for
> all three paths,
>
> * Without PRE,
>
> Path1:
>        movl    $2, %eax
>        cmpl    $1, %eax
>        je      .L3
>
> Path2:
>        movb    $3, %al
>        cmpl    $1, %eax
>        je      .L3
>
> Path3:
>        cmpl    $1, %eax
>        jne     .L14
>
> * With PRE,
>
> Path1:
>        movl    $1, %ebx
>        movl    $2, %eax
>        testb   %bl, %bl
>        je      .L3
>
> Path2:
>        movl    $1, %ebx
>        movb    $3, %al
>        testb   %bl, %bl
>        je      .L3
>
> Path3:
>        cmpl    $1, %eax
>        setne   %bl
>        testb   %bl, %bl
>        jne     .L14
>
> Do you have any more thoughts?

It seems to me that with PRE all the testb %bl, %bl
should be evaluated at compile-time considering the
preceeding movl $1, %ebx.  Am I missing something?

Richard.

> Thanks,
> -Jiangning
>
>> -----Original Message-----
>> From: Richard Guenther [mailto:richard.guenther@gmail.com]
>> Sent: Tuesday, August 02, 2011 5:23 PM
>> To: Jiangning Liu
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: A case that PRE optimization hurts performance
>>
>> On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu <jiangning.liu@arm.com>
>> wrote:
>> > Hi,
>> >
>> > For the following simple test case, PRE optimization hoists
>> computation
>> > (s!=1) into the default branch of the switch statement, and finally
>> causes
>> > very poor code generation. This problem occurs in both X86 and ARM,
>> and I
>> > believe it is also a problem for other targets.
>> >
>> > int f(char *t) {
>> >    int s=0;
>> >
>> >    while (*t && s != 1) {
>> >        switch (s) {
>> >        case 0:
>> >            s = 2;
>> >            break;
>> >        case 2:
>> >            s = 1;
>> >            break;
>> >        default:
>> >            if (*t == '-')
>> >                s = 1;
>> >            break;
>> >        }
>> >        t++;
>> >    }
>> >
>> >    return s;
>> > }
>> >
>> > Taking X86 as an example, with option "-O2" you may find 52
>> instructions
>> > generated like below,
>> >
>> > 00000000 <f>:
>> >   0:   55                      push   %ebp
>> >   1:   31 c0                   xor    %eax,%eax
>> >   3:   89 e5                   mov    %esp,%ebp
>> >   5:   57                      push   %edi
>> >   6:   56                      push   %esi
>> >   7:   53                      push   %ebx
>> >   8:   8b 55 08                mov    0x8(%ebp),%edx
>> >   b:   0f b6 0a                movzbl (%edx),%ecx
>> >   e:   84 c9                   test   %cl,%cl
>> >  10:   74 50                   je     62 <f+0x62>
>> >  12:   83 c2 01                add    $0x1,%edx
>> >  15:   85 c0                   test   %eax,%eax
>> >  17:   75 23                   jne    3c <f+0x3c>
>> >  19:   8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
>> >  20:   0f b6 0a                movzbl (%edx),%ecx
>> >  23:   84 c9                   test   %cl,%cl
>> >  25:   0f 95 c0                setne  %al
>> >  28:   89 c7                   mov    %eax,%edi
>> >  2a:   b8 02 00 00 00          mov    $0x2,%eax
>> >  2f:   89 fb                   mov    %edi,%ebx
>> >  31:   83 c2 01                add    $0x1,%edx
>> >  34:   84 db                   test   %bl,%bl
>> >  36:   74 2a                   je     62 <f+0x62>
>> >  38:   85 c0                   test   %eax,%eax
>> >  3a:   74 e4                   je     20 <f+0x20>
>> >  3c:   83 f8 02                cmp    $0x2,%eax
>> >  3f:   74 1f                   je     60 <f+0x60>
>> >  41:   80 f9 2d                cmp    $0x2d,%cl
>> >  44:   74 22                   je     68 <f+0x68>
>> >  46:   0f b6 0a                movzbl (%edx),%ecx
>> >  49:   83 f8 01                cmp    $0x1,%eax
>> >  4c:   0f 95 c3                setne  %bl
>> >  4f:   89 df                   mov    %ebx,%edi
>> >  51:   84 c9                   test   %cl,%cl
>> >  53:   0f 95 c3                setne  %bl
>> >  56:   89 de                   mov    %ebx,%esi
>> >  58:   21 f7                   and    %esi,%edi
>> >  5a:   eb d3                   jmp    2f <f+0x2f>
>> >  5c:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
>> >  60:   b0 01                   mov    $0x1,%al
>> >  62:   5b                      pop    %ebx
>> >  63:   5e                      pop    %esi
>> >  64:   5f                      pop    %edi
>> >  65:   5d                      pop    %ebp
>> >  66:   c3                      ret
>> >  67:   90                      nop
>> >  68:   b8 01 00 00 00          mov    $0x1,%eax
>> >  6d:   5b                      pop    %ebx
>> >  6e:   5e                      pop    %esi
>> >  6f:   5f                      pop    %edi
>> >  70:   5d                      pop    %ebp
>> >  71:   c3                      ret
>> >
>> > But with command line option "-O2 -fno-tree-pre", there are only 12
>> > instructions generated, and the code would be very clean like below,
>> >
>> > 00000000 <f>:
>> >   0:   55                      push   %ebp
>> >   1:   31 c0                   xor    %eax,%eax
>> >   3:   89 e5                   mov    %esp,%ebp
>> >   5:   8b 55 08                mov    0x8(%ebp),%edx
>> >   8:   80 3a 00                cmpb   $0x0,(%edx)
>> >   b:   74 0e                   je     1b <f+0x1b>
>> >   d:   80 7a 01 00             cmpb   $0x0,0x1(%edx)
>> >  11:   b0 02                   mov    $0x2,%al
>> >  13:   ba 01 00 00 00          mov    $0x1,%edx
>> >  18:   0f 45 c2                cmovne %edx,%eax
>> >  1b:   5d                      pop    %ebp
>> >  1c:   c3                      ret
>> >
>> > Do you have any idea about this?
>>
>> If you look at what PRE does it is clear that it's a profitable
>> transformation
>> as it reduces the number of instructions computed on the paths where
>> s is set to a constant.  If you ask for size optimization you won't get
>> this transformation.
>>
>> Now - on the tree level we don't see the if-conversion opportunity
>> (we don't have a conditional move instruction), but still we could
>> improve phiopt to handle switch statements - not sure what we
>> could do with the case in question though which is
>>
>> <bb 3>:
>>   switch (s_3) <default: <L3>, case 0: <L11>, case 2: <L10>>
>>
>> <L11>:
>>   goto <bb 7> (<L10>);
>>
>> <L3>:
>>   if (D.2703_6 == 45)
>>     goto <bb 6>;
>>   else
>>     goto <bb 7> (<L10>);
>>
>> <bb 6>:
>>
>>   # s_2 = PHI <2(4), 1(3), 1(6), s_3(5)>
>> <L10>:
>>
>> as the condition in at L3 complicates things.
>>
>> The code is also really really special (and written in an awful
>> way ...).
>>
>> Richard.
>>
>> > Thanks,
>> > -Jiangning
>> >
>> >
>> >
>> >
>
>
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: A case that PRE optimization hurts performance
  2011-08-02  9:23 ` Richard Guenther
@ 2011-09-16  2:01   ` Jiangning Liu
       [not found]   ` <4e72add8.c9d0e30a.142a.ffffe60eSMTPIN_ADDED@mx.google.com>
  1 sibling, 0 replies; 13+ messages in thread
From: Jiangning Liu @ 2011-09-16  2:01 UTC (permalink / raw)
  To: 'Richard Guenther'; +Cc: gcc

Hi Richard,

I slightly changed the case to be like below,

int f(char *t) {
    int s=0;

    while (*t && s != 1) {
        switch (s) {
        case 0:   /* path 1 */
            s = 2;
            break;
        case 2:   /* path 2 */
            s = 3; /* changed */
            break;
        default:  /* path 3 */
            if (*t == '-') 
                s = 2;
            break;
        }
        t++;
    }

    return s;
}

"-O2" is still worse than "-O2 -fno-tree-pre". 

"-O2 -fno-tree-pre" result is 

f:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	movl	8(%ebp), %edx
	movzbl	(%edx), %ecx
	jmp	.L14
	.p2align 4,,7
	.p2align 3
.L5:
	movl	$2, %eax
.L7:
	addl	$1, %edx
	cmpl	$1, %eax
	movzbl	(%edx), %ecx
	je	.L3
.L14:
	testb	%cl, %cl
	je	.L3
	testl	%eax, %eax
	je	.L5
	cmpl	$2, %eax
	.p2align 4,,5
	je	.L17
	cmpb	$45, %cl
	.p2align 4,,5
	je	.L5
	addl	$1, %edx
	cmpl	$1, %eax
	movzbl	(%edx), %ecx
	jne	.L14
	.p2align 4,,7
	.p2align 3
.L3:
	popl	%ebp
	.p2align 4,,2
	ret
	.p2align 4,,7
	.p2align 3
.L17:
	movb	$3, %al
	.p2align 4,,3
	jmp	.L7

While "-O2" result is 

f:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	movl	8(%ebp), %edx
	pushl	%ebx
	movzbl	(%edx), %ecx
	jmp	.L14
	.p2align 4,,7
	.p2align 3
.L5:
	movl	$1, %ebx
	movl	$2, %eax
.L7:
	addl	$1, %edx
	testb	%bl, %bl
	movzbl	(%edx), %ecx
	je	.L3
.L14:
	testb	%cl, %cl
	je	.L3
	testl	%eax, %eax
	je	.L5
	cmpl	$2, %eax
	.p2align 4,,5
	je	.L16
	cmpb	$45, %cl
	.p2align 4,,5
	je	.L5
	cmpl	$1, %eax
	setne	%bl
	addl	$1, %edx
	testb	%bl, %bl
	movzbl	(%edx), %ecx
	jne	.L14
	.p2align 4,,7
	.p2align 3
.L3:
	popl	%ebx
	popl	%ebp
	ret
	.p2align 4,,7
	.p2align 3
.L16:
	movl	$1, %ebx
	movb	$3, %al
	jmp	.L7

You may notice that register ebx is introduced, and some more instructions
around ebx are generated as well. i.e.

	setne	%bl
	testb	%bl, %bl

I agree with you that in theory PRE does the right thing to minimize the
computation cost on gimple level. However, the problem is the cost of
converting comparison result to a bool value is not considered, so it
actually makes binary code worse. For this case, as I summarized below, to
complete the same functionality "With PRE" is worse than "Without PRE" for
all three paths,

* Without PRE,

Path1:
	movl	$2, %eax
	cmpl	$1, %eax
	je	.L3

Path2:
	movb	$3, %al
	cmpl	$1, %eax
	je	.L3

Path3:
	cmpl	$1, %eax
	jne	.L14

* With PRE,

Path1:
	movl	$1, %ebx
	movl	$2, %eax
	testb	%bl, %bl
	je	.L3

Path2:
	movl	$1, %ebx
	movb	$3, %al
	testb	%bl, %bl
	je	.L3

Path3:
	cmpl	$1, %eax
	setne	%bl
	testb	%bl, %bl
	jne	.L14

Do you have any more thoughts?

Thanks,
-Jiangning

> -----Original Message-----
> From: Richard Guenther [mailto:richard.guenther@gmail.com]
> Sent: Tuesday, August 02, 2011 5:23 PM
> To: Jiangning Liu
> Cc: gcc@gcc.gnu.org
> Subject: Re: A case that PRE optimization hurts performance
> 
> On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu <jiangning.liu@arm.com>
> wrote:
> > Hi,
> >
> > For the following simple test case, PRE optimization hoists
> computation
> > (s!=1) into the default branch of the switch statement, and finally
> causes
> > very poor code generation. This problem occurs in both X86 and ARM,
> and I
> > believe it is also a problem for other targets.
> >
> > int f(char *t) {
> >    int s=0;
> >
> >    while (*t && s != 1) {
> >        switch (s) {
> >        case 0:
> >            s = 2;
> >            break;
> >        case 2:
> >            s = 1;
> >            break;
> >        default:
> >            if (*t == '-')
> >                s = 1;
> >            break;
> >        }
> >        t++;
> >    }
> >
> >    return s;
> > }
> >
> > Taking X86 as an example, with option "-O2" you may find 52
> instructions
> > generated like below,
> >
> > 00000000 <f>:
> >   0:   55                      push   %ebp
> >   1:   31 c0                   xor    %eax,%eax
> >   3:   89 e5                   mov    %esp,%ebp
> >   5:   57                      push   %edi
> >   6:   56                      push   %esi
> >   7:   53                      push   %ebx
> >   8:   8b 55 08                mov    0x8(%ebp),%edx
> >   b:   0f b6 0a                movzbl (%edx),%ecx
> >   e:   84 c9                   test   %cl,%cl
> >  10:   74 50                   je     62 <f+0x62>
> >  12:   83 c2 01                add    $0x1,%edx
> >  15:   85 c0                   test   %eax,%eax
> >  17:   75 23                   jne    3c <f+0x3c>
> >  19:   8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
> >  20:   0f b6 0a                movzbl (%edx),%ecx
> >  23:   84 c9                   test   %cl,%cl
> >  25:   0f 95 c0                setne  %al
> >  28:   89 c7                   mov    %eax,%edi
> >  2a:   b8 02 00 00 00          mov    $0x2,%eax
> >  2f:   89 fb                   mov    %edi,%ebx
> >  31:   83 c2 01                add    $0x1,%edx
> >  34:   84 db                   test   %bl,%bl
> >  36:   74 2a                   je     62 <f+0x62>
> >  38:   85 c0                   test   %eax,%eax
> >  3a:   74 e4                   je     20 <f+0x20>
> >  3c:   83 f8 02                cmp    $0x2,%eax
> >  3f:   74 1f                   je     60 <f+0x60>
> >  41:   80 f9 2d                cmp    $0x2d,%cl
> >  44:   74 22                   je     68 <f+0x68>
> >  46:   0f b6 0a                movzbl (%edx),%ecx
> >  49:   83 f8 01                cmp    $0x1,%eax
> >  4c:   0f 95 c3                setne  %bl
> >  4f:   89 df                   mov    %ebx,%edi
> >  51:   84 c9                   test   %cl,%cl
> >  53:   0f 95 c3                setne  %bl
> >  56:   89 de                   mov    %ebx,%esi
> >  58:   21 f7                   and    %esi,%edi
> >  5a:   eb d3                   jmp    2f <f+0x2f>
> >  5c:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
> >  60:   b0 01                   mov    $0x1,%al
> >  62:   5b                      pop    %ebx
> >  63:   5e                      pop    %esi
> >  64:   5f                      pop    %edi
> >  65:   5d                      pop    %ebp
> >  66:   c3                      ret
> >  67:   90                      nop
> >  68:   b8 01 00 00 00          mov    $0x1,%eax
> >  6d:   5b                      pop    %ebx
> >  6e:   5e                      pop    %esi
> >  6f:   5f                      pop    %edi
> >  70:   5d                      pop    %ebp
> >  71:   c3                      ret
> >
> > But with command line option "-O2 -fno-tree-pre", there are only 12
> > instructions generated, and the code would be very clean like below,
> >
> > 00000000 <f>:
> >   0:   55                      push   %ebp
> >   1:   31 c0                   xor    %eax,%eax
> >   3:   89 e5                   mov    %esp,%ebp
> >   5:   8b 55 08                mov    0x8(%ebp),%edx
> >   8:   80 3a 00                cmpb   $0x0,(%edx)
> >   b:   74 0e                   je     1b <f+0x1b>
> >   d:   80 7a 01 00             cmpb   $0x0,0x1(%edx)
> >  11:   b0 02                   mov    $0x2,%al
> >  13:   ba 01 00 00 00          mov    $0x1,%edx
> >  18:   0f 45 c2                cmovne %edx,%eax
> >  1b:   5d                      pop    %ebp
> >  1c:   c3                      ret
> >
> > Do you have any idea about this?
> 
> If you look at what PRE does it is clear that it's a profitable
> transformation
> as it reduces the number of instructions computed on the paths where
> s is set to a constant.  If you ask for size optimization you won't get
> this transformation.
> 
> Now - on the tree level we don't see the if-conversion opportunity
> (we don't have a conditional move instruction), but still we could
> improve phiopt to handle switch statements - not sure what we
> could do with the case in question though which is
> 
> <bb 3>:
>   switch (s_3) <default: <L3>, case 0: <L11>, case 2: <L10>>
> 
> <L11>:
>   goto <bb 7> (<L10>);
> 
> <L3>:
>   if (D.2703_6 == 45)
>     goto <bb 6>;
>   else
>     goto <bb 7> (<L10>);
> 
> <bb 6>:
> 
>   # s_2 = PHI <2(4), 1(3), 1(6), s_3(5)>
> <L10>:
> 
> as the condition in at L3 complicates things.
> 
> The code is also really really special (and written in an awful
> way ...).
> 
> Richard.
> 
> > Thanks,
> > -Jiangning
> >
> >
> >
> >




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: A case that PRE optimization hurts performance
       [not found] <4e3762ed.030d440a.1143.2af2SMTPIN_ADDED@mx.google.com>
@ 2011-08-02  9:23 ` Richard Guenther
  2011-09-16  2:01   ` Jiangning Liu
       [not found]   ` <4e72add8.c9d0e30a.142a.ffffe60eSMTPIN_ADDED@mx.google.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Richard Guenther @ 2011-08-02  9:23 UTC (permalink / raw)
  To: Jiangning Liu; +Cc: gcc

On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu <jiangning.liu@arm.com> wrote:
> Hi,
>
> For the following simple test case, PRE optimization hoists computation
> (s!=1) into the default branch of the switch statement, and finally causes
> very poor code generation. This problem occurs in both X86 and ARM, and I
> believe it is also a problem for other targets.
>
> int f(char *t) {
>    int s=0;
>
>    while (*t && s != 1) {
>        switch (s) {
>        case 0:
>            s = 2;
>            break;
>        case 2:
>            s = 1;
>            break;
>        default:
>            if (*t == '-')
>                s = 1;
>            break;
>        }
>        t++;
>    }
>
>    return s;
> }
>
> Taking X86 as an example, with option "-O2" you may find 52 instructions
> generated like below,
>
> 00000000 <f>:
>   0:   55                      push   %ebp
>   1:   31 c0                   xor    %eax,%eax
>   3:   89 e5                   mov    %esp,%ebp
>   5:   57                      push   %edi
>   6:   56                      push   %esi
>   7:   53                      push   %ebx
>   8:   8b 55 08                mov    0x8(%ebp),%edx
>   b:   0f b6 0a                movzbl (%edx),%ecx
>   e:   84 c9                   test   %cl,%cl
>  10:   74 50                   je     62 <f+0x62>
>  12:   83 c2 01                add    $0x1,%edx
>  15:   85 c0                   test   %eax,%eax
>  17:   75 23                   jne    3c <f+0x3c>
>  19:   8d b4 26 00 00 00 00    lea    0x0(%esi,%eiz,1),%esi
>  20:   0f b6 0a                movzbl (%edx),%ecx
>  23:   84 c9                   test   %cl,%cl
>  25:   0f 95 c0                setne  %al
>  28:   89 c7                   mov    %eax,%edi
>  2a:   b8 02 00 00 00          mov    $0x2,%eax
>  2f:   89 fb                   mov    %edi,%ebx
>  31:   83 c2 01                add    $0x1,%edx
>  34:   84 db                   test   %bl,%bl
>  36:   74 2a                   je     62 <f+0x62>
>  38:   85 c0                   test   %eax,%eax
>  3a:   74 e4                   je     20 <f+0x20>
>  3c:   83 f8 02                cmp    $0x2,%eax
>  3f:   74 1f                   je     60 <f+0x60>
>  41:   80 f9 2d                cmp    $0x2d,%cl
>  44:   74 22                   je     68 <f+0x68>
>  46:   0f b6 0a                movzbl (%edx),%ecx
>  49:   83 f8 01                cmp    $0x1,%eax
>  4c:   0f 95 c3                setne  %bl
>  4f:   89 df                   mov    %ebx,%edi
>  51:   84 c9                   test   %cl,%cl
>  53:   0f 95 c3                setne  %bl
>  56:   89 de                   mov    %ebx,%esi
>  58:   21 f7                   and    %esi,%edi
>  5a:   eb d3                   jmp    2f <f+0x2f>
>  5c:   8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
>  60:   b0 01                   mov    $0x1,%al
>  62:   5b                      pop    %ebx
>  63:   5e                      pop    %esi
>  64:   5f                      pop    %edi
>  65:   5d                      pop    %ebp
>  66:   c3                      ret
>  67:   90                      nop
>  68:   b8 01 00 00 00          mov    $0x1,%eax
>  6d:   5b                      pop    %ebx
>  6e:   5e                      pop    %esi
>  6f:   5f                      pop    %edi
>  70:   5d                      pop    %ebp
>  71:   c3                      ret
>
> But with command line option "-O2 -fno-tree-pre", there are only 12
> instructions generated, and the code would be very clean like below,
>
> 00000000 <f>:
>   0:   55                      push   %ebp
>   1:   31 c0                   xor    %eax,%eax
>   3:   89 e5                   mov    %esp,%ebp
>   5:   8b 55 08                mov    0x8(%ebp),%edx
>   8:   80 3a 00                cmpb   $0x0,(%edx)
>   b:   74 0e                   je     1b <f+0x1b>
>   d:   80 7a 01 00             cmpb   $0x0,0x1(%edx)
>  11:   b0 02                   mov    $0x2,%al
>  13:   ba 01 00 00 00          mov    $0x1,%edx
>  18:   0f 45 c2                cmovne %edx,%eax
>  1b:   5d                      pop    %ebp
>  1c:   c3                      ret
>
> Do you have any idea about this?

If you look at what PRE does it is clear that it's a profitable transformation
as it reduces the number of instructions computed on the paths where
s is set to a constant.  If you ask for size optimization you won't get
this transformation.

Now - on the tree level we don't see the if-conversion opportunity
(we don't have a conditional move instruction), but still we could
improve phiopt to handle switch statements - not sure what we
could do with the case in question though which is

<bb 3>:
  switch (s_3) <default: <L3>, case 0: <L11>, case 2: <L10>>

<L11>:
  goto <bb 7> (<L10>);

<L3>:
  if (D.2703_6 == 45)
    goto <bb 6>;
  else
    goto <bb 7> (<L10>);

<bb 6>:

  # s_2 = PHI <2(4), 1(3), 1(6), s_3(5)>
<L10>:

as the condition in at L3 complicates things.

The code is also really really special (and written in an awful way ...).

Richard.

> Thanks,
> -Jiangning
>
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-09-27 15:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-02  2:37 A case that PRE optimization hurts performance Jiangning Liu
2011-08-15  8:40 ` Václav Zeman
     [not found] <4e3762ed.030d440a.1143.2af2SMTPIN_ADDED@mx.google.com>
2011-08-02  9:23 ` Richard Guenther
2011-09-16  2:01   ` Jiangning Liu
     [not found]   ` <4e72add8.c9d0e30a.142a.ffffe60eSMTPIN_ADDED@mx.google.com>
2011-09-23 20:25     ` Richard Guenther
2011-09-26  8:34       ` Jiangning Liu
     [not found]       ` <4e802c3d.94ebd80a.4b1f.ffff8101SMTPIN_ADDED@mx.google.com>
2011-09-26 14:33         ` Richard Guenther
2011-09-27  7:30           ` Jeff Law
2011-09-27 11:15             ` Jiangning Liu
2011-09-27 13:44             ` Richard Guenther
2011-09-27 18:59               ` Jeff Law
2011-09-27 18:59                 ` Richard Guenther
2011-09-27 21:56                   ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).