x86 gcc lacks simple optimization

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* x86 gcc lacks simple optimization
@ 2013-12-06  8:31 Konstantin Vladimirov
  2013-12-06  9:28 ` David Brown
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Konstantin Vladimirov @ 2013-12-06  8:31 UTC (permalink / raw)
  To: gcc, gcc-help

Hi,

Consider code:

int foo(char *t, char *v, int w)
{
int i;

for (i = 1; i != w; ++i)
{
int x = i << 2;
v[x + 4] = t[x + 4];
}

return 0;
}

Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:

gcc -O2 -m32 -S test.c

You will see loop, formed like:

.L5:
leal 0(,%eax,4), %edx
addl $1, %eax
movzbl 4(%edi,%edx), %ecx
cmpl %ebx, %eax
movb %cl, 4(%esi,%edx)
jne .L5

But it can be easily simplified to something like this:

.L5:
addl $1, %eax
movzbl (%esi,%eax,4), %edx
cmpl %ecx, %eax
movb %dl, (%ebx,%eax,4)
jne .L5

(i.e. left shift may be moved to address).

First question to gcc-help maillist. May be there are some options,
that I've missed, and there IS a way to explain gcc my intention to do
this?

And second question to gcc developers mail list. I am working on
private backend and want to add this optimization to my backend. What
do you advise me to do -- custom gimple pass, or rtl pass, or modify
some existent pass, etc?

---
With best regards, Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06  8:31 x86 gcc lacks simple optimization Konstantin Vladimirov
@ 2013-12-06  9:28 ` David Brown
  2013-12-06 10:09 ` Jakub Jelinek
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: David Brown @ 2013-12-06  9:28 UTC (permalink / raw)
  To: Konstantin Vladimirov, gcc, gcc-help

On 06/12/13 09:30, Konstantin Vladimirov wrote:
> Hi,
> 
> Consider code:
> 
> int foo(char *t, char *v, int w)
> {
> int i;
> 
> for (i = 1; i != w; ++i)
> {
> int x = i << 2;
> v[x + 4] = t[x + 4];
> }
> 
> return 0;
> }
> 
> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
> 
> gcc -O2 -m32 -S test.c
> 
> You will see loop, formed like:
> 
> .L5:
> leal 0(,%eax,4), %edx
> addl $1, %eax
> movzbl 4(%edi,%edx), %ecx
> cmpl %ebx, %eax
> movb %cl, 4(%esi,%edx)
> jne .L5
> 
> But it can be easily simplified to something like this:
> 
> .L5:
> addl $1, %eax
> movzbl (%esi,%eax,4), %edx
> cmpl %ecx, %eax
> movb %dl, (%ebx,%eax,4)
> jne .L5
> 
> (i.e. left shift may be moved to address).
> 
> First question to gcc-help maillist. May be there are some options,
> that I've missed, and there IS a way to explain gcc my intention to do
> this?
> 
> And second question to gcc developers mail list. I am working on
> private backend and want to add this optimization to my backend. What
> do you advise me to do -- custom gimple pass, or rtl pass, or modify
> some existent pass, etc?
> 

Hi,

Usually the gcc developers are not keen on emails going to both the help
and development list - they prefer to keep them separate.

My first thought when someone finds a "missed optimisation" issue,
especially with the x86 target, is are you /sure/ this code is slower?
x86 chips are immensely complex, and the interplay between different
instructions, pipelines, superscaling, etc., means that code that might
appear faster, can actually be slower.  So please check your
architecture flags (i.e., are you optimising for the "native" cpu, or
any other specific cpu - optimised code can be different for different
x86 cpus).  Then /measure/ the speed of the code to see if there is a
real difference.

Regarding your "private backend" - is this a modification of the x86
backend, or a completely different target?  If it is x86, then I think
the answer is "don't do it - work with the mainline code".  If it is
something else, then an x86-specific optimisation is of little use anyway.

mvh.,

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06  8:31 x86 gcc lacks simple optimization Konstantin Vladimirov
  2013-12-06  9:28 ` David Brown
@ 2013-12-06 10:09 ` Jakub Jelinek
  2013-12-06 10:10 ` Richard Biener
  2013-12-06 12:13 ` Marc Glisse
  3 siblings, 0 replies; 12+ messages in thread
From: Jakub Jelinek @ 2013-12-06 10:09 UTC (permalink / raw)
  To: Konstantin Vladimirov; +Cc: gcc, gcc-help

On Fri, Dec 06, 2013 at 12:30:54PM +0400, Konstantin Vladimirov wrote:
> Consider code:
> 
> int foo(char *t, char *v, int w)
> {
> int i;
> 
> for (i = 1; i != w; ++i)
> {
> int x = i << 2;
> v[x + 4] = t[x + 4];
> }
> 
> return 0;
> }

This is either job of ivopts pass, dunno why it doesn't consider
turning those memory accesses to TARGET_MEM_REF with the *4 multiplication
in there, or combiner (which is limited to single use though, so if you do
say v[x + 4] = t[i]; int he loop instead, it will for -m32 put use
the (...,4) addressing, but as you have two uses, it doesn't do it).
As others said, the question is if using the more complex addressing
more than once is actually beneficial or not.

Anyway, while on this testcase, I wonder why VRP doesn't derive ranges
here.

  <bb 3>:
  # RANGE [-2147483648, 2147483647] NONZERO 0x000000000fffffffc
  x_5 = i_1 << 2;
  # RANGE ~[2147483648, 18446744071562067967] NONZERO 0x0fffffffffffffffc
  _6 = (sizetype) x_5;
  # RANGE ~[2147483652, 18446744071562067971] NONZERO 0x0fffffffffffffffc
  _7 = _6 + 4;
  # PT = nonlocal
  _9 = v_8(D) + _7;
  # PT = nonlocal
  _11 = t_10(D) + _7;
  _12 = *_11;
  *_9 = _12;
  i_14 = i_1 + 1;

  <bb 4>:
  # i_1 = PHI <1(2), i_14(3)>
  if (i_1 != w_4(D))
    goto <bb 3>;
  else
    goto <bb 5>;

As i is signed integer with undefined overflow, and
the loop starts with 1 and it is always only incremented, can't
we derive
  # RANGE [1, 2147483647]
  # i_1 = PHI <1(2), i_14(3)>
and similarly for i_14?  We likely can't derive similar range for
x_5 because at least in C++ it isn't undefined behavior if it
shits into negative (or is it?), but at least with
x = i * 4;
instead we could.

	Jakub

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06  8:31 x86 gcc lacks simple optimization Konstantin Vladimirov
  2013-12-06  9:28 ` David Brown
  2013-12-06 10:09 ` Jakub Jelinek
@ 2013-12-06 10:10 ` Richard Biener
  2013-12-06 10:19   ` Konstantin Vladimirov
  2013-12-06 12:13 ` Marc Glisse
  3 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2013-12-06 10:10 UTC (permalink / raw)
  To: Konstantin Vladimirov; +Cc: GCC Development, GCC-help

On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
<konstantin.vladimirov@gmail.com> wrote:
> Hi,
>
> Consider code:
>
> int foo(char *t, char *v, int w)
> {
> int i;
>
> for (i = 1; i != w; ++i)
> {
> int x = i << 2;
> v[x + 4] = t[x + 4];
> }
>
> return 0;
> }
>
> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>
> gcc -O2 -m32 -S test.c
>
> You will see loop, formed like:
>
> .L5:
> leal 0(,%eax,4), %edx
> addl $1, %eax
> movzbl 4(%edi,%edx), %ecx
> cmpl %ebx, %eax
> movb %cl, 4(%esi,%edx)
> jne .L5
>
> But it can be easily simplified to something like this:
>
> .L5:
> addl $1, %eax
> movzbl (%esi,%eax,4), %edx
> cmpl %ecx, %eax
> movb %dl, (%ebx,%eax,4)
> jne .L5
>
> (i.e. left shift may be moved to address).
>
> First question to gcc-help maillist. May be there are some options,
> that I've missed, and there IS a way to explain gcc my intention to do
> this?
>
> And second question to gcc developers mail list. I am working on
> private backend and want to add this optimization to my backend. What
> do you advise me to do -- custom gimple pass, or rtl pass, or modify
> some existent pass, etc?

This looks like a deficiency in induction variable optimization.  Note
that i << 2 may overflow and this overflow does not invoke undefined
behavior but is in the implementation defined behavior category.

The issue in this case is likely that the SCEV infrastructure does not handle
left-shifts.

Richard.

> ---
> With best regards, Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06 10:10 ` Richard Biener
@ 2013-12-06 10:19   ` Konstantin Vladimirov
  2013-12-06 10:25     ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Vladimirov @ 2013-12-06 10:19 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development, GCC-help

Hi,

nothing changes if everything is unsigned and we are guaranteed to not
raise UB on overflow:

unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
{
unsigned i;

for (i = 1; i != w; ++i)
{
unsigned x = i << 2;
v[x + 4] = t[x + 4];
}

return 0;
}

yields:

.L5:
leal 0(,%eax,4), %edx
addl $1, %eax
movzbl 4(%edi,%edx), %ecx
cmpl %ebx, %eax
movb %cl, 4(%esi,%edx)
jne .L5

What is SCEV infrastructure (guessing scalar evolutions?) and what
files/passes to look in?

---
With best regards, Konstantin

On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
> <konstantin.vladimirov@gmail.com> wrote:
>> Hi,
>>
>> Consider code:
>>
>> int foo(char *t, char *v, int w)
>> {
>> int i;
>>
>> for (i = 1; i != w; ++i)
>> {
>> int x = i << 2;
>> v[x + 4] = t[x + 4];
>> }
>>
>> return 0;
>> }
>>
>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>>
>> gcc -O2 -m32 -S test.c
>>
>> You will see loop, formed like:
>>
>> .L5:
>> leal 0(,%eax,4), %edx
>> addl $1, %eax
>> movzbl 4(%edi,%edx), %ecx
>> cmpl %ebx, %eax
>> movb %cl, 4(%esi,%edx)
>> jne .L5
>>
>> But it can be easily simplified to something like this:
>>
>> .L5:
>> addl $1, %eax
>> movzbl (%esi,%eax,4), %edx
>> cmpl %ecx, %eax
>> movb %dl, (%ebx,%eax,4)
>> jne .L5
>>
>> (i.e. left shift may be moved to address).
>>
>> First question to gcc-help maillist. May be there are some options,
>> that I've missed, and there IS a way to explain gcc my intention to do
>> this?
>>
>> And second question to gcc developers mail list. I am working on
>> private backend and want to add this optimization to my backend. What
>> do you advise me to do -- custom gimple pass, or rtl pass, or modify
>> some existent pass, etc?
>
> This looks like a deficiency in induction variable optimization.  Note
> that i << 2 may overflow and this overflow does not invoke undefined
> behavior but is in the implementation defined behavior category.
>
> The issue in this case is likely that the SCEV infrastructure does not handle
> left-shifts.
>
> Richard.
>
>> ---
>> With best regards, Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06 10:19   ` Konstantin Vladimirov
@ 2013-12-06 10:25     ` Richard Biener
  2013-12-06 11:31       ` H.J. Lu
  2013-12-06 13:52       ` Konstantin Vladimirov
  0 siblings, 2 replies; 12+ messages in thread
From: Richard Biener @ 2013-12-06 10:25 UTC (permalink / raw)
  To: Konstantin Vladimirov; +Cc: GCC Development, GCC-help

On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov
<konstantin.vladimirov@gmail.com> wrote:
> Hi,
>
> nothing changes if everything is unsigned and we are guaranteed to not
> raise UB on overflow:
>
> unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
> {
> unsigned i;
>
> for (i = 1; i != w; ++i)
> {
> unsigned x = i << 2;
> v[x + 4] = t[x + 4];
> }
>
> return 0;
> }
>
> yields:
>
> .L5:
> leal 0(,%eax,4), %edx
> addl $1, %eax
> movzbl 4(%edi,%edx), %ecx
> cmpl %ebx, %eax
> movb %cl, 4(%esi,%edx)
> jne .L5
>
> What is SCEV infrastructure (guessing scalar evolutions?) and what
> files/passes to look in?

tree-scalar-evolution.c, look at where it handles MULT_EXPR but
lacks LSHIFT_EXPR support.

Richard.

> ---
> With best regards, Konstantin
>
> On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
>> <konstantin.vladimirov@gmail.com> wrote:
>>> Hi,
>>>
>>> Consider code:
>>>
>>> int foo(char *t, char *v, int w)
>>> {
>>> int i;
>>>
>>> for (i = 1; i != w; ++i)
>>> {
>>> int x = i << 2;
>>> v[x + 4] = t[x + 4];
>>> }
>>>
>>> return 0;
>>> }
>>>
>>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>>>
>>> gcc -O2 -m32 -S test.c
>>>
>>> You will see loop, formed like:
>>>
>>> .L5:
>>> leal 0(,%eax,4), %edx
>>> addl $1, %eax
>>> movzbl 4(%edi,%edx), %ecx
>>> cmpl %ebx, %eax
>>> movb %cl, 4(%esi,%edx)
>>> jne .L5
>>>
>>> But it can be easily simplified to something like this:
>>>
>>> .L5:
>>> addl $1, %eax
>>> movzbl (%esi,%eax,4), %edx
>>> cmpl %ecx, %eax
>>> movb %dl, (%ebx,%eax,4)
>>> jne .L5
>>>
>>> (i.e. left shift may be moved to address).
>>>
>>> First question to gcc-help maillist. May be there are some options,
>>> that I've missed, and there IS a way to explain gcc my intention to do
>>> this?
>>>
>>> And second question to gcc developers mail list. I am working on
>>> private backend and want to add this optimization to my backend. What
>>> do you advise me to do -- custom gimple pass, or rtl pass, or modify
>>> some existent pass, etc?
>>
>> This looks like a deficiency in induction variable optimization.  Note
>> that i << 2 may overflow and this overflow does not invoke undefined
>> behavior but is in the implementation defined behavior category.
>>
>> The issue in this case is likely that the SCEV infrastructure does not handle
>> left-shifts.
>>
>> Richard.
>>
>>> ---
>>> With best regards, Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06 10:25     ` Richard Biener
@ 2013-12-06 11:31       ` H.J. Lu
  2013-12-06 13:52       ` Konstantin Vladimirov
  1 sibling, 0 replies; 12+ messages in thread
From: H.J. Lu @ 2013-12-06 11:31 UTC (permalink / raw)
  To: Richard Biener; +Cc: Konstantin Vladimirov, GCC Development, GCC-help

On Fri, Dec 6, 2013 at 2:25 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov
> <konstantin.vladimirov@gmail.com> wrote:
>> Hi,
>>
>> nothing changes if everything is unsigned and we are guaranteed to not
>> raise UB on overflow:
>>
>> unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
>> {
>> unsigned i;
>>
>> for (i = 1; i != w; ++i)
>> {
>> unsigned x = i << 2;
>> v[x + 4] = t[x + 4];
>> }
>>
>> return 0;
>> }
>>
>> yields:
>>
>> .L5:
>> leal 0(,%eax,4), %edx
>> addl $1, %eax
>> movzbl 4(%edi,%edx), %ecx
>> cmpl %ebx, %eax
>> movb %cl, 4(%esi,%edx)
>> jne .L5
>>
>> What is SCEV infrastructure (guessing scalar evolutions?) and what
>> files/passes to look in?
>
> tree-scalar-evolution.c, look at where it handles MULT_EXPR but
> lacks LSHIFT_EXPR support.
>

For
--
int foo(char *t, char *v, int w)
{
int i;

for (i = 1; i != w; ++i)
{
int x = i * 2;
v[x + 4] = t[x + 4];
}

return 0;
}
---

-O2 gives:

.L6:
    movzbl    4(%esi,%eax,2), %edx
    movb    %dl, 4(%ebx,%eax,2)
    addl    $1, %eax
    cmpl    %ecx, %eax
    jne    .L6


-- 
H.J.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06  8:31 x86 gcc lacks simple optimization Konstantin Vladimirov
                   ` (2 preceding siblings ...)
  2013-12-06 10:10 ` Richard Biener
@ 2013-12-06 12:13 ` Marc Glisse
  3 siblings, 0 replies; 12+ messages in thread
From: Marc Glisse @ 2013-12-06 12:13 UTC (permalink / raw)
  To: Konstantin Vladimirov; +Cc: gcc, gcc-help

On Fri, 6 Dec 2013, Konstantin Vladimirov wrote:

> Consider code:
>
> int foo(char *t, char *v, int w)
> {
> int i;
>
> for (i = 1; i != w; ++i)
> {
> int x = i << 2;

A side note, but something too few people seem to be aware of: writing 
i<<2 can pessimize code compared to i*4 (and it is never faster). That is 
because, at a high level, signed multiplication overflow is undefined 
behavior while shift isn't. At a low level, gcc knows it can implement *4 
as a shift anyway.

> v[x + 4] = t[x + 4];
> }
>
> return 0;
> }

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06 10:25     ` Richard Biener
  2013-12-06 11:31       ` H.J. Lu
@ 2013-12-06 13:52       ` Konstantin Vladimirov
  2013-12-06 14:17         ` Richard Biener
  1 sibling, 1 reply; 12+ messages in thread
From: Konstantin Vladimirov @ 2013-12-06 13:52 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Development, GCC-help

Hi,

Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c
and now it yields code like (x86 again):

.L5:
movzbl 4(%esi,%eax,4), %edx
movb %dl, 4(%ebx,%eax,4)
addl $1, %eax
cmpl %ecx, %eax
jne .L5

So, excessive lea is gone. It is great, thank you so much. But I
wonder what else can I do to move add upper to simplify memory
accesses (I am guessing, this is some arithmetical re-associations,
still not sure where to look). For architecture, I am working on, it
is important. What would you advise?

---
With best regards, Konstantin

On Fri, Dec 6, 2013 at 2:25 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov
> <konstantin.vladimirov@gmail.com> wrote:
>> Hi,
>>
>> nothing changes if everything is unsigned and we are guaranteed to not
>> raise UB on overflow:
>>
>> unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
>> {
>> unsigned i;
>>
>> for (i = 1; i != w; ++i)
>> {
>> unsigned x = i << 2;
>> v[x + 4] = t[x + 4];
>> }
>>
>> return 0;
>> }
>>
>> yields:
>>
>> .L5:
>> leal 0(,%eax,4), %edx
>> addl $1, %eax
>> movzbl 4(%edi,%edx), %ecx
>> cmpl %ebx, %eax
>> movb %cl, 4(%esi,%edx)
>> jne .L5
>>
>> What is SCEV infrastructure (guessing scalar evolutions?) and what
>> files/passes to look in?
>
> tree-scalar-evolution.c, look at where it handles MULT_EXPR but
> lacks LSHIFT_EXPR support.
>
> Richard.
>
>> ---
>> With best regards, Konstantin
>>
>> On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
>>> <konstantin.vladimirov@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> Consider code:
>>>>
>>>> int foo(char *t, char *v, int w)
>>>> {
>>>> int i;
>>>>
>>>> for (i = 1; i != w; ++i)
>>>> {
>>>> int x = i << 2;
>>>> v[x + 4] = t[x + 4];
>>>> }
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>>>>
>>>> gcc -O2 -m32 -S test.c
>>>>
>>>> You will see loop, formed like:
>>>>
>>>> .L5:
>>>> leal 0(,%eax,4), %edx
>>>> addl $1, %eax
>>>> movzbl 4(%edi,%edx), %ecx
>>>> cmpl %ebx, %eax
>>>> movb %cl, 4(%esi,%edx)
>>>> jne .L5
>>>>
>>>> But it can be easily simplified to something like this:
>>>>
>>>> .L5:
>>>> addl $1, %eax
>>>> movzbl (%esi,%eax,4), %edx
>>>> cmpl %ecx, %eax
>>>> movb %dl, (%ebx,%eax,4)
>>>> jne .L5
>>>>
>>>> (i.e. left shift may be moved to address).
>>>>
>>>> First question to gcc-help maillist. May be there are some options,
>>>> that I've missed, and there IS a way to explain gcc my intention to do
>>>> this?
>>>>
>>>> And second question to gcc developers mail list. I am working on
>>>> private backend and want to add this optimization to my backend. What
>>>> do you advise me to do -- custom gimple pass, or rtl pass, or modify
>>>> some existent pass, etc?
>>>
>>> This looks like a deficiency in induction variable optimization.  Note
>>> that i << 2 may overflow and this overflow does not invoke undefined
>>> behavior but is in the implementation defined behavior category.
>>>
>>> The issue in this case is likely that the SCEV infrastructure does not handle
>>> left-shifts.
>>>
>>> Richard.
>>>
>>>> ---
>>>> With best regards, Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06 13:52       ` Konstantin Vladimirov
@ 2013-12-06 14:17         ` Richard Biener
  2013-12-06 20:54           ` Jeff Law
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2013-12-06 14:17 UTC (permalink / raw)
  To: Konstantin Vladimirov; +Cc: GCC Development, GCC-help

On Fri, Dec 6, 2013 at 2:52 PM, Konstantin Vladimirov
<konstantin.vladimirov@gmail.com> wrote:
> Hi,
>
> Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c
> and now it yields code like (x86 again):
>
> .L5:
> movzbl 4(%esi,%eax,4), %edx
> movb %dl, 4(%ebx,%eax,4)
> addl $1, %eax
> cmpl %ecx, %eax
> jne .L5
>
> So, excessive lea is gone. It is great, thank you so much. But I
> wonder what else can I do to move add upper to simplify memory
> accesses (I am guessing, this is some arithmetical re-associations,
> still not sure where to look). For architecture, I am working on, it
> is important. What would you advise?

You need to look at IVOPTs and how it arrives at the choice of
induction variables.

Richard.

> ---
> With best regards, Konstantin
>
> On Fri, Dec 6, 2013 at 2:25 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov
>> <konstantin.vladimirov@gmail.com> wrote:
>>> Hi,
>>>
>>> nothing changes if everything is unsigned and we are guaranteed to not
>>> raise UB on overflow:
>>>
>>> unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
>>> {
>>> unsigned i;
>>>
>>> for (i = 1; i != w; ++i)
>>> {
>>> unsigned x = i << 2;
>>> v[x + 4] = t[x + 4];
>>> }
>>>
>>> return 0;
>>> }
>>>
>>> yields:
>>>
>>> .L5:
>>> leal 0(,%eax,4), %edx
>>> addl $1, %eax
>>> movzbl 4(%edi,%edx), %ecx
>>> cmpl %ebx, %eax
>>> movb %cl, 4(%esi,%edx)
>>> jne .L5
>>>
>>> What is SCEV infrastructure (guessing scalar evolutions?) and what
>>> files/passes to look in?
>>
>> tree-scalar-evolution.c, look at where it handles MULT_EXPR but
>> lacks LSHIFT_EXPR support.
>>
>> Richard.
>>
>>> ---
>>> With best regards, Konstantin
>>>
>>> On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
>>>> <konstantin.vladimirov@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> Consider code:
>>>>>
>>>>> int foo(char *t, char *v, int w)
>>>>> {
>>>>> int i;
>>>>>
>>>>> for (i = 1; i != w; ++i)
>>>>> {
>>>>> int x = i << 2;
>>>>> v[x + 4] = t[x + 4];
>>>>> }
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>>>>>
>>>>> gcc -O2 -m32 -S test.c
>>>>>
>>>>> You will see loop, formed like:
>>>>>
>>>>> .L5:
>>>>> leal 0(,%eax,4), %edx
>>>>> addl $1, %eax
>>>>> movzbl 4(%edi,%edx), %ecx
>>>>> cmpl %ebx, %eax
>>>>> movb %cl, 4(%esi,%edx)
>>>>> jne .L5
>>>>>
>>>>> But it can be easily simplified to something like this:
>>>>>
>>>>> .L5:
>>>>> addl $1, %eax
>>>>> movzbl (%esi,%eax,4), %edx
>>>>> cmpl %ecx, %eax
>>>>> movb %dl, (%ebx,%eax,4)
>>>>> jne .L5
>>>>>
>>>>> (i.e. left shift may be moved to address).
>>>>>
>>>>> First question to gcc-help maillist. May be there are some options,
>>>>> that I've missed, and there IS a way to explain gcc my intention to do
>>>>> this?
>>>>>
>>>>> And second question to gcc developers mail list. I am working on
>>>>> private backend and want to add this optimization to my backend. What
>>>>> do you advise me to do -- custom gimple pass, or rtl pass, or modify
>>>>> some existent pass, etc?
>>>>
>>>> This looks like a deficiency in induction variable optimization.  Note
>>>> that i << 2 may overflow and this overflow does not invoke undefined
>>>> behavior but is in the implementation defined behavior category.
>>>>
>>>> The issue in this case is likely that the SCEV infrastructure does not handle
>>>> left-shifts.
>>>>
>>>> Richard.
>>>>
>>>>> ---
>>>>> With best regards, Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
  2013-12-06 14:17         ` Richard Biener
@ 2013-12-06 20:54           ` Jeff Law
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff Law @ 2013-12-06 20:54 UTC (permalink / raw)
  To: Richard Biener, Konstantin Vladimirov; +Cc: GCC Development, GCC-help

On 12/06/13 07:17, Richard Biener wrote:
> On Fri, Dec 6, 2013 at 2:52 PM, Konstantin Vladimirov
> <konstantin.vladimirov@gmail.com> wrote:
>> Hi,
>>
>> Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c
>> and now it yields code like (x86 again):
>>
>> .L5:
>> movzbl 4(%esi,%eax,4), %edx
>> movb %dl, 4(%ebx,%eax,4)
>> addl $1, %eax
>> cmpl %ecx, %eax
>> jne .L5
>>
>> So, excessive lea is gone. It is great, thank you so much. But I
>> wonder what else can I do to move add upper to simplify memory
>> accesses (I am guessing, this is some arithmetical re-associations,
>> still not sure where to look). For architecture, I am working on, it
>> is important. What would you advise?
>
> You need to look at IVOPTs and how it arrives at the choice of
> induction variables.
Konstantin,

You might want to work will Ben Cheng from ARM; he's already poking in 
this code and may have some ideas.

jeff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: x86 gcc lacks simple optimization
@ 2013-12-06  9:40 Konstantin Vladimirov
  0 siblings, 0 replies; 12+ messages in thread
From: Konstantin Vladimirov @ 2013-12-06  9:40 UTC (permalink / raw)
  To: David Brown; +Cc: gcc-help, gcc

Hi,

Example from x86 code was only for ease of reproduction. I am pretty
sure, this is architecture-independent issue. Say on ARM:

.L2:
mov ip, r3, asl #2
add ip, ip, #4
add r3, r3, #1
ldrb r4, [r0, ip] @ zero_extendqisi2
cmp r3, r2
strb r4, [r1, ip]
bne .L2

May be improved to:

.L2:
add r3, r3, #1
ldrb ip, [r0, r3, asl #2] @ zero_extendqisi2
cmp r3, r2
strb ip, [r1, r3, asl #2]
bne .L2

And so on. I myself feeling more comfortable with x86, but it is only
a matter of taste.

To get improved version of code, I just do by hands what compiler is
expected to do automatically, i.e. rewritten things as:

int foo(char *t, char *v, int w)
{
int i;

for (i = 1; i != w; ++i)
{
v[(i + 1) << 2] = t[(i + 1) << 2];
}

return 0;
}

Private backend, I am working on isn't a modification of any, it is
private backend, written from scratch.

---
With best regards, Konstantin

On Fri, Dec 6, 2013 at 1:27 PM, David Brown <david@westcontrol.com> wrote:
> On 06/12/13 09:30, Konstantin Vladimirov wrote:
>> Hi,
>>
>> Consider code:
>>
>> int foo(char *t, char *v, int w)
>> {
>> int i;
>>
>> for (i = 1; i != w; ++i)
>> {
>> int x = i << 2;
>> v[x + 4] = t[x + 4];
>> }
>>
>> return 0;
>> }
>>
>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>>
>> gcc -O2 -m32 -S test.c
>>
>> You will see loop, formed like:
>>
>> .L5:
>> leal 0(,%eax,4), %edx
>> addl $1, %eax
>> movzbl 4(%edi,%edx), %ecx
>> cmpl %ebx, %eax
>> movb %cl, 4(%esi,%edx)
>> jne .L5
>>
>> But it can be easily simplified to something like this:
>>
>> .L5:
>> addl $1, %eax
>> movzbl (%esi,%eax,4), %edx
>> cmpl %ecx, %eax
>> movb %dl, (%ebx,%eax,4)
>> jne .L5
>>
>> (i.e. left shift may be moved to address).
>>
>> First question to gcc-help maillist. May be there are some options,
>> that I've missed, and there IS a way to explain gcc my intention to do
>> this?
>>
>> And second question to gcc developers mail list. I am working on
>> private backend and want to add this optimization to my backend. What
>> do you advise me to do -- custom gimple pass, or rtl pass, or modify
>> some existent pass, etc?
>>
>
> Hi,
>
> Usually the gcc developers are not keen on emails going to both the help
> and development list - they prefer to keep them separate.
>
> My first thought when someone finds a "missed optimisation" issue,
> especially with the x86 target, is are you /sure/ this code is slower?
> x86 chips are immensely complex, and the interplay between different
> instructions, pipelines, superscaling, etc., means that code that might
> appear faster, can actually be slower.  So please check your
> architecture flags (i.e., are you optimising for the "native" cpu, or
> any other specific cpu - optimised code can be different for different
> x86 cpus).  Then /measure/ the speed of the code to see if there is a
> real difference.
>
>
> Regarding your "private backend" - is this a modification of the x86
> backend, or a completely different target?  If it is x86, then I think
> the answer is "don't do it - work with the mainline code".  If it is
> something else, then an x86-specific optimisation is of little use anyway.
>
> mvh.,
>
> David
>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-12-06 20:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-06  8:31 x86 gcc lacks simple optimization Konstantin Vladimirov
2013-12-06  9:28 ` David Brown
2013-12-06 10:09 ` Jakub Jelinek
2013-12-06 10:10 ` Richard Biener
2013-12-06 10:19   ` Konstantin Vladimirov
2013-12-06 10:25     ` Richard Biener
2013-12-06 11:31       ` H.J. Lu
2013-12-06 13:52       ` Konstantin Vladimirov
2013-12-06 14:17         ` Richard Biener
2013-12-06 20:54           ` Jeff Law
2013-12-06 12:13 ` Marc Glisse
2013-12-06  9:40 Konstantin Vladimirov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).