Hi guys,
Thanks for the review and for your responses - please find my answers
below.

> Yes, I looked on the patch in detail this week (I am currently on a travel with
> sporadic internet access and my days before leaving was extremely hectic).  The
> patch is OK except for the expr.c bits I can not apporve myself and they ought
> to go in separately anyway.
You probably meant changes in emit-rtl.c.  Originally they were made to
overcome an issue with defining alignment of MEM_REF, but now they seem
to be unneeded - everything works well even without them.

> The reason I took so long to decide on the patch is that I did some experiments
> with the SSE loop and it is hard to find scenarios with the current
> alignment/block size code where it is a win over library call on the current
> implementation (or with Ondra's). So I did not find block size/chip combination
> where the current inline SSE loop would be default codegen strategy. One of
> cases where inline loop wins is when the extra register pressure impossed by
> the call cause IRA to not do very good job on surrounding code.
As I mentioned earlier, this implementation doesn't pretend to be a
finished and completely tuned implementation - it's just a base for
future work in this direction.  But nevertheless, I think that even now
it could be used for cases when alignment is statically known and
blocksize is big enough - I doubt that libcall would beat inlining in
that case due to call overheads (but yes, currently that's just my
assumptions).

> Michael, my apologizes for taking so long to decide here.
No problem, Jan, thanks for the review.

> Do you think you can work on the
> memset and move_by_pieces/store_by_pieces bits?
> I think the by pieces code is a lot more promising then the actual inline SSE loops.
I think I'll find some time for implementing memset (in i386.c) - but
I'm going to restrict SSE loops only for the cases when we are zeroing
memory.  That means, we'll give up if we're filling memory with some
other value.  I suppose that zeroing is the most common use of memset,
so it'll be enough.  What do you think, does it sound reasonable?

As for move_by_pieces and store_by_pieces - I thought about them.
Implementation there would be much more complicated and it'd be much
harder to tune it - the reason is obvious: we need to make common
implementation which will work for all architectures.
Instead of this, I've been thinking of adding yet another algorithm for
expanding memcpy/memset in i386.c - something like fully_unrolled_loop.
It'll perform copying without any loop at all and assumingly will be
useful for copying small blocks.  With this implemented, we could change
MOVE_RATIO to always expand memmov/memset with some algorithm from
i386.c and not to use move_by_pieces/store_by_pieces.  I think such
implementation could be made much more optimized than any other in
move_by_pieces.

> It does not measure real world issues like icache pressure etc. I would
> like more definitive data. One possibility is compile gcc with various
> options and repeately run it to compile simple project for day or so.
I agree, but anyway we need some tests to measure performance - whether
these tests are Specs or some others.


I attached the updated patch - it's the same as the previous, but
without emit-rtl.c changes. Is it ok for trunk?

The patch is bootstrapped and regtested on i386 and x86-64.  Specs2000 are also
passing without regressions (I checked only stability, not performance).

Changelog:
2013-07-02  Michael Zolotukhin  <michael.v.zolotukhin@gmail.com>

        * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop.
        * config/i386/i386.c (expand_set_or_movmem_via_loop): Use
        adjust_address instead of change_address to keep info about alignment.
        (emit_strmov): Remove.
        (emit_memmov): New function.
        (expand_movmem_epilogue): Refactor to properly handle bigger sizes.
        (expand_movmem_epilogue): Likewise and return updated rtx for
        destination.
        (expand_constant_movmem_prologue): Likewise and return updated rtx for
        destination and source.
        (decide_alignment): Refactor, handle vector_loop.
        (ix86_expand_movmem): Likewise.
        (ix86_expand_setmem): Likewise.
        * config/i386/i386.opt (Enum): Add vector_loop to option stringop_alg.


Thanks, Michael
> Honza
>
> Ondra

On 30 June 2013 23:22, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Sun, Jun 30, 2013 at 11:06:47AM +0200, Jan Hubicka wrote:
>> > On Tue, Jun 25, 2013 at 3:36 PM, Michael Zolotukhin
>> > <michael.v.zolotukhin@gmail.com> wrote:
>> > > Ping.
>> > >
>> > > On 20 June 2013 20:56, Michael Zolotukhin
>> > > <michael.v.zolotukhin@gmail.com> wrote:
>> > >> It seems that one of the tests needed a small fix.
>> > >> Attached is a corrected version.
>> >
>> > Jan, do you plan to review this patch? It touches the area that you
>> > worked on extensively some time ago, so your expert opinion would be
>> > much appreciated here.
>>
>> Yes, I looked on the patch in detail this week (I am currently on a travel with
>> sporadic internet access and my days before leaving was extremely hectic).  The
>> patch is OK except for the expr.c bits I can not apporve myself and they ought
>> to go in separately anyway.
>>
>> The reason I took so long to decide on the patch is that I did some experiments
>> with the SSE loop and it is hard to find scenarios with the current
>> alignment/block size code where it is a win over library call on the current
>> implementation (or with Ondra's). So I did not find block size/chip combination
>> where the current inline SSE loop would be default codegen strategy. One of
>> cases where inline loop wins is when the extra register pressure impossed by
>> the call cause IRA to not do very good job on surrounding code.
>>
> It does not measure real world issues like icache pressure etc. I would
> like more definitive data. One possibility is compile gcc with various
> options and repeately run it to compile simple project for day or so.
>
> I am interested upto which size inlining code makes sense, my guess is
> that after 64 bytes and no FP registers inlining is unproductive.
>
> I tried this to see how LD_PRELOADing memcpy affects performance and it
> is measurable after about day, a performance impact is small in my case
> it was about 0.5% so you really need that long to converge.
>
> Ondra


--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.