Hi guys, Thanks for the review and for your responses - please find my answers below. > Yes, I looked on the patch in detail this week (I am currently on a travel with > sporadic internet access and my days before leaving was extremely hectic). The > patch is OK except for the expr.c bits I can not apporve myself and they ought > to go in separately anyway. You probably meant changes in emit-rtl.c. Originally they were made to overcome an issue with defining alignment of MEM_REF, but now they seem to be unneeded - everything works well even without them. > The reason I took so long to decide on the patch is that I did some experiments > with the SSE loop and it is hard to find scenarios with the current > alignment/block size code where it is a win over library call on the current > implementation (or with Ondra's). So I did not find block size/chip combination > where the current inline SSE loop would be default codegen strategy. One of > cases where inline loop wins is when the extra register pressure impossed by > the call cause IRA to not do very good job on surrounding code. As I mentioned earlier, this implementation doesn't pretend to be a finished and completely tuned implementation - it's just a base for future work in this direction. But nevertheless, I think that even now it could be used for cases when alignment is statically known and blocksize is big enough - I doubt that libcall would beat inlining in that case due to call overheads (but yes, currently that's just my assumptions). > Michael, my apologizes for taking so long to decide here. No problem, Jan, thanks for the review. > Do you think you can work on the > memset and move_by_pieces/store_by_pieces bits? > I think the by pieces code is a lot more promising then the actual inline SSE loops. I think I'll find some time for implementing memset (in i386.c) - but I'm going to restrict SSE loops only for the cases when we are zeroing memory. That means, we'll give up if we're filling memory with some other value. I suppose that zeroing is the most common use of memset, so it'll be enough. What do you think, does it sound reasonable? As for move_by_pieces and store_by_pieces - I thought about them. Implementation there would be much more complicated and it'd be much harder to tune it - the reason is obvious: we need to make common implementation which will work for all architectures. Instead of this, I've been thinking of adding yet another algorithm for expanding memcpy/memset in i386.c - something like fully_unrolled_loop. It'll perform copying without any loop at all and assumingly will be useful for copying small blocks. With this implemented, we could change MOVE_RATIO to always expand memmov/memset with some algorithm from i386.c and not to use move_by_pieces/store_by_pieces. I think such implementation could be made much more optimized than any other in move_by_pieces. > It does not measure real world issues like icache pressure etc. I would > like more definitive data. One possibility is compile gcc with various > options and repeately run it to compile simple project for day or so. I agree, but anyway we need some tests to measure performance - whether these tests are Specs or some others. I attached the updated patch - it's the same as the previous, but without emit-rtl.c changes. Is it ok for trunk? The patch is bootstrapped and regtested on i386 and x86-64. Specs2000 are also passing without regressions (I checked only stability, not performance). Changelog: 2013-07-02 Michael Zolotukhin * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop. * config/i386/i386.c (expand_set_or_movmem_via_loop): Use adjust_address instead of change_address to keep info about alignment. (emit_strmov): Remove. (emit_memmov): New function. (expand_movmem_epilogue): Refactor to properly handle bigger sizes. (expand_movmem_epilogue): Likewise and return updated rtx for destination. (expand_constant_movmem_prologue): Likewise and return updated rtx for destination and source. (decide_alignment): Refactor, handle vector_loop. (ix86_expand_movmem): Likewise. (ix86_expand_setmem): Likewise. * config/i386/i386.opt (Enum): Add vector_loop to option stringop_alg. Thanks, Michael > Honza > > Ondra On 30 June 2013 23:22, Ondřej Bílka wrote: > On Sun, Jun 30, 2013 at 11:06:47AM +0200, Jan Hubicka wrote: >> > On Tue, Jun 25, 2013 at 3:36 PM, Michael Zolotukhin >> > wrote: >> > > Ping. >> > > >> > > On 20 June 2013 20:56, Michael Zolotukhin >> > > wrote: >> > >> It seems that one of the tests needed a small fix. >> > >> Attached is a corrected version. >> > >> > Jan, do you plan to review this patch? It touches the area that you >> > worked on extensively some time ago, so your expert opinion would be >> > much appreciated here. >> >> Yes, I looked on the patch in detail this week (I am currently on a travel with >> sporadic internet access and my days before leaving was extremely hectic). The >> patch is OK except for the expr.c bits I can not apporve myself and they ought >> to go in separately anyway. >> >> The reason I took so long to decide on the patch is that I did some experiments >> with the SSE loop and it is hard to find scenarios with the current >> alignment/block size code where it is a win over library call on the current >> implementation (or with Ondra's). So I did not find block size/chip combination >> where the current inline SSE loop would be default codegen strategy. One of >> cases where inline loop wins is when the extra register pressure impossed by >> the call cause IRA to not do very good job on surrounding code. >> > It does not measure real world issues like icache pressure etc. I would > like more definitive data. One possibility is compile gcc with various > options and repeately run it to compile simple project for day or so. > > I am interested upto which size inlining code makes sense, my guess is > that after 64 bytes and no FP registers inlining is unproductive. > > I tried this to see how LD_PRELOADing memcpy affects performance and it > is measurable after about day, a performance impact is small in my case > it was about 0.5% so you really need that long to converge. > > Ondra -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.