From: Michael Zolotukhin <michael.v.zolotukhin@gmail.com>
To: "Ondřej Bílka" <neleai@seznam.cz>
Cc: Jan Hubicka <hubicka@ucw.cz>,
"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH, x86] Use vector moves in memmove expanding
Date: Fri, 12 Apr 2013 11:10:00 -0000 [thread overview]
Message-ID: <CANtU078ZtSUA6wmFpxLjy8YWofGAbfjqP58fVJ1DECBLhCabCQ@mail.gmail.com> (raw)
In-Reply-To: <20130412085415.GA16101@domone.kolej.mff.cuni.cz>
> I did some profiling of builtin implementation, download this
> http://kam.mff.cuni.cz/~ondra/memcpy_profile_builtin.tar.bz2
Nice data, thanks!
Could you please describe what is memcpy_new_builtin here? Is it how
GCC expanded memcpy with this patch?
Is this a comparison between libcall, libcall with your version of
glibc, and expanded memmov with implementation from this patch?
Michael
On 12 April 2013 12:54, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Thu, Apr 11, 2013 at 04:32:30PM +0400, Michael Zolotukhin wrote:
>> > 128 is about upper bound you can expand with sse moves.
>> > Tuning did not take into account code size and measured only when code
>> > is in tigth loop.
>> > For GPR-moves limit is around 64.
>> Thanks for the data - I've not performed measurements with this
>> implementation yet, but we surely should adjust thresholds to avoid
>> performance degradations on small sizes.
>>
>
> I did some profiling of builtin implementation, download this
> http://kam.mff.cuni.cz/~ondra/memcpy_profile_builtin.tar.bz2
>
> see files results_rand/result.html and results_rand_noicache/result.html
>
> A memcpy_new_builtin for sizes x0,x1...x5 calls builtin and new
> otherwise.
> I did same for memcpy_glibc to see variance.
>
> memcpy_new does not call builtin.
>
> To regenerate graphs on other arch run benchmarks script.
> To use other builtin change in Makefile how to compile variant/builtin.c
> file.
>
> A builtin are faster by inlined function call, I did not add that as I
> do not know estimate of this cost.
>
>> Michael
>>
>> On 10 April 2013 22:53, Ondřej Bílka <neleai@seznam.cz> wrote:
>> > On Wed, Apr 10, 2013 at 09:53:09PM +0400, Michael Zolotukhin wrote:
>> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
>> >> > much faster on small strings (variant for sandy bridge attached.
>> >>
>> >> I'm not sure I get what you meant - could you please explain what is
>> >> computed jumps?
>> > computed goto. See Duff's device it works almost exactly same.
>> >>
>> >> > You must also check performance with cold instruction cache.
>> >> > Now memcpy(x,y,128) takes 126 bytes which is too much.
>> >>
>> >> > Do not align for small sizes. Dependency caused by this erases any gains
>> >> > that you migth get. Keep in mind that in 55% of cases data are already
>> >> > aligned.
>> >>
>> >> Other algorithms are still available and we can use them for small
>> >> sizes. E.g. for sizes <128 we could emit loop with GPR-moves and don't
>> >> use vector instructions in it.
>> >
>> > 128 is about upper bound you can expand with sse moves.
>> > Tuning did not take into account code size and measured only when code
>> > is in tigth loop.
>> > For GPR-moves limit is around 64.
>> >
>> > What matters which code has best performance/size ratio.
>> >> But that's tuning and I haven't worked on it yet - I'm going to
>> >> measure performance of all algorithms on all sizes and thus defines on
>> >> which sizes which algorithm is preferable.
>> >> What I did in this patch is introducing some infrastructure to allow
>> >> emitting of vector moves in movmem expanding - tuning is certainly
>> >> possible and needed, but that's out of the scope of the patch.
>> >>
>> >> On 10 April 2013 21:43, Ondřej Bílka <neleai@seznam.cz> wrote:
>> >> > On Wed, Apr 10, 2013 at 08:14:30PM +0400, Michael Zolotukhin wrote:
>> >> >> Hi,
>> >> >> This patch adds a new algorithm of expanding movmem in x86 and a bit
>> >> >> refactor existing implementation. This is a reincarnation of the patch
>> >> >> that was sent wasn't checked couple of years ago - now I reworked it
>> >> >> from scratch and divide into several more manageable parts.
>> >> >>
>> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
>> >> > much faster on small strings (variant for sandy bridge attached.
>> >> >
>> >> >> For now this algorithm isn't used, because cost_models are tuned to
>> >> >> use existing ones. I believe the new algorithm will give better
>> >> >> performance, but I'll leave cost-models tuning for a separate patch.
>> >> >>
>> >> > You must also check performance with cold instruction cache.
>> >> > Now memcpy(x,y,128) takes 126 bytes which is too much.
>> >> >
>> >> >> Also, I changed get_mem_align_offset to make it handle MEM_REFs as
>> >> >> well. Probably, there is another way of getting info about alignment -
>> >> >> if so, please let me know.
>> >> >>
>> >> > Do not align for small sizes. Dependency caused by this erases any gains
>> >> > that you migth get. Keep in mind that in 55% of cases data are already
>> >> > aligned.
>> >> >
>> >> > Also in my tests best way to handle prologue is first copy last 16
>> >> > bytes and then loop.
>> >> >
>> >> >> Similar improvements could be done in expanding of memset, but that's
>> >> >> in progress now and I'm going to proceed with it if this patch is ok.
>> >> >>
>> >> >> Bootstrap/make check/Specs2k are passing on i686 and x86_64.
>> >> >>
>> >> >> Is it ok for trunk?
>> >> >>
>> >> >> Changelog entry:
>> >> >>
>> >> >> 2013-04-10 Michael Zolotukhin <michael.v.zolotukhin@gmail.com>
>> >> >>
>> >> >> * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop.
>> >> >> * config/i386/i386.c (expand_set_or_movmem_via_loop): Use
>> >> >> adjust_address instead of change_address to keep info about alignment.
>> >> >> (emit_strmov): Remove.
>> >> >> (emit_memmov): New function.
>> >> >> (expand_movmem_epilogue): Refactor to properly handle bigger sizes.
>> >> >> (expand_movmem_epilogue): Likewise and return updated rtx for
>> >> >> destination.
>> >> >> (expand_constant_movmem_prologue): Likewise and return updated rtx for
>> >> >> destination and source.
>> >> >> (decide_alignment): Refactor, handle vector_loop.
>> >> >> (ix86_expand_movmem): Likewise.
>> >> >> (ix86_expand_setmem): Likewise.
>> >> >> * config/i386/i386.opt (Enum): Add vector_loop to option stringop_alg.
>> >> >> * emit-rtl.c (get_mem_align_offset): Compute alignment for MEM_REF.
>>
>> --
>> ---
>> Best regards,
>> Michael V. Zolotukhin,
>> Software Engineer
>> Intel Corporation.
>
> --
>
> Spider infestation in warm case parts
--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.
next prev parent reply other threads:[~2013-04-12 9:08 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-10 17:54 Michael Zolotukhin
2013-04-10 20:17 ` Ondřej Bílka
2013-04-10 21:39 ` Michael Zolotukhin
2013-04-10 22:24 ` Ondřej Bílka
2013-04-11 12:56 ` Michael Zolotukhin
2013-04-12 10:06 ` Ondřej Bílka
2013-04-12 11:10 ` Michael Zolotukhin [this message]
2013-04-13 18:13 ` Ondřej Bílka
2013-04-17 16:18 ` Jan Hubicka
2013-04-17 18:40 ` Jan Hubicka
2013-04-18 13:50 ` Michael Zolotukhin
2013-04-18 13:55 ` Michael Zolotukhin
2013-05-14 14:35 ` Michael Zolotukhin
2013-05-14 15:55 ` H.J. Lu
2013-05-15 12:47 ` Michael Zolotukhin
2013-05-15 15:45 ` H.J. Lu
2013-06-05 14:10 ` Michael Zolotukhin
2013-06-20 13:16 ` Michael Zolotukhin
2013-06-20 16:56 ` Michael Zolotukhin
2013-06-25 13:36 ` Michael Zolotukhin
2013-06-30 9:06 ` Uros Bizjak
2013-06-30 9:32 ` Jan Hubicka
2013-06-30 19:15 ` Ondřej Bílka
2013-07-02 14:37 ` Michael Zolotukhin
2013-07-05 7:58 ` Michael Zolotukhin
2013-07-05 11:25 ` Jan Hubicka
2013-07-08 6:49 ` Kirill Yukhin
2013-07-08 6:56 ` Michael Zolotukhin
2013-09-03 19:01 ` Eric Botcazou
2013-09-03 19:05 ` Michael V. Zolotukhin
2013-09-03 19:25 ` H.J. Lu
2013-09-06 16:58 ` H.J. Lu
2013-09-06 20:50 ` Michael Zolotukhin
2013-09-09 7:35 ` Michael V. Zolotukhin
2013-09-09 7:40 ` Jan Hubicka
2013-09-09 7:46 ` Michael V. Zolotukhin
2013-09-09 7:46 ` Uros Bizjak
2013-09-09 7:59 ` Jakub Jelinek
2013-09-09 8:01 ` Michael V. Zolotukhin
2013-09-09 8:02 ` Jakub Jelinek
2013-09-09 9:19 ` Michael V. Zolotukhin
2013-09-09 9:24 ` Jakub Jelinek
2013-09-09 9:25 ` Michael V. Zolotukhin
2013-09-09 9:32 ` Uros Bizjak
2013-09-09 10:13 ` Michael V. Zolotukhin
2013-09-09 10:19 ` Uros Bizjak
2013-09-09 10:27 ` Michael V. Zolotukhin
2013-09-09 12:21 ` Uros Bizjak
2013-09-10 8:23 ` Kirill Yukhin
2013-09-09 10:19 ` Jakub Jelinek
2013-09-09 10:22 ` Uros Bizjak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CANtU078ZtSUA6wmFpxLjy8YWofGAbfjqP58fVJ1DECBLhCabCQ@mail.gmail.com \
--to=michael.v.zolotukhin@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hubicka@ucw.cz \
--cc=neleai@seznam.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).