Re: [PATCH, x86] Use vector moves in memmove expanding

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Ondřej Bílka" <neleai@seznam.cz>
To: Michael Zolotukhin <michael.v.zolotukhin@gmail.com>
Cc: Jan Hubicka <hubicka@ucw.cz>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH, x86] Use vector moves in memmove expanding
Date: Sat, 13 Apr 2013 18:13:00 -0000	[thread overview]
Message-ID: <20130413063608.GA5592@domone> (raw)
In-Reply-To: <CANtU078ZtSUA6wmFpxLjy8YWofGAbfjqP58fVJ1DECBLhCabCQ@mail.gmail.com>

On Fri, Apr 12, 2013 at 01:08:15PM +0400, Michael Zolotukhin wrote:
> > I did some profiling of builtin implementation, download this
> > http://kam.mff.cuni.cz/~ondra/memcpy_profile_builtin.tar.bz2
> Nice data, thanks!
> Could you please describe what is memcpy_new_builtin here? Is it how
> GCC expanded memcpy with this patch?
> Is this a comparison between libcall, libcall with your version of
> glibc, and expanded memmov with implementation from this patch?
>
 
I try to make benchmarks self contained. So now I measure
 libcall, libcall with my version and current builtin expansion.

I updated my benchmark, one of problems of measuring memcpy is that most
memory ops happen asynchronously so this version should capute that.
(padding now should be sufficient but I did not decrement it from time
yet.)

Now memcpy_gcc_builtin there measures builtin for first 100 sizes, then
switches to my implementation.

I added memcpy_new_builtin which is now same as memcpy_gcc_builtin.

To add your implementation compile variant/builtin.c file into
variant/builtin.s file. 
Then run ./benchmark.

Ondra
> Michael
> 
> On 12 April 2013 12:54, OndÅ™ej BÃlka <neleai@seznam.cz> wrote:
> > On Thu, Apr 11, 2013 at 04:32:30PM +0400, Michael Zolotukhin wrote:
> >> > 128 is about upper bound you can expand with sse moves.
> >> > Tuning did not take into account code size and measured only when code
> >> > is in tigth loop.
> >> > For GPR-moves limit is around 64.
> >> Thanks for the data - I've not performed measurements with this
> >> implementation yet, but we surely should adjust thresholds to avoid
> >> performance degradations on small sizes.
> >>
> >
> > I did some profiling of builtin implementation, download this
> > http://kam.mff.cuni.cz/~ondra/memcpy_profile_builtin.tar.bz2
> >
> > see files results_rand/result.html and results_rand_noicache/result.html
> >
> > A memcpy_new_builtin for sizes x0,x1...x5 calls builtin and new
> > otherwise.
> > I did same for memcpy_glibc to see variance.
> >
> > memcpy_new does not call builtin.
> >
> > To regenerate graphs on other arch run benchmarks script.
> > To use other builtin change in Makefile how to compile variant/builtin.c
> > file.
> >
> > A builtin are faster by inlined function call, I did not add that as I
> > do not know estimate of this cost.
> >
> >> Michael
> >>
> >> On 10 April 2013 22:53, OndÅ™ej BÃlka <neleai@seznam.cz> wrote:
> >> > On Wed, Apr 10, 2013 at 09:53:09PM +0400, Michael Zolotukhin wrote:
> >> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
> >> >> > much faster on small strings (variant for sandy bridge attached.
> >> >>
> >> >> I'm not sure I get what you meant - could you please explain what is
> >> >> computed jumps?
> >> > computed goto. See Duff's device it works almost exactly same.
> >> >>
> >> >> > You must also check performance with cold instruction cache.
> >> >> > Now memcpy(x,y,128) takes 126 bytes which is too much.
> >> >>
> >> >> > Do not align for small sizes. Dependency caused by this erases any gains
> >> >> > that you migth get. Keep in mind that in 55% of cases data are already
> >> >> > aligned.
> >> >>
> >> >> Other algorithms are still available and we can use them for small
> >> >> sizes. E.g. for sizes <128 we could emit loop with GPR-moves and don't
> >> >> use vector instructions in it.
> >> >
> >> > 128 is about upper bound you can expand with sse moves.
> >> > Tuning did not take into account code size and measured only when code
> >> > is in tigth loop.
> >> > For GPR-moves limit is around 64.
> >> >
> >> > What matters which code has best performance/size ratio.
> >> >> But that's tuning and I haven't worked on it yet - I'm going to
> >> >> measure performance of all algorithms on all sizes and thus defines on
> >> >> which sizes which algorithm is preferable.
> >> >> What I did in this patch is introducing some infrastructure to allow
> >> >> emitting of vector moves in movmem expanding - tuning is certainly
> >> >> possible and needed, but that's out of the scope of the patch.
> >> >>
> >> >> On 10 April 2013 21:43, OndÅ™ej BÃlka <neleai@seznam.cz> wrote:
> >> >> > On Wed, Apr 10, 2013 at 08:14:30PM +0400, Michael Zolotukhin wrote:
> >> >> >> Hi,
> >> >> >> This patch adds a new algorithm of expanding movmem in x86 and a bit
> >> >> >> refactor existing implementation. This is a reincarnation of the patch
> >> >> >> that was sent wasn't checked couple of years ago - now I reworked it
> >> >> >> from scratch and divide into several more manageable parts.
> >> >> >>
> >> >> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
> >> >> > much faster on small strings (variant for sandy bridge attached.
> >> >> >
> >> >> >> For now this algorithm isn't used, because cost_models are tuned to
> >> >> >> use existing ones. I believe the new algorithm will give better
> >> >> >> performance, but I'll leave cost-models tuning for a separate patch.
> >> >> >>
> >> >> > You must also check performance with cold instruction cache.
> >> >> > Now memcpy(x,y,128) takes 126 bytes which is too much.
> >> >> >
> >> >> >> Also, I changed get_mem_align_offset to make it handle MEM_REFs as
> >> >> >> well. Probably, there is another way of getting info about alignment -
> >> >> >> if so, please let me know.
> >> >> >>
> >> >> > Do not align for small sizes. Dependency caused by this erases any gains
> >> >> > that you migth get. Keep in mind that in 55% of cases data are already
> >> >> > aligned.
> >> >> >
> >> >> > Also in my tests best way to handle prologue is first copy last 16
> >> >> > bytes and then loop.
> >> >> >
> >> >> >> Similar improvements could be done in expanding of memset, but that's
> >> >> >> in progress now and I'm going to proceed with it if this patch is ok.
> >> >> >>
> >> >> >> Bootstrap/make check/Specs2k are passing on i686 and x86_64.
> >> >> >>
> >> >> >> Is it ok for trunk?
> >> >> >>
> >> >> >> Changelog entry:
> >> >> >>
> >> >> >> 2013-04-10  Michael Zolotukhin  <michael.v.zolotukhin@gmail.com>
> >> >> >>
> >> >> >>         * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop.
> >> >> >>         * config/i386/i386.c (expand_set_or_movmem_via_loop): Use
> >> >> >>         adjust_address instead of change_address to keep info about alignment.
> >> >> >>         (emit_strmov): Remove.
> >> >> >>         (emit_memmov): New function.
> >> >> >>         (expand_movmem_epilogue): Refactor to properly handle bigger sizes.
> >> >> >>         (expand_movmem_epilogue): Likewise and return updated rtx for
> >> >> >>         destination.
> >> >> >>         (expand_constant_movmem_prologue): Likewise and return updated rtx for
> >> >> >>         destination and source.
> >> >> >>         (decide_alignment): Refactor, handle vector_loop.
> >> >> >>         (ix86_expand_movmem): Likewise.
> >> >> >>         (ix86_expand_setmem): Likewise.
> >> >> >>         * config/i386/i386.opt (Enum): Add vector_loop to option stringop_alg.
> >> >> >>         * emit-rtl.c (get_mem_align_offset): Compute alignment for MEM_REF.
> >>
> >> --
> >> ---
> >> Best regards,
> >> Michael V. Zolotukhin,
> >> Software Engineer
> >> Intel Corporation.
> >
> > --
> >
> > Spider infestation in warm case parts
> 
> 
> 
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.

-- 

doppler effect

next prev parent reply	other threads:[~2013-04-13  6:36 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-10 17:54 Michael Zolotukhin
2013-04-10 20:17 ` Ondřej Bílka
2013-04-10 21:39   ` Michael Zolotukhin
2013-04-10 22:24     ` Ondřej Bílka
2013-04-11 12:56       ` Michael Zolotukhin
2013-04-12 10:06         ` Ondřej Bílka
2013-04-12 11:10           ` Michael Zolotukhin
2013-04-13 18:13             ` Ondřej Bílka [this message]
2013-04-17 16:18 ` Jan Hubicka
2013-04-17 18:40   ` Jan Hubicka
2013-04-18 13:50     ` Michael Zolotukhin
2013-04-18 13:55       ` Michael Zolotukhin
2013-05-14 14:35         ` Michael Zolotukhin
2013-05-14 15:55           ` H.J. Lu
2013-05-15 12:47             ` Michael Zolotukhin
2013-05-15 15:45               ` H.J. Lu
2013-06-05 14:10                 ` Michael Zolotukhin
2013-06-20 13:16                   ` Michael Zolotukhin
2013-06-20 16:56                     ` Michael Zolotukhin
2013-06-25 13:36                       ` Michael Zolotukhin
2013-06-30  9:06                         ` Uros Bizjak
2013-06-30  9:32                           ` Jan Hubicka
2013-06-30 19:15                             ` Ondřej Bílka
2013-07-02 14:37                               ` Michael Zolotukhin
2013-07-05  7:58                                 ` Michael Zolotukhin
2013-07-05 11:25                                   ` Jan Hubicka
2013-07-08  6:49                                     ` Kirill Yukhin
2013-07-08  6:56                                       ` Michael Zolotukhin
2013-09-03 19:01                                       ` Eric Botcazou
2013-09-03 19:05                                         ` Michael V. Zolotukhin
2013-09-03 19:25                                           ` H.J. Lu
2013-09-06 16:58                                         ` H.J. Lu
2013-09-06 20:50                                           ` Michael Zolotukhin
2013-09-09  7:35                                             ` Michael V. Zolotukhin
2013-09-09  7:40                                               ` Jan Hubicka
2013-09-09  7:46                                                 ` Michael V. Zolotukhin
2013-09-09  7:46                                                   ` Uros Bizjak
2013-09-09  7:59                                                   ` Jakub Jelinek
2013-09-09  8:01                                                     ` Michael V. Zolotukhin
2013-09-09  8:02                                                       ` Jakub Jelinek
2013-09-09  9:19                                                         ` Michael V. Zolotukhin
2013-09-09  9:24                                                           ` Jakub Jelinek
2013-09-09  9:25                                                             ` Michael V. Zolotukhin
2013-09-09  9:32                                                               ` Uros Bizjak
2013-09-09 10:13                                                                 ` Michael V. Zolotukhin
2013-09-09 10:19                                                                   ` Uros Bizjak
2013-09-09 10:27                                                                     ` Michael V. Zolotukhin
2013-09-09 12:21                                                                       ` Uros Bizjak
2013-09-10  8:23                                                                         ` Kirill Yukhin
2013-09-09 10:19                                                                   ` Jakub Jelinek
2013-09-09 10:22                                                                     ` Uros Bizjak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130413063608.GA5592@domone \
    --to=neleai@seznam.cz \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hubicka@ucw.cz \
    --cc=michael.v.zolotukhin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).