Re: [PATCH 0/2] Multiarch hooks for memcpy variants

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
To: libc-alpha@sourceware.org
Subject: Re: [PATCH 0/2] Multiarch hooks for memcpy variants
Date: Tue, 15 Aug 2017 20:14:00 -0000	[thread overview]
Message-ID: <3812a37d-3377-087b-ab21-d270d6f8a834@oracle.com> (raw)
In-Reply-To: <a62a7627-1d1e-6a2b-8197-f4b16bfbdcc6@oracle.com>

Apologies for the multiple versions of this message (again).
I've got to learn to edit my complete message outside email since I can't
seem to fix my habit of typing "^S" (save text) every time
I complete a paragraph. Unfortunately, Thunderbird treats
^S as a send command.
- patrick

On 8/15/2017 3:10 PM, Patrick McGehearty wrote:
> On 8/14/2017 8:22 AM, Wilco Dijkstra wrote:
>> Siddhesh Poyarekar wrote:
>>> The first part is not true for falkor since its implementation is a good
>>> 10-15% faster on the falkor chip due to its design differences.  glibc
>>> makes pretty extensive use of memcpy throughout, but I don't have data
>>> on how much difference a core-specific memcpy will make there, so I
>>> don't have enough grounds for a generic change.
>> 66% of memcpy calls are <=16 bytes. Assuming you can even get a 15% gain
>> for these small sizes (there is very little you can do different), that's at most 1
>> cycle faster, so the PLT indirection is going to be more expensive.
> It is important to be careful about overemphasizing the frequency of 
> short memcpy calls.
> Even though a high percentage of memcpy calls are short, my experience 
> is that a high
> percentage of time spent in memcpy is on longer copies.
>
> Following example is just that, an example, not an expression of any 
> specific real application behavior:
> If 66% of calls are <=16 bytes  (average length=8, say) but the 
> average length of the remaining
> 1/3 of calls was 1K bytes (i.e. > 100 times as long), then the vast 
> majority of time
> in memcpy would be in the longer copies.
>
> My experience with tuning libc memcpy off and on on multiple platforms 
> is that copies
> of length > 256 bytes are the ones that affect overall application 
> performance. Really short
> copies where the length and/or alignment might be known at compile 
> time are best handled
> by inlining the copy.
>
> I've produced platform specific optimizations for memcpy many times 
> over the years.  By platform
> specific, I mean different code for different generations/platforms of 
> the same architecture.
> These) which shows improvements from at little as 10% to as much as 250%
> depending on how close the memory architecture of latest platform is 
> to the prior platform.
>
> Typical factors that can influence best memcpy performance a specific 
> platform for a given architecture include:
> ideal prefetch distance ... depends on processor speed, cache/memory 
> latency, depth of memory
>    subsystem queues, details of memory subsystem priorities for 
> prefetch vs demand fetch, and more.
> number of alu operations that can be issued per cycle
> number of memory operations that can be issued per cycle
> number of total instructions that can be issued per cycle
> branch misprediction latency; branch predictor behavior; other branch 
> issues
> and many other architectural features which can make occasional 
> differences
>
> I find it hard to imagine a single generic memcpy library routine that 
> can match the performance
> of a platform specific tuned routine over a typical range of copy 
> lengths, assuming the architecture
> has been around long enough to go through several semiconductor 
> process redesigns.
> With dynamic linking, the overhead of using platform specific code for 
> something frequently
> called should be relatively minimal.
>
> I do agree a good generic version should be available as the effort of 
> finding the best tuning
> for a particular platform can take weeks and not all 
> architecture/platform combinations
> will get that intense attention.
>
> - patrick mcgehearty
>
>>> Your last point about hurting everything else is very valid though; it's
>>> very likely that adding an extra indirection in cases where
>>> __memcpy_generic is going to be called anyway is going to be expensive
>>> given that a bulk of the memcpy calls will be for small sizes of less
>>> than 1k.
>> Note that the falkor version does quite well in memcpy-random across several
>> micro architectures so I think parts of it could be moved into the generic code.
>>
>>> Allowing a PLT only for __memcpy_chk and mempcpy would need a test case
>>> waiver in check_localplt and that would become a blanket OK for PLT
>>> usage for memcpy, which we don't want.  Hence my patch is probably the
>>> best compromise, especially since there is precedent for the approach in
>>> x86.
>> I still can't see any reason to even support these entry points in GLIBC, let
>> alone optimize them using ifuncs. The _chk functions should obviously be
>> inlined to avoid all the target specific complexity for no benefit. I think this
>> could trivially be done via the GLIBC headers already. (That's assuming they
>> are in any way performance critical.)
>>
>> Wilco
>

next prev parent reply	other threads:[~2017-08-15 20:14 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11 11:11 Wilco Dijkstra
2017-08-11 11:22 ` Siddhesh Poyarekar
2017-08-11 17:58   ` Szabolcs Nagy
2017-08-11 18:06     ` Zack Weinberg
2017-08-11 18:53       ` Siddhesh Poyarekar
2017-08-11 18:55         ` Zack Weinberg
2017-08-14 10:36           ` Wilco Dijkstra
2017-08-14 12:14             ` Siddhesh Poyarekar
2017-08-14 13:20               ` Szabolcs Nagy
2017-08-14 13:29                 ` Siddhesh Poyarekar
2017-08-14 13:52                   ` Szabolcs Nagy
2017-08-14 13:56                     ` Siddhesh Poyarekar
2017-08-15 21:55                       ` Wilco Dijkstra
2017-08-16  4:29                         ` Siddhesh Poyarekar
2017-08-14 13:22               ` Wilco Dijkstra
2017-08-14 13:54                 ` Siddhesh Poyarekar
2017-08-15 20:11                 ` Patrick McGehearty
     [not found]                 ` <a62a7627-1d1e-6a2b-8197-f4b16bfbdcc6@oracle.com>
2017-08-15 20:14                   ` Patrick McGehearty [this message]
2017-08-15 21:02                 ` Zack Weinberg
2017-08-15 21:41                   ` Wilco Dijkstra
2017-08-15 22:06                     ` Zack Weinberg
2017-08-16  4:40                       ` Siddhesh Poyarekar
2017-08-16  5:10                   ` Siddhesh Poyarekar
2017-08-15 20:52             ` Zack Weinberg
2017-08-16 12:28               ` Wilco Dijkstra
2017-08-17  1:09                 ` Zack Weinberg
  -- strict thread matches above, loose matches on Subject: below --
2017-08-11  7:14 Siddhesh Poyarekar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3812a37d-3377-087b-ab21-d270d6f8a834@oracle.com \
    --to=patrick.mcgehearty@oracle.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).