public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To: Noah Goldstein <goldstein.w.n@gmail.com>,
	Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: GNU C Library <libc-alpha@sourceware.org>,
	"H.J. Lu" <hjl.tools@gmail.com>
Subject: Re: [PATCH v2] x86-64: Optimize bzero
Date: Wed, 23 Feb 2022 09:09:44 -0300	[thread overview]
Message-ID: <1e8bdcf4-36c3-704b-5580-84ff9662d1da@linaro.org> (raw)
In-Reply-To: <CAFUsyfJKpM+SpEt5ShCU8Dfu2+sp-rQMgmHX_zBzpc-Scvg6Ww@mail.gmail.com>



On 23/02/2022 05:12, Noah Goldstein wrote:
> On Tue, Feb 15, 2022 at 7:38 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>>
>> Hi,
>>
>>> Is there any way it can be setup so that one C impl can cover all the
>>> arch that want to just leave `__memsetzero` as an alias to `memset`?
>>> I know they have incompatible interfaces that make it hard but would
>>> a weak static inline in string.h work?
>>
>> No that won't work. A C implementation similar to current string/bzero.c
>> adds unacceptable overhead (since most targets just implement memset and
>> will continue to do so). An inline function in string.h would introduce target
>> hacks in our headers, something we've been working hard to remove over the
>> years.
>>
>> The only reasonable option is a target specific optimization in GCC and LLVM
>> so that memsetzero is only emitted when it is known an optimized GLIBC
>> implementation exists (similar to mempcpy).
>>
>>> It's worth noting that between the two `memset` is the cold function
>>> and `__memsetzero` is the hot one. Based on profiles of GCC11 and
>>> Python3.7.7 setting zero covers 99%+ cases.
>>
>> There is no doubt memset of zero is by far the most common. What is in doubt
>> is whether micro-optimizing is worth it on modern cores. Does Python speed up
>> by a measurable amount if you use memsetzero?
> 
> Ran a few benchmarks for GCC/Python3.7
> 
> There is no measurable benefit using '__memsetzero' in Python3.7
> 
> For GCC there are some cases where there is a consistent speedup
> though it's not universal.
> 
> Times are geomean (N=30) of memsetzero / memset
> (1.0 means no difference, less than 1 means improvement, greater than
> 1 regression).
> 
>  Size, N Funcs,  Type, memsetzero / memset
> small,       1, bench,             0.99986
> small,       1, build,             0.99378
> small,       1,  link,             0.99241
> small,      10, bench,             0.99712
> small,      10, build,             0.99393
> small,      10,  link,             0.99245
> small,     100, bench,             0.99659
> small,     100, build,             0.99271
> small,     100,  link,             0.99227
> small,     250, bench,             1.00195
> small,     250, build,             0.99609
> small,     250,  link,             0.99744
> large,     N/A, bench,             0.99930
> 
> 
> The "small" size basically means the file was filled with essentially empty
> functions i.e
> ```
> int foo(void) { return 0; }
> ```
> 
> N Funcs refers to the number of these functions per file, so small-250 would
> be 250 empty functions per file.
> 
> Bench recompiled the same file 100x times
> Build compiled all the files
> Link linked all the files with a main that emitted 1x call per function
> 
> The "large" size was a realistic file someone might compile (in this case
> a freeze of sqlite3.c).
> 
> The performance improvement for the build/link step for varying amount of
> small functions per file was consistently in the ~.8% range. Not mind blowing
> but I believe its a genuine improvement.
> 
> I don't think this shows expected GCC usage is going to be faster, but
> I do think it shows that the effects of this change could be noticeable in an
> application.
> 

I hardly consider this marginal improvement a good reason to add a libc
symbol, specially because it is unlikely most architecture will ever provide
an optimized version of this (aarch64 maintainer said there are not looking
forward to it) and newer architecture extensions (such as s390 mvcle) or 
compile optimizations (such PGO or LTO) might remove the function call
altogether.

> NB: I'm not exactly certain why 'bench' doesn't follow the same trend
> as build/link.
> The only thing I notice is 'bench' takes longer (implemented in a Makefile
> loop) so possibly to '+ c' term just dampens any performance differences.
> The math for this doesn't work out 100% so there is a bit to still be skeptical
> of.

  reply	other threads:[~2022-02-23 12:09 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-15 13:38 Wilco Dijkstra
2022-02-23  8:12 ` Noah Goldstein
2022-02-23 12:09   ` Adhemerval Zanella [this message]
2022-02-24 13:16   ` Wilco Dijkstra
2022-02-24 15:48     ` H.J. Lu
2022-02-24 22:58     ` Noah Goldstein
2022-02-24 23:21       ` Noah Goldstein
2022-02-25 17:37         ` Noah Goldstein
2022-02-25 13:51       ` Wilco Dijkstra
2022-02-25 17:35         ` Noah Goldstein
  -- strict thread matches above, loose matches on Subject: below --
2022-02-08 22:43 H.J. Lu
2022-02-08 23:56 ` Noah Goldstein
2022-02-09 11:41 ` Adhemerval Zanella
2022-02-09 22:14   ` Noah Goldstein
2022-02-10 12:35     ` Adhemerval Zanella
2022-02-10 13:01       ` Wilco Dijkstra
2022-02-10 13:10         ` Adhemerval Zanella
2022-02-10 13:16           ` Adhemerval Zanella
2022-02-10 13:17           ` Wilco Dijkstra
2022-02-10 13:22             ` Adhemerval Zanella
2022-02-10 17:50               ` Alejandro Colomar (man-pages)
2022-02-10 19:19                 ` Wilco Dijkstra
2022-02-10 20:27                   ` Alejandro Colomar (man-pages)
2022-02-10 20:42                     ` Adhemerval Zanella
2022-02-10 21:07                       ` Patrick McGehearty
2022-02-11 13:01                         ` Adhemerval Zanella
2022-02-12 23:46                           ` Noah Goldstein
2022-02-14 12:07                             ` Adhemerval Zanella
2022-02-14 12:41                               ` Noah Goldstein
2022-02-14 14:07                                 ` Adhemerval Zanella
2022-02-14 15:03                                   ` H.J. Lu
2022-05-04  6:35                                     ` Sunil Pandey
2022-05-04 12:52                                       ` Adhemerval Zanella
2022-05-04 14:50                                         ` H.J. Lu
2022-05-04 14:54                                           ` Adhemerval Zanella
2022-02-10 22:00                       ` Alejandro Colomar (man-pages)
2022-02-10 19:42                 ` Adhemerval Zanella
2022-02-10 18:28         ` Noah Goldstein
2022-02-10 18:35         ` Noah Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e8bdcf4-36c3-704b-5580-84ff9662d1da@linaro.org \
    --to=adhemerval.zanella@linaro.org \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=goldstein.w.n@gmail.com \
    --cc=hjl.tools@gmail.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).