From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goldstein.w.n@gmail.com>
Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com
 [IPv6:2607:f8b0:4864:20::636])
 by sourceware.org (Postfix) with ESMTPS id 87D1A394741D
 for <libc-alpha@sourceware.org>; Wed, 23 Feb 2022 08:12:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 87D1A394741D
Received: by mail-pl1-x636.google.com with SMTP id q1so9377923plx.4
 for <libc-alpha@sourceware.org>; Wed, 23 Feb 2022 00:12:25 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=8Fdzs8s9yyMOSM8iwOmLjmAwkjhmUTFlGEG+j21yXLo=;
 b=U9x0ncuNPrZc6wArc/NSy/9boKYHj/Un3F6YE0a7/8FevOiUdw6UTC46lhp91YUZPM
 z67f9Vbh57a6t29vCQZRSYskl5E2/L9eND2onpszzthpZJE5EZP+//fFsfUGrA23vOE4
 ETHju5IQhTH5GC4oc564MMXVkQt7t7tsreodmgBVaJV9tXTEHWqZ42eFBKVONU9dxGfN
 HVzYAOpbU6nfxOuKHAyGqo78Iz3FOkI7ZTyxNJhxuIkvOXzjwzjrJ37Qc/FjQMaDXmj5
 muIANQODns5bhW44ZL8+mi1pw/0+nJynrXoR9aSzkFl6d3gv2eJBq2ONtvS+a+XxfPNY
 /fKg==
X-Gm-Message-State: AOAM5331xRHxyYQisSlSxa/3dLGJwNP1MIvw1I2W/oa3alSBfTZwU3k6
 ycYvLoikPTGXP4uqzqavaBP9Qg29O/0+VKSktco=
X-Google-Smtp-Source: ABdhPJw+eMBYchUy5A1leFE9cgJIr0g0eJ3+vVOWK4tGEzc1L8RIYRKDwhgtQ6s0nlDn5wLwUmjsYVCqZalhl1+dq2c=
X-Received: by 2002:a17:902:d50b:b0:14d:ca2b:1b59 with SMTP id
 b11-20020a170902d50b00b0014dca2b1b59mr26333830plg.22.1645603944494; Wed, 23
 Feb 2022 00:12:24 -0800 (PST)
MIME-Version: 1.0
References: <AS8PR08MB65342CBD569FB0206F8522D883349@AS8PR08MB6534.eurprd08.prod.outlook.com>
In-Reply-To: <AS8PR08MB65342CBD569FB0206F8522D883349@AS8PR08MB6534.eurprd08.prod.outlook.com>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed, 23 Feb 2022 02:12:13 -0600
Message-ID: <CAFUsyfJKpM+SpEt5ShCU8Dfu2+sp-rQMgmHX_zBzpc-Scvg6Ww@mail.gmail.com>
Subject: Re: [PATCH v2] x86-64: Optimize bzero
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: GNU C Library <libc-alpha@sourceware.org>, 
 Adhemerval Zanella <adhemerval.zanella@linaro.org>,
 "H.J. Lu" <hjl.tools@gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Feb 2022 08:12:28 -0000

On Tue, Feb 15, 2022 at 7:38 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi,
>
> > Is there any way it can be setup so that one C impl can cover all the
> > arch that want to just leave `__memsetzero` as an alias to `memset`?
> > I know they have incompatible interfaces that make it hard but would
> > a weak static inline in string.h work?
>
> No that won't work. A C implementation similar to current string/bzero.c
> adds unacceptable overhead (since most targets just implement memset and
> will continue to do so). An inline function in string.h would introduce target
> hacks in our headers, something we've been working hard to remove over the
> years.
>
> The only reasonable option is a target specific optimization in GCC and LLVM
> so that memsetzero is only emitted when it is known an optimized GLIBC
> implementation exists (similar to mempcpy).
>
> > It's worth noting that between the two `memset` is the cold function
> > and `__memsetzero` is the hot one. Based on profiles of GCC11 and
> > Python3.7.7 setting zero covers 99%+ cases.
>
> There is no doubt memset of zero is by far the most common. What is in doubt
> is whether micro-optimizing is worth it on modern cores. Does Python speed up
> by a measurable amount if you use memsetzero?

Ran a few benchmarks for GCC/Python3.7

There is no measurable benefit using '__memsetzero' in Python3.7

For GCC there are some cases where there is a consistent speedup
though it's not universal.

Times are geomean (N=30) of memsetzero / memset
(1.0 means no difference, less than 1 means improvement, greater than
1 regression).

 Size, N Funcs,  Type, memsetzero / memset
small,       1, bench,             0.99986
small,       1, build,             0.99378
small,       1,  link,             0.99241
small,      10, bench,             0.99712
small,      10, build,             0.99393
small,      10,  link,             0.99245
small,     100, bench,             0.99659
small,     100, build,             0.99271
small,     100,  link,             0.99227
small,     250, bench,             1.00195
small,     250, build,             0.99609
small,     250,  link,             0.99744
large,     N/A, bench,             0.99930


The "small" size basically means the file was filled with essentially empty
functions i.e
```
int foo(void) { return 0; }
```

N Funcs refers to the number of these functions per file, so small-250 would
be 250 empty functions per file.

Bench recompiled the same file 100x times
Build compiled all the files
Link linked all the files with a main that emitted 1x call per function

The "large" size was a realistic file someone might compile (in this case
a freeze of sqlite3.c).

The performance improvement for the build/link step for varying amount of
small functions per file was consistently in the ~.8% range. Not mind blowing
but I believe its a genuine improvement.

I don't think this shows expected GCC usage is going to be faster, but
I do think it shows that the effects of this change could be noticeable in an
application.

NB: I'm not exactly certain why 'bench' doesn't follow the same trend
as build/link.
The only thing I notice is 'bench' takes longer (implemented in a Makefile
loop) so possibly to '+ c' term just dampens any performance differences.
The math for this doesn't work out 100% so there is a bit to still be skeptical
of.


>
> Cheers,
> Wilco