From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goldstein.w.n@gmail.com>
Received: from mail-pj1-x1036.google.com (mail-pj1-x1036.google.com
 [IPv6:2607:f8b0:4864:20::1036])
 by sourceware.org (Postfix) with ESMTPS id B26DA3858D1E
 for <libc-alpha@sourceware.org>; Thu, 24 Feb 2022 23:21:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B26DA3858D1E
Received: by mail-pj1-x1036.google.com with SMTP id kk17so134383pjb.1
 for <libc-alpha@sourceware.org>; Thu, 24 Feb 2022 15:21:34 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=HzhJOc7ltUGGyy+SVy1zCEqMb8ML7wrBPIbfUdYqIMg=;
 b=fi34Ansl3oXKrBQCrkRG1kCIk7MpFtz19InNRGXC8Kry3J8VD0c+4veMg5pSQbrve4
 ZDFqw5fJaUIYnBfSEhtP8RzcNefcnttCB8q1Wm2n1MXjVFWc8kg7wAYtfQf9JF9OlM76
 C5rdUemcf94ZuGs+z10U5n5LHLj4TDVzU7Z+kv2iRjJYEXaQG64/9IDEXMsnFCLFNVlc
 mo2dQ9qCmRxowWxYC7R4OkCiU5OYAeZdz/t+ZwCO9e42ofODJlbl6Zhv9wRH3Ez/jGjv
 EfvNIwSGxvVQF/abcW/ludXWoCiaYPUT0ENmIqUHgeBCU9EKsPRGtC8FsVkp9iKxApa0
 U3uw==
X-Gm-Message-State: AOAM533UCqa9XINAL5lN/QeT4VfKylOeQF/ZoWaYq4wLP4VkV4wTSmZD
 D4CftCkExTX8U5J8fx2DfRBaR9hewCRPuZne88Vw5FsM
X-Google-Smtp-Source: ABdhPJx/wxO3zJjW8gUIqVfyUPkJ7z1MwhVnuF2T5Wf9HvCKVDAXx8rJz+zXlCuHL0yc7uQEu5AWQVH4p7S6DF2N5Nk=
X-Received: by 2002:a17:90b:88b:b0:1bc:835e:c0fa with SMTP id
 bj11-20020a17090b088b00b001bc835ec0famr395683pjb.87.1645744893826; Thu, 24
 Feb 2022 15:21:33 -0800 (PST)
MIME-Version: 1.0
References: <AS8PR08MB65342CBD569FB0206F8522D883349@AS8PR08MB6534.eurprd08.prod.outlook.com>
 <CAFUsyfJKpM+SpEt5ShCU8Dfu2+sp-rQMgmHX_zBzpc-Scvg6Ww@mail.gmail.com>
 <AS8PR08MB6534F65DF1686E29E8939B87833C9@AS8PR08MB6534.eurprd08.prod.outlook.com>
 <CAFUsyfKjzNw_nda31yn2FHRqqSAoutxcaTwdNVEA1EVs8F8B8g@mail.gmail.com>
In-Reply-To: <CAFUsyfKjzNw_nda31yn2FHRqqSAoutxcaTwdNVEA1EVs8F8B8g@mail.gmail.com>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Thu, 24 Feb 2022 17:21:22 -0600
Message-ID: <CAFUsyfKPLxRdpFLUEAB4Dnxz+ex1diAS_nVCpC1idf_BeDwWqg@mail.gmail.com>
Subject: Re: [PATCH v2] x86-64: Optimize bzero
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: GNU C Library <libc-alpha@sourceware.org>, 
 Adhemerval Zanella <adhemerval.zanella@linaro.org>,
 "H.J. Lu" <hjl.tools@gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 24 Feb 2022 23:21:36 -0000

On Thu, Feb 24, 2022 at 4:58 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Thu, Feb 24, 2022 at 7:16 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> >
> > Hi Noah,
> >
> > > The performance improvement for the build/link step for varying amount of
> > > small functions per file was consistently in the ~.8% range. Not mind blowing
> > > but I believe its a genuine improvement.
> >
> > I don't see how it's possible to get anywhere near 0.8%. I tried compiling a file with
> > 10000 empty functions, and the latest __memset_exex_unaligned_erms takes about
> > 1.16% of total time.
>
> Smart sanity check I'll start using.
>
> What method are you using for getting total function call overhead?
>
> Using `perf record` + `pref report` and see a fair amount of variance but much
> higher memset overheader (counting `_*unaligned_erms` and `*_unaligned`)
> in `cc1` and `as`.
>
> From average of 3 runs compiling file with 1/10/100/1000 functions I get:
>
> 1: 4.04%
> 10: 3.94%
> 100: 2.86%
> 1000: 2.68%
>
> So its slightly less insane, arguing for the following speedups:
>
> 1: ~15%
> 10: ~15%
> 100: ~25% <--- this makes little to no sense
> 1000: ~15%
>
> personally agree with you that those numbers seem to high though.
>
> In the best case micro-benchmark that stressed the p5 bottleneck
> this is about what we see.
>
> >
> > There are 81.5 million calls to memset in 48 billion cycles for this benchmark. That
> > means 6.8 cycles per memset call on average. A 0.8% speedup would require making
> > each memset 4.7 cycles faster, and that's not possible with bzero.
> >
> > To verify whether vpbroadcastb is a bottleneck I repeated it 16 times. This increased
> > the memset percentage to 1.86%, however the total cycles didn't change measurably.
> >
> > I'm not sure how you're measuring this, but it's clear what you're seeing is not a
> > speedup from bzero.

Bah, youre right. The the non-memsetzero GCC  was choosing my systems
memset implementation (2.31 + avx2).

Sorry, I'll rerun tonight and post an update.
> >
> > Cheers,
> > Wilco