From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <goldstein.w.n@gmail.com>
Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com
 [IPv6:2607:f8b0:4864:20::533])
 by sourceware.org (Postfix) with ESMTPS id 252473858432
 for <libc-alpha@sourceware.org>; Thu, 24 Feb 2022 22:58:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 252473858432
Received: by mail-pg1-x533.google.com with SMTP id 12so3045259pgd.0
 for <libc-alpha@sourceware.org>; Thu, 24 Feb 2022 14:58:52 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=MjBd2ntlCNicC2OJiHACsVEMFgOo3d34VqD4ph9V+P0=;
 b=HvjxDvUDAlM72ODj4TXc+AbLgunXtJgvfYkCIVt26sp9HPYnJdwGlV5Ojt1xTNaP7J
 s+VN01UghcmK8Mimo9KZoiO3b/SlUB04FgeKBJaKIUrq+Py10Z3ll7Xkiy2tXDUsqrWk
 iTTBIL9wqkaL5hrqOKWtVgpGCFusZ9LWCUTSUhilPo3SYCMEqtn4Z1r04UTvwoLc4Pjl
 xB9bQ1NRam4EGUi5qPf0mZykxbFOk1WI9Vrtul8/0JkPY1jRMVCzYKsCfbYlfEztFvf0
 p8F2IaCfSTUIuMKpo5AsbbSZovWK8OQjsI1aD4uR9sfOViNRs77axmuBHpkXce1j5o67
 MJUA==
X-Gm-Message-State: AOAM530RmOf1Jl5RMPDn59ZD11UkRysX6bZ7edEF3cUjrH4TV3x0H/dG
 wkMjaL1RZUpfsTau1tMb6r12jCYMtQDUDvdTYpk=
X-Google-Smtp-Source: ABdhPJxzfCJ/AGrp6i+7vhBIXFesylrcVDg7DarT1KHMEbfpTkxsp44/uTJMsHk3JABXbt35e8Rk4BRiUCFA7It9Ne8=
X-Received: by 2002:a05:6a00:de:b0:4e0:ca1a:9f07 with SMTP id
 e30-20020a056a0000de00b004e0ca1a9f07mr4904199pfj.11.1645743531232; Thu, 24
 Feb 2022 14:58:51 -0800 (PST)
MIME-Version: 1.0
References: <AS8PR08MB65342CBD569FB0206F8522D883349@AS8PR08MB6534.eurprd08.prod.outlook.com>
 <CAFUsyfJKpM+SpEt5ShCU8Dfu2+sp-rQMgmHX_zBzpc-Scvg6Ww@mail.gmail.com>
 <AS8PR08MB6534F65DF1686E29E8939B87833C9@AS8PR08MB6534.eurprd08.prod.outlook.com>
In-Reply-To: <AS8PR08MB6534F65DF1686E29E8939B87833C9@AS8PR08MB6534.eurprd08.prod.outlook.com>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Thu, 24 Feb 2022 16:58:40 -0600
Message-ID: <CAFUsyfKjzNw_nda31yn2FHRqqSAoutxcaTwdNVEA1EVs8F8B8g@mail.gmail.com>
Subject: Re: [PATCH v2] x86-64: Optimize bzero
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: GNU C Library <libc-alpha@sourceware.org>, 
 Adhemerval Zanella <adhemerval.zanella@linaro.org>,
 "H.J. Lu" <hjl.tools@gmail.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 24 Feb 2022 22:58:53 -0000

On Thu, Feb 24, 2022 at 7:16 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Noah,
>
> > The performance improvement for the build/link step for varying amount of
> > small functions per file was consistently in the ~.8% range. Not mind blowing
> > but I believe its a genuine improvement.
>
> I don't see how it's possible to get anywhere near 0.8%. I tried compiling a file with
> 10000 empty functions, and the latest __memset_exex_unaligned_erms takes about
> 1.16% of total time.

Smart sanity check I'll start using.

What method are you using for getting total function call overhead?

Using `perf record` + `pref report` and see a fair amount of variance but much
higher memset overheader (counting `_*unaligned_erms` and `*_unaligned`)
in `cc1` and `as`.

>From average of 3 runs compiling file with 1/10/100/1000 functions I get:

1: 4.04%
10: 3.94%
100: 2.86%
1000: 2.68%

So its slightly less insane, arguing for the following speedups:

1: ~15%
10: ~15%
100: ~25% <--- this makes little to no sense
1000: ~15%

personally agree with you that those numbers seem to high though.

In the best case micro-benchmark that stressed the p5 bottleneck
this is about what we see.

>
> There are 81.5 million calls to memset in 48 billion cycles for this benchmark. That
> means 6.8 cycles per memset call on average. A 0.8% speedup would require making
> each memset 4.7 cycles faster, and that's not possible with bzero.
>
> To verify whether vpbroadcastb is a bottleneck I repeated it 16 times. This increased
> the memset percentage to 1.86%, however the total cycles didn't change measurably.
>
> I'm not sure how you're measuring this, but it's clear what you're seeing is not a
> speedup from bzero.
>
> Cheers,
> Wilco