From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com [IPv6:2607:f8b0:4864:20::533]) by sourceware.org (Postfix) with ESMTPS id 252473858432 for ; Thu, 24 Feb 2022 22:58:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 252473858432 Received: by mail-pg1-x533.google.com with SMTP id 12so3045259pgd.0 for ; Thu, 24 Feb 2022 14:58:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=MjBd2ntlCNicC2OJiHACsVEMFgOo3d34VqD4ph9V+P0=; b=HvjxDvUDAlM72ODj4TXc+AbLgunXtJgvfYkCIVt26sp9HPYnJdwGlV5Ojt1xTNaP7J s+VN01UghcmK8Mimo9KZoiO3b/SlUB04FgeKBJaKIUrq+Py10Z3ll7Xkiy2tXDUsqrWk iTTBIL9wqkaL5hrqOKWtVgpGCFusZ9LWCUTSUhilPo3SYCMEqtn4Z1r04UTvwoLc4Pjl xB9bQ1NRam4EGUi5qPf0mZykxbFOk1WI9Vrtul8/0JkPY1jRMVCzYKsCfbYlfEztFvf0 p8F2IaCfSTUIuMKpo5AsbbSZovWK8OQjsI1aD4uR9sfOViNRs77axmuBHpkXce1j5o67 MJUA== X-Gm-Message-State: AOAM530RmOf1Jl5RMPDn59ZD11UkRysX6bZ7edEF3cUjrH4TV3x0H/dG wkMjaL1RZUpfsTau1tMb6r12jCYMtQDUDvdTYpk= X-Google-Smtp-Source: ABdhPJxzfCJ/AGrp6i+7vhBIXFesylrcVDg7DarT1KHMEbfpTkxsp44/uTJMsHk3JABXbt35e8Rk4BRiUCFA7It9Ne8= X-Received: by 2002:a05:6a00:de:b0:4e0:ca1a:9f07 with SMTP id e30-20020a056a0000de00b004e0ca1a9f07mr4904199pfj.11.1645743531232; Thu, 24 Feb 2022 14:58:51 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Noah Goldstein Date: Thu, 24 Feb 2022 16:58:40 -0600 Message-ID: Subject: Re: [PATCH v2] x86-64: Optimize bzero To: Wilco Dijkstra Cc: GNU C Library , Adhemerval Zanella , "H.J. Lu" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Feb 2022 22:58:53 -0000 On Thu, Feb 24, 2022 at 7:16 AM Wilco Dijkstra wrote: > > Hi Noah, > > > The performance improvement for the build/link step for varying amount of > > small functions per file was consistently in the ~.8% range. Not mind blowing > > but I believe its a genuine improvement. > > I don't see how it's possible to get anywhere near 0.8%. I tried compiling a file with > 10000 empty functions, and the latest __memset_exex_unaligned_erms takes about > 1.16% of total time. Smart sanity check I'll start using. What method are you using for getting total function call overhead? Using `perf record` + `pref report` and see a fair amount of variance but much higher memset overheader (counting `_*unaligned_erms` and `*_unaligned`) in `cc1` and `as`. >From average of 3 runs compiling file with 1/10/100/1000 functions I get: 1: 4.04% 10: 3.94% 100: 2.86% 1000: 2.68% So its slightly less insane, arguing for the following speedups: 1: ~15% 10: ~15% 100: ~25% <--- this makes little to no sense 1000: ~15% personally agree with you that those numbers seem to high though. In the best case micro-benchmark that stressed the p5 bottleneck this is about what we see. > > There are 81.5 million calls to memset in 48 billion cycles for this benchmark. That > means 6.8 cycles per memset call on average. A 0.8% speedup would require making > each memset 4.7 cycles faster, and that's not possible with bzero. > > To verify whether vpbroadcastb is a bottleneck I repeated it 16 times. This increased > the memset percentage to 1.86%, however the total cycles didn't change measurably. > > I'm not sure how you're measuring this, but it's clear what you're seeing is not a > speedup from bzero. > > Cheers, > Wilco