From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pj1-x1036.google.com (mail-pj1-x1036.google.com [IPv6:2607:f8b0:4864:20::1036]) by sourceware.org (Postfix) with ESMTPS id B26DA3858D1E for ; Thu, 24 Feb 2022 23:21:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B26DA3858D1E Received: by mail-pj1-x1036.google.com with SMTP id kk17so134383pjb.1 for ; Thu, 24 Feb 2022 15:21:34 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=HzhJOc7ltUGGyy+SVy1zCEqMb8ML7wrBPIbfUdYqIMg=; b=fi34Ansl3oXKrBQCrkRG1kCIk7MpFtz19InNRGXC8Kry3J8VD0c+4veMg5pSQbrve4 ZDFqw5fJaUIYnBfSEhtP8RzcNefcnttCB8q1Wm2n1MXjVFWc8kg7wAYtfQf9JF9OlM76 C5rdUemcf94ZuGs+z10U5n5LHLj4TDVzU7Z+kv2iRjJYEXaQG64/9IDEXMsnFCLFNVlc mo2dQ9qCmRxowWxYC7R4OkCiU5OYAeZdz/t+ZwCO9e42ofODJlbl6Zhv9wRH3Ez/jGjv EfvNIwSGxvVQF/abcW/ludXWoCiaYPUT0ENmIqUHgeBCU9EKsPRGtC8FsVkp9iKxApa0 U3uw== X-Gm-Message-State: AOAM533UCqa9XINAL5lN/QeT4VfKylOeQF/ZoWaYq4wLP4VkV4wTSmZD D4CftCkExTX8U5J8fx2DfRBaR9hewCRPuZne88Vw5FsM X-Google-Smtp-Source: ABdhPJx/wxO3zJjW8gUIqVfyUPkJ7z1MwhVnuF2T5Wf9HvCKVDAXx8rJz+zXlCuHL0yc7uQEu5AWQVH4p7S6DF2N5Nk= X-Received: by 2002:a17:90b:88b:b0:1bc:835e:c0fa with SMTP id bj11-20020a17090b088b00b001bc835ec0famr395683pjb.87.1645744893826; Thu, 24 Feb 2022 15:21:33 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Noah Goldstein Date: Thu, 24 Feb 2022 17:21:22 -0600 Message-ID: Subject: Re: [PATCH v2] x86-64: Optimize bzero To: Wilco Dijkstra Cc: GNU C Library , Adhemerval Zanella , "H.J. Lu" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Feb 2022 23:21:36 -0000 On Thu, Feb 24, 2022 at 4:58 PM Noah Goldstein wrote: > > On Thu, Feb 24, 2022 at 7:16 AM Wilco Dijkstra wrote: > > > > Hi Noah, > > > > > The performance improvement for the build/link step for varying amount of > > > small functions per file was consistently in the ~.8% range. Not mind blowing > > > but I believe its a genuine improvement. > > > > I don't see how it's possible to get anywhere near 0.8%. I tried compiling a file with > > 10000 empty functions, and the latest __memset_exex_unaligned_erms takes about > > 1.16% of total time. > > Smart sanity check I'll start using. > > What method are you using for getting total function call overhead? > > Using `perf record` + `pref report` and see a fair amount of variance but much > higher memset overheader (counting `_*unaligned_erms` and `*_unaligned`) > in `cc1` and `as`. > > From average of 3 runs compiling file with 1/10/100/1000 functions I get: > > 1: 4.04% > 10: 3.94% > 100: 2.86% > 1000: 2.68% > > So its slightly less insane, arguing for the following speedups: > > 1: ~15% > 10: ~15% > 100: ~25% <--- this makes little to no sense > 1000: ~15% > > personally agree with you that those numbers seem to high though. > > In the best case micro-benchmark that stressed the p5 bottleneck > this is about what we see. > > > > > There are 81.5 million calls to memset in 48 billion cycles for this benchmark. That > > means 6.8 cycles per memset call on average. A 0.8% speedup would require making > > each memset 4.7 cycles faster, and that's not possible with bzero. > > > > To verify whether vpbroadcastb is a bottleneck I repeated it 16 times. This increased > > the memset percentage to 1.86%, however the total cycles didn't change measurably. > > > > I'm not sure how you're measuring this, but it's clear what you're seeing is not a > > speedup from bzero. Bah, youre right. The the non-memsetzero GCC was choosing my systems memset implementation (2.31 + avx2). Sorry, I'll rerun tonight and post an update. > > > > Cheers, > > Wilco