From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by sourceware.org (Postfix) with ESMTPS id 87D1A394741D for ; Wed, 23 Feb 2022 08:12:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 87D1A394741D Received: by mail-pl1-x636.google.com with SMTP id q1so9377923plx.4 for ; Wed, 23 Feb 2022 00:12:25 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8Fdzs8s9yyMOSM8iwOmLjmAwkjhmUTFlGEG+j21yXLo=; b=U9x0ncuNPrZc6wArc/NSy/9boKYHj/Un3F6YE0a7/8FevOiUdw6UTC46lhp91YUZPM z67f9Vbh57a6t29vCQZRSYskl5E2/L9eND2onpszzthpZJE5EZP+//fFsfUGrA23vOE4 ETHju5IQhTH5GC4oc564MMXVkQt7t7tsreodmgBVaJV9tXTEHWqZ42eFBKVONU9dxGfN HVzYAOpbU6nfxOuKHAyGqo78Iz3FOkI7ZTyxNJhxuIkvOXzjwzjrJ37Qc/FjQMaDXmj5 muIANQODns5bhW44ZL8+mi1pw/0+nJynrXoR9aSzkFl6d3gv2eJBq2ONtvS+a+XxfPNY /fKg== X-Gm-Message-State: AOAM5331xRHxyYQisSlSxa/3dLGJwNP1MIvw1I2W/oa3alSBfTZwU3k6 ycYvLoikPTGXP4uqzqavaBP9Qg29O/0+VKSktco= X-Google-Smtp-Source: ABdhPJw+eMBYchUy5A1leFE9cgJIr0g0eJ3+vVOWK4tGEzc1L8RIYRKDwhgtQ6s0nlDn5wLwUmjsYVCqZalhl1+dq2c= X-Received: by 2002:a17:902:d50b:b0:14d:ca2b:1b59 with SMTP id b11-20020a170902d50b00b0014dca2b1b59mr26333830plg.22.1645603944494; Wed, 23 Feb 2022 00:12:24 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Noah Goldstein Date: Wed, 23 Feb 2022 02:12:13 -0600 Message-ID: Subject: Re: [PATCH v2] x86-64: Optimize bzero To: Wilco Dijkstra Cc: GNU C Library , Adhemerval Zanella , "H.J. Lu" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Feb 2022 08:12:28 -0000 On Tue, Feb 15, 2022 at 7:38 AM Wilco Dijkstra wrote: > > Hi, > > > Is there any way it can be setup so that one C impl can cover all the > > arch that want to just leave `__memsetzero` as an alias to `memset`? > > I know they have incompatible interfaces that make it hard but would > > a weak static inline in string.h work? > > No that won't work. A C implementation similar to current string/bzero.c > adds unacceptable overhead (since most targets just implement memset and > will continue to do so). An inline function in string.h would introduce target > hacks in our headers, something we've been working hard to remove over the > years. > > The only reasonable option is a target specific optimization in GCC and LLVM > so that memsetzero is only emitted when it is known an optimized GLIBC > implementation exists (similar to mempcpy). > > > It's worth noting that between the two `memset` is the cold function > > and `__memsetzero` is the hot one. Based on profiles of GCC11 and > > Python3.7.7 setting zero covers 99%+ cases. > > There is no doubt memset of zero is by far the most common. What is in doubt > is whether micro-optimizing is worth it on modern cores. Does Python speed up > by a measurable amount if you use memsetzero? Ran a few benchmarks for GCC/Python3.7 There is no measurable benefit using '__memsetzero' in Python3.7 For GCC there are some cases where there is a consistent speedup though it's not universal. Times are geomean (N=30) of memsetzero / memset (1.0 means no difference, less than 1 means improvement, greater than 1 regression). Size, N Funcs, Type, memsetzero / memset small, 1, bench, 0.99986 small, 1, build, 0.99378 small, 1, link, 0.99241 small, 10, bench, 0.99712 small, 10, build, 0.99393 small, 10, link, 0.99245 small, 100, bench, 0.99659 small, 100, build, 0.99271 small, 100, link, 0.99227 small, 250, bench, 1.00195 small, 250, build, 0.99609 small, 250, link, 0.99744 large, N/A, bench, 0.99930 The "small" size basically means the file was filled with essentially empty functions i.e ``` int foo(void) { return 0; } ``` N Funcs refers to the number of these functions per file, so small-250 would be 250 empty functions per file. Bench recompiled the same file 100x times Build compiled all the files Link linked all the files with a main that emitted 1x call per function The "large" size was a realistic file someone might compile (in this case a freeze of sqlite3.c). The performance improvement for the build/link step for varying amount of small functions per file was consistently in the ~.8% range. Not mind blowing but I believe its a genuine improvement. I don't think this shows expected GCC usage is going to be faster, but I do think it shows that the effects of this change could be noticeable in an application. NB: I'm not exactly certain why 'bench' doesn't follow the same trend as build/link. The only thing I notice is 'bench' takes longer (implemented in a Makefile loop) so possibly to '+ c' term just dampens any performance differences. The math for this doesn't work out 100% so there is a bit to still be skeptical of. > > Cheers, > Wilco