From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oo1-xc2d.google.com (mail-oo1-xc2d.google.com [IPv6:2607:f8b0:4864:20::c2d]) by sourceware.org (Postfix) with ESMTPS id BA746385BF81 for ; Wed, 23 Feb 2022 12:09:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BA746385BF81 Received: by mail-oo1-xc2d.google.com with SMTP id s203-20020a4a3bd4000000b003191c2dcbe8so21924351oos.9 for ; Wed, 23 Feb 2022 04:09:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=9ER3yiDPquyoRunxW3hBW5LwBn+ikJ6NmRCZBQDkK64=; b=xb9rKsYMKJT8/PHQFCVRFYvR0N/wlUhHYCJSAT+iDUH0DEztmwk5uCoarqJjiI3HLR LjNOa4lANb3tSzzwr+sabpmp2LWZzP7RspGFL8Dx+Esh576lQR+X/AmHhI8OVw6wtkSg ZT/qXVzI9Cbk3dQeBfN/LLKE+JpVJuql0dBEm9Sp/+5q2nDhu4eAc4HMR7xrVvDIAAoX ygBg5gBk7l6spkFNGOx/UQIUH8WHR838z71fZM+ITnbWb6fC75Wei1FWAhxlrC7vJR99 cu2a5CG355SLodR7ExjMlv2VcSE764cOwQV9yshSXumgVFw1uGFFeSlWNGewzduEcIHm LTbw== X-Gm-Message-State: AOAM533Q5VIB/F7wnQFp9PZx2hOFluxYaCUYfJkHamvV8DYcutaTTbY9 FVyo5BawCZjfEqmyMmDHMmnhZjRvD11QFA== X-Google-Smtp-Source: ABdhPJwgfvjsNV0AjDVMu8QgxS2idNP9Z2EoRg79sVfsc8cp3w0VbnjrrQkZFajqOPfRAIXRyFJ1xQ== X-Received: by 2002:a05:6871:7a2:b0:c7:5834:8ee5 with SMTP id o34-20020a05687107a200b000c758348ee5mr3606953oap.39.1645618188050; Wed, 23 Feb 2022 04:09:48 -0800 (PST) Received: from ?IPV6:2804:431:c7ca:cb36:52bd:55cf:8e44:571? ([2804:431:c7ca:cb36:52bd:55cf:8e44:571]) by smtp.gmail.com with ESMTPSA id t192sm10750146oie.14.2022.02.23.04.09.46 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 23 Feb 2022 04:09:47 -0800 (PST) Message-ID: <1e8bdcf4-36c3-704b-5580-84ff9662d1da@linaro.org> Date: Wed, 23 Feb 2022 09:09:44 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Subject: Re: [PATCH v2] x86-64: Optimize bzero Content-Language: en-US To: Noah Goldstein , Wilco Dijkstra Cc: GNU C Library , "H.J. Lu" References: From: Adhemerval Zanella In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-5.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Feb 2022 12:09:50 -0000 On 23/02/2022 05:12, Noah Goldstein wrote: > On Tue, Feb 15, 2022 at 7:38 AM Wilco Dijkstra wrote: >> >> Hi, >> >>> Is there any way it can be setup so that one C impl can cover all the >>> arch that want to just leave `__memsetzero` as an alias to `memset`? >>> I know they have incompatible interfaces that make it hard but would >>> a weak static inline in string.h work? >> >> No that won't work. A C implementation similar to current string/bzero.c >> adds unacceptable overhead (since most targets just implement memset and >> will continue to do so). An inline function in string.h would introduce target >> hacks in our headers, something we've been working hard to remove over the >> years. >> >> The only reasonable option is a target specific optimization in GCC and LLVM >> so that memsetzero is only emitted when it is known an optimized GLIBC >> implementation exists (similar to mempcpy). >> >>> It's worth noting that between the two `memset` is the cold function >>> and `__memsetzero` is the hot one. Based on profiles of GCC11 and >>> Python3.7.7 setting zero covers 99%+ cases. >> >> There is no doubt memset of zero is by far the most common. What is in doubt >> is whether micro-optimizing is worth it on modern cores. Does Python speed up >> by a measurable amount if you use memsetzero? > > Ran a few benchmarks for GCC/Python3.7 > > There is no measurable benefit using '__memsetzero' in Python3.7 > > For GCC there are some cases where there is a consistent speedup > though it's not universal. > > Times are geomean (N=30) of memsetzero / memset > (1.0 means no difference, less than 1 means improvement, greater than > 1 regression). > > Size, N Funcs, Type, memsetzero / memset > small, 1, bench, 0.99986 > small, 1, build, 0.99378 > small, 1, link, 0.99241 > small, 10, bench, 0.99712 > small, 10, build, 0.99393 > small, 10, link, 0.99245 > small, 100, bench, 0.99659 > small, 100, build, 0.99271 > small, 100, link, 0.99227 > small, 250, bench, 1.00195 > small, 250, build, 0.99609 > small, 250, link, 0.99744 > large, N/A, bench, 0.99930 > > > The "small" size basically means the file was filled with essentially empty > functions i.e > ``` > int foo(void) { return 0; } > ``` > > N Funcs refers to the number of these functions per file, so small-250 would > be 250 empty functions per file. > > Bench recompiled the same file 100x times > Build compiled all the files > Link linked all the files with a main that emitted 1x call per function > > The "large" size was a realistic file someone might compile (in this case > a freeze of sqlite3.c). > > The performance improvement for the build/link step for varying amount of > small functions per file was consistently in the ~.8% range. Not mind blowing > but I believe its a genuine improvement. > > I don't think this shows expected GCC usage is going to be faster, but > I do think it shows that the effects of this change could be noticeable in an > application. > I hardly consider this marginal improvement a good reason to add a libc symbol, specially because it is unlikely most architecture will ever provide an optimized version of this (aarch64 maintainer said there are not looking forward to it) and newer architecture extensions (such as s390 mvcle) or compile optimizations (such PGO or LTO) might remove the function call altogether. > NB: I'm not exactly certain why 'bench' doesn't follow the same trend > as build/link. > The only thing I notice is 'bench' takes longer (implemented in a Makefile > loop) so possibly to '+ c' term just dampens any performance differences. > The math for this doesn't work out 100% so there is a bit to still be skeptical > of.