From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi1-x235.google.com (mail-oi1-x235.google.com [IPv6:2607:f8b0:4864:20::235]) by sourceware.org (Postfix) with ESMTPS id 25FF53856DF1 for ; Wed, 4 May 2022 14:54:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 25FF53856DF1 Received: by mail-oi1-x235.google.com with SMTP id e189so1407419oia.8 for ; Wed, 04 May 2022 07:54:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=Lr09kvAHLtUXcRPrHHR+2cjNAWM/hifNY7N/5eL7kz4=; b=pe9uMsiql2FjOys2m2/mWdlBAXdQMpSWA+DAWh1V7nG4EhoUXN6H86jQnsdae6whMt OOXZYMaDeWDYDsQfzorpwfLrp8yNCdoJTmWa1RIPsTuZ+8RmfOQrzPmDjhK+mvS2zNIz WlOiXqF5lJmqf25XMstUzfEpkHWvu0lAlKm4oGBBl3Tr2HKQp9faI4AWhbpAu/1c8zus Bb/PRF35xuVTK6ipZdtle3f9Tj2j8zT+C6fXmb2dAdCGLVOPFjvmmBHZkjevgHfU1Dpo j6NmCON6fKmfVwaQ42KGuANZnuZMY9xeBcBNlQxe8Pyt80FF3zwfVXnJUln363NBpTRh C1/Q== X-Gm-Message-State: AOAM532ICvv5OA+rikPBWlTmz10CZo3oMZaU9R3110b8M/l0f726AMiU 9WYpiOMxfUbBLPb8zmc3eCTCqQ== X-Google-Smtp-Source: ABdhPJxP0H8EQjSRIsPRDZLTlsKUaufd/8JuzGks3iyBEv8MBzgha9XotHEa5Js4UaLqHtky8z6xZw== X-Received: by 2002:a05:6808:124d:b0:325:788d:e23d with SMTP id o13-20020a056808124d00b00325788de23dmr4099093oiv.267.1651676055296; Wed, 04 May 2022 07:54:15 -0700 (PDT) Received: from ?IPV6:2804:431:c7cb:726:3ae8:3076:1dad:37? ([2804:431:c7cb:726:3ae8:3076:1dad:37]) by smtp.gmail.com with ESMTPSA id t21-20020a0568080b3500b00325643bce40sm4431492oij.0.2022.05.04.07.54.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 04 May 2022 07:54:14 -0700 (PDT) Message-ID: <14e0c415-0fda-c645-067f-9f7e85e1bb69@linaro.org> Date: Wed, 4 May 2022 11:54:12 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 Subject: Re: [PATCH v2] x86-64: Optimize bzero Content-Language: en-US To: "H.J. Lu" Cc: Sunil Pandey , Libc-stable Mailing List , GNU C Library References: <20220208224319.40271-1-hjl.tools@gmail.com> <1f75bda3-9e89-6860-a042-ef0406b072c1@linaro.org> <78cdba88-9e00-798a-846b-f0f77559bfd5@gmail.com> <0efdd4fe-4e35-cf1d-5731-13ed1c046cc6@oracle.com> <1ea64f9f-6ce8-5409-8b56-02f7481526d9@linaro.org> <1f5d5e63-f79b-9fc6-0f35-77d4abed7480@linaro.org> From: Adhemerval Zanella In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-stable@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-stable mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 May 2022 14:54:18 -0000 On 04/05/2022 11:50, H.J. Lu wrote: > On Wed, May 4, 2022 at 5:52 AM Adhemerval Zanella > wrote: >> >> >> >> On 04/05/2022 03:35, Sunil Pandey wrote: >>> On Mon, Feb 14, 2022 at 7:04 AM H.J. Lu via Libc-alpha >>> wrote: >>>> >>>> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella via Libc-alpha >>>> wrote: >>>>> >>>>> >>>>> >>>>> On 14/02/2022 09:41, Noah Goldstein wrote: >>>>>> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 12/02/2022 20:46, Noah Goldstein wrote: >>>>>>>> On Fri, Feb 11, 2022 at 7:01 AM Adhemerval Zanella via Libc-alpha >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/02/2022 18:07, Patrick McGehearty via Libc-alpha wrote: >>>>>>>>>> Just as another point of information, Solaris libc implemented >>>>>>>>>> bzero as moving arguments around appropriately then jumping to >>>>>>>>>> memset. Noone noticed enough to file a complaint. Of course, >>>>>>>>>> short fixed-length bzero was handled with in line stores of zero >>>>>>>>>> by the compiler. For long vector bzeroing, the overhead was >>>>>>>>>> negligible. >>>>>>>>>> >>>>>>>>>> When certain Sparc hardware implementations provided faster methods >>>>>>>>>> for zeroing a cache line at a time on cache line boundaries, >>>>>>>>>> memset added a single test for zero ifandonlyif the length of code >>>>>>>>>> to memset was over a threshold that seemed likely to make it >>>>>>>>>> worthwhile to use the faster method. The principal advantage >>>>>>>>>> of the fast zeroing operation is that it did not require data >>>>>>>>>> to move from memory to cache before writing zeros to memory, >>>>>>>>>> protecting cache locality in the face of large block zeroing. >>>>>>>>>> I was responsible for much of that optimization effort. >>>>>>>>>> Whether that optimization was really worth it is open for debate >>>>>>>>>> for a variety of reasons that I won't go into just now. >>>>>>>>> >>>>>>>>> Afaik this is pretty much what optimized memset implementations >>>>>>>>> does, if architecture allows it. For instance, aarch64 uses >>>>>>>>> 'dc zva' for sizes larger than 256 and powerpc uses dcbz with a >>>>>>>>> similar strategy. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Apps still used bzero or memset(target,zero,length) according to >>>>>>>>>> their preferences, but the code was unified under memset. >>>>>>>>>> >>>>>>>>>> I am inclined to agree with keeping bzero in the API for >>>>>>>>>> compatibility with old code/old binaries/old programmers. :-) >>>>>>>>> >>>>>>>>> The main driver to remove the bzero internal implementation is just >>>>>>>>> the *currently* gcc just do not generate bzero calls as default >>>>>>>>> (I couldn't find a single binary that calls bzero in my system). >>>>>>>> >>>>>>>> Does it make sense then to add '__memsetzero' so that we can have >>>>>>>> a function optimized for setting zero? >>>>>>> >>>>>>> Will it be really a huge gain instead of a microoptimization that will >>>>>>> just a bunch of more ifunc variants along with the maintenance cost >>>>>>> associated with this? >>>>>> Is there any way it can be setup so that one C impl can cover all the >>>>>> arch that want to just leave `__memsetzero` as an alias to `memset`? >>>>>> I know they have incompatible interfaces that make it hard but would >>>>>> a weak static inline in string.h work? >>>>>> >>>>>> For some of the shorter control flows (which are generally small sizes >>>>>> and very hot) we saw reasonable benefits on x86_64. >>>>>> >>>>>> The most significant was the EVEX/AVX2 [32, 64] case where it net >>>>>> us ~25% throughput. This is a pretty hot set value so it may be worth it. >>>>> >>>>> With different prototypes and semantics we won't be able to define an >>>>> alias. What we used to do, but we move away in recent version, was to >>>>> define static inline function that glue the two function if optimization >>>>> is set. >>>> >>>> I have >>>> >>>> /* NB: bzero returns void and __memsetzero returns void *. */ >>>> asm (".weak bzero"); >>>> asm ("bzero = __memsetzero"); >>>> asm (".global __bzero"); >>>> asm ("__bzero = __memsetzero"); >>>> >>>>>> >>>>>>> >>>>>>> My understanding is __memsetzero would maybe yield some gain in the >>>>>>> store mask generation (some architecture might have a zero register >>>>>>> or some instruction to generate one), however it would require to >>>>>>> use the same strategy as memset to use specific architecture instruction >>>>>>> that optimize cache utilization (dc zva, dcbz). >>>>>>> >>>>>>> So it would mostly require a lot of arch-specific code to to share >>>>>>> the memset code with __memsetzero (to avoid increasing code size), >>>>>>> so I am not sure if this is really a gain in the long term. >>>>>> >>>>>> It's worth noting that between the two `memset` is the cold function >>>>>> and `__memsetzero` is the hot one. Based on profiles of GCC11 and >>>>>> Python3.7.7 setting zero covers 99%+ cases. >>>>> >>>>> This is very workload specific and I think with more advance compiler >>>>> optimization like LTO and PGO such calls could most likely being >>>>> optimized by the compiler itself (either by inline or by create a >>>>> synthetic function to handle it). >>>>> >>>>> What I worried is such symbols might ended up as the AEBI memcpy variants >>>>> that was added as way to optimize when alignment is know to be multiple >>>>> of words, but it ended up not being implemented and also not being generated >>>>> by the compiler (at least not gcc). >>>> >>>> >>>> >>>> -- >>>> H.J. >>> >>> I would like to backport this patch to release branches. >>> Any comments or objections? >> >> Nothing really against, but as previous discussion we had on this maillist optimizing >> bzero does not yield much gain compared to memset (compiler won't generate libcall >> for loop transformation, among other shortcomings). My idea is to follow other >> architecture and just remove all x86_64 optimizations. > > We'd like to reduce the differences between master and release branches to help > future backports to release branches. > Ok, fair enough.