From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oa1-x36.google.com (mail-oa1-x36.google.com [IPv6:2001:4860:4864:20::36]) by sourceware.org (Postfix) with ESMTPS id A67643857831 for ; Wed, 4 May 2022 12:52:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A67643857831 Received: by mail-oa1-x36.google.com with SMTP id 586e51a60fabf-e5e433d66dso1053437fac.5 for ; Wed, 04 May 2022 05:52:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=E6RqeRuxIGor35bHA44V6ONwOhdHCCFEryPhO2Zksis=; b=kXPH2bl1IKglUai6abSBxQX6s/FtE+CsaRbplxmnOieKq9SGwu4WuXiUjXkquFYsUD pd+R3GYgM3tsSUPPn8AiTyhTWuR3UFgaUGrd99EeRpBrTHSRUrDiwZ099e9wPF/USK8J 2Ad0ne6q5SDJQJ5Drdk3edKIIJ5ApLYgMr9LMhuFNSPbMoDwfpTp55O2zaXqlcr0sVOR BoXMb1ujnQoCGxbgilVAC9PkFg5fp5w1E7XxLj3k/SfXvQXEtvYWi/PXSc9T+OO6dkuA d19mwP/FaxgbYuIxDmBcv6A7KP/Iw6N/mQJwVJOtZYE4HZgCxa1TYAFP5DclKRSSgfZa YhmA== X-Gm-Message-State: AOAM5307/Kwhe0M5ZsVEpMNK+rpn8auDb+izq1GjDuR3WePwgBObiETg Jz70hJ88qqe/OPra1XsAYDGqSg== X-Google-Smtp-Source: ABdhPJwc6jPrLFAoJqY3SCoalwoyPVNLM4N4HlWAlp5xArzQGlE19yn+3FKsYNgGkkaePZ3eu0N7RQ== X-Received: by 2002:a05:6870:e8c1:b0:ed:a4d8:dd6b with SMTP id r1-20020a056870e8c100b000eda4d8dd6bmr3804703oan.224.1651668757859; Wed, 04 May 2022 05:52:37 -0700 (PDT) Received: from ?IPV6:2804:431:c7cb:726:3ae8:3076:1dad:37? ([2804:431:c7cb:726:3ae8:3076:1dad:37]) by smtp.gmail.com with ESMTPSA id h3-20020a9d6403000000b006061debeed4sm3112021otl.69.2022.05.04.05.52.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 04 May 2022 05:52:37 -0700 (PDT) Message-ID: Date: Wed, 4 May 2022 09:52:34 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 Subject: Re: [PATCH v2] x86-64: Optimize bzero Content-Language: en-US To: Sunil Pandey , "H.J. Lu" , Libc-stable Mailing List Cc: GNU C Library References: <20220208224319.40271-1-hjl.tools@gmail.com> <1f75bda3-9e89-6860-a042-ef0406b072c1@linaro.org> <78cdba88-9e00-798a-846b-f0f77559bfd5@gmail.com> <0efdd4fe-4e35-cf1d-5731-13ed1c046cc6@oracle.com> <1ea64f9f-6ce8-5409-8b56-02f7481526d9@linaro.org> <1f5d5e63-f79b-9fc6-0f35-77d4abed7480@linaro.org> From: Adhemerval Zanella In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-stable@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-stable mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 May 2022 12:52:41 -0000 On 04/05/2022 03:35, Sunil Pandey wrote: > On Mon, Feb 14, 2022 at 7:04 AM H.J. Lu via Libc-alpha > wrote: >> >> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella via Libc-alpha >> wrote: >>> >>> >>> >>> On 14/02/2022 09:41, Noah Goldstein wrote: >>>> On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella >>>> wrote: >>>>> >>>>> >>>>> >>>>> On 12/02/2022 20:46, Noah Goldstein wrote: >>>>>> On Fri, Feb 11, 2022 at 7:01 AM Adhemerval Zanella via Libc-alpha >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/02/2022 18:07, Patrick McGehearty via Libc-alpha wrote: >>>>>>>> Just as another point of information, Solaris libc implemented >>>>>>>> bzero as moving arguments around appropriately then jumping to >>>>>>>> memset. Noone noticed enough to file a complaint. Of course, >>>>>>>> short fixed-length bzero was handled with in line stores of zero >>>>>>>> by the compiler. For long vector bzeroing, the overhead was >>>>>>>> negligible. >>>>>>>> >>>>>>>> When certain Sparc hardware implementations provided faster methods >>>>>>>> for zeroing a cache line at a time on cache line boundaries, >>>>>>>> memset added a single test for zero ifandonlyif the length of code >>>>>>>> to memset was over a threshold that seemed likely to make it >>>>>>>> worthwhile to use the faster method. The principal advantage >>>>>>>> of the fast zeroing operation is that it did not require data >>>>>>>> to move from memory to cache before writing zeros to memory, >>>>>>>> protecting cache locality in the face of large block zeroing. >>>>>>>> I was responsible for much of that optimization effort. >>>>>>>> Whether that optimization was really worth it is open for debate >>>>>>>> for a variety of reasons that I won't go into just now. >>>>>>> >>>>>>> Afaik this is pretty much what optimized memset implementations >>>>>>> does, if architecture allows it. For instance, aarch64 uses >>>>>>> 'dc zva' for sizes larger than 256 and powerpc uses dcbz with a >>>>>>> similar strategy. >>>>>>> >>>>>>>> >>>>>>>> Apps still used bzero or memset(target,zero,length) according to >>>>>>>> their preferences, but the code was unified under memset. >>>>>>>> >>>>>>>> I am inclined to agree with keeping bzero in the API for >>>>>>>> compatibility with old code/old binaries/old programmers. :-) >>>>>>> >>>>>>> The main driver to remove the bzero internal implementation is just >>>>>>> the *currently* gcc just do not generate bzero calls as default >>>>>>> (I couldn't find a single binary that calls bzero in my system). >>>>>> >>>>>> Does it make sense then to add '__memsetzero' so that we can have >>>>>> a function optimized for setting zero? >>>>> >>>>> Will it be really a huge gain instead of a microoptimization that will >>>>> just a bunch of more ifunc variants along with the maintenance cost >>>>> associated with this? >>>> Is there any way it can be setup so that one C impl can cover all the >>>> arch that want to just leave `__memsetzero` as an alias to `memset`? >>>> I know they have incompatible interfaces that make it hard but would >>>> a weak static inline in string.h work? >>>> >>>> For some of the shorter control flows (which are generally small sizes >>>> and very hot) we saw reasonable benefits on x86_64. >>>> >>>> The most significant was the EVEX/AVX2 [32, 64] case where it net >>>> us ~25% throughput. This is a pretty hot set value so it may be worth it. >>> >>> With different prototypes and semantics we won't be able to define an >>> alias. What we used to do, but we move away in recent version, was to >>> define static inline function that glue the two function if optimization >>> is set. >> >> I have >> >> /* NB: bzero returns void and __memsetzero returns void *. */ >> asm (".weak bzero"); >> asm ("bzero = __memsetzero"); >> asm (".global __bzero"); >> asm ("__bzero = __memsetzero"); >> >>>> >>>>> >>>>> My understanding is __memsetzero would maybe yield some gain in the >>>>> store mask generation (some architecture might have a zero register >>>>> or some instruction to generate one), however it would require to >>>>> use the same strategy as memset to use specific architecture instruction >>>>> that optimize cache utilization (dc zva, dcbz). >>>>> >>>>> So it would mostly require a lot of arch-specific code to to share >>>>> the memset code with __memsetzero (to avoid increasing code size), >>>>> so I am not sure if this is really a gain in the long term. >>>> >>>> It's worth noting that between the two `memset` is the cold function >>>> and `__memsetzero` is the hot one. Based on profiles of GCC11 and >>>> Python3.7.7 setting zero covers 99%+ cases. >>> >>> This is very workload specific and I think with more advance compiler >>> optimization like LTO and PGO such calls could most likely being >>> optimized by the compiler itself (either by inline or by create a >>> synthetic function to handle it). >>> >>> What I worried is such symbols might ended up as the AEBI memcpy variants >>> that was added as way to optimize when alignment is know to be multiple >>> of words, but it ended up not being implemented and also not being generated >>> by the compiler (at least not gcc). >> >> >> >> -- >> H.J. > > I would like to backport this patch to release branches. > Any comments or objections? Nothing really against, but as previous discussion we had on this maillist optimizing bzero does not yield much gain compared to memset (compiler won't generate libcall for loop transformation, among other shortcomings). My idea is to follow other architecture and just remove all x86_64 optimizations.