From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by sourceware.org (Postfix) with ESMTPS id A9A14385840A for ; Mon, 14 Feb 2022 12:41:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A9A14385840A Received: by mail-pl1-x630.google.com with SMTP id p6so10223386plf.10 for ; Mon, 14 Feb 2022 04:41:19 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7cITywQq4bjGFFK6Je2nHshobv5RUi2fx08lf7RZDCI=; b=44TE/rr20WONKjPFRuu9ixUCWhOCgYav1Ct+en7puO21ANPOrmgIvoqgj2z49HRIXr 0ZgS5AIlea9sNfx2bJt+HmdnKdbrNcwZmlL3OL7DgP7BwAmY8vmVNy9MrXvHqCAMNUPt NMRuOhsP+xDUusUl+L0gQvBc3aLmWZR0eQQ1fKxu8nHuR3mwz9opcrGvO4QI7zoREQ0i hPLl8RTo9xvynzVf5BnQS0SbypCm7/Vfn8ba+gsNn2ZsSFf2xxsvWgNW9xR6wizLXAxa WbGPkbJ0PHT4dVABuVR0Rpp/iAQUPdV0ZKdKsvxERr05mgxroML/8DqAD4bP+6/JQL7S aw7Q== X-Gm-Message-State: AOAM531vWGUndMhiKGxwNwIg4N9WoRDFvar/LX+AekUbV4eSVtfF0Y4s 5BdBFXDdwGNnsW0cSbB1yXM2bl2s9/H63ZZSfPk= X-Google-Smtp-Source: ABdhPJwr13UvmfPqg+IcL3ydeZEonJZk5WxjrwX8Nju1PuI9/ZNrkBkXK2L3VXdzshKyssU24qM0UzVlvdw5ZG+ktZw= X-Received: by 2002:a17:903:185:: with SMTP id z5mr13941212plg.22.1644842478643; Mon, 14 Feb 2022 04:41:18 -0800 (PST) MIME-Version: 1.0 References: <20220208224319.40271-1-hjl.tools@gmail.com> <1f75bda3-9e89-6860-a042-ef0406b072c1@linaro.org> <78cdba88-9e00-798a-846b-f0f77559bfd5@gmail.com> <0efdd4fe-4e35-cf1d-5731-13ed1c046cc6@oracle.com> <1ea64f9f-6ce8-5409-8b56-02f7481526d9@linaro.org> In-Reply-To: From: Noah Goldstein Date: Mon, 14 Feb 2022 06:41:07 -0600 Message-ID: Subject: Re: [PATCH v2] x86-64: Optimize bzero To: Adhemerval Zanella Cc: Patrick McGehearty , GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Feb 2022 12:41:21 -0000 On Mon, Feb 14, 2022 at 6:07 AM Adhemerval Zanella wrote: > > > > On 12/02/2022 20:46, Noah Goldstein wrote: > > On Fri, Feb 11, 2022 at 7:01 AM Adhemerval Zanella via Libc-alpha > > wrote: > >> > >> > >> > >> On 10/02/2022 18:07, Patrick McGehearty via Libc-alpha wrote: > >>> Just as another point of information, Solaris libc implemented > >>> bzero as moving arguments around appropriately then jumping to > >>> memset. Noone noticed enough to file a complaint. Of course, > >>> short fixed-length bzero was handled with in line stores of zero > >>> by the compiler. For long vector bzeroing, the overhead was > >>> negligible. > >>> > >>> When certain Sparc hardware implementations provided faster methods > >>> for zeroing a cache line at a time on cache line boundaries, > >>> memset added a single test for zero ifandonlyif the length of code > >>> to memset was over a threshold that seemed likely to make it > >>> worthwhile to use the faster method. The principal advantage > >>> of the fast zeroing operation is that it did not require data > >>> to move from memory to cache before writing zeros to memory, > >>> protecting cache locality in the face of large block zeroing. > >>> I was responsible for much of that optimization effort. > >>> Whether that optimization was really worth it is open for debate > >>> for a variety of reasons that I won't go into just now. > >> > >> Afaik this is pretty much what optimized memset implementations > >> does, if architecture allows it. For instance, aarch64 uses > >> 'dc zva' for sizes larger than 256 and powerpc uses dcbz with a > >> similar strategy. > >> > >>> > >>> Apps still used bzero or memset(target,zero,length) according to > >>> their preferences, but the code was unified under memset. > >>> > >>> I am inclined to agree with keeping bzero in the API for > >>> compatibility with old code/old binaries/old programmers. :-) > >> > >> The main driver to remove the bzero internal implementation is just > >> the *currently* gcc just do not generate bzero calls as default > >> (I couldn't find a single binary that calls bzero in my system). > > > > Does it make sense then to add '__memsetzero' so that we can have > > a function optimized for setting zero? > > Will it be really a huge gain instead of a microoptimization that will > just a bunch of more ifunc variants along with the maintenance cost > associated with this? Is there any way it can be setup so that one C impl can cover all the arch that want to just leave `__memsetzero` as an alias to `memset`? I know they have incompatible interfaces that make it hard but would a weak static inline in string.h work? For some of the shorter control flows (which are generally small sizes and very hot) we saw reasonable benefits on x86_64. The most significant was the EVEX/AVX2 [32, 64] case where it net us ~25% throughput. This is a pretty hot set value so it may be worth it. > > My understanding is __memsetzero would maybe yield some gain in the > store mask generation (some architecture might have a zero register > or some instruction to generate one), however it would require to > use the same strategy as memset to use specific architecture instruction > that optimize cache utilization (dc zva, dcbz). > > So it would mostly require a lot of arch-specific code to to share > the memset code with __memsetzero (to avoid increasing code size), > so I am not sure if this is really a gain in the long term. It's worth noting that between the two `memset` is the cold function and `__memsetzero` is the hot one. Based on profiles of GCC11 and Python3.7.7 setting zero covers 99%+ cases.