From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb1-xb2e.google.com (mail-yb1-xb2e.google.com [IPv6:2607:f8b0:4864:20::b2e]) by sourceware.org (Postfix) with ESMTPS id 09020385625B for ; Wed, 18 May 2022 17:38:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 09020385625B Received: by mail-yb1-xb2e.google.com with SMTP id x2so4595995ybi.8 for ; Wed, 18 May 2022 10:38:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=5MnAGscZj9a1ixzgRlhufbWA7/i1WCWOI4gcb2l6BfU=; b=d6SSXCerEUX3Yhe0TrEj6BKXgY9Alwm/d8js+t9greUXiCx5byA8uq0VnTFDIBfe49 iT7TV/7H2fk2euYEif+G7NX3eYJkjlHNNYvdtT9kAwyVVFiiZWbtS+VeK6+sqo7UZ82s u0UZjtVWTdIHak4Ho7KjflyDyfC6TMQR/18Ed4bVPPCRdDit73oZs8nEWjztvP27FuBu GoCmEZli7nrw022zCSsRqUMy1NIN472XrK/Ir/YEcCGtTCEFrKGHj4FQmm0epS3Newyq FplH6s7SInhzdLpyUIcKKOhWt9ezxqHDafnjS1J2JXVYKAef2LAXL91bTRAOZHyGYyx/ wN0Q== X-Gm-Message-State: AOAM5312Pau2DJYOc9t3KSJhguzCkgtfTUu5xlk2po2qrjECAbbtsCse 0/9ZeNqvhtoTVve4gHJ0IJb2r3dNqfjb/60QFY9agPfK3lA= X-Google-Smtp-Source: ABdhPJyivWtnAAgSWDXoguMYaA8nmKEXEgiIf9O6uuMs8y7FgB7J4/UnMqwFgPezavcf/Ht7yYeKpBekBGNdXAo+fvs= X-Received: by 2002:a25:c004:0:b0:64d:6c11:a018 with SMTP id c4-20020a25c004000000b0064d6c11a018mr689150ybf.143.1652895535357; Wed, 18 May 2022 10:38:55 -0700 (PDT) MIME-Version: 1.0 References: <20220414041231.926415-1-goldstein.w.n@gmail.com> <20220516203004.38687-1-goldstein.w.n@gmail.com> <20220516203004.38687-6-goldstein.w.n@gmail.com> <0020ab3f-748a-f02b-da06-2b793694e168@gotplt.org> In-Reply-To: <0020ab3f-748a-f02b-da06-2b793694e168@gotplt.org> From: Noah Goldstein Date: Wed, 18 May 2022 12:38:44 -0500 Message-ID: Subject: Re: [PATCH v9 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h To: Siddhesh Poyarekar Cc: GNU C Library , Alexander Monakov Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2022 17:38:58 -0000 On Tue, May 17, 2022 at 12:12 AM Siddhesh Poyarekar wrote: > > Not sure why, but the series failed to apply on trybot. It's probably a > trybot bug because it applied just fine on my up to date copy. I think it may be related to the fact that earlier versions deleted elf/dl-new-hash.h. > > On 17/05/2022 02:00, Noah Goldstein via Libc-alpha wrote: > > Unroll slightly and enforce good instruction scheduling. This improves > > performance on out-of-order machines. The unrolling allows for > > pipelined multiplies. > > > > As well, as an optional sysdep, reorder the operations and prevent > > reassosiation for better scheduling and higher ILP. This commit > > reassociation. Later in the patch too. Fixed throughout in V10. > > > only adds the barrier for x86, although it should be either no > > change or a win for any architecture. > > > > Unrolling further started to induce slowdowns for sizes [0, 4] > > but can help the loop so if larger sizes are the target further > > unrolling can be beneficial. > > > > Results for _dl_new_hash > > Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz > > > > Time as Geometric Mean of N=30 runs > > Geometric of all benchmark New / Old: 0.674 > > type, length, New Time, Old Time, New Time / Old Time > > fixed, 0, 2.865, 2.72, 1.053 > > fixed, 1, 3.567, 2.489, 1.433 > > fixed, 2, 2.577, 3.649, 0.706 > > fixed, 3, 3.644, 5.983, 0.609 > > fixed, 4, 4.211, 6.833, 0.616 > > fixed, 5, 4.741, 9.372, 0.506 > > fixed, 6, 5.415, 9.561, 0.566 > > fixed, 7, 6.649, 10.789, 0.616 > > fixed, 8, 8.081, 11.808, 0.684 > > fixed, 9, 8.427, 12.935, 0.651 > > fixed, 10, 8.673, 14.134, 0.614 > > fixed, 11, 10.69, 15.408, 0.694 > > fixed, 12, 10.789, 16.982, 0.635 > > fixed, 13, 12.169, 18.411, 0.661 > > fixed, 14, 12.659, 19.914, 0.636 > > fixed, 15, 13.526, 21.541, 0.628 > > fixed, 16, 14.211, 23.088, 0.616 > > fixed, 32, 29.412, 52.722, 0.558 > > fixed, 64, 65.41, 142.351, 0.459 > > fixed, 128, 138.505, 295.625, 0.469 > > fixed, 256, 291.707, 601.983, 0.485 > > random, 2, 12.698, 12.849, 0.988 > > random, 4, 16.065, 15.857, 1.013 > > random, 8, 19.564, 21.105, 0.927 > > random, 16, 23.919, 26.823, 0.892 > > random, 32, 31.987, 39.591, 0.808 > > random, 64, 49.282, 71.487, 0.689 > > random, 128, 82.23, 145.364, 0.566 > > random, 256, 152.209, 298.434, 0.51 > > > > Co-authored-by: Alexander Monakov > > --- > > sysdeps/generic/dl-new-hash.h | 114 +++++++++++++++++++++++++++++ > > {elf => sysdeps/x86}/dl-new-hash.h | 16 +--- > > This breaks the benchmark build, but including just dl-new-hash.h in the > benchmark should fix it. Fixed in V10. > > > 2 files changed, 117 insertions(+), 13 deletions(-) > > create mode 100644 sysdeps/generic/dl-new-hash.h > > rename {elf => sysdeps/x86}/dl-new-hash.h (77%) > > > > diff --git a/sysdeps/generic/dl-new-hash.h b/sysdeps/generic/dl-new-hash.h > > new file mode 100644 > > index 0000000000..84aa7991a4 > > --- /dev/null > > +++ b/sysdeps/generic/dl-new-hash.h > > @@ -0,0 +1,114 @@ > > +/* _dl_new_hash for elf symbol lookup > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#ifndef _DL_NEW_HASH_H > > +#define _DL_NEW_HASH_H 1 > > + > > +#include > > +/* For __always_inline. */ > > +#include > > +/* For __glibc_unlikely. */ > > +#include > > Same header included twice. Missed that in V10. Will wait for feedback on the rest of the changes and update with V11. > > > + > > +/* The simplest implementation of _dl_new_hash is: > > + > > + _dl_new_hash (const char *s) > > + { > > + uint32_t h = 5381; > > + for (unsigned char c = *s; c != '\0'; c = *++s) > > + h = h * 33 + c; > > + return h; > > + } > > + > > + We can get better performance by slightly unrolling the > > + loop to pipeline the multiples. > > .. the multiples, which gcc cannot easily do due to dependencies across > iterations. > > > + As well, as an architecture specific option we add asm statements > > + to explicitly specifying order of operations to prevent > > to explicitly specify the order... Fixed in V10. > > > + reassosiation of instructions that lengthens the loop carried > > + dependency. This may have no affect as the compiler may have > > + ordered instructions the same way without it but in testing this > > + has not been the case for GCC. Improving GCC to reliably schedule > > + instructions ideally cannot be easily done. > > + > > + Architecture(s) that use the reassosiation barries are: > > + x86 > > + > > + Note it is very unlikely the reassosiation barriers would > > + de-optimize performance on any archictecture and with an imperfect > > architecture Fixed in V10. > > > + compiler it may help performance, especially on out-of-order cpus, > > + so it is suggested that the respective maintainers add them. */ > > Suggest: "architecture maintainers are encouraged to benchmark this with > __asm_reassociation_barrier defined to __asm__ like it is in x86." > Took your suggestion in V10. > > + > > + > > +#ifndef __asm_reassociation_barrier > > +# define __asm_reassociation_barrier(...) > > +#endif > > + > > +static __always_inline uint32_t > > +__attribute__ ((unused)) > > +_dl_new_hash (const char *str) > > +{ > > + const unsigned char *s = (const unsigned char *) str; > > + unsigned int h = 5381; > > + unsigned int c0, c1; > > + for (;;) > > + { > > + c0 = s[0]; > > + /* Since hashed string is normally not empty, this is unlikely on the > > + first iteration of the loop. */ > > + if (__glibc_unlikely (c0 == 0)) > > + return h; > > + > > + c1 = s[1]; > > + if (c1 == 0) > > + { > > + /* Ideal instruction scheduling is: > > Suggest: "Ideal computation order is" Took your suggestion in V10. > > > + c0 += h; > > + h *= 32; > > + h += c0; > > + > > + The __asm_reassociation_barrier() macro is a sysdep optional asm > > + statements to prevents reassosiation that would result in more > > + instruction interdependencies and worse scheduling. */ > > This bit is redundant with the description at the top near > __asm_reassociation_barrier. Dropped in V10. > > > + c0 += h; > > + __asm_reassociation_barrier("" : "+r"(h) : "r"(c0)); > > + h = h * 32 + c0; > > + return h; > > + } > > + > > + /* Ideal instruction scheduling is: > > Same: "Ideal computation order is" Took your suggestion in V10. > > > + c1 += c0; > > + h *= 33 * 33; > > + c0 *= 32; > > + c1 += c0; > > + h += c1; > > + > > + The __asm_reassociation_barrier() macro is a sysdep optional asm > > + statements to prevents reassosiation that would result in more > > + instruction interdependencies and worse scheduling. */ > > This too is redundant. Dropped in V10. > > > + c1 += c0; > > + __asm_reassociation_barrier("" : "+r"(c1), "+r"(c0)); > > + h *= 33 * 33; > > + c1 += c0 * 32; > > + __asm_reassociation_barrier("" : "+r"(c1)); > > + h += c1; > > + s += 2; > > + } > > +} > > + > > +#endif /* dl-new-hash.h */ > > diff --git a/elf/dl-new-hash.h b/sysdeps/x86/dl-new-hash.h > > similarity index 77% > > rename from elf/dl-new-hash.h > > rename to sysdeps/x86/dl-new-hash.h > > index b7a91ecc07..dd800265bf 100644 > > --- a/elf/dl-new-hash.h > > +++ b/sysdeps/x86/dl-new-hash.h > > @@ -19,19 +19,9 @@ > > #ifndef _DL_NEW_HASH_H > > #define _DL_NEW_HASH_H 1 > > No need to define this... Fixed in V10. > > > > > -#include > > -/* For __always_inline. */ > > -#include > > - > > -static __always_inline uint32_t > > -__attribute__ ((unused)) > > -_dl_new_hash (const char *s) > > -{ > > - uint32_t h = 5381; > > - for (unsigned char c = *s; c != '\0'; c = *++s) > > - h = h * 33 + c; > > - return h; > > -} > > +#define __asm_reassociation_barrier __asm__ > > > > +#undef _DL_NEW_HASH_H > > ... if it is unconditionally undefined here. > > > +#include > > > > #endif /* dl-new-hash.h */ >