From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb1-xb2e.google.com (mail-yb1-xb2e.google.com [IPv6:2607:f8b0:4864:20::b2e]) by sourceware.org (Postfix) with ESMTPS id 60F33385780C for ; Tue, 10 May 2022 17:17:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 60F33385780C Received: by mail-yb1-xb2e.google.com with SMTP id r11so31906933ybg.6 for ; Tue, 10 May 2022 10:17:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CyF2VVCiLytaY34Q1xuaPmPk/srmZDBlenDOBWOFI/o=; b=TYJnVMDSQPLtH0Un7vbPS/mMFhMuf9IDo5qYVvj3WJ65n6tL+dABO2kLubTfRzNMu8 sFYfOPArJKS0Yecz4B+Uxr+PZwNT4QCi7CiiHDcf/ezbFdG4/xT68WJpTxE+OQ/YzI9A T0UGTvssBZ2a3lmZYUKsTOHMJj3XcRuU9POJGIUX65BCACMu/Sw4B8v8Bz+NfC6j61AF uiTU204+r4SU1DP9p2/ZAGdIQeGM3esQv+wUOZiiwTh9BZbs+kC7YhnTyulHrUtmUAWY tYzYOcKfuYCZANS/pmzGmXQ6hd/sjXQL5/E9XMw54eDPdDqGvy9WAFfIn14zugVu8EXG 8OfQ== X-Gm-Message-State: AOAM532HIXtzlJeJmw2ZLtTSzwYjFJt0JtuduIsOGyUJY/Fysa6iGKt8 Skaq3z2PIgBTlPQ08L6dyTFli9x5MgA6KH8Uq47uflo/EzI= X-Google-Smtp-Source: ABdhPJwVCxWjot2T2/miskenjx7fOWrmU//nO3awj+K6NeVdhcxs9xT3kPUBcav7V4B0FjaBIIVZEq3MrrxDnt9lzYE= X-Received: by 2002:a25:bac3:0:b0:64b:c2d:bf15 with SMTP id a3-20020a25bac3000000b0064b0c2dbf15mr4425199ybk.43.1652203050772; Tue, 10 May 2022 10:17:30 -0700 (PDT) MIME-Version: 1.0 References: <20220414041231.926415-1-goldstein.w.n@gmail.com> <20220510150441.20948-1-goldstein.w.n@gmail.com> <20220510150441.20948-6-goldstein.w.n@gmail.com> In-Reply-To: From: Noah Goldstein Date: Tue, 10 May 2022 12:17:20 -0500 Message-ID: Subject: Re: [PATCH v6 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h To: Alexander Monakov Cc: GNU C Library , "H.J. Lu" , "Carlos O'Donell" Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 May 2022 17:17:33 -0000 On Tue, May 10, 2022 at 11:49 AM Alexander Monakov wrote: > > On Tue, 10 May 2022, Noah Goldstein wrote: > > > Unroll slightly and enforce good instruction scheduling. This improves > > performance on out-of-order machines. Note the unrolling allows > > for pipelined multiplies which helps a bit, but most of the gain > > is from enforcing better instruction scheduling for more ILP. > > Unrolling further started to induce slowdowns for sizes [0, 4] > > but can help the loop so if larger sizes are the target further > > unrolling can be beneficial. > > > > Results for _dl_new_hash > > Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz > > It seems benchmark figures are unchanged from the first iteration of this > patch, did the revision not affect them? Didn't rerun, will do for V7. > > (more comments below) > > > --- a/elf/dl-new-hash.h > > +++ b/elf/dl-new-hash.h > > @@ -20,15 +20,55 @@ > > #define _DL_NEW_HASH_H 1 > > > > #include > > +/* For __glibc_unlikely. */ > > +#include > > > > static inline uint32_t > > __attribute__ ((unused)) > > -_dl_new_hash (const char *s) > > +_dl_new_hash (const char *signed_s) > > This is technically inaccurate (whether plain 'char' is signed depends on the > target), so if you're revising this further I'd suggest to change this too to > e.g. 'str'. Will fix for V7. > > > { > > - uint32_t h = 5381; > > - for (unsigned char c = *s; c != '\0'; c = *++s) > > - h = h * 33 + c; > > - return h; > > I think it would be nice to retain this loop as a comment to indicate what this > function is supposed to implement. Will fix for V7. > > > + const unsigned char *s = (const unsigned char *) signed_s; > > + unsigned int h = 5381; > > + unsigned int c0, c1; > > + for (;;) > > + { > > + c0 = (unsigned int) *s; > > Surprised to see an explicit cast where plain assignment with implicit type > conversion is doing the obvious thing. Is it really necessary? It's not, I just tend to err on the side of explicitness. > > > + /* Unlikely length zero string so evens will be slightly less > > + common. */ > > I had trouble understanding this comment. I'd suggest dropping it or rephrasing > like 'Since hashed string is normally not empty, this is unlikely on the first > iteration of the loop'. Will fix for V7. > > > + if (__glibc_unlikely (c0 == 0)) > > + { > > + return h; > > + } > > Braces look unnecessary. Will fix v7. > > > + > > + c1 = (unsigned int) *(s + 1); > > Again unnecessary explicit cast here (c1 = s[1] might be easier to read). > Alternatively, you could use 'c1 = *s++' here and above and drop explicit > s += 2 below, I expect resulting assembly to be the same. generally think user `[]` access makes stylistic sense when incrementing an index and *(s1 + N) makes sense when incrementing a pointer. I'm generally in favor of leaving the casts/access as is but its not a hill worth dying on. LMK if its important to you for V7 (will be a few hours to rerun benchmarks). > > > + if (c1 == 0) > > + { > > + c0 += h; > > + /* Ideal instruction scheduling is: > > + c0 += h; > > + h *= 32; > > + h += c0; > > + The asm statement ensures the compiler can't mess that up. */ > > The main concern here is preventing reassociation from 'h = h*32 + (c0 + h)' > to 'h = (h*32 + h) + c0', not scheduling. We're using an empty asm to break > up a sequence of additions. Well the reason (h * 32 + h) + c0 is worse is it creates scheduling dependencies. No? Seems like if the comment is "to avoid reassociation" the natural question is why does reassociation matter? To say it's explicitly for scheduling answers that. I'll update the comment in V7 to mention both. > > Also note that this leads to a return, so only saves a couple cycles once per > call, not inside the loop. > > > + asm("" : "+r"(h) : "r"(c0)); > > + h = h * 32 + c0; > > + return h; > > Wrong indentation here. Will fix V7. > > > + } > > + > > + /* Ideal instruction scheduling is: > > + c1 += c0; > > + h *= 33 * 33; > > + c0 *= 32; > > + c1 += c0; > > + h += c1; > > + The asm statements ensures the compiler can't mess that up. */ > > As above, we are placing empty asms mainly as reassociation barriers. Will update to mention both. It really is a missed-optimization in GCC instruction scheduling pass though. > > > + c1 += c0; > > + asm("" : "+r"(c1), "+r"(c0)); > > + h *= 33 * 33; > > + c1 += c0 * 32; > > + asm("" : "+r"(c1)); > > + h += c1; > > + s += 2; > > + } > > } > > Alexander