From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb1-xb31.google.com (mail-yb1-xb31.google.com [IPv6:2607:f8b0:4864:20::b31]) by sourceware.org (Postfix) with ESMTPS id 080333858C56 for ; Thu, 5 May 2022 18:04:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 080333858C56 Received: by mail-yb1-xb31.google.com with SMTP id r11so8975749ybg.6 for ; Thu, 05 May 2022 11:04:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=5xnAAu8H2FT6MzYqhz3Ig89E8KgORIReS28ymS5p5tA=; b=u62p+/ddgpRXPvT9UOJ5z0nLhlxfwe8Ne/IcT/ijmfgGCx8E9nbIUMkku+NU+4DzuI RWec0G0gPq8WOtVCLYf474GzbXCF1qwyvfdXWeen0M2k+EIJiHPMDzrpITnbGOlewtDj 8EkPSZ26LALFVZ/s80P3sB8kEoLWUd4CwThBx50rqVe9mp77gv0dO194y+pZtVxRfhdf U8ynhXAUkSR+j3phEEFYx/U9hjvldlDjOvBN+9eHE/YWbTNE0y2dg5AiuFDMPVrtgPBz AWamNfY+OWFMaTe8uNQHQiqn4745j6/7Z3G7ET5Ij/QrXO6x0tZuCFcduINbuUi5rb56 012A== X-Gm-Message-State: AOAM5318pgd19veaxjlFD54TaJvW3FqRHBMT8fht2OArafA7coq0oxf9 dm3z8sBYRp9LZUMb4l9P6siBR4Moap0qNAQwvYs= X-Google-Smtp-Source: ABdhPJx2y0ehyAss59SkcyFkBNy5kKcdEn/VSiU9hsnTaD4J/5tC9qFwj+Q8mSRUMhGM+eytUEyLG+0b64alvV90JAA= X-Received: by 2002:a05:6902:1002:b0:649:70c2:db58 with SMTP id w2-20020a056902100200b0064970c2db58mr17146895ybt.68.1651773859406; Thu, 05 May 2022 11:04:19 -0700 (PDT) MIME-Version: 1.0 References: <20220414041231.926415-1-goldstein.w.n@gmail.com> <20220425163601.3670626-1-goldstein.w.n@gmail.com> <20220425163601.3670626-6-goldstein.w.n@gmail.com> <991f2272-4ae6-b06b-ceb7-184e5d369118@ispras.ru> <6d61305f-58-705b-bfb8-cfa243c41d3e@ispras.ru> <741653a-20e0-6dc4-d5ce-c826677cf36@ispras.ru> <21d761a-a6ef-e9a2-1a38-a592503c81db@ispras.ru> In-Reply-To: <21d761a-a6ef-e9a2-1a38-a592503c81db@ispras.ru> From: Noah Goldstein Date: Thu, 5 May 2022 13:03:45 -0500 Message-ID: Subject: Re: [PATCH v3 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h To: Alexander Monakov Cc: GNU C Library Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 May 2022 18:04:21 -0000 On Thu, May 5, 2022 at 10:26 AM Alexander Monakov wrote: > > On Thu, 5 May 2022, Noah Goldstein wrote: > > > > The following variant of the original loop avoids those issues and > > > should run close to 2 cycles per iteration on most CPUs: > > > > > > static uint32_t > > > _dl_new_hash (const char *s) > > > { > > > uint32_t h = 5381, c; > > > const unsigned char *us = (const void *)s; > > > while ((c = *us++)) > > > { > > > c += h; > > > asm("" : "+r"(h) : "r"(c)); > > > h = h * 32 + c; > > > } > > > return h; > > > } > > > > > > > I'm not sure what you are getting at with the asm(). It seems the > > produce the exact > > same assembly: > > https://godbolt.org/z/93qGMaTTE > > They are definitely not the same even via your link. Loop body of dl_new_hash0: Oh wow. Must have 'looked at it' before it reloaded. I'm sorry! > > .L3: > (a) addl %eax, %edx > addq $1, %rcx > (b) sall $5, %eax > (c) addl %edx, %eax > movzbl -1(%rcx), %edx > testl %edx, %edx > jne .L3 > > and of dl_new_hash1: > > .L9: > (A) movl %r8d, %eax > addq $1, %rcx > (B) sall $5, %eax > (C) addl %edx, %eax > movzbl -1(%rcx), %edx > (D) addl %eax, %r8d > testl %edx, %edx > jne .L9 > > (in fact even the instruction count is not the same, 7 vs 8) > > In the first loop, (a) and (b) are independent, (c) depends on them both, and > on the next iteration (a) and (b) take the result of (c) from the previous. Thus > the dependencies are 2 cycles for one iteration. > > In the second loop, (B) depends on (A), (C) depends on (B), (D) depends on (C), > and on the next iteration (A) depends on (D) from the previous. Thus four > instructions form a dependency chain of 3 or 4 cycles depending if move > elimination happens or not. > > In any case, if you benchmark them both you should see the difference. > Okay that makes sense and indeed results in a substantial improvement. Totally happy with going with your version. Think there is still some benefit to the unrolled version because 1) It's less eager about hitting the LSD on newer processors (but that's really only an issue for strings > ~24 characters). 2) It bottlenecks less hard on `p6` because the `imul` goes `p0` and the branches are distributed between `p0` and `p6` instead of always on `p6`. 3) It still saves a few uops (although imul vs `add + shl` isn't really a meaningful save). Either way it will be an improvement. Little benchmark: https://godbolt.org/z/G6PvW4eTr Generally see hash2 winning the most. > Alexander