From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id 591883858D3C for ; Thu, 5 May 2022 19:37:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 591883858D3C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru Received: from [10.10.3.121] (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTPS id 5B28F40D4004; Thu, 5 May 2022 19:37:20 +0000 (UTC) Date: Thu, 5 May 2022 22:37:20 +0300 (MSK) From: Alexander Monakov To: Noah Goldstein cc: GNU C Library Subject: Re: [PATCH v3 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h In-Reply-To: Message-ID: <9e22ae56-109d-8026-991-5c6af7b1730@ispras.ru> References: <20220414041231.926415-1-goldstein.w.n@gmail.com> <20220425163601.3670626-1-goldstein.w.n@gmail.com> <20220425163601.3670626-6-goldstein.w.n@gmail.com> <991f2272-4ae6-b06b-ceb7-184e5d369118@ispras.ru> <6d61305f-58-705b-bfb8-cfa243c41d3e@ispras.ru> <741653a-20e0-6dc4-d5ce-c826677cf36@ispras.ru> <21d761a-a6ef-e9a2-1a38-a592503c81db@ispras.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 May 2022 19:37:28 -0000 On Thu, 5 May 2022, Noah Goldstein wrote: > Okay that makes sense and indeed results in a substantial improvement. > > Totally happy with going with your version. > > Think there is still some benefit to the unrolled version because > > 1) It's less eager about hitting the LSD on newer processors > (but that's really only an issue for strings > ~24 characters). > > 2) It bottlenecks less hard on `p6` because the `imul` goes `p0` > and the branches are distributed between `p0` and `p6` instead of > always on `p6`. > > 3) It still saves a few uops (although imul vs `add + shl` isn't really > a meaningful save). Agreed; let me point out though that your variant is preferable provided that multiplication is at most 3 cycles (or integer multiply-add is at most 4). That's a given on not-too-old x86, but I'm not sure how things stand elsewhere. On the other hand, if a 3-cycle multiply-add is available, you variant is strongly preferable (this is generic code so we should try to think a bit about other architectures). I'd recommend to reword commit message of your patch 6/6 so it properly explains that the apparent 2x speedup is due to baseline case hitting two issues that slow it down by, well, almost 2x. > Either way it will be an improvement. > > Little benchmark: https://godbolt.org/z/G6PvW4eTr > > Generally see hash2 winning the most. I noticed that hash2 needs more magic empty asms, because the compiler may reassociate 'h = h + (32 * c0 + c1);' to 'h = (h + c1) + 32 * c0' which increases the dependency chain. I adjusted it like this: c1 += c0; asm("" : "+r"(c1), "+r"(c0)); h *= 33 * 33; c1 += c0 * 32; asm("" : "+r"(c1)); h += c1; s += 2; Alexander