From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id AF16A385842B for ; Wed, 27 Apr 2022 15:02:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AF16A385842B Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru Received: from [10.10.3.121] (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTPS id 0A4F04076267; Wed, 27 Apr 2022 15:02:06 +0000 (UTC) Date: Wed, 27 Apr 2022 18:02:05 +0300 (MSK) From: Alexander Monakov To: Noah Goldstein cc: libc-alpha@sourceware.org Subject: Re: [PATCH v3 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h In-Reply-To: <20220425163601.3670626-6-goldstein.w.n@gmail.com> Message-ID: <991f2272-4ae6-b06b-ceb7-184e5d369118@ispras.ru> References: <20220414041231.926415-1-goldstein.w.n@gmail.com> <20220425163601.3670626-1-goldstein.w.n@gmail.com> <20220425163601.3670626-6-goldstein.w.n@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Apr 2022 15:02:11 -0000 On Mon, 25 Apr 2022, Noah Goldstein via Libc-alpha wrote: > Unroll slightly so some of the multiples can be pipelined on out-order > machines. Unrolling further started to induce slowdowns for sizes > [0, 4] but can help the loop so if larger sizes are the target > further unrolling can be beneficial. Note, the original algorithm does not need a literal multiplication (with 3cyc latency), as h*33 == (h << 5) + h: > - for (unsigned char c = *s; c != '\0'; c = *++s) > - h = h * 33 + c; in musl we spell this out as 'h += h*32 + c' to avoid GCC emitting a multiplication at -Os. In 'h = (h + c) + (h << 5)' critical path has latency of only 2 cycles, and (h + c) goes independent of (h << 5). (if the original loop is implemented with a multiplication, its critical patch has 4-cycle latency, 3cyc from the multiplication plus 1 from the addition) However, when you reroll the loop and overlap two iterations, multiplication by 33*33 no longer has this nice property and runs with two 4cyc paths overlapped (so effective critical path is the same as original). Alexander