From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id 4FA5F3856260 for ; Thu, 5 May 2022 11:08:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4FA5F3856260 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru Received: from [10.10.3.121] (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTPS id 8C7E740D4004; Thu, 5 May 2022 11:07:52 +0000 (UTC) Date: Thu, 5 May 2022 14:07:52 +0300 (MSK) From: Alexander Monakov To: Noah Goldstein cc: GNU C Library Subject: Re: [PATCH v3 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h In-Reply-To: Message-ID: <741653a-20e0-6dc4-d5ce-c826677cf36@ispras.ru> References: <20220414041231.926415-1-goldstein.w.n@gmail.com> <20220425163601.3670626-1-goldstein.w.n@gmail.com> <20220425163601.3670626-6-goldstein.w.n@gmail.com> <991f2272-4ae6-b06b-ceb7-184e5d369118@ispras.ru> <6d61305f-58-705b-bfb8-cfa243c41d3e@ispras.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 May 2022 11:08:01 -0000 On Wed, 4 May 2022, Alexander Monakov wrote: > Tigerlake belongs to Intel CPU generations that don't perform move > elimination for general registers (along with Icelake and Sandybridge). > So register copy in preparation to shift 'h' by 5 costs an extra cycle > on the critical path. > > Furthermore, h*33 + *c gets evaluated as '((h + h*32) + *c) instead of > '((h + *c) + h*32)', which prevents interleaving the additions and thus > puts one extra add on the critical path. > > (gcc-11 gets this right if you assist it by writing 'h = h + *c + h*32') > > So due to the above issues, on Tigerlake you get 4 cycles for the original > loop, and also 4 cycles for the modified two-at-a-time loop. The following variant of the original loop avoids those issues and should run close to 2 cycles per iteration on most CPUs: static uint32_t _dl_new_hash (const char *s) { uint32_t h = 5381, c; const unsigned char *us = (const void *)s; while ((c = *us++)) { c += h; asm("" : "+r"(h) : "r"(c)); h = h * 32 + c; } return h; } Alexander