From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from frog.ash.relay.mailchannels.net (frog.ash.relay.mailchannels.net [23.83.222.63]) by sourceware.org (Postfix) with ESMTPS id 6F66C384B0C5 for ; Mon, 16 May 2022 14:12:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 6F66C384B0C5 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=gotplt.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gotplt.org X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id C1CC0121BFA; Mon, 16 May 2022 14:12:36 +0000 (UTC) Received: from pdx1-sub0-mail-a305.dreamhost.com (unknown [127.0.0.6]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 7999A121B62; Mon, 16 May 2022 14:12:35 +0000 (UTC) ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1652710355; a=rsa-sha256; cv=none; b=pxuvqQvPA8LX0W+wLJHang5c4M3ao4hvcPgwuIpVH5BFExriqqaplm55Fbxh8CejYHFvfp 9JZgYRZ+8VutEtIY/viPsqiWcAm3zMQv/XdBuEFQjxrt8ahsDLbAMmKmy4tvgmTiyBA4c4 MFZoeSuqRYw8D04f89tnwmPwKfurKtag34IOQo6uOMmuCggt80GuwE6bD4nDMEikTqvuRm yqtnbaEmLjlYlAQj8ZbVeFbluN3xy5rJB9nb7J46DT+NL6hE5qoLDTlggxmmnDCGcmL6i+ Pbmga+wnhlB/OVmpWVrZhGT2wY5v2OPmraU3JW8bxx+bdmvGh3T4xX39XQsGZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=mailchannels.net; s=arc-2022; t=1652710355; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CGUPGMXdSiIj38Y4Ku/ZocDeCkJA9JSyG0hPyS+l04Q=; b=E2CzCugMwLhyJPd5lOLrb5QrkbVA4i8GwPhMsAXdI4z8QnAR+xcevaf+d15HN5rDLmA6q3 Jn1HDDt930yF28sOqwPXMLzoPMdU3lgCB2X6nmSXYCY9XvM4c+59ouzJv8gnXVbz9/OiuC sCyzKq+scfH3ycMYDQbcKK8V1b10xKgmVmDcZa2NK87GvRH8qXHBhBTfuHqHX3NV6Wd0Lm lWm0cWugk8tj5TDJU4a9DOIOG3r6Iw5YIR0TOYVjRthIL51uiMdZ+FkSW/tCdet8+RqHgi nR7wx4CVY27P+vMIZDIvYkEu/HMyxw0cpFwdojSRJaDd8IG/asyDoE8XqiItCA== ARC-Authentication-Results: i=1; rspamd-6fcfc4d76-rnszq; auth=pass smtp.auth=dreamhost smtp.mailfrom=siddhesh@gotplt.org X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|siddhesh@gotplt.org X-MailChannels-Auth-Id: dreamhost X-Dime-Tangy: 09ed3897719bfb4c_1652710356411_1589544384 X-MC-Loop-Signature: 1652710356411:3729483400 X-MC-Ingress-Time: 1652710356410 Received: from pdx1-sub0-mail-a305.dreamhost.com (pop.dreamhost.com [64.90.62.162]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384) by 100.124.238.93 (trex/6.7.1); Mon, 16 May 2022 14:12:36 +0000 Received: from [192.168.1.174] (unknown [1.186.223.88]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) (Authenticated sender: siddhesh@gotplt.org) by pdx1-sub0-mail-a305.dreamhost.com (Postfix) with ESMTPSA id 4L21RP5j42z36; Mon, 16 May 2022 07:12:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gotplt.org; s=dreamhost; t=1652710355; bh=CGUPGMXdSiIj38Y4Ku/ZocDeCkJA9JSyG0hPyS+l04Q=; h=Date:Subject:To:Cc:From:Content-Type:Content-Transfer-Encoding; b=ZBUgC+xySh7NZp6uHmoB1Tf5puT/EIJ2TSImuSriBIoh/9LnatwuIXvIzGzIE2QVW zBiHeLEnFF2FqFyuS6Q29RTtXFx91f8OAXgFlDLrqMFII/g71u6fdwxst5meQsgoik 4YLnad8cU9VFyHieJhfGc+/jPgBc6BWHoWiOlEKgAbP2dMGepAfnBD1M4+gO3UdPHS +cHqu3z7mNpUl++jEdfN8xl/GHJmD4B+ShGNKsED0sxHrXz3OFXMsmJSSXFJ+iMzJi zOFt7oAb/we1YXbBBme+gGZm2xAfmgGnY+lNQ5R/Vcza/b/FaJ993I0DX31DvM4CWR tFEhgrDK9+YSQ== Message-ID: <1b419b02-0dee-813b-de4c-1fdc0779174a@gotplt.org> Date: Mon, 16 May 2022 19:42:27 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Subject: Re: [PATCH v8 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h Content-Language: en-US To: Noah Goldstein , libc-alpha@sourceware.org Cc: Alexander Monakov References: <20220414041231.926415-1-goldstein.w.n@gmail.com> <20220511030635.154689-1-goldstein.w.n@gmail.com> <20220511030635.154689-6-goldstein.w.n@gmail.com> From: Siddhesh Poyarekar In-Reply-To: <20220511030635.154689-6-goldstein.w.n@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3036.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, NICE_REPLY_A, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2022 14:12:46 -0000 On 11/05/2022 08:36, Noah Goldstein via Libc-alpha wrote: > Unroll slightly and enforce good instruction scheduling. This improves > performance on out-of-order machines. Note the unrolling allows > for pipelined multiplies which helps a bit, but most of the gain > is from enforcing better instruction scheduling for more ILP. > Unrolling further started to induce slowdowns for sizes [0, 4] > but can help the loop so if larger sizes are the target further > unrolling can be beneficial. > > Results for _dl_new_hash > Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz > > Time as Geometric Mean of N=30 runs > Geometric of all benchmark New / Old: 0.674 > type, length, New Time, Old Time, New Time / Old Time > fixed, 0, 2.865, 2.72, 1.053 > fixed, 1, 3.567, 2.489, 1.433 > fixed, 2, 2.577, 3.649, 0.706 > fixed, 3, 3.644, 5.983, 0.609 > fixed, 4, 4.211, 6.833, 0.616 > fixed, 5, 4.741, 9.372, 0.506 > fixed, 6, 5.415, 9.561, 0.566 > fixed, 7, 6.649, 10.789, 0.616 > fixed, 8, 8.081, 11.808, 0.684 > fixed, 9, 8.427, 12.935, 0.651 > fixed, 10, 8.673, 14.134, 0.614 > fixed, 11, 10.69, 15.408, 0.694 > fixed, 12, 10.789, 16.982, 0.635 > fixed, 13, 12.169, 18.411, 0.661 > fixed, 14, 12.659, 19.914, 0.636 > fixed, 15, 13.526, 21.541, 0.628 > fixed, 16, 14.211, 23.088, 0.616 > fixed, 32, 29.412, 52.722, 0.558 > fixed, 64, 65.41, 142.351, 0.459 > fixed, 128, 138.505, 295.625, 0.469 > fixed, 256, 291.707, 601.983, 0.485 > random, 2, 12.698, 12.849, 0.988 > random, 4, 16.065, 15.857, 1.013 > random, 8, 19.564, 21.105, 0.927 > random, 16, 23.919, 26.823, 0.892 > random, 32, 31.987, 39.591, 0.808 > random, 64, 49.282, 71.487, 0.689 > random, 128, 82.23, 145.364, 0.566 > random, 256, 152.209, 298.434, 0.51 > > Co-authored-by: Alexander Monakov > --- > elf/dl-new-hash.h | 66 +++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 61 insertions(+), 5 deletions(-) > > diff --git a/elf/dl-new-hash.h b/elf/dl-new-hash.h > index 40d88c81f9..5269f6eb98 100644 > --- a/elf/dl-new-hash.h > +++ b/elf/dl-new-hash.h > @@ -20,15 +20,71 @@ > #define _DL_NEW_HASH_H 1 > > #include > +/* For __glibc_unlikely. */ > +#include > + > +/* The simplest implementation of _dl_new_hash is: > + > +_dl_new_hash (const char *s) > +{ > + uint32_t h = 5381; > + for (unsigned char c = *s; c != '\0'; c = *++s) > + h = h * 33 + c; > + return h; > +} > + > + We can get better performance, however, by slightly unrolling the > + loop and explicitly specifying order of operations to prevent > + reassosiation of instructions and ensure ideal scheduling. */ > > static inline uint32_t > __attribute__ ((unused)) > -_dl_new_hash (const char *s) > +_dl_new_hash (const char *str) > { > - uint32_t h = 5381; > - for (unsigned char c = *s; c != '\0'; c = *++s) > - h = h * 33 + c; > - return h; > + const unsigned char *s = (const unsigned char *) str; > + unsigned int h = 5381; > + unsigned int c0, c1; > + for (;;) > + { > + c0 = s[0]; > + /* Unlikely length zero string so evens will be slightly less > + common. */ > + if (__glibc_unlikely (c0 == 0)) > + return h; > + > + c1 = s[1]; > + if (c1 == 0) > + { > + c0 += h; > + /* Ideal instruction scheduling is: > + c0 += h; > + h *= 32; > + h += c0; > + > + The asm statements are to prevent reassosiation that would result in > + more instruction interdependencies and worse scheduling. */ > + asm("" : "+r"(h) : "r"(c0)); There are a couple of things that seem problematic to me about this: - It seems like we're trying to fix a gcc issue in glibc. Couldn't we file a gcc bug and explore ways in which this could be supported in the compiler? In fact, it might make sense to do that for the original loop; it looks like a missed optimization that gcc ought to fix. IMO the bug should be filed even if we do end up with this micro-optimization in glibc. - The patch controls an instruction schedule so that it works well on out-of-order processors but then only quoting one microarchitecture. If it works well on TigerLake (and on x86 in general) then it might be better to add it as a sysdep override; I assumed that was the point of breaking the function out into its header anyway. If it is more generally useful then please share numbers to that effect in the commit message and also explicitly state in the comments why we're trying to exert this level of control on codegen in generic C code and why it is good for all architectures. > + h = h * 32 + c0; > + return h; > + } > + > + /* Ideal instruction scheduling is: > + c1 += c0; > + h *= 33 * 33; > + c0 *= 32; > + c1 += c0; > + h += c1; > + > + The asm statements are to prevent reassosiation that would result in > + more instruction interdependencies and worse scheduling. */ > + c1 += c0; > + asm("" : "+r"(c1), "+r"(c0)); > + h *= 33 * 33; > + c1 += c0 * 32; > + asm("" : "+r"(c1)); > + h += c1; > + s += 2; > + } > } > > Thanks, Siddhesh