From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <siddhesh@gotplt.org>
Received: from frog.ash.relay.mailchannels.net
 (frog.ash.relay.mailchannels.net [23.83.222.63])
 by sourceware.org (Postfix) with ESMTPS id 6F66C384B0C5
 for <libc-alpha@sourceware.org>; Mon, 16 May 2022 14:12:39 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 6F66C384B0C5
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=gotplt.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gotplt.org
X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org
Received: from relay.mailchannels.net (localhost [127.0.0.1])
 by relay.mailchannels.net (Postfix) with ESMTP id C1CC0121BFA;
 Mon, 16 May 2022 14:12:36 +0000 (UTC)
Received: from pdx1-sub0-mail-a305.dreamhost.com (unknown [127.0.0.6])
 (Authenticated sender: dreamhost)
 by relay.mailchannels.net (Postfix) with ESMTPA id 7999A121B62;
 Mon, 16 May 2022 14:12:35 +0000 (UTC)
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1652710355; a=rsa-sha256;
 cv=none;
 b=pxuvqQvPA8LX0W+wLJHang5c4M3ao4hvcPgwuIpVH5BFExriqqaplm55Fbxh8CejYHFvfp
 9JZgYRZ+8VutEtIY/viPsqiWcAm3zMQv/XdBuEFQjxrt8ahsDLbAMmKmy4tvgmTiyBA4c4
 MFZoeSuqRYw8D04f89tnwmPwKfurKtag34IOQo6uOMmuCggt80GuwE6bD4nDMEikTqvuRm
 yqtnbaEmLjlYlAQj8ZbVeFbluN3xy5rJB9nb7J46DT+NL6hE5qoLDTlggxmmnDCGcmL6i+
 Pbmga+wnhlB/OVmpWVrZhGT2wY5v2OPmraU3JW8bxx+bdmvGh3T4xX39XQsGZQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=mailchannels.net; s=arc-2022; t=1652710355;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 content-transfer-encoding:content-transfer-encoding:
 in-reply-to:in-reply-to:references:references:dkim-signature;
 bh=CGUPGMXdSiIj38Y4Ku/ZocDeCkJA9JSyG0hPyS+l04Q=;
 b=E2CzCugMwLhyJPd5lOLrb5QrkbVA4i8GwPhMsAXdI4z8QnAR+xcevaf+d15HN5rDLmA6q3
 Jn1HDDt930yF28sOqwPXMLzoPMdU3lgCB2X6nmSXYCY9XvM4c+59ouzJv8gnXVbz9/OiuC
 sCyzKq+scfH3ycMYDQbcKK8V1b10xKgmVmDcZa2NK87GvRH8qXHBhBTfuHqHX3NV6Wd0Lm
 lWm0cWugk8tj5TDJU4a9DOIOG3r6Iw5YIR0TOYVjRthIL51uiMdZ+FkSW/tCdet8+RqHgi
 nR7wx4CVY27P+vMIZDIvYkEu/HMyxw0cpFwdojSRJaDd8IG/asyDoE8XqiItCA==
ARC-Authentication-Results: i=1; rspamd-6fcfc4d76-rnszq;
 auth=pass smtp.auth=dreamhost smtp.mailfrom=siddhesh@gotplt.org
X-Sender-Id: dreamhost|x-authsender|siddhesh@gotplt.org
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|siddhesh@gotplt.org
X-MailChannels-Auth-Id: dreamhost
X-Dime-Tangy: 09ed3897719bfb4c_1652710356411_1589544384
X-MC-Loop-Signature: 1652710356411:3729483400
X-MC-Ingress-Time: 1652710356410
Received: from pdx1-sub0-mail-a305.dreamhost.com (pop.dreamhost.com
 [64.90.62.162]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384)
 by 100.124.238.93 (trex/6.7.1); Mon, 16 May 2022 14:12:36 +0000
Received: from [192.168.1.174] (unknown [1.186.223.88])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits))
 (No client certificate requested)
 (Authenticated sender: siddhesh@gotplt.org)
 by pdx1-sub0-mail-a305.dreamhost.com (Postfix) with ESMTPSA id 4L21RP5j42z36; 
 Mon, 16 May 2022 07:12:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gotplt.org;
 s=dreamhost; t=1652710355;
 bh=CGUPGMXdSiIj38Y4Ku/ZocDeCkJA9JSyG0hPyS+l04Q=;
 h=Date:Subject:To:Cc:From:Content-Type:Content-Transfer-Encoding;
 b=ZBUgC+xySh7NZp6uHmoB1Tf5puT/EIJ2TSImuSriBIoh/9LnatwuIXvIzGzIE2QVW
 zBiHeLEnFF2FqFyuS6Q29RTtXFx91f8OAXgFlDLrqMFII/g71u6fdwxst5meQsgoik
 4YLnad8cU9VFyHieJhfGc+/jPgBc6BWHoWiOlEKgAbP2dMGepAfnBD1M4+gO3UdPHS
 +cHqu3z7mNpUl++jEdfN8xl/GHJmD4B+ShGNKsED0sxHrXz3OFXMsmJSSXFJ+iMzJi
 zOFt7oAb/we1YXbBBme+gGZm2xAfmgGnY+lNQ5R/Vcza/b/FaJ993I0DX31DvM4CWR
 tFEhgrDK9+YSQ==
Message-ID: <1b419b02-0dee-813b-de4c-1fdc0779174a@gotplt.org>
Date: Mon, 16 May 2022 19:42:27 +0530
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.8.0
Subject: Re: [PATCH v8 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h
Content-Language: en-US
To: Noah Goldstein <goldstein.w.n@gmail.com>, libc-alpha@sourceware.org
Cc: Alexander Monakov <amonakov@ispras.ru>
References: <20220414041231.926415-1-goldstein.w.n@gmail.com>
 <20220511030635.154689-1-goldstein.w.n@gmail.com>
 <20220511030635.154689-6-goldstein.w.n@gmail.com>
From: Siddhesh Poyarekar <siddhesh@gotplt.org>
In-Reply-To: <20220511030635.154689-6-goldstein.w.n@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-3036.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, NICE_REPLY_A,
 RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 16 May 2022 14:12:46 -0000

On 11/05/2022 08:36, Noah Goldstein via Libc-alpha wrote:
> Unroll slightly and enforce good instruction scheduling. This improves
> performance on out-of-order machines. Note the unrolling allows
> for pipelined multiplies which helps a bit, but most of the gain
> is from enforcing better instruction scheduling for more ILP.
> Unrolling further started to induce slowdowns for sizes [0, 4]
> but can help the loop so if larger sizes are the target further
> unrolling can be beneficial.
> 
> Results for _dl_new_hash
> Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
> 
> Time as Geometric Mean of N=30 runs
> Geometric of all benchmark New / Old: 0.674
>    type, length, New Time, Old Time, New Time / Old Time
>   fixed,      0,    2.865,     2.72,               1.053
>   fixed,      1,    3.567,    2.489,               1.433
>   fixed,      2,    2.577,    3.649,               0.706
>   fixed,      3,    3.644,    5.983,               0.609
>   fixed,      4,    4.211,    6.833,               0.616
>   fixed,      5,    4.741,    9.372,               0.506
>   fixed,      6,    5.415,    9.561,               0.566
>   fixed,      7,    6.649,   10.789,               0.616
>   fixed,      8,    8.081,   11.808,               0.684
>   fixed,      9,    8.427,   12.935,               0.651
>   fixed,     10,    8.673,   14.134,               0.614
>   fixed,     11,    10.69,   15.408,               0.694
>   fixed,     12,   10.789,   16.982,               0.635
>   fixed,     13,   12.169,   18.411,               0.661
>   fixed,     14,   12.659,   19.914,               0.636
>   fixed,     15,   13.526,   21.541,               0.628
>   fixed,     16,   14.211,   23.088,               0.616
>   fixed,     32,   29.412,   52.722,               0.558
>   fixed,     64,    65.41,  142.351,               0.459
>   fixed,    128,  138.505,  295.625,               0.469
>   fixed,    256,  291.707,  601.983,               0.485
> random,      2,   12.698,   12.849,               0.988
> random,      4,   16.065,   15.857,               1.013
> random,      8,   19.564,   21.105,               0.927
> random,     16,   23.919,   26.823,               0.892
> random,     32,   31.987,   39.591,               0.808
> random,     64,   49.282,   71.487,               0.689
> random,    128,    82.23,  145.364,               0.566
> random,    256,  152.209,  298.434,                0.51
> 
> Co-authored-by: Alexander Monakov <amonakov@ispras.ru>
> ---
>   elf/dl-new-hash.h | 66 +++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 61 insertions(+), 5 deletions(-)
> 
> diff --git a/elf/dl-new-hash.h b/elf/dl-new-hash.h
> index 40d88c81f9..5269f6eb98 100644
> --- a/elf/dl-new-hash.h
> +++ b/elf/dl-new-hash.h
> @@ -20,15 +20,71 @@
>   #define _DL_NEW_HASH_H 1
>   
>   #include <stdint.h>
> +/* For __glibc_unlikely.  */
> +#include <sys/cdefs.h>
> +
> +/* The simplest implementation of _dl_new_hash is:
> +
> +_dl_new_hash (const char *s)
> +{
> +   uint32_t h = 5381;
> +   for (unsigned char c = *s; c != '\0'; c = *++s)
> +     h = h * 33 + c;
> +   return h;
> +}
> +
> +   We can get better performance, however, by slightly unrolling the
> +   loop and explicitly specifying order of operations to prevent
> +   reassosiation of instructions and ensure ideal scheduling.  */
>   
>   static inline uint32_t
>   __attribute__ ((unused))
> -_dl_new_hash (const char *s)
> +_dl_new_hash (const char *str)
>   {
> -  uint32_t h = 5381;
> -  for (unsigned char c = *s; c != '\0'; c = *++s)
> -    h = h * 33 + c;
> -  return h;
> +  const unsigned char *s = (const unsigned char *) str;
> +  unsigned int h = 5381;
> +  unsigned int c0, c1;
> +  for (;;)
> +    {
> +      c0 = s[0];
> +      /* Unlikely length zero string so evens will be slightly less
> +	 common.  */
> +      if (__glibc_unlikely (c0 == 0))
> +	return h;
> +
> +      c1 = s[1];
> +      if (c1 == 0)
> +	{
> +	  c0 += h;
> +	  /* Ideal instruction scheduling is:
> +	 c0 += h;
> +	 h *= 32;
> +	 h += c0;
> +
> +	 The asm statements are to prevent reassosiation that would result in
> +	 more instruction interdependencies and worse scheduling.  */
> +	  asm("" : "+r"(h) : "r"(c0));

There are a couple of things that seem problematic to me about this:

- It seems like we're trying to fix a gcc issue in glibc.  Couldn't we 
file a gcc bug and explore ways in which this could be supported in the 
compiler?  In fact, it might make sense to do that for the original 
loop; it looks like a missed optimization that gcc ought to fix.  IMO 
the bug should be filed even if we do end up with this 
micro-optimization in glibc.

- The patch controls an instruction schedule so that it works well on 
out-of-order processors but then only quoting one microarchitecture.  If 
it works well on TigerLake (and on x86 in general) then it might be 
better to add it as a sysdep override; I assumed that was the point of 
breaking the function out into its header anyway.  If it is more 
generally useful then please share numbers to that effect in the commit 
message and also explicitly state in the comments why we're trying to 
exert this level of control on codegen in generic C code and why it is 
good for all architectures.

> +	  h = h * 32 + c0;
> +	  return h;
> +	}
> +
> +      /* Ideal instruction scheduling is:
> +	 c1 += c0;
> +	 h *= 33 * 33;
> +	 c0 *= 32;
> +	 c1 += c0;
> +	 h  += c1;
> +
> +	 The asm statements are to prevent reassosiation that would result in
> +	 more instruction interdependencies and worse scheduling.  */
> +      c1 += c0;
> +      asm("" : "+r"(c1), "+r"(c0));
> +      h *= 33 * 33;
> +      c1 += c0 * 32;
> +      asm("" : "+r"(c1));
> +      h += c1;
> +      s += 2;
> +    }
>   }
>   
>   

Thanks,
Siddhesh