From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <amonakov@ispras.ru>
Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84])
 by sourceware.org (Postfix) with ESMTPS id AF16A385842B
 for <libc-alpha@sourceware.org>; Wed, 27 Apr 2022 15:02:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AF16A385842B
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=ispras.ru
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru
Received: from [10.10.3.121] (unknown [10.10.3.121])
 by mail.ispras.ru (Postfix) with ESMTPS id 0A4F04076267;
 Wed, 27 Apr 2022 15:02:06 +0000 (UTC)
Date: Wed, 27 Apr 2022 18:02:05 +0300 (MSK)
From: Alexander Monakov <amonakov@ispras.ru>
To: Noah Goldstein <goldstein.w.n@gmail.com>
cc: libc-alpha@sourceware.org
Subject: Re: [PATCH v3 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h
In-Reply-To: <20220425163601.3670626-6-goldstein.w.n@gmail.com>
Message-ID: <991f2272-4ae6-b06b-ceb7-184e5d369118@ispras.ru>
References: <20220414041231.926415-1-goldstein.w.n@gmail.com>
 <20220425163601.3670626-1-goldstein.w.n@gmail.com>
 <20220425163601.3670626-6-goldstein.w.n@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Wed, 27 Apr 2022 15:02:11 -0000

On Mon, 25 Apr 2022, Noah Goldstein via Libc-alpha wrote:

> Unroll slightly so some of the multiples can be pipelined on out-order
> machines. Unrolling further started to induce slowdowns for sizes
> [0, 4] but can help the loop so if larger sizes are the target
> further unrolling can be beneficial.

Note, the original algorithm does not need a literal multiplication (with
3cyc latency), as h*33 == (h << 5) + h:

> -  for (unsigned char c = *s; c != '\0'; c = *++s)
> -    h = h * 33 + c;

in musl we spell this out as 'h += h*32 + c' to avoid GCC emitting a
multiplication at -Os.

In 'h = (h + c) + (h << 5)' critical path has latency of only 2 cycles,
and (h + c) goes independent of (h << 5).

(if the original loop is implemented with a multiplication, its critical
patch has 4-cycle latency, 3cyc from the multiplication plus 1 from the
addition)

However, when you reroll the loop and overlap two iterations, multiplication
by 33*33 no longer has this nice property and runs with two 4cyc paths
overlapped (so effective critical path is the same as original).

Alexander