From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <amonakov@ispras.ru>
Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84])
 by sourceware.org (Postfix) with ESMTPS id 591883858D3C
 for <libc-alpha@sourceware.org>; Thu,  5 May 2022 19:37:26 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 591883858D3C
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=ispras.ru
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru
Received: from [10.10.3.121] (unknown [10.10.3.121])
 by mail.ispras.ru (Postfix) with ESMTPS id 5B28F40D4004;
 Thu,  5 May 2022 19:37:20 +0000 (UTC)
Date: Thu, 5 May 2022 22:37:20 +0300 (MSK)
From: Alexander Monakov <amonakov@ispras.ru>
To: Noah Goldstein <goldstein.w.n@gmail.com>
cc: GNU C Library <libc-alpha@sourceware.org>
Subject: Re: [PATCH v3 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h
In-Reply-To: <CAFUsyfJqs3seMX0VVHKxoCaP68eobBP1XV6YhKG4rMRWjUPh3g@mail.gmail.com>
Message-ID: <9e22ae56-109d-8026-991-5c6af7b1730@ispras.ru>
References: <20220414041231.926415-1-goldstein.w.n@gmail.com>
 <20220425163601.3670626-1-goldstein.w.n@gmail.com>
 <20220425163601.3670626-6-goldstein.w.n@gmail.com>
 <991f2272-4ae6-b06b-ceb7-184e5d369118@ispras.ru>
 <CAFUsyfKeocq4VAusvnggq-NR=tOQTjrD0Z6r3CYCTjGQ=tGGSw@mail.gmail.com>
 <f54f1ec9-fc31-283f-bce9-59fd8bda98ad@ispras.ru>
 <CAFUsyfJS7NyL5cx9O3whMxg2+mbpyJAeTTkhCU3D20uAJzHvAQ@mail.gmail.com>
 <6d61305f-58-705b-bfb8-cfa243c41d3e@ispras.ru>
 <dc462cf-8f1d-e7c8-8a7f-b5face4ed25@ispras.ru>
 <741653a-20e0-6dc4-d5ce-c826677cf36@ispras.ru>
 <CAFUsyfJweGQVV7YhWYGa8vM45rWHfE+btJ0kb7yQi1+x8gw0zg@mail.gmail.com>
 <21d761a-a6ef-e9a2-1a38-a592503c81db@ispras.ru>
 <CAFUsyfJqs3seMX0VVHKxoCaP68eobBP1XV6YhKG4rMRWjUPh3g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 05 May 2022 19:37:28 -0000

On Thu, 5 May 2022, Noah Goldstein wrote:
> Okay that makes sense and indeed results in a substantial improvement.
> 
> Totally happy with going with your version.
> 
> Think there is still some benefit to the unrolled version because
> 
> 1) It's less eager about hitting the LSD on newer processors
> (but that's really only an issue for strings > ~24 characters).
> 
> 2) It bottlenecks less hard on `p6` because the `imul` goes `p0`
> and the branches are distributed between `p0` and `p6` instead of
> always on `p6`.
> 
> 3) It still saves a few uops (although imul vs `add + shl` isn't really
> a meaningful save).

Agreed; let me point out though that your variant is preferable provided that
multiplication is at most 3 cycles (or integer multiply-add is at most 4).
That's a given on not-too-old x86, but I'm not sure how things stand elsewhere.

On the other hand, if a 3-cycle multiply-add is available, you variant is
strongly preferable (this is generic code so we should try to think a bit
about other architectures).

I'd recommend to reword commit message of your patch 6/6 so it properly
explains that the apparent 2x speedup is due to baseline case hitting
two issues that slow it down by, well, almost 2x.

> Either way it will be an improvement.
> 
> Little benchmark: https://godbolt.org/z/G6PvW4eTr
> 
> Generally see hash2 winning the most.

I noticed that hash2 needs more magic empty asms, because the compiler may
reassociate 'h = h + (32 * c0 + c1);' to 'h = (h + c1) + 32 * c0' which
increases the dependency chain. I adjusted it like this:

        c1 += c0;
        asm("" : "+r"(c1), "+r"(c0));
        h *= 33 * 33;
        c1 += c0 * 32;
        asm("" : "+r"(c1));
        h += c1;
        s += 2;

Alexander