From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id E54E638485B3 for ; Thu, 28 Apr 2022 18:03:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E54E638485B3 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru Received: from [10.10.3.121] (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTPS id 265B840D403D; Thu, 28 Apr 2022 18:03:37 +0000 (UTC) Date: Thu, 28 Apr 2022 21:03:37 +0300 (MSK) From: Alexander Monakov To: Noah Goldstein cc: GNU C Library Subject: Re: [PATCH v3 6/6] elf: Optimize _dl_new_hash in dl-new-hash.h In-Reply-To: Message-ID: <6d61305f-58-705b-bfb8-cfa243c41d3e@ispras.ru> References: <20220414041231.926415-1-goldstein.w.n@gmail.com> <20220425163601.3670626-1-goldstein.w.n@gmail.com> <20220425163601.3670626-6-goldstein.w.n@gmail.com> <991f2272-4ae6-b06b-ceb7-184e5d369118@ispras.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Apr 2022 18:03:49 -0000 On Wed, 27 Apr 2022, Noah Goldstein via Libc-alpha wrote: > I think it is the way you're doing your analysis as a loop-carried > dependency. I.e really 7c per iteration with no unroll (although > its fair the loads on address can speculate ahead so it will > indeed be faster) vs 9c per 2x iterations. Hm? Right, the CPU will issue loads speculatively, so you shouldn't count load latency as part of critical path. I don't understand how you get a 2x improvement on long strings, did you run the benchmark with rdtscp timing, i.e. with make USE_RDTSCP=1 bench ? Alexander