From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id 88C5438346A7 for ; Fri, 6 May 2022 09:27:44 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 88C5438346A7 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru Received: from [10.10.3.121] (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTPS id 7D2A040755D7; Fri, 6 May 2022 09:27:39 +0000 (UTC) Date: Fri, 6 May 2022 12:27:39 +0300 (MSK) From: Alexander Monakov To: Paul Zimmermann cc: gcc-help@gcc.gnu.org, stephane.glondu@inria.fr, marc.glisse@inria.fr, sibid@uvic.ca Subject: Re: slowdown with -std=gnu18 with respect to -std=c99 In-Reply-To: Message-ID: <2b4e81-fee-e79-5ea0-bf658f20b4c2@ispras.ru> References: <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru> <4d36d96-2de9-f8ac-2d52-ea32b1cc6d9@grove.saclay.inria.fr> <74dc894-7774-e5bb-81-c5955c94ee4@ispras.ru> MIME-Version: 1.0 X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, KAM_NUMSUBJECT, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-help@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-help mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 06 May 2022 09:27:47 -0000 On Fri, 6 May 2022, Paul Zimmermann via Gcc-help wrote: > here are latency metrics (still on i5-4590): > > | gcc-9 | gcc-10 | gcc-11 | > ------------|-------|--------|--------| > -std=c99 | 70.8 | 70.3 | 70.2 | > -std=gnu18 | 59.5 | 59.5 | 59.5 | > > It thus seems the issue only appears for the reciprocal throughput. Thanks. The primary issue here is false dependency on vcvtss2sd instruction. In the snippet shown in Stéphane's email, the slower variant begins with vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 The cvtss2sd instruction is specified to take the upper bits of SSE register unmodified, so here it merges high bits of xmm1 with results of float->double conversion (in low bits) into new xmm1. Unless the CPU can track dependencies separately for vector register components, it has to delay this instruction until the previous computation that modified xmm1 has completed (AMD Zen2 is an example of a microarchitecture that apparently can). This limits the degree to which separate cr_log10f can overlap, affecting throughput. In latency measurements, the calls are already serialized by dependency over xmm0, so the additional false dependency does not matter. (so fma is a "red herring", it's just that depending on compiler version and flags, register allocation will place last assignment into xmm1 differently) If you want to experiment, you can hand-edit assembly to replace the problematic instruction with variants that avoid the false dependency, such as vcvtss2sd %xmm0, %xmm0, %xmm1 or vpxor %xmm1, %xmm1, %xmm1 vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 GCC has code to do this automatically, but for some reason it doesn't work for your function. I have reported in to the Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504 Alexander