From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id D59663850425 for ; Wed, 11 May 2022 13:26:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D59663850425 Received: from [10.10.3.121] (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTPS id C357B4076265; Wed, 11 May 2022 13:26:05 +0000 (UTC) Date: Wed, 11 May 2022 16:26:05 +0300 (MSK) From: Alexander Monakov To: Paul Zimmermann cc: gcc-help@gcc.gnu.org, stephane.glondu@inria.fr, marc.glisse@inria.fr, sibid@uvic.ca Subject: Re: slowdown with -std=gnu18 with respect to -std=c99 In-Reply-To: <2b4e81-fee-e79-5ea0-bf658f20b4c2@ispras.ru> Message-ID: <9b56647e-46bb-9a79-d9a0-439e2f35ee27@ispras.ru> References: <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru> <4d36d96-2de9-f8ac-2d52-ea32b1cc6d9@grove.saclay.inria.fr> <74dc894-7774-e5bb-81-c5955c94ee4@ispras.ru> <2b4e81-fee-e79-5ea0-bf658f20b4c2@ispras.ru> MIME-Version: 1.0 X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, KAM_NUMSUBJECT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-help@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-help mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 May 2022 13:26:10 -0000 On Fri, 6 May 2022, Alexander Monakov wrote: > The primary issue here is false dependency on vcvtss2sd instruction. In the > snippet shown in Stéphane's email, the slower variant begins with > > vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 > > The cvtss2sd instruction is specified to take the upper bits of SSE register > unmodified, so here it merges high bits of xmm1 with results of float->double > conversion (in low bits) into new xmm1. Unless the CPU can track dependencies > separately for vector register components, it has to delay this instruction > until the previous computation that modified xmm1 has completed (AMD Zen2 is > an example of a microarchitecture that apparently can). For future reference, my statement in parenthesis was a bit inaccurate: Zen 2 avoids the false dependency provided that xmm1 carries all-zeroes in high bits after being idiomatically zeroed (i.e. via pxor). Thanks to Andreas Abel for pointing out there's a limitation. (nevertheless, the "blessed" state seemingly survives context switches, so it's quite useful, including this testcase) Alexander