From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sourceware.org (Postfix) with ESMTPS id E51F73857C71 for ; Sat, 7 May 2022 06:11:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E51F73857C71 Received-SPF: SoftFail (mail3-relais-sop.national.inria.fr: domain of Paul.Zimmermann@inria.fr is inclined to not designate 152.81.10.51 as permitted sender) identity=mailfrom; client-ip=152.81.10.51; receiver=mail3-relais-sop.national.inria.fr; envelope-from="Paul.Zimmermann@inria.fr"; x-sender="Paul.Zimmermann@inria.fr"; x-conformance=spf_only; x-record-type="v=spf1"; x-record-text="v=spf1 ip4:192.134.164.0/24 mx ~all" Received-SPF: None (mail3-relais-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@tomate) identity=helo; client-ip=152.81.10.51; receiver=mail3-relais-sop.national.inria.fr; envelope-from="Paul.Zimmermann@inria.fr"; x-sender="postmaster@tomate"; x-conformance=spf_only X-IronPort-AV: E=Sophos;i="5.91,206,1647298800"; d="scan'208";a="13439916" Received: from tomate.loria.fr (HELO tomate) ([152.81.10.51]) by mail3-relais-sop.national.inria.fr with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 May 2022 08:11:37 +0200 Date: Sat, 07 May 2022 08:11:36 +0200 Message-Id: From: Paul Zimmermann To: Alexander Monakov Cc: gcc-help@gcc.gnu.org, stephane.glondu@inria.fr, marc.glisse@inria.fr, sibid@uvic.ca In-Reply-To: <2b4e81-fee-e79-5ea0-bf658f20b4c2@ispras.ru> (message from Alexander Monakov on Fri, 6 May 2022 12:27:39 +0300 (MSK)) Subject: Re: slowdown with -std=gnu18 with respect to -std=c99 References: <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru> <4d36d96-2de9-f8ac-2d52-ea32b1cc6d9@grove.saclay.inria.fr> <74dc894-7774-e5bb-81-c5955c94ee4@ispras.ru> <2b4e81-fee-e79-5ea0-bf658f20b4c2@ispras.ru> MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-help@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-help mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 07 May 2022 06:11:39 -0000 thank you very much Alexander for your analysis and the bugzilla report! Paul > Date: Fri, 6 May 2022 12:27:39 +0300 (MSK) > From: Alexander Monakov > > On Fri, 6 May 2022, Paul Zimmermann via Gcc-help wrote: > > > here are latency metrics (still on i5-4590): > > > > | gcc-9 | gcc-10 | gcc-11 | > > ------------|-------|--------|--------| > > -std=c99 | 70.8 | 70.3 | 70.2 | > > -std=gnu18 | 59.5 | 59.5 | 59.5 | > > > > It thus seems the issue only appears for the reciprocal throughput. > > Thanks. > > The primary issue here is false dependency on vcvtss2sd instruction. In the > snippet shown in Stéphane's email, the slower variant begins with > > vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 > > The cvtss2sd instruction is specified to take the upper bits of SSE register > unmodified, so here it merges high bits of xmm1 with results of float->double > conversion (in low bits) into new xmm1. Unless the CPU can track dependencies > separately for vector register components, it has to delay this instruction > until the previous computation that modified xmm1 has completed (AMD Zen2 is > an example of a microarchitecture that apparently can). > > This limits the degree to which separate cr_log10f can overlap, affecting > throughput. In latency measurements, the calls are already serialized by > dependency over xmm0, so the additional false dependency does not matter. > > (so fma is a "red herring", it's just that depending on compiler version and > flags, register allocation will place last assignment into xmm1 differently) > > If you want to experiment, you can hand-edit assembly to replace the problematic > instruction with variants that avoid the false dependency, such as > > vcvtss2sd %xmm0, %xmm0, %xmm1 > > or > > vpxor %xmm1, %xmm1, %xmm1 > vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 > > GCC has code to do this automatically, but for some reason it doesn't work for > your function. I have reported in to the Bugzilla: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504 > > Alexander