From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <amonakov@ispras.ru>
Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84])
 by sourceware.org (Postfix) with ESMTPS id 88C5438346A7
 for <gcc-help@gcc.gnu.org>; Fri,  6 May 2022 09:27:44 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 88C5438346A7
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=ispras.ru
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru
Received: from [10.10.3.121] (unknown [10.10.3.121])
 by mail.ispras.ru (Postfix) with ESMTPS id 7D2A040755D7;
 Fri,  6 May 2022 09:27:39 +0000 (UTC)
Date: Fri, 6 May 2022 12:27:39 +0300 (MSK)
From: Alexander Monakov <amonakov@ispras.ru>
To: Paul Zimmermann <Paul.Zimmermann@inria.fr>
cc: gcc-help@gcc.gnu.org, stephane.glondu@inria.fr, marc.glisse@inria.fr, 
 sibid@uvic.ca
Subject: Re: slowdown with -std=gnu18 with respect to -std=c99
In-Reply-To: <mw7d6zf6iv.fsf@tomate.loria.fr>
Message-ID: <2b4e81-fee-e79-5ea0-bf658f20b4c2@ispras.ru>
References: <mw1qxbc954.fsf@tomate.loria.fr>
 <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru>
 <c8517377-e695-06fe-0be4-b7e409d471b9@inria.fr>
 <c1a686bd-d2fa-a934-f931-6bf96f11e3d9@inria.fr>
 <4d36d96-2de9-f8ac-2d52-ea32b1cc6d9@grove.saclay.inria.fr>
 <74dc894-7774-e5bb-81-c5955c94ee4@ispras.ru> <mw7d6zf6iv.fsf@tomate.loria.fr>
MIME-Version: 1.0
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 KAM_NUMSUBJECT, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 8BIT
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: gcc-help@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-help mailing list <gcc-help.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-help>,
 <mailto:gcc-help-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-help>,
 <mailto:gcc-help-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 06 May 2022 09:27:47 -0000

On Fri, 6 May 2022, Paul Zimmermann via Gcc-help wrote:

> here are latency metrics (still on i5-4590):
> 
>             | gcc-9 | gcc-10 | gcc-11 |
> ------------|-------|--------|--------|
>  -std=c99   | 70.8  | 70.3   | 70.2   |
>  -std=gnu18 | 59.5  | 59.5   | 59.5   |
> 
> It thus seems the issue only appears for the reciprocal throughput.

Thanks.

The primary issue here is false dependency on vcvtss2sd instruction. In the
snippet shown in Stéphane's email, the slower variant begins with

    vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1

The cvtss2sd instruction is specified to take the upper bits of SSE register
unmodified, so here it merges high bits of xmm1 with results of float->double
conversion (in low bits) into new xmm1. Unless the CPU can track dependencies
separately for vector register components, it has to delay this instruction
until the previous computation that modified xmm1 has completed (AMD Zen2 is
an example of a microarchitecture that apparently can).

This limits the degree to which separate cr_log10f can overlap, affecting
throughput. In latency measurements, the calls are already serialized by
dependency over xmm0, so the additional false dependency does not matter.

(so fma is a "red herring", it's just that depending on compiler version and
flags, register allocation will place last assignment into xmm1 differently)

If you want to experiment, you can hand-edit assembly to replace the problematic
instruction with variants that avoid the false dependency, such as

    vcvtss2sd %xmm0, %xmm0, %xmm1

or

    vpxor %xmm1, %xmm1, %xmm1
    vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1

GCC has code to do this automatically, but for some reason it doesn't work for
your function. I have reported in to the Bugzilla:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504

Alexander