From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stephane.glondu@inria.fr>
Received: from mail2-relais-roc.national.inria.fr
 (mail2-relais-roc.national.inria.fr [192.134.164.83])
 by sourceware.org (Postfix) with ESMTPS id D8BA93857018
 for <gcc-help@gcc.gnu.org>; Thu,  5 May 2022 08:57:17 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D8BA93857018
X-IronPort-AV: E=Sophos;i="5.91,200,1647298800"; d="scan'208";a="34802358"
Received: from malt.loria.fr (HELO [152.81.9.54]) ([152.81.9.54])
 by mail2-relais-roc.national.inria.fr with
 ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 May 2022 10:57:17 +0200
Message-ID: <c8517377-e695-06fe-0be4-b7e409d471b9@inria.fr>
Date: Thu, 5 May 2022 10:57:07 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.8.1
Content-Language: fr
To: Alexander Monakov <amonakov@ispras.ru>, gcc-help@gcc.gnu.org
Cc: sibid@uvic.ca, Paul Zimmermann <Paul.Zimmermann@inria.fr>
References: <mw1qxbc954.fsf@tomate.loria.fr>
 <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru>
From: =?UTF-8?Q?St=c3=a9phane_Glondu?= <stephane.glondu@inria.fr>
Subject: Re: slowdown with -std=gnu18 with respect to -std=c99
In-Reply-To: <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-4.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, NICE_REPLY_A,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-help@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-help mailing list <gcc-help.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-help>,
 <mailto:gcc-help-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-help>,
 <mailto:gcc-help-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 05 May 2022 08:57:19 -0000

Le 03/05/2022 à 11:09, Alexander Monakov a écrit :
>> Does anyone have a clue?
> 
> I can reproduce a difference, but in my case it's simply because in -std=gnuXX
> mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few
> steps in the benchmarked function to use fma instead of separate mul/add
> instructions.
> 
> (regarding __builtin_expect, it also makes a small difference in my case,
> it seems GCC generates some redundant code without it, but the difference is
> 10x smaller than what presence/absence of FMA gives)
> 
> I think you might be able to figure it out on your end if you run both variants
> under 'perf stat', note how cycle count and instruction counts change, and then
> look at disassembly to see what changed. You can use 'perf record' and 'perf
> report' to easily see the hot code path; if you do that, I'd recommend to run
> it with the same sampling period in both cases, e.g. like this:
> 
>     perf record -e instructions:P -c 500000 ./perf ...
I did that. The hot code path corresponds to (from exp10f.c):

    double a = iln2h*z, ia = __builtin_floor(a), h = (a - ia) + iln2l*z;
    long i = ia, j = i&0xf, e = i - j;
    e >>= 4;
    double s = tb[j];
    b64u64_u su = {.u = (e + 0x3fful)<<52};
    s *= su.f;
    double h2 = h*h;
    double c0 = c[0] + h*c[1];
    double c2 = c[2] + h*c[3];
    double c4 = c[4] + h*c[5];
    c0 += h2*(c2 + h2*c4);
    double w = s*h;
    return s + w*c0;

With -std=c99, where the overall performance is 22 cycles, I get:

  4,03 │ 3a:   vcvtss2sd  -0x4(%rsp),%xmm0,%xmm0
  0,01 │       vmulsd     ir.4+0x38,%xmm0,%xmm1
       │       vmulsd     ir.4+0x40,%xmm0,%xmm0
  0,01 │       lea        tb.1,%rdx
  3,06 │       vroundsd   $0x9,%xmm1,%xmm1,%xmm2
  0,03 │       vsubsd     %xmm2,%xmm1,%xmm1
 10,42 │       vcvttsd2si %xmm2,%rax
  0,01 │       vaddsd     %xmm0,%xmm1,%xmm1
       │       mov        %rax,%rcx
  0,02 │       vmulsd     ir.4+0x58,%xmm1,%xmm0
  0,38 │       vmulsd     %xmm1,%xmm1,%xmm5
  0,00 │       vmulsd     ir.4+0x68,%xmm1,%xmm4
       │       sar        $0x4,%rax
  0,00 │       add        $0x3ff,%rax
  1,17 │       vaddsd     ir.4+0x60,%xmm0,%xmm0
       │       shl        $0x34,%rax
  0,02 │       vaddsd     ir.4+0x70,%xmm4,%xmm4
  0,10 │       vmulsd     %xmm5,%xmm0,%xmm0
  0,85 │       vmulsd     ir.4+0x48,%xmm1,%xmm3
  0,00 │       and        $0xf,%ecx
       │       vmovq      %rax,%xmm6
  1,20 │       vmulsd     (%rdx,%rcx,8),%xmm6,%xmm2
  0,65 │       vaddsd     %xmm4,%xmm0,%xmm0
  0,00 │       vaddsd     ir.4+0x50,%xmm3,%xmm3
  3,49 │       vmulsd     %xmm5,%xmm0,%xmm0
 15,59 │       vmulsd     %xmm2,%xmm1,%xmm1
  4,61 │       vaddsd     %xmm3,%xmm0,%xmm0
 10,24 │       vmulsd     %xmm1,%xmm0,%xmm0
 11,31 │       vaddsd     %xmm2,%xmm0,%xmm0
 23,21 │       vcvtsd2ss  %xmm0,%xmm0,%xmm0
  0,00 │     ← ret

With -std=gnu18, where the overall performance is 36 cycles, I get:

  0,02 │ 3a:   vcvtss2sd   -0x4(%rsp),%xmm1,%xmm1
  0,01 │       vmulsd      ir.4+0x40,%xmm1,%xmm0
       │       vmovsd      ir.4+0x60,%xmm5
       │       vmovsd      ir.4+0x50,%xmm4
       │       lea         tb.1,%rdx
  0,13 │       vroundsd    $0x9,%xmm0,%xmm0,%xmm2
  0,83 │       vsubsd      %xmm2,%xmm0,%xmm0
 28,99 │       vcvttsd2si  %xmm2,%rax
 63,49 │       vfmadd132sd 0x961(%rip),%xmm0,%xmm1
       │       vmovsd      ir.4+0x70,%xmm0
       │       mov         %rax,%rcx
       │       sar         $0x4,%rax
  2,73 │       add         $0x3ff,%rax
  1,99 │       vmulsd      %xmm1,%xmm1,%xmm3
  0,00 │       vfmadd213sd 0x95f(%rip),%xmm1,%xmm5
  0,00 │       vfmadd213sd 0x966(%rip),%xmm1,%xmm0
       │       shl         $0x34,%rax
       │       and         $0xf,%ecx
       │       vmovq       %rax,%xmm6
  0,17 │       vmulsd      (%rdx,%rcx,8),%xmm6,%xmm2
       │       vfmadd213sd 0x92c(%rip),%xmm1,%xmm4
  0,04 │       vfmadd132sd %xmm3,%xmm5,%xmm0
  0,64 │       vmulsd      %xmm2,%xmm1,%xmm1
  0,01 │       vfmadd132sd %xmm3,%xmm4,%xmm0
  0,46 │       vfmadd132sd %xmm1,%xmm2,%xmm0
  0,27 │       vcvtsd2ss   %xmm0,%xmm0,%xmm0
       │     ← ret

The distribution of time is very different in both cases: in the first
case, most of the time is spent at the end (computing w and return value
I suppose) whereas in the second case, most of the time is spent in the
first multiply-and-add (computing h). I do not understand this change of
behaviour.


Cheers,

-- 
Stéphane