From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sourceware.org (Postfix) with ESMTPS id D8BA93857018 for ; Thu, 5 May 2022 08:57:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D8BA93857018 X-IronPort-AV: E=Sophos;i="5.91,200,1647298800"; d="scan'208";a="34802358" Received: from malt.loria.fr (HELO [152.81.9.54]) ([152.81.9.54]) by mail2-relais-roc.national.inria.fr with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 May 2022 10:57:17 +0200 Message-ID: Date: Thu, 5 May 2022 10:57:07 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Content-Language: fr To: Alexander Monakov , gcc-help@gcc.gnu.org Cc: sibid@uvic.ca, Paul Zimmermann References: <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru> From: =?UTF-8?Q?St=c3=a9phane_Glondu?= Subject: Re: slowdown with -std=gnu18 with respect to -std=c99 In-Reply-To: <9f7e3aa9-8d46-1fbb-75b-1c8ad9a667f@ispras.ru> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, NICE_REPLY_A, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-help@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-help mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 May 2022 08:57:19 -0000 Le 03/05/2022 à 11:09, Alexander Monakov a écrit : >> Does anyone have a clue? > > I can reproduce a difference, but in my case it's simply because in -std=gnuXX > mode (as opposed to -std=cXX) GCC enables FMA contraction, enabling the last few > steps in the benchmarked function to use fma instead of separate mul/add > instructions. > > (regarding __builtin_expect, it also makes a small difference in my case, > it seems GCC generates some redundant code without it, but the difference is > 10x smaller than what presence/absence of FMA gives) > > I think you might be able to figure it out on your end if you run both variants > under 'perf stat', note how cycle count and instruction counts change, and then > look at disassembly to see what changed. You can use 'perf record' and 'perf > report' to easily see the hot code path; if you do that, I'd recommend to run > it with the same sampling period in both cases, e.g. like this: > > perf record -e instructions:P -c 500000 ./perf ... I did that. The hot code path corresponds to (from exp10f.c): double a = iln2h*z, ia = __builtin_floor(a), h = (a - ia) + iln2l*z; long i = ia, j = i&0xf, e = i - j; e >>= 4; double s = tb[j]; b64u64_u su = {.u = (e + 0x3fful)<<52}; s *= su.f; double h2 = h*h; double c0 = c[0] + h*c[1]; double c2 = c[2] + h*c[3]; double c4 = c[4] + h*c[5]; c0 += h2*(c2 + h2*c4); double w = s*h; return s + w*c0; With -std=c99, where the overall performance is 22 cycles, I get: 4,03 │ 3a: vcvtss2sd -0x4(%rsp),%xmm0,%xmm0 0,01 │ vmulsd ir.4+0x38,%xmm0,%xmm1 │ vmulsd ir.4+0x40,%xmm0,%xmm0 0,01 │ lea tb.1,%rdx 3,06 │ vroundsd $0x9,%xmm1,%xmm1,%xmm2 0,03 │ vsubsd %xmm2,%xmm1,%xmm1 10,42 │ vcvttsd2si %xmm2,%rax 0,01 │ vaddsd %xmm0,%xmm1,%xmm1 │ mov %rax,%rcx 0,02 │ vmulsd ir.4+0x58,%xmm1,%xmm0 0,38 │ vmulsd %xmm1,%xmm1,%xmm5 0,00 │ vmulsd ir.4+0x68,%xmm1,%xmm4 │ sar $0x4,%rax 0,00 │ add $0x3ff,%rax 1,17 │ vaddsd ir.4+0x60,%xmm0,%xmm0 │ shl $0x34,%rax 0,02 │ vaddsd ir.4+0x70,%xmm4,%xmm4 0,10 │ vmulsd %xmm5,%xmm0,%xmm0 0,85 │ vmulsd ir.4+0x48,%xmm1,%xmm3 0,00 │ and $0xf,%ecx │ vmovq %rax,%xmm6 1,20 │ vmulsd (%rdx,%rcx,8),%xmm6,%xmm2 0,65 │ vaddsd %xmm4,%xmm0,%xmm0 0,00 │ vaddsd ir.4+0x50,%xmm3,%xmm3 3,49 │ vmulsd %xmm5,%xmm0,%xmm0 15,59 │ vmulsd %xmm2,%xmm1,%xmm1 4,61 │ vaddsd %xmm3,%xmm0,%xmm0 10,24 │ vmulsd %xmm1,%xmm0,%xmm0 11,31 │ vaddsd %xmm2,%xmm0,%xmm0 23,21 │ vcvtsd2ss %xmm0,%xmm0,%xmm0 0,00 │ ← ret With -std=gnu18, where the overall performance is 36 cycles, I get: 0,02 │ 3a: vcvtss2sd -0x4(%rsp),%xmm1,%xmm1 0,01 │ vmulsd ir.4+0x40,%xmm1,%xmm0 │ vmovsd ir.4+0x60,%xmm5 │ vmovsd ir.4+0x50,%xmm4 │ lea tb.1,%rdx 0,13 │ vroundsd $0x9,%xmm0,%xmm0,%xmm2 0,83 │ vsubsd %xmm2,%xmm0,%xmm0 28,99 │ vcvttsd2si %xmm2,%rax 63,49 │ vfmadd132sd 0x961(%rip),%xmm0,%xmm1 │ vmovsd ir.4+0x70,%xmm0 │ mov %rax,%rcx │ sar $0x4,%rax 2,73 │ add $0x3ff,%rax 1,99 │ vmulsd %xmm1,%xmm1,%xmm3 0,00 │ vfmadd213sd 0x95f(%rip),%xmm1,%xmm5 0,00 │ vfmadd213sd 0x966(%rip),%xmm1,%xmm0 │ shl $0x34,%rax │ and $0xf,%ecx │ vmovq %rax,%xmm6 0,17 │ vmulsd (%rdx,%rcx,8),%xmm6,%xmm2 │ vfmadd213sd 0x92c(%rip),%xmm1,%xmm4 0,04 │ vfmadd132sd %xmm3,%xmm5,%xmm0 0,64 │ vmulsd %xmm2,%xmm1,%xmm1 0,01 │ vfmadd132sd %xmm3,%xmm4,%xmm0 0,46 │ vfmadd132sd %xmm1,%xmm2,%xmm0 0,27 │ vcvtsd2ss %xmm0,%xmm0,%xmm0 │ ← ret The distribution of time is very different in both cases: in the first case, most of the time is spent at the end (computing w and return value I suppose) whereas in the second case, most of the time is spent in the first multiply-and-add (computing h). I do not understand this change of behaviour. Cheers, -- Stéphane