From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <macro@embecosm.com>
Received: from mail-wr1-x436.google.com (mail-wr1-x436.google.com
 [IPv6:2a00:1450:4864:20::436])
 by sourceware.org (Postfix) with ESMTPS id 97BC638582B1
 for <gcc-patches@gcc.gnu.org>; Thu, 23 Jun 2022 13:39:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 97BC638582B1
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=embecosm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=embecosm.com
Received: by mail-wr1-x436.google.com with SMTP id o8so28034739wro.3
 for <gcc-patches@gcc.gnu.org>; Thu, 23 Jun 2022 06:39:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=embecosm.com; s=google;
 h=date:from:to:cc:subject:in-reply-to:message-id:references
 :user-agent:mime-version;
 bh=Pe7ImCMw36mXSWgZSD0/mFhhmJXoNS51xFi1W3ax/1o=;
 b=XZ20jblwDNTJ39rX6m+m59dvTI4NhbfLX28j+DwQVhdeWBvGQ/hNCUhthcfPYMcqcz
 PNxKL4H80skm2G28Qwn0CuxCif7d0B29+RIt34yLAqbcfeU1K/s7e5gjVHiCuS2uXhog
 6G7IeXs8xb5LkVBMomj9818tl0LcLJsA0caLNNjWDyQ3EUus5b0qxRM3TE1TUvJjk/06
 cRpZ/yLc3er3j+f2KDe06vAk1Jy45pp0jp49QYyf7AW4i6pJ9Wl1CPcxV2bR66jvJYGG
 Dq3qeajnDxIOf9HpsRpkryQIcG3qtz49MpRQ397nr9KhsGlNHIHymrBuRou8XTHr0cjr
 r+6A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
 :references:user-agent:mime-version;
 bh=Pe7ImCMw36mXSWgZSD0/mFhhmJXoNS51xFi1W3ax/1o=;
 b=uoVk/fJEoSfbcdMXHklYSbkZfTlZbzXrAqGLvTJwRX5cVjpgniNFZkRfLQsdlx0RPQ
 L/ZR5gvWep+m3UcDC+eSgEHmHrl7qpuso3j/w9piea0I3pUzJ1fNZNo1xU3kKeVkW9mi
 +FNJCYd5TUAyMZIMXtlpuC52twhR8nEEPy8G8/ZbVr+OwIvkN8Ag2O6dZRwJVxxpxmnp
 KTHCxudSpaOCe51zKOAS1Iwd5e6AeKmECiUCmH/MrqLWq2fBiWSTdRbHgHXVinHPN9bc
 YAZP0J88D9TKpQTsva7jCOhJl+uNCXQDOOD07t2/KvzWfc67LW5bp3Y2wJtEBbfUs1xq
 8H4w==
X-Gm-Message-State: AJIora8HexTOelI/eAYIeJeb79ne5uuVJngLoaynmmAEjKs96PpVQggv
 +hgU6KArFzS9LDh5DZTIKZCC2DgfiT5K9w==
X-Google-Smtp-Source: AGRyM1uhwzl+7/MrhsaYGV0/qHeSMEsS3ljJHx1ua/zE4p1e7wul1emTzl2YASjqtFUHWVQBJ5pRtg==
X-Received: by 2002:a05:6000:15c1:b0:21b:ad5d:64dd with SMTP id
 y1-20020a05600015c100b0021bad5d64ddmr1469838wry.642.1655991573163; 
 Thu, 23 Jun 2022 06:39:33 -0700 (PDT)
Received: from tpp.orcam.me.uk (tpp.orcam.me.uk.
 [2001:8b0:154:0:ea6a:64ff:fe24:f2fc])
 by smtp.gmail.com with ESMTPSA id
 t18-20020a05600c199200b003a032c88877sm2661619wmq.15.2022.06.23.06.39.30
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 23 Jun 2022 06:39:32 -0700 (PDT)
Date: Thu, 23 Jun 2022 14:39:27 +0100 (BST)
From: "Maciej W. Rozycki" <macro@embecosm.com>
To: gcc-patches@gcc.gnu.org
cc: Andrew Waterman <andrew@sifive.com>, Jim Wilson <jim.wilson.gcc@gmail.com>,
 Kito Cheng <kito.cheng@gmail.com>, Palmer Dabbelt <palmer@dabbelt.com>
Subject: Re: [PATCH] RISC-V: Split unordered FP comparisons into individual
 RTL insns
In-Reply-To: <alpine.DEB.2.20.2206082354490.10833@tpp.orcam.me.uk>
Message-ID: <alpine.DEB.2.20.2206220143490.10833@tpp.orcam.me.uk>
References: <alpine.DEB.2.20.2206082354490.10833@tpp.orcam.me.uk>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,
 SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Jun 2022 13:39:37 -0000

On Thu, 9 Jun 2022, Maciej W. Rozycki wrote:

>  I'm yet running some benchmarking to see if the use of UNSPEC_VOLATILEs 
> makes any noticeable performance difference, but I suspect it does not as 
> the compiler could not do much about the original multiple-instruction 
> single RTL insns anyway.

 This has now finally completed.  I used SPECfp 2006 built at `-O3' and 
statically linked, which needs ~33 hours per run with the HiFive Unmatched 
board at its standard 1196MHz clock rate.  Here are the results merged by 
hand from original reports:

               Base     Base     Base     Base   Est Base  Est Base  Est Base   
Benchmarks     Ref.   Run Time Run Time Run Time   Ratio     Ratio     Ratio  
                       (base)  (length) (split)   (base)   (length)   (split)  
------------- ------  -------  -------  -------  --------  --------  -------- 
410.bwaves     13590    10353    10396    10370    1.31      1.31      1.31   
416.gamess     19580     9080     9410     9284    2.16      2.08      2.11   
433.milc        9180     5465     5475     5610    1.68      1.68      1.64   
434.zeusmp      9100     5773     5767     5761    1.58      1.58      1.58   
435.gromacs     7140     3605     3561     3545    1.98      2.00      2.01   
436.cactusADM  11950     7779     7658     7680    1.54      1.56      1.56   
437.leslie3d    9400    10280    10697    10274    0.914     0.879     0.915  
444.namd        8020     3141     3120     3129    2.55      2.57      2.56   
447.dealII     11440     3459     3490     3495    3.31      3.28      3.27   
450.soplex      8340     4698     4899     4781    1.78      1.70      1.74   
453.povray      5320     1953     1922     1916    2.72      2.77      2.78   
454.calculix    8250     4844     4857     4821    1.70      1.70      1.71   
459.GemsFDTD   10610     8703     8957     9028    1.22      1.18      1.18   
465.tonto       9840     4585     4539     4620    2.15      2.17      2.13   
470.lbm        13740    10172    10945    10789    1.35      1.26      1.27   
481.wrf        11170     8516     8646     8584    1.31      1.29      1.30   
482.sphinx3    19490     9240     9267     9280    2.11      2.10      2.10   
==============================================================================

The execution time reference (second column) is for a Sun Ultra Enterprise 
2 system from 1997, based on a 296MHz UltraSPARC II CPU, times are given 
in seconds (lower is better) and the ratios calculated are in relation to 
the reference (higher is better).

In the table above "base" results are with upstream master as at commit 
7b98910406b5 ("c++: ICE with template NEW_EXPR [PR105803]".  Then "length" 
results are with commit 72b185189f91 ("RISC-V: Reset the length to the 
default of 4 for FP comparisons") applied on top, as it does make changes 
to code produced even at `-O3' (where size matters less than speed), e.g.:

    46b2c:	8d01b787          	fld	fa5,-1840(gp) # 7760a8 <__SDATA_BEGIN__+0xd0>
-   46b30:	66f4b027          	fsd	fa5,1632(s1)
-   46b34:	a029                	j	46b3e <gciinp_+0x124>
-   46b36:	8c01b787          	fld	fa5,-1856(gp) # 776098 <__SDATA_BEGIN__+0xc0>
-   46b3a:	66f4b027          	fsd	fa5,1632(s1)
-   46b3e:	00ab67b7          	lui	a5,0xab6
-   46b42:	0a07b707          	fld	fa4,160(a5) # ab60a0 <runopt_>
-   46b46:	8d81b787          	fld	fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8>
-   46b4a:	a2f727d3          	feq.d	a5,fa4,fa5
-   46b4e:	18079fe3          	bnez	a5,474ec <gciinp_+0xad2>
-   46b52:	00afd7b7          	lui	a5,0xafd
-   46b56:	4607a703          	lw	a4,1120(a5) # afd460 <symtry_+0x47340>
-   46b5a:	4785                	li	a5,1
-   46b5c:	18f708e3          	beq	a4,a5,474ec <gciinp_+0xad2>
-   46b60:	00aaeab7          	lui	s5,0xaae
-   46b64:	d70a8a93          	addi	s5,s5,-656 # aadd70 <infoa_>
-   46b68:	008aa783          	lw	a5,8(s5)
-   46b6c:	8301b707          	fld	fa4,-2000(gp) # 776008 <__SDATA_BEGIN__+0x30>
-   46b70:	37fd                	addiw	a5,a5,-1
+   46b30:	00ab67b7          	lui	a5,0xab6
+   46b34:	0a07b707          	fld	fa4,160(a5) # ab60a0 <runopt_>
+   46b38:	66f4b027          	fsd	fa5,1632(s1)
+   46b3c:	8d81b787          	fld	fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8>
+   46b40:	a2f727d3          	feq.d	a5,fa4,fa5
+   46b44:	c39d                	beqz	a5,46b6a <gciinp_+0x150>
+   46b46:	8901b787          	fld	fa5,-1904(gp) # 776068 <__SDATA_BEGIN__+0x90>
+   46b4a:	66f4b027          	fsd	fa5,1632(s1)
+   46b4e:	a02d                	j	46b78 <gciinp_+0x15e>
+   46b50:	8c01b787          	fld	fa5,-1856(gp) # 776098 <__SDATA_BEGIN__+0xc0>
+   46b54:	66f4b027          	fsd	fa5,1632(s1)
+   46b58:	00ab67b7          	lui	a5,0xab6
+   46b5c:	0a07b707          	fld	fa4,160(a5) # ab60a0 <runopt_>
+   46b60:	8d81b787          	fld	fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8>
+   46b64:	a2f727d3          	feq.d	a5,fa4,fa5
+   46b68:	fff9                	bnez	a5,46b46 <gciinp_+0x12c>
+   46b6a:	00afd7b7          	lui	a5,0xafd
+   46b6e:	4607a703          	lw	a4,1120(a5) # afd460 <symtry_+0x47340>
+   46b72:	4785                	li	a5,1
+   46b74:	fcf709e3          	beq	a4,a5,46b46 <gciinp_+0x12c>
+   46b78:	00aaeab7          	lui	s5,0xaae
+   46b7c:	d70a8a93          	addi	s5,s5,-656 # aadd70 <infoa_>
+   46b80:	008aa783          	lw	a5,8(s5)
+   46b84:	8301b707          	fld	fa4,-2000(gp) # 776008 <__SDATA_BEGIN__+0x30>
+   46b88:	37fd                	addiw	a5,a5,-1

And finally "split" is with this patch also applied, changing code in 
places as well, e.g.:

@@ -4873598,13 +4873598,13 @@
   5f5744:	87bf70ef          	jal	ra,5ecfbe <_gfortrani_internal_error>
   5f5748:	8281b407          	fld	fs0,-2008(gp) # 776000 <__SDATA_BEGIN__+0x28>
   5f574c:	221c                	fld	fa5,0(a2)
-  5f574e:	0079a7b7          	lui	a5,0x79a
-  5f5752:	ac22                	fsd	fs0,24(sp)
-  5f5754:	a83e                	fsd	fa5,16(sp)
-  5f5756:	27c2                	fld	fa5,16(sp)
-  5f5758:	4907b707          	fld	fa4,1168(a5) # 79a490 <__global_pointer$+0x23cb8>
-  5f575c:	22f7a7d3          	fabs.d	fa5,fa5
-  5f5760:	00102773          	frflags	a4
+  5f574e:	ac22                	fsd	fs0,24(sp)
+  5f5750:	a83e                	fsd	fa5,16(sp)
+  5f5752:	27c2                	fld	fa5,16(sp)
+  5f5754:	22f7a7d3          	fabs.d	fa5,fa5
+  5f5758:	00102773          	frflags	a4
+  5f575c:	0079a7b7          	lui	a5,0x79a
+  5f5760:	4907b707          	fld	fa4,1168(a5) # 79a490 <__global_pointer$+0x23cb8>
   5f5764:	a2e787d3          	fle.d	a5,fa5,fa4
   5f5768:	00171073          	fsflags	a4
   5f576c:	2c078363          	beqz	a5,5f5a32 <determine_en_precision+0x328>

or:

@@ -4909410,9 +4909410,9 @@
   60eb8a:	a2f696d3          	flt.d	a3,fa3,fa5
   60eb8e:	00161073          	fsflags	a2
   60eb92:	ee81                	bnez	a3,60ebaa <__hypot+0x58>
-  60eb94:	f20707d3          	fmv.d.x	fa5,a4
-  60eb98:	22f7a7d3          	fabs.d	fa5,fa5
-  60eb9c:	00102673          	frflags	a2
+  60eb94:	00102673          	frflags	a2
+  60eb98:	f20707d3          	fmv.d.x	fa5,a4
+  60eb9c:	22f7a7d3          	fabs.d	fa5,fa5
   60eba0:	a2f696d3          	flt.d	a3,fa3,fa5
   60eba4:	00161073          	fsflags	a2
   60eba8:	c29d                	beqz	a3,60ebce <__hypot+0x7c>

(so no arithmetic FP instructions appear to be scheduled between FSFLAGS 
and FRFLAGS, though it's not clear to me how the compiler knows it is not 
allowed do it) or finally:

-   66204:	52cd754b          	fnmsub.d	fa0,fs10,fa2,fa0
-   66208:	40157553          	fcvt.s.d	fa0,fa0
-   6620c:	a0e517d3          	flt.s	a5,fa0,fa4
-   66210:	58079263          	bnez	a5,66794 <do_cg+0xd20>
-   66214:	00102773          	frflags	a4
-   66218:	a0e517d3          	flt.s	a5,fa0,fa4
-   6621c:	00171073          	fsflags	a4
-   66220:	220793e3          	bnez	a5,66c46 <do_cg+0x11d2>
-   66224:	580576d3          	fsqrt.s	fa3,fa0
+   6620c:	52cd754b          	fnmsub.d	fa0,fs10,fa2,fa0
+   66210:	40157553          	fcvt.s.d	fa0,fa0
+   66214:	a0e517d3          	flt.s	a5,fa0,fa4
+   66218:	58079063          	bnez	a5,66798 <do_cg+0xd1c>
+   6621c:	00102773          	frflags	a4
+   66220:	00171073          	fsflags	a4
+   66224:	220793e3          	bnez	a5,66c4a <do_cg+0x11ce>
+   66228:	580576d3          	fsqrt.s	fa3,fa0

(at least removing a redundant FLT.S instruction, although this doesn't 
seem optimal anyway as there appears no way for the second BNEZ branch to 
be ever taken, but I gather that's an unfortunate consequence of the 
volatility of `riscv_frflags'/`riscv_fsflags' RTL insns) and I was able to 
spot a place where an FMV.D instruction has been removed too, indicating a 
better register allocation.

 Results quoted above seem to suggest that in some cases a performance 
regression has resulted from the change, but that may not necessarily be 
the case given that the benchmarks have been run on a live even if lightly 
loaded Linux system.  Obtaining standard three samples would require ~4.5 
days per SPECfp 2006 iteration or almost a fortnight total.

 Therefore I chose to rerun only one of the worst offenders and the 
results are as follows:

                                  Estimated  
                Base     Base       Base     
Benchmarks      Ref.   Run Time     Ratio    
-------------- ------  ---------  ---------  
416.gamess      19580       9138       2.14 S
416.gamess      19580       9498       2.06 S
416.gamess      19580       9478       2.07 *

corresponding to the "split" result earlier on.  So the variation between 
runs is similar to the supposed loss of performance and therefore I think 
we do not need to be concerned.  If there's anything that we're missing, 
it's the tracking of IEEE exception flags, as I previously mentioned.

 I did not run benchmarking for `-fsignaling-nans'.  Relative figures are 
expected to be similar as the only difference is the presence of a FEQ.fmt 
instruction following FSFLAGS.  I've spotted this anomaly however:

-   3c670:	5a057553          	fsqrt.d	fa0,fa0
-   3c674:	f20006d3          	fmv.d.x	fa3,zero
-   3c678:	8d01b787          	fld	fa5,-1840(gp) # c5678 <__SDATA_BEGIN__+0xd0>
-   3c67c:	0ad57553          	fsub.d	fa0,fa0,fa3
-   3c680:	a2a797d3          	flt.d	a5,fa5,fa0
-   3c684:	e3cd                	bnez	a5,3c726 <_ZN9ResultSet5checkEv+0xe6>
-   3c686:	8d81b787          	fld	fa5,-1832(gp) # c5680 <__SDATA_BEGIN__+0xd8>
-   3c68a:	a2f517d3          	flt.d	a5,fa0,fa5
-   3c68e:	efc1                	bnez	a5,3c726 <_ZN9ResultSet5checkEv+0xe6>
-   3c690:	3578                	fld	fa4,232(a0)
-   3c692:	317c                	fld	fa5,224(a0)
-   3c694:	3968                	fld	fa0,240(a0)
-   3c696:	12e77753          	fmul.d	fa4,fa4,fa4
-   3c69a:	72f7f7c3          	fmadd.d	fa5,fa5,fa5,fa4
-   3c69e:	7aa57543          	fmadd.d	fa0,fa0,fa0,fa5
-   3c6a2:	00102773          	frflags	a4
-   3c6a6:	a2d517d3          	flt.d	a5,fa0,fa3
-   3c6aa:	00171073          	fsflags	a4
-   3c6ae:	a2d52053          	feq.d	zero,fa0,fa3
-   3c6b2:	efc9                	bnez	a5,3c74c <_ZN9ResultSet5checkEv+0x10c>
-   3c6b4:	5a057553          	fsqrt.d	fa0,fa0
+   3c66e:	5a057553          	fsqrt.d	fa0,fa0
+   3c672:	f20006d3          	fmv.d.x	fa3,zero
+   3c676:	8d01b787          	fld	fa5,-1840(gp) # c5678 <__SDATA_BEGIN__+0xd0>
+   3c67a:	0ad57553          	fsub.d	fa0,fa0,fa3
+   3c67e:	a2a797d3          	flt.d	a5,fa5,fa0
+   3c682:	e7cd                	bnez	a5,3c72c <_ZN9ResultSet5checkEv+0xee>
+   3c684:	8d81b787          	fld	fa5,-1832(gp) # c5680 <__SDATA_BEGIN__+0xd8>
+   3c688:	a2f517d3          	flt.d	a5,fa0,fa5
+   3c68c:	e3c5                	bnez	a5,3c72c <_ZN9ResultSet5checkEv+0xee>
+   3c68e:	3578                	fld	fa4,232(a0)
+   3c690:	317c                	fld	fa5,224(a0)
+   3c692:	3968                	fld	fa0,240(a0)
+   3c694:	12e77753          	fmul.d	fa4,fa4,fa4
+   3c698:	72f7f7c3          	fmadd.d	fa5,fa5,fa5,fa4
+   3c69c:	7aa57543          	fmadd.d	fa0,fa0,fa0,fa5
+   3c6a0:	00102773          	frflags	a4
+   3c6a4:	a2d517d3          	flt.d	a5,fa0,fa3
+   3c6a8:	00171073          	fsflags	a4
+   3c6ac:	f20007d3          	fmv.d.x	fa5,zero
+   3c6b0:	a2f52053          	feq.d	zero,fa0,fa5
+   3c6b4:	efd9                	bnez	a5,3c752 <_ZN9ResultSet5checkEv+0x114>
+   3c6b6:	5a057553          	fsqrt.d	fa0,fa0

where the compiler for some reason cannot realise it already has the value 
of 0.0 available in fa3 and instead uses an extra move to fa5 for the 
final FEQ.D.

>  No regressions with the GCC (with and w/o `-fsignaling-nans') and glibc 
> testsuites (as per commit 1fcbfb00fc67 ("RISC-V: Fix -fsignaling-nans for 
> glibc testsuite.")).  OK to apply?

 Any comments on the change, anyone?

  Maciej