public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: "Maciej W. Rozycki" <macro@embecosm.com>
To: gcc-patches@gcc.gnu.org
Cc: Andrew Waterman <andrew@sifive.com>,
	Jim Wilson <jim.wilson.gcc@gmail.com>,
	Kito Cheng <kito.cheng@gmail.com>,
	Palmer Dabbelt <palmer@dabbelt.com>
Subject: Re: [PATCH] RISC-V: Split unordered FP comparisons into individual RTL insns
Date: Thu, 23 Jun 2022 14:39:27 +0100 (BST)	[thread overview]
Message-ID: <alpine.DEB.2.20.2206220143490.10833@tpp.orcam.me.uk> (raw)
In-Reply-To: <alpine.DEB.2.20.2206082354490.10833@tpp.orcam.me.uk>

On Thu, 9 Jun 2022, Maciej W. Rozycki wrote:

>  I'm yet running some benchmarking to see if the use of UNSPEC_VOLATILEs 
> makes any noticeable performance difference, but I suspect it does not as 
> the compiler could not do much about the original multiple-instruction 
> single RTL insns anyway.

 This has now finally completed.  I used SPECfp 2006 built at `-O3' and 
statically linked, which needs ~33 hours per run with the HiFive Unmatched 
board at its standard 1196MHz clock rate.  Here are the results merged by 
hand from original reports:

               Base     Base     Base     Base   Est Base  Est Base  Est Base   
Benchmarks     Ref.   Run Time Run Time Run Time   Ratio     Ratio     Ratio  
                       (base)  (length) (split)   (base)   (length)   (split)  
------------- ------  -------  -------  -------  --------  --------  -------- 
410.bwaves     13590    10353    10396    10370    1.31      1.31      1.31   
416.gamess     19580     9080     9410     9284    2.16      2.08      2.11   
433.milc        9180     5465     5475     5610    1.68      1.68      1.64   
434.zeusmp      9100     5773     5767     5761    1.58      1.58      1.58   
435.gromacs     7140     3605     3561     3545    1.98      2.00      2.01   
436.cactusADM  11950     7779     7658     7680    1.54      1.56      1.56   
437.leslie3d    9400    10280    10697    10274    0.914     0.879     0.915  
444.namd        8020     3141     3120     3129    2.55      2.57      2.56   
447.dealII     11440     3459     3490     3495    3.31      3.28      3.27   
450.soplex      8340     4698     4899     4781    1.78      1.70      1.74   
453.povray      5320     1953     1922     1916    2.72      2.77      2.78   
454.calculix    8250     4844     4857     4821    1.70      1.70      1.71   
459.GemsFDTD   10610     8703     8957     9028    1.22      1.18      1.18   
465.tonto       9840     4585     4539     4620    2.15      2.17      2.13   
470.lbm        13740    10172    10945    10789    1.35      1.26      1.27   
481.wrf        11170     8516     8646     8584    1.31      1.29      1.30   
482.sphinx3    19490     9240     9267     9280    2.11      2.10      2.10   
==============================================================================

The execution time reference (second column) is for a Sun Ultra Enterprise 
2 system from 1997, based on a 296MHz UltraSPARC II CPU, times are given 
in seconds (lower is better) and the ratios calculated are in relation to 
the reference (higher is better).

In the table above "base" results are with upstream master as at commit 
7b98910406b5 ("c++: ICE with template NEW_EXPR [PR105803]".  Then "length" 
results are with commit 72b185189f91 ("RISC-V: Reset the length to the 
default of 4 for FP comparisons") applied on top, as it does make changes 
to code produced even at `-O3' (where size matters less than speed), e.g.:

    46b2c:	8d01b787          	fld	fa5,-1840(gp) # 7760a8 <__SDATA_BEGIN__+0xd0>
-   46b30:	66f4b027          	fsd	fa5,1632(s1)
-   46b34:	a029                	j	46b3e <gciinp_+0x124>
-   46b36:	8c01b787          	fld	fa5,-1856(gp) # 776098 <__SDATA_BEGIN__+0xc0>
-   46b3a:	66f4b027          	fsd	fa5,1632(s1)
-   46b3e:	00ab67b7          	lui	a5,0xab6
-   46b42:	0a07b707          	fld	fa4,160(a5) # ab60a0 <runopt_>
-   46b46:	8d81b787          	fld	fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8>
-   46b4a:	a2f727d3          	feq.d	a5,fa4,fa5
-   46b4e:	18079fe3          	bnez	a5,474ec <gciinp_+0xad2>
-   46b52:	00afd7b7          	lui	a5,0xafd
-   46b56:	4607a703          	lw	a4,1120(a5) # afd460 <symtry_+0x47340>
-   46b5a:	4785                	li	a5,1
-   46b5c:	18f708e3          	beq	a4,a5,474ec <gciinp_+0xad2>
-   46b60:	00aaeab7          	lui	s5,0xaae
-   46b64:	d70a8a93          	addi	s5,s5,-656 # aadd70 <infoa_>
-   46b68:	008aa783          	lw	a5,8(s5)
-   46b6c:	8301b707          	fld	fa4,-2000(gp) # 776008 <__SDATA_BEGIN__+0x30>
-   46b70:	37fd                	addiw	a5,a5,-1
+   46b30:	00ab67b7          	lui	a5,0xab6
+   46b34:	0a07b707          	fld	fa4,160(a5) # ab60a0 <runopt_>
+   46b38:	66f4b027          	fsd	fa5,1632(s1)
+   46b3c:	8d81b787          	fld	fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8>
+   46b40:	a2f727d3          	feq.d	a5,fa4,fa5
+   46b44:	c39d                	beqz	a5,46b6a <gciinp_+0x150>
+   46b46:	8901b787          	fld	fa5,-1904(gp) # 776068 <__SDATA_BEGIN__+0x90>
+   46b4a:	66f4b027          	fsd	fa5,1632(s1)
+   46b4e:	a02d                	j	46b78 <gciinp_+0x15e>
+   46b50:	8c01b787          	fld	fa5,-1856(gp) # 776098 <__SDATA_BEGIN__+0xc0>
+   46b54:	66f4b027          	fsd	fa5,1632(s1)
+   46b58:	00ab67b7          	lui	a5,0xab6
+   46b5c:	0a07b707          	fld	fa4,160(a5) # ab60a0 <runopt_>
+   46b60:	8d81b787          	fld	fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8>
+   46b64:	a2f727d3          	feq.d	a5,fa4,fa5
+   46b68:	fff9                	bnez	a5,46b46 <gciinp_+0x12c>
+   46b6a:	00afd7b7          	lui	a5,0xafd
+   46b6e:	4607a703          	lw	a4,1120(a5) # afd460 <symtry_+0x47340>
+   46b72:	4785                	li	a5,1
+   46b74:	fcf709e3          	beq	a4,a5,46b46 <gciinp_+0x12c>
+   46b78:	00aaeab7          	lui	s5,0xaae
+   46b7c:	d70a8a93          	addi	s5,s5,-656 # aadd70 <infoa_>
+   46b80:	008aa783          	lw	a5,8(s5)
+   46b84:	8301b707          	fld	fa4,-2000(gp) # 776008 <__SDATA_BEGIN__+0x30>
+   46b88:	37fd                	addiw	a5,a5,-1

And finally "split" is with this patch also applied, changing code in 
places as well, e.g.:

@@ -4873598,13 +4873598,13 @@
   5f5744:	87bf70ef          	jal	ra,5ecfbe <_gfortrani_internal_error>
   5f5748:	8281b407          	fld	fs0,-2008(gp) # 776000 <__SDATA_BEGIN__+0x28>
   5f574c:	221c                	fld	fa5,0(a2)
-  5f574e:	0079a7b7          	lui	a5,0x79a
-  5f5752:	ac22                	fsd	fs0,24(sp)
-  5f5754:	a83e                	fsd	fa5,16(sp)
-  5f5756:	27c2                	fld	fa5,16(sp)
-  5f5758:	4907b707          	fld	fa4,1168(a5) # 79a490 <__global_pointer$+0x23cb8>
-  5f575c:	22f7a7d3          	fabs.d	fa5,fa5
-  5f5760:	00102773          	frflags	a4
+  5f574e:	ac22                	fsd	fs0,24(sp)
+  5f5750:	a83e                	fsd	fa5,16(sp)
+  5f5752:	27c2                	fld	fa5,16(sp)
+  5f5754:	22f7a7d3          	fabs.d	fa5,fa5
+  5f5758:	00102773          	frflags	a4
+  5f575c:	0079a7b7          	lui	a5,0x79a
+  5f5760:	4907b707          	fld	fa4,1168(a5) # 79a490 <__global_pointer$+0x23cb8>
   5f5764:	a2e787d3          	fle.d	a5,fa5,fa4
   5f5768:	00171073          	fsflags	a4
   5f576c:	2c078363          	beqz	a5,5f5a32 <determine_en_precision+0x328>

or:

@@ -4909410,9 +4909410,9 @@
   60eb8a:	a2f696d3          	flt.d	a3,fa3,fa5
   60eb8e:	00161073          	fsflags	a2
   60eb92:	ee81                	bnez	a3,60ebaa <__hypot+0x58>
-  60eb94:	f20707d3          	fmv.d.x	fa5,a4
-  60eb98:	22f7a7d3          	fabs.d	fa5,fa5
-  60eb9c:	00102673          	frflags	a2
+  60eb94:	00102673          	frflags	a2
+  60eb98:	f20707d3          	fmv.d.x	fa5,a4
+  60eb9c:	22f7a7d3          	fabs.d	fa5,fa5
   60eba0:	a2f696d3          	flt.d	a3,fa3,fa5
   60eba4:	00161073          	fsflags	a2
   60eba8:	c29d                	beqz	a3,60ebce <__hypot+0x7c>

(so no arithmetic FP instructions appear to be scheduled between FSFLAGS 
and FRFLAGS, though it's not clear to me how the compiler knows it is not 
allowed do it) or finally:

-   66204:	52cd754b          	fnmsub.d	fa0,fs10,fa2,fa0
-   66208:	40157553          	fcvt.s.d	fa0,fa0
-   6620c:	a0e517d3          	flt.s	a5,fa0,fa4
-   66210:	58079263          	bnez	a5,66794 <do_cg+0xd20>
-   66214:	00102773          	frflags	a4
-   66218:	a0e517d3          	flt.s	a5,fa0,fa4
-   6621c:	00171073          	fsflags	a4
-   66220:	220793e3          	bnez	a5,66c46 <do_cg+0x11d2>
-   66224:	580576d3          	fsqrt.s	fa3,fa0
+   6620c:	52cd754b          	fnmsub.d	fa0,fs10,fa2,fa0
+   66210:	40157553          	fcvt.s.d	fa0,fa0
+   66214:	a0e517d3          	flt.s	a5,fa0,fa4
+   66218:	58079063          	bnez	a5,66798 <do_cg+0xd1c>
+   6621c:	00102773          	frflags	a4
+   66220:	00171073          	fsflags	a4
+   66224:	220793e3          	bnez	a5,66c4a <do_cg+0x11ce>
+   66228:	580576d3          	fsqrt.s	fa3,fa0

(at least removing a redundant FLT.S instruction, although this doesn't 
seem optimal anyway as there appears no way for the second BNEZ branch to 
be ever taken, but I gather that's an unfortunate consequence of the 
volatility of `riscv_frflags'/`riscv_fsflags' RTL insns) and I was able to 
spot a place where an FMV.D instruction has been removed too, indicating a 
better register allocation.

 Results quoted above seem to suggest that in some cases a performance 
regression has resulted from the change, but that may not necessarily be 
the case given that the benchmarks have been run on a live even if lightly 
loaded Linux system.  Obtaining standard three samples would require ~4.5 
days per SPECfp 2006 iteration or almost a fortnight total.

 Therefore I chose to rerun only one of the worst offenders and the 
results are as follows:

                                  Estimated  
                Base     Base       Base     
Benchmarks      Ref.   Run Time     Ratio    
-------------- ------  ---------  ---------  
416.gamess      19580       9138       2.14 S
416.gamess      19580       9498       2.06 S
416.gamess      19580       9478       2.07 *

corresponding to the "split" result earlier on.  So the variation between 
runs is similar to the supposed loss of performance and therefore I think 
we do not need to be concerned.  If there's anything that we're missing, 
it's the tracking of IEEE exception flags, as I previously mentioned.

 I did not run benchmarking for `-fsignaling-nans'.  Relative figures are 
expected to be similar as the only difference is the presence of a FEQ.fmt 
instruction following FSFLAGS.  I've spotted this anomaly however:

-   3c670:	5a057553          	fsqrt.d	fa0,fa0
-   3c674:	f20006d3          	fmv.d.x	fa3,zero
-   3c678:	8d01b787          	fld	fa5,-1840(gp) # c5678 <__SDATA_BEGIN__+0xd0>
-   3c67c:	0ad57553          	fsub.d	fa0,fa0,fa3
-   3c680:	a2a797d3          	flt.d	a5,fa5,fa0
-   3c684:	e3cd                	bnez	a5,3c726 <_ZN9ResultSet5checkEv+0xe6>
-   3c686:	8d81b787          	fld	fa5,-1832(gp) # c5680 <__SDATA_BEGIN__+0xd8>
-   3c68a:	a2f517d3          	flt.d	a5,fa0,fa5
-   3c68e:	efc1                	bnez	a5,3c726 <_ZN9ResultSet5checkEv+0xe6>
-   3c690:	3578                	fld	fa4,232(a0)
-   3c692:	317c                	fld	fa5,224(a0)
-   3c694:	3968                	fld	fa0,240(a0)
-   3c696:	12e77753          	fmul.d	fa4,fa4,fa4
-   3c69a:	72f7f7c3          	fmadd.d	fa5,fa5,fa5,fa4
-   3c69e:	7aa57543          	fmadd.d	fa0,fa0,fa0,fa5
-   3c6a2:	00102773          	frflags	a4
-   3c6a6:	a2d517d3          	flt.d	a5,fa0,fa3
-   3c6aa:	00171073          	fsflags	a4
-   3c6ae:	a2d52053          	feq.d	zero,fa0,fa3
-   3c6b2:	efc9                	bnez	a5,3c74c <_ZN9ResultSet5checkEv+0x10c>
-   3c6b4:	5a057553          	fsqrt.d	fa0,fa0
+   3c66e:	5a057553          	fsqrt.d	fa0,fa0
+   3c672:	f20006d3          	fmv.d.x	fa3,zero
+   3c676:	8d01b787          	fld	fa5,-1840(gp) # c5678 <__SDATA_BEGIN__+0xd0>
+   3c67a:	0ad57553          	fsub.d	fa0,fa0,fa3
+   3c67e:	a2a797d3          	flt.d	a5,fa5,fa0
+   3c682:	e7cd                	bnez	a5,3c72c <_ZN9ResultSet5checkEv+0xee>
+   3c684:	8d81b787          	fld	fa5,-1832(gp) # c5680 <__SDATA_BEGIN__+0xd8>
+   3c688:	a2f517d3          	flt.d	a5,fa0,fa5
+   3c68c:	e3c5                	bnez	a5,3c72c <_ZN9ResultSet5checkEv+0xee>
+   3c68e:	3578                	fld	fa4,232(a0)
+   3c690:	317c                	fld	fa5,224(a0)
+   3c692:	3968                	fld	fa0,240(a0)
+   3c694:	12e77753          	fmul.d	fa4,fa4,fa4
+   3c698:	72f7f7c3          	fmadd.d	fa5,fa5,fa5,fa4
+   3c69c:	7aa57543          	fmadd.d	fa0,fa0,fa0,fa5
+   3c6a0:	00102773          	frflags	a4
+   3c6a4:	a2d517d3          	flt.d	a5,fa0,fa3
+   3c6a8:	00171073          	fsflags	a4
+   3c6ac:	f20007d3          	fmv.d.x	fa5,zero
+   3c6b0:	a2f52053          	feq.d	zero,fa0,fa5
+   3c6b4:	efd9                	bnez	a5,3c752 <_ZN9ResultSet5checkEv+0x114>
+   3c6b6:	5a057553          	fsqrt.d	fa0,fa0

where the compiler for some reason cannot realise it already has the value 
of 0.0 available in fa3 and instead uses an extra move to fa5 for the 
final FEQ.D.

>  No regressions with the GCC (with and w/o `-fsignaling-nans') and glibc 
> testsuites (as per commit 1fcbfb00fc67 ("RISC-V: Fix -fsignaling-nans for 
> glibc testsuite.")).  OK to apply?

 Any comments on the change, anyone?

  Maciej

  reply	other threads:[~2022-06-23 13:39 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-09 13:44 Maciej W. Rozycki
2022-06-23 13:39 ` Maciej W. Rozycki [this message]
2022-06-23 16:44   ` Kito Cheng
2022-07-04 14:12     ` [PATCH v2] " Maciej W. Rozycki
2022-07-18 15:42       ` [PING][PATCH " Maciej W. Rozycki
2022-07-27 10:40         ` Kito Cheng
2022-07-28 13:20           ` Maciej W. Rozycki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.20.2206220143490.10833@tpp.orcam.me.uk \
    --to=macro@embecosm.com \
    --cc=andrew@sifive.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jim.wilson.gcc@gmail.com \
    --cc=kito.cheng@gmail.com \
    --cc=palmer@dabbelt.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).