From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x436.google.com (mail-wr1-x436.google.com [IPv6:2a00:1450:4864:20::436]) by sourceware.org (Postfix) with ESMTPS id 97BC638582B1 for ; Thu, 23 Jun 2022 13:39:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 97BC638582B1 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=embecosm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=embecosm.com Received: by mail-wr1-x436.google.com with SMTP id o8so28034739wro.3 for ; Thu, 23 Jun 2022 06:39:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=embecosm.com; s=google; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=Pe7ImCMw36mXSWgZSD0/mFhhmJXoNS51xFi1W3ax/1o=; b=XZ20jblwDNTJ39rX6m+m59dvTI4NhbfLX28j+DwQVhdeWBvGQ/hNCUhthcfPYMcqcz PNxKL4H80skm2G28Qwn0CuxCif7d0B29+RIt34yLAqbcfeU1K/s7e5gjVHiCuS2uXhog 6G7IeXs8xb5LkVBMomj9818tl0LcLJsA0caLNNjWDyQ3EUus5b0qxRM3TE1TUvJjk/06 cRpZ/yLc3er3j+f2KDe06vAk1Jy45pp0jp49QYyf7AW4i6pJ9Wl1CPcxV2bR66jvJYGG Dq3qeajnDxIOf9HpsRpkryQIcG3qtz49MpRQ397nr9KhsGlNHIHymrBuRou8XTHr0cjr r+6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=Pe7ImCMw36mXSWgZSD0/mFhhmJXoNS51xFi1W3ax/1o=; b=uoVk/fJEoSfbcdMXHklYSbkZfTlZbzXrAqGLvTJwRX5cVjpgniNFZkRfLQsdlx0RPQ L/ZR5gvWep+m3UcDC+eSgEHmHrl7qpuso3j/w9piea0I3pUzJ1fNZNo1xU3kKeVkW9mi +FNJCYd5TUAyMZIMXtlpuC52twhR8nEEPy8G8/ZbVr+OwIvkN8Ag2O6dZRwJVxxpxmnp KTHCxudSpaOCe51zKOAS1Iwd5e6AeKmECiUCmH/MrqLWq2fBiWSTdRbHgHXVinHPN9bc YAZP0J88D9TKpQTsva7jCOhJl+uNCXQDOOD07t2/KvzWfc67LW5bp3Y2wJtEBbfUs1xq 8H4w== X-Gm-Message-State: AJIora8HexTOelI/eAYIeJeb79ne5uuVJngLoaynmmAEjKs96PpVQggv +hgU6KArFzS9LDh5DZTIKZCC2DgfiT5K9w== X-Google-Smtp-Source: AGRyM1uhwzl+7/MrhsaYGV0/qHeSMEsS3ljJHx1ua/zE4p1e7wul1emTzl2YASjqtFUHWVQBJ5pRtg== X-Received: by 2002:a05:6000:15c1:b0:21b:ad5d:64dd with SMTP id y1-20020a05600015c100b0021bad5d64ddmr1469838wry.642.1655991573163; Thu, 23 Jun 2022 06:39:33 -0700 (PDT) Received: from tpp.orcam.me.uk (tpp.orcam.me.uk. [2001:8b0:154:0:ea6a:64ff:fe24:f2fc]) by smtp.gmail.com with ESMTPSA id t18-20020a05600c199200b003a032c88877sm2661619wmq.15.2022.06.23.06.39.30 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 23 Jun 2022 06:39:32 -0700 (PDT) Date: Thu, 23 Jun 2022 14:39:27 +0100 (BST) From: "Maciej W. Rozycki" To: gcc-patches@gcc.gnu.org cc: Andrew Waterman , Jim Wilson , Kito Cheng , Palmer Dabbelt Subject: Re: [PATCH] RISC-V: Split unordered FP comparisons into individual RTL insns In-Reply-To: Message-ID: References: User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Jun 2022 13:39:37 -0000 On Thu, 9 Jun 2022, Maciej W. Rozycki wrote: > I'm yet running some benchmarking to see if the use of UNSPEC_VOLATILEs > makes any noticeable performance difference, but I suspect it does not as > the compiler could not do much about the original multiple-instruction > single RTL insns anyway. This has now finally completed. I used SPECfp 2006 built at `-O3' and statically linked, which needs ~33 hours per run with the HiFive Unmatched board at its standard 1196MHz clock rate. Here are the results merged by hand from original reports: Base Base Base Base Est Base Est Base Est Base Benchmarks Ref. Run Time Run Time Run Time Ratio Ratio Ratio (base) (length) (split) (base) (length) (split) ------------- ------ ------- ------- ------- -------- -------- -------- 410.bwaves 13590 10353 10396 10370 1.31 1.31 1.31 416.gamess 19580 9080 9410 9284 2.16 2.08 2.11 433.milc 9180 5465 5475 5610 1.68 1.68 1.64 434.zeusmp 9100 5773 5767 5761 1.58 1.58 1.58 435.gromacs 7140 3605 3561 3545 1.98 2.00 2.01 436.cactusADM 11950 7779 7658 7680 1.54 1.56 1.56 437.leslie3d 9400 10280 10697 10274 0.914 0.879 0.915 444.namd 8020 3141 3120 3129 2.55 2.57 2.56 447.dealII 11440 3459 3490 3495 3.31 3.28 3.27 450.soplex 8340 4698 4899 4781 1.78 1.70 1.74 453.povray 5320 1953 1922 1916 2.72 2.77 2.78 454.calculix 8250 4844 4857 4821 1.70 1.70 1.71 459.GemsFDTD 10610 8703 8957 9028 1.22 1.18 1.18 465.tonto 9840 4585 4539 4620 2.15 2.17 2.13 470.lbm 13740 10172 10945 10789 1.35 1.26 1.27 481.wrf 11170 8516 8646 8584 1.31 1.29 1.30 482.sphinx3 19490 9240 9267 9280 2.11 2.10 2.10 ============================================================================== The execution time reference (second column) is for a Sun Ultra Enterprise 2 system from 1997, based on a 296MHz UltraSPARC II CPU, times are given in seconds (lower is better) and the ratios calculated are in relation to the reference (higher is better). In the table above "base" results are with upstream master as at commit 7b98910406b5 ("c++: ICE with template NEW_EXPR [PR105803]". Then "length" results are with commit 72b185189f91 ("RISC-V: Reset the length to the default of 4 for FP comparisons") applied on top, as it does make changes to code produced even at `-O3' (where size matters less than speed), e.g.: 46b2c: 8d01b787 fld fa5,-1840(gp) # 7760a8 <__SDATA_BEGIN__+0xd0> - 46b30: 66f4b027 fsd fa5,1632(s1) - 46b34: a029 j 46b3e - 46b36: 8c01b787 fld fa5,-1856(gp) # 776098 <__SDATA_BEGIN__+0xc0> - 46b3a: 66f4b027 fsd fa5,1632(s1) - 46b3e: 00ab67b7 lui a5,0xab6 - 46b42: 0a07b707 fld fa4,160(a5) # ab60a0 - 46b46: 8d81b787 fld fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8> - 46b4a: a2f727d3 feq.d a5,fa4,fa5 - 46b4e: 18079fe3 bnez a5,474ec - 46b52: 00afd7b7 lui a5,0xafd - 46b56: 4607a703 lw a4,1120(a5) # afd460 - 46b5a: 4785 li a5,1 - 46b5c: 18f708e3 beq a4,a5,474ec - 46b60: 00aaeab7 lui s5,0xaae - 46b64: d70a8a93 addi s5,s5,-656 # aadd70 - 46b68: 008aa783 lw a5,8(s5) - 46b6c: 8301b707 fld fa4,-2000(gp) # 776008 <__SDATA_BEGIN__+0x30> - 46b70: 37fd addiw a5,a5,-1 + 46b30: 00ab67b7 lui a5,0xab6 + 46b34: 0a07b707 fld fa4,160(a5) # ab60a0 + 46b38: 66f4b027 fsd fa5,1632(s1) + 46b3c: 8d81b787 fld fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8> + 46b40: a2f727d3 feq.d a5,fa4,fa5 + 46b44: c39d beqz a5,46b6a + 46b46: 8901b787 fld fa5,-1904(gp) # 776068 <__SDATA_BEGIN__+0x90> + 46b4a: 66f4b027 fsd fa5,1632(s1) + 46b4e: a02d j 46b78 + 46b50: 8c01b787 fld fa5,-1856(gp) # 776098 <__SDATA_BEGIN__+0xc0> + 46b54: 66f4b027 fsd fa5,1632(s1) + 46b58: 00ab67b7 lui a5,0xab6 + 46b5c: 0a07b707 fld fa4,160(a5) # ab60a0 + 46b60: 8d81b787 fld fa5,-1832(gp) # 7760b0 <__SDATA_BEGIN__+0xd8> + 46b64: a2f727d3 feq.d a5,fa4,fa5 + 46b68: fff9 bnez a5,46b46 + 46b6a: 00afd7b7 lui a5,0xafd + 46b6e: 4607a703 lw a4,1120(a5) # afd460 + 46b72: 4785 li a5,1 + 46b74: fcf709e3 beq a4,a5,46b46 + 46b78: 00aaeab7 lui s5,0xaae + 46b7c: d70a8a93 addi s5,s5,-656 # aadd70 + 46b80: 008aa783 lw a5,8(s5) + 46b84: 8301b707 fld fa4,-2000(gp) # 776008 <__SDATA_BEGIN__+0x30> + 46b88: 37fd addiw a5,a5,-1 And finally "split" is with this patch also applied, changing code in places as well, e.g.: @@ -4873598,13 +4873598,13 @@ 5f5744: 87bf70ef jal ra,5ecfbe <_gfortrani_internal_error> 5f5748: 8281b407 fld fs0,-2008(gp) # 776000 <__SDATA_BEGIN__+0x28> 5f574c: 221c fld fa5,0(a2) - 5f574e: 0079a7b7 lui a5,0x79a - 5f5752: ac22 fsd fs0,24(sp) - 5f5754: a83e fsd fa5,16(sp) - 5f5756: 27c2 fld fa5,16(sp) - 5f5758: 4907b707 fld fa4,1168(a5) # 79a490 <__global_pointer$+0x23cb8> - 5f575c: 22f7a7d3 fabs.d fa5,fa5 - 5f5760: 00102773 frflags a4 + 5f574e: ac22 fsd fs0,24(sp) + 5f5750: a83e fsd fa5,16(sp) + 5f5752: 27c2 fld fa5,16(sp) + 5f5754: 22f7a7d3 fabs.d fa5,fa5 + 5f5758: 00102773 frflags a4 + 5f575c: 0079a7b7 lui a5,0x79a + 5f5760: 4907b707 fld fa4,1168(a5) # 79a490 <__global_pointer$+0x23cb8> 5f5764: a2e787d3 fle.d a5,fa5,fa4 5f5768: 00171073 fsflags a4 5f576c: 2c078363 beqz a5,5f5a32 or: @@ -4909410,9 +4909410,9 @@ 60eb8a: a2f696d3 flt.d a3,fa3,fa5 60eb8e: 00161073 fsflags a2 60eb92: ee81 bnez a3,60ebaa <__hypot+0x58> - 60eb94: f20707d3 fmv.d.x fa5,a4 - 60eb98: 22f7a7d3 fabs.d fa5,fa5 - 60eb9c: 00102673 frflags a2 + 60eb94: 00102673 frflags a2 + 60eb98: f20707d3 fmv.d.x fa5,a4 + 60eb9c: 22f7a7d3 fabs.d fa5,fa5 60eba0: a2f696d3 flt.d a3,fa3,fa5 60eba4: 00161073 fsflags a2 60eba8: c29d beqz a3,60ebce <__hypot+0x7c> (so no arithmetic FP instructions appear to be scheduled between FSFLAGS and FRFLAGS, though it's not clear to me how the compiler knows it is not allowed do it) or finally: - 66204: 52cd754b fnmsub.d fa0,fs10,fa2,fa0 - 66208: 40157553 fcvt.s.d fa0,fa0 - 6620c: a0e517d3 flt.s a5,fa0,fa4 - 66210: 58079263 bnez a5,66794 - 66214: 00102773 frflags a4 - 66218: a0e517d3 flt.s a5,fa0,fa4 - 6621c: 00171073 fsflags a4 - 66220: 220793e3 bnez a5,66c46 - 66224: 580576d3 fsqrt.s fa3,fa0 + 6620c: 52cd754b fnmsub.d fa0,fs10,fa2,fa0 + 66210: 40157553 fcvt.s.d fa0,fa0 + 66214: a0e517d3 flt.s a5,fa0,fa4 + 66218: 58079063 bnez a5,66798 + 6621c: 00102773 frflags a4 + 66220: 00171073 fsflags a4 + 66224: 220793e3 bnez a5,66c4a + 66228: 580576d3 fsqrt.s fa3,fa0 (at least removing a redundant FLT.S instruction, although this doesn't seem optimal anyway as there appears no way for the second BNEZ branch to be ever taken, but I gather that's an unfortunate consequence of the volatility of `riscv_frflags'/`riscv_fsflags' RTL insns) and I was able to spot a place where an FMV.D instruction has been removed too, indicating a better register allocation. Results quoted above seem to suggest that in some cases a performance regression has resulted from the change, but that may not necessarily be the case given that the benchmarks have been run on a live even if lightly loaded Linux system. Obtaining standard three samples would require ~4.5 days per SPECfp 2006 iteration or almost a fortnight total. Therefore I chose to rerun only one of the worst offenders and the results are as follows: Estimated Base Base Base Benchmarks Ref. Run Time Ratio -------------- ------ --------- --------- 416.gamess 19580 9138 2.14 S 416.gamess 19580 9498 2.06 S 416.gamess 19580 9478 2.07 * corresponding to the "split" result earlier on. So the variation between runs is similar to the supposed loss of performance and therefore I think we do not need to be concerned. If there's anything that we're missing, it's the tracking of IEEE exception flags, as I previously mentioned. I did not run benchmarking for `-fsignaling-nans'. Relative figures are expected to be similar as the only difference is the presence of a FEQ.fmt instruction following FSFLAGS. I've spotted this anomaly however: - 3c670: 5a057553 fsqrt.d fa0,fa0 - 3c674: f20006d3 fmv.d.x fa3,zero - 3c678: 8d01b787 fld fa5,-1840(gp) # c5678 <__SDATA_BEGIN__+0xd0> - 3c67c: 0ad57553 fsub.d fa0,fa0,fa3 - 3c680: a2a797d3 flt.d a5,fa5,fa0 - 3c684: e3cd bnez a5,3c726 <_ZN9ResultSet5checkEv+0xe6> - 3c686: 8d81b787 fld fa5,-1832(gp) # c5680 <__SDATA_BEGIN__+0xd8> - 3c68a: a2f517d3 flt.d a5,fa0,fa5 - 3c68e: efc1 bnez a5,3c726 <_ZN9ResultSet5checkEv+0xe6> - 3c690: 3578 fld fa4,232(a0) - 3c692: 317c fld fa5,224(a0) - 3c694: 3968 fld fa0,240(a0) - 3c696: 12e77753 fmul.d fa4,fa4,fa4 - 3c69a: 72f7f7c3 fmadd.d fa5,fa5,fa5,fa4 - 3c69e: 7aa57543 fmadd.d fa0,fa0,fa0,fa5 - 3c6a2: 00102773 frflags a4 - 3c6a6: a2d517d3 flt.d a5,fa0,fa3 - 3c6aa: 00171073 fsflags a4 - 3c6ae: a2d52053 feq.d zero,fa0,fa3 - 3c6b2: efc9 bnez a5,3c74c <_ZN9ResultSet5checkEv+0x10c> - 3c6b4: 5a057553 fsqrt.d fa0,fa0 + 3c66e: 5a057553 fsqrt.d fa0,fa0 + 3c672: f20006d3 fmv.d.x fa3,zero + 3c676: 8d01b787 fld fa5,-1840(gp) # c5678 <__SDATA_BEGIN__+0xd0> + 3c67a: 0ad57553 fsub.d fa0,fa0,fa3 + 3c67e: a2a797d3 flt.d a5,fa5,fa0 + 3c682: e7cd bnez a5,3c72c <_ZN9ResultSet5checkEv+0xee> + 3c684: 8d81b787 fld fa5,-1832(gp) # c5680 <__SDATA_BEGIN__+0xd8> + 3c688: a2f517d3 flt.d a5,fa0,fa5 + 3c68c: e3c5 bnez a5,3c72c <_ZN9ResultSet5checkEv+0xee> + 3c68e: 3578 fld fa4,232(a0) + 3c690: 317c fld fa5,224(a0) + 3c692: 3968 fld fa0,240(a0) + 3c694: 12e77753 fmul.d fa4,fa4,fa4 + 3c698: 72f7f7c3 fmadd.d fa5,fa5,fa5,fa4 + 3c69c: 7aa57543 fmadd.d fa0,fa0,fa0,fa5 + 3c6a0: 00102773 frflags a4 + 3c6a4: a2d517d3 flt.d a5,fa0,fa3 + 3c6a8: 00171073 fsflags a4 + 3c6ac: f20007d3 fmv.d.x fa5,zero + 3c6b0: a2f52053 feq.d zero,fa0,fa5 + 3c6b4: efd9 bnez a5,3c752 <_ZN9ResultSet5checkEv+0x114> + 3c6b6: 5a057553 fsqrt.d fa0,fa0 where the compiler for some reason cannot realise it already has the value of 0.0 available in fa3 and instead uses an extra move to fa5 for the final FEQ.D. > No regressions with the GCC (with and w/o `-fsignaling-nans') and glibc > testsuites (as per commit 1fcbfb00fc67 ("RISC-V: Fix -fsignaling-nans for > glibc testsuite.")). OK to apply? Any comments on the change, anyone? Maciej