From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 1005) id D3AC9388E80D; Thu, 31 Mar 2022 14:00:22 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D3AC9388E80D Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Michael Meissner To: gcc-cvs@gcc.gnu.org Subject: [gcc(refs/users/meissner/heads/work084)] Optimize vec_splats of constant vec_extract for V2DI/V2DF, PR target 99293. X-Act-Checkin: gcc X-Git-Author: Michael Meissner X-Git-Refname: refs/users/meissner/heads/work084 X-Git-Oldrev: 4b06b3b008d424f48bddb787c20897745383277e X-Git-Newrev: a8827d3a698ff5e48f0ba97e2cf7d9b3061f3ef0 Message-Id: <20220331140022.D3AC9388E80D@sourceware.org> Date: Thu, 31 Mar 2022 14:00:22 +0000 (GMT) X-BeenThere: gcc-cvs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-cvs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Mar 2022 14:00:22 -0000 https://gcc.gnu.org/g:a8827d3a698ff5e48f0ba97e2cf7d9b3061f3ef0 commit a8827d3a698ff5e48f0ba97e2cf7d9b3061f3ef0 Author: Michael Meissner Date: Mon Mar 28 23:04:55 2022 -0400 Optimize vec_splats of constant vec_extract for V2DI/V2DF, PR target 99293. In PR target/99293, it was pointed out that doing: vector long long dest0, dest1, src; /* ... */ dest0 = vec_splats (vec_extract (src, 0)); dest1 = vec_splats (vec_extract (src, 1)); would generate slower code. It generates the following code on power8: ;; vec_splats (vec_extract (src, 0)) xxpermdi 0,34,34,3 xxpermdi 34,0,0,0 ;; vec_splats (vec_extract (src, 1)) xxlor 0,34,34 xxpermdi 34,0,0,0 However on power9 and power10 it generates: ;; vec_splats (vec_extract (src, 0)) mfvsld 3,34 mtvsrdd 34,9,9 ;; vec_splats (vec_extract (src, 1)) mfvsrd 9,34 mtvsrdd 34,9,9 This is due to the power9 having the mfvsrld instruction which can extract either 64-bit element into a GPR. While there are alternatives for both vector registers and GPR registers, the register allocator prefers to put DImode into GPR registers. However in this case, it is better to have a single combiner pattern that can generate a single xxpermdi, instead of doing 2 insnsns (the extract and then the concat). This is particularly true if the two operations are move from vector register and move to vector register. As Segher pointed out in a previous version of the patch, the combiner already tries doing creating a (vec_duplicate (vec_select ...)) pattern, but we didn't provide one. I rewrote the existing pattern vsx_xxspltd_ to have a VEC_DUPLCIATE so that the case would match for the PR instead of using UNSPEC. I have built Spec 2017 with this patch installed, and the cam4_r benchmark is the only benchmark that generated different code. On a power9, I did not notice any significant changes in the runtime of cam4_r. I have built bootstrap versions on the following systems. There were no regressions in the runs: Power9 little endian, --with-cpu=power9 Power10 little endian, --with-cpu=power10 Power8 big endian, --with-cpu=power8 (both 32-bit & 64-bit tests) Can I install this into the trunk? After a burn-in period, can I backport and install this into GCC 11 and GCC 10 branches? 2022-03-28 Michael Meissner gcc/ PR target/99293 * config/rs6000/rs6000-p8swap.cc (rtx_is_swappable_p): Remove UNSPEC_VSX_XXSPLTD case. * config/rs6000/vsx.md (UNSPEC_VSX_XXSPLTD): Delete. (vsx_xxspltd_): Rewrite to use VEC_DUPLICATE. gcc/testsuite: PR target/99293 * gcc.target/powerpc/builtins-1.c: Update insn count. * gcc.target/powerpc/pr99293.c: New test. Diff: --- gcc/config/rs6000/rs6000-p8swap.cc | 1 - gcc/config/rs6000/vsx.md | 38 ++++++++++++++++++++------- gcc/testsuite/gcc.target/powerpc/builtins-1.c | 2 +- gcc/testsuite/gcc.target/powerpc/pr99293.c | 36 +++++++++++++++++++++++++ 4 files changed, 66 insertions(+), 11 deletions(-) diff --git a/gcc/config/rs6000/rs6000-p8swap.cc b/gcc/config/rs6000/rs6000-p8swap.cc index d301bc3fe59..1973d9c8245 100644 --- a/gcc/config/rs6000/rs6000-p8swap.cc +++ b/gcc/config/rs6000/rs6000-p8swap.cc @@ -805,7 +805,6 @@ rtx_is_swappable_p (rtx op, unsigned int *special) case UNSPEC_VUPKLU_V4SF: return 0; case UNSPEC_VSPLT_DIRECT: - case UNSPEC_VSX_XXSPLTD: *special = SH_SPLAT; return 1; case UNSPEC_REDUC_PLUS: diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md index 1b75538f42f..26226520335 100644 --- a/gcc/config/rs6000/vsx.md +++ b/gcc/config/rs6000/vsx.md @@ -296,7 +296,6 @@ UNSPEC_VSX_XXPERM UNSPEC_VSX_XXSPLTW - UNSPEC_VSX_XXSPLTD UNSPEC_VSX_DIVSD UNSPEC_VSX_DIVUD UNSPEC_VSX_DIVSQ @@ -3089,6 +3088,25 @@ } [(set_attr "type" "vecperm")]) +;; Combiner patterns to allow creating XXPERMDI's to access either double +;; word element in a vector register when used with VEC_DUPLICATE.. +(define_insn "*vsx_dup__1" + [(set (match_operand:VSX_D 0 "vsx_register_operand" "=wa") + (vec_duplicate:VSX_D + (vec_select: + (match_operand:VSX_D 1 "gpc_reg_operand" "wa") + (parallel [(match_operand:QI 2 "const_0_to_1_operand" "n")]))))] + "VECTOR_MEM_VSX_P (mode)" +{ + HOST_WIDE_INT dword = INTVAL (operands[2]); + if (!BYTES_BIG_ENDIAN) + dword = !dword; + + operands[3] = GEN_INT (3*dword); + return "xxpermdi %x0,%x1,%x1,%3"; +} + [(set_attr "type" "vecperm")]) + ;; Special purpose concat using xxpermdi to glue two single precision values ;; together, relying on the fact that internally scalar floats are represented ;; as doubles. This is used to initialize a V4SF vector with 4 floats @@ -4673,16 +4691,18 @@ ;; V2DF/V2DI splat for use by vec_splat builtin (define_insn "vsx_xxspltd_" [(set (match_operand:VSX_D 0 "vsx_register_operand" "=wa") - (unspec:VSX_D [(match_operand:VSX_D 1 "vsx_register_operand" "wa") - (match_operand:QI 2 "u5bit_cint_operand" "i")] - UNSPEC_VSX_XXSPLTD))] + (vec_duplicate:VSX_D + (vec_select: + (match_operand:VSX_D 1 "gpc_reg_operand" "wa") + (parallel [(match_operand:QI 2 "const_0_to_1_operand" "i")]))))] "VECTOR_MEM_VSX_P (mode)" { - if ((BYTES_BIG_ENDIAN && INTVAL (operands[2]) == 0) - || (!BYTES_BIG_ENDIAN && INTVAL (operands[2]) == 1)) - return "xxpermdi %x0,%x1,%x1,0"; - else - return "xxpermdi %x0,%x1,%x1,3"; + HOST_WIDE_INT dword = INTVAL (operands[2]); + if (!BYTES_BIG_ENDIAN) + dword = !dword; + + operands[3] = GEN_INT (3*dword); + return "xxpermdi %x0,%x1,%x1,%3"; } [(set_attr "type" "vecperm")]) diff --git a/gcc/testsuite/gcc.target/powerpc/builtins-1.c b/gcc/testsuite/gcc.target/powerpc/builtins-1.c index 28cd1aa6b1a..98783668bce 100644 --- a/gcc/testsuite/gcc.target/powerpc/builtins-1.c +++ b/gcc/testsuite/gcc.target/powerpc/builtins-1.c @@ -1035,4 +1035,4 @@ foo156 (vector unsigned short usa) /* { dg-final { scan-assembler-times {\mvmrglb\M} 3 } } */ /* { dg-final { scan-assembler-times {\mvmrgew\M} 4 } } */ /* { dg-final { scan-assembler-times {\mvsplth|xxsplth\M} 4 } } */ -/* { dg-final { scan-assembler-times {\mxxpermdi\M} 44 } } */ +/* { dg-final { scan-assembler-times {\mxxpermdi\M} 42 } } */ diff --git a/gcc/testsuite/gcc.target/powerpc/pr99293.c b/gcc/testsuite/gcc.target/powerpc/pr99293.c new file mode 100644 index 00000000000..03c22f8f4de --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/pr99293.c @@ -0,0 +1,36 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target powerpc_vsx_ok } */ +/* { dg-options "-O2 -mvsx" } */ + +/* Test for PR 99263, which wants to do: + __builtin_vec_splats (__builtin_vec_extract (v, n)) + + where v is a V2DF or V2DI vector and n is either 0 or 1. Previously the + compiler would do a direct move to the GPR registers to select the item and + a direct move from the GPR registers to do the splat. */ + +vector long long +splat_dup_ll_0 (vector long long v) +{ + return __builtin_vec_splats (__builtin_vec_extract (v, 0)); +} + +vector long long +splat_dup_ll_1 (vector long long v) +{ + return __builtin_vec_splats (__builtin_vec_extract (v, 1)); +} + +vector double +splat_dup_d_0 (vector double v) +{ + return __builtin_vec_splats (__builtin_vec_extract (v, 0)); +} + +vector double +splat_dup_d_1 (vector double v) +{ + return __builtin_vec_splats (__builtin_vec_extract (v, 1)); +} + +/* { dg-final { scan-assembler-times {\mxxpermdi\M} 4 } } */