From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 10758 invoked by alias); 3 Sep 2015 16:16:33 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 10744 invoked by uid 89); 3 Sep 2015 16:16:33 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-ig0-f177.google.com Received: from mail-ig0-f177.google.com (HELO mail-ig0-f177.google.com) (209.85.213.177) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Thu, 03 Sep 2015 16:16:21 +0000 Received: by igcpb10 with SMTP id pb10so18350980igc.1 for ; Thu, 03 Sep 2015 09:16:17 -0700 (PDT) X-Received: by 10.50.143.101 with SMTP id sd5mr12943471igb.32.1441296976798; Thu, 03 Sep 2015 09:16:16 -0700 (PDT) Received: from [10.19.103.12] (64.2.3.194.ptr.us.xo.net. [64.2.3.194]) by smtp.gmail.com with ESMTPSA id qh9sm4807363igb.20.2015.09.03.09.16.14 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 03 Sep 2015 09:16:15 -0700 (PDT) References: <1438362335-48036-1-git-send-email-benedikt.huber@theobroma-systems.com> <1438362335-48036-2-git-send-email-benedikt.huber@theobroma-systems.com> <0EEA3B43-319F-4E50-8CC4-2CB3F5C082C5@theobroma-systems.com> Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: Cc: Benedikt Huber , GCC Patches , "Dr. Philipp Tomsich" , "Kumar, Venkataramanan" , Evandro Menezes , Kyrill Tkachov , Marcus Shawcroft , "Richard.Earnshaw@foss.arm.com" , Ramana Radhakrishnan , Marcus Shawcroft From: pinskia@gmail.com Subject: Re: [PATCH] 2015-07-31 Benedikt Huber Philipp Tomsich Date: Thu, 03 Sep 2015 16:17:00 -0000 To: Sebastian Pop X-IsSubscribed: yes X-SW-Source: 2015-09/txt/msg00285.txt.bz2 > On Sep 3, 2015, at 11:58 PM, Sebastian Pop wrote: >=20 > On Wed, Aug 26, 2015 at 11:58 AM, Benedikt Huber > wrote: >> ping >>=20 >> [PATCH v4][aarch64] Implemented reciprocal square root (rsqrt) estimatio= n in -ffast-math >>=20 >> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02698.html >>=20 >>> On 31 Jul 2015, at 19:05, Benedikt Huber wrote: >>>=20 >>> * config/aarch64/aarch64-builtins.c: Builtins for rsqrt and >>> rsqrtf. >>> * config/aarch64/aarch64-opts.h: -mrecip has a default value >>> depending on the core. >>> * config/aarch64/aarch64-protos.h: Declare. >>> * config/aarch64/aarch64-simd.md: Matching expressions for >>> frsqrte and frsqrts. >>> * config/aarch64/aarch64-tuning-flags.def: Added >>> MRECIP_DEFAULT_ENABLED. >>> * config/aarch64/aarch64.c: New functions. Emit rsqrt >>> estimation code in fast math mode. >>> * config/aarch64/aarch64.md: Added enum entries. >>> * config/aarch64/aarch64.opt: Added options -mrecip and >>> -mlow-precision-recip-sqrt. >>> * testsuite/gcc.target/aarch64/rsqrt-asm-check.c: Assembly scans >>> for frsqrte and frsqrts >>> * testsuite/gcc.target/aarch64/rsqrt.c: Functional tests for rsqrt. >=20 > The patch looks good to me. > You will need an ARM/AArch64 maintainer to approve your patch: +Ramana Except it is missing comments before almost all new functions. Yes aarch64= backend does not follow that rule but that does not mean you should not ei= ther.=20 Thanks, Andrew >=20 > Thanks, > Sebastian >=20 >>>=20 >>> Signed-off-by: Philipp Tomsich >>> --- >>> gcc/ChangeLog | 21 ++++ >>> gcc/config/aarch64/aarch64-builtins.c | 104 ++++++++++++++= ++++++ >>> gcc/config/aarch64/aarch64-opts.h | 7 ++ >>> gcc/config/aarch64/aarch64-protos.h | 2 + >>> gcc/config/aarch64/aarch64-simd.md | 27 ++++++ >>> gcc/config/aarch64/aarch64-tuning-flags.def | 1 + >>> gcc/config/aarch64/aarch64.c | 106 ++++++++++++++= +++++- >>> gcc/config/aarch64/aarch64.md | 3 + >>> gcc/config/aarch64/aarch64.opt | 8 ++ >>> gcc/doc/invoke.texi | 19 ++++ >>> gcc/testsuite/gcc.target/aarch64/rsqrt-asm-check.c | 63 ++++++++++++ >>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 107 ++++++++++++++= +++++++ >>> 12 files changed, 463 insertions(+), 5 deletions(-) >>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt-asm-check.c >>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c >>>=20 >>> diff --git a/gcc/ChangeLog b/gcc/ChangeLog >>> index 3432adb..3bf3098 100644 >>> --- a/gcc/ChangeLog >>> +++ b/gcc/ChangeLog >>> @@ -1,3 +1,24 @@ >>> +2015-07-31 Benedikt Huber >>> + Philipp Tomsich >>> + >>> + * config/aarch64/aarch64-builtins.c: Builtins for rsqrt and >>> + rsqrtf. >>> + * config/aarch64/aarch64-opts.h: -mrecip has a default value >>> + depending on the core. >>> + * config/aarch64/aarch64-protos.h: Declare. >>> + * config/aarch64/aarch64-simd.md: Matching expressions for >>> + frsqrte and frsqrts. >>> + * config/aarch64/aarch64-tuning-flags.def: Added >>> + MRECIP_DEFAULT_ENABLED. >>> + * config/aarch64/aarch64.c: New functions. Emit rsqrt >>> + estimation code in fast math mode. >>> + * config/aarch64/aarch64.md: Added enum entries. >>> + * config/aarch64/aarch64.opt: Added options -mrecip and >>> + -mlow-precision-recip-sqrt. >>> + * testsuite/gcc.target/aarch64/rsqrt-asm-check.c: Assembly scans >>> + for frsqrte and frsqrts >>> + * testsuite/gcc.target/aarch64/rsqrt.c: Functional tests for rsqr= t. >>> + >>> 2015-07-08 Jiong Wang >>>=20 >>> * config/aarch64/aarch64.c (aarch64_unspec_may_trap_p): New functi= on. >>> diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64= /aarch64-builtins.c >>> index b6c89b9..b4f443c 100644 >>> --- a/gcc/config/aarch64/aarch64-builtins.c >>> +++ b/gcc/config/aarch64/aarch64-builtins.c >>> @@ -335,6 +335,11 @@ enum aarch64_builtins >>> AARCH64_BUILTIN_GET_FPSR, >>> AARCH64_BUILTIN_SET_FPSR, >>>=20 >>> + AARCH64_BUILTIN_RSQRT_DF, >>> + AARCH64_BUILTIN_RSQRT_SF, >>> + AARCH64_BUILTIN_RSQRT_V2DF, >>> + AARCH64_BUILTIN_RSQRT_V2SF, >>> + AARCH64_BUILTIN_RSQRT_V4SF, >>> AARCH64_SIMD_BUILTIN_BASE, >>> AARCH64_SIMD_BUILTIN_LANE_CHECK, >>> #include "aarch64-simd-builtins.def" >>> @@ -824,6 +829,42 @@ aarch64_init_crc32_builtins () >>> } >>>=20 >>> void >>> +aarch64_add_builtin_rsqrt (void) Here.=20 >>> +{ >>> + tree fndecl =3D NULL; >>> + tree ftype =3D NULL; >>> + >>> + tree V2SF_type_node =3D build_vector_type (float_type_node, 2); >>> + tree V2DF_type_node =3D build_vector_type (double_type_node, 2); >>> + tree V4SF_type_node =3D build_vector_type (float_type_node, 4); >>> + >>> + ftype =3D build_function_type_list (double_type_node, double_type_no= de, NULL_TREE); >>> + fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_df", >>> + ftype, AARCH64_BUILTIN_RSQRT_DF, BUILT_IN_MD, NULL, NULL_TREE); >>> + aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_DF] =3D fndecl; >>> + >>> + ftype =3D build_function_type_list (float_type_node, float_type_node= , NULL_TREE); >>> + fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_sf", >>> + ftype, AARCH64_BUILTIN_RSQRT_SF, BUILT_IN_MD, NULL, NULL_TREE); >>> + aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_SF] =3D fndecl; >>> + >>> + ftype =3D build_function_type_list (V2DF_type_node, V2DF_type_node, = NULL_TREE); >>> + fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_v2df", >>> + ftype, AARCH64_BUILTIN_RSQRT_V2DF, BUILT_IN_MD, NULL, NULL_TREE); >>> + aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V2DF] =3D fndecl; >>> + >>> + ftype =3D build_function_type_list (V2SF_type_node, V2SF_type_node, = NULL_TREE); >>> + fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_v2sf", >>> + ftype, AARCH64_BUILTIN_RSQRT_V2SF, BUILT_IN_MD, NULL, NULL_TREE); >>> + aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V2SF] =3D fndecl; >>> + >>> + ftype =3D build_function_type_list (V4SF_type_node, V4SF_type_node, = NULL_TREE); >>> + fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_v4sf", >>> + ftype, AARCH64_BUILTIN_RSQRT_V4SF, BUILT_IN_MD, NULL, NULL_TREE); >>> + aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V4SF] =3D fndecl; >>> +} >>> + >>> +void >>> aarch64_init_builtins (void) >>> { >>> tree ftype_set_fpr >>> @@ -848,6 +889,7 @@ aarch64_init_builtins (void) >>> aarch64_init_simd_builtins (); >>> if (TARGET_CRC32) >>> aarch64_init_crc32_builtins (); >>> + aarch64_add_builtin_rsqrt (); >>> } >>>=20 >>> tree >>> @@ -1092,6 +1134,39 @@ aarch64_crc32_expand_builtin (int fcode, tree ex= p, rtx target) >>> return target; >>> } >>>=20 >>> +static rtx >>> +aarch64_expand_builtin_rsqrt (int fcode, tree exp, rtx target) >>> +{ Likewise.=20 >>> + rtx pat; >>> + tree arg0 =3D CALL_EXPR_ARG (exp, 0); >>> + rtx op0 =3D expand_normal (arg0); >>> + >>> + enum insn_code c; >>> + >>> + switch (fcode) >>> + { >>> + case AARCH64_BUILTIN_RSQRT_DF: >>> + c =3D CODE_FOR_rsqrt_df2; break; >>> + case AARCH64_BUILTIN_RSQRT_SF: >>> + c =3D CODE_FOR_rsqrt_sf2; break; >>> + case AARCH64_BUILTIN_RSQRT_V2DF: >>> + c =3D CODE_FOR_rsqrt_v2df2; break; >>> + case AARCH64_BUILTIN_RSQRT_V2SF: >>> + c =3D CODE_FOR_rsqrt_v2sf2; break; >>> + case AARCH64_BUILTIN_RSQRT_V4SF: >>> + c =3D CODE_FOR_rsqrt_v4sf2; break; >>> + default: gcc_unreachable (); >>> + } >>> + >>> + if (!target) >>> + target =3D gen_reg_rtx (GET_MODE (op0)); >>> + >>> + pat =3D GEN_FCN (c) (target, op0); >>> + emit_insn (pat); >>> + >>> + return target; >>> +} >>> + >>> /* Expand an expression EXP that calls a built-in function, >>> with result going to TARGET if that's convenient. */ >>> rtx >>> @@ -1139,6 +1214,13 @@ aarch64_expand_builtin (tree exp, >>> else if (fcode >=3D AARCH64_CRC32_BUILTIN_BASE && fcode <=3D AARCH64_C= RC32_BUILTIN_MAX) >>> return aarch64_crc32_expand_builtin (fcode, exp, target); >>>=20 >>> + if (fcode =3D=3D AARCH64_BUILTIN_RSQRT_DF >>> + || fcode =3D=3D AARCH64_BUILTIN_RSQRT_SF >>> + || fcode =3D=3D AARCH64_BUILTIN_RSQRT_V2DF >>> + || fcode =3D=3D AARCH64_BUILTIN_RSQRT_V2SF >>> + || fcode =3D=3D AARCH64_BUILTIN_RSQRT_V4SF) >>> + return aarch64_expand_builtin_rsqrt (fcode, exp, target); >>> + >>> gcc_unreachable (); >>> } >>>=20 >>> @@ -1296,6 +1378,28 @@ aarch64_builtin_vectorized_function (tree fndecl= , tree type_out, tree type_in) >>> return NULL_TREE; >>> } >>>=20 >>> +tree >>> +aarch64_builtin_rsqrt (unsigned int fn, bool md_fn) >>> +{ And here.=20 >>> + if (md_fn) >>> + { >>> + if (fn =3D=3D AARCH64_SIMD_BUILTIN_UNOP_sqrtv2df) >>> + return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V2DF]; >>> + if (fn =3D=3D AARCH64_SIMD_BUILTIN_UNOP_sqrtv2sf) >>> + return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V2SF]; >>> + if (fn =3D=3D AARCH64_SIMD_BUILTIN_UNOP_sqrtv4sf) >>> + return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V4SF]; >>> + } >>> + else >>> + { >>> + if (fn =3D=3D BUILT_IN_SQRT) >>> + return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_DF]; >>> + if (fn =3D=3D BUILT_IN_SQRTF) >>> + return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_SF]; >>> + } >>> + return NULL_TREE; >>> +} >>> + >>> #undef VAR1 >>> #define VAR1(T, N, MAP, A) \ >>> case AARCH64_SIMD_BUILTIN_##T##_##N##A: >>> diff --git a/gcc/config/aarch64/aarch64-opts.h b/gcc/config/aarch64/aar= ch64-opts.h >>> index 24bfd9f..75e9c67 100644 >>> --- a/gcc/config/aarch64/aarch64-opts.h >>> +++ b/gcc/config/aarch64/aarch64-opts.h >>> @@ -64,4 +64,11 @@ enum aarch64_code_model { >>> AARCH64_CMODEL_LARGE >>> }; >>>=20 >>> +/* Each core can have -mrecip enabled or disabled by default. */ >>> +enum aarch64_mrecip { >>> + AARCH64_MRECIP_OFF =3D 0, >>> + AARCH64_MRECIP_ON, >>> + AARCH64_MRECIP_DEFAULT, >>> +}; >>> + >>> #endif >>> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/a= arch64-protos.h >>> index 4062c27..8b8a389 100644 >>> --- a/gcc/config/aarch64/aarch64-protos.h >>> +++ b/gcc/config/aarch64/aarch64-protos.h >>> @@ -321,6 +321,8 @@ void aarch64_print_operand (FILE *, rtx, char); >>> void aarch64_print_operand_address (FILE *, rtx); >>> void aarch64_emit_call_insn (rtx); >>>=20 >>> +void aarch64_emit_swrsqrt (rtx, rtx); >>> + >>> /* Initialize builtins for SIMD intrinsics. */ >>> void init_aarch64_simd_builtins (void); >>>=20 >>> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aa= rch64-simd.md >>> index b90f938..ae81731 100644 >>> --- a/gcc/config/aarch64/aarch64-simd.md >>> +++ b/gcc/config/aarch64/aarch64-simd.md >>> @@ -353,6 +353,33 @@ >>> [(set_attr "type" "neon_fp_mul_d_scalar_q")] >>> ) >>>=20 >>> +(define_insn "rsqrte_2" >>> + [(set (match_operand:VALLF 0 "register_operand" "=3Dw") >>> + (unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")] >>> + UNSPEC_RSQRTE))] >>> + "TARGET_SIMD" >>> + "frsqrte\\t%0, %1" >>> + [(set_attr "type" "neon_fp_rsqrte_")]) >>> + >>> +(define_insn "rsqrts_3" >>> + [(set (match_operand:VALLF 0 "register_operand" "=3Dw") >>> + (unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w") >>> + (match_operand:VALLF 2 "register_operand" "w")] >>> + UNSPEC_RSQRTS))] >>> + "TARGET_SIMD" >>> + "frsqrts\\t%0, %1, %2" >>> + [(set_attr "type" "neon_fp_rsqrts_")]) >>> + >>> +(define_expand "rsqrt_2" >>> + [(set (match_operand:VALLF 0 "register_operand" "=3Dw") >>> + (unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")] >>> + UNSPEC_RSQRT))] >>> + "TARGET_SIMD" >>> +{ >>> + aarch64_emit_swrsqrt (operands[0], operands[1]); >>> + DONE; >>> +}) >>> + >>> (define_insn "*aarch64_mul3_elt_to_64v2df" >>> [(set (match_operand:DF 0 "register_operand" "=3Dw") >>> (mult:DF >>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/a= arch64/aarch64-tuning-flags.def >>> index 01aaca8..97dbf00 100644 >>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>> @@ -31,4 +31,5 @@ >>> flags. */ >>>=20 >>> AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS, 0) >>> +AARCH64_EXTRA_TUNING_OPTION ("mrecip_default_enabled", MRECIP_DEFAULT_= ENABLED, 1) >>>=20 >>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c >>> index 6c13a078..76c0eee 100644 >>> --- a/gcc/config/aarch64/aarch64.c >>> +++ b/gcc/config/aarch64/aarch64.c >>> @@ -364,7 +364,7 @@ static const struct tune_params generic_tunings =3D >>> 1, /* vec_reassoc_width. */ >>> 2, /* min_div_recip_mul_sf. */ >>> 2, /* min_div_recip_mul_df. */ >>> - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ >>> + (AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED) /* tune_flags. */ >>> }; >>>=20 >>> static const struct tune_params cortexa53_tunings =3D >>> @@ -386,7 +386,7 @@ static const struct tune_params cortexa53_tunings = =3D >>> 1, /* vec_reassoc_width. */ >>> 2, /* min_div_recip_mul_sf. */ >>> 2, /* min_div_recip_mul_df. */ >>> - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ >>> + (AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED) /* tune_flags. */ >>> }; >>>=20 >>> static const struct tune_params cortexa57_tunings =3D >>> @@ -408,7 +408,8 @@ static const struct tune_params cortexa57_tunings = =3D >>> 1, /* vec_reassoc_width. */ >>> 2, /* min_div_recip_mul_sf. */ >>> 2, /* min_div_recip_mul_df. */ >>> - (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ >>> + (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS /* tune_flags. */ >>> + | AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED) >>> }; >>>=20 >>> static const struct tune_params cortexa72_tunings =3D >>> @@ -430,7 +431,7 @@ static const struct tune_params cortexa72_tunings = =3D >>> 1, /* vec_reassoc_width. */ >>> 2, /* min_div_recip_mul_sf. */ >>> 2, /* min_div_recip_mul_df. */ >>> - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ >>> + (AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED) /* tune_flags. */ >>> }; >>>=20 >>> static const struct tune_params thunderx_tunings =3D >>> @@ -472,7 +473,7 @@ static const struct tune_params xgene1_tunings =3D >>> 1, /* vec_reassoc_width. */ >>> 2, /* min_div_recip_mul_sf. */ >>> 2, /* min_div_recip_mul_df. */ >>> - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ >>> + (AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED) /* tune_flags. */ >>> }; >>>=20 >>> /* Support for fine-grained override of the tuning structures. */ >>> @@ -6961,6 +6962,98 @@ aarch64_memory_move_cost (machine_mode mode ATTR= IBUTE_UNUSED, >>> return aarch64_tune_params.memmov_cost; >>> } >>>=20 >>> +extern tree aarch64_builtin_rsqrt (unsigned int fn, bool md_fn); Why isn't that in a header file instead of here? >>> + >>> +static tree >>> +aarch64_builtin_reciprocal (unsigned int fn, >>> + bool md_fn, >>> + bool) And here.=20 >>> +{ >>> + if (!flag_finite_math_only >>> + || flag_trapping_math >>> + || !flag_unsafe_math_optimizations >>> + || optimize_size >>> + || flag_mrecip =3D=3D AARCH64_MRECIP_OFF >>> + || (flag_mrecip =3D=3D AARCH64_MRECIP_DEFAULT >>> + && !(aarch64_tune_params.extra_tuning_flags >>> + & AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED))) >>> + { >>> + return NULL_TREE; >>> + } >>> + >>> + return aarch64_builtin_rsqrt (fn, md_fn); >>> +} >>> + >>> +typedef rtx (*rsqrte_type) (rtx, rtx); >>> + >>> +rsqrte_type get_rsqrte_type (enum machine_mode mode) >>> +{ And another one.=20 >>> + switch (mode) >>> + { >>> + case DFmode: return gen_rsqrte_df2; >>> + case SFmode: return gen_rsqrte_sf2; >>> + case V2DFmode: return gen_rsqrte_v2df2; >>> + case V2SFmode: return gen_rsqrte_v2sf2; >>> + case V4SFmode: return gen_rsqrte_v4sf2; >>> + default: gcc_unreachable (); >>> + } >>> +} >>> + >>> +typedef rtx (*rsqrts_type) (rtx, rtx, rtx); >>> + >>> +rsqrts_type get_rsqrts_type (enum machine_mode mode) And another one.=20 >>> +{ >>> + switch (mode) >>> + { >>> + case DFmode: return gen_rsqrts_df3; >>> + case SFmode: return gen_rsqrts_sf3; >>> + case V2DFmode: return gen_rsqrts_v2df3; >>> + case V2SFmode: return gen_rsqrts_v2sf3; >>> + case V4SFmode: return gen_rsqrts_v4sf3; >>> + default: gcc_unreachable (); >>> + } >>> +} >>> + >>> +void >>> +aarch64_emit_swrsqrt (rtx dst, rtx src) >>> +{ One more.=20 >>> + enum machine_mode mode =3D GET_MODE (src); >>> + gcc_assert ( >>> + mode =3D=3D SFmode || mode =3D=3D V2SFmode || mode =3D=3D V4SFmode= || >>> + mode =3D=3D DFmode || mode =3D=3D V2DFmode); >>> + >>> + rtx xsrc =3D gen_reg_rtx (mode); >>> + emit_move_insn (xsrc, src); >>> + rtx x0 =3D gen_reg_rtx (mode); >>> + >>> + emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc)); >>> + >>> + bool double_mode =3D (mode =3D=3D DFmode || mode =3D=3D V2DFmode); >>> + >>> + int iterations =3D 2; >>> + if (double_mode) >>> + iterations =3D 3; >>> + >>> + if (flag_mrecip_low_precision_sqrt) >>> + iterations--; >>> + >>> + for (int i =3D 0; i < iterations; ++i) >>> + { >>> + rtx x1 =3D gen_reg_rtx (mode); >>> + rtx x2 =3D gen_reg_rtx (mode); >>> + rtx x3 =3D gen_reg_rtx (mode); >>> + emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0)); >>> + >>> + emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2)); >>> + >>> + emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3)); >>> + x0 =3D x1; >>> + } >>> + >>> + emit_move_insn (dst, x0); >>> + return; >>> +} >>> + >>> /* Return the number of instructions that can be issued per cycle. */ >>> static int >>> aarch64_sched_issue_rate (void) >>> @@ -12099,6 +12192,9 @@ aarch64_unspec_may_trap_p (const_rtx x, unsigne= d flags) >>> #undef TARGET_USE_BLOCKS_FOR_CONSTANT_P >>> #define TARGET_USE_BLOCKS_FOR_CONSTANT_P aarch64_use_blocks_for_constan= t_p >>>=20 >>> +#undef TARGET_BUILTIN_RECIPROCAL >>> +#define TARGET_BUILTIN_RECIPROCAL aarch64_builtin_reciprocal >>> + >>> #undef TARGET_VECTOR_MODE_SUPPORTED_P >>> #define TARGET_VECTOR_MODE_SUPPORTED_P aarch64_vector_mode_supported_p >>>=20 >>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64= .md >>> index 1e343fa..d7944b2 100644 >>> --- a/gcc/config/aarch64/aarch64.md >>> +++ b/gcc/config/aarch64/aarch64.md >>> @@ -122,6 +122,9 @@ >>> UNSPEC_VSTRUCTDUMMY >>> UNSPEC_SP_SET >>> UNSPEC_SP_TEST >>> + UNSPEC_RSQRT >>> + UNSPEC_RSQRTE >>> + UNSPEC_RSQRTS >>> ]) >>>=20 >>> (define_c_enum "unspecv" [ >>> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch6= 4.opt >>> index 98ef9f6..7921b85 100644 >>> --- a/gcc/config/aarch64/aarch64.opt >>> +++ b/gcc/config/aarch64/aarch64.opt >>> @@ -124,3 +124,11 @@ Enum(aarch64_abi) String(ilp32) Value(AARCH64_ABI_= ILP32) >>>=20 >>> EnumValue >>> Enum(aarch64_abi) String(lp64) Value(AARCH64_ABI_LP64) >>> + >>> +mrecip >>> +Common Report Var(flag_mrecip) Optimization Init(AARCH64_MRECIP_DEFAUL= T) >>> +Generate software reciprocal square root for better throughput. >>> + >>> +mlow-precision-recip-sqrt >>> +Common Var(flag_mrecip_low_precision_sqrt) Optimization >>> +Run fewer approximation steps to reduce latency and precision. >>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi >>> index b28e5d6..bd922a3 100644 >>> --- a/gcc/doc/invoke.texi >>> +++ b/gcc/doc/invoke.texi >>> @@ -515,6 +515,8 @@ Objective-C and Objective-C++ Dialects}. >>> -mtls-dialect=3Ddesc -mtls-dialect=3Dtraditional @gol >>> -mfix-cortex-a53-835769 -mno-fix-cortex-a53-835769 @gol >>> -mfix-cortex-a53-843419 -mno-fix-cortex-a53-843419 @gol >>> +-mrecip -mno-recip @gol >>> +-mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol >>> -march=3D@var{name} -mcpu=3D@var{name} -mtune=3D@var{name}} >>>=20 >>> @emph{Adapteva Epiphany Options} >>> @@ -12426,6 +12428,23 @@ Enable or disable the workaround for the ARM C= ortex-A53 erratum number 843419. >>> This erratum workaround is made at link time and this will only pass the >>> corresponding flag to the linker. >>>=20 >>> +@item -mrecip >>> +@item -mno-recip >>> +@opindex mrecip >>> +@opindex mno-recip >>> +This option enables use of the >>> +reciprocal square root estimate instructions with additional >>> +Newton-Raphson steps to increase precision instead of doing a square r= oot and >>> +divide for floating-point arguments. >>> +It can only be used together with @option{-ffast-math}. >>> + >>> +@item -mlow-precision-recip-sqrt >>> +@item -mno-low-precision-recip-sqrt >>> +@opindex -mlow-precision-recip-sqrt >>> +@opindex -mno-low-precision-recip-sqrt >>> +The square root estimate uses two steps instead of three for double-pr= ecision, >>> +and one step instead of two for single-precision. Thus reducing latenc= y and precision. >>> + >>> @item -march=3D@var{name} >>> @opindex march >>> Specify the name of the target architecture, optionally suffixed by one= or >>> diff --git a/gcc/testsuite/gcc.target/aarch64/rsqrt-asm-check.c b/gcc/t= estsuite/gcc.target/aarch64/rsqrt-asm-check.c >>> new file mode 100644 >>> index 0000000..d6cfe11 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/aarch64/rsqrt-asm-check.c >>> @@ -0,0 +1,63 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-options "-O3 --save-temps -fverbose-asm -ffast-math -mrecip" }= */ >>> + >>> +#include >>> + >>> +#define sqrt_float __builtin_sqrtf >>> +#define sqrt_double __builtin_sqrt >>> + >>> +#define TESTTYPE(TYPE) \ >>> +typedef struct { \ >>> + TYPE a; \ >>> + TYPE b; \ >>> + TYPE c; \ >>> + TYPE d; \ >>> +} s4_##TYPE; \ >>> + \ >>> +typedef struct { \ >>> + TYPE a; \ >>> + TYPE b; \ >>> +} s2_##TYPE; \ >>> + \ >>> +s4_##TYPE rsqrtv4_##TYPE (s4_##TYPE i) \ >>> +{ \ >>> + s4_##TYPE o; \ >>> + o.a =3D 1.0 / sqrt_##TYPE (i.a); \ >>> + o.b =3D 1.0 / sqrt_##TYPE (i.b); \ >>> + o.c =3D 1.0 / sqrt_##TYPE (i.c); \ >>> + o.d =3D 1.0 / sqrt_##TYPE (i.d); \ >>> + return o; \ >>> +} \ >>> + \ >>> +s2_##TYPE rsqrtv2_##TYPE (s2_##TYPE i) \ >>> +{ \ >>> + s2_##TYPE o; \ >>> + o.a =3D 1.0 / sqrt_##TYPE (i.a); \ >>> + o.b =3D 1.0 / sqrt_##TYPE (i.b); \ >>> + return o; \ >>> +} \ >>> + \ >>> +TYPE rsqrt_##TYPE (TYPE i) \ >>> +{ \ >>> + return 1.0 / sqrt_##TYPE (i); \ >>> +} \ >>> + \ >>> + >>> +TESTTYPE(double) >>> +TESTTYPE(float) >>> + >>> +/* { dg-final { scan-assembler-times "frsqrte\\td\[0-9\]+, d\[0-9\]+" = 1 } } */ >>> +/* { dg-final { scan-assembler-times "frsqrts\\td\[0-9\]+, d\[0-9\]+, = d\[0-9\]+" 3 } } */ >>> + >>> +/* { dg-final { scan-assembler-times "frsqrte\\tv\[0-9\]+.2d, v\[0-9\]= +.2d" 3 } } */ >>> +/* { dg-final { scan-assembler-times "frsqrts\\tv\[0-9\]+.2d, v\[0-9\]= +.2d, v\[0-9\]+.2d" 9 } } */ >>> + >>> + >>> +/* { dg-final { scan-assembler-times "frsqrte\\ts\[0-9\]+, s\[0-9\]+" = 1 } } */ >>> +/* { dg-final { scan-assembler-times "frsqrts\\ts\[0-9\]+, s\[0-9\]+, = s\[0-9\]+" 2 } } */ >>> + >>> +/* { dg-final { scan-assembler-times "frsqrte\\tv\[0-9\]+.4s, v\[0-9\]= +.4s" 1 } } */ >>> +/* { dg-final { scan-assembler-times "frsqrts\\tv\[0-9\]+.4s, v\[0-9\]= +.4s, v\[0-9\]+.4s" 2 } } */ >>> + >>> +/* { dg-final { scan-assembler-times "frsqrte\\tv\[0-9\]+.2s, v\[0-9\]= +.2s" 1 } } */ >>> +/* { dg-final { scan-assembler-times "frsqrts\\tv\[0-9\]+.2s, v\[0-9\]= +.2s, v\[0-9\]+.2s" 2 } } */ >>> diff --git a/gcc/testsuite/gcc.target/aarch64/rsqrt.c b/gcc/testsuite/g= cc.target/aarch64/rsqrt.c >>> new file mode 100644 >>> index 0000000..4a5c008 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/aarch64/rsqrt.c >>> @@ -0,0 +1,107 @@ >>> +/* { dg-do run } */ >>> +/* { dg-options "-O3 --save-temps -fverbose-asm -ffast-math -mrecip" }= */ >>> + >>> +#include >>> +#include >>> + >>> +#include >>> + >>> +#define PI 3.141592653589793 >>> +#define SQRT2 1.4142135623730951 >>> + >>> +#define PI_4 0.7853981633974483 >>> +#define SQRT1_2 0.7071067811865475 >>> + >>> +/* 2^25+1, float has 24 significand bits >>> + * according to Single-precision floating-point format. */ >>> +#define TESTA8_FLT 33554433 >>> +/* 2^54+1, double has 53 significand bits >>> + * according to Double-precision floating-point format. */ >>> +#define TESTA8_DBL 18014398509481985 >>> + >>> +#define SD(a, b) t_double ((#a), (a), (b)); >>> +#define SF(a, b) t_float ((#a), (a), (b)); >>> + >>> +#define EPSILON_double __DBL_EPSILON__ >>> +#define EPSILON_float __FLT_EPSILON__ >>> +#define ABS_double __builtin_fabs >>> +#define ABS_float __builtin_fabsf >>> +#define SQRT_double __builtin_sqrt >>> +#define SQRT_float __builtin_sqrtf >>> + >>> +extern void abort (void); >>> + >>> +#define TESTTYPE(TYPE) = \ >>> +TYPE rsqrt_##TYPE (TYPE a) = \ >>> +{ = \ >>> + return 1.0/SQRT_##TYPE(a); = \ >>> +} = \ >>> + = \ >>> +int equals_##TYPE (TYPE a, TYPE b) = \ >>> +{ = \ >>> + return (a =3D=3D b || = \ >>> + (isnan (a) && isnan (b)) || = \ >>> + (ABS_##TYPE (a - b) < EPSILON_##TYPE)); = \ >>> +} = \ >>> + = \ >>> +void t_##TYPE (const char *s, TYPE a, TYPE result) = \ >>> +{ = \ >>> + TYPE r =3D rsqrt_##TYPE (a); = \ >>> + if (!equals_##TYPE (r, result)) = \ >>> + { = \ >>> + abort (); = \ >>> + } = \ >>> +} = \ >>> + >>> +// printf ("Problem in %20s: %30.18A should be %30.18A\n", s, r, resu= lt); \ >>> + >>> +TESTTYPE(double) >>> +TESTTYPE(float) >>> + >>> +int main () >>> +{ >>> + SD( 1.0/256, 0X1.00000000000000P+4 ); >>> + SD( 1.0, 0X1.00000000000000P+0 ); >>> + SD( -1.0, NAN); >>> + SD( 11.0, 0X1.34BF63D1568260P-2 ); >>> + SD( 0.0, INFINITY); >>> + SD( INFINITY, 0X0.00000000000000P+0 ); >>> + SD( NAN, NAN); >>> + SD( -NAN, -NAN); >>> + SD( DBL_MAX, 0X1.00000000000010P-512); >>> + SD( DBL_MIN, 0X1.00000000000000P+511); >>> + SD( PI, 0X1.20DD750429B6D0P-1 ); >>> + SD( PI_4, 0X1.20DD750429B6D0P+0 ); >>> + SD( SQRT2, 0X1.AE89F995AD3AE0P-1 ); >>> + SD( SQRT1_2, 0X1.306FE0A31B7150P+0 ); >>> + SD( -PI, NAN); >>> + SD( -SQRT2, NAN); >>> + SD( TESTA8_DBL, 0X1.00000000000000P-27 ); >>> + >>> + SF( 1.0/256, 0X1.00000000000000P+4 ); >>> + SF( 1.0, 0X1.00000000000000P+0 ); >>> + SF( -1.0, NAN); >>> + SF( 11.0, 0X1.34BF6400000000P-2 ); >>> + SF( 0.0, INFINITY); >>> + SF( INFINITY, 0X0.00000000000000P+0 ); >>> + SF( NAN, NAN); >>> + SF( -NAN, -NAN); >>> + SF( FLT_MAX, 0X1.00000200000000P-64 ); >>> + SF( FLT_MIN, 0X1.00000000000000P+63 ); >>> + SF( PI, 0X1.20DD7400000000P-1 ); >>> + SF( PI_4, 0X1.20DD7400000000P+0 ); >>> + SF( SQRT2, 0X1.AE89FA00000000P-1 ); >>> + SF( SQRT1_2, 0X1.306FE000000000P+0 ); >>> + SF( -PI, NAN); >>> + SF( -SQRT2, NAN); >>> + SF( TESTA8_FLT, 0X1.6A09E600000000P-13 ); >>> + >>> +// With -ffast-math these return positive INF. >>> +// SD( -0.0, -INFINITY); >>> +// SF( -0.0, -INFINITY); >>> +// The reason here is that -ffast-math flushes to zero. >>> +// SD(DBL_MIN/256, 0X1.00000000000000P+515); >>> +// SF(FLT_MIN/256, 0X1.00000000000000P+67 ); >>> + >>> + return 0; >>> +} >>> -- >>> 1.9.1 >>>=20 >>=20