From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-406104-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 117516 invoked by alias); 26 Aug 2015 16:58:28 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 117490 invoked by uid 89); 26 Aug 2015 16:58:27 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.6 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=ham version=3.3.2
X-HELO: mail.theobroma-systems.com
Received: from vegas.theobroma-systems.com (HELO mail.theobroma-systems.com) (144.76.126.164) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Wed, 26 Aug 2015 16:58:23 +0000
Received: from [86.59.122.178] (port=59735 helo=bhuber.lan)	by mail.theobroma-systems.com with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:256)	(Exim 4.80)	(envelope-from <benedikt.huber@theobroma-systems.com>)	id 1ZUe1j-0000a9-0Y; Wed, 26 Aug 2015 18:58:19 +0200
Subject: Re: [PATCH] 2015-07-31  Benedikt Huber  <benedikt.huber@theobroma-systems.com> 	    Philipp Tomsich  <philipp.tomsich@theobroma-systems.com>
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Content-Type: multipart/signed; boundary="Apple-Mail=_8138A134-D688-4E16-8309-9AB97BD020ED"; protocol="application/pgp-signature"; micalg=pgp-sha512
X-Pgp-Agent: GPGMail 2.5
From: Benedikt Huber <benedikt.huber@theobroma-systems.com>
In-Reply-To: <1438362335-48036-2-git-send-email-benedikt.huber@theobroma-systems.com>
Date: Wed, 26 Aug 2015 17:39:00 -0000
Cc: "Dr. Philipp Tomsich" <philipp.tomsich@theobroma-systems.com>, "Kumar, Venkataramanan" <Venkataramanan.Kumar@amd.com>, pinskia@gmail.com, Evandro Menezes <e.menezes@samsung.com>, kyrylo.tkachov@arm.com, marcus.shawcroft@gmail.com, Richard.Earnshaw@foss.arm.com
Message-Id: <0EEA3B43-319F-4E50-8CC4-2CB3F5C082C5@theobroma-systems.com>
References: <1438362335-48036-1-git-send-email-benedikt.huber@theobroma-systems.com> <1438362335-48036-2-git-send-email-benedikt.huber@theobroma-systems.com>
To: gcc-patches@gcc.gnu.org
X-IsSubscribed: yes
X-SW-Source: 2015-08/txt/msg01641.txt.bz2


--Apple-Mail=_8138A134-D688-4E16-8309-9AB97BD020ED
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii
Content-length: 28696

ping

[PATCH v4][aarch64] Implemented reciprocal square root (rsqrt) estimation i=
n -ffast-math

https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02698.html

> On 31 Jul 2015, at 19:05, Benedikt Huber <benedikt.huber@theobroma-system=
s.com> wrote:
>=20
> 	* config/aarch64/aarch64-builtins.c: Builtins for rsqrt and
> 	rsqrtf.
> 	* config/aarch64/aarch64-opts.h: -mrecip has a default value
> 	depending on the core.
> 	* config/aarch64/aarch64-protos.h: Declare.
> 	* config/aarch64/aarch64-simd.md: Matching expressions for
> 	frsqrte and frsqrts.
> 	* config/aarch64/aarch64-tuning-flags.def: Added
> 	MRECIP_DEFAULT_ENABLED.
> 	* config/aarch64/aarch64.c: New functions. Emit rsqrt
> 	estimation code in fast math mode.
> 	* config/aarch64/aarch64.md: Added enum entries.
> 	* config/aarch64/aarch64.opt: Added options -mrecip and
> 	-mlow-precision-recip-sqrt.
> 	* testsuite/gcc.target/aarch64/rsqrt-asm-check.c: Assembly scans
> 	for frsqrte and frsqrts
> 	* testsuite/gcc.target/aarch64/rsqrt.c: Functional tests for rsqrt.
>=20
> Signed-off-by: Philipp Tomsich <philipp.tomsich@theobroma-systems.com>
> ---
> gcc/ChangeLog                                      |  21 ++++
> gcc/config/aarch64/aarch64-builtins.c              | 104 ++++++++++++++++=
++++
> gcc/config/aarch64/aarch64-opts.h                  |   7 ++
> gcc/config/aarch64/aarch64-protos.h                |   2 +
> gcc/config/aarch64/aarch64-simd.md                 |  27 ++++++
> gcc/config/aarch64/aarch64-tuning-flags.def        |   1 +
> gcc/config/aarch64/aarch64.c                       | 106 ++++++++++++++++=
+++-
> gcc/config/aarch64/aarch64.md                      |   3 +
> gcc/config/aarch64/aarch64.opt                     |   8 ++
> gcc/doc/invoke.texi                                |  19 ++++
> gcc/testsuite/gcc.target/aarch64/rsqrt-asm-check.c |  63 ++++++++++++
> gcc/testsuite/gcc.target/aarch64/rsqrt.c           | 107 ++++++++++++++++=
+++++
> 12 files changed, 463 insertions(+), 5 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt-asm-check.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>=20
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 3432adb..3bf3098 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,24 @@
> +2015-07-31  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
> +	    Philipp Tomsich  <philipp.tomsich@theobroma-systems.com>
> +
> +	* config/aarch64/aarch64-builtins.c: Builtins for rsqrt and
> +	rsqrtf.
> +	* config/aarch64/aarch64-opts.h: -mrecip has a default value
> +	depending on the core.
> +	* config/aarch64/aarch64-protos.h: Declare.
> +	* config/aarch64/aarch64-simd.md: Matching expressions for
> +	frsqrte and frsqrts.
> +	* config/aarch64/aarch64-tuning-flags.def: Added
> +	MRECIP_DEFAULT_ENABLED.
> +	* config/aarch64/aarch64.c: New functions. Emit rsqrt
> +	estimation code in fast math mode.
> +	* config/aarch64/aarch64.md: Added enum entries.
> +	* config/aarch64/aarch64.opt: Added options -mrecip and
> +	-mlow-precision-recip-sqrt.
> +	* testsuite/gcc.target/aarch64/rsqrt-asm-check.c: Assembly scans
> +	for frsqrte and frsqrts
> +	* testsuite/gcc.target/aarch64/rsqrt.c: Functional tests for rsqrt.
> +
> 2015-07-08  Jiong Wang  <jiong.wang@arm.com>
>=20
> 	* config/aarch64/aarch64.c (aarch64_unspec_may_trap_p): New function.
> diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/a=
arch64-builtins.c
> index b6c89b9..b4f443c 100644
> --- a/gcc/config/aarch64/aarch64-builtins.c
> +++ b/gcc/config/aarch64/aarch64-builtins.c
> @@ -335,6 +335,11 @@ enum aarch64_builtins
>   AARCH64_BUILTIN_GET_FPSR,
>   AARCH64_BUILTIN_SET_FPSR,
>=20
> +  AARCH64_BUILTIN_RSQRT_DF,
> +  AARCH64_BUILTIN_RSQRT_SF,
> +  AARCH64_BUILTIN_RSQRT_V2DF,
> +  AARCH64_BUILTIN_RSQRT_V2SF,
> +  AARCH64_BUILTIN_RSQRT_V4SF,
>   AARCH64_SIMD_BUILTIN_BASE,
>   AARCH64_SIMD_BUILTIN_LANE_CHECK,
> #include "aarch64-simd-builtins.def"
> @@ -824,6 +829,42 @@ aarch64_init_crc32_builtins ()
> }
>=20
> void
> +aarch64_add_builtin_rsqrt (void)
> +{
> +  tree fndecl =3D NULL;
> +  tree ftype =3D NULL;
> +
> +  tree V2SF_type_node =3D build_vector_type (float_type_node, 2);
> +  tree V2DF_type_node =3D build_vector_type (double_type_node, 2);
> +  tree V4SF_type_node =3D build_vector_type (float_type_node, 4);
> +
> +  ftype =3D build_function_type_list (double_type_node, double_type_node=
, NULL_TREE);
> +  fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_df",
> +    ftype, AARCH64_BUILTIN_RSQRT_DF, BUILT_IN_MD, NULL, NULL_TREE);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_DF] =3D fndecl;
> +
> +  ftype =3D build_function_type_list (float_type_node, float_type_node, =
NULL_TREE);
> +  fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_sf",
> +    ftype, AARCH64_BUILTIN_RSQRT_SF, BUILT_IN_MD, NULL, NULL_TREE);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_SF] =3D fndecl;
> +
> +  ftype =3D build_function_type_list (V2DF_type_node, V2DF_type_node, NU=
LL_TREE);
> +  fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_v2df",
> +    ftype, AARCH64_BUILTIN_RSQRT_V2DF, BUILT_IN_MD, NULL, NULL_TREE);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V2DF] =3D fndecl;
> +
> +  ftype =3D build_function_type_list (V2SF_type_node, V2SF_type_node, NU=
LL_TREE);
> +  fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_v2sf",
> +    ftype, AARCH64_BUILTIN_RSQRT_V2SF, BUILT_IN_MD, NULL, NULL_TREE);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V2SF] =3D fndecl;
> +
> +  ftype =3D build_function_type_list (V4SF_type_node, V4SF_type_node, NU=
LL_TREE);
> +  fndecl =3D add_builtin_function ("__builtin_aarch64_rsqrt_v4sf",
> +    ftype, AARCH64_BUILTIN_RSQRT_V4SF, BUILT_IN_MD, NULL, NULL_TREE);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V4SF] =3D fndecl;
> +}
> +
> +void
> aarch64_init_builtins (void)
> {
>   tree ftype_set_fpr
> @@ -848,6 +889,7 @@ aarch64_init_builtins (void)
>     aarch64_init_simd_builtins ();
>   if (TARGET_CRC32)
>     aarch64_init_crc32_builtins ();
> +  aarch64_add_builtin_rsqrt ();
> }
>=20
> tree
> @@ -1092,6 +1134,39 @@ aarch64_crc32_expand_builtin (int fcode, tree exp,=
 rtx target)
>   return target;
> }
>=20
> +static rtx
> +aarch64_expand_builtin_rsqrt (int fcode, tree exp, rtx target)
> +{
> +  rtx pat;
> +  tree arg0 =3D CALL_EXPR_ARG (exp, 0);
> +  rtx op0 =3D expand_normal (arg0);
> +
> +  enum insn_code c;
> +
> +  switch (fcode)
> +    {
> +      case AARCH64_BUILTIN_RSQRT_DF:
> +        c =3D CODE_FOR_rsqrt_df2; break;
> +      case AARCH64_BUILTIN_RSQRT_SF:
> +        c =3D CODE_FOR_rsqrt_sf2; break;
> +      case AARCH64_BUILTIN_RSQRT_V2DF:
> +        c =3D CODE_FOR_rsqrt_v2df2; break;
> +      case AARCH64_BUILTIN_RSQRT_V2SF:
> +        c =3D CODE_FOR_rsqrt_v2sf2; break;
> +      case AARCH64_BUILTIN_RSQRT_V4SF:
> +        c =3D CODE_FOR_rsqrt_v4sf2; break;
> +	  default: gcc_unreachable ();
> +    }
> +
> +  if (!target)
> +    target =3D gen_reg_rtx (GET_MODE (op0));
> +
> +  pat =3D GEN_FCN (c) (target, op0);
> +  emit_insn (pat);
> +
> +  return target;
> +}
> +
> /* Expand an expression EXP that calls a built-in function,
>    with result going to TARGET if that's convenient.  */
> rtx
> @@ -1139,6 +1214,13 @@ aarch64_expand_builtin (tree exp,
>   else if (fcode >=3D AARCH64_CRC32_BUILTIN_BASE && fcode <=3D AARCH64_CR=
C32_BUILTIN_MAX)
>     return aarch64_crc32_expand_builtin (fcode, exp, target);
>=20
> +  if (fcode =3D=3D AARCH64_BUILTIN_RSQRT_DF
> +      || fcode =3D=3D AARCH64_BUILTIN_RSQRT_SF
> +      || fcode =3D=3D AARCH64_BUILTIN_RSQRT_V2DF
> +      || fcode =3D=3D AARCH64_BUILTIN_RSQRT_V2SF
> +      || fcode =3D=3D AARCH64_BUILTIN_RSQRT_V4SF)
> +    return aarch64_expand_builtin_rsqrt (fcode, exp, target);
> +
>   gcc_unreachable ();
> }
>=20
> @@ -1296,6 +1378,28 @@ aarch64_builtin_vectorized_function (tree fndecl, =
tree type_out, tree type_in)
>   return NULL_TREE;
> }
>=20
> +tree
> +aarch64_builtin_rsqrt (unsigned int fn, bool md_fn)
> +{
> +  if (md_fn)
> +    {
> +      if (fn =3D=3D AARCH64_SIMD_BUILTIN_UNOP_sqrtv2df)
> +        return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V2DF];
> +      if (fn =3D=3D AARCH64_SIMD_BUILTIN_UNOP_sqrtv2sf)
> +        return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V2SF];
> +      if (fn =3D=3D AARCH64_SIMD_BUILTIN_UNOP_sqrtv4sf)
> +        return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_V4SF];
> +    }
> +  else
> +    {
> +      if (fn =3D=3D BUILT_IN_SQRT)
> +        return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_DF];
> +      if (fn =3D=3D BUILT_IN_SQRTF)
> +        return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT_SF];
> +    }
> +  return NULL_TREE;
> +}
> +
> #undef VAR1
> #define VAR1(T, N, MAP, A) \
>   case AARCH64_SIMD_BUILTIN_##T##_##N##A:
> diff --git a/gcc/config/aarch64/aarch64-opts.h b/gcc/config/aarch64/aarch=
64-opts.h
> index 24bfd9f..75e9c67 100644
> --- a/gcc/config/aarch64/aarch64-opts.h
> +++ b/gcc/config/aarch64/aarch64-opts.h
> @@ -64,4 +64,11 @@ enum aarch64_code_model {
>   AARCH64_CMODEL_LARGE
> };
>=20
> +/* Each core can have -mrecip enabled or disabled by default. */
> +enum aarch64_mrecip {
> +  AARCH64_MRECIP_OFF =3D 0,
> +  AARCH64_MRECIP_ON,
> +  AARCH64_MRECIP_DEFAULT,
> +};
> +
> #endif
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aar=
ch64-protos.h
> index 4062c27..8b8a389 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -321,6 +321,8 @@ void aarch64_print_operand (FILE *, rtx, char);
> void aarch64_print_operand_address (FILE *, rtx);
> void aarch64_emit_call_insn (rtx);
>=20
> +void aarch64_emit_swrsqrt (rtx, rtx);
> +
> /* Initialize builtins for SIMD intrinsics.  */
> void init_aarch64_simd_builtins (void);
>=20
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarc=
h64-simd.md
> index b90f938..ae81731 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -353,6 +353,33 @@
>   [(set_attr "type" "neon_fp_mul_d_scalar_q")]
> )
>=20
> +(define_insn "rsqrte_<mode>2"
> +  [(set (match_operand:VALLF 0 "register_operand" "=3Dw")
> +	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")]
> +		     UNSPEC_RSQRTE))]
> +  "TARGET_SIMD"
> +  "frsqrte\\t%<v>0<Vmtype>, %<v>1<Vmtype>"
> +  [(set_attr "type" "neon_fp_rsqrte_<Vetype><q>")])
> +
> +(define_insn "rsqrts_<mode>3"
> +  [(set (match_operand:VALLF 0 "register_operand" "=3Dw")
> +	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")
> +               (match_operand:VALLF 2 "register_operand" "w")]
> +		     UNSPEC_RSQRTS))]
> +  "TARGET_SIMD"
> +  "frsqrts\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>"
> +  [(set_attr "type" "neon_fp_rsqrts_<Vetype><q>")])
> +
> +(define_expand "rsqrt_<mode>2"
> +  [(set (match_operand:VALLF 0 "register_operand" "=3Dw")
> +	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")]
> +		     UNSPEC_RSQRT))]
> +  "TARGET_SIMD"
> +{
> +  aarch64_emit_swrsqrt (operands[0], operands[1]);
> +  DONE;
> +})
> +
> (define_insn "*aarch64_mul3_elt_to_64v2df"
>   [(set (match_operand:DF 0 "register_operand" "=3Dw")
>      (mult:DF
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aar=
ch64/aarch64-tuning-flags.def
> index 01aaca8..97dbf00 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -31,4 +31,5 @@
>      flags.  */
>=20
> AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS, 0)
> +AARCH64_EXTRA_TUNING_OPTION ("mrecip_default_enabled", MRECIP_DEFAULT_EN=
ABLED, 1)
>=20
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 6c13a078..76c0eee 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -364,7 +364,7 @@ static const struct tune_params generic_tunings =3D
>   1,	/* vec_reassoc_width.  */
>   2,	/* min_div_recip_mul_sf.  */
>   2,	/* min_div_recip_mul_df.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED)	/* tune_flags.  */
> };
>=20
> static const struct tune_params cortexa53_tunings =3D
> @@ -386,7 +386,7 @@ static const struct tune_params cortexa53_tunings =3D
>   1,	/* vec_reassoc_width.  */
>   2,	/* min_div_recip_mul_sf.  */
>   2,	/* min_div_recip_mul_df.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED)	/* tune_flags.  */
> };
>=20
> static const struct tune_params cortexa57_tunings =3D
> @@ -408,7 +408,8 @@ static const struct tune_params cortexa57_tunings =3D
>   1,	/* vec_reassoc_width.  */
>   2,	/* min_div_recip_mul_sf.  */
>   2,	/* min_div_recip_mul_df.  */
> -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS	/* tune_flags.  */
> +   | AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED)
> };
>=20
> static const struct tune_params cortexa72_tunings =3D
> @@ -430,7 +431,7 @@ static const struct tune_params cortexa72_tunings =3D
>   1,	/* vec_reassoc_width.  */
>   2,	/* min_div_recip_mul_sf.  */
>   2,	/* min_div_recip_mul_df.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED)	/* tune_flags.  */
> };
>=20
> static const struct tune_params thunderx_tunings =3D
> @@ -472,7 +473,7 @@ static const struct tune_params xgene1_tunings =3D
>   1,	/* vec_reassoc_width.  */
>   2,	/* min_div_recip_mul_sf.  */
>   2,	/* min_div_recip_mul_df.  */
> -  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED)	/* tune_flags.  */
> };
>=20
> /* Support for fine-grained override of the tuning structures.  */
> @@ -6961,6 +6962,98 @@ aarch64_memory_move_cost (machine_mode mode ATTRIB=
UTE_UNUSED,
>   return aarch64_tune_params.memmov_cost;
> }
>=20
> +extern tree aarch64_builtin_rsqrt (unsigned int fn, bool md_fn);
> +
> +static tree
> +aarch64_builtin_reciprocal (unsigned int fn,
> +                            bool md_fn,
> +                            bool)
> +{
> +  if (!flag_finite_math_only
> +      || flag_trapping_math
> +      || !flag_unsafe_math_optimizations
> +      || optimize_size
> +      || flag_mrecip =3D=3D AARCH64_MRECIP_OFF
> +      || (flag_mrecip =3D=3D AARCH64_MRECIP_DEFAULT
> +          && !(aarch64_tune_params.extra_tuning_flags
> +               & AARCH64_EXTRA_TUNE_MRECIP_DEFAULT_ENABLED)))
> +  {
> +    return NULL_TREE;
> +  }
> +
> +  return aarch64_builtin_rsqrt (fn, md_fn);
> +}
> +
> +typedef rtx (*rsqrte_type) (rtx, rtx);
> +
> +rsqrte_type get_rsqrte_type (enum machine_mode mode)
> +{
> +  switch (mode)
> +  {
> +    case DFmode:   return gen_rsqrte_df2;
> +    case SFmode:   return gen_rsqrte_sf2;
> +    case V2DFmode: return gen_rsqrte_v2df2;
> +    case V2SFmode: return gen_rsqrte_v2sf2;
> +    case V4SFmode: return gen_rsqrte_v4sf2;
> +    default: gcc_unreachable ();
> +  }
> +}
> +
> +typedef rtx (*rsqrts_type) (rtx, rtx, rtx);
> +
> +rsqrts_type get_rsqrts_type (enum machine_mode mode)
> +{
> +  switch (mode)
> +  {
> +    case DFmode:   return gen_rsqrts_df3;
> +    case SFmode:   return gen_rsqrts_sf3;
> +    case V2DFmode: return gen_rsqrts_v2df3;
> +    case V2SFmode: return gen_rsqrts_v2sf3;
> +    case V4SFmode: return gen_rsqrts_v4sf3;
> +    default: gcc_unreachable ();
> +  }
> +}
> +
> +void
> +aarch64_emit_swrsqrt (rtx dst, rtx src)
> +{
> +  enum machine_mode mode =3D GET_MODE (src);
> +  gcc_assert (
> +    mode =3D=3D SFmode || mode =3D=3D V2SFmode || mode =3D=3D V4SFmode ||
> +    mode =3D=3D DFmode || mode =3D=3D V2DFmode);
> +
> +  rtx xsrc =3D gen_reg_rtx (mode);
> +  emit_move_insn (xsrc, src);
> +  rtx x0 =3D gen_reg_rtx (mode);
> +
> +  emit_insn ((*get_rsqrte_type (mode)) (x0, xsrc));
> +
> +  bool double_mode =3D (mode =3D=3D DFmode || mode =3D=3D V2DFmode);
> +
> +  int iterations =3D 2;
> +  if (double_mode)
> +    iterations =3D 3;
> +
> +  if (flag_mrecip_low_precision_sqrt)
> +    iterations--;
> +
> +  for (int i =3D 0; i < iterations; ++i)
> +    {
> +      rtx x1 =3D gen_reg_rtx (mode);
> +      rtx x2 =3D gen_reg_rtx (mode);
> +      rtx x3 =3D gen_reg_rtx (mode);
> +      emit_set_insn (x2, gen_rtx_MULT (mode, x0, x0));
> +
> +      emit_insn ((*get_rsqrts_type (mode)) (x3, xsrc, x2));
> +
> +      emit_set_insn (x1, gen_rtx_MULT (mode, x0, x3));
> +      x0 =3D x1;
> +    }
> +
> +  emit_move_insn (dst, x0);
> +  return;
> +}
> +
> /* Return the number of instructions that can be issued per cycle.  */
> static int
> aarch64_sched_issue_rate (void)
> @@ -12099,6 +12192,9 @@ aarch64_unspec_may_trap_p (const_rtx x, unsigned =
flags)
> #undef TARGET_USE_BLOCKS_FOR_CONSTANT_P
> #define TARGET_USE_BLOCKS_FOR_CONSTANT_P aarch64_use_blocks_for_constant_p
>=20
> +#undef TARGET_BUILTIN_RECIPROCAL
> +#define TARGET_BUILTIN_RECIPROCAL aarch64_builtin_reciprocal
> +
> #undef TARGET_VECTOR_MODE_SUPPORTED_P
> #define TARGET_VECTOR_MODE_SUPPORTED_P aarch64_vector_mode_supported_p
>=20
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 1e343fa..d7944b2 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -122,6 +122,9 @@
>     UNSPEC_VSTRUCTDUMMY
>     UNSPEC_SP_SET
>     UNSPEC_SP_TEST
> +    UNSPEC_RSQRT
> +    UNSPEC_RSQRTE
> +    UNSPEC_RSQRTS
> ])
>=20
> (define_c_enum "unspecv" [
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.=
opt
> index 98ef9f6..7921b85 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -124,3 +124,11 @@ Enum(aarch64_abi) String(ilp32) Value(AARCH64_ABI_IL=
P32)
>=20
> EnumValue
> Enum(aarch64_abi) String(lp64) Value(AARCH64_ABI_LP64)
> +
> +mrecip
> +Common Report Var(flag_mrecip) Optimization Init(AARCH64_MRECIP_DEFAULT)
> +Generate software reciprocal square root for better throughput.
> +
> +mlow-precision-recip-sqrt
> +Common Var(flag_mrecip_low_precision_sqrt) Optimization
> +Run fewer approximation steps to reduce latency and precision.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index b28e5d6..bd922a3 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -515,6 +515,8 @@ Objective-C and Objective-C++ Dialects}.
> -mtls-dialect=3Ddesc  -mtls-dialect=3Dtraditional @gol
> -mfix-cortex-a53-835769  -mno-fix-cortex-a53-835769 @gol
> -mfix-cortex-a53-843419  -mno-fix-cortex-a53-843419 @gol
> +-mrecip -mno-recip @gol
> +-mlow-precision-recip-sqrt -mno-low-precision-recip-sqrt@gol
> -march=3D@var{name}  -mcpu=3D@var{name}  -mtune=3D@var{name}}
>=20
> @emph{Adapteva Epiphany Options}
> @@ -12426,6 +12428,23 @@ Enable or disable the workaround for the ARM Cor=
tex-A53 erratum number 843419.
> This erratum workaround is made at link time and this will only pass the
> corresponding flag to the linker.
>=20
> +@item -mrecip
> +@item -mno-recip
> +@opindex mrecip
> +@opindex mno-recip
> +This option enables use of the
> +reciprocal square root estimate instructions with additional
> +Newton-Raphson steps to increase precision instead of doing a square roo=
t and
> +divide for floating-point arguments.
> +It can only be used together with @option{-ffast-math}.
> +
> +@item -mlow-precision-recip-sqrt
> +@item -mno-low-precision-recip-sqrt
> +@opindex -mlow-precision-recip-sqrt
> +@opindex -mno-low-precision-recip-sqrt
> +The square root estimate uses two steps instead of three for double-prec=
ision,
> +and one step instead of two for single-precision. Thus reducing latency =
and precision.
> +
> @item -march=3D@var{name}
> @opindex march
> Specify the name of the target architecture, optionally suffixed by one or
> diff --git a/gcc/testsuite/gcc.target/aarch64/rsqrt-asm-check.c b/gcc/tes=
tsuite/gcc.target/aarch64/rsqrt-asm-check.c
> new file mode 100644
> index 0000000..d6cfe11
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/rsqrt-asm-check.c
> @@ -0,0 +1,63 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 --save-temps -fverbose-asm -ffast-math -mrecip" } */
> +
> +#include <math.h>
> +
> +#define sqrt_float   __builtin_sqrtf
> +#define sqrt_double  __builtin_sqrt
> +
> +#define TESTTYPE(TYPE)                                          \
> +typedef struct {                                                \
> +  TYPE a;                                                       \
> +  TYPE b;                                                       \
> +  TYPE c;                                                       \
> +  TYPE d;                                                       \
> +} s4_##TYPE;                                                    \
> +                                                                \
> +typedef struct {                                                \
> +  TYPE a;                                                       \
> +  TYPE b;                                                       \
> +} s2_##TYPE;                                                    \
> +                                                                \
> +s4_##TYPE rsqrtv4_##TYPE (s4_##TYPE i)                          \
> +{                                                               \
> +  s4_##TYPE o;                                                  \
> +  o.a =3D 1.0 / sqrt_##TYPE (i.a);                                \
> +  o.b =3D 1.0 / sqrt_##TYPE (i.b);                                \
> +  o.c =3D 1.0 / sqrt_##TYPE (i.c);                                \
> +  o.d =3D 1.0 / sqrt_##TYPE (i.d);                                \
> +  return o;                                                     \
> +}                                                               \
> +                                                                \
> +s2_##TYPE rsqrtv2_##TYPE (s2_##TYPE i)                          \
> +{                                                               \
> +  s2_##TYPE o;                                                  \
> +  o.a =3D 1.0 / sqrt_##TYPE (i.a);                                \
> +  o.b =3D 1.0 / sqrt_##TYPE (i.b);                                \
> +  return o;                                                     \
> +}                                                               \
> +                                                                \
> +TYPE rsqrt_##TYPE (TYPE i)                                      \
> +{                                                               \
> +  return 1.0 / sqrt_##TYPE (i);                                 \
> +}                                                               \
> +                                                                \
> +
> +TESTTYPE(double)
> +TESTTYPE(float)
> +
> +/* { dg-final { scan-assembler-times "frsqrte\\td\[0-9\]+, d\[0-9\]+" 1 =
} } */
> +/* { dg-final { scan-assembler-times "frsqrts\\td\[0-9\]+, d\[0-9\]+, d\=
[0-9\]+" 3 } } */
> +
> +/* { dg-final { scan-assembler-times "frsqrte\\tv\[0-9\]+.2d, v\[0-9\]+.=
2d" 3 } } */
> +/* { dg-final { scan-assembler-times "frsqrts\\tv\[0-9\]+.2d, v\[0-9\]+.=
2d, v\[0-9\]+.2d" 9 } } */
> +
> +
> +/* { dg-final { scan-assembler-times "frsqrte\\ts\[0-9\]+, s\[0-9\]+" 1 =
} } */
> +/* { dg-final { scan-assembler-times "frsqrts\\ts\[0-9\]+, s\[0-9\]+, s\=
[0-9\]+" 2 } } */
> +
> +/* { dg-final { scan-assembler-times "frsqrte\\tv\[0-9\]+.4s, v\[0-9\]+.=
4s" 1 } } */
> +/* { dg-final { scan-assembler-times "frsqrts\\tv\[0-9\]+.4s, v\[0-9\]+.=
4s, v\[0-9\]+.4s" 2 } } */
> +
> +/* { dg-final { scan-assembler-times "frsqrte\\tv\[0-9\]+.2s, v\[0-9\]+.=
2s" 1 } } */
> +/* { dg-final { scan-assembler-times "frsqrts\\tv\[0-9\]+.2s, v\[0-9\]+.=
2s, v\[0-9\]+.2s" 2 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/rsqrt.c b/gcc/testsuite/gcc=
.target/aarch64/rsqrt.c
> new file mode 100644
> index 0000000..4a5c008
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
> @@ -0,0 +1,107 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 --save-temps -fverbose-asm -ffast-math -mrecip" } */
> +
> +#include <math.h>
> +#include <stdio.h>
> +
> +#include <float.h>
> +
> +#define PI    3.141592653589793
> +#define SQRT2 1.4142135623730951
> +
> +#define PI_4 0.7853981633974483
> +#define SQRT1_2 0.7071067811865475
> +
> +/* 2^25+1, float has 24 significand bits
> + *       according to Single-precision floating-point format.  */
> +#define TESTA8_FLT 33554433
> +/* 2^54+1, double has 53 significand bits
> + *       according to Double-precision floating-point format.  */
> +#define TESTA8_DBL 18014398509481985
> +
> +#define SD(a, b) t_double ((#a), (a), (b));
> +#define SF(a, b) t_float ((#a), (a), (b));
> +
> +#define EPSILON_double __DBL_EPSILON__
> +#define EPSILON_float __FLT_EPSILON__
> +#define ABS_double __builtin_fabs
> +#define ABS_float __builtin_fabsf
> +#define SQRT_double __builtin_sqrt
> +#define SQRT_float __builtin_sqrtf
> +
> +extern void abort (void);
> +
> +#define TESTTYPE(TYPE)                                                  =
   \
> +TYPE rsqrt_##TYPE (TYPE a)                                              =
   \
> +{                                                                       =
   \
> +  return 1.0/SQRT_##TYPE(a);                                            =
   \
> +}                                                                       =
   \
> +                                                                        =
   \
> +int equals_##TYPE (TYPE a, TYPE b)                                      =
   \
> +{                                                                       =
   \
> +  return (a =3D=3D b ||                                                 =
       \
> +          (isnan (a) && isnan (b)) ||                                   =
   \
> +          (ABS_##TYPE (a - b) < EPSILON_##TYPE));                       =
   \
> +}                                                                       =
   \
> +                                                                        =
   \
> +void t_##TYPE (const char *s, TYPE a, TYPE result)                      =
   \
> +{                                                                       =
   \
> +  TYPE r =3D rsqrt_##TYPE (a);                                          =
     \
> +  if (!equals_##TYPE (r, result))                                       =
   \
> +  {                                                                     =
   \
> +    abort ();                                                           =
   \
> +  }                                                                     =
   \
> +}                                                                       =
   \
> +
> +//  printf ("Problem in %20s: %30.18A should be %30.18A\n", s, r, result=
); \
> +
> +TESTTYPE(double)
> +TESTTYPE(float)
> +
> +int main ()
> +{
> +  SD(    1.0/256, 0X1.00000000000000P+4  );
> +  SD(        1.0, 0X1.00000000000000P+0  );
> +  SD(       -1.0,                     NAN);
> +  SD(       11.0, 0X1.34BF63D1568260P-2  );
> +  SD(        0.0,                INFINITY);
> +  SD(   INFINITY, 0X0.00000000000000P+0  );
> +  SD(        NAN,                     NAN);
> +  SD(       -NAN,                    -NAN);
> +  SD(    DBL_MAX, 0X1.00000000000010P-512);
> +  SD(    DBL_MIN, 0X1.00000000000000P+511);
> +  SD(         PI, 0X1.20DD750429B6D0P-1  );
> +  SD(       PI_4, 0X1.20DD750429B6D0P+0  );
> +  SD(      SQRT2, 0X1.AE89F995AD3AE0P-1  );
> +  SD(    SQRT1_2, 0X1.306FE0A31B7150P+0  );
> +  SD(        -PI,                     NAN);
> +  SD(     -SQRT2,                     NAN);
> +  SD( TESTA8_DBL, 0X1.00000000000000P-27 );
> +
> +  SF(    1.0/256, 0X1.00000000000000P+4  );
> +  SF(        1.0, 0X1.00000000000000P+0  );
> +  SF(       -1.0,                     NAN);
> +  SF(       11.0, 0X1.34BF6400000000P-2  );
> +  SF(        0.0,                INFINITY);
> +  SF(   INFINITY, 0X0.00000000000000P+0  );
> +  SF(        NAN,                     NAN);
> +  SF(       -NAN,                    -NAN);
> +  SF(    FLT_MAX, 0X1.00000200000000P-64 );
> +  SF(    FLT_MIN, 0X1.00000000000000P+63 );
> +  SF(         PI, 0X1.20DD7400000000P-1  );
> +  SF(       PI_4, 0X1.20DD7400000000P+0  );
> +  SF(      SQRT2, 0X1.AE89FA00000000P-1  );
> +  SF(    SQRT1_2, 0X1.306FE000000000P+0  );
> +  SF(        -PI,                     NAN);
> +  SF(     -SQRT2,                     NAN);
> +  SF( TESTA8_FLT, 0X1.6A09E600000000P-13 );
> +
> +//   With -ffast-math these return positive INF.
> +//   SD(       -0.0,               -INFINITY);
> +//   SF(       -0.0,               -INFINITY);
> +//   The reason here is that -ffast-math flushes to zero.
> +//   SD(DBL_MIN/256, 0X1.00000000000000P+515);
> +//   SF(FLT_MIN/256, 0X1.00000000000000P+67 );
> +
> +  return 0;
> +}
> --
> 1.9.1
>=20


--Apple-Mail=_8138A134-D688-4E16-8309-9AB97BD020ED
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename=signature.asc
Content-Type: application/pgp-signature;
	name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail
Content-length: 496

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJV3fAqAAoJEPg2VlzibIH6lZgH/ii/J9oVSP2/iEKmkaIuIVKN
1Ghxr9khG0Y/QH7TXku3Mfv0siG8dA2Uk/1WUuyTDAC1Vu+9RAGej41bFc9euge/
4pAuTWO/2mFY74ztfHOE0vBUbQTXWv27iBHtD6WZzgCJxVjbJsYzVRsP+Z0+IIJJ
AiIuWQEz93EkGkO8ryMXGSVFqjwNeN1kOL+YEFO8h5g7zdK5CWEXVhw6v6STS6sw
N7Y0OaO/QYm14WXHj+MVXU32I/zrEnTuqkdqqqa4TLaYKGas26Zj2z2UA4uf7ciH
cbixq8AOOqRn9UXR+YLsHydz43xcc3rpdILM52yDdpyeeVyCp5o0Fcd8tcaE01I=
=Ps5N
-----END PGP SIGNATURE-----

--Apple-Mail=_8138A134-D688-4E16-8309-9AB97BD020ED--