Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
@ 2016-09-13 12:16 Wilco Dijkstra
  2016-09-13 16:10 ` Joseph Myers
  2016-09-21 14:51 ` Richard Earnshaw (lists)
  0 siblings, 2 replies; 32+ messages in thread
From: Wilco Dijkstra @ 2016-09-13 12:16 UTC (permalink / raw)
  To: Jakub Jelinek, Tamar Christina; +Cc: GCC Patches, rguenther, Jeff Law, nd

Jakub wrote:
> On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
> > This patch adds an optimized route to the fpclassify builtin
> > for floating point numbers which are similar to IEEE-754 in format.
> > 
> > The goal is to make it faster by:
> > 1. Trying to determine the most common case first
> >    (e.g. the float is a Normal number) and then the
> >    rest. The amount of code generated at -O2 are
> >    about the same +/- 1 instruction, but the code
> >    is much better.
> > 2. Using integer operation in the optimized path.
> 
> Is it generally preferable to use integer operations for this instead
> of floating point operations?  I mean various targets have quite high costs
> of moving data in between the general purpose and floating point register
> file, often it has to go through memory etc.

It is generally preferable indeed - there was a *very* long discussion about integer
vs FP on the GLIBC mailing list when I updated math.h to use the GCC builtins a
while back (the GLIBC implementation used a non-inlined unoptimized integer
implementation, so an inlined FP implementation seemed a good intermediate solution).

Integer operations are generally lower latency and enable bit manipulation tricks like the
fast early exit. The FP version requires execution of 5 branches for a "normal" FP value
and loads several floating point immediates. There are also many targets with emulated
floating point types, so 5 calls to the comparison lib function would be seriously slow.
Note using so many FP comparisons is not just slow but they aren't correct for signalling
NaNs, so this patch also fixes bug 66462 for fpclassify.

I would suggest someone with access to a machine with slow FP moves (POWER?)
to benchmark this using the fpclassify test (glibc/benchtests/bench-math-inlines.c)
so we know for sure.

Wilco

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-13 12:16 [PATCH] Optimise the fpclassify builtin to perform integer operations when possible Wilco Dijkstra
@ 2016-09-13 16:10 ` Joseph Myers
  2016-09-21 14:51 ` Richard Earnshaw (lists)
  1 sibling, 0 replies; 32+ messages in thread
From: Joseph Myers @ 2016-09-13 16:10 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Jakub Jelinek, Tamar Christina, GCC Patches, rguenther, Jeff Law, nd

On Tue, 13 Sep 2016, Wilco Dijkstra wrote:

> I would suggest someone with access to a machine with slow FP moves (POWER?)
> to benchmark this using the fpclassify test (glibc/benchtests/bench-math-inlines.c)
> so we know for sure.

And if for some operations on some architectures the floating-point 
version is faster, that just means we need a hook to choose between them 
(in the default -fno-signaling-nans case, since -fsignaling-nans should 
always use the integer version).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-13 12:16 [PATCH] Optimise the fpclassify builtin to perform integer operations when possible Wilco Dijkstra
  2016-09-13 16:10 ` Joseph Myers
@ 2016-09-21 14:51 ` Richard Earnshaw (lists)
  1 sibling, 0 replies; 32+ messages in thread
From: Richard Earnshaw (lists) @ 2016-09-21 14:51 UTC (permalink / raw)
  To: Wilco Dijkstra, Jakub Jelinek, Tamar Christina
  Cc: GCC Patches, rguenther, Jeff Law, nd

On 13/09/16 12:35, Wilco Dijkstra wrote:
> Jakub wrote:
>> On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
>>> This patch adds an optimized route to the fpclassify builtin
>>> for floating point numbers which are similar to IEEE-754 in format.
>>>
>>> The goal is to make it faster by:
>>> 1. Trying to determine the most common case first
>>>    (e.g. the float is a Normal number) and then the
>>>    rest. The amount of code generated at -O2 are
>>>    about the same +/- 1 instruction, but the code
>>>    is much better.
>>> 2. Using integer operation in the optimized path.
>>
>> Is it generally preferable to use integer operations for this instead
>> of floating point operations?  I mean various targets have quite high costs
>> of moving data in between the general purpose and floating point register
>> file, often it has to go through memory etc.
> 
> It is generally preferable indeed - there was a *very* long discussion about integer
> vs FP on the GLIBC mailing list when I updated math.h to use the GCC builtins a
> while back (the GLIBC implementation used a non-inlined unoptimized integer
> implementation, so an inlined FP implementation seemed a good intermediate solution).
> 
> Integer operations are generally lower latency and enable bit manipulation tricks like the
> fast early exit. The FP version requires execution of 5 branches for a "normal" FP value
> and loads several floating point immediates. There are also many targets with emulated
> floating point types, so 5 calls to the comparison lib function would be seriously slow.
> Note using so many FP comparisons is not just slow but they aren't correct for signalling
> NaNs, so this patch also fixes bug 66462 for fpclassify.

And don't forget that getting the results of a floating-point comparison
back to the branch unit may be no faster than transferring the value in
the first place.

R.

> 
> I would suggest someone with access to a machine with slow FP moves (POWER?)
> to benchmark this using the fpclassify test (glibc/benchtests/bench-math-inlines.c)
> so we know for sure.
> 
> Wilco
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
@ 2016-09-12 17:24 Moritz Klammler
  2016-09-12 20:08 ` Andrew Pinski
  0 siblings, 1 reply; 32+ messages in thread
From: Moritz Klammler @ 2016-09-12 17:24 UTC (permalink / raw)
  To: gcc-patches; +Cc: Tamar Christina


[-- Attachment #1.1: Type: text/plain, Size: 2858 bytes --]


Tamar Christina <Tamar.Christina@arm.com> writes:

> Hi All,
>
> This patch adds an optimized route to the fpclassify builtin
> for floating point numbers which are similar to IEEE-754 in format.
>
> [...]

I might be the least competent person on this list to review this patch
but nevertheless read it out of interest and stumbled over a comment
that I believe could be improved for clarity.

    diff --git a/gcc/real.h b/gcc/real.h
    index 59af580e78f2637be84f71b98b45ec6611053222..36ded57cf4db7c30c935bdb24219a167480f39c8 100644
    --- a/gcc/real.h
    +++ b/gcc/real.h
    @@ -161,6 +161,15 @@ struct real_format
       bool has_signed_zero;
       bool qnan_msb_set;
       bool canonical_nan_lsbs_set;
    +
    +  /* This flag indicates whether the format can be used in the optimized
    +     code paths for the __builtin_fpclassify function and friends.
    +     The format has to have the same NaN and INF representation as normal
    +     IEEE floats (e.g. exp must have all bits set), most significant bit must be
    +     sign bit, followed by exp bits of at most 32 bits.  Lastly the floating
    +     point number must be representable as an integer.  The base of the number
    +     also must be base 2.  */
    +  bool is_binary_ieee_compatible;
       const char *name;
     };

My first issue is that

> The format has to have the same NaN and INF representation as normal
> IEEE floats

is kind of an oxymoron because NaNs and INFs are not "normal" IEEE
floats.

Second,

> the floating point number must be representable as an integer

is also somewhat misleading because it could be interpreted in the
(obviously nonsensical) way that the floating-point *values* have to be
integral.  (I think it should be possible to *interpret* not *represent*
them as integers.)

So I would like to suggest the following rewording.

> This flag indicates whether the format is suitable for the optimized
> code paths for the __builtin_fpclassify function and friends.  For
> this, the format must be a base 2 representation with the sign bit as
> the most-significant bit followed by (exp <= 32) exponent bits
> followed by the mantissa bits.  It must be possible to interpret the
> bits of the floating-point representation as an integer.  NaNs and
> INFs must be represented by the same schema used by IEEE 754.  (NaNs
> must be represented by an exponent with all bits 1, any mantissa
> except all bits 0 and any sign bit.  +INF and -INF must be represented
> by an exponent with all bits 1, a mantissa with all bits 0 and a sign
> bit of 0 and 1 respectively.)

I Hope this is clearer and still matches what the comment was supposed
to say.
-- 
OpenPGP:

Public Key:   http://openpgp.klammler.eu
Fingerprint:  2732 DA32 C8D0 EEEC A081  BE9D CF6C 5166 F393 A9C0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 454 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 17:24 Moritz Klammler
@ 2016-09-12 20:08 ` Andrew Pinski
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Pinski @ 2016-09-12 20:08 UTC (permalink / raw)
  To: Moritz Klammler; +Cc: GCC Patches, Tamar Christina

On Mon, Sep 12, 2016 at 6:21 PM, Moritz Klammler <moritz@klammler.eu> wrote:
>
> Tamar Christina <Tamar.Christina@arm.com> writes:
>
>> Hi All,
>>
>> This patch adds an optimized route to the fpclassify builtin
>> for floating point numbers which are similar to IEEE-754 in format.
>>
>> [...]
>
> I might be the least competent person on this list to review this patch
> but nevertheless read it out of interest and stumbled over a comment
> that I believe could be improved for clarity.
>
>     diff --git a/gcc/real.h b/gcc/real.h
>     index 59af580e78f2637be84f71b98b45ec6611053222..36ded57cf4db7c30c935bdb24219a167480f39c8 100644
>     --- a/gcc/real.h
>     +++ b/gcc/real.h
>     @@ -161,6 +161,15 @@ struct real_format
>        bool has_signed_zero;
>        bool qnan_msb_set;
>        bool canonical_nan_lsbs_set;
>     +
>     +  /* This flag indicates whether the format can be used in the optimized
>     +     code paths for the __builtin_fpclassify function and friends.
>     +     The format has to have the same NaN and INF representation as normal
>     +     IEEE floats (e.g. exp must have all bits set), most significant bit must be
>     +     sign bit, followed by exp bits of at most 32 bits.  Lastly the floating
>     +     point number must be representable as an integer.  The base of the number
>     +     also must be base 2.  */
>     +  bool is_binary_ieee_compatible;
>        const char *name;
>      };
>
> My first issue is that
>
>> The format has to have the same NaN and INF representation as normal
>> IEEE floats
>
> is kind of an oxymoron because NaNs and INFs are not "normal" IEEE
> floats.

Let me clarify here what was originally meant,  first some float uses
the same format as IEEE but don't support INF or NaNs (SPUv1 float for
an example, v2 supports both though).

Thanks,
Andrew.


>
> Second,
>
>> the floating point number must be representable as an integer
>
> is also somewhat misleading because it could be interpreted in the
> (obviously nonsensical) way that the floating-point *values* have to be
> integral.  (I think it should be possible to *interpret* not *represent*
> them as integers.)
>
> So I would like to suggest the following rewording.
>
>> This flag indicates whether the format is suitable for the optimized
>> code paths for the __builtin_fpclassify function and friends.  For
>> this, the format must be a base 2 representation with the sign bit as
>> the most-significant bit followed by (exp <= 32) exponent bits
>> followed by the mantissa bits.  It must be possible to interpret the
>> bits of the floating-point representation as an integer.  NaNs and
>> INFs must be represented by the same schema used by IEEE 754.  (NaNs
>> must be represented by an exponent with all bits 1, any mantissa
>> except all bits 0 and any sign bit.  +INF and -INF must be represented
>> by an exponent with all bits 1, a mantissa with all bits 0 and a sign
>> bit of 0 and 1 respectively.)
>
> I Hope this is clearer and still matches what the comment was supposed
> to say.
> --
> OpenPGP:
>
> Public Key:   http://openpgp.klammler.eu
> Fingerprint:  2732 DA32 C8D0 EEEC A081  BE9D CF6C 5166 F393 A9C0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
@ 2016-09-12 16:21 Tamar Christina
  2016-09-12 22:33 ` Joseph Myers
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Tamar Christina @ 2016-09-12 16:21 UTC (permalink / raw)
  To: GCC Patches, jakub, rguenther, law; +Cc: nd

[-- Attachment #1: Type: text/plain, Size: 3130 bytes --]

Hi All,

This patch adds an optimized route to the fpclassify builtin
for floating point numbers which are similar to IEEE-754 in format.

The goal is to make it faster by:
1. Trying to determine the most common case first
   (e.g. the float is a Normal number) and then the
   rest. The amount of code generated at -O2 are
   about the same +/- 1 instruction, but the code
   is much better.
2. Using integer operation in the optimized path.

At a high level, the optimized path uses integer operations
to perform the following:

  if (exponent bits aren't all set or unset)
     return Normal;
  else if (no bits are set on the number after masking out
	   sign bits then)
     return Zero;
  else if (exponent has no bits set)
     return Subnormal;
  else if (mantissa has no bits set)
     return Infinite;
  else
     return NaN;

In case the optimization can't be applied the old
implementation is used as a fall-back.

A limitation with this new approach is that the exponent
of the floating point has to fit in 31 bits and the floating
point has to have an IEEE like format and values for NaN and INF
(e.g. for NaN and INF all bits of the exp must be set).

To determine this IEEE likeness a new boolean was added to real_format.

Regression tests ran on aarch64-none-linux and arm-none-linux-gnueabi
and no regression. x86 uses it's own implementation other than 
the fpclassify builtin.

As an example, Aarch64 now generates for classification of doubles:

f:
	fmov	x1, d0
	mov	w0, 7
	sbfx	x2, x1, 52, 11
	add	w3, w2, 1
	tst	w3, 0x07FE
	bne	.L1
	mov	w0, 13
	tst	x1, 0x7fffffffffffffff
	beq	.L1
	mov	w0, 11
	tbz	x2, 0, .L1
	tst	x1, 0xfffffffffffff
	mov	w0, 3
	mov	w1, 5
	csel	w0, w0, w1, ne

.L1:
	ret

No new tests as there are existing tests to test functionality.
glibc benchmarks ran against the builtin and this shows a 31.3%
performance gain.

Ok for trunk?

Thanks,
Tamar

PS. I don't have commit rights so if OK can someone apply the patch for me.

gcc/
2016-08-25  Tamar Christina  <tamar.christina@arm.com>
	    Wilco Dijkstra  <wilco.dijkstra@arm.com>

	* gcc/builtins.c (fold_builtin_fpclassify): Added optimized version. 
	* gcc/real.h (real_format): Added is_ieee_compatible field.
	* gcc/real.c (ieee_single_format): Set is_ieee_compatible flag.
	(mips_single_format): Likewise.
	(motorola_single_format): Likewise.
	(spu_single_format): Likewise.
	(ieee_double_format): Likewise.
	(mips_double_format): Likewise.
	(motorola_double_format): Likewise.
	(ieee_extended_motorola_format): Likewise.
	(ieee_extended_intel_128_format): Likewise.
	(ieee_extended_intel_96_round_53_format): Likewise.
	(ibm_extended_format): Likewise.
	(mips_extended_format): Likewise.
	(ieee_quad_format): Likewise.
	(mips_quad_format): Likewise.
	(vax_f_format): Likewise.
	(vax_d_format): Likewise.
	(vax_g_format): Likewise.
	(decimal_single_format): Likewise.
	(decimal_quad_format): Likewise.
	(iee_half_format): Likewise.
	(mips_single_format): Likewise.
	(arm_half_format): Likewise.
	(real_internal_format): Likewise.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: gcc-public.patch --]
[-- Type: text/x-patch; name=gcc-public.patch, Size: 11013 bytes --]

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 1073e35b17b1bc1f6974c71c940bd9d82bbbfc0f..58bf129f9a0228659fd3b976d38d021d1d5bd6bb 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -7947,10 +7947,8 @@ static tree
 fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
 {
   tree fp_nan, fp_infinite, fp_normal, fp_subnormal, fp_zero,
-    arg, type, res, tmp;
+    arg, type, res;
   machine_mode mode;
-  REAL_VALUE_TYPE r;
-  char buf[128];
 
   /* Verify the required arguments in the original call.  */
   if (nargs != 6
@@ -7970,14 +7968,143 @@ fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
   arg = args[5];
   type = TREE_TYPE (arg);
   mode = TYPE_MODE (type);
-  arg = builtin_save_expr (fold_build1_loc (loc, ABS_EXPR, type, arg));
+  const real_format *format = REAL_MODE_FORMAT (mode);
+
+  /*
+  For IEEE 754 types:
+
+  fpclassify (x) ->
+       !((exp + 1) & (exp_mask & ~1)) // exponent bits not all set or unset
+	 ? (x & sign_mask == 0 ? FP_ZERO :
+	   (exp & exp_mask == exp_mask
+	      ? (mantisa == 0 ? FP_INFINITE : FP_NAN) :
+	      FP_SUBNORMAL)):
+       FP_NORMAL.
+
+  Otherwise
+
+  fpclassify (x) ->
+       isnan (x) ? FP_NAN :
+	(fabs (x) == Inf ? FP_INFINITE :
+	   (fabs (x) >= DBL_MIN ? FP_NORMAL :
+	     (x == 0 ? FP_ZERO : FP_SUBNORMAL))).
+  */
+
+  /* Check if the number that is being classified is close enough to IEEE 754
+     format to be able to go in the early exit code.  */
+  if (format->is_binary_ieee_compatible)
+    {
+      gcc_assert (format->b == 2);
+
+      const tree int_type = integer_type_node;
+      const int exp_bits  = (GET_MODE_SIZE (mode) * BITS_PER_UNIT) - format->p;
+      const int exp_mask  = (1 << exp_bits) - 1;
+
+      tree exp, specials, exp_bitfield,
+	   const_arg0, const_arg1, const0, const1,
+	   not_sign_mask, zero_check, mantissa_mask,
+	   mantissa_any_set, exp_lsb_set, mask_check;
+      tree int_arg_type, int_arg;
+
+      /* Re-interpret the float as an unsigned integer type
+	 with equal precision.  */
+      int_arg_type = build_nonstandard_integer_type (TYPE_PRECISION (type), 0);
+      int_arg = fold_build1_loc (loc, INDIRECT_REF, int_arg_type,
+		  fold_build1_loc (loc, NOP_EXPR,
+				   build_pointer_type (int_arg_type),
+		    fold_build1_loc (loc, ADDR_EXPR,
+				     build_pointer_type (type), arg)));
+
+      /* Extract exp bits from the float, where we expect the exponent to be.
+	 We create a new type because BIT_FIELD_REF does not allow you to
+	 extract less bits than the precision of the storage variable.  */
+      exp_bitfield = fold_build3_loc (loc, BIT_FIELD_REF,
+			build_nonstandard_integer_type (exp_bits, 0), int_arg,
+			build_int_cst (int_type, exp_bits),
+			build_int_cst (int_type, format->p - 1));
+
+      /* Re-interpret the extracted exponent bits as a 32 bit int.
+	 This allows us to continue doing operations as int_type.  */
+      exp = fold_build1_loc (loc, NOP_EXPR, int_type, exp_bitfield);
+
+      /* Set up some often used constants.  */
+      const_arg0 = build_int_cst (int_arg_type, 0);
+      const_arg1 = build_int_cst (int_arg_type, 1);
+      const0 = build_int_cst (int_type, 0);
+      const1 = build_int_cst (int_type, 1);
+
+      /* 1) First check for 0 by first masking out sign bit.
+	 2) Then check for NaNs using a bit mask by checking first if the
+	    exponent has all bits set, if it does it can be either NaN or INF.
+	 3) Anything else are subnormal numbers.  */
+
+      /* ~(1 << location_sign_bit).
+	 This creates a mask that can be used to mask out the sign bit.  */
+      not_sign_mask = fold_build1_loc (loc, BIT_NOT_EXPR, int_arg_type,
+			fold_build2_loc (loc, LSHIFT_EXPR, int_arg_type,
+			  const_arg1,
+			  build_int_cst (int_arg_type, format->signbit_rw)));
+
+      /* num & not_sign_mask == 0.
+	 This checks to see if the number is zero.  */
+      zero_check = fold_build2_loc (loc, EQ_EXPR, int_type, const_arg0,
+			 fold_build2_loc (loc, BIT_AND_EXPR, int_arg_type,
+			   int_arg, not_sign_mask));
+
+      /* b^(p-1) - 1 or 1 << (p - 2)
+	 This creates a mask to be used to check the mantissa value.  */
+      mantissa_mask = fold_build2_loc (loc, MINUS_EXPR, int_arg_type,
+			 fold_build2_loc (loc, LSHIFT_EXPR, int_arg_type,
+			    build_int_cst (int_arg_type, format->b),
+			    build_int_cst (int_arg_type, format->p - 2)),
+			 const_arg1);
+
+      /* num & mantissa_mask != 0.  */
+      mantissa_any_set = fold_build2_loc (loc, NE_EXPR, int_type, const_arg0,
+			    fold_build2_loc (loc, BIT_AND_EXPR, int_arg_type,
+			      mantissa_mask, int_arg));
+
+      /* (exp & 1) != 0.
+	 This check can be used to check if the exp is all 0 or all 1.
+	 At the point it is used the exp is either all 1 or 0, so checking
+	 one bit is enough to disambiguate between the two.  */
+      exp_lsb_set = fold_build2_loc (loc, NE_EXPR, int_type, const0,
+			    fold_build2_loc (loc, BIT_AND_EXPR, int_type,
+					     exp, const1));
+
+      /* Combine the values together.  */
+      specials = fold_build3_loc (loc, COND_EXPR, int_type, zero_check, fp_zero,
+		   fold_build3_loc (loc, COND_EXPR, int_type, exp_lsb_set,
+		    fold_build3_loc (loc, COND_EXPR, int_type, mantissa_any_set,
+		      HONOR_NANS (mode) ? fp_nan : fp_normal,
+		      HONOR_INFINITIES (mode) ? fp_infinite : fp_normal),
+		    fp_subnormal));
+
+      /* Top level compare of the most general case,
+	 try to see if it's a normal real.  */
+
+      /* exp_mask & ~1.  */
+      mask_check = fold_build2_loc (loc, BIT_AND_EXPR, int_type,
+			  build_int_cst (int_type, exp_mask),
+			  fold_build1_loc (loc, BIT_NOT_EXPR, int_type,
+					   const1));
+
+      res = fold_build3_loc (loc, COND_EXPR, int_type,
+	       fold_build2_loc (loc, NE_EXPR, int_type, const0,
+		 /* (exp + 1) & mask_check.
+		    Check to see if exp is not all 0 or all 1.  */
+		 fold_build2_loc (loc, BIT_AND_EXPR, int_type,
+		   fold_build2_loc (loc, PLUS_EXPR, int_type, exp, const1),
+		     mask_check)),
+		   fp_normal, specials);
 
-  /* fpclassify(x) ->
-       isnan(x) ? FP_NAN :
-         (fabs(x) == Inf ? FP_INFINITE :
-	   (fabs(x) >= DBL_MIN ? FP_NORMAL :
-	     (x == 0 ? FP_ZERO : FP_SUBNORMAL))).  */
+      return res;
+    }
 
+  REAL_VALUE_TYPE r;
+  tree tmp;
+  char buf[128];
+  arg = builtin_save_expr (fold_build1_loc (loc, ABS_EXPR, type, arg));
   tmp = fold_build2_loc (loc, EQ_EXPR, integer_type_node, arg,
 		     build_real (type, dconst0));
   res = fold_build3_loc (loc, COND_EXPR, integer_type_node,
diff --git a/gcc/real.h b/gcc/real.h
index 59af580e78f2637be84f71b98b45ec6611053222..36ded57cf4db7c30c935bdb24219a167480f39c8 100644
--- a/gcc/real.h
+++ b/gcc/real.h
@@ -161,6 +161,15 @@ struct real_format
   bool has_signed_zero;
   bool qnan_msb_set;
   bool canonical_nan_lsbs_set;
+
+  /* This flag indicates whether the format can be used in the optimized
+     code paths for the __builtin_fpclassify function and friends.
+     The format has to have the same NaN and INF representation as normal
+     IEEE floats (e.g. exp must have all bits set), most significant bit must be
+     sign bit, followed by exp bits of at most 32 bits.  Lastly the floating
+     point number must be representable as an integer.  The base of the number
+     also must be base 2.  */
+  bool is_binary_ieee_compatible;
   const char *name;
 };
 
diff --git a/gcc/real.c b/gcc/real.c
index 66e88e2ad366f7848609d157074c80420d778bcf..a9ad63072b5d5803eb048d30af5546e0b458f857 100644
--- a/gcc/real.c
+++ b/gcc/real.c
@@ -3052,6 +3052,7 @@ const struct real_format ieee_single_format =
     true,
     true,
     false,
+    true,
     "ieee_single"
   };
 
@@ -3075,6 +3076,7 @@ const struct real_format mips_single_format =
     true,
     false,
     true,
+    true,
     "mips_single"
   };
 
@@ -3098,6 +3100,7 @@ const struct real_format motorola_single_format =
     true,
     true,
     true,
+    true,
     "motorola_single"
   };
 
@@ -3132,6 +3135,7 @@ const struct real_format spu_single_format =
     true,
     false,
     false,
+    false,
     "spu_single"
   };
 \f
@@ -3343,6 +3347,7 @@ const struct real_format ieee_double_format =
     true,
     true,
     false,
+    true,
     "ieee_double"
   };
 
@@ -3366,6 +3371,7 @@ const struct real_format mips_double_format =
     true,
     false,
     true,
+    true,
     "mips_double"
   };
 
@@ -3389,6 +3395,7 @@ const struct real_format motorola_double_format =
     true,
     true,
     true,
+    true,
     "motorola_double"
   };
 \f
@@ -3735,6 +3742,7 @@ const struct real_format ieee_extended_motorola_format =
     true,
     true,
     true,
+    false,
     "ieee_extended_motorola"
   };
 
@@ -3758,6 +3766,7 @@ const struct real_format ieee_extended_intel_96_format =
     true,
     true,
     false,
+    false,
     "ieee_extended_intel_96"
   };
 
@@ -3781,6 +3790,7 @@ const struct real_format ieee_extended_intel_128_format =
     true,
     true,
     false,
+    false,
     "ieee_extended_intel_128"
   };
 
@@ -3806,6 +3816,7 @@ const struct real_format ieee_extended_intel_96_round_53_format =
     true,
     true,
     false,
+    false,
     "ieee_extended_intel_96_round_53"
   };
 \f
@@ -3896,6 +3907,7 @@ const struct real_format ibm_extended_format =
     true,
     true,
     false,
+    false,
     "ibm_extended"
   };
 
@@ -3919,6 +3931,7 @@ const struct real_format mips_extended_format =
     true,
     false,
     true,
+    false,
     "mips_extended"
   };
 
@@ -4184,6 +4197,7 @@ const struct real_format ieee_quad_format =
     true,
     true,
     false,
+    false,
     "ieee_quad"
   };
 
@@ -4207,6 +4221,7 @@ const struct real_format mips_quad_format =
     true,
     false,
     true,
+    false,
     "mips_quad"
   };
 \f
@@ -4509,6 +4524,7 @@ const struct real_format vax_f_format =
     false,
     false,
     false,
+    false,
     "vax_f"
   };
 
@@ -4532,6 +4548,7 @@ const struct real_format vax_d_format =
     false,
     false,
     false,
+    false,
     "vax_d"
   };
 
@@ -4555,6 +4572,7 @@ const struct real_format vax_g_format =
     false,
     false,
     false,
+    false,
     "vax_g"
   };
 \f
@@ -4633,6 +4651,7 @@ const struct real_format decimal_single_format =
     true,
     true,
     false,
+    false,
     "decimal_single"
   };
 
@@ -4657,6 +4676,7 @@ const struct real_format decimal_double_format =
     true,
     true,
     false,
+    false,
     "decimal_double"
   };
 
@@ -4681,6 +4701,7 @@ const struct real_format decimal_quad_format =
     true,
     true,
     false,
+    false,
     "decimal_quad"
   };
 \f
@@ -4820,6 +4841,7 @@ const struct real_format ieee_half_format =
     true,
     true,
     false,
+    false,
     "ieee_half"
   };
 
@@ -4846,6 +4868,7 @@ const struct real_format arm_half_format =
     true,
     false,
     false,
+    false,
     "arm_half"
   };
 \f
@@ -4893,6 +4916,7 @@ const struct real_format real_internal_format =
     true,
     true,
     false,
+    false,
     "real_internal"
   };
 \f

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 16:21 Tamar Christina
@ 2016-09-12 22:33 ` Joseph Myers
  2016-09-13 12:25   ` Tamar Christina
  2016-09-12 22:41 ` Joseph Myers
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Joseph Myers @ 2016-09-12 22:33 UTC (permalink / raw)
  To: Tamar Christina; +Cc: GCC Patches, jakub, rguenther, law, nd

On Mon, 12 Sep 2016, Tamar Christina wrote:

> Hi All,
> 
> This patch adds an optimized route to the fpclassify builtin
> for floating point numbers which are similar to IEEE-754 in format.

Similar changes may be useful for __builtin_isfinite, __builtin_isnan, 
__builtin_isinf, __builtin_isinf_sign, __builtin_isnormal.

Will your version always use only integer operations if the format is IEEE 
enough?  If so, it could be used by glibc's <math.h> if __SUPPORT_SNAN__ 
(-fsignaling-nans), except in the case where IBM long double is supported, 
whereas presently all those built-in functions are avoided by glibc 
<math.h> for -fsignaling-nans.  The same applies to integer versions of 
the other functions - whether or not they are beneficial in performance 
normally, they are correct for -fsignaling-nans, which the present 
built-in functions aren't.

(I intend to add issubnormal and iszero macros to glibc following TS 
18661-1; built-in versions of those would be useful as well.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 22:33 ` Joseph Myers
@ 2016-09-13 12:25   ` Tamar Christina
  0 siblings, 0 replies; 32+ messages in thread
From: Tamar Christina @ 2016-09-13 12:25 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GCC Patches, jakub, rguenther, law, nd



On 12/09/16 23:28, Joseph Myers wrote:
> On Mon, 12 Sep 2016, Tamar Christina wrote:
>
> Similar changes may be useful for __builtin_isfinite, __builtin_isnan,
> __builtin_isinf, __builtin_isinf_sign, __builtin_isnormal.
>
> Will your version always use only integer operations if the format is IEEE
> enough?
Yes it will, the idea was indeed to also do those calls but to start 
with this one first to see
what the feedback would be. I believe there's a ticket for that as well 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66462

Tamar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 16:21 Tamar Christina
  2016-09-12 22:33 ` Joseph Myers
@ 2016-09-12 22:41 ` Joseph Myers
  2016-09-13 12:30   ` Tamar Christina
  2016-09-12 22:49 ` Joseph Myers
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Joseph Myers @ 2016-09-12 22:41 UTC (permalink / raw)
  To: Tamar Christina; +Cc: GCC Patches, jakub, rguenther, law, nd

On Mon, 12 Sep 2016, Tamar Christina wrote:

> A limitation with this new approach is that the exponent
> of the floating point has to fit in 31 bits and the floating
> point has to have an IEEE like format and values for NaN and INF
> (e.g. for NaN and INF all bits of the exp must be set).
> 
> To determine this IEEE likeness a new boolean was added to real_format.

Why is this boolean false for ieee_quad_format, mips_quad_format and 
ieee_half_format?  They should meet your description (even if the x86 / 
m68k "extended" formats don't because of the leading mantissa bit being 
set for infinities).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 22:41 ` Joseph Myers
@ 2016-09-13 12:30   ` Tamar Christina
  2016-09-13 12:44     ` Joseph Myers
  0 siblings, 1 reply; 32+ messages in thread
From: Tamar Christina @ 2016-09-13 12:30 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GCC Patches, jakub, rguenther, law, nd



On 12/09/16 23:33, Joseph Myers wrote:
> Why is this boolean false for ieee_quad_format, mips_quad_format and
> ieee_half_format?  They should meet your description (even if the x86 /
> m68k "extended" formats don't because of the leading mantissa bit being
> set for infinities).
>
Ah, I played it a bit too safe there. I will change this and do some 
re-testing and
update the patch.

Tamar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-13 12:30   ` Tamar Christina
@ 2016-09-13 12:44     ` Joseph Myers
  2016-09-15  9:08       ` Tamar Christina
  0 siblings, 1 reply; 32+ messages in thread
From: Joseph Myers @ 2016-09-13 12:44 UTC (permalink / raw)
  To: Tamar Christina; +Cc: GCC Patches, jakub, rguenther, law, nd

On Tue, 13 Sep 2016, Tamar Christina wrote:

> 
> 
> On 12/09/16 23:33, Joseph Myers wrote:
> > Why is this boolean false for ieee_quad_format, mips_quad_format and
> > ieee_half_format?  They should meet your description (even if the x86 /
> > m68k "extended" formats don't because of the leading mantissa bit being
> > set for infinities).
> > 
> Ah, I played it a bit too safe there. I will change this and do some 
> re-testing and update the patch.

It occurred to me that there might be an issue with your approach of 
overlaying the floating-point value with a single integer, when the quad 
formats are used on 32-bit systems where TImode isn't fully supported as a 
scalar mode.  However, if that's an issue the answer isn't to mark the 
formats as non-IEEE, it's to support ORing together the relevant parts of 
multiple words when determining whether the mantissa is nonzero (or some 
equivalent logic).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-13 12:44     ` Joseph Myers
@ 2016-09-15  9:08       ` Tamar Christina
  2016-09-15 11:21         ` Wilco Dijkstra
  2016-09-15 13:05         ` Joseph Myers
  0 siblings, 2 replies; 32+ messages in thread
From: Tamar Christina @ 2016-09-15  9:08 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GCC Patches, jakub, rguenther, law, nd, Wilco Dijkstra



On 13/09/16 13:43, Joseph Myers wrote:
> On Tue, 13 Sep 2016, Tamar Christina wrote:
>
>>
>> On 12/09/16 23:33, Joseph Myers wrote:
>>> Why is this boolean false for ieee_quad_format, mips_quad_format and
>>> ieee_half_format?  They should meet your description (even if the x86 /
>>> m68k "extended" formats don't because of the leading mantissa bit being
>>> set for infinities).
>>>
>> Ah, I played it a bit too safe there. I will change this and do some
>> re-testing and update the patch.
> It occurred to me that there might be an issue with your approach of
> overlaying the floating-point value with a single integer, when the quad
> formats are used on 32-bit systems where TImode isn't fully supported as a
> scalar mode.  However, if that's an issue the answer isn't to mark the
> formats as non-IEEE, it's to support ORing together the relevant parts of
> multiple words when determining whether the mantissa is nonzero (or some
> equivalent logic).
>
I have been trying to reproduce this on the architectures I have access to
but have been unable to so far. In practice if this does happen though 
isn't it
the fault of the system for advertising partial TImode support and 
support of
IEEE types?

It seems to me that in order for me to be able to do this fpclassify 
would incur
a rather large costs in complexity. Also wouldn't this be problematic 
for other functions
as well such as expand_builtin_signbit?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-15  9:08       ` Tamar Christina
@ 2016-09-15 11:21         ` Wilco Dijkstra
  2016-09-15 12:56           ` Joseph Myers
  2016-09-15 13:05         ` Joseph Myers
  1 sibling, 1 reply; 32+ messages in thread
From: Wilco Dijkstra @ 2016-09-15 11:21 UTC (permalink / raw)
  To: Tamar Christina, Joseph Myers; +Cc: GCC Patches, jakub, rguenther, law, nd

Tamar Christina wrote:
> On 13/09/16 13:43, Joseph Myers wrote:
> > On Tue, 13 Sep 2016, Tamar Christina wrote:
>>
> >> On 12/09/16 23:33, Joseph Myers wrote:
> >>> Why is this boolean false for ieee_quad_format, mips_quad_format and
> >>> ieee_half_format?  They should meet your description (even if the x86 /
> >>> m68k "extended" formats don't because of the leading mantissa bit being
> >>> set for infinities).
> >>>
> >> Ah, I played it a bit too safe there. I will change this and do some
> >> re-testing and update the patch.
> > It occurred to me that there might be an issue with your approach of
> > overlaying the floating-point value with a single integer, when the quad
> > formats are used on 32-bit systems where TImode isn't fully supported as a
> > scalar mode.  However, if that's an issue the answer isn't to mark the
> > formats as non-IEEE, it's to support ORing together the relevant parts of
> > multiple words when determining whether the mantissa is nonzero (or some
> > equivalent logic).
> >
> I have been trying to reproduce this on the architectures I have access to
> but have been unable to so far. In practice if this does happen though 
> isn't it the fault of the system for advertising partial TImode support and 
> support of IEEE types?
>
> It seems to me that in order for me to be able to do this fpclassify 
> would incur a rather large costs in complexity. Also wouldn't this be problematic 
> for other functions as well such as expand_builtin_signbit?

Yes, if there are targets which don't implement TImode operations then surely
they should be automatically split into DImode operations before or during Expand?
GCC's implementation of types larger than the register int type is generally extremely
poor as it is missing such an expansion (practically all compilers do this), so this
would improve things significantly.

So for now it would seem best to keep the boolean false for quad formats on 32-bit
targets.

Wilco

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-15 11:21         ` Wilco Dijkstra
@ 2016-09-15 12:56           ` Joseph Myers
  0 siblings, 0 replies; 32+ messages in thread
From: Joseph Myers @ 2016-09-15 12:56 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Tamar Christina, GCC Patches, jakub, rguenther, law, nd

On Thu, 15 Sep 2016, Wilco Dijkstra wrote:

> Yes, if there are targets which don't implement TImode operations then 
> surely they should be automatically split into DImode operations before 
> or during Expand?

The operations generally don't exist if the mode fails the 
scalar_mode_supported_p hook.  I don't know whether there are sufficient 
TImode operations for the bitwise operations you need here, even in the 
case where it fails that hook (and so you can't declare variables with 
that mode) - it's arithmetic, and the ABI support needed for argument 
passing, that are harder to do by splitting into smaller modes (and that 
GCC generally only handles in libgcc for 2-word operands, not for 4-word 
operands).

> So for now it would seem best to keep the boolean false for quad formats 
> on 32-bit targets.

This is a function of command-line options, not the format, so it can't go 
in the table.  The table should describe the format properties only.

Does the expansion work, in fact, for __float128 on 32-bit x86, given the 
boolean set to true (other relevant cases include 128-bit long double on 
32-bit s390 and 32-bit sparc with appropriate options to make long double 
128-bit)?  If it does, it may be OK to use modes that fail the 
scalar_mode_supported_p hook.  If something doesn't work in that case, the 
right way to avoid an expansion is not to set the boolean to false in the 
table of formats, it's to loop over supported integer modes seeing if 
there is one wide enough that also passes the scalar_mode_supported_p 
hook.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-15  9:08       ` Tamar Christina
  2016-09-15 11:21         ` Wilco Dijkstra
@ 2016-09-15 13:05         ` Joseph Myers
  1 sibling, 0 replies; 32+ messages in thread
From: Joseph Myers @ 2016-09-15 13:05 UTC (permalink / raw)
  To: Tamar Christina; +Cc: GCC Patches, jakub, rguenther, law, nd, Wilco Dijkstra

On Thu, 15 Sep 2016, Tamar Christina wrote:

> a rather large costs in complexity. Also wouldn't this be problematic 
> for other functions as well such as expand_builtin_signbit?

expand_builtin_signbit computes a word number and the bit position in that 
word.  It has no problem with 128-bit types on 32-bit systems where the 
largest integer mode supported for scalar variables is DImode.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 16:21 Tamar Christina
  2016-09-12 22:33 ` Joseph Myers
  2016-09-12 22:41 ` Joseph Myers
@ 2016-09-12 22:49 ` Joseph Myers
  2016-09-13 12:33   ` Tamar Christina
  2016-09-13  8:58 ` Jakub Jelinek
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Joseph Myers @ 2016-09-12 22:49 UTC (permalink / raw)
  To: Tamar Christina; +Cc: GCC Patches, jakub, rguenther, law, nd

Are you making endianness assumptions - specifically, does the 
reinterpretation as an integer require that WORDS_BIG_ENDIAN and 
FLOAT_WORDS_BIG_ENDIAN are the same?  If so, I think that's OK (in that 
the only target where they aren't the same seems to be pdp11 which doesn't 
use IEEE formats), but probably the code should check explicitly.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 22:49 ` Joseph Myers
@ 2016-09-13 12:33   ` Tamar Christina
  2016-09-13 12:48     ` Joseph Myers
  0 siblings, 1 reply; 32+ messages in thread
From: Tamar Christina @ 2016-09-13 12:33 UTC (permalink / raw)
  To: Joseph Myers; +Cc: GCC Patches, jakub, rguenther, law, nd


On 12/09/16 23:41, Joseph Myers wrote:
> Are you making endianness assumptions - specifically, does the
> reinterpretation as an integer require that WORDS_BIG_ENDIAN and
> FLOAT_WORDS_BIG_ENDIAN are the same?  If so, I think that's OK (in that
> the only target where they aren't the same seems to be pdp11 which doesn't
> use IEEE formats), but probably the code should check explicitly.
>
No, if I understood the question correctly then  this should be ok,
since I always access the float as an integer of equivalent precision.
So a 64bit float will be addressed as a 64bit int.

Tamar


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-13 12:33   ` Tamar Christina
@ 2016-09-13 12:48     ` Joseph Myers
  0 siblings, 0 replies; 32+ messages in thread
From: Joseph Myers @ 2016-09-13 12:48 UTC (permalink / raw)
  To: Tamar Christina; +Cc: GCC Patches, jakub, rguenther, law, nd

On Tue, 13 Sep 2016, Tamar Christina wrote:

> On 12/09/16 23:41, Joseph Myers wrote:
> > Are you making endianness assumptions - specifically, does the
> > reinterpretation as an integer require that WORDS_BIG_ENDIAN and
> > FLOAT_WORDS_BIG_ENDIAN are the same?  If so, I think that's OK (in that
> > the only target where they aren't the same seems to be pdp11 which doesn't
> > use IEEE formats), but probably the code should check explicitly.
> > 
> No, if I understood the question correctly then  this should be ok,
> since I always access the float as an integer of equivalent precision.
> So a 64bit float will be addressed as a 64bit int.

My point is that there are theoretically systems where the order of words 
in a 64-bit float is not the same as the order of words in a 64-bit 
integer.  Though it may be the case in practice that no such targets in 
GCC use IEEE formats (and that pdp11 is the only target without all of 
BYTES_BIG_ENDIAN, WORDS_BIG_ENDIAN and FLOAT_WORDS_BIG_ENDIAN the same).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 16:21 Tamar Christina
                   ` (2 preceding siblings ...)
  2016-09-12 22:49 ` Joseph Myers
@ 2016-09-13  8:58 ` Jakub Jelinek
  2016-09-13 16:16   ` Jeff Law
  2016-09-16 19:53 ` Jeff Law
  2016-09-19 22:43 ` Michael Meissner
  5 siblings, 1 reply; 32+ messages in thread
From: Jakub Jelinek @ 2016-09-13  8:58 UTC (permalink / raw)
  To: Tamar Christina; +Cc: GCC Patches, rguenther, law, nd

On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
> This patch adds an optimized route to the fpclassify builtin
> for floating point numbers which are similar to IEEE-754 in format.
> 
> The goal is to make it faster by:
> 1. Trying to determine the most common case first
>    (e.g. the float is a Normal number) and then the
>    rest. The amount of code generated at -O2 are
>    about the same +/- 1 instruction, but the code
>    is much better.
> 2. Using integer operation in the optimized path.

Is it generally preferable to use integer operations for this instead
of floating point operations?  I mean various targets have quite high costs
of moving data in between the general purpose and floating point register
file, often it has to go through memory etc.

	Jakub

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-13  8:58 ` Jakub Jelinek
@ 2016-09-13 16:16   ` Jeff Law
  2016-09-14  8:31     ` Richard Biener
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff Law @ 2016-09-13 16:16 UTC (permalink / raw)
  To: Jakub Jelinek, Tamar Christina; +Cc: GCC Patches, rguenther, nd

On 09/13/2016 02:41 AM, Jakub Jelinek wrote:
> On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
>> This patch adds an optimized route to the fpclassify builtin
>> for floating point numbers which are similar to IEEE-754 in format.
>>
>> The goal is to make it faster by:
>> 1. Trying to determine the most common case first
>>    (e.g. the float is a Normal number) and then the
>>    rest. The amount of code generated at -O2 are
>>    about the same +/- 1 instruction, but the code
>>    is much better.
>> 2. Using integer operation in the optimized path.
>
> Is it generally preferable to use integer operations for this instead
> of floating point operations?  I mean various targets have quite high costs
> of moving data in between the general purpose and floating point register
> file, often it has to go through memory etc.
Bit testing/twiddling is obviously a trade-off for a non-addressable 
object.  I don't think there's any reasonable way to always generate the 
most efficient code as it's going to depend on (for example) register 
allocation behavior.

So what we're stuck doing is relying on the target costing bits to guide 
this kind of thing.

jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-13 16:16   ` Jeff Law
@ 2016-09-14  8:31     ` Richard Biener
  2016-09-15 16:02       ` Jeff Law
  0 siblings, 1 reply; 32+ messages in thread
From: Richard Biener @ 2016-09-14  8:31 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jakub Jelinek, Tamar Christina, GCC Patches, rguenther, nd

On Tue, Sep 13, 2016 at 6:15 PM, Jeff Law <law@redhat.com> wrote:
> On 09/13/2016 02:41 AM, Jakub Jelinek wrote:
>>
>> On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
>>>
>>> This patch adds an optimized route to the fpclassify builtin
>>> for floating point numbers which are similar to IEEE-754 in format.
>>>
>>> The goal is to make it faster by:
>>> 1. Trying to determine the most common case first
>>>    (e.g. the float is a Normal number) and then the
>>>    rest. The amount of code generated at -O2 are
>>>    about the same +/- 1 instruction, but the code
>>>    is much better.
>>> 2. Using integer operation in the optimized path.
>>
>>
>> Is it generally preferable to use integer operations for this instead
>> of floating point operations?  I mean various targets have quite high
>> costs
>> of moving data in between the general purpose and floating point register
>> file, often it has to go through memory etc.
>
> Bit testing/twiddling is obviously a trade-off for a non-addressable object.
> I don't think there's any reasonable way to always generate the most
> efficient code as it's going to depend on (for example) register allocation
> behavior.
>
> So what we're stuck doing is relying on the target costing bits to guide
> this kind of thing.

I think the reason for this patch is to provide a general optimized
integer version.

The only reason to not use integer operation (compared to what
fold_builtin_classify
does currently) is that the folding is done very early at the moment
and it's harder
to optimize the integer bit-twiddling with more FP context known.
Like if we know
if (! isnan ()) then unless we also expand that inline via
bit-twiddling nothing will
optimize the followup test from the fpclassify.   This might be somewhat moot
at the moment given our lack of FP value-range propagation but it should be a
general concern (of doing this too early).

I think it asks for a FP (class) propagation pass somewhere (maybe as part of
complex lowering which already has a similar "coarse" lattice -- not that I like
its implementation very much) and doing the "lowering" there.

Not something that should block this patch though.

Richard.

> jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-14  8:31     ` Richard Biener
@ 2016-09-15 16:02       ` Jeff Law
  2016-09-15 16:28         ` Richard Biener
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff Law @ 2016-09-15 16:02 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jakub Jelinek, Tamar Christina, GCC Patches, rguenther, nd

On 09/14/2016 02:24 AM, Richard Biener wrote:
> On Tue, Sep 13, 2016 at 6:15 PM, Jeff Law <law@redhat.com> wrote:
>> On 09/13/2016 02:41 AM, Jakub Jelinek wrote:
>>>
>>> On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
>>>>
>>>> This patch adds an optimized route to the fpclassify builtin
>>>> for floating point numbers which are similar to IEEE-754 in format.
>>>>
>>>> The goal is to make it faster by:
>>>> 1. Trying to determine the most common case first
>>>>    (e.g. the float is a Normal number) and then the
>>>>    rest. The amount of code generated at -O2 are
>>>>    about the same +/- 1 instruction, but the code
>>>>    is much better.
>>>> 2. Using integer operation in the optimized path.
>>>
>>>
>>> Is it generally preferable to use integer operations for this instead
>>> of floating point operations?  I mean various targets have quite high
>>> costs
>>> of moving data in between the general purpose and floating point register
>>> file, often it has to go through memory etc.
>>
>> Bit testing/twiddling is obviously a trade-off for a non-addressable object.
>> I don't think there's any reasonable way to always generate the most
>> efficient code as it's going to depend on (for example) register allocation
>> behavior.
>>
>> So what we're stuck doing is relying on the target costing bits to guide
>> this kind of thing.
>
> I think the reason for this patch is to provide a general optimized
> integer version.
And just to be clear, that's fine with me.  While there are cases where 
bit twiddling hurts, I think bit twiddling is generally better.


> I think it asks for a FP (class) propagation pass somewhere (maybe as part of
> complex lowering which already has a similar "coarse" lattice -- not that I like
> its implementation very much) and doing the "lowering" there.
Not a bad idea -- I wonder how much a coarse tracking of the exceptional 
cases would allow later optimization.

>
> Not something that should block this patch though.
Agreed.

jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-15 16:02       ` Jeff Law
@ 2016-09-15 16:28         ` Richard Biener
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Biener @ 2016-09-15 16:28 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: Jakub Jelinek, Tamar Christina, GCC Patches, nd

On September 15, 2016 5:52:34 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>On 09/14/2016 02:24 AM, Richard Biener wrote:
>> On Tue, Sep 13, 2016 at 6:15 PM, Jeff Law <law@redhat.com> wrote:
>>> On 09/13/2016 02:41 AM, Jakub Jelinek wrote:
>>>>
>>>> On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
>>>>>
>>>>> This patch adds an optimized route to the fpclassify builtin
>>>>> for floating point numbers which are similar to IEEE-754 in
>format.
>>>>>
>>>>> The goal is to make it faster by:
>>>>> 1. Trying to determine the most common case first
>>>>>    (e.g. the float is a Normal number) and then the
>>>>>    rest. The amount of code generated at -O2 are
>>>>>    about the same +/- 1 instruction, but the code
>>>>>    is much better.
>>>>> 2. Using integer operation in the optimized path.
>>>>
>>>>
>>>> Is it generally preferable to use integer operations for this
>instead
>>>> of floating point operations?  I mean various targets have quite
>high
>>>> costs
>>>> of moving data in between the general purpose and floating point
>register
>>>> file, often it has to go through memory etc.
>>>
>>> Bit testing/twiddling is obviously a trade-off for a non-addressable
>object.
>>> I don't think there's any reasonable way to always generate the most
>>> efficient code as it's going to depend on (for example) register
>allocation
>>> behavior.
>>>
>>> So what we're stuck doing is relying on the target costing bits to
>guide
>>> this kind of thing.
>>
>> I think the reason for this patch is to provide a general optimized
>> integer version.
>And just to be clear, that's fine with me.  While there are cases where
>
>bit twiddling hurts, I think bit twiddling is generally better.
>
>
>> I think it asks for a FP (class) propagation pass somewhere (maybe as
>part of
>> complex lowering which already has a similar "coarse" lattice -- not
>that I like
>> its implementation very much) and doing the "lowering" there.
>Not a bad idea -- I wonder how much a coarse tracking of the
>exceptional 
>cases would allow later optimization.

I guess it really depends on the ability to set ffast-math flags on individual stmts (or at least built-in calls).

Richard.

>>
>> Not something that should block this patch though.
>Agreed.
>
>jeff


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 16:21 Tamar Christina
                   ` (3 preceding siblings ...)
  2016-09-13  8:58 ` Jakub Jelinek
@ 2016-09-16 19:53 ` Jeff Law
  2016-09-20 12:14   ` Tamar Christina
  2016-09-19 22:43 ` Michael Meissner
  5 siblings, 1 reply; 32+ messages in thread
From: Jeff Law @ 2016-09-16 19:53 UTC (permalink / raw)
  To: Tamar Christina, GCC Patches, jakub, rguenther; +Cc: nd

On 09/12/2016 10:19 AM, Tamar Christina wrote:
> Hi All,
>
> This patch adds an optimized route to the fpclassify builtin
> for floating point numbers which are similar to IEEE-754 in format.
>
> The goal is to make it faster by:
> 1. Trying to determine the most common case first
>    (e.g. the float is a Normal number) and then the
>    rest. The amount of code generated at -O2 are
>    about the same Â± 1 instruction, but the code
>    is much better.
> 2. Using integer operation in the optimized path.
>
> At a high level, the optimized path uses integer operations
> to perform the following:
>
>   if (exponent bits aren't all set or unset)
>      return Normal;
>   else if (no bits are set on the number after masking out
> 	   sign bits then)
>      return Zero;
>   else if (exponent has no bits set)
>      return Subnormal;
>   else if (mantissa has no bits set)
>      return Infinite;
>   else
>      return NaN;
>
> In case the optimization can't be applied the old
> implementation is used as a fall-back.
>
> A limitation with this new approach is that the exponent
> of the floating point has to fit in 31 bits and the floating
> point has to have an IEEE like format and values for NaN and INF
> (e.g. for NaN and INF all bits of the exp must be set).
>
> To determine this IEEE likeness a new boolean was added to real_format.
>
> Regression tests ran on aarch64-none-linux and arm-none-linux-gnueabi
> and no regression. x86 uses it's own implementation other than
> the fpclassify builtin.
>
> As an example, Aarch64 now generates for classification of doubles:
>
> f:
> 	fmov	x1, d0
> 	mov	w0, 7
> 	sbfx	x2, x1, 52, 11
> 	add	w3, w2, 1
> 	tst	w3, 0x07FE
> 	bne	.L1
> 	mov	w0, 13
> 	tst	x1, 0x7fffffffffffffff
> 	beq	.L1
> 	mov	w0, 11
> 	tbz	x2, 0, .L1
> 	tst	x1, 0xfffffffffffff
> 	mov	w0, 3
> 	mov	w1, 5
> 	csel	w0, w0, w1, ne
>
> .L1:
> 	ret
>
> No new tests as there are existing tests to test functionality.
> glibc benchmarks ran against the builtin and this shows a 31.3%
> performance gain.
>
> Ok for trunk?
>
> Thanks,
> Tamar
>
> PS. I don't have commit rights so if OK can someone apply the patch for me.
>
> gcc/
> 2016-08-25  Tamar Christina  <tamar.christina@arm.com>
> 	    Wilco Dijkstra  <wilco.dijkstra@arm.com>
>
> 	* gcc/builtins.c (fold_builtin_fpclassify): Added optimized version.
> 	* gcc/real.h (real_format): Added is_ieee_compatible field.
> 	* gcc/real.c (ieee_single_format): Set is_ieee_compatible flag.
> 	(mips_single_format): Likewise.
> 	(motorola_single_format): Likewise.
> 	(spu_single_format): Likewise.
> 	(ieee_double_format): Likewise.
> 	(mips_double_format): Likewise.
> 	(motorola_double_format): Likewise.
> 	(ieee_extended_motorola_format): Likewise.
> 	(ieee_extended_intel_128_format): Likewise.
> 	(ieee_extended_intel_96_round_53_format): Likewise.
> 	(ibm_extended_format): Likewise.
> 	(mips_extended_format): Likewise.
> 	(ieee_quad_format): Likewise.
> 	(mips_quad_format): Likewise.
> 	(vax_f_format): Likewise.
> 	(vax_d_format): Likewise.
> 	(vax_g_format): Likewise.
> 	(decimal_single_format): Likewise.
> 	(decimal_quad_format): Likewise.
> 	(iee_half_format): Likewise.
> 	(mips_single_format): Likewise.
> 	(arm_half_format): Likewise.
> 	(real_internal_format): Likewise.
>
>
> gcc-public.patch
>
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index 1073e35b17b1bc1f6974c71c940bd9d82bbbfc0f..58bf129f9a0228659fd3b976d38d021d1d5bd6bb 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -7947,10 +7947,8 @@ static tree
>  fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
>  {
>    tree fp_nan, fp_infinite, fp_normal, fp_subnormal, fp_zero,
> -    arg, type, res, tmp;
> +    arg, type, res;
>    machine_mode mode;
> -  REAL_VALUE_TYPE r;
> -  char buf[128];
>
>    /* Verify the required arguments in the original call.  */
>    if (nargs != 6
> @@ -7970,14 +7968,143 @@ fold_builtin_fpclassify (location_t loc, tree *args, int nargs)
>    arg = args[5];
>    type = TREE_TYPE (arg);
>    mode = TYPE_MODE (type);
> -  arg = builtin_save_expr (fold_build1_loc (loc, ABS_EXPR, type, arg));
> +  const real_format *format = REAL_MODE_FORMAT (mode);
> +
> +  /*
> +  For IEEE 754 types:
> +
> +  fpclassify (x) ->
> +       !((exp + 1) & (exp_mask & ~1)) // exponent bits not all set or unset
> +	 ? (x & sign_mask == 0 ? FP_ZERO :
> +	   (exp & exp_mask == exp_mask
> +	      ? (mantisa == 0 ? FP_INFINITE : FP_NAN) :
> +	      FP_SUBNORMAL)):
> +       FP_NORMAL.
> +
> +  Otherwise
> +
> +  fpclassify (x) ->
> +       isnan (x) ? FP_NAN :
> +	(fabs (x) == Inf ? FP_INFINITE :
> +	   (fabs (x) >= DBL_MIN ? FP_NORMAL :
> +	     (x == 0 ? FP_ZERO : FP_SUBNORMAL))).
> +  */
> +
> +  /* Check if the number that is being classified is close enough to IEEE 754
> +     format to be able to go in the early exit code.  */
> +  if (format->is_binary_ieee_compatible)
> +    {
> +      gcc_assert (format->b == 2);
> +
> +      const tree int_type = integer_type_node;
> +      const int exp_bits  = (GET_MODE_SIZE (mode) * BITS_PER_UNIT) - format->p;
> +      const int exp_mask  = (1 << exp_bits) - 1;
> +
> +      tree exp, specials, exp_bitfield,
> +	   const_arg0, const_arg1, const0, const1,
> +	   not_sign_mask, zero_check, mantissa_mask,
> +	   mantissa_any_set, exp_lsb_set, mask_check;
> +      tree int_arg_type, int_arg;
Style nit.  Just use

   tree exp, specials, exp_bitfield;
   tree const_arg0, const_arg1, etc etc.


> +
> +      /* Re-interpret the float as an unsigned integer type
> +	 with equal precision.  */
> +      int_arg_type = build_nonstandard_integer_type (TYPE_PRECISION (type), 0);
> +      int_arg = fold_build1_loc (loc, INDIRECT_REF, int_arg_type,
> +		  fold_build1_loc (loc, NOP_EXPR,
> +				   build_pointer_type (int_arg_type),
> +		    fold_build1_loc (loc, ADDR_EXPR,
> +				     build_pointer_type (type), arg)));
Doesn't this make ARG addressable?  Which in turn means ARG won't be 
exposed to the gimple/ssa optimizers.    Or is it the case that when 
fpclassify is used its argument is already in memory (and thus addressable?)


> +      /* ~(1 << location_sign_bit).
> +	 This creates a mask that can be used to mask out the sign bit.  */
> +      not_sign_mask = fold_build1_loc (loc, BIT_NOT_EXPR, int_arg_type,
> +			fold_build2_loc (loc, LSHIFT_EXPR, int_arg_type,
> +			  const_arg1,
> +			  build_int_cst (int_arg_type, format->signbit_rw)));
Formatting nits.  When you have to wrap a call, the arguments are 
formatted like this

foo (arg, arg, arg, ...
      arg, arg

Given you've got calls to fold_build2_loc, build_int_cst, etc embedded 
inside other calls to fold_build2_loc, I'd just create some temporaries 
to hold the results of the inner calls.  That'll clean up the formatting 
significantly.


> +					     exp, const1));
> +
> +      /* Combine the values together.  */
> +      specials = fold_build3_loc (loc, COND_EXPR, int_type, zero_check, fp_zero,
> +		   fold_build3_loc (loc, COND_EXPR, int_type, exp_lsb_set,
> +		    fold_build3_loc (loc, COND_EXPR, int_type, mantissa_any_set,
> +		      HONOR_NANS (mode) ? fp_nan : fp_normal,
> +		      HONOR_INFINITIES (mode) ? fp_infinite : fp_normal),
> +		    fp_subnormal));
So this implies you're running on generic, not gimple, right?  Otherwise 
you can't generate these kinds of expressions.


> diff --git a/gcc/real.h b/gcc/real.h
> index 59af580e78f2637be84f71b98b45ec6611053222..36ded57cf4db7c30c935bdb24219a167480f39c8 100644
> --- a/gcc/real.h
> +++ b/gcc/real.h
> @@ -161,6 +161,15 @@ struct real_format
>    bool has_signed_zero;
>    bool qnan_msb_set;
>    bool canonical_nan_lsbs_set;
> +
> +  /* This flag indicates whether the format can be used in the optimized
> +     code paths for the __builtin_fpclassify function and friends.
> +     The format has to have the same NaN and INF representation as normal
> +     IEEE floats (e.g. exp must have all bits set), most significant bit must be
> +     sign bit, followed by exp bits of at most 32 bits.  Lastly the floating
> +     point number must be representable as an integer.  The base of the number
> +     also must be base 2.  */
> +  bool is_binary_ieee_compatible;
>    const char *name;
>  };
I think Joseph has already commented on the contents of the initializer 
and a few more cases were we can use the optimized paths.

However, I do have a general question.  There are some targets which 
have FPUs that are basically IEEE, but don't support certain IEEE 
features like NaNs, denorms, etc.

Presumably all that's needed is for those targets to define a hook to 
describe which checks will always be false and you can check the hook's 
return value.  Right?


Can you please include some tests to verify you're getting the initial 
code generation you want?  Ideally there'd be execution tests too where 
you generate one of the special nodes, then call the __builtin and 
verify that you get the expected results back.  The latter in particular 
are key since it'll allow us to catch problems much earlier across the 
wide variety of targets GCC supports.

I think you already had plans to post an updated patch.  Please include 
the fixes noted above in that update.

And just to be clear, I like where this is going, I just think we're 
going to need a couple iterations to iron everything out.

Jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-16 19:53 ` Jeff Law
@ 2016-09-20 12:14   ` Tamar Christina
  2016-09-20 14:52     ` Jeff Law
  0 siblings, 1 reply; 32+ messages in thread
From: Tamar Christina @ 2016-09-20 12:14 UTC (permalink / raw)
  To: Jeff Law, GCC Patches, jakub, rguenther; +Cc: nd



On 16/09/16 20:49, Jeff Law wrote:
> On 09/12/2016 10:19 AM, Tamar Christina wrote:
>> Hi All,
>> +
>> +      /* Re-interpret the float as an unsigned integer type
>> +     with equal precision.  */
>> +      int_arg_type = build_nonstandard_integer_type (TYPE_PRECISION 
>> (type), 0);
>> +      int_arg = fold_build1_loc (loc, INDIRECT_REF, int_arg_type,
>> +          fold_build1_loc (loc, NOP_EXPR,
>> +                   build_pointer_type (int_arg_type),
>> +            fold_build1_loc (loc, ADDR_EXPR,
>> +                     build_pointer_type (type), arg)));
> Doesn't this make ARG addressable?  Which in turn means ARG won't be 
> exposed to the gimple/ssa optimizers.    Or is it the case that when 
> fpclassify is used its argument is already in memory (and thus 
> addressable?)
>
I believe that it is the case that when fpclassify is use the argument 
is already addressable, but I am not 100% certain. I may be able to do 
this differently so I'll
come back to you on this one.
>> +                         exp, const1));
>> +
>> +      /* Combine the values together.  */
>> +      specials = fold_build3_loc (loc, COND_EXPR, int_type, 
>> zero_check, fp_zero,
>> +           fold_build3_loc (loc, COND_EXPR, int_type, exp_lsb_set,
>> +            fold_build3_loc (loc, COND_EXPR, int_type, 
>> mantissa_any_set,
>> +              HONOR_NANS (mode) ? fp_nan : fp_normal,
>> +              HONOR_INFINITIES (mode) ? fp_infinite : fp_normal),
>> +            fp_subnormal));
> So this implies you're running on generic, not gimple, right? 
> Otherwise you can't generate these kinds of expressions.
>

Yes this is generic.

>> diff --git a/gcc/real.h b/gcc/real.h
>> index 
>> 59af580e78f2637be84f71b98b45ec6611053222..36ded57cf4db7c30c935bdb24219a167480f39c8 
>> 100644
>> --- a/gcc/real.h
>> +++ b/gcc/real.h
>> @@ -161,6 +161,15 @@ struct real_format
>>    bool has_signed_zero;
>>    bool qnan_msb_set;
>>    bool canonical_nan_lsbs_set;
>> +
>> +  /* This flag indicates whether the format can be used in the 
>> optimized
>> +     code paths for the __builtin_fpclassify function and friends.
>> +     The format has to have the same NaN and INF representation as 
>> normal
>> +     IEEE floats (e.g. exp must have all bits set), most significant 
>> bit must be
>> +     sign bit, followed by exp bits of at most 32 bits.  Lastly the 
>> floating
>> +     point number must be representable as an integer.  The base of 
>> the number
>> +     also must be base 2.  */
>> +  bool is_binary_ieee_compatible;
>>    const char *name;
>>  };
> I think Joseph has already commented on the contents of the 
> initializer and a few more cases were we can use the optimized paths.
>
> However, I do have a general question.  There are some targets which 
> have FPUs that are basically IEEE, but don't support certain IEEE 
> features like NaNs, denorms, etc.
>
> Presumably all that's needed is for those targets to define a hook to 
> describe which checks will always be false and you can check the 
> hook's return value.  Right?
>
Yes, that should be enough. Not supporting NAN and Infinities is already 
supported though, but it's tied to the real format rather than a 
particular target.
>
> Can you please include some tests to verify you're getting the initial 
> code generation you want?  Ideally there'd be execution tests too 
> where you generate one of the special nodes, then call the __builtin 
> and verify that you get the expected results back. The latter in 
> particular are key since it'll allow us to catch problems much earlier 
> across the wide variety of targets GCC supports.
>
I can add some code generation tests. There are I believe already some 
execution tests, which test both correct and incorrect output.

> I think you already had plans to post an updated patch.  Please 
> include the fixes noted above in that update.

Yes I will include your feedback in it. I'm currently waiting for some 
extra performance numbers.

Thanks,
Tamar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-20 12:14   ` Tamar Christina
@ 2016-09-20 14:52     ` Jeff Law
  2016-09-20 17:52       ` Joseph Myers
  2016-09-21  7:13       ` Richard Biener
  0 siblings, 2 replies; 32+ messages in thread
From: Jeff Law @ 2016-09-20 14:52 UTC (permalink / raw)
  To: Tamar Christina, GCC Patches, jakub, rguenther; +Cc: nd

On 09/20/2016 06:00 AM, Tamar Christina wrote:
>
>
> On 16/09/16 20:49, Jeff Law wrote:
>> On 09/12/2016 10:19 AM, Tamar Christina wrote:
>>> Hi All,
>>> +
>>> +      /* Re-interpret the float as an unsigned integer type
>>> +     with equal precision.  */
>>> +      int_arg_type = build_nonstandard_integer_type (TYPE_PRECISION
>>> (type), 0);
>>> +      int_arg = fold_build1_loc (loc, INDIRECT_REF, int_arg_type,
>>> +          fold_build1_loc (loc, NOP_EXPR,
>>> +                   build_pointer_type (int_arg_type),
>>> +            fold_build1_loc (loc, ADDR_EXPR,
>>> +                     build_pointer_type (type), arg)));
>> Doesn't this make ARG addressable?  Which in turn means ARG won't be
>> exposed to the gimple/ssa optimizers.    Or is it the case that when
>> fpclassify is used its argument is already in memory (and thus
>> addressable?)
>>
> I believe that it is the case that when fpclassify is use the argument
> is already addressable, but I am not 100% certain. I may be able to do
> this differently so I'll come back to you on this one.
The more I think about it, the more I suspect ARG is only going to 
already be marked as addressable if it has already had its address taken.

But I think we can look at this as an opportunity.  If ARG is already 
addressable, then it's most likely going to be living in memory (there 
are exceptions).  If ARG is most likely going to be living in memory, 
then we clearly want to use your fast integer path, regardless of the 
target.

If ARG is not addressable, then it's not as clear as the object is 
likely going to be assigned into an FP register.  Integer operations on 
the an FP register likely will force a sequence where we dump the 
register into memory, load from memory into a GPR, then bit test on the 
GPR.  That gets very expensive on some architectures.

Could we defer lowering in the case where the object is not addressable 
until gimple->rtl expansion time?  That's the best time to introduce 
target dependencies into the code we generate.

Jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-20 14:52     ` Jeff Law
@ 2016-09-20 17:52       ` Joseph Myers
  2016-09-21  7:13       ` Richard Biener
  1 sibling, 0 replies; 32+ messages in thread
From: Joseph Myers @ 2016-09-20 17:52 UTC (permalink / raw)
  To: Jeff Law; +Cc: Tamar Christina, GCC Patches, jakub, rguenther, nd

On Tue, 20 Sep 2016, Jeff Law wrote:

> Could we defer lowering in the case where the object is not addressable until
> gimple->rtl expansion time?  That's the best time to introduce target
> dependencies into the code we generate.

If we do that (remembering that -fsignaling-nans always wants the integer 
path), we need to make sure there are tests of fpclassify that reliably 
exercise both paths....

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-20 14:52     ` Jeff Law
  2016-09-20 17:52       ` Joseph Myers
@ 2016-09-21  7:13       ` Richard Biener
  1 sibling, 0 replies; 32+ messages in thread
From: Richard Biener @ 2016-09-21  7:13 UTC (permalink / raw)
  To: Jeff Law; +Cc: Tamar Christina, GCC Patches, jakub, nd

On Tue, 20 Sep 2016, Jeff Law wrote:

> On 09/20/2016 06:00 AM, Tamar Christina wrote:
> > 
> > 
> > On 16/09/16 20:49, Jeff Law wrote:
> > > On 09/12/2016 10:19 AM, Tamar Christina wrote:
> > > > Hi All,
> > > > +
> > > > +      /* Re-interpret the float as an unsigned integer type
> > > > +     with equal precision.  */
> > > > +      int_arg_type = build_nonstandard_integer_type (TYPE_PRECISION
> > > > (type), 0);
> > > > +      int_arg = fold_build1_loc (loc, INDIRECT_REF, int_arg_type,
> > > > +          fold_build1_loc (loc, NOP_EXPR,
> > > > +                   build_pointer_type (int_arg_type),
> > > > +            fold_build1_loc (loc, ADDR_EXPR,
> > > > +                     build_pointer_type (type), arg)));
> > > Doesn't this make ARG addressable?  Which in turn means ARG won't be
> > > exposed to the gimple/ssa optimizers.    Or is it the case that when
> > > fpclassify is used its argument is already in memory (and thus
> > > addressable?)
> > > 
> > I believe that it is the case that when fpclassify is use the argument
> > is already addressable, but I am not 100% certain. I may be able to do
> > this differently so I'll come back to you on this one.
> The more I think about it, the more I suspect ARG is only going to already be
> marked as addressable if it has already had its address taken.

Sure, if it has it's address taken ... but I don't see how
fpclassify requires the arg to be address taken.

> But I think we can look at this as an opportunity.  If ARG is already
> addressable, then it's most likely going to be living in memory (there are
> exceptions).  If ARG is most likely going to be living in memory, then we
> clearly want to use your fast integer path, regardless of the target.
> 
> If ARG is not addressable, then it's not as clear as the object is likely
> going to be assigned into an FP register.  Integer operations on the an FP
> register likely will force a sequence where we dump the register into memory,
> load from memory into a GPR, then bit test on the GPR.  That gets very
> expensive on some architectures.
> 
> Could we defer lowering in the case where the object is not addressable until
> gimple->rtl expansion time?  That's the best time to introduce target
> dependencies into the code we generate.

Note that GIMPLE doesn't require sth to be addressable just because
you access random pieces of it.  The IL has tricks like allowing
MEM[&decl + CST] w/o actually marking decl TREE_ADDRESSABLE (and the
expanders trying to cope with that) and there is of course
BIT_FIELD_REF which you can use to extract arbitrary bits off any
entity without it living in memory (and again the expanders trying to
cope with that).

So may I suggest to move the "folding" from builtins.c to gimplify.c
and simply emit GIMPLE directly there?  That would make it also clearer
that we are dealing with a lowering process rather than a "folding".

Doing it in GIMPLE lowering is another possibility - we lower things
like posix_memalign and setjmp there as well.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-12 16:21 Tamar Christina
                   ` (4 preceding siblings ...)
  2016-09-16 19:53 ` Jeff Law
@ 2016-09-19 22:43 ` Michael Meissner
       [not found]   ` <41217f33-3861-dbb8-2f11-950ab30a7021@arm.com>
  5 siblings, 1 reply; 32+ messages in thread
From: Michael Meissner @ 2016-09-19 22:43 UTC (permalink / raw)
  To: Tamar Christina; +Cc: GCC Patches, jakub, rguenther, law, nd

On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
> Hi All,
> 
> This patch adds an optimized route to the fpclassify builtin
> for floating point numbers which are similar to IEEE-754 in format.
> 
> The goal is to make it faster by:
> 1. Trying to determine the most common case first
>    (e.g. the float is a Normal number) and then the
>    rest. The amount of code generated at -O2 are
>    about the same +/- 1 instruction, but the code
>    is much better.
> 2. Using integer operation in the optimized path.
> 
> At a high level, the optimized path uses integer operations
> to perform the following:
> 
>   if (exponent bits aren't all set or unset)
>      return Normal;
>   else if (no bits are set on the number after masking out
> 	   sign bits then)
>      return Zero;
>   else if (exponent has no bits set)
>      return Subnormal;
>   else if (mantissa has no bits set)
>      return Infinite;
>   else
>      return NaN;

I haven't looked at fpclassify.  I assume we can define a backend insn to do
the right thing?  One of the things we've noticed over the years with the
PowerPC is that it can be rather expensive to move things from the floating
point/vector unit to the integer registers and vice versa.  This is
particularly true if you having to do the transfer via the memory unit via
stores and loads of different sizes.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 32+ messages in thread

[parent not found: <41217f33-3861-dbb8-2f11-950ab30a7021@arm.com>]

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
       [not found]   ` <41217f33-3861-dbb8-2f11-950ab30a7021@arm.com>
@ 2016-09-20 21:27     ` Michael Meissner
  2016-09-21  2:05       ` Joseph Myers
  0 siblings, 1 reply; 32+ messages in thread
From: Michael Meissner @ 2016-09-20 21:27 UTC (permalink / raw)
  To: Tamar Christina; +Cc: Michael Meissner, GCC Patches, jakub, rguenther, law, nd

On Tue, Sep 20, 2016 at 01:19:07PM +0100, Tamar Christina wrote:
> On 19/09/16 23:16, Michael Meissner wrote:
> >On Mon, Sep 12, 2016 at 04:19:32PM +0000, Tamar Christina wrote:
> >>Hi All,
> >>
> >>This patch adds an optimized route to the fpclassify builtin
> >>for floating point numbers which are similar to IEEE-754 in format.
> >>
> >>The goal is to make it faster by:
> >>1. Trying to determine the most common case first
> >>    (e.g. the float is a Normal number) and then the
> >>    rest. The amount of code generated at -O2 are
> >>    about the same +/- 1 instruction, but the code
> >>    is much better.
> >>2. Using integer operation in the optimized path.
> >>
> >>At a high level, the optimized path uses integer operations
> >>to perform the following:
> >>
> >>   if (exponent bits aren't all set or unset)
> >>      return Normal;
> >>   else if (no bits are set on the number after masking out
> >>        sign bits then)
> >>      return Zero;
> >>   else if (exponent has no bits set)
> >>      return Subnormal;
> >>   else if (mantissa has no bits set)
> >>      return Infinite;
> >>   else
> >>      return NaN;
> >I haven't looked at fpclassify.  I assume we can define a backend insn to do
> >the right thing?  One of the things we've noticed over the years with the
> >PowerPC is that it can be rather expensive to move things from the floating
> >point/vector unit to the integer registers and vice versa.  This is
> >particularly true if you having to do the transfer via the memory unit via
> >stores and loads of different sizes.
> >
> Hmm, what do you mean with the right thing? Do you mean never to use the
> integer version?

The forthcoming PowerPC with ISA 3.0 (power9), we have different ways to do
classification within the floating point unit.

For example, there is the XSTSTDCDP instruction that can set a condition code
register to whether the value is 0, NaN, Infinity, Denormal.  We might come up
with a clever set of tests to use 4 of these instructions to return the
appropriate FP_<xxx>.

Even if we want to do it by looking at the exponent, ISA 3.0 defines
instructions like XSXEXPDP that extracts the exponent from a double precision
value and returns it in a GPR register.

> If so then no, it currently determines it based on the format.
> I could potentially add a hook to allow backends to opt-in/out if
> there's a concern this might be slower.

It would be better to have a fpclassify<mode>2 pattern, and if it isn't
defined, then do the machine independent processing.  That is the way it is
done elsewhere.

> Though is the move that much slower that it negates the benefits we
> should get from not having to do
> 4 branches in the normal case?

It depends.  We have a lot of other stuff for ISA 3.0 on our plates, and
truthfully, we won't be able to answer the question about performance until we
get real hardware, but I would prefer not to be locked into an existing
implementation.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-20 21:27     ` Michael Meissner
@ 2016-09-21  2:05       ` Joseph Myers
  2016-09-21  8:32         ` Richard Biener
  0 siblings, 1 reply; 32+ messages in thread
From: Joseph Myers @ 2016-09-21  2:05 UTC (permalink / raw)
  To: Michael Meissner; +Cc: Tamar Christina, GCC Patches, jakub, rguenther, law, nd

On Tue, 20 Sep 2016, Michael Meissner wrote:

> It would be better to have a fpclassify<mode>2 pattern, and if it isn't
> defined, then do the machine independent processing.  That is the way it is
> done elsewhere.

But note:

* The __builtin_fpclassify function takes arguments for all the possible 
FP_* results, so the insn pattern would need to map the results to the 
arguments passed to __builtin_fpclassify.  (They are documented as needing 
to be constants, of type int.)

* Then you want that mapping step to get optimized away in the case of a 
comparison fpclassify (...) == FP_SUBNORMAL (for example), or a switch 
over possible results.  Will the RTL optimizers do that given the insns 
structured appropriately?

(For that matter, I don't know if the GIMPLE optimizers will optimize away 
such a mapping either, but they clearly should.  I've wondered what the 
right approach would be for making FLT_ROUNDS properly depend on the 
rounding mode - bug 30569, 
<https://gcc.gnu.org/ml/gcc/2013-11/msg00317.html> - where the same issues 
apply.  For boolean operations such as isnan you don't have such 
complications.)

* If flag_signaling_nans, then any pattern should work for signaling NaNs.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Optimise the fpclassify builtin to perform integer operations when possible
  2016-09-21  2:05       ` Joseph Myers
@ 2016-09-21  8:32         ` Richard Biener
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Biener @ 2016-09-21  8:32 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Michael Meissner, Tamar Christina, GCC Patches, jakub, law, nd

On Wed, 21 Sep 2016, Joseph Myers wrote:

> On Tue, 20 Sep 2016, Michael Meissner wrote:
> 
> > It would be better to have a fpclassify<mode>2 pattern, and if it isn't
> > defined, then do the machine independent processing.  That is the way it is
> > done elsewhere.
> 
> But note:
> 
> * The __builtin_fpclassify function takes arguments for all the possible 
> FP_* results, so the insn pattern would need to map the results to the 
> arguments passed to __builtin_fpclassify.  (They are documented as needing 
> to be constants, of type int.)

Yeah, that's the reason we "lower" this early.

> * Then you want that mapping step to get optimized away in the case of a 
> comparison fpclassify (...) == FP_SUBNORMAL (for example), or a switch 
> over possible results.  Will the RTL optimizers do that given the insns 
> structured appropriately?

I think it makes sense to fold fpclassify (...) == N to more specific
classification builtins that do not have this issue if possible.  OTOH
RTL expansion could detect some of the non-builtin ways to do such checks
and see if an optab exists as well.

> (For that matter, I don't know if the GIMPLE optimizers will optimize away 
> such a mapping either, but they clearly should.  I've wondered what the 
> right approach would be for making FLT_ROUNDS properly depend on the 
> rounding mode - bug 30569, 
> <https://gcc.gnu.org/ml/gcc/2013-11/msg00317.html> - where the same issues 
> apply.  For boolean operations such as isnan you don't have such 
> complications.)

I think they do via jump-threading.

Richard.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-09-21 14:49 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-13 12:16 [PATCH] Optimise the fpclassify builtin to perform integer operations when possible Wilco Dijkstra
2016-09-13 16:10 ` Joseph Myers
2016-09-21 14:51 ` Richard Earnshaw (lists)
  -- strict thread matches above, loose matches on Subject: below --
2016-09-12 17:24 Moritz Klammler
2016-09-12 20:08 ` Andrew Pinski
2016-09-12 16:21 Tamar Christina
2016-09-12 22:33 ` Joseph Myers
2016-09-13 12:25   ` Tamar Christina
2016-09-12 22:41 ` Joseph Myers
2016-09-13 12:30   ` Tamar Christina
2016-09-13 12:44     ` Joseph Myers
2016-09-15  9:08       ` Tamar Christina
2016-09-15 11:21         ` Wilco Dijkstra
2016-09-15 12:56           ` Joseph Myers
2016-09-15 13:05         ` Joseph Myers
2016-09-12 22:49 ` Joseph Myers
2016-09-13 12:33   ` Tamar Christina
2016-09-13 12:48     ` Joseph Myers
2016-09-13  8:58 ` Jakub Jelinek
2016-09-13 16:16   ` Jeff Law
2016-09-14  8:31     ` Richard Biener
2016-09-15 16:02       ` Jeff Law
2016-09-15 16:28         ` Richard Biener
2016-09-16 19:53 ` Jeff Law
2016-09-20 12:14   ` Tamar Christina
2016-09-20 14:52     ` Jeff Law
2016-09-20 17:52       ` Joseph Myers
2016-09-21  7:13       ` Richard Biener
2016-09-19 22:43 ` Michael Meissner
     [not found]   ` <41217f33-3861-dbb8-2f11-950ab30a7021@arm.com>
2016-09-20 21:27     ` Michael Meissner
2016-09-21  2:05       ` Joseph Myers
2016-09-21  8:32         ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).