[RFC] Implementing detection of saturation and rounding arithmetic

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] Implementing detection of saturation and rounding arithmetic
@ 2021-05-11  5:37 Tamar Christina
  2021-05-11 10:03 ` David Brown
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Tamar Christina @ 2021-05-11  5:37 UTC (permalink / raw)
  To: gcc; +Cc: Richard Biener, Richard Sandiford

Hi All,

We are looking to implement saturation support in the compiler.  The aim is to
recognize both Scalar and Vector variant of typical saturating expressions.

As an example:

1. Saturating addition:
   char sat (char a, char b)
   {
      int tmp = a + b;
      return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
   }

2. Saturating abs:
   char sat (char a)
   {
      int tmp = abs (a);
      return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
   }

3. Rounding shifts
   char rndshift (char dc)
   {
      int round_const = 1 << (shift - 1);
      return (dc + round_const) >> shift;
   }

etc.

Of course the first issue is that C does not really have a single idiom for
expressing this.

At the RTL level we have ss_truncate and us_truncate and float_truncate for
truncation.

At the Tree level we have nothing for truncation (I believe) for scalars. For
Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
nothing actually generates this at the moment. it's just an unused tree code.

For rounding there doesn't seem to be any existing infrastructure.

The proposal to handle these are as follow, keep in mind that all of these also
exist in their scalar form, as such detecting them in the vectorizer would be
the wrong place.

1. Rounding:
   a) Use match.pd to rewrite various rounding idioms to shifts.
   b) Use backwards or forward prop to rewrite these to internal functions
      where even if the target does not support these rounding instructions they
      have a chance to provide a more efficient implementation than what would
      be generated normally.

2. Saturation:
   a) Use match.pd to rewrite the various saturation expressions into min/max
      operations which opens up the expressions to further optimizations.
   b) Use backwards or forward prop to convert to internal functions if the
      resulting min/max expression still meet the criteria for being a
      saturating expression.  This follows the algorithm as outlined in "The
      Software Vectorization handbook" by Aart J.C. Bik.

      We could get the right instructions by using combine if we don't rewrite
      the instructions to an internal function, however then during Vectorization
      we would overestimate the cost of performing the saturation.  The constants
      will the also be loaded into registers and so becomes a lot more difficult
      to cleanup solely in the backend.

The one thing I am wondering about is whether we would need an internal function
for all operations supported, or if it should be modelled as an internal FN which
just "marks" the operation as rounding/saturating. After all, the only difference
between a normal and saturating expression in RTL is the xx_truncate RTL surrounding
the expression.  Doing so would also mean that all targets whom have saturating
instructions would automatically benefit from this.

But it does mean a small adjustment to the costing, which would need to cost the
internal function call and the argument together as a whole.

Any feedback is appreciated to minimize the number of changes required to the
final patch.  Any objections to the outlined approach?

Thanks,
Tamar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11  5:37 [RFC] Implementing detection of saturation and rounding arithmetic Tamar Christina
@ 2021-05-11 10:03 ` David Brown
  2021-05-11 17:00   ` Joseph Myers
  2021-05-12  8:00   ` Tamar Christina
  2021-05-11 11:44 ` Richard Biener
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 17+ messages in thread
From: David Brown @ 2021-05-11 10:03 UTC (permalink / raw)
  To: Tamar Christina, gcc; +Cc: Richard Sandiford, Richard Biener

On 11/05/2021 07:37, Tamar Christina via Gcc wrote:
> Hi All,
> 
> We are looking to implement saturation support in the compiler.  The aim is to
> recognize both Scalar and Vector variant of typical saturating expressions.
> 
> As an example:
> 
> 1. Saturating addition:
>    char sat (char a, char b)
>    {
>       int tmp = a + b;
>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>    }
> 
> 2. Saturating abs:
>    char sat (char a)
>    {
>       int tmp = abs (a);
>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>    }
> 
> 3. Rounding shifts
>    char rndshift (char dc)
>    {
>       int round_const = 1 << (shift - 1);
>       return (dc + round_const) >> shift;
>    }
> 
> etc.
> 

I can't comment on the implementation part - I don't know anything about it.

However, in your examples above I see a few points.

One is your use of "char".  "char" is a type that varies in signedness
from target to target (and also depending on compiler flags), and is
slightly different in C and C++ ('a' has char type in C++, int type in
C).  If you must use "char" in arithmetic contexts, I recommend using
"signed char" or "unsigned char" explicitly.

I would rather recommend you use the size-specific <stdint.h> types -
int8_t, etc., - as being more appropriate for this kind of thing.
(AFAIK all gcc targets have 8-bit CHAR.)  This also makes it easier to
see the sizes you need for the "tmp" value as you make functions for
bigger sizes - remember that on some gcc targets, "int" is 16-bit.

It is also worth noting that gcc already has support for saturating
types on some targets:

<https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>

My testing of these (quite a long time ago) left me with a feeling that
it was not a feature anyone had worked hard to optimise - certainly it
did not make use of saturating arithmetic instructions available on some
of the targets I tested (ARM Cortex M4, for example).  But it is
possible that there are things here that would be of use to you.  (I am
not convinced that it is worth spending time optimising the
implementation of these - I don't think the N1169 types are much used by
anyone.)

While it is always good that the compiler can spot patterns in generic C
code and generate optimal instruction sequences, another possibility
here would be a set of built-in functions for saturated and rounding
arithmetic.  That would take the guesswork out of it for users - if
there code requires efficient saturated addition, they can use
__builtin_add_sat and get the best their target can offer (just like
__builtin_add_overflow, and that kind of thing).  And it might be easier
to implement in the compiler.

I hope these comments give you a few ideas or useful thoughts.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11  5:37 [RFC] Implementing detection of saturation and rounding arithmetic Tamar Christina
  2021-05-11 10:03 ` David Brown
@ 2021-05-11 11:44 ` Richard Biener
  2021-05-12  9:13   ` Tamar Christina
  2021-05-11 15:43 ` Segher Boessenkool
  2021-05-12  8:47 ` Richard Sandiford
  3 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-05-11 11:44 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc, Richard Sandiford, Jakub Jelinek

On Tue, 11 May 2021, Tamar Christina wrote:

> Hi All,
> 
> We are looking to implement saturation support in the compiler.  The aim is to
> recognize both Scalar and Vector variant of typical saturating expressions.
> 
> As an example:
> 
> 1. Saturating addition:
>    char sat (char a, char b)
>    {
>       int tmp = a + b;
>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>    }
> 
> 2. Saturating abs:
>    char sat (char a)
>    {
>       int tmp = abs (a);
>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>    }
> 
> 3. Rounding shifts
>    char rndshift (char dc)
>    {
>       int round_const = 1 << (shift - 1);
>       return (dc + round_const) >> shift;
>    }
> 
> etc.
> 
> Of course the first issue is that C does not really have a single idiom for
> expressing this.
> 
> At the RTL level we have ss_truncate and us_truncate and float_truncate for
> truncation.
> 
> At the Tree level we have nothing for truncation (I believe) for scalars. For
> Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
> nothing actually generates this at the moment. it's just an unused tree code.
> 
> For rounding there doesn't seem to be any existing infrastructure.
> 
> The proposal to handle these are as follow, keep in mind that all of these also
> exist in their scalar form, as such detecting them in the vectorizer would be
> the wrong place.
> 
> 1. Rounding:
>    a) Use match.pd to rewrite various rounding idioms to shifts.
>    b) Use backwards or forward prop to rewrite these to internal functions
>       where even if the target does not support these rounding instructions they
>       have a chance to provide a more efficient implementation than what would
>       be generated normally.
> 
> 2. Saturation:
>    a) Use match.pd to rewrite the various saturation expressions into min/max
>       operations which opens up the expressions to further optimizations.
>    b) Use backwards or forward prop to convert to internal functions if the
>       resulting min/max expression still meet the criteria for being a
>       saturating expression.  This follows the algorithm as outlined in "The
>       Software Vectorization handbook" by Aart J.C. Bik.
> 
>       We could get the right instructions by using combine if we don't rewrite
>       the instructions to an internal function, however then during Vectorization
>       we would overestimate the cost of performing the saturation.  The constants
>       will the also be loaded into registers and so becomes a lot more difficult
>       to cleanup solely in the backend.
> 
> The one thing I am wondering about is whether we would need an internal function
> for all operations supported, or if it should be modelled as an internal FN which
> just "marks" the operation as rounding/saturating. After all, the only difference
> between a normal and saturating expression in RTL is the xx_truncate RTL surrounding
> the expression.  Doing so would also mean that all targets whom have saturating
> instructions would automatically benefit from this.
> 
> But it does mean a small adjustment to the costing, which would need to cost the
> internal function call and the argument together as a whole.
> 
> Any feedback is appreciated to minimize the number of changes required to the
> final patch.  Any objections to the outlined approach?

I think it makes sense to pattern-match the operations on GIMPLE
and follow the approach take by __builtin_add_overflow & friends.
Maybe quickly check whether clang provides some builtins already
which we could implement.

There's some appeal to mimicing what RTL does - thus have
the saturation be represented as saturating truncation.
Maybe that's what users expect of builtins as well.

I'm not sure what the rounding shift would do - 'shift' isn't
an argument to rndshift here.  It feels like it's a
rounding division but only by powers of two.  Does
ROUND_DIV_EXPR already provide the desired semantics?

Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11  5:37 [RFC] Implementing detection of saturation and rounding arithmetic Tamar Christina
  2021-05-11 10:03 ` David Brown
  2021-05-11 11:44 ` Richard Biener
@ 2021-05-11 15:43 ` Segher Boessenkool
  2021-05-12  9:13   ` Tamar Christina
  2021-05-12  8:47 ` Richard Sandiford
  3 siblings, 1 reply; 17+ messages in thread
From: Segher Boessenkool @ 2021-05-11 15:43 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc, Richard Sandiford, Richard Biener

Hi!

On Tue, May 11, 2021 at 05:37:34AM +0000, Tamar Christina via Gcc wrote:
> 2. Saturating abs:
>    char sat (char a)
>    {
>       int tmp = abs (a);
>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>    }

That can be done quite a bit better, branchless at least.  Same for all
examples here probably.

> 2. Saturation:
>    a) Use match.pd to rewrite the various saturation expressions into min/max
>       operations which opens up the expressions to further optimizations.

You'll have to do the operation in a bigger mode for that.  (This is
also true for rounding, in many cases).

This makes new internal functions more attractive / more feasible.

>       We could get the right instructions by using combine if we don't rewrite
>       the instructions to an internal function, however then during Vectorization
>       we would overestimate the cost of performing the saturation.  The constants
>       will the also be loaded into registers and so becomes a lot more difficult
>       to cleanup solely in the backend.

Combine is almost never the right answer if you want to combine more
than two or three RTL insns.  It can be done, but combine will always
write the combined instruction in simplest terms, which tends to mean
that if you combine more insns there can be very many outcomes that you
all need to recognise as insns in your machine description.

> The one thing I am wondering about is whether we would need an internal function
> for all operations supported, or if it should be modelled as an internal FN which
> just "marks" the operation as rounding/saturating. After all, the only difference
> between a normal and saturating expression in RTL is the xx_truncate RTL surrounding
> the expression.  Doing so would also mean that all targets whom have saturating
> instructions would automatically benefit from this.

I think you will have to experiment with both approaches to get a good
feeling for the tradeoff.


Segher

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11 10:03 ` David Brown
@ 2021-05-11 17:00   ` Joseph Myers
  2021-05-12  7:43     ` David Brown
  2021-05-12  9:13     ` Tamar Christina
  2021-05-12  8:00   ` Tamar Christina
  1 sibling, 2 replies; 17+ messages in thread
From: Joseph Myers @ 2021-05-11 17:00 UTC (permalink / raw)
  To: David Brown; +Cc: Tamar Christina, gcc, Richard Sandiford, Richard Biener

On Tue, 11 May 2021, David Brown wrote:

> It is also worth noting that gcc already has support for saturating
> types on some targets:
> 
> <https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>
> 
> My testing of these (quite a long time ago) left me with a feeling that
> it was not a feature anyone had worked hard to optimise - certainly it

The implementation isn't well-integrated with any optimizations for 
arithmetic on ordinary integer types / modes, because it has its own 
completely separate machine modes and operations on those.  I still think 
it would be better to have a GIMPLE pass that lowers from fixed-point 
types to saturating etc. operations on ordinary integer types, as I said 
in <https://gcc.gnu.org/legacy-ml/gcc-patches/2011-05/msg00846.html>.

Note however that such lowering should be more or less independent of 
what's being discussed in this thread - this thread is about better 
optimization of such operations on ordinary types (with or without 
built-in functions of some kind in addition to recognition of such 
operations written in generic C), which you can do independently of 
what's done with fixed-point types.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11 17:00   ` Joseph Myers
@ 2021-05-12  7:43     ` David Brown
  2021-05-12  9:13     ` Tamar Christina
  1 sibling, 0 replies; 17+ messages in thread
From: David Brown @ 2021-05-12  7:43 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Tamar Christina, gcc, Richard Sandiford, Richard Biener

On 11/05/2021 19:00, Joseph Myers wrote:
> On Tue, 11 May 2021, David Brown wrote:
> 
>> It is also worth noting that gcc already has support for saturating
>> types on some targets:
>>
>> <https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>
>>
>> My testing of these (quite a long time ago) left me with a feeling that
>> it was not a feature anyone had worked hard to optimise - certainly it
> 
> The implementation isn't well-integrated with any optimizations for 
> arithmetic on ordinary integer types / modes, because it has its own 
> completely separate machine modes and operations on those.  I still think 
> it would be better to have a GIMPLE pass that lowers from fixed-point 
> types to saturating etc. operations on ordinary integer types, as I said 
> in <https://gcc.gnu.org/legacy-ml/gcc-patches/2011-05/msg00846.html>.

That would make sense (to me anyway, with my limited knowledge),
especially if this work on ordinary integer types pays off.  That would
surely let you simplify the sat/accum/fract type handling while
simultaneously making it work on a wider variety of targets.

> 
> Note however that such lowering should be more or less independent of 
> what's being discussed in this thread - this thread is about better 
> optimization of such operations on ordinary types (with or without 
> built-in functions of some kind in addition to recognition of such 
> operations written in generic C), which you can do independently of 
> what's done with fixed-point types.
> 

Yes, indeed.  I mentioned them for comparison, and in case there were
ideas which could be copied.  I don't think the N1169/TR 18037 types are
much (if ever) used in real code - with Tamar's planned optimisations,
the use-cases for them will be even fewer.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11 10:03 ` David Brown
  2021-05-11 17:00   ` Joseph Myers
@ 2021-05-12  8:00   ` Tamar Christina
  2021-05-12  8:31     ` David Brown
  1 sibling, 1 reply; 17+ messages in thread
From: Tamar Christina @ 2021-05-12  8:00 UTC (permalink / raw)
  To: David Brown, gcc; +Cc: Richard Sandiford, Richard Biener

Hi David, 

> -----Original Message-----
> From: David Brown <david.brown@hesbynett.no>
> Sent: Tuesday, May 11, 2021 11:04 AM
> To: Tamar Christina <Tamar.Christina@arm.com>; gcc@gcc.gnu.org
> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Biener
> <rguenther@suse.de>
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On 11/05/2021 07:37, Tamar Christina via Gcc wrote:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >    char sat (char a, char b)
> >    {
> >       int tmp = a + b;
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> >
> > 2. Saturating abs:
> >    char sat (char a)
> >    {
> >       int tmp = abs (a);
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> >
> > 3. Rounding shifts
> >    char rndshift (char dc)
> >    {
> >       int round_const = 1 << (shift - 1);
> >       return (dc + round_const) >> shift;
> >    }
> >
> > etc.
> >
> 
> I can't comment on the implementation part - I don't know anything about it.
> 
> However, in your examples above I see a few points.
> 
> One is your use of "char".  "char" is a type that varies in signedness from
> target to target (and also depending on compiler flags), and is slightly
> different in C and C++ ('a' has char type in C++, int type in C).  If you must use
> "char" in arithmetic contexts, I recommend using "signed char" or "unsigned
> char" explicitly.
> 
> I would rather recommend you use the size-specific <stdint.h> types - int8_t,
> etc., - as being more appropriate for this kind of thing.
> (AFAIK all gcc targets have 8-bit CHAR.)  This also makes it easier to see the
> sizes you need for the "tmp" value as you make functions for bigger sizes -
> remember that on some gcc targets, "int" is 16-bit.

Indeed, but unfortunately we're targeting existing code that's quite old, so the
C99 fixed sized types may not have been used.

> 
> It is also worth noting that gcc already has support for saturating types on
> some targets:
> 
> <https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>
> 
> My testing of these (quite a long time ago) left me with a feeling that it was
> not a feature anyone had worked hard to optimise - certainly it did not make
> use of saturating arithmetic instructions available on some of the targets I
> tested (ARM Cortex M4, for example).  But it is possible that there are things
> here that would be of use to you.  (I am not convinced that it is worth
> spending time optimising the implementation of these - I don't think the
> N1169 types are much used by
> anyone.)
> 

I did notice these and was wondering if it makes sense to use them, but I'd need
to check how well supported they are.  For instance would need to check if they
don't block vectorization and stuff like that.

> 
> While it is always good that the compiler can spot patterns in generic C code
> and generate optimal instruction sequences, another possibility here would
> be a set of built-in functions for saturated and rounding arithmetic.  That
> would take the guesswork out of it for users - if there code requires efficient
> saturated addition, they can use __builtin_add_sat and get the best their
> target can offer (just like __builtin_add_overflow, and that kind of thing).
> And it might be easier to implement in the compiler.
> 

We already have neon intrinics for this, but these operations happen quite often
in media processing and machine learning frameworks which don't use intrinsics
or builtins.  So the big gain for us here really is the idiom recognition 😊

Thanks for the feedback!

Cheers,
Tamar

> 
> I hope these comments give you a few ideas or useful thoughts.
> 
> David


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-12  8:00   ` Tamar Christina
@ 2021-05-12  8:31     ` David Brown
  0 siblings, 0 replies; 17+ messages in thread
From: David Brown @ 2021-05-12  8:31 UTC (permalink / raw)
  To: Tamar Christina, gcc; +Cc: Richard Sandiford, Richard Biener

On 12/05/2021 10:00, Tamar Christina wrote:
> Hi David, 
> 
>> -----Original Message-----
>> From: David Brown <david.brown@hesbynett.no>
>> Sent: Tuesday, May 11, 2021 11:04 AM
>> To: Tamar Christina <Tamar.Christina@arm.com>; gcc@gcc.gnu.org
>> Cc: Richard Sandiford <Richard.Sandiford@arm.com>; Richard Biener
>> <rguenther@suse.de>
>> Subject: Re: [RFC] Implementing detection of saturation and rounding
>> arithmetic
>>
>> On 11/05/2021 07:37, Tamar Christina via Gcc wrote:
>>> Hi All,
>>>
>>> We are looking to implement saturation support in the compiler.  The
>>> aim is to recognize both Scalar and Vector variant of typical saturating
>> expressions.
>>>
>>> As an example:
>>>
>>> 1. Saturating addition:
>>>    char sat (char a, char b)
>>>    {
>>>       int tmp = a + b;
>>>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>>>    }
>>>
>>> 2. Saturating abs:
>>>    char sat (char a)
>>>    {
>>>       int tmp = abs (a);
>>>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>>>    }
>>>
>>> 3. Rounding shifts
>>>    char rndshift (char dc)
>>>    {
>>>       int round_const = 1 << (shift - 1);
>>>       return (dc + round_const) >> shift;
>>>    }
>>>
>>> etc.
>>>
>>
>> I can't comment on the implementation part - I don't know anything about it.
>>
>> However, in your examples above I see a few points.
>>
>> One is your use of "char".  "char" is a type that varies in signedness from
>> target to target (and also depending on compiler flags), and is slightly
>> different in C and C++ ('a' has char type in C++, int type in C).  If you must use
>> "char" in arithmetic contexts, I recommend using "signed char" or "unsigned
>> char" explicitly.
>>
>> I would rather recommend you use the size-specific <stdint.h> types - int8_t,
>> etc., - as being more appropriate for this kind of thing.
>> (AFAIK all gcc targets have 8-bit CHAR.)  This also makes it easier to see the
>> sizes you need for the "tmp" value as you make functions for bigger sizes -
>> remember that on some gcc targets, "int" is 16-bit.
> 
> Indeed, but unfortunately we're targeting existing code that's quite old, so the
> C99 fixed sized types may not have been used.

I was thinking of your examples (for testing) and in your implementation
- and you have C99 in those cases.  If you make gcc optimisations that
you confirm work correctly for int8_t, int16_t, int32_t, int64_t, (and
possibly __int128), and the unsigned types, then they will automatically
work with fundamental types (signed char, short, etc.) and any other
typedefs users have in their code.


> 
>>
>> It is also worth noting that gcc already has support for saturating types on
>> some targets:
>>
>> <https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>
>>
>> My testing of these (quite a long time ago) left me with a feeling that it was
>> not a feature anyone had worked hard to optimise - certainly it did not make
>> use of saturating arithmetic instructions available on some of the targets I
>> tested (ARM Cortex M4, for example).  But it is possible that there are things
>> here that would be of use to you.  (I am not convinced that it is worth
>> spending time optimising the implementation of these - I don't think the
>> N1169 types are much used by
>> anyone.)
>>
> 
> I did notice these and was wondering if it makes sense to use them, but I'd need
> to check how well supported they are.  For instance would need to check if they
> don't block vectorization and stuff like that.
> 

I get the impression that they are likely to fail with vectorisation or
other more advanced optimisations (though I am no expert here).  I
mentioned them in case there is any inspiration or ideas you can copy,
rather than because I think they are useful.  I have not seen any use of
these types in real code, and I think the whole N1169 / TR 18037 was a
case of too little, too inconvenient to use, too late.  Few compilers
support it, fewer programmers use it.  (The syntax for named address
space is used by many embedded compilers, including gcc, but it is a
simple and obvious syntax extension that was used long before that TR.)

>>
>> While it is always good that the compiler can spot patterns in generic C code
>> and generate optimal instruction sequences, another possibility here would
>> be a set of built-in functions for saturated and rounding arithmetic.  That
>> would take the guesswork out of it for users - if there code requires efficient
>> saturated addition, they can use __builtin_add_sat and get the best their
>> target can offer (just like __builtin_add_overflow, and that kind of thing).
>> And it might be easier to implement in the compiler.
>>
> 
> We already have neon intrinics for this, but these operations happen quite often
> in media processing and machine learning frameworks which don't use intrinsics
> or builtins.  So the big gain for us here really is the idiom recognition 😊
> 

Target-specific intrinsics are useful, but it is even better when they
are supported as generic builtins in gcc.  That way they work on all
targets, user source code is more portable, and gcc target maintainers
avoid duplicating effort.  (And I personally will use them on ARM
microcontrollers that don't have Neon - so you are guaranteed at least
one user!)

> Thanks for the feedback!
> 

No problem, and good luck with this work - it sounds great.

David


> Cheers,
> Tamar
> 
>>
>> I hope these comments give you a few ideas or useful thoughts.
>>
>> David
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11  5:37 [RFC] Implementing detection of saturation and rounding arithmetic Tamar Christina
                   ` (2 preceding siblings ...)
  2021-05-11 15:43 ` Segher Boessenkool
@ 2021-05-12  8:47 ` Richard Sandiford
  2021-05-12  9:13   ` Tamar Christina
  2021-05-12  9:27   ` Richard Biener
  3 siblings, 2 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-05-12  8:47 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc, Richard Biener

Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi All,
>
> We are looking to implement saturation support in the compiler.  The aim is to
> recognize both Scalar and Vector variant of typical saturating expressions.
>
> As an example:
>
> 1. Saturating addition:
>    char sat (char a, char b)
>    {
>       int tmp = a + b;
>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>    }
>
> 2. Saturating abs:
>    char sat (char a)
>    {
>       int tmp = abs (a);
>       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
>    }
>
> 3. Rounding shifts
>    char rndshift (char dc)
>    {
>       int round_const = 1 << (shift - 1);
>       return (dc + round_const) >> shift;
>    }
>
> etc.
>
> Of course the first issue is that C does not really have a single idiom for
> expressing this.
>
> At the RTL level we have ss_truncate and us_truncate and float_truncate for
> truncation.
>
> At the Tree level we have nothing for truncation (I believe) for scalars. For
> Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
> nothing actually generates this at the moment. it's just an unused tree code.
>
> For rounding there doesn't seem to be any existing infrastructure.
>
> The proposal to handle these are as follow, keep in mind that all of these also
> exist in their scalar form, as such detecting them in the vectorizer would be
> the wrong place.
>
> 1. Rounding:
>    a) Use match.pd to rewrite various rounding idioms to shifts.
>    b) Use backwards or forward prop to rewrite these to internal functions
>       where even if the target does not support these rounding instructions they
>       have a chance to provide a more efficient implementation than what would
>       be generated normally.
>
> 2. Saturation:
>    a) Use match.pd to rewrite the various saturation expressions into min/max
>       operations which opens up the expressions to further optimizations.
>    b) Use backwards or forward prop to convert to internal functions if the
>       resulting min/max expression still meet the criteria for being a
>       saturating expression.  This follows the algorithm as outlined in "The
>       Software Vectorization handbook" by Aart J.C. Bik.
>
>       We could get the right instructions by using combine if we don't rewrite
>       the instructions to an internal function, however then during Vectorization
>       we would overestimate the cost of performing the saturation.  The constants
>       will the also be loaded into registers and so becomes a lot more difficult
>       to cleanup solely in the backend.
>
> The one thing I am wondering about is whether we would need an internal function
> for all operations supported, or if it should be modelled as an internal FN which
> just "marks" the operation as rounding/saturating. After all, the only difference
> between a normal and saturating expression in RTL is the xx_truncate RTL surrounding
> the expression.  Doing so would also mean that all targets whom have saturating
> instructions would automatically benefit from this.

I might have misunderstood what you meant here, but the *_truncate
RTL codes are true truncations: the operand has to be wider than the
result.  Using this representation for general arithmetic is a problem
if you're operating at the maximum size that the target supports natively.
E.g. representing a 64-bit saturating addition as:

  - extend to 128 bits
  - do a 128-bit addition
  - truncate to 64 bits

is going to be hard to cost and code-generate on targets that don't support
native 128-bit operations (or at least, don't support them cheaply).
This might not be a problem when recognising C idioms, since the C source
code has to be able do the wider operation before truncating the result,
but it could be a problem if we provide built-in functions or if we want
to introduce compiler-generated saturating operations.

RTL already has per-operation saturation such as ss_plus/us_plus,
ss_minus/us_minus, ss_neg/us_neg, ss_mult/us_mult, ss_div,
ss_ashift/us_ashift and ss_abs.  I think we should do the same
in gimple, using internal functions like you say.

Thanks,
Richard


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11 11:44 ` Richard Biener
@ 2021-05-12  9:13   ` Tamar Christina
  2021-05-12 10:28     ` Richard Biener
  2021-05-12 13:06     ` Liu Hao
  0 siblings, 2 replies; 17+ messages in thread
From: Tamar Christina @ 2021-05-12  9:13 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, Richard Sandiford, Jakub Jelinek



> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, May 11, 2021 12:45 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc@gcc.gnu.org; Richard Sandiford <Richard.Sandiford@arm.com>;
> Jakub Jelinek <jakub@redhat.com>
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On Tue, 11 May 2021, Tamar Christina wrote:
> 
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >    char sat (char a, char b)
> >    {
> >       int tmp = a + b;
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> >
> > 2. Saturating abs:
> >    char sat (char a)
> >    {
> >       int tmp = abs (a);
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> >
> > 3. Rounding shifts
> >    char rndshift (char dc)
> >    {
> >       int round_const = 1 << (shift - 1);
> >       return (dc + round_const) >> shift;
> >    }
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single
> > idiom for expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and
> > float_truncate for truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for
> > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > but it looks like nothing actually generates this at the moment. it's just an
> unused tree code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of
> > these also exist in their scalar form, as such detecting them in the
> > vectorizer would be the wrong place.
> >
> > 1. Rounding:
> >    a) Use match.pd to rewrite various rounding idioms to shifts.
> >    b) Use backwards or forward prop to rewrite these to internal functions
> >       where even if the target does not support these rounding instructions
> they
> >       have a chance to provide a more efficient implementation than what
> would
> >       be generated normally.
> >
> > 2. Saturation:
> >    a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >       operations which opens up the expressions to further optimizations.
> >    b) Use backwards or forward prop to convert to internal functions if the
> >       resulting min/max expression still meet the criteria for being a
> >       saturating expression.  This follows the algorithm as outlined in "The
> >       Software Vectorization handbook" by Aart J.C. Bik.
> >
> >       We could get the right instructions by using combine if we don't rewrite
> >       the instructions to an internal function, however then during
> Vectorization
> >       we would overestimate the cost of performing the saturation.  The
> constants
> >       will the also be loaded into registers and so becomes a lot more difficult
> >       to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> >
> > But it does mean a small adjustment to the costing, which would need
> > to cost the internal function call and the argument together as a whole.
> >
> > Any feedback is appreciated to minimize the number of changes required
> > to the final patch.  Any objections to the outlined approach?
> 
> I think it makes sense to pattern-match the operations on GIMPLE and follow
> the approach take by __builtin_add_overflow & friends.
> Maybe quickly check whether clang provides some builtins already which we
> could implement.
> 

Hmm so taking a look at __builtin_add_overflow and friends it looks like they use
Internal functions like ADD_OVERFLOW. Biggest difference being that it sets a condition
flag.  This does me wonder if something like

int f (int a, int b)
{
    int res;
    if (__builtin_add_overflow (a, b, &res))
      {
          if (res >= 0)
            return INT_MAX;
          else
            return INT_MIN;
      }
    return res;
}

Should be recognized as saturating as well.  But yeah, following the same approach we
would rewrite the sequence to something like res = .ADD_SAT (a, b);

I know that in the past there was concerns about doing such pattern matching early which
is why I thought about following the same approach we do with the table based ctz recognition
and do it in backprop.

For completeness, the OVERFLOW version of ADD in AArch64 maps to ADDS, but the saturating to SQADD.

> There's some appeal to mimicing what RTL does - thus have the saturation be
> represented as saturating truncation.
> Maybe that's what users expect of builtins as well.

That would also certainly cut down on the number of internal FNs we would introduce. But, the
one technical issue I see is that I don't know how to ask the target if it supports it or not.

Normally the internal-fn are tied to a single optab right, so I'd need to have something that combines
The internal-fn and the next instruction into an optab so the target can expand to it.  It's because of this
that I wonder whether this added complexity is worth it.  Unless there's already such a mechanism In place?

> 
> I'm not sure what the rounding shift would do - 'shift' isn't an argument to
> rndshift here.  It feels like it's a rounding division but only by powers of two.
> Does ROUND_DIV_EXPR already provide the desired semantics?

Oops, yes shift should have been an argument to the function.
For the given example indeed I think this is ROUND_DIV_EXPR but we have other ones such as
Rounding half-ing addition, Will need to check if that's just a ROUND_DIV_EXPR over an ADD.

For completeness, the AArch64 instruction for this example is SRSHL.

Thanks!

Regards,
Tamar

> 
> Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11 15:43 ` Segher Boessenkool
@ 2021-05-12  9:13   ` Tamar Christina
  2021-05-12 12:20     ` Segher Boessenkool
  0 siblings, 1 reply; 17+ messages in thread
From: Tamar Christina @ 2021-05-12  9:13 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc, Richard Sandiford, Richard Biener

Hi,

> -----Original Message-----
> From: Segher Boessenkool <segher@kernel.crashing.org>
> Sent: Tuesday, May 11, 2021 4:43 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc@gcc.gnu.org; Richard Sandiford <Richard.Sandiford@arm.com>;
> Richard Biener <rguenther@suse.de>
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> Hi!
> 
> On Tue, May 11, 2021 at 05:37:34AM +0000, Tamar Christina via Gcc wrote:
> > 2. Saturating abs:
> >    char sat (char a)
> >    {
> >       int tmp = abs (a);
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> 
> That can be done quite a bit better, branchless at least.  Same for all
> examples here probably.

Do you mean in the example? Sure, I wanted to keep it simple 😊

> 
> > 2. Saturation:
> >    a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >       operations which opens up the expressions to further optimizations.
> 
> You'll have to do the operation in a bigger mode for that.  (This is also true for
> rounding, in many cases).
> 
> This makes new internal functions more attractive / more feasible.

True, but the internal function doesn't need to model the wider mode right?
So if you're saturating an int, the internal-fn would work on int,int,int.

> 
> >       We could get the right instructions by using combine if we don't rewrite
> >       the instructions to an internal function, however then during
> Vectorization
> >       we would overestimate the cost of performing the saturation.  The
> constants
> >       will the also be loaded into registers and so becomes a lot more difficult
> >       to cleanup solely in the backend.
> 
> Combine is almost never the right answer if you want to combine more than
> two or three RTL insns.  It can be done, but combine will always write the
> combined instruction in simplest terms, which tends to mean that if you
> combine more insns there can be very many outcomes that you all need to
> recognise as insns in your machine description.

Yeah, ideally I would like to handle it before it gets to expand.

> 
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> 
> I think you will have to experiment with both approaches to get a good
> feeling for the tradeoff.

Fair enough,

Thanks for the feedback!

Cheers,
Tamar
> 
> 
> Segher

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-11 17:00   ` Joseph Myers
  2021-05-12  7:43     ` David Brown
@ 2021-05-12  9:13     ` Tamar Christina
  1 sibling, 0 replies; 17+ messages in thread
From: Tamar Christina @ 2021-05-12  9:13 UTC (permalink / raw)
  To: Joseph Myers, David Brown; +Cc: gcc, Richard Sandiford, Richard Biener



> -----Original Message-----
> From: Joseph Myers <joseph@codesourcery.com>
> Sent: Tuesday, May 11, 2021 6:01 PM
> To: David Brown <david.brown@hesbynett.no>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; gcc@gcc.gnu.org; Richard
> Sandiford <Richard.Sandiford@arm.com>; Richard Biener
> <rguenther@suse.de>
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> On Tue, 11 May 2021, David Brown wrote:
> 
> > It is also worth noting that gcc already has support for saturating
> > types on some targets:
> >
> > <https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html>
> >
> > My testing of these (quite a long time ago) left me with a feeling
> > that it was not a feature anyone had worked hard to optimise -
> > certainly it
> 
> The implementation isn't well-integrated with any optimizations for
> arithmetic on ordinary integer types / modes, because it has its own
> completely separate machine modes and operations on those.  I still think it
> would be better to have a GIMPLE pass that lowers from fixed-point types to
> saturating etc. operations on ordinary integer types, as I said in
> <https://gcc.gnu.org/legacy-ml/gcc-patches/2011-05/msg00846.html>.
> 

Oh, thanks for the link! That's a very useful read!

Cheers,
Tamar

> Note however that such lowering should be more or less independent of
> what's being discussed in this thread - this thread is about better
> optimization of such operations on ordinary types (with or without built-in
> functions of some kind in addition to recognition of such operations written
> in generic C), which you can do independently of what's done with fixed-
> point types.
> 
> --
> Joseph S. Myers
> joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-12  8:47 ` Richard Sandiford
@ 2021-05-12  9:13   ` Tamar Christina
  2021-05-12  9:27   ` Richard Biener
  1 sibling, 0 replies; 17+ messages in thread
From: Tamar Christina @ 2021-05-12  9:13 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc, Richard Biener



> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Wednesday, May 12, 2021 9:48 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc@gcc.gnu.org; Richard Biener <rguenther@suse.de>
> Subject: Re: [RFC] Implementing detection of saturation and rounding
> arithmetic
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The
> > aim is to recognize both Scalar and Vector variant of typical saturating
> expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >    char sat (char a, char b)
> >    {
> >       int tmp = a + b;
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> >
> > 2. Saturating abs:
> >    char sat (char a)
> >    {
> >       int tmp = abs (a);
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> >
> > 3. Rounding shifts
> >    char rndshift (char dc)
> >    {
> >       int round_const = 1 << (shift - 1);
> >       return (dc + round_const) >> shift;
> >    }
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single
> > idiom for expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and
> > float_truncate for truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for
> > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > but it looks like nothing actually generates this at the moment. it's just an
> unused tree code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of
> > these also exist in their scalar form, as such detecting them in the
> > vectorizer would be the wrong place.
> >
> > 1. Rounding:
> >    a) Use match.pd to rewrite various rounding idioms to shifts.
> >    b) Use backwards or forward prop to rewrite these to internal functions
> >       where even if the target does not support these rounding instructions
> they
> >       have a chance to provide a more efficient implementation than what
> would
> >       be generated normally.
> >
> > 2. Saturation:
> >    a) Use match.pd to rewrite the various saturation expressions into
> min/max
> >       operations which opens up the expressions to further optimizations.
> >    b) Use backwards or forward prop to convert to internal functions if the
> >       resulting min/max expression still meet the criteria for being a
> >       saturating expression.  This follows the algorithm as outlined in "The
> >       Software Vectorization handbook" by Aart J.C. Bik.
> >
> >       We could get the right instructions by using combine if we don't rewrite
> >       the instructions to an internal function, however then during
> Vectorization
> >       we would overestimate the cost of performing the saturation.  The
> constants
> >       will the also be loaded into registers and so becomes a lot more difficult
> >       to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an
> > internal function for all operations supported, or if it should be
> > modelled as an internal FN which just "marks" the operation as
> > rounding/saturating. After all, the only difference between a normal
> > and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have
> saturating instructions would automatically benefit from this.
> 
> I might have misunderstood what you meant here, but the *_truncate RTL
> codes are true truncations: the operand has to be wider than the result.
> Using this representation for general arithmetic is a problem if you're
> operating at the maximum size that the target supports natively.
> E.g. representing a 64-bit saturating addition as:
> 
>   - extend to 128 bits
>   - do a 128-bit addition
>   - truncate to 64 bits
> 

Ah, that wasn't clear from the documentation.. The one for the normal truncate
mentions that the modes have to be wider but the _truncate and friends don't
mention this constraint.  That would indeed not work..

> is going to be hard to cost and code-generate on targets that don't support
> native 128-bit operations (or at least, don't support them cheaply).
> This might not be a problem when recognising C idioms, since the C source
> code has to be able do the wider operation before truncating the result, but
> it could be a problem if we provide built-in functions or if we want to
> introduce compiler-generated saturating operations.
> 
> RTL already has per-operation saturation such as ss_plus/us_plus,
> ss_minus/us_minus, ss_neg/us_neg, ss_mult/us_mult, ss_div,
> ss_ashift/us_ashift and ss_abs.  I think we should do the same in gimple,
> using internal functions like you say.

Oh, I didn't know about those. Indeed those look like a better fit here.
Having all the operations separate in RTL already does seem to imply
That separate internal-fns is the way to go.

Thanks!

Regards,
Tamar

> 
> Thanks,
> Richard


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-12  8:47 ` Richard Sandiford
  2021-05-12  9:13   ` Tamar Christina
@ 2021-05-12  9:27   ` Richard Biener
  1 sibling, 0 replies; 17+ messages in thread
From: Richard Biener @ 2021-05-12  9:27 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Tamar Christina, gcc

On Wed, 12 May 2021, Richard Sandiford wrote:

> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi All,
> >
> > We are looking to implement saturation support in the compiler.  The aim is to
> > recognize both Scalar and Vector variant of typical saturating expressions.
> >
> > As an example:
> >
> > 1. Saturating addition:
> >    char sat (char a, char b)
> >    {
> >       int tmp = a + b;
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> >
> > 2. Saturating abs:
> >    char sat (char a)
> >    {
> >       int tmp = abs (a);
> >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> >    }
> >
> > 3. Rounding shifts
> >    char rndshift (char dc)
> >    {
> >       int round_const = 1 << (shift - 1);
> >       return (dc + round_const) >> shift;
> >    }
> >
> > etc.
> >
> > Of course the first issue is that C does not really have a single idiom for
> > expressing this.
> >
> > At the RTL level we have ss_truncate and us_truncate and float_truncate for
> > truncation.
> >
> > At the Tree level we have nothing for truncation (I believe) for scalars. For
> > Vector code there already seems to be VEC_PACK_SAT_EXPR but it looks like
> > nothing actually generates this at the moment. it's just an unused tree code.
> >
> > For rounding there doesn't seem to be any existing infrastructure.
> >
> > The proposal to handle these are as follow, keep in mind that all of these also
> > exist in their scalar form, as such detecting them in the vectorizer would be
> > the wrong place.
> >
> > 1. Rounding:
> >    a) Use match.pd to rewrite various rounding idioms to shifts.
> >    b) Use backwards or forward prop to rewrite these to internal functions
> >       where even if the target does not support these rounding instructions they
> >       have a chance to provide a more efficient implementation than what would
> >       be generated normally.
> >
> > 2. Saturation:
> >    a) Use match.pd to rewrite the various saturation expressions into min/max
> >       operations which opens up the expressions to further optimizations.
> >    b) Use backwards or forward prop to convert to internal functions if the
> >       resulting min/max expression still meet the criteria for being a
> >       saturating expression.  This follows the algorithm as outlined in "The
> >       Software Vectorization handbook" by Aart J.C. Bik.
> >
> >       We could get the right instructions by using combine if we don't rewrite
> >       the instructions to an internal function, however then during Vectorization
> >       we would overestimate the cost of performing the saturation.  The constants
> >       will the also be loaded into registers and so becomes a lot more difficult
> >       to cleanup solely in the backend.
> >
> > The one thing I am wondering about is whether we would need an internal function
> > for all operations supported, or if it should be modelled as an internal FN which
> > just "marks" the operation as rounding/saturating. After all, the only difference
> > between a normal and saturating expression in RTL is the xx_truncate RTL surrounding
> > the expression.  Doing so would also mean that all targets whom have saturating
> > instructions would automatically benefit from this.
> 
> I might have misunderstood what you meant here, but the *_truncate
> RTL codes are true truncations: the operand has to be wider than the
> result.  Using this representation for general arithmetic is a problem
> if you're operating at the maximum size that the target supports natively.
> E.g. representing a 64-bit saturating addition as:
> 
>   - extend to 128 bits
>   - do a 128-bit addition
>   - truncate to 64 bits
> 
> is going to be hard to cost and code-generate on targets that don't support
> native 128-bit operations (or at least, don't support them cheaply).
> This might not be a problem when recognising C idioms, since the C source
> code has to be able do the wider operation before truncating the result,
> but it could be a problem if we provide built-in functions or if we want
> to introduce compiler-generated saturating operations.
> 
> RTL already has per-operation saturation such as ss_plus/us_plus,
> ss_minus/us_minus, ss_neg/us_neg, ss_mult/us_mult, ss_div,
> ss_ashift/us_ashift and ss_abs.  I think we should do the same
> in gimple, using internal functions like you say.

I think that for followup optimizations using regular arithmetic
ops and just new saturating truncations is better.  Maybe we can
also do both, with first only matching the actual saturation
with a new tree code and then later match the optabs the target
actually supports (in ISEL for example)?

Truly saturating ops might provide an interesting example how
to deal with -ftrapv - one might think we can now simply
use the trapping optabs as internal functions to reflect
-ftrapv onto the IL ...

Richard.

> Thanks,
> Richard
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-12  9:13   ` Tamar Christina
@ 2021-05-12 10:28     ` Richard Biener
  2021-05-12 13:06     ` Liu Hao
  1 sibling, 0 replies; 17+ messages in thread
From: Richard Biener @ 2021-05-12 10:28 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc, Richard Sandiford, Jakub Jelinek

On Wed, 12 May 2021, Tamar Christina wrote:

> 
> 
> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, May 11, 2021 12:45 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc@gcc.gnu.org; Richard Sandiford <Richard.Sandiford@arm.com>;
> > Jakub Jelinek <jakub@redhat.com>
> > Subject: Re: [RFC] Implementing detection of saturation and rounding
> > arithmetic
> > 
> > On Tue, 11 May 2021, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > We are looking to implement saturation support in the compiler.  The
> > > aim is to recognize both Scalar and Vector variant of typical saturating
> > expressions.
> > >
> > > As an example:
> > >
> > > 1. Saturating addition:
> > >    char sat (char a, char b)
> > >    {
> > >       int tmp = a + b;
> > >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >    }
> > >
> > > 2. Saturating abs:
> > >    char sat (char a)
> > >    {
> > >       int tmp = abs (a);
> > >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >    }
> > >
> > > 3. Rounding shifts
> > >    char rndshift (char dc)
> > >    {
> > >       int round_const = 1 << (shift - 1);
> > >       return (dc + round_const) >> shift;
> > >    }
> > >
> > > etc.
> > >
> > > Of course the first issue is that C does not really have a single
> > > idiom for expressing this.
> > >
> > > At the RTL level we have ss_truncate and us_truncate and
> > > float_truncate for truncation.
> > >
> > > At the Tree level we have nothing for truncation (I believe) for
> > > scalars. For Vector code there already seems to be VEC_PACK_SAT_EXPR
> > > but it looks like nothing actually generates this at the moment. it's just an
> > unused tree code.
> > >
> > > For rounding there doesn't seem to be any existing infrastructure.
> > >
> > > The proposal to handle these are as follow, keep in mind that all of
> > > these also exist in their scalar form, as such detecting them in the
> > > vectorizer would be the wrong place.
> > >
> > > 1. Rounding:
> > >    a) Use match.pd to rewrite various rounding idioms to shifts.
> > >    b) Use backwards or forward prop to rewrite these to internal functions
> > >       where even if the target does not support these rounding instructions
> > they
> > >       have a chance to provide a more efficient implementation than what
> > would
> > >       be generated normally.
> > >
> > > 2. Saturation:
> > >    a) Use match.pd to rewrite the various saturation expressions into
> > min/max
> > >       operations which opens up the expressions to further optimizations.
> > >    b) Use backwards or forward prop to convert to internal functions if the
> > >       resulting min/max expression still meet the criteria for being a
> > >       saturating expression.  This follows the algorithm as outlined in "The
> > >       Software Vectorization handbook" by Aart J.C. Bik.
> > >
> > >       We could get the right instructions by using combine if we don't rewrite
> > >       the instructions to an internal function, however then during
> > Vectorization
> > >       we would overestimate the cost of performing the saturation.  The
> > constants
> > >       will the also be loaded into registers and so becomes a lot more difficult
> > >       to cleanup solely in the backend.
> > >
> > > The one thing I am wondering about is whether we would need an
> > > internal function for all operations supported, or if it should be
> > > modelled as an internal FN which just "marks" the operation as
> > > rounding/saturating. After all, the only difference between a normal
> > > and saturating expression in RTL is the xx_truncate RTL surrounding
> > > the expression.  Doing so would also mean that all targets whom have
> > saturating instructions would automatically benefit from this.
> > >
> > > But it does mean a small adjustment to the costing, which would need
> > > to cost the internal function call and the argument together as a whole.
> > >
> > > Any feedback is appreciated to minimize the number of changes required
> > > to the final patch.  Any objections to the outlined approach?
> > 
> > I think it makes sense to pattern-match the operations on GIMPLE and follow
> > the approach take by __builtin_add_overflow & friends.
> > Maybe quickly check whether clang provides some builtins already which we
> > could implement.
> > 
> 
> Hmm so taking a look at __builtin_add_overflow and friends it looks like they use
> Internal functions like ADD_OVERFLOW. Biggest difference being that it sets a condition
> flag.  This does me wonder if something like
> 
> int f (int a, int b)
> {
>     int res;
>     if (__builtin_add_overflow (a, b, &res))
>       {
>           if (res >= 0)
>             return INT_MAX;
>           else
>             return INT_MIN;
>       }
>     return res;
> }
> 
> Should be recognized as saturating as well.  But yeah, following the same approach we
> would rewrite the sequence to something like res = .ADD_SAT (a, b);
> 
> I know that in the past there was concerns about doing such pattern matching early which
> is why I thought about following the same approach we do with the table based ctz recognition
> and do it in backprop.

I think the danger with matching the saturation part late is that jump
optimizations break it apart.  That might be also good of course(?)
in case for example we can prove the value never is < -128 but only
possibly > 127 so we can use a single MAX_EXPR.  Which makes me wonder
whether you see us recognizing MIN/MAX and you're just pattern matching
that?

> For completeness, the OVERFLOW version of ADD in AArch64 maps to ADDS, but the saturating to SQADD.
> 
> > There's some appeal to mimicing what RTL does - thus have the saturation be
> > represented as saturating truncation.
> > Maybe that's what users expect of builtins as well.
> 
> That would also certainly cut down on the number of internal FNs we would introduce. But, the
> one technical issue I see is that I don't know how to ask the target if it supports it or not.

We don't ask the target for overflow support either but have fallback
RTL expansion.  What I'd suggest is to match the saturation operation
separately and independent of target support and do a late combining
with the feeding arithmetic operations using direct internal functions
to the optabs when the target supports this.

So the saturation op would match (target-type)MIN<MAX<op, tt-min>, tt-max>
thus three GIMPLE stmts.  Thus we'd have a SAT_CONVERT_EXPR eventually
with restriction on sign-changes?

> Normally the internal-fn are ti
> The internal-fn and the next instruction into an optab so the target can expand to it.  It's because of this
> that I wonder whether this added complexity is worth it.  Unless there's already such a mechanism In place?

I think direct internal fns are prefered in case we look to restrict
things to what target support.

Richard.

> > 
> > I'm not sure what the rounding shift would do - 'shift' isn't an argument to
> > rndshift here.  It feels like it's a rounding division but only by powers of two.
> > Does ROUND_DIV_EXPR already provide the desired semantics?
> 
> Oops, yes shift should have been an argument to the function.
> For the given example indeed I think this is ROUND_DIV_EXPR but we have other ones such as
> Rounding half-ing addition, Will need to check if that's just a ROUND_DIV_EXPR over an ADD.
> 
> For completeness, the AArch64 instruction for this example is SRSHL.
> 
> Thanks!
> 
> Regards,
> Tamar
> 
> > 
> > Richard.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-12  9:13   ` Tamar Christina
@ 2021-05-12 12:20     ` Segher Boessenkool
  0 siblings, 0 replies; 17+ messages in thread
From: Segher Boessenkool @ 2021-05-12 12:20 UTC (permalink / raw)
  To: Tamar Christina; +Cc: gcc, Richard Sandiford, Richard Biener

On Wed, May 12, 2021 at 09:13:38AM +0000, Tamar Christina wrote:
> > From: Segher Boessenkool <segher@kernel.crashing.org>
> > On Tue, May 11, 2021 at 05:37:34AM +0000, Tamar Christina via Gcc wrote:
> > > 2. Saturating abs:
> > >    char sat (char a)
> > >    {
> > >       int tmp = abs (a);
> > >       return tmp > 127 ? 127 : ((tmp < -128) ? -128 : tmp);
> > >    }
> > 
> > That can be done quite a bit better, branchless at least.  Same for all
> > examples here probably.
> 
> Do you mean in the example? Sure, I wanted to keep it simple 😊

Point taken...  I think you need to look at these issues sooner rather
than later though.

> > > 2. Saturation:
> > >    a) Use match.pd to rewrite the various saturation expressions into
> > min/max
> > >       operations which opens up the expressions to further optimizations.
> > 
> > You'll have to do the operation in a bigger mode for that.  (This is also true for
> > rounding, in many cases).
> > 
> > This makes new internal functions more attractive / more feasible.
> 
> True, but the internal function doesn't need to model the wider mode right?
> So if you're saturating an int, the internal-fn would work on int,int,int.

The ifn doesn't have to go to a wider mode, it's only if you want to
express it with more basic operations that you have to.  That is the
reason I find ifns more attractive for this, yup.

> > >       We could get the right instructions by using combine if we don't rewrite
> > >       the instructions to an internal function, however then during
> > Vectorization
> > >       we would overestimate the cost of performing the saturation.  The
> > constants
> > >       will the also be loaded into registers and so becomes a lot more difficult
> > >       to cleanup solely in the backend.
> > 
> > Combine is almost never the right answer if you want to combine more than
> > two or three RTL insns.  It can be done, but combine will always write the
> > combined instruction in simplest terms, which tends to mean that if you
> > combine more insns there can be very many outcomes that you all need to
> > recognise as insns in your machine description.
> 
> Yeah, ideally I would like to handle it before it gets to expand.

That is the right place yeah...  but it will need some RTL work, in
simplify-rtx at least.


Segher

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Implementing detection of saturation and rounding arithmetic
  2021-05-12  9:13   ` Tamar Christina
  2021-05-12 10:28     ` Richard Biener
@ 2021-05-12 13:06     ` Liu Hao
  1 sibling, 0 replies; 17+ messages in thread
From: Liu Hao @ 2021-05-12 13:06 UTC (permalink / raw)
  To: Tamar Christina, Richard Biener; +Cc: Jakub Jelinek, gcc, Richard Sandiford


[-- Attachment #1.1: Type: text/plain, Size: 1285 bytes --]

在 5/12/21 5:13 PM, Tamar Christina via Gcc 写道:
> 
> int f (int a, int b)
> {
>      int res;
>      if (__builtin_add_overflow (a, b, &res))
>        {
>            if (res >= 0)
>              return INT_MAX;
>            else
>              return INT_MIN;
>        }
>      return res;
> }
> 
> Should be recognized as saturating as well.  But yeah, following the same approach we
> would rewrite the sequence to something like res = .ADD_SAT (a, b);
> 

Is this a correct saturating addition implementation?

If the addition has overflowed, you get a positive result or zero for the sum of two negative 
numbers (or a negative one for two positive numbers); and it is not straightforward to write it this 
way.


This should be

   int f (int a, int b)
   {
     int res;
     if (__builtin_add_overflow (a, b, &res))
       {
         if (a >= 0)      /* changed from `res` to `a`  */
           return INT_MAX;
         else
           return INT_MIN;
       }
     return res;
   }

which can be optimized further as

   int f (int a, int b)
   {
     int res;
     if (__builtin_add_overflow (a, b, &res))
       res = (a >> sizeof(int) * CHAR_BIT - 1) ^ INT_MAX;
     return res;
   }


-- 
Best regards,
Liu Hao


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-05-12 13:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-11  5:37 [RFC] Implementing detection of saturation and rounding arithmetic Tamar Christina
2021-05-11 10:03 ` David Brown
2021-05-11 17:00   ` Joseph Myers
2021-05-12  7:43     ` David Brown
2021-05-12  9:13     ` Tamar Christina
2021-05-12  8:00   ` Tamar Christina
2021-05-12  8:31     ` David Brown
2021-05-11 11:44 ` Richard Biener
2021-05-12  9:13   ` Tamar Christina
2021-05-12 10:28     ` Richard Biener
2021-05-12 13:06     ` Liu Hao
2021-05-11 15:43 ` Segher Boessenkool
2021-05-12  9:13   ` Tamar Christina
2021-05-12 12:20     ` Segher Boessenkool
2021-05-12  8:47 ` Richard Sandiford
2021-05-12  9:13   ` Tamar Christina
2021-05-12  9:27   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).