public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC] GNU Vector Extension -- Packed Boolean Vectors
@ 2023-06-26  6:23 Tejas Belagod
  2023-06-26  8:50 ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Tejas Belagod @ 2023-06-26  6:23 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 7831 bytes --]

Hi,

Packed Boolean Vectors
----------------------

I'd like to propose a feature addition to GNU Vector extensions to add packed
boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
been implemented in Clang recently[2].

With predication features being added to vector architectures (SVE, MVE, AVX),
it is a useful feature to have to model predication on targets.  This could
find its use in intrinsics or just used as is as a GNU vector extension being
mapped to underlying target features.  For example, the packed boolean vector
could directly map to a predicate register on SVE.

Also, this new packed boolean type GNU extension can be used with SVE ACLE
intrinsics to replace a fixed-length svbool_t.

Here are a few options to represent the packed boolean vector type.

1. __attribute__((vector_size (n))) where n represents bytes

  typedef bool vbool __attribute__ ((vector_size (1)));

In this approach, the shape of the boolean vector is unclear. IoW, it is not
clear if each bit in 'n' controls a byte or an element. On targets
like SVE, it would be natural to have each bit control a byte of the target
vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
bit would control one element/lane on the target vector(therefore resulting in a
'packed' layout with all significant bits at the LSB).

2. __attribute__((vector_size (n))) where n represents num of lanes

  typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
  typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));

Here the 'n' in the vector_size attribute represents the number of bits that
is needed to represent a vector quantity.  In this case, this packed boolean
vector can represent upto 'n' vector lanes. The size of the type is
rounded up the nearest byte.  For example, the sizeof v4bi in the above
example is 1.

In this approach, because of the nature of the representation, the n bits required
to represent the n lanes of the vector are packed at the LSB. This does not naturally
align with the SVE approach of each bit representing a byte of the target vector
and PBV therefore having an 'unpacked' layout.

More importantly, another drawback here is that the change in units for vector_size
might be confusing to programmers.  The units will have to be interpreted based on the
base type of the typedef. It does not offer any flexibility in terms of the layout of
the bool vector - it is fixed.

3. Combination of 1 and 2.

Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
unambiguously represent the layout of the PBV. Consider

  typedef bool vbool __attribute__((vector_size (s, n[, w])));

where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
is the number of bits of the PBV that represents a lane of the target vector. 'w' would
allow a target to force a certain layout of the PBV.

The 2-parameter form of vector_size allows the target to have an
implementation-defined layout of the PBV. The target is free to choose the 'w'
if it is not specified to mirror the target layout of predicate registers. For
eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.

As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
8 lanes of boolean which could be represented by

  typedef bool v8b __attribute__ ((vector_size (2, 8)));

SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2

and AVX would choose a layout where all 8 consecutive bits packed at LSB would
be significant i.e. w == 1.

This scheme would accomodate more than 1 target to effectively represent vector
bools that mirror the target properties.

4. A new attribite

This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
attribute to define the PBV and make it general enough to

* represent all targets flexibly (SVE, AVX etc)
* represent sub-byte length predicates
* have no change in units of vector_size/no new vector_size signature
* not have the number of bytes constrain representation

If we call the new attribute 'bool_vec' (for lack of a better name), consider

  typedef bool vbool __attribute__((bool_vec (n[, w])))

where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.

If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).

5. Behaviour of the packed vector boolean type.

Taking the example of one of the options above, following is an illustration of it's behavior

* ABI

  New ABI rules will need to be defined for this type - eg alignment, PCS,
  mangling etc

* Initialization:

  Packed Boolean Vectors(PBV) can be initialized like so:

    typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
    v4bi p = {false, true, false, false};

  Each value in the initizlizer constant is of type bool. The lowest numbered
  element in the const array corresponds to the LSbit of p, element 1 is
  assigned to bit 4 etc.

  p is effectively a 2-byte bitmask with value 0x0010

  With a different layout

    typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
    v4bi p = {false, true, false, false};

  p is effectively a 2-byte bitmask with value 0x0002

* Operations:

  Packed Boolean Vectors support the following operations:
  . unary ~
  . unary !
  . binary&,|andˆ
  . assignments &=, |= and ˆ=
  . comparisons <, <=, ==, !=, >= and >
  . Ternary operator ?:

  Operations are defined as applied to the individual elements i.e the bits
  that are significant in the PBV. Whether the PBVs are treated as bitmasks
  or otherwise is implementation-defined.

  Insignificant bits could affect results of comparisons or ternary operators.
  In such cases, it is implementation defined how the unused bits are treated.

  . Subscript operator []

  For the subscript operator, the packed boolean vector acts like a array of
  elements - the first or the 0th indexed element being the LSbit of the PBV.
  Subscript operator yields a scalar boolean value.
  For example:

    typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));

    // Subscript operator result yields a boolean value.
    // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
    bool foo (v8b p, int n) { p[3] = true; return p[1]; }

  Out of bounds access: OOB access can be determined at compile time given the
  strong typing of the PBVs.

  PBV does not support address of operator(&) for elements of PBVs.

  . Implicit conversion from integer vectors to PBVs

  We would like to support the output of comparison operations to be PBVs. This
  requires us to define the implicit conversion from an integer vector to PBV
  as the result of vector comparisons are integer vectors.

  To define this operation:

    bool_vector = vector <cmpop> vector

  There is no change in how vector <cmpop> vector behavior i.e. this comparison
  would still produce an int_vector type as it does now.

    temp_int_vec = vector <cmpop> vector
    bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec

  The implicit conversion from int_vec to bool I'd define simply to be:

    bool_vec[n] = (_Bool) int_vec[n]

  where the C11 standard rules apply
  6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
  is 0 if the value compares equal to 0; otherwise, the result is 1.


[1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
[2] https://reviews.llvm.org/D88905
[3] https://reviews.llvm.org/D81083

Thoughts?

Thanks,
Tejas.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-06-26  6:23 [RFC] GNU Vector Extension -- Packed Boolean Vectors Tejas Belagod
@ 2023-06-26  8:50 ` Richard Biener
  2023-06-27  6:30   ` Tejas Belagod
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2023-06-26  8:50 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> Packed Boolean Vectors
> ----------------------
>
> I'd like to propose a feature addition to GNU Vector extensions to add packed
> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> been implemented in Clang recently[2].
>
> With predication features being added to vector architectures (SVE, MVE, AVX),
> it is a useful feature to have to model predication on targets.  This could
> find its use in intrinsics or just used as is as a GNU vector extension being
> mapped to underlying target features.  For example, the packed boolean vector
> could directly map to a predicate register on SVE.
>
> Also, this new packed boolean type GNU extension can be used with SVE ACLE
> intrinsics to replace a fixed-length svbool_t.
>
> Here are a few options to represent the packed boolean vector type.

The GIMPLE frontend uses a new 'vector_mask' attribute:

typedef int v8si __attribute__((vector_size(8*sizeof(int))));
typedef v8si v8sib __attribute__((vector_mask));

it get's you a vector type that's the appropriate (dependent on the
target) vector
mask type for the vector data type (v8si in this case).

> 1. __attribute__((vector_size (n))) where n represents bytes
>
>   typedef bool vbool __attribute__ ((vector_size (1)));
>
> In this approach, the shape of the boolean vector is unclear. IoW, it is not
> clear if each bit in 'n' controls a byte or an element. On targets
> like SVE, it would be natural to have each bit control a byte of the target
> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> bit would control one element/lane on the target vector(therefore resulting in a
> 'packed' layout with all significant bits at the LSB).
>
> 2. __attribute__((vector_size (n))) where n represents num of lanes
>
>   typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
>   typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
>
> Here the 'n' in the vector_size attribute represents the number of bits that
> is needed to represent a vector quantity.  In this case, this packed boolean
> vector can represent upto 'n' vector lanes. The size of the type is
> rounded up the nearest byte.  For example, the sizeof v4bi in the above
> example is 1.
>
> In this approach, because of the nature of the representation, the n bits required
> to represent the n lanes of the vector are packed at the LSB. This does not naturally
> align with the SVE approach of each bit representing a byte of the target vector
> and PBV therefore having an 'unpacked' layout.
>
> More importantly, another drawback here is that the change in units for vector_size
> might be confusing to programmers.  The units will have to be interpreted based on the
> base type of the typedef. It does not offer any flexibility in terms of the layout of
> the bool vector - it is fixed.
>
> 3. Combination of 1 and 2.
>
> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> unambiguously represent the layout of the PBV. Consider
>
>   typedef bool vbool __attribute__((vector_size (s, n[, w])));
>
> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> allow a target to force a certain layout of the PBV.
>
> The 2-parameter form of vector_size allows the target to have an
> implementation-defined layout of the PBV. The target is free to choose the 'w'
> if it is not specified to mirror the target layout of predicate registers. For
> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
>
> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> 8 lanes of boolean which could be represented by
>
>   typedef bool v8b __attribute__ ((vector_size (2, 8)));
>
> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
>
> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> be significant i.e. w == 1.
>
> This scheme would accomodate more than 1 target to effectively represent vector
> bools that mirror the target properties.
>
> 4. A new attribite
>
> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> attribute to define the PBV and make it general enough to
>
> * represent all targets flexibly (SVE, AVX etc)
> * represent sub-byte length predicates
> * have no change in units of vector_size/no new vector_size signature
> * not have the number of bytes constrain representation
>
> If we call the new attribute 'bool_vec' (for lack of a better name), consider
>
>   typedef bool vbool __attribute__((bool_vec (n[, w])))
>
> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
>
> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
>
> 5. Behaviour of the packed vector boolean type.
>
> Taking the example of one of the options above, following is an illustration of it's behavior
>
> * ABI
>
>   New ABI rules will need to be defined for this type - eg alignment, PCS,
>   mangling etc
>
> * Initialization:
>
>   Packed Boolean Vectors(PBV) can be initialized like so:
>
>     typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
>     v4bi p = {false, true, false, false};
>
>   Each value in the initizlizer constant is of type bool. The lowest numbered
>   element in the const array corresponds to the LSbit of p, element 1 is
>   assigned to bit 4 etc.
>
>   p is effectively a 2-byte bitmask with value 0x0010
>
>   With a different layout
>
>     typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
>     v4bi p = {false, true, false, false};
>
>   p is effectively a 2-byte bitmask with value 0x0002
>
> * Operations:
>
>   Packed Boolean Vectors support the following operations:
>   . unary ~
>   . unary !
>   . binary&,|andˆ
>   . assignments &=, |= and ˆ=
>   . comparisons <, <=, ==, !=, >= and >
>   . Ternary operator ?:
>
>   Operations are defined as applied to the individual elements i.e the bits
>   that are significant in the PBV. Whether the PBVs are treated as bitmasks
>   or otherwise is implementation-defined.
>
>   Insignificant bits could affect results of comparisons or ternary operators.
>   In such cases, it is implementation defined how the unused bits are treated.
>
>   . Subscript operator []
>
>   For the subscript operator, the packed boolean vector acts like a array of
>   elements - the first or the 0th indexed element being the LSbit of the PBV.
>   Subscript operator yields a scalar boolean value.
>   For example:
>
>     typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
>
>     // Subscript operator result yields a boolean value.
>     // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
>     bool foo (v8b p, int n) { p[3] = true; return p[1]; }
>
>   Out of bounds access: OOB access can be determined at compile time given the
>   strong typing of the PBVs.
>
>   PBV does not support address of operator(&) for elements of PBVs.
>
>   . Implicit conversion from integer vectors to PBVs
>
>   We would like to support the output of comparison operations to be PBVs. This
>   requires us to define the implicit conversion from an integer vector to PBV
>   as the result of vector comparisons are integer vectors.
>
>   To define this operation:
>
>     bool_vector = vector <cmpop> vector
>
>   There is no change in how vector <cmpop> vector behavior i.e. this comparison
>   would still produce an int_vector type as it does now.
>
>     temp_int_vec = vector <cmpop> vector
>     bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
>
>   The implicit conversion from int_vec to bool I'd define simply to be:
>
>     bool_vec[n] = (_Bool) int_vec[n]
>
>   where the C11 standard rules apply
>   6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
>   is 0 if the value compares equal to 0; otherwise, the result is 1.
>
>
> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> [2] https://reviews.llvm.org/D88905
> [3] https://reviews.llvm.org/D81083
>
> Thoughts?
>
> Thanks,
> Tejas.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-06-26  8:50 ` Richard Biener
@ 2023-06-27  6:30   ` Tejas Belagod
  2023-06-27  7:28     ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Tejas Belagod @ 2023-06-27  6:30 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 9290 bytes --]



From: Richard Biener <richard.guenther@gmail.com>
Date: Monday, June 26, 2023 at 2:23 PM
To: Tejas Belagod <Tejas.Belagod@arm.com>
Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> Packed Boolean Vectors
> ----------------------
>
> I'd like to propose a feature addition to GNU Vector extensions to add packed
> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> been implemented in Clang recently[2].
>
> With predication features being added to vector architectures (SVE, MVE, AVX),
> it is a useful feature to have to model predication on targets.  This could
> find its use in intrinsics or just used as is as a GNU vector extension being
> mapped to underlying target features.  For example, the packed boolean vector
> could directly map to a predicate register on SVE.
>
> Also, this new packed boolean type GNU extension can be used with SVE ACLE
> intrinsics to replace a fixed-length svbool_t.
>
> Here are a few options to represent the packed boolean vector type.

The GIMPLE frontend uses a new 'vector_mask' attribute:

typedef int v8si __attribute__((vector_size(8*sizeof(int))));
typedef v8si v8sib __attribute__((vector_mask));

it get's you a vector type that's the appropriate (dependent on the
target) vector
mask type for the vector data type (v8si in this case).


Thanks Richard.

Having had a quick look at the implementation, it does seem to tick the boxes.

I must admit I haven't dug deep, but if the target hook allows the mask to be

defined in way that is target-friendly (and I don't know how much effort it will

be to migrate the attribute to more front-ends), it should do the job nicely.

Let me go back and dig a bit deeper and get back with questions if any.

Thanks,
Tejas.




> 1. __attribute__((vector_size (n))) where n represents bytes
>
>   typedef bool vbool __attribute__ ((vector_size (1)));
>
> In this approach, the shape of the boolean vector is unclear. IoW, it is not
> clear if each bit in 'n' controls a byte or an element. On targets
> like SVE, it would be natural to have each bit control a byte of the target
> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> bit would control one element/lane on the target vector(therefore resulting in a
> 'packed' layout with all significant bits at the LSB).
>
> 2. __attribute__((vector_size (n))) where n represents num of lanes
>
>   typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
>   typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
>
> Here the 'n' in the vector_size attribute represents the number of bits that
> is needed to represent a vector quantity.  In this case, this packed boolean
> vector can represent upto 'n' vector lanes. The size of the type is
> rounded up the nearest byte.  For example, the sizeof v4bi in the above
> example is 1.
>
> In this approach, because of the nature of the representation, the n bits required
> to represent the n lanes of the vector are packed at the LSB. This does not naturally
> align with the SVE approach of each bit representing a byte of the target vector
> and PBV therefore having an 'unpacked' layout.
>
> More importantly, another drawback here is that the change in units for vector_size
> might be confusing to programmers.  The units will have to be interpreted based on the
> base type of the typedef. It does not offer any flexibility in terms of the layout of
> the bool vector - it is fixed.
>
> 3. Combination of 1 and 2.
>
> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> unambiguously represent the layout of the PBV. Consider
>
>   typedef bool vbool __attribute__((vector_size (s, n[, w])));
>
> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> allow a target to force a certain layout of the PBV.
>
> The 2-parameter form of vector_size allows the target to have an
> implementation-defined layout of the PBV. The target is free to choose the 'w'
> if it is not specified to mirror the target layout of predicate registers. For
> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
>
> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> 8 lanes of boolean which could be represented by
>
>   typedef bool v8b __attribute__ ((vector_size (2, 8)));
>
> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
>
> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> be significant i.e. w == 1.
>
> This scheme would accomodate more than 1 target to effectively represent vector
> bools that mirror the target properties.
>
> 4. A new attribite
>
> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> attribute to define the PBV and make it general enough to
>
> * represent all targets flexibly (SVE, AVX etc)
> * represent sub-byte length predicates
> * have no change in units of vector_size/no new vector_size signature
> * not have the number of bytes constrain representation
>
> If we call the new attribute 'bool_vec' (for lack of a better name), consider
>
>   typedef bool vbool __attribute__((bool_vec (n[, w])))
>
> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
>
> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
>
> 5. Behaviour of the packed vector boolean type.
>
> Taking the example of one of the options above, following is an illustration of it's behavior
>
> * ABI
>
>   New ABI rules will need to be defined for this type - eg alignment, PCS,
>   mangling etc
>
> * Initialization:
>
>   Packed Boolean Vectors(PBV) can be initialized like so:
>
>     typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
>     v4bi p = {false, true, false, false};
>
>   Each value in the initizlizer constant is of type bool. The lowest numbered
>   element in the const array corresponds to the LSbit of p, element 1 is
>   assigned to bit 4 etc.
>
>   p is effectively a 2-byte bitmask with value 0x0010
>
>   With a different layout
>
>     typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
>     v4bi p = {false, true, false, false};
>
>   p is effectively a 2-byte bitmask with value 0x0002
>
> * Operations:
>
>   Packed Boolean Vectors support the following operations:
>   . unary ~
>   . unary !
>   . binary&,|andˆ
>   . assignments &=, |= and ˆ=
>   . comparisons <, <=, ==, !=, >= and >
>   . Ternary operator ?:
>
>   Operations are defined as applied to the individual elements i.e the bits
>   that are significant in the PBV. Whether the PBVs are treated as bitmasks
>   or otherwise is implementation-defined.
>
>   Insignificant bits could affect results of comparisons or ternary operators.
>   In such cases, it is implementation defined how the unused bits are treated.
>
>   . Subscript operator []
>
>   For the subscript operator, the packed boolean vector acts like a array of
>   elements - the first or the 0th indexed element being the LSbit of the PBV.
>   Subscript operator yields a scalar boolean value.
>   For example:
>
>     typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
>
>     // Subscript operator result yields a boolean value.
>     // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
>     bool foo (v8b p, int n) { p[3] = true; return p[1]; }
>
>   Out of bounds access: OOB access can be determined at compile time given the
>   strong typing of the PBVs.
>
>   PBV does not support address of operator(&) for elements of PBVs.
>
>   . Implicit conversion from integer vectors to PBVs
>
>   We would like to support the output of comparison operations to be PBVs. This
>   requires us to define the implicit conversion from an integer vector to PBV
>   as the result of vector comparisons are integer vectors.
>
>   To define this operation:
>
>     bool_vector = vector <cmpop> vector
>
>   There is no change in how vector <cmpop> vector behavior i.e. this comparison
>   would still produce an int_vector type as it does now.
>
>     temp_int_vec = vector <cmpop> vector
>     bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
>
>   The implicit conversion from int_vec to bool I'd define simply to be:
>
>     bool_vec[n] = (_Bool) int_vec[n]
>
>   where the C11 standard rules apply
>   6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
>   is 0 if the value compares equal to 0; otherwise, the result is 1.
>
>
> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> [2] https://reviews.llvm.org/D88905
> [3] https://reviews.llvm.org/D81083
>
> Thoughts?
>
> Thanks,
> Tejas.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-06-27  6:30   ` Tejas Belagod
@ 2023-06-27  7:28     ` Richard Biener
  2023-06-28 11:26       ` Tejas Belagod
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2023-06-27  7:28 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>
>
>
>
>
> From: Richard Biener <richard.guenther@gmail.com>
> Date: Monday, June 26, 2023 at 2:23 PM
> To: Tejas Belagod <Tejas.Belagod@arm.com>
> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>
> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Hi,
> >
> > Packed Boolean Vectors
> > ----------------------
> >
> > I'd like to propose a feature addition to GNU Vector extensions to add packed
> > boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> > been implemented in Clang recently[2].
> >
> > With predication features being added to vector architectures (SVE, MVE, AVX),
> > it is a useful feature to have to model predication on targets.  This could
> > find its use in intrinsics or just used as is as a GNU vector extension being
> > mapped to underlying target features.  For example, the packed boolean vector
> > could directly map to a predicate register on SVE.
> >
> > Also, this new packed boolean type GNU extension can be used with SVE ACLE
> > intrinsics to replace a fixed-length svbool_t.
> >
> > Here are a few options to represent the packed boolean vector type.
>
> The GIMPLE frontend uses a new 'vector_mask' attribute:
>
> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> typedef v8si v8sib __attribute__((vector_mask));
>
> it get's you a vector type that's the appropriate (dependent on the
> target) vector
> mask type for the vector data type (v8si in this case).
>
>
>
> Thanks Richard.
>
> Having had a quick look at the implementation, it does seem to tick the boxes.
>
> I must admit I haven't dug deep, but if the target hook allows the mask to be
>
> defined in way that is target-friendly (and I don't know how much effort it will
>
> be to migrate the attribute to more front-ends), it should do the job nicely.
>
> Let me go back and dig a bit deeper and get back with questions if any.

Let me add that the advantage of this is the compiler doesn't need
to support weird explicitely laid out packed boolean vectors that do
not match what the target supports and the user doesn't need to know
what the target supports (and thus have an #ifdef maze around explicitely
specified layouts).

It does remove some flexibility though, for example with -mavx512f -mavx512vl
you'll get AVX512 style masks for V4SImode data vectors but of course the
target sill supports SSE2/AVX2 style masks as well, but those would not be
available as "packed boolean vectors", though they are of course in fact
equal to V4SImode data vectors with -1 or 0 values, so in this particular
case it might not matter.

That said, the vector_mask attribute will get you V4SImode vectors with
signed boolean elements of 32 bits for V4SImode data vectors with
SSE2/AVX2.

Richard.

>
>
> Thanks,
>
> Tejas.
>
>
>
>
>
>
>
> > 1. __attribute__((vector_size (n))) where n represents bytes
> >
> >   typedef bool vbool __attribute__ ((vector_size (1)));
> >
> > In this approach, the shape of the boolean vector is unclear. IoW, it is not
> > clear if each bit in 'n' controls a byte or an element. On targets
> > like SVE, it would be natural to have each bit control a byte of the target
> > vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> > bit would control one element/lane on the target vector(therefore resulting in a
> > 'packed' layout with all significant bits at the LSB).
> >
> > 2. __attribute__((vector_size (n))) where n represents num of lanes
> >
> >   typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
> >   typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
> >
> > Here the 'n' in the vector_size attribute represents the number of bits that
> > is needed to represent a vector quantity.  In this case, this packed boolean
> > vector can represent upto 'n' vector lanes. The size of the type is
> > rounded up the nearest byte.  For example, the sizeof v4bi in the above
> > example is 1.
> >
> > In this approach, because of the nature of the representation, the n bits required
> > to represent the n lanes of the vector are packed at the LSB. This does not naturally
> > align with the SVE approach of each bit representing a byte of the target vector
> > and PBV therefore having an 'unpacked' layout.
> >
> > More importantly, another drawback here is that the change in units for vector_size
> > might be confusing to programmers.  The units will have to be interpreted based on the
> > base type of the typedef. It does not offer any flexibility in terms of the layout of
> > the bool vector - it is fixed.
> >
> > 3. Combination of 1 and 2.
> >
> > Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> > unambiguously represent the layout of the PBV. Consider
> >
> >   typedef bool vbool __attribute__((vector_size (s, n[, w])));
> >
> > where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> > is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> > allow a target to force a certain layout of the PBV.
> >
> > The 2-parameter form of vector_size allows the target to have an
> > implementation-defined layout of the PBV. The target is free to choose the 'w'
> > if it is not specified to mirror the target layout of predicate registers. For
> > eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> >
> > As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> > 8 lanes of boolean which could be represented by
> >
> >   typedef bool v8b __attribute__ ((vector_size (2, 8)));
> >
> > SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
> >
> > and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> > be significant i.e. w == 1.
> >
> > This scheme would accomodate more than 1 target to effectively represent vector
> > bools that mirror the target properties.
> >
> > 4. A new attribite
> >
> > This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> > attribute to define the PBV and make it general enough to
> >
> > * represent all targets flexibly (SVE, AVX etc)
> > * represent sub-byte length predicates
> > * have no change in units of vector_size/no new vector_size signature
> > * not have the number of bytes constrain representation
> >
> > If we call the new attribute 'bool_vec' (for lack of a better name), consider
> >
> >   typedef bool vbool __attribute__((bool_vec (n[, w])))
> >
> > where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
> >
> > If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> > If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> >
> > 5. Behaviour of the packed vector boolean type.
> >
> > Taking the example of one of the options above, following is an illustration of it's behavior
> >
> > * ABI
> >
> >   New ABI rules will need to be defined for this type - eg alignment, PCS,
> >   mangling etc
> >
> > * Initialization:
> >
> >   Packed Boolean Vectors(PBV) can be initialized like so:
> >
> >     typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> >     v4bi p = {false, true, false, false};
> >
> >   Each value in the initizlizer constant is of type bool. The lowest numbered
> >   element in the const array corresponds to the LSbit of p, element 1 is
> >   assigned to bit 4 etc.
> >
> >   p is effectively a 2-byte bitmask with value 0x0010
> >
> >   With a different layout
> >
> >     typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> >     v4bi p = {false, true, false, false};
> >
> >   p is effectively a 2-byte bitmask with value 0x0002
> >
> > * Operations:
> >
> >   Packed Boolean Vectors support the following operations:
> >   . unary ~
> >   . unary !
> >   . binary&,|andˆ
> >   . assignments &=, |= and ˆ=
> >   . comparisons <, <=, ==, !=, >= and >
> >   . Ternary operator ?:
> >
> >   Operations are defined as applied to the individual elements i.e the bits
> >   that are significant in the PBV. Whether the PBVs are treated as bitmasks
> >   or otherwise is implementation-defined.
> >
> >   Insignificant bits could affect results of comparisons or ternary operators.
> >   In such cases, it is implementation defined how the unused bits are treated.
> >
> >   . Subscript operator []
> >
> >   For the subscript operator, the packed boolean vector acts like a array of
> >   elements - the first or the 0th indexed element being the LSbit of the PBV.
> >   Subscript operator yields a scalar boolean value.
> >   For example:
> >
> >     typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> >
> >     // Subscript operator result yields a boolean value.
> >     // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> >     bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> >
> >   Out of bounds access: OOB access can be determined at compile time given the
> >   strong typing of the PBVs.
> >
> >   PBV does not support address of operator(&) for elements of PBVs.
> >
> >   . Implicit conversion from integer vectors to PBVs
> >
> >   We would like to support the output of comparison operations to be PBVs. This
> >   requires us to define the implicit conversion from an integer vector to PBV
> >   as the result of vector comparisons are integer vectors.
> >
> >   To define this operation:
> >
> >     bool_vector = vector <cmpop> vector
> >
> >   There is no change in how vector <cmpop> vector behavior i.e. this comparison
> >   would still produce an int_vector type as it does now.
> >
> >     temp_int_vec = vector <cmpop> vector
> >     bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
> >
> >   The implicit conversion from int_vec to bool I'd define simply to be:
> >
> >     bool_vec[n] = (_Bool) int_vec[n]
> >
> >   where the C11 standard rules apply
> >   6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
> >   is 0 if the value compares equal to 0; otherwise, the result is 1.
> >
> >
> > [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> > [2] https://reviews.llvm.org/D88905
> > [3] https://reviews.llvm.org/D81083
> >
> > Thoughts?
> >
> > Thanks,
> > Tejas.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-06-27  7:28     ` Richard Biener
@ 2023-06-28 11:26       ` Tejas Belagod
  2023-06-29 13:25         ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Tejas Belagod @ 2023-06-28 11:26 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 12229 bytes --]



From: Richard Biener <richard.guenther@gmail.com>
Date: Tuesday, June 27, 2023 at 12:58 PM
To: Tejas Belagod <Tejas.Belagod@arm.com>
Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>
>
>
>
>
> From: Richard Biener <richard.guenther@gmail.com>
> Date: Monday, June 26, 2023 at 2:23 PM
> To: Tejas Belagod <Tejas.Belagod@arm.com>
> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>
> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Hi,
> >
> > Packed Boolean Vectors
> > ----------------------
> >
> > I'd like to propose a feature addition to GNU Vector extensions to add packed
> > boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> > been implemented in Clang recently[2].
> >
> > With predication features being added to vector architectures (SVE, MVE, AVX),
> > it is a useful feature to have to model predication on targets.  This could
> > find its use in intrinsics or just used as is as a GNU vector extension being
> > mapped to underlying target features.  For example, the packed boolean vector
> > could directly map to a predicate register on SVE.
> >
> > Also, this new packed boolean type GNU extension can be used with SVE ACLE
> > intrinsics to replace a fixed-length svbool_t.
> >
> > Here are a few options to represent the packed boolean vector type.
>
> The GIMPLE frontend uses a new 'vector_mask' attribute:
>
> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> typedef v8si v8sib __attribute__((vector_mask));
>
> it get's you a vector type that's the appropriate (dependent on the
> target) vector
> mask type for the vector data type (v8si in this case).
>
>
>
> Thanks Richard.
>
> Having had a quick look at the implementation, it does seem to tick the boxes.
>
> I must admit I haven't dug deep, but if the target hook allows the mask to be
>
> defined in way that is target-friendly (and I don't know how much effort it will
>
> be to migrate the attribute to more front-ends), it should do the job nicely.
>
> Let me go back and dig a bit deeper and get back with questions if any.


Let me add that the advantage of this is the compiler doesn't need
to support weird explicitely laid out packed boolean vectors that do
not match what the target supports and the user doesn't need to know
what the target supports (and thus have an #ifdef maze around explicitely
specified layouts).

Sorry for the delayed response – I spent a day experimenting with vector_mask.

Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
to avoid having to sprinkle the code with ifdefs.

It does remove some flexibility though, for example with -mavx512f -mavx512vl
you'll get AVX512 style masks for V4SImode data vectors but of course the
target sill supports SSE2/AVX2 style masks as well, but those would not be
available as "packed boolean vectors", though they are of course in fact
equal to V4SImode data vectors with -1 or 0 values, so in this particular
case it might not matter.

That said, the vector_mask attribute will get you V4SImode vectors with
signed boolean elements of 32 bits for V4SImode data vectors with
SSE2/AVX2.

This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
and a ‘w’ specified for the type.

Given its current implementation, if vector_mask is exposed to the CFE, would there be any
major challenges wrt implementation or defining behaviour semantics? I played around with a
few examples from the testsuite and wrote some new ones. I mostly tried operations that
the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
to the CFE – initialization and subscript operations?


Thanks,
Tejas.



Richard.

>
>
> Thanks,
>
> Tejas.
>
>
>
>
>
>
>
> > 1. __attribute__((vector_size (n))) where n represents bytes
> >
> >   typedef bool vbool __attribute__ ((vector_size (1)));
> >
> > In this approach, the shape of the boolean vector is unclear. IoW, it is not
> > clear if each bit in 'n' controls a byte or an element. On targets
> > like SVE, it would be natural to have each bit control a byte of the target
> > vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> > bit would control one element/lane on the target vector(therefore resulting in a
> > 'packed' layout with all significant bits at the LSB).
> >
> > 2. __attribute__((vector_size (n))) where n represents num of lanes
> >
> >   typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
> >   typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
> >
> > Here the 'n' in the vector_size attribute represents the number of bits that
> > is needed to represent a vector quantity.  In this case, this packed boolean
> > vector can represent upto 'n' vector lanes. The size of the type is
> > rounded up the nearest byte.  For example, the sizeof v4bi in the above
> > example is 1.
> >
> > In this approach, because of the nature of the representation, the n bits required
> > to represent the n lanes of the vector are packed at the LSB. This does not naturally
> > align with the SVE approach of each bit representing a byte of the target vector
> > and PBV therefore having an 'unpacked' layout.
> >
> > More importantly, another drawback here is that the change in units for vector_size
> > might be confusing to programmers.  The units will have to be interpreted based on the
> > base type of the typedef. It does not offer any flexibility in terms of the layout of
> > the bool vector - it is fixed.
> >
> > 3. Combination of 1 and 2.
> >
> > Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> > unambiguously represent the layout of the PBV. Consider
> >
> >   typedef bool vbool __attribute__((vector_size (s, n[, w])));
> >
> > where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> > is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> > allow a target to force a certain layout of the PBV.
> >
> > The 2-parameter form of vector_size allows the target to have an
> > implementation-defined layout of the PBV. The target is free to choose the 'w'
> > if it is not specified to mirror the target layout of predicate registers. For
> > eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> >
> > As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> > 8 lanes of boolean which could be represented by
> >
> >   typedef bool v8b __attribute__ ((vector_size (2, 8)));
> >
> > SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
> >
> > and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> > be significant i.e. w == 1.
> >
> > This scheme would accomodate more than 1 target to effectively represent vector
> > bools that mirror the target properties.
> >
> > 4. A new attribite
> >
> > This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> > attribute to define the PBV and make it general enough to
> >
> > * represent all targets flexibly (SVE, AVX etc)
> > * represent sub-byte length predicates
> > * have no change in units of vector_size/no new vector_size signature
> > * not have the number of bytes constrain representation
> >
> > If we call the new attribute 'bool_vec' (for lack of a better name), consider
> >
> >   typedef bool vbool __attribute__((bool_vec (n[, w])))
> >
> > where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
> >
> > If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> > If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> >
> > 5. Behaviour of the packed vector boolean type.
> >
> > Taking the example of one of the options above, following is an illustration of it's behavior
> >
> > * ABI
> >
> >   New ABI rules will need to be defined for this type - eg alignment, PCS,
> >   mangling etc
> >
> > * Initialization:
> >
> >   Packed Boolean Vectors(PBV) can be initialized like so:
> >
> >     typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> >     v4bi p = {false, true, false, false};
> >
> >   Each value in the initizlizer constant is of type bool. The lowest numbered
> >   element in the const array corresponds to the LSbit of p, element 1 is
> >   assigned to bit 4 etc.
> >
> >   p is effectively a 2-byte bitmask with value 0x0010
> >
> >   With a different layout
> >
> >     typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> >     v4bi p = {false, true, false, false};
> >
> >   p is effectively a 2-byte bitmask with value 0x0002
> >
> > * Operations:
> >
> >   Packed Boolean Vectors support the following operations:
> >   . unary ~
> >   . unary !
> >   . binary&,|andˆ
> >   . assignments &=, |= and ˆ=
> >   . comparisons <, <=, ==, !=, >= and >
> >   . Ternary operator ?:
> >
> >   Operations are defined as applied to the individual elements i.e the bits
> >   that are significant in the PBV. Whether the PBVs are treated as bitmasks
> >   or otherwise is implementation-defined.
> >
> >   Insignificant bits could affect results of comparisons or ternary operators.
> >   In such cases, it is implementation defined how the unused bits are treated.
> >
> >   . Subscript operator []
> >
> >   For the subscript operator, the packed boolean vector acts like a array of
> >   elements - the first or the 0th indexed element being the LSbit of the PBV.
> >   Subscript operator yields a scalar boolean value.
> >   For example:
> >
> >     typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> >
> >     // Subscript operator result yields a boolean value.
> >     // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> >     bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> >
> >   Out of bounds access: OOB access can be determined at compile time given the
> >   strong typing of the PBVs.
> >
> >   PBV does not support address of operator(&) for elements of PBVs.
> >
> >   . Implicit conversion from integer vectors to PBVs
> >
> >   We would like to support the output of comparison operations to be PBVs. This
> >   requires us to define the implicit conversion from an integer vector to PBV
> >   as the result of vector comparisons are integer vectors.
> >
> >   To define this operation:
> >
> >     bool_vector = vector <cmpop> vector
> >
> >   There is no change in how vector <cmpop> vector behavior i.e. this comparison
> >   would still produce an int_vector type as it does now.
> >
> >     temp_int_vec = vector <cmpop> vector
> >     bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
> >
> >   The implicit conversion from int_vec to bool I'd define simply to be:
> >
> >     bool_vec[n] = (_Bool) int_vec[n]
> >
> >   where the C11 standard rules apply
> >   6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
> >   is 0 if the value compares equal to 0; otherwise, the result is 1.
> >
> >
> > [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> > [2] https://reviews.llvm.org/D88905
> > [3] https://reviews.llvm.org/D81083
> >
> > Thoughts?
> >
> > Thanks,
> > Tejas.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-06-28 11:26       ` Tejas Belagod
@ 2023-06-29 13:25         ` Richard Biener
  2023-07-03  6:50           ` Tejas Belagod
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2023-06-29 13:25 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>
>
>
>
>
> From: Richard Biener <richard.guenther@gmail.com>
> Date: Tuesday, June 27, 2023 at 12:58 PM
> To: Tejas Belagod <Tejas.Belagod@arm.com>
> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>
> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >
> >
> >
> >
> >
> > From: Richard Biener <richard.guenther@gmail.com>
> > Date: Monday, June 26, 2023 at 2:23 PM
> > To: Tejas Belagod <Tejas.Belagod@arm.com>
> > Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> > Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >
> > On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > Hi,
> > >
> > > Packed Boolean Vectors
> > > ----------------------
> > >
> > > I'd like to propose a feature addition to GNU Vector extensions to add packed
> > > boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> > > been implemented in Clang recently[2].
> > >
> > > With predication features being added to vector architectures (SVE, MVE, AVX),
> > > it is a useful feature to have to model predication on targets.  This could
> > > find its use in intrinsics or just used as is as a GNU vector extension being
> > > mapped to underlying target features.  For example, the packed boolean vector
> > > could directly map to a predicate register on SVE.
> > >
> > > Also, this new packed boolean type GNU extension can be used with SVE ACLE
> > > intrinsics to replace a fixed-length svbool_t.
> > >
> > > Here are a few options to represent the packed boolean vector type.
> >
> > The GIMPLE frontend uses a new 'vector_mask' attribute:
> >
> > typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> > typedef v8si v8sib __attribute__((vector_mask));
> >
> > it get's you a vector type that's the appropriate (dependent on the
> > target) vector
> > mask type for the vector data type (v8si in this case).
> >
> >
> >
> > Thanks Richard.
> >
> > Having had a quick look at the implementation, it does seem to tick the boxes.
> >
> > I must admit I haven't dug deep, but if the target hook allows the mask to be
> >
> > defined in way that is target-friendly (and I don't know how much effort it will
> >
> > be to migrate the attribute to more front-ends), it should do the job nicely.
> >
> > Let me go back and dig a bit deeper and get back with questions if any.
>
>
> Let me add that the advantage of this is the compiler doesn't need
> to support weird explicitely laid out packed boolean vectors that do
> not match what the target supports and the user doesn't need to know
> what the target supports (and thus have an #ifdef maze around explicitely
> specified layouts).
>
> Sorry for the delayed response – I spent a day experimenting with vector_mask.
>
>
>
> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
>
> to avoid having to sprinkle the code with ifdefs.
>
>
> It does remove some flexibility though, for example with -mavx512f -mavx512vl
> you'll get AVX512 style masks for V4SImode data vectors but of course the
> target sill supports SSE2/AVX2 style masks as well, but those would not be
> available as "packed boolean vectors", though they are of course in fact
> equal to V4SImode data vectors with -1 or 0 values, so in this particular
> case it might not matter.
>
> That said, the vector_mask attribute will get you V4SImode vectors with
> signed boolean elements of 32 bits for V4SImode data vectors with
> SSE2/AVX2.
>
>
>
> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
>
> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
>
> and a ‘w’ specified for the type.
>
>
>
> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
>
> major challenges wrt implementation or defining behaviour semantics? I played around with a
>
> few examples from the testsuite and wrote some new ones. I mostly tried operations that
>
> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
>
> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
>
> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
>
> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
>
> to the CFE – initialization and subscript operations?

Yes, either that or restrict how the mask vectors can be used, thus
properly diagnose improper
uses.  A question would be for example how to write common mask test
operations like
if (any (mask)) or if (all (mask)).  Likewise writing merge operations
- do those as

 a = a | (mask ? b : 0);

thus use ternary ?: for this?  For initialization regular vector
syntax should work:

mtype mask = (mtype){ -1, -1, 0, 0, ... };

there's the question of the signedness of the mask elements.  GCC
internally uses signed
bools with values -1 for true and 0 for false.

Richard.

>
>
>
>
>
> Thanks,
>
> Tejas.
>
>
>
>
>
> Richard.
>
> >
> >
> > Thanks,
> >
> > Tejas.
> >
> >
> >
> >
> >
> >
> >
> > > 1. __attribute__((vector_size (n))) where n represents bytes
> > >
> > >   typedef bool vbool __attribute__ ((vector_size (1)));
> > >
> > > In this approach, the shape of the boolean vector is unclear. IoW, it is not
> > > clear if each bit in 'n' controls a byte or an element. On targets
> > > like SVE, it would be natural to have each bit control a byte of the target
> > > vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> > > bit would control one element/lane on the target vector(therefore resulting in a
> > > 'packed' layout with all significant bits at the LSB).
> > >
> > > 2. __attribute__((vector_size (n))) where n represents num of lanes
> > >
> > >   typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
> > >   typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
> > >
> > > Here the 'n' in the vector_size attribute represents the number of bits that
> > > is needed to represent a vector quantity.  In this case, this packed boolean
> > > vector can represent upto 'n' vector lanes. The size of the type is
> > > rounded up the nearest byte.  For example, the sizeof v4bi in the above
> > > example is 1.
> > >
> > > In this approach, because of the nature of the representation, the n bits required
> > > to represent the n lanes of the vector are packed at the LSB. This does not naturally
> > > align with the SVE approach of each bit representing a byte of the target vector
> > > and PBV therefore having an 'unpacked' layout.
> > >
> > > More importantly, another drawback here is that the change in units for vector_size
> > > might be confusing to programmers.  The units will have to be interpreted based on the
> > > base type of the typedef. It does not offer any flexibility in terms of the layout of
> > > the bool vector - it is fixed.
> > >
> > > 3. Combination of 1 and 2.
> > >
> > > Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> > > unambiguously represent the layout of the PBV. Consider
> > >
> > >   typedef bool vbool __attribute__((vector_size (s, n[, w])));
> > >
> > > where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> > > is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> > > allow a target to force a certain layout of the PBV.
> > >
> > > The 2-parameter form of vector_size allows the target to have an
> > > implementation-defined layout of the PBV. The target is free to choose the 'w'
> > > if it is not specified to mirror the target layout of predicate registers. For
> > > eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> > >
> > > As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> > > 8 lanes of boolean which could be represented by
> > >
> > >   typedef bool v8b __attribute__ ((vector_size (2, 8)));
> > >
> > > SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
> > >
> > > and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> > > be significant i.e. w == 1.
> > >
> > > This scheme would accomodate more than 1 target to effectively represent vector
> > > bools that mirror the target properties.
> > >
> > > 4. A new attribite
> > >
> > > This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> > > attribute to define the PBV and make it general enough to
> > >
> > > * represent all targets flexibly (SVE, AVX etc)
> > > * represent sub-byte length predicates
> > > * have no change in units of vector_size/no new vector_size signature
> > > * not have the number of bytes constrain representation
> > >
> > > If we call the new attribute 'bool_vec' (for lack of a better name), consider
> > >
> > >   typedef bool vbool __attribute__((bool_vec (n[, w])))
> > >
> > > where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
> > >
> > > If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> > > If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> > >
> > > 5. Behaviour of the packed vector boolean type.
> > >
> > > Taking the example of one of the options above, following is an illustration of it's behavior
> > >
> > > * ABI
> > >
> > >   New ABI rules will need to be defined for this type - eg alignment, PCS,
> > >   mangling etc
> > >
> > > * Initialization:
> > >
> > >   Packed Boolean Vectors(PBV) can be initialized like so:
> > >
> > >     typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> > >     v4bi p = {false, true, false, false};
> > >
> > >   Each value in the initizlizer constant is of type bool. The lowest numbered
> > >   element in the const array corresponds to the LSbit of p, element 1 is
> > >   assigned to bit 4 etc.
> > >
> > >   p is effectively a 2-byte bitmask with value 0x0010
> > >
> > >   With a different layout
> > >
> > >     typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> > >     v4bi p = {false, true, false, false};
> > >
> > >   p is effectively a 2-byte bitmask with value 0x0002
> > >
> > > * Operations:
> > >
> > >   Packed Boolean Vectors support the following operations:
> > >   . unary ~
> > >   . unary !
> > >   . binary&,|andˆ
> > >   . assignments &=, |= and ˆ=
> > >   . comparisons <, <=, ==, !=, >= and >
> > >   . Ternary operator ?:
> > >
> > >   Operations are defined as applied to the individual elements i.e the bits
> > >   that are significant in the PBV. Whether the PBVs are treated as bitmasks
> > >   or otherwise is implementation-defined.
> > >
> > >   Insignificant bits could affect results of comparisons or ternary operators.
> > >   In such cases, it is implementation defined how the unused bits are treated.
> > >
> > >   . Subscript operator []
> > >
> > >   For the subscript operator, the packed boolean vector acts like a array of
> > >   elements - the first or the 0th indexed element being the LSbit of the PBV.
> > >   Subscript operator yields a scalar boolean value.
> > >   For example:
> > >
> > >     typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> > >
> > >     // Subscript operator result yields a boolean value.
> > >     // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> > >     bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> > >
> > >   Out of bounds access: OOB access can be determined at compile time given the
> > >   strong typing of the PBVs.
> > >
> > >   PBV does not support address of operator(&) for elements of PBVs.
> > >
> > >   . Implicit conversion from integer vectors to PBVs
> > >
> > >   We would like to support the output of comparison operations to be PBVs. This
> > >   requires us to define the implicit conversion from an integer vector to PBV
> > >   as the result of vector comparisons are integer vectors.
> > >
> > >   To define this operation:
> > >
> > >     bool_vector = vector <cmpop> vector
> > >
> > >   There is no change in how vector <cmpop> vector behavior i.e. this comparison
> > >   would still produce an int_vector type as it does now.
> > >
> > >     temp_int_vec = vector <cmpop> vector
> > >     bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
> > >
> > >   The implicit conversion from int_vec to bool I'd define simply to be:
> > >
> > >     bool_vec[n] = (_Bool) int_vec[n]
> > >
> > >   where the C11 standard rules apply
> > >   6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
> > >   is 0 if the value compares equal to 0; otherwise, the result is 1.
> > >
> > >
> > > [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> > > [2] https://reviews.llvm.org/D88905
> > > [3] https://reviews.llvm.org/D81083
> > >
> > > Thoughts?
> > >
> > > Thanks,
> > > Tejas.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-06-29 13:25         ` Richard Biener
@ 2023-07-03  6:50           ` Tejas Belagod
  2023-07-03  8:01             ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Tejas Belagod @ 2023-07-03  6:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On 6/29/23 6:55 PM, Richard Biener wrote:
> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>>
>>
>>
>>
>>
>> From: Richard Biener <richard.guenther@gmail.com>
>> Date: Tuesday, June 27, 2023 at 12:58 PM
>> To: Tejas Belagod <Tejas.Belagod@arm.com>
>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>>
>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>> From: Richard Biener <richard.guenther@gmail.com>
>>> Date: Monday, June 26, 2023 at 2:23 PM
>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>>>
>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Packed Boolean Vectors
>>>> ----------------------
>>>>
>>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
>>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
>>>> been implemented in Clang recently[2].
>>>>
>>>> With predication features being added to vector architectures (SVE, MVE, AVX),
>>>> it is a useful feature to have to model predication on targets.  This could
>>>> find its use in intrinsics or just used as is as a GNU vector extension being
>>>> mapped to underlying target features.  For example, the packed boolean vector
>>>> could directly map to a predicate register on SVE.
>>>>
>>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
>>>> intrinsics to replace a fixed-length svbool_t.
>>>>
>>>> Here are a few options to represent the packed boolean vector type.
>>>
>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
>>>
>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>>> typedef v8si v8sib __attribute__((vector_mask));
>>>
>>> it get's you a vector type that's the appropriate (dependent on the
>>> target) vector
>>> mask type for the vector data type (v8si in this case).
>>>
>>>
>>>
>>> Thanks Richard.
>>>
>>> Having had a quick look at the implementation, it does seem to tick the boxes.
>>>
>>> I must admit I haven't dug deep, but if the target hook allows the mask to be
>>>
>>> defined in way that is target-friendly (and I don't know how much effort it will
>>>
>>> be to migrate the attribute to more front-ends), it should do the job nicely.
>>>
>>> Let me go back and dig a bit deeper and get back with questions if any.
>>
>>
>> Let me add that the advantage of this is the compiler doesn't need
>> to support weird explicitely laid out packed boolean vectors that do
>> not match what the target supports and the user doesn't need to know
>> what the target supports (and thus have an #ifdef maze around explicitely
>> specified layouts).
>>
>> Sorry for the delayed response – I spent a day experimenting with vector_mask.
>>
>>
>>
>> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
>>
>> to avoid having to sprinkle the code with ifdefs.
>>
>>
>> It does remove some flexibility though, for example with -mavx512f -mavx512vl
>> you'll get AVX512 style masks for V4SImode data vectors but of course the
>> target sill supports SSE2/AVX2 style masks as well, but those would not be
>> available as "packed boolean vectors", though they are of course in fact
>> equal to V4SImode data vectors with -1 or 0 values, so in this particular
>> case it might not matter.
>>
>> That said, the vector_mask attribute will get you V4SImode vectors with
>> signed boolean elements of 32 bits for V4SImode data vectors with
>> SSE2/AVX2.
>>
>>
>>
>> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
>>
>> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
>>
>> and a ‘w’ specified for the type.
>>
>>
>>
>> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
>>
>> major challenges wrt implementation or defining behaviour semantics? I played around with a
>>
>> few examples from the testsuite and wrote some new ones. I mostly tried operations that
>>
>> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
>>
>> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
>>
>> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
>>
>> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
>>
>> to the CFE – initialization and subscript operations?
> 
> Yes, either that or restrict how the mask vectors can be used, thus
> properly diagnose improper
> uses. 

Indeed.

  A question would be for example how to write common mask test
> operations like
> if (any (mask)) or if (all (mask)). 

I see 2 options here. New builtins could support new types - they'd 
provide a target independent way to test any and all conditions. Another 
would be to let the target use its intrinsics to do them in the most 
efficient way possible (which the builtins would get lowered down to 
anyway).


  Likewise writing merge operations
> - do those as
> 
>   a = a | (mask ? b : 0);
> 
> thus use ternary ?: for this?  

Yes, like now, the ternary could just translate to

   {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }

One thing to flesh out is the semantics. Should we allow this operation 
as long as the number of elements are the same even if the mask type if 
different i.e.

   v4hib ? v4si : v4si;

I don't see why this can't be allowed as now we let

   v4si ? v4sf : v4sf;


For initialization regular vector
> syntax should work:
> 
> mtype mask = (mtype){ -1, -1, 0, 0, ... };
> 
> there's the question of the signedness of the mask elements.  GCC
> internally uses signed
> bools with values -1 for true and 0 for false.

One of the things is the value that represents true. This is largely 
target-dependent when it comes to the vector_mask type. When vector_mask 
types are created from GCC's internal representation of bool vectors 
(signed ints) the point about implicit/explicit conversions from signed 
int vect to mask types in the proposal covers this. So mask in

   v4sib mask = (v4sib){-1, -1, 0, 0, ... }

will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx 
on SVE. On AVX2/SSE they'd still be represented as vector of signed ints 
{-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this 
new mask type representations will have in the mid-end while being 
converted back and forth to and from GCC's internal representation, but 
I'm guessing this is already being handled at some level by the 
vector_mask type's current support?

Thanks,
Tejas.

> 
> Richard.
> 
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Tejas.
>>
>>
>>
>>
>>
>> Richard.
>>
>>>
>>>
>>> Thanks,
>>>
>>> Tejas.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> 1. __attribute__((vector_size (n))) where n represents bytes
>>>>
>>>>    typedef bool vbool __attribute__ ((vector_size (1)));
>>>>
>>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
>>>> clear if each bit in 'n' controls a byte or an element. On targets
>>>> like SVE, it would be natural to have each bit control a byte of the target
>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
>>>> bit would control one element/lane on the target vector(therefore resulting in a
>>>> 'packed' layout with all significant bits at the LSB).
>>>>
>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
>>>>
>>>>    typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
>>>>    typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
>>>>
>>>> Here the 'n' in the vector_size attribute represents the number of bits that
>>>> is needed to represent a vector quantity.  In this case, this packed boolean
>>>> vector can represent upto 'n' vector lanes. The size of the type is
>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
>>>> example is 1.
>>>>
>>>> In this approach, because of the nature of the representation, the n bits required
>>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
>>>> align with the SVE approach of each bit representing a byte of the target vector
>>>> and PBV therefore having an 'unpacked' layout.
>>>>
>>>> More importantly, another drawback here is that the change in units for vector_size
>>>> might be confusing to programmers.  The units will have to be interpreted based on the
>>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
>>>> the bool vector - it is fixed.
>>>>
>>>> 3. Combination of 1 and 2.
>>>>
>>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
>>>> unambiguously represent the layout of the PBV. Consider
>>>>
>>>>    typedef bool vbool __attribute__((vector_size (s, n[, w])));
>>>>
>>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
>>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
>>>> allow a target to force a certain layout of the PBV.
>>>>
>>>> The 2-parameter form of vector_size allows the target to have an
>>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
>>>> if it is not specified to mirror the target layout of predicate registers. For
>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
>>>>
>>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
>>>> 8 lanes of boolean which could be represented by
>>>>
>>>>    typedef bool v8b __attribute__ ((vector_size (2, 8)));
>>>>
>>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
>>>>
>>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
>>>> be significant i.e. w == 1.
>>>>
>>>> This scheme would accomodate more than 1 target to effectively represent vector
>>>> bools that mirror the target properties.
>>>>
>>>> 4. A new attribite
>>>>
>>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
>>>> attribute to define the PBV and make it general enough to
>>>>
>>>> * represent all targets flexibly (SVE, AVX etc)
>>>> * represent sub-byte length predicates
>>>> * have no change in units of vector_size/no new vector_size signature
>>>> * not have the number of bytes constrain representation
>>>>
>>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
>>>>
>>>>    typedef bool vbool __attribute__((bool_vec (n[, w])))
>>>>
>>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
>>>>
>>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
>>>>
>>>> 5. Behaviour of the packed vector boolean type.
>>>>
>>>> Taking the example of one of the options above, following is an illustration of it's behavior
>>>>
>>>> * ABI
>>>>
>>>>    New ABI rules will need to be defined for this type - eg alignment, PCS,
>>>>    mangling etc
>>>>
>>>> * Initialization:
>>>>
>>>>    Packed Boolean Vectors(PBV) can be initialized like so:
>>>>
>>>>      typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
>>>>      v4bi p = {false, true, false, false};
>>>>
>>>>    Each value in the initizlizer constant is of type bool. The lowest numbered
>>>>    element in the const array corresponds to the LSbit of p, element 1 is
>>>>    assigned to bit 4 etc.
>>>>
>>>>    p is effectively a 2-byte bitmask with value 0x0010
>>>>
>>>>    With a different layout
>>>>
>>>>      typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
>>>>      v4bi p = {false, true, false, false};
>>>>
>>>>    p is effectively a 2-byte bitmask with value 0x0002
>>>>
>>>> * Operations:
>>>>
>>>>    Packed Boolean Vectors support the following operations:
>>>>    . unary ~
>>>>    . unary !
>>>>    . binary&,|andˆ
>>>>    . assignments &=, |= and ˆ=
>>>>    . comparisons <, <=, ==, !=, >= and >
>>>>    . Ternary operator ?:
>>>>
>>>>    Operations are defined as applied to the individual elements i.e the bits
>>>>    that are significant in the PBV. Whether the PBVs are treated as bitmasks
>>>>    or otherwise is implementation-defined.
>>>>
>>>>    Insignificant bits could affect results of comparisons or ternary operators.
>>>>    In such cases, it is implementation defined how the unused bits are treated.
>>>>
>>>>    . Subscript operator []
>>>>
>>>>    For the subscript operator, the packed boolean vector acts like a array of
>>>>    elements - the first or the 0th indexed element being the LSbit of the PBV.
>>>>    Subscript operator yields a scalar boolean value.
>>>>    For example:
>>>>
>>>>      typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
>>>>
>>>>      // Subscript operator result yields a boolean value.
>>>>      // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
>>>>      bool foo (v8b p, int n) { p[3] = true; return p[1]; }
>>>>
>>>>    Out of bounds access: OOB access can be determined at compile time given the
>>>>    strong typing of the PBVs.
>>>>
>>>>    PBV does not support address of operator(&) for elements of PBVs.
>>>>
>>>>    . Implicit conversion from integer vectors to PBVs
>>>>
>>>>    We would like to support the output of comparison operations to be PBVs. This
>>>>    requires us to define the implicit conversion from an integer vector to PBV
>>>>    as the result of vector comparisons are integer vectors.
>>>>
>>>>    To define this operation:
>>>>
>>>>      bool_vector = vector <cmpop> vector
>>>>
>>>>    There is no change in how vector <cmpop> vector behavior i.e. this comparison
>>>>    would still produce an int_vector type as it does now.
>>>>
>>>>      temp_int_vec = vector <cmpop> vector
>>>>      bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
>>>>
>>>>    The implicit conversion from int_vec to bool I'd define simply to be:
>>>>
>>>>      bool_vec[n] = (_Bool) int_vec[n]
>>>>
>>>>    where the C11 standard rules apply
>>>>    6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
>>>>    is 0 if the value compares equal to 0; otherwise, the result is 1.
>>>>
>>>>
>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
>>>> [2] https://reviews.llvm.org/D88905
>>>> [3] https://reviews.llvm.org/D81083
>>>>
>>>> Thoughts?
>>>>
>>>> Thanks,
>>>> Tejas.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-03  6:50           ` Tejas Belagod
@ 2023-07-03  8:01             ` Richard Biener
  2023-07-13 10:14               ` Tejas Belagod
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2023-07-03  8:01 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
>
> On 6/29/23 6:55 PM, Richard Biener wrote:
> > On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >>
> >>
> >>
> >>
> >>
> >> From: Richard Biener <richard.guenther@gmail.com>
> >> Date: Tuesday, June 27, 2023 at 12:58 PM
> >> To: Tejas Belagod <Tejas.Belagod@arm.com>
> >> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> >> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>
> >> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Richard Biener <richard.guenther@gmail.com>
> >>> Date: Monday, June 26, 2023 at 2:23 PM
> >>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> >>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> >>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>
> >>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> >>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> Packed Boolean Vectors
> >>>> ----------------------
> >>>>
> >>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
> >>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> >>>> been implemented in Clang recently[2].
> >>>>
> >>>> With predication features being added to vector architectures (SVE, MVE, AVX),
> >>>> it is a useful feature to have to model predication on targets.  This could
> >>>> find its use in intrinsics or just used as is as a GNU vector extension being
> >>>> mapped to underlying target features.  For example, the packed boolean vector
> >>>> could directly map to a predicate register on SVE.
> >>>>
> >>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
> >>>> intrinsics to replace a fixed-length svbool_t.
> >>>>
> >>>> Here are a few options to represent the packed boolean vector type.
> >>>
> >>> The GIMPLE frontend uses a new 'vector_mask' attribute:
> >>>
> >>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>> typedef v8si v8sib __attribute__((vector_mask));
> >>>
> >>> it get's you a vector type that's the appropriate (dependent on the
> >>> target) vector
> >>> mask type for the vector data type (v8si in this case).
> >>>
> >>>
> >>>
> >>> Thanks Richard.
> >>>
> >>> Having had a quick look at the implementation, it does seem to tick the boxes.
> >>>
> >>> I must admit I haven't dug deep, but if the target hook allows the mask to be
> >>>
> >>> defined in way that is target-friendly (and I don't know how much effort it will
> >>>
> >>> be to migrate the attribute to more front-ends), it should do the job nicely.
> >>>
> >>> Let me go back and dig a bit deeper and get back with questions if any.
> >>
> >>
> >> Let me add that the advantage of this is the compiler doesn't need
> >> to support weird explicitely laid out packed boolean vectors that do
> >> not match what the target supports and the user doesn't need to know
> >> what the target supports (and thus have an #ifdef maze around explicitely
> >> specified layouts).
> >>
> >> Sorry for the delayed response – I spent a day experimenting with vector_mask.
> >>
> >>
> >>
> >> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
> >>
> >> to avoid having to sprinkle the code with ifdefs.
> >>
> >>
> >> It does remove some flexibility though, for example with -mavx512f -mavx512vl
> >> you'll get AVX512 style masks for V4SImode data vectors but of course the
> >> target sill supports SSE2/AVX2 style masks as well, but those would not be
> >> available as "packed boolean vectors", though they are of course in fact
> >> equal to V4SImode data vectors with -1 or 0 values, so in this particular
> >> case it might not matter.
> >>
> >> That said, the vector_mask attribute will get you V4SImode vectors with
> >> signed boolean elements of 32 bits for V4SImode data vectors with
> >> SSE2/AVX2.
> >>
> >>
> >>
> >> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
> >>
> >> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
> >>
> >> and a ‘w’ specified for the type.
> >>
> >>
> >>
> >> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
> >>
> >> major challenges wrt implementation or defining behaviour semantics? I played around with a
> >>
> >> few examples from the testsuite and wrote some new ones. I mostly tried operations that
> >>
> >> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
> >>
> >> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
> >>
> >> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
> >>
> >> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
> >>
> >> to the CFE – initialization and subscript operations?
> >
> > Yes, either that or restrict how the mask vectors can be used, thus
> > properly diagnose improper
> > uses.
>
> Indeed.
>
>   A question would be for example how to write common mask test
> > operations like
> > if (any (mask)) or if (all (mask)).
>
> I see 2 options here. New builtins could support new types - they'd
> provide a target independent way to test any and all conditions. Another
> would be to let the target use its intrinsics to do them in the most
> efficient way possible (which the builtins would get lowered down to
> anyway).
>
>
>   Likewise writing merge operations
> > - do those as
> >
> >   a = a | (mask ? b : 0);
> >
> > thus use ternary ?: for this?
>
> Yes, like now, the ternary could just translate to
>
>    {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
>
> One thing to flesh out is the semantics. Should we allow this operation
> as long as the number of elements are the same even if the mask type if
> different i.e.
>
>    v4hib ? v4si : v4si;
>
> I don't see why this can't be allowed as now we let
>
>    v4si ? v4sf : v4sf;
>
>
> For initialization regular vector
> > syntax should work:
> >
> > mtype mask = (mtype){ -1, -1, 0, 0, ... };
> >
> > there's the question of the signedness of the mask elements.  GCC
> > internally uses signed
> > bools with values -1 for true and 0 for false.
>
> One of the things is the value that represents true. This is largely
> target-dependent when it comes to the vector_mask type. When vector_mask
> types are created from GCC's internal representation of bool vectors
> (signed ints) the point about implicit/explicit conversions from signed
> int vect to mask types in the proposal covers this. So mask in
>
>    v4sib mask = (v4sib){-1, -1, 0, 0, ... }
>
> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
> on SVE. On AVX2/SSE they'd still be represented as vector of signed ints
> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this
> new mask type representations will have in the mid-end while being
> converted back and forth to and from GCC's internal representation, but
> I'm guessing this is already being handled at some level by the
> vector_mask type's current support?

Yes, I would guess so.  Of course what the middle-end is currently exposed
to is simply what the vectorizer generates - once fuzzers discover this feature
we'll see "interesting" uses that might run into missed or wrong handling of
them.

So whatever we do on the side of exposing this to users a good portion
of testsuite coverage for the allowed use cases is important.

Richard.

>
> Thanks,
> Tejas.
>
> >
> > Richard.
> >
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Tejas.
> >>
> >>
> >>
> >>
> >>
> >> Richard.
> >>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Tejas.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> 1. __attribute__((vector_size (n))) where n represents bytes
> >>>>
> >>>>    typedef bool vbool __attribute__ ((vector_size (1)));
> >>>>
> >>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
> >>>> clear if each bit in 'n' controls a byte or an element. On targets
> >>>> like SVE, it would be natural to have each bit control a byte of the target
> >>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> >>>> bit would control one element/lane on the target vector(therefore resulting in a
> >>>> 'packed' layout with all significant bits at the LSB).
> >>>>
> >>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
> >>>>
> >>>>    typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
> >>>>    typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
> >>>>
> >>>> Here the 'n' in the vector_size attribute represents the number of bits that
> >>>> is needed to represent a vector quantity.  In this case, this packed boolean
> >>>> vector can represent upto 'n' vector lanes. The size of the type is
> >>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
> >>>> example is 1.
> >>>>
> >>>> In this approach, because of the nature of the representation, the n bits required
> >>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
> >>>> align with the SVE approach of each bit representing a byte of the target vector
> >>>> and PBV therefore having an 'unpacked' layout.
> >>>>
> >>>> More importantly, another drawback here is that the change in units for vector_size
> >>>> might be confusing to programmers.  The units will have to be interpreted based on the
> >>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
> >>>> the bool vector - it is fixed.
> >>>>
> >>>> 3. Combination of 1 and 2.
> >>>>
> >>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> >>>> unambiguously represent the layout of the PBV. Consider
> >>>>
> >>>>    typedef bool vbool __attribute__((vector_size (s, n[, w])));
> >>>>
> >>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> >>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> >>>> allow a target to force a certain layout of the PBV.
> >>>>
> >>>> The 2-parameter form of vector_size allows the target to have an
> >>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
> >>>> if it is not specified to mirror the target layout of predicate registers. For
> >>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> >>>>
> >>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> >>>> 8 lanes of boolean which could be represented by
> >>>>
> >>>>    typedef bool v8b __attribute__ ((vector_size (2, 8)));
> >>>>
> >>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
> >>>>
> >>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> >>>> be significant i.e. w == 1.
> >>>>
> >>>> This scheme would accomodate more than 1 target to effectively represent vector
> >>>> bools that mirror the target properties.
> >>>>
> >>>> 4. A new attribite
> >>>>
> >>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> >>>> attribute to define the PBV and make it general enough to
> >>>>
> >>>> * represent all targets flexibly (SVE, AVX etc)
> >>>> * represent sub-byte length predicates
> >>>> * have no change in units of vector_size/no new vector_size signature
> >>>> * not have the number of bytes constrain representation
> >>>>
> >>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
> >>>>
> >>>>    typedef bool vbool __attribute__((bool_vec (n[, w])))
> >>>>
> >>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
> >>>>
> >>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> >>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> >>>>
> >>>> 5. Behaviour of the packed vector boolean type.
> >>>>
> >>>> Taking the example of one of the options above, following is an illustration of it's behavior
> >>>>
> >>>> * ABI
> >>>>
> >>>>    New ABI rules will need to be defined for this type - eg alignment, PCS,
> >>>>    mangling etc
> >>>>
> >>>> * Initialization:
> >>>>
> >>>>    Packed Boolean Vectors(PBV) can be initialized like so:
> >>>>
> >>>>      typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> >>>>      v4bi p = {false, true, false, false};
> >>>>
> >>>>    Each value in the initizlizer constant is of type bool. The lowest numbered
> >>>>    element in the const array corresponds to the LSbit of p, element 1 is
> >>>>    assigned to bit 4 etc.
> >>>>
> >>>>    p is effectively a 2-byte bitmask with value 0x0010
> >>>>
> >>>>    With a different layout
> >>>>
> >>>>      typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> >>>>      v4bi p = {false, true, false, false};
> >>>>
> >>>>    p is effectively a 2-byte bitmask with value 0x0002
> >>>>
> >>>> * Operations:
> >>>>
> >>>>    Packed Boolean Vectors support the following operations:
> >>>>    . unary ~
> >>>>    . unary !
> >>>>    . binary&,|andˆ
> >>>>    . assignments &=, |= and ˆ=
> >>>>    . comparisons <, <=, ==, !=, >= and >
> >>>>    . Ternary operator ?:
> >>>>
> >>>>    Operations are defined as applied to the individual elements i.e the bits
> >>>>    that are significant in the PBV. Whether the PBVs are treated as bitmasks
> >>>>    or otherwise is implementation-defined.
> >>>>
> >>>>    Insignificant bits could affect results of comparisons or ternary operators.
> >>>>    In such cases, it is implementation defined how the unused bits are treated.
> >>>>
> >>>>    . Subscript operator []
> >>>>
> >>>>    For the subscript operator, the packed boolean vector acts like a array of
> >>>>    elements - the first or the 0th indexed element being the LSbit of the PBV.
> >>>>    Subscript operator yields a scalar boolean value.
> >>>>    For example:
> >>>>
> >>>>      typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> >>>>
> >>>>      // Subscript operator result yields a boolean value.
> >>>>      // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> >>>>      bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> >>>>
> >>>>    Out of bounds access: OOB access can be determined at compile time given the
> >>>>    strong typing of the PBVs.
> >>>>
> >>>>    PBV does not support address of operator(&) for elements of PBVs.
> >>>>
> >>>>    . Implicit conversion from integer vectors to PBVs
> >>>>
> >>>>    We would like to support the output of comparison operations to be PBVs. This
> >>>>    requires us to define the implicit conversion from an integer vector to PBV
> >>>>    as the result of vector comparisons are integer vectors.
> >>>>
> >>>>    To define this operation:
> >>>>
> >>>>      bool_vector = vector <cmpop> vector
> >>>>
> >>>>    There is no change in how vector <cmpop> vector behavior i.e. this comparison
> >>>>    would still produce an int_vector type as it does now.
> >>>>
> >>>>      temp_int_vec = vector <cmpop> vector
> >>>>      bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
> >>>>
> >>>>    The implicit conversion from int_vec to bool I'd define simply to be:
> >>>>
> >>>>      bool_vec[n] = (_Bool) int_vec[n]
> >>>>
> >>>>    where the C11 standard rules apply
> >>>>    6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
> >>>>    is 0 if the value compares equal to 0; otherwise, the result is 1.
> >>>>
> >>>>
> >>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> >>>> [2] https://reviews.llvm.org/D88905
> >>>> [3] https://reviews.llvm.org/D81083
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> Thanks,
> >>>> Tejas.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-03  8:01             ` Richard Biener
@ 2023-07-13 10:14               ` Tejas Belagod
  2023-07-13 10:35                 ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Tejas Belagod @ 2023-07-13 10:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On 7/3/23 1:31 PM, Richard Biener wrote:
> On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
>>
>> On 6/29/23 6:55 PM, Richard Biener wrote:
>>> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> From: Richard Biener <richard.guenther@gmail.com>
>>>> Date: Tuesday, June 27, 2023 at 12:58 PM
>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>>>>
>>>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> From: Richard Biener <richard.guenther@gmail.com>
>>>>> Date: Monday, June 26, 2023 at 2:23 PM
>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>>>>>
>>>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Packed Boolean Vectors
>>>>>> ----------------------
>>>>>>
>>>>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
>>>>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
>>>>>> been implemented in Clang recently[2].
>>>>>>
>>>>>> With predication features being added to vector architectures (SVE, MVE, AVX),
>>>>>> it is a useful feature to have to model predication on targets.  This could
>>>>>> find its use in intrinsics or just used as is as a GNU vector extension being
>>>>>> mapped to underlying target features.  For example, the packed boolean vector
>>>>>> could directly map to a predicate register on SVE.
>>>>>>
>>>>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
>>>>>> intrinsics to replace a fixed-length svbool_t.
>>>>>>
>>>>>> Here are a few options to represent the packed boolean vector type.
>>>>>
>>>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
>>>>>
>>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>>>>> typedef v8si v8sib __attribute__((vector_mask));
>>>>>
>>>>> it get's you a vector type that's the appropriate (dependent on the
>>>>> target) vector
>>>>> mask type for the vector data type (v8si in this case).
>>>>>
>>>>>
>>>>>
>>>>> Thanks Richard.
>>>>>
>>>>> Having had a quick look at the implementation, it does seem to tick the boxes.
>>>>>
>>>>> I must admit I haven't dug deep, but if the target hook allows the mask to be
>>>>>
>>>>> defined in way that is target-friendly (and I don't know how much effort it will
>>>>>
>>>>> be to migrate the attribute to more front-ends), it should do the job nicely.
>>>>>
>>>>> Let me go back and dig a bit deeper and get back with questions if any.
>>>>
>>>>
>>>> Let me add that the advantage of this is the compiler doesn't need
>>>> to support weird explicitely laid out packed boolean vectors that do
>>>> not match what the target supports and the user doesn't need to know
>>>> what the target supports (and thus have an #ifdef maze around explicitely
>>>> specified layouts).
>>>>
>>>> Sorry for the delayed response – I spent a day experimenting with vector_mask.
>>>>
>>>>
>>>>
>>>> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
>>>>
>>>> to avoid having to sprinkle the code with ifdefs.
>>>>
>>>>
>>>> It does remove some flexibility though, for example with -mavx512f -mavx512vl
>>>> you'll get AVX512 style masks for V4SImode data vectors but of course the
>>>> target sill supports SSE2/AVX2 style masks as well, but those would not be
>>>> available as "packed boolean vectors", though they are of course in fact
>>>> equal to V4SImode data vectors with -1 or 0 values, so in this particular
>>>> case it might not matter.
>>>>
>>>> That said, the vector_mask attribute will get you V4SImode vectors with
>>>> signed boolean elements of 32 bits for V4SImode data vectors with
>>>> SSE2/AVX2.
>>>>
>>>>
>>>>
>>>> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
>>>>
>>>> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
>>>>
>>>> and a ‘w’ specified for the type.
>>>>
>>>>
>>>>
>>>> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
>>>>
>>>> major challenges wrt implementation or defining behaviour semantics? I played around with a
>>>>
>>>> few examples from the testsuite and wrote some new ones. I mostly tried operations that
>>>>
>>>> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
>>>>
>>>> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
>>>>
>>>> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
>>>>
>>>> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
>>>>
>>>> to the CFE – initialization and subscript operations?
>>>
>>> Yes, either that or restrict how the mask vectors can be used, thus
>>> properly diagnose improper
>>> uses.
>>
>> Indeed.
>>
>>    A question would be for example how to write common mask test
>>> operations like
>>> if (any (mask)) or if (all (mask)).
>>
>> I see 2 options here. New builtins could support new types - they'd
>> provide a target independent way to test any and all conditions. Another
>> would be to let the target use its intrinsics to do them in the most
>> efficient way possible (which the builtins would get lowered down to
>> anyway).
>>
>>
>>    Likewise writing merge operations
>>> - do those as
>>>
>>>    a = a | (mask ? b : 0);
>>>
>>> thus use ternary ?: for this?
>>
>> Yes, like now, the ternary could just translate to
>>
>>     {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
>>
>> One thing to flesh out is the semantics. Should we allow this operation
>> as long as the number of elements are the same even if the mask type if
>> different i.e.
>>
>>     v4hib ? v4si : v4si;
>>
>> I don't see why this can't be allowed as now we let
>>
>>     v4si ? v4sf : v4sf;
>>
>>
>> For initialization regular vector
>>> syntax should work:
>>>
>>> mtype mask = (mtype){ -1, -1, 0, 0, ... };
>>>
>>> there's the question of the signedness of the mask elements.  GCC
>>> internally uses signed
>>> bools with values -1 for true and 0 for false.
>>
>> One of the things is the value that represents true. This is largely
>> target-dependent when it comes to the vector_mask type. When vector_mask
>> types are created from GCC's internal representation of bool vectors
>> (signed ints) the point about implicit/explicit conversions from signed
>> int vect to mask types in the proposal covers this. So mask in
>>
>>     v4sib mask = (v4sib){-1, -1, 0, 0, ... }
>>
>> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
>> on SVE. On AVX2/SSE they'd still be represented as vector of signed ints
>> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this
>> new mask type representations will have in the mid-end while being
>> converted back and forth to and from GCC's internal representation, but
>> I'm guessing this is already being handled at some level by the
>> vector_mask type's current support?
> 
> Yes, I would guess so.  Of course what the middle-end is currently exposed
> to is simply what the vectorizer generates - once fuzzers discover this feature
> we'll see "interesting" uses that might run into missed or wrong handling of
> them.
> 
> So whatever we do on the side of exposing this to users a good portion
> of testsuite coverage for the allowed use cases is important.
> 
> Richard.
> 

Apologies for the long-ish reply, but here's a TLDR and gory details follow.

TLDR:
GIMPLE's vector_mask type semantics seems to be target-dependent, so 
elevating vector_mask to CFE with same semantics is undesirable. OTOH, 
changing vector_mask to have target-independent CFE semantics will cause 
dichotomy between its CFE and GFE behaviours. But vector_mask approach 
scales well for sizeless types. Is the solution to have something like 
vector_mask with defined target-independent type semantics, but call it 
something else to prevent conflation with GIMPLE, a viable option?

Details:
After some more analysis of the proposed options, here are some 
interesting findings:

vector_mask looked like a very interesting option until I ran into some 
semantic uncertainly. This code:

typedef int v8si __attribute__((vector_size(8*sizeof(int))));
typedef v8si v8sib __attribute__((vector_mask));

typedef short v8hi __attribute__((vector_size(8*sizeof(short))));
typedef v8hi v8hib __attribute__((vector_mask));

v8si res;
v8hi resh;

v8hib __GIMPLE () foo (v8hib x, v8sib y)
{
   v8hib res;

   res = x & y;
   return res;
}

When compiled on AArch64, produces a type-mismatch error for binary 
expression involving '&' because the 'derived' types 'v8hib' and 'v8sib' 
  have a different target-layout. If the layout of these two 'derived' 
types match, then the above code has no issue. Which is the case on 
amdgcn-amdhsa target where it compiles without any error(amdgcn uses a 
scalar DImode mask mode). IoW such code seems to be allowed on some 
targets and not on others.

With the same code, I tried putting casts and it worked fine on AArch64 
and amdgcn. This target-specific behaviour of vector_mask derived types 
will be difficult to specify once we move it to the CFE - in fact we 
probably don't want target-specific behaviour once it moves to the CFE.

If we expose vector_mask to CFE, we'd have to specify consistent 
semantics for vector_mask types. We'd have to resolve ambiguities like 
'v4hib & v4sib' clearly to be able to specify the semantics of the type 
system involving vector_mask. If we do this, don't we run the risk of a 
dichotomy between the CFE and GFE semantics of vector_mask? I'm assuming 
we'd want to retain vector_mask semantics as they are in GIMPLE.

If we want to enforce constant semantics for vector_mask in the CFE, one 
way is to treat vector_mask types as distinct if they're 'attached' to 
distinct data vector types. In such a scenario, vector_mask types 
attached to two data vector types with the same lane-width and number of 
lanes would be classified as distinct. For eg:

typedef int v8si __attribute__((vector_size(8*sizeof(int))));
typedef v8si v8sib __attribute__((vector_mask));

typedef float v8sf __attribute__((vector_size(8*sizeof(float))));
typedef v8sf v8sfb __attribute__((vector_mask));

v8si  foo (v8sf x, v8sf y, v8si i, v8si j)
{
   (a == b) & (v8sfb)(x == y) ? x : (v8si){0};
}

This could be the case for unsigned vs signed int vectors too for eg - 
seems a bit unnecessary tbh.

Though vector_mask's being 'attached' to a type has its drawbacks, it 
does seem to have an advantage when sizeless types are considered. If we 
have to define a sizeless vector boolean type that is implied by the 
lane size, we could do something like

typedef svint32_t svbool32_t __attribute__((vector_mask));

int32_t foo (svint32_t a, svint32_t b)
{
   svbool32_t pred = a > b;

   return pred[2] ? a[2] : b[2];
}

This is harder to do in the other schemes proposed so far as they're 
size-based.

To be able to free the boolean from the base type (not size) and retain 
vector_mask's flexibility to declare sizeless types, we could have an 
attribute that is more flexibly-typed and only 'derives' the lane-size 
and number of lanes from its 'base' type without actually inheriting the 
actual base type(char, short, int etc) or its signedness. This creates a 
purer and stand-alone boolean type without the associated semantics' 
complexity of having to cast between two same-size types with the same 
number of lanes. Eg.

typedef int v8si __attribute__((vector_size(8*sizeof(int))));
typedef v8si v8b __attribute__((vector_bool));

However, with differing lane-sizes, there will have to be a cast as the 
'derived' element size is different which could impact the layout of the 
vector mask. Eg.

v8si  foo (v8hi x, v8hi y, v8si i, v8si j)
{
   (v8sib)(x == y) & (i == j) ? i : (v8si){0};
}

Such conversions on targets like AVX512/AMDGCN will be a NOP, but 
non-trivial on SVE (depending on the implemented layout of the bool vector).

vector_bool decouples us from having to retain the behaviour of 
vector_mask and provides the flexibility of not having to cast across 
same-element-size vector types. Wrt to sizeless types, it could scale well.

typedef svint32_t svbool32_t __attribute__((vector_bool));
typedef svint16_t svbool16_t __attribute__((vector_bool));

int32_t foo (svint32_t a, svint32_t b)
{
   svbool32_t pred = a > b;

   return pred[2] ? a[2] : b[2];
}

int16_t bar (svint16_t a, svint16_t b)
{
   svbool16_t pred = a > b;

   return pred[2] ? a[2] : b[2];
}

On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint32_t on 
the target predicate.

Thoughts?

Thanks,
Tejas.

>>
>> Thanks,
>> Tejas.
>>
>>>
>>> Richard.
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Tejas.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Richard.
>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Tejas.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> 1. __attribute__((vector_size (n))) where n represents bytes
>>>>>>
>>>>>>     typedef bool vbool __attribute__ ((vector_size (1)));
>>>>>>
>>>>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
>>>>>> clear if each bit in 'n' controls a byte or an element. On targets
>>>>>> like SVE, it would be natural to have each bit control a byte of the target
>>>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
>>>>>> bit would control one element/lane on the target vector(therefore resulting in a
>>>>>> 'packed' layout with all significant bits at the LSB).
>>>>>>
>>>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
>>>>>>
>>>>>>     typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
>>>>>>     typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
>>>>>>
>>>>>> Here the 'n' in the vector_size attribute represents the number of bits that
>>>>>> is needed to represent a vector quantity.  In this case, this packed boolean
>>>>>> vector can represent upto 'n' vector lanes. The size of the type is
>>>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
>>>>>> example is 1.
>>>>>>
>>>>>> In this approach, because of the nature of the representation, the n bits required
>>>>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
>>>>>> align with the SVE approach of each bit representing a byte of the target vector
>>>>>> and PBV therefore having an 'unpacked' layout.
>>>>>>
>>>>>> More importantly, another drawback here is that the change in units for vector_size
>>>>>> might be confusing to programmers.  The units will have to be interpreted based on the
>>>>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
>>>>>> the bool vector - it is fixed.
>>>>>>
>>>>>> 3. Combination of 1 and 2.
>>>>>>
>>>>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
>>>>>> unambiguously represent the layout of the PBV. Consider
>>>>>>
>>>>>>     typedef bool vbool __attribute__((vector_size (s, n[, w])));
>>>>>>
>>>>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
>>>>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
>>>>>> allow a target to force a certain layout of the PBV.
>>>>>>
>>>>>> The 2-parameter form of vector_size allows the target to have an
>>>>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
>>>>>> if it is not specified to mirror the target layout of predicate registers. For
>>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
>>>>>>
>>>>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
>>>>>> 8 lanes of boolean which could be represented by
>>>>>>
>>>>>>     typedef bool v8b __attribute__ ((vector_size (2, 8)));
>>>>>>
>>>>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
>>>>>>
>>>>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
>>>>>> be significant i.e. w == 1.
>>>>>>
>>>>>> This scheme would accomodate more than 1 target to effectively represent vector
>>>>>> bools that mirror the target properties.
>>>>>>
>>>>>> 4. A new attribite
>>>>>>
>>>>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
>>>>>> attribute to define the PBV and make it general enough to
>>>>>>
>>>>>> * represent all targets flexibly (SVE, AVX etc)
>>>>>> * represent sub-byte length predicates
>>>>>> * have no change in units of vector_size/no new vector_size signature
>>>>>> * not have the number of bytes constrain representation
>>>>>>
>>>>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
>>>>>>
>>>>>>     typedef bool vbool __attribute__((bool_vec (n[, w])))
>>>>>>
>>>>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
>>>>>>
>>>>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
>>>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
>>>>>>
>>>>>> 5. Behaviour of the packed vector boolean type.
>>>>>>
>>>>>> Taking the example of one of the options above, following is an illustration of it's behavior
>>>>>>
>>>>>> * ABI
>>>>>>
>>>>>>     New ABI rules will need to be defined for this type - eg alignment, PCS,
>>>>>>     mangling etc
>>>>>>
>>>>>> * Initialization:
>>>>>>
>>>>>>     Packed Boolean Vectors(PBV) can be initialized like so:
>>>>>>
>>>>>>       typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
>>>>>>       v4bi p = {false, true, false, false};
>>>>>>
>>>>>>     Each value in the initizlizer constant is of type bool. The lowest numbered
>>>>>>     element in the const array corresponds to the LSbit of p, element 1 is
>>>>>>     assigned to bit 4 etc.
>>>>>>
>>>>>>     p is effectively a 2-byte bitmask with value 0x0010
>>>>>>
>>>>>>     With a different layout
>>>>>>
>>>>>>       typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
>>>>>>       v4bi p = {false, true, false, false};
>>>>>>
>>>>>>     p is effectively a 2-byte bitmask with value 0x0002
>>>>>>
>>>>>> * Operations:
>>>>>>
>>>>>>     Packed Boolean Vectors support the following operations:
>>>>>>     . unary ~
>>>>>>     . unary !
>>>>>>     . binary&,|andˆ
>>>>>>     . assignments &=, |= and ˆ=
>>>>>>     . comparisons <, <=, ==, !=, >= and >
>>>>>>     . Ternary operator ?:
>>>>>>
>>>>>>     Operations are defined as applied to the individual elements i.e the bits
>>>>>>     that are significant in the PBV. Whether the PBVs are treated as bitmasks
>>>>>>     or otherwise is implementation-defined.
>>>>>>
>>>>>>     Insignificant bits could affect results of comparisons or ternary operators.
>>>>>>     In such cases, it is implementation defined how the unused bits are treated.
>>>>>>
>>>>>>     . Subscript operator []
>>>>>>
>>>>>>     For the subscript operator, the packed boolean vector acts like a array of
>>>>>>     elements - the first or the 0th indexed element being the LSbit of the PBV.
>>>>>>     Subscript operator yields a scalar boolean value.
>>>>>>     For example:
>>>>>>
>>>>>>       typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
>>>>>>
>>>>>>       // Subscript operator result yields a boolean value.
>>>>>>       // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
>>>>>>       bool foo (v8b p, int n) { p[3] = true; return p[1]; }
>>>>>>
>>>>>>     Out of bounds access: OOB access can be determined at compile time given the
>>>>>>     strong typing of the PBVs.
>>>>>>
>>>>>>     PBV does not support address of operator(&) for elements of PBVs.
>>>>>>
>>>>>>     . Implicit conversion from integer vectors to PBVs
>>>>>>
>>>>>>     We would like to support the output of comparison operations to be PBVs. This
>>>>>>     requires us to define the implicit conversion from an integer vector to PBV
>>>>>>     as the result of vector comparisons are integer vectors.
>>>>>>
>>>>>>     To define this operation:
>>>>>>
>>>>>>       bool_vector = vector <cmpop> vector
>>>>>>
>>>>>>     There is no change in how vector <cmpop> vector behavior i.e. this comparison
>>>>>>     would still produce an int_vector type as it does now.
>>>>>>
>>>>>>       temp_int_vec = vector <cmpop> vector
>>>>>>       bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
>>>>>>
>>>>>>     The implicit conversion from int_vec to bool I'd define simply to be:
>>>>>>
>>>>>>       bool_vec[n] = (_Bool) int_vec[n]
>>>>>>
>>>>>>     where the C11 standard rules apply
>>>>>>     6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
>>>>>>     is 0 if the value compares equal to 0; otherwise, the result is 1.
>>>>>>
>>>>>>
>>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
>>>>>> [2] https://reviews.llvm.org/D88905
>>>>>> [3] https://reviews.llvm.org/D81083
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> Thanks,
>>>>>> Tejas.
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-13 10:14               ` Tejas Belagod
@ 2023-07-13 10:35                 ` Richard Biener
  2023-07-14 10:18                   ` Tejas Belagod
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2023-07-13 10:35 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

On Thu, Jul 13, 2023 at 12:15 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
>
> On 7/3/23 1:31 PM, Richard Biener wrote:
> > On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
> >>
> >> On 6/29/23 6:55 PM, Richard Biener wrote:
> >>> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> From: Richard Biener <richard.guenther@gmail.com>
> >>>> Date: Tuesday, June 27, 2023 at 12:58 PM
> >>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> >>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> >>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>>
> >>>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> From: Richard Biener <richard.guenther@gmail.com>
> >>>>> Date: Monday, June 26, 2023 at 2:23 PM
> >>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> >>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> >>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>>>
> >>>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> >>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Packed Boolean Vectors
> >>>>>> ----------------------
> >>>>>>
> >>>>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
> >>>>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> >>>>>> been implemented in Clang recently[2].
> >>>>>>
> >>>>>> With predication features being added to vector architectures (SVE, MVE, AVX),
> >>>>>> it is a useful feature to have to model predication on targets.  This could
> >>>>>> find its use in intrinsics or just used as is as a GNU vector extension being
> >>>>>> mapped to underlying target features.  For example, the packed boolean vector
> >>>>>> could directly map to a predicate register on SVE.
> >>>>>>
> >>>>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
> >>>>>> intrinsics to replace a fixed-length svbool_t.
> >>>>>>
> >>>>>> Here are a few options to represent the packed boolean vector type.
> >>>>>
> >>>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
> >>>>>
> >>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>>> typedef v8si v8sib __attribute__((vector_mask));
> >>>>>
> >>>>> it get's you a vector type that's the appropriate (dependent on the
> >>>>> target) vector
> >>>>> mask type for the vector data type (v8si in this case).
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks Richard.
> >>>>>
> >>>>> Having had a quick look at the implementation, it does seem to tick the boxes.
> >>>>>
> >>>>> I must admit I haven't dug deep, but if the target hook allows the mask to be
> >>>>>
> >>>>> defined in way that is target-friendly (and I don't know how much effort it will
> >>>>>
> >>>>> be to migrate the attribute to more front-ends), it should do the job nicely.
> >>>>>
> >>>>> Let me go back and dig a bit deeper and get back with questions if any.
> >>>>
> >>>>
> >>>> Let me add that the advantage of this is the compiler doesn't need
> >>>> to support weird explicitely laid out packed boolean vectors that do
> >>>> not match what the target supports and the user doesn't need to know
> >>>> what the target supports (and thus have an #ifdef maze around explicitely
> >>>> specified layouts).
> >>>>
> >>>> Sorry for the delayed response – I spent a day experimenting with vector_mask.
> >>>>
> >>>>
> >>>>
> >>>> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
> >>>>
> >>>> to avoid having to sprinkle the code with ifdefs.
> >>>>
> >>>>
> >>>> It does remove some flexibility though, for example with -mavx512f -mavx512vl
> >>>> you'll get AVX512 style masks for V4SImode data vectors but of course the
> >>>> target sill supports SSE2/AVX2 style masks as well, but those would not be
> >>>> available as "packed boolean vectors", though they are of course in fact
> >>>> equal to V4SImode data vectors with -1 or 0 values, so in this particular
> >>>> case it might not matter.
> >>>>
> >>>> That said, the vector_mask attribute will get you V4SImode vectors with
> >>>> signed boolean elements of 32 bits for V4SImode data vectors with
> >>>> SSE2/AVX2.
> >>>>
> >>>>
> >>>>
> >>>> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
> >>>>
> >>>> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
> >>>>
> >>>> and a ‘w’ specified for the type.
> >>>>
> >>>>
> >>>>
> >>>> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
> >>>>
> >>>> major challenges wrt implementation or defining behaviour semantics? I played around with a
> >>>>
> >>>> few examples from the testsuite and wrote some new ones. I mostly tried operations that
> >>>>
> >>>> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
> >>>>
> >>>> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
> >>>>
> >>>> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
> >>>>
> >>>> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
> >>>>
> >>>> to the CFE – initialization and subscript operations?
> >>>
> >>> Yes, either that or restrict how the mask vectors can be used, thus
> >>> properly diagnose improper
> >>> uses.
> >>
> >> Indeed.
> >>
> >>    A question would be for example how to write common mask test
> >>> operations like
> >>> if (any (mask)) or if (all (mask)).
> >>
> >> I see 2 options here. New builtins could support new types - they'd
> >> provide a target independent way to test any and all conditions. Another
> >> would be to let the target use its intrinsics to do them in the most
> >> efficient way possible (which the builtins would get lowered down to
> >> anyway).
> >>
> >>
> >>    Likewise writing merge operations
> >>> - do those as
> >>>
> >>>    a = a | (mask ? b : 0);
> >>>
> >>> thus use ternary ?: for this?
> >>
> >> Yes, like now, the ternary could just translate to
> >>
> >>     {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
> >>
> >> One thing to flesh out is the semantics. Should we allow this operation
> >> as long as the number of elements are the same even if the mask type if
> >> different i.e.
> >>
> >>     v4hib ? v4si : v4si;
> >>
> >> I don't see why this can't be allowed as now we let
> >>
> >>     v4si ? v4sf : v4sf;
> >>
> >>
> >> For initialization regular vector
> >>> syntax should work:
> >>>
> >>> mtype mask = (mtype){ -1, -1, 0, 0, ... };
> >>>
> >>> there's the question of the signedness of the mask elements.  GCC
> >>> internally uses signed
> >>> bools with values -1 for true and 0 for false.
> >>
> >> One of the things is the value that represents true. This is largely
> >> target-dependent when it comes to the vector_mask type. When vector_mask
> >> types are created from GCC's internal representation of bool vectors
> >> (signed ints) the point about implicit/explicit conversions from signed
> >> int vect to mask types in the proposal covers this. So mask in
> >>
> >>     v4sib mask = (v4sib){-1, -1, 0, 0, ... }
> >>
> >> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
> >> on SVE. On AVX2/SSE they'd still be represented as vector of signed ints
> >> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this
> >> new mask type representations will have in the mid-end while being
> >> converted back and forth to and from GCC's internal representation, but
> >> I'm guessing this is already being handled at some level by the
> >> vector_mask type's current support?
> >
> > Yes, I would guess so.  Of course what the middle-end is currently exposed
> > to is simply what the vectorizer generates - once fuzzers discover this feature
> > we'll see "interesting" uses that might run into missed or wrong handling of
> > them.
> >
> > So whatever we do on the side of exposing this to users a good portion
> > of testsuite coverage for the allowed use cases is important.
> >
> > Richard.
> >
>
> Apologies for the long-ish reply, but here's a TLDR and gory details follow.
>
> TLDR:
> GIMPLE's vector_mask type semantics seems to be target-dependent, so
> elevating vector_mask to CFE with same semantics is undesirable. OTOH,
> changing vector_mask to have target-independent CFE semantics will cause
> dichotomy between its CFE and GFE behaviours. But vector_mask approach
> scales well for sizeless types. Is the solution to have something like
> vector_mask with defined target-independent type semantics, but call it
> something else to prevent conflation with GIMPLE, a viable option?
>
> Details:
> After some more analysis of the proposed options, here are some
> interesting findings:
>
> vector_mask looked like a very interesting option until I ran into some
> semantic uncertainly. This code:
>
> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> typedef v8si v8sib __attribute__((vector_mask));
>
> typedef short v8hi __attribute__((vector_size(8*sizeof(short))));
> typedef v8hi v8hib __attribute__((vector_mask));
>
> v8si res;
> v8hi resh;
>
> v8hib __GIMPLE () foo (v8hib x, v8sib y)
> {
>    v8hib res;
>
>    res = x & y;
>    return res;
> }
>
> When compiled on AArch64, produces a type-mismatch error for binary
> expression involving '&' because the 'derived' types 'v8hib' and 'v8sib'
>   have a different target-layout. If the layout of these two 'derived'
> types match, then the above code has no issue. Which is the case on
> amdgcn-amdhsa target where it compiles without any error(amdgcn uses a
> scalar DImode mask mode). IoW such code seems to be allowed on some
> targets and not on others.
>
> With the same code, I tried putting casts and it worked fine on AArch64
> and amdgcn. This target-specific behaviour of vector_mask derived types
> will be difficult to specify once we move it to the CFE - in fact we
> probably don't want target-specific behaviour once it moves to the CFE.
>
> If we expose vector_mask to CFE, we'd have to specify consistent
> semantics for vector_mask types. We'd have to resolve ambiguities like
> 'v4hib & v4sib' clearly to be able to specify the semantics of the type
> system involving vector_mask. If we do this, don't we run the risk of a
> dichotomy between the CFE and GFE semantics of vector_mask? I'm assuming
> we'd want to retain vector_mask semantics as they are in GIMPLE.
>
> If we want to enforce constant semantics for vector_mask in the CFE, one
> way is to treat vector_mask types as distinct if they're 'attached' to
> distinct data vector types. In such a scenario, vector_mask types
> attached to two data vector types with the same lane-width and number of
> lanes would be classified as distinct. For eg:
>
> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> typedef v8si v8sib __attribute__((vector_mask));
>
> typedef float v8sf __attribute__((vector_size(8*sizeof(float))));
> typedef v8sf v8sfb __attribute__((vector_mask));
>
> v8si  foo (v8sf x, v8sf y, v8si i, v8si j)
> {
>    (a == b) & (v8sfb)(x == y) ? x : (v8si){0};
> }
>
> This could be the case for unsigned vs signed int vectors too for eg -
> seems a bit unnecessary tbh.
>
> Though vector_mask's being 'attached' to a type has its drawbacks, it
> does seem to have an advantage when sizeless types are considered. If we
> have to define a sizeless vector boolean type that is implied by the
> lane size, we could do something like
>
> typedef svint32_t svbool32_t __attribute__((vector_mask));
>
> int32_t foo (svint32_t a, svint32_t b)
> {
>    svbool32_t pred = a > b;
>
>    return pred[2] ? a[2] : b[2];
> }
>
> This is harder to do in the other schemes proposed so far as they're
> size-based.
>
> To be able to free the boolean from the base type (not size) and retain
> vector_mask's flexibility to declare sizeless types, we could have an
> attribute that is more flexibly-typed and only 'derives' the lane-size
> and number of lanes from its 'base' type without actually inheriting the
> actual base type(char, short, int etc) or its signedness. This creates a
> purer and stand-alone boolean type without the associated semantics'
> complexity of having to cast between two same-size types with the same
> number of lanes. Eg.
>
> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> typedef v8si v8b __attribute__((vector_bool));
>
> However, with differing lane-sizes, there will have to be a cast as the
> 'derived' element size is different which could impact the layout of the
> vector mask. Eg.
>
> v8si  foo (v8hi x, v8hi y, v8si i, v8si j)
> {
>    (v8sib)(x == y) & (i == j) ? i : (v8si){0};
> }
>
> Such conversions on targets like AVX512/AMDGCN will be a NOP, but
> non-trivial on SVE (depending on the implemented layout of the bool vector).
>
> vector_bool decouples us from having to retain the behaviour of
> vector_mask and provides the flexibility of not having to cast across
> same-element-size vector types. Wrt to sizeless types, it could scale well.
>
> typedef svint32_t svbool32_t __attribute__((vector_bool));
> typedef svint16_t svbool16_t __attribute__((vector_bool));
>
> int32_t foo (svint32_t a, svint32_t b)
> {
>    svbool32_t pred = a > b;
>
>    return pred[2] ? a[2] : b[2];
> }
>
> int16_t bar (svint16_t a, svint16_t b)
> {
>    svbool16_t pred = a > b;
>
>    return pred[2] ? a[2] : b[2];
> }
>
> On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint32_t on
> the target predicate.
>
> Thoughts?

The GIMPLE frontend accepts just what is valid on the target here.  Any
"plumbing" such as implicit conversions (if we do not want to require
explicit ones even when NOP) need to be done/enforced by the C frontend.

There's one issue I can see that wasn't mentioned yet - GCC currently
accepts

typedef long gv1024di __attribute__((vector_size(1024*8)));

even if there's no underlying support on the target which either has support
only for smaller vectors or no vectors at all.  Currently vector_mask will
simply fail to produce sth desirable here.  What's your idea of making
that not target dependent?  GCC will later lower operations with such
vectors, possibly splitting them up into sizes supported by the hardware
natively, possibly performing elementwise operations.  For the former
one would need to guess the "decomposition type" and based on that
select the mask type [layout]?

One idea would be to specify the mask layout follows the largest vector
kind supported by the target and if there is none follow the layout
of (signed?) _Bool [n]?  When there's no target support for vectors
GCC will generally use elementwise operations apart from some
special-cases.

While using a different name than vector_mask is certainly possible
it wouldn't me to decide that, but I'm also not yet convinced it's
really necessary.  As said, what the GIMPLE frontend accepts
or not shouldn't limit us here - just the actual chosen layout of the
boolean vectors.

Richard.

> Thanks,
> Tejas.
>
> >>
> >> Thanks,
> >> Tejas.
> >>
> >>>
> >>> Richard.
> >>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Tejas.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Richard.
> >>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Tejas.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> 1. __attribute__((vector_size (n))) where n represents bytes
> >>>>>>
> >>>>>>     typedef bool vbool __attribute__ ((vector_size (1)));
> >>>>>>
> >>>>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
> >>>>>> clear if each bit in 'n' controls a byte or an element. On targets
> >>>>>> like SVE, it would be natural to have each bit control a byte of the target
> >>>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> >>>>>> bit would control one element/lane on the target vector(therefore resulting in a
> >>>>>> 'packed' layout with all significant bits at the LSB).
> >>>>>>
> >>>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
> >>>>>>
> >>>>>>     typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
> >>>>>>     typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
> >>>>>>
> >>>>>> Here the 'n' in the vector_size attribute represents the number of bits that
> >>>>>> is needed to represent a vector quantity.  In this case, this packed boolean
> >>>>>> vector can represent upto 'n' vector lanes. The size of the type is
> >>>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
> >>>>>> example is 1.
> >>>>>>
> >>>>>> In this approach, because of the nature of the representation, the n bits required
> >>>>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
> >>>>>> align with the SVE approach of each bit representing a byte of the target vector
> >>>>>> and PBV therefore having an 'unpacked' layout.
> >>>>>>
> >>>>>> More importantly, another drawback here is that the change in units for vector_size
> >>>>>> might be confusing to programmers.  The units will have to be interpreted based on the
> >>>>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
> >>>>>> the bool vector - it is fixed.
> >>>>>>
> >>>>>> 3. Combination of 1 and 2.
> >>>>>>
> >>>>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> >>>>>> unambiguously represent the layout of the PBV. Consider
> >>>>>>
> >>>>>>     typedef bool vbool __attribute__((vector_size (s, n[, w])));
> >>>>>>
> >>>>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> >>>>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> >>>>>> allow a target to force a certain layout of the PBV.
> >>>>>>
> >>>>>> The 2-parameter form of vector_size allows the target to have an
> >>>>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
> >>>>>> if it is not specified to mirror the target layout of predicate registers. For
> >>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> >>>>>>
> >>>>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> >>>>>> 8 lanes of boolean which could be represented by
> >>>>>>
> >>>>>>     typedef bool v8b __attribute__ ((vector_size (2, 8)));
> >>>>>>
> >>>>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
> >>>>>>
> >>>>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> >>>>>> be significant i.e. w == 1.
> >>>>>>
> >>>>>> This scheme would accomodate more than 1 target to effectively represent vector
> >>>>>> bools that mirror the target properties.
> >>>>>>
> >>>>>> 4. A new attribite
> >>>>>>
> >>>>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> >>>>>> attribute to define the PBV and make it general enough to
> >>>>>>
> >>>>>> * represent all targets flexibly (SVE, AVX etc)
> >>>>>> * represent sub-byte length predicates
> >>>>>> * have no change in units of vector_size/no new vector_size signature
> >>>>>> * not have the number of bytes constrain representation
> >>>>>>
> >>>>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
> >>>>>>
> >>>>>>     typedef bool vbool __attribute__((bool_vec (n[, w])))
> >>>>>>
> >>>>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
> >>>>>>
> >>>>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> >>>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> >>>>>>
> >>>>>> 5. Behaviour of the packed vector boolean type.
> >>>>>>
> >>>>>> Taking the example of one of the options above, following is an illustration of it's behavior
> >>>>>>
> >>>>>> * ABI
> >>>>>>
> >>>>>>     New ABI rules will need to be defined for this type - eg alignment, PCS,
> >>>>>>     mangling etc
> >>>>>>
> >>>>>> * Initialization:
> >>>>>>
> >>>>>>     Packed Boolean Vectors(PBV) can be initialized like so:
> >>>>>>
> >>>>>>       typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> >>>>>>       v4bi p = {false, true, false, false};
> >>>>>>
> >>>>>>     Each value in the initizlizer constant is of type bool. The lowest numbered
> >>>>>>     element in the const array corresponds to the LSbit of p, element 1 is
> >>>>>>     assigned to bit 4 etc.
> >>>>>>
> >>>>>>     p is effectively a 2-byte bitmask with value 0x0010
> >>>>>>
> >>>>>>     With a different layout
> >>>>>>
> >>>>>>       typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> >>>>>>       v4bi p = {false, true, false, false};
> >>>>>>
> >>>>>>     p is effectively a 2-byte bitmask with value 0x0002
> >>>>>>
> >>>>>> * Operations:
> >>>>>>
> >>>>>>     Packed Boolean Vectors support the following operations:
> >>>>>>     . unary ~
> >>>>>>     . unary !
> >>>>>>     . binary&,|andˆ
> >>>>>>     . assignments &=, |= and ˆ=
> >>>>>>     . comparisons <, <=, ==, !=, >= and >
> >>>>>>     . Ternary operator ?:
> >>>>>>
> >>>>>>     Operations are defined as applied to the individual elements i.e the bits
> >>>>>>     that are significant in the PBV. Whether the PBVs are treated as bitmasks
> >>>>>>     or otherwise is implementation-defined.
> >>>>>>
> >>>>>>     Insignificant bits could affect results of comparisons or ternary operators.
> >>>>>>     In such cases, it is implementation defined how the unused bits are treated.
> >>>>>>
> >>>>>>     . Subscript operator []
> >>>>>>
> >>>>>>     For the subscript operator, the packed boolean vector acts like a array of
> >>>>>>     elements - the first or the 0th indexed element being the LSbit of the PBV.
> >>>>>>     Subscript operator yields a scalar boolean value.
> >>>>>>     For example:
> >>>>>>
> >>>>>>       typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> >>>>>>
> >>>>>>       // Subscript operator result yields a boolean value.
> >>>>>>       // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> >>>>>>       bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> >>>>>>
> >>>>>>     Out of bounds access: OOB access can be determined at compile time given the
> >>>>>>     strong typing of the PBVs.
> >>>>>>
> >>>>>>     PBV does not support address of operator(&) for elements of PBVs.
> >>>>>>
> >>>>>>     . Implicit conversion from integer vectors to PBVs
> >>>>>>
> >>>>>>     We would like to support the output of comparison operations to be PBVs. This
> >>>>>>     requires us to define the implicit conversion from an integer vector to PBV
> >>>>>>     as the result of vector comparisons are integer vectors.
> >>>>>>
> >>>>>>     To define this operation:
> >>>>>>
> >>>>>>       bool_vector = vector <cmpop> vector
> >>>>>>
> >>>>>>     There is no change in how vector <cmpop> vector behavior i.e. this comparison
> >>>>>>     would still produce an int_vector type as it does now.
> >>>>>>
> >>>>>>       temp_int_vec = vector <cmpop> vector
> >>>>>>       bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
> >>>>>>
> >>>>>>     The implicit conversion from int_vec to bool I'd define simply to be:
> >>>>>>
> >>>>>>       bool_vec[n] = (_Bool) int_vec[n]
> >>>>>>
> >>>>>>     where the C11 standard rules apply
> >>>>>>     6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
> >>>>>>     is 0 if the value compares equal to 0; otherwise, the result is 1.
> >>>>>>
> >>>>>>
> >>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> >>>>>> [2] https://reviews.llvm.org/D88905
> >>>>>> [3] https://reviews.llvm.org/D81083
> >>>>>>
> >>>>>> Thoughts?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Tejas.
> >>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-13 10:35                 ` Richard Biener
@ 2023-07-14 10:18                   ` Tejas Belagod
  2023-07-17 12:16                     ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Tejas Belagod @ 2023-07-14 10:18 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On 7/13/23 4:05 PM, Richard Biener wrote:
> On Thu, Jul 13, 2023 at 12:15 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
>>
>> On 7/3/23 1:31 PM, Richard Biener wrote:
>>> On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
>>>>
>>>> On 6/29/23 6:55 PM, Richard Biener wrote:
>>>>> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: Richard Biener <richard.guenther@gmail.com>
>>>>>> Date: Tuesday, June 27, 2023 at 12:58 PM
>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>>>>>>
>>>>>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From: Richard Biener <richard.guenther@gmail.com>
>>>>>>> Date: Monday, June 26, 2023 at 2:23 PM
>>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
>>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>>>>>>>
>>>>>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
>>>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Packed Boolean Vectors
>>>>>>>> ----------------------
>>>>>>>>
>>>>>>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
>>>>>>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
>>>>>>>> been implemented in Clang recently[2].
>>>>>>>>
>>>>>>>> With predication features being added to vector architectures (SVE, MVE, AVX),
>>>>>>>> it is a useful feature to have to model predication on targets.  This could
>>>>>>>> find its use in intrinsics or just used as is as a GNU vector extension being
>>>>>>>> mapped to underlying target features.  For example, the packed boolean vector
>>>>>>>> could directly map to a predicate register on SVE.
>>>>>>>>
>>>>>>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
>>>>>>>> intrinsics to replace a fixed-length svbool_t.
>>>>>>>>
>>>>>>>> Here are a few options to represent the packed boolean vector type.
>>>>>>>
>>>>>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
>>>>>>>
>>>>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>>>>>>> typedef v8si v8sib __attribute__((vector_mask));
>>>>>>>
>>>>>>> it get's you a vector type that's the appropriate (dependent on the
>>>>>>> target) vector
>>>>>>> mask type for the vector data type (v8si in this case).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks Richard.
>>>>>>>
>>>>>>> Having had a quick look at the implementation, it does seem to tick the boxes.
>>>>>>>
>>>>>>> I must admit I haven't dug deep, but if the target hook allows the mask to be
>>>>>>>
>>>>>>> defined in way that is target-friendly (and I don't know how much effort it will
>>>>>>>
>>>>>>> be to migrate the attribute to more front-ends), it should do the job nicely.
>>>>>>>
>>>>>>> Let me go back and dig a bit deeper and get back with questions if any.
>>>>>>
>>>>>>
>>>>>> Let me add that the advantage of this is the compiler doesn't need
>>>>>> to support weird explicitely laid out packed boolean vectors that do
>>>>>> not match what the target supports and the user doesn't need to know
>>>>>> what the target supports (and thus have an #ifdef maze around explicitely
>>>>>> specified layouts).
>>>>>>
>>>>>> Sorry for the delayed response – I spent a day experimenting with vector_mask.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
>>>>>>
>>>>>> to avoid having to sprinkle the code with ifdefs.
>>>>>>
>>>>>>
>>>>>> It does remove some flexibility though, for example with -mavx512f -mavx512vl
>>>>>> you'll get AVX512 style masks for V4SImode data vectors but of course the
>>>>>> target sill supports SSE2/AVX2 style masks as well, but those would not be
>>>>>> available as "packed boolean vectors", though they are of course in fact
>>>>>> equal to V4SImode data vectors with -1 or 0 values, so in this particular
>>>>>> case it might not matter.
>>>>>>
>>>>>> That said, the vector_mask attribute will get you V4SImode vectors with
>>>>>> signed boolean elements of 32 bits for V4SImode data vectors with
>>>>>> SSE2/AVX2.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
>>>>>>
>>>>>> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
>>>>>>
>>>>>> and a ‘w’ specified for the type.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
>>>>>>
>>>>>> major challenges wrt implementation or defining behaviour semantics? I played around with a
>>>>>>
>>>>>> few examples from the testsuite and wrote some new ones. I mostly tried operations that
>>>>>>
>>>>>> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
>>>>>>
>>>>>> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
>>>>>>
>>>>>> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
>>>>>>
>>>>>> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
>>>>>>
>>>>>> to the CFE – initialization and subscript operations?
>>>>>
>>>>> Yes, either that or restrict how the mask vectors can be used, thus
>>>>> properly diagnose improper
>>>>> uses.
>>>>
>>>> Indeed.
>>>>
>>>>     A question would be for example how to write common mask test
>>>>> operations like
>>>>> if (any (mask)) or if (all (mask)).
>>>>
>>>> I see 2 options here. New builtins could support new types - they'd
>>>> provide a target independent way to test any and all conditions. Another
>>>> would be to let the target use its intrinsics to do them in the most
>>>> efficient way possible (which the builtins would get lowered down to
>>>> anyway).
>>>>
>>>>
>>>>     Likewise writing merge operations
>>>>> - do those as
>>>>>
>>>>>     a = a | (mask ? b : 0);
>>>>>
>>>>> thus use ternary ?: for this?
>>>>
>>>> Yes, like now, the ternary could just translate to
>>>>
>>>>      {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
>>>>
>>>> One thing to flesh out is the semantics. Should we allow this operation
>>>> as long as the number of elements are the same even if the mask type if
>>>> different i.e.
>>>>
>>>>      v4hib ? v4si : v4si;
>>>>
>>>> I don't see why this can't be allowed as now we let
>>>>
>>>>      v4si ? v4sf : v4sf;
>>>>
>>>>
>>>> For initialization regular vector
>>>>> syntax should work:
>>>>>
>>>>> mtype mask = (mtype){ -1, -1, 0, 0, ... };
>>>>>
>>>>> there's the question of the signedness of the mask elements.  GCC
>>>>> internally uses signed
>>>>> bools with values -1 for true and 0 for false.
>>>>
>>>> One of the things is the value that represents true. This is largely
>>>> target-dependent when it comes to the vector_mask type. When vector_mask
>>>> types are created from GCC's internal representation of bool vectors
>>>> (signed ints) the point about implicit/explicit conversions from signed
>>>> int vect to mask types in the proposal covers this. So mask in
>>>>
>>>>      v4sib mask = (v4sib){-1, -1, 0, 0, ... }
>>>>
>>>> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
>>>> on SVE. On AVX2/SSE they'd still be represented as vector of signed ints
>>>> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this
>>>> new mask type representations will have in the mid-end while being
>>>> converted back and forth to and from GCC's internal representation, but
>>>> I'm guessing this is already being handled at some level by the
>>>> vector_mask type's current support?
>>>
>>> Yes, I would guess so.  Of course what the middle-end is currently exposed
>>> to is simply what the vectorizer generates - once fuzzers discover this feature
>>> we'll see "interesting" uses that might run into missed or wrong handling of
>>> them.
>>>
>>> So whatever we do on the side of exposing this to users a good portion
>>> of testsuite coverage for the allowed use cases is important.
>>>
>>> Richard.
>>>
>>
>> Apologies for the long-ish reply, but here's a TLDR and gory details follow.
>>
>> TLDR:
>> GIMPLE's vector_mask type semantics seems to be target-dependent, so
>> elevating vector_mask to CFE with same semantics is undesirable. OTOH,
>> changing vector_mask to have target-independent CFE semantics will cause
>> dichotomy between its CFE and GFE behaviours. But vector_mask approach
>> scales well for sizeless types. Is the solution to have something like
>> vector_mask with defined target-independent type semantics, but call it
>> something else to prevent conflation with GIMPLE, a viable option?
>>
>> Details:
>> After some more analysis of the proposed options, here are some
>> interesting findings:
>>
>> vector_mask looked like a very interesting option until I ran into some
>> semantic uncertainly. This code:
>>
>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>> typedef v8si v8sib __attribute__((vector_mask));
>>
>> typedef short v8hi __attribute__((vector_size(8*sizeof(short))));
>> typedef v8hi v8hib __attribute__((vector_mask));
>>
>> v8si res;
>> v8hi resh;
>>
>> v8hib __GIMPLE () foo (v8hib x, v8sib y)
>> {
>>     v8hib res;
>>
>>     res = x & y;
>>     return res;
>> }
>>
>> When compiled on AArch64, produces a type-mismatch error for binary
>> expression involving '&' because the 'derived' types 'v8hib' and 'v8sib'
>>    have a different target-layout. If the layout of these two 'derived'
>> types match, then the above code has no issue. Which is the case on
>> amdgcn-amdhsa target where it compiles without any error(amdgcn uses a
>> scalar DImode mask mode). IoW such code seems to be allowed on some
>> targets and not on others.
>>
>> With the same code, I tried putting casts and it worked fine on AArch64
>> and amdgcn. This target-specific behaviour of vector_mask derived types
>> will be difficult to specify once we move it to the CFE - in fact we
>> probably don't want target-specific behaviour once it moves to the CFE.
>>
>> If we expose vector_mask to CFE, we'd have to specify consistent
>> semantics for vector_mask types. We'd have to resolve ambiguities like
>> 'v4hib & v4sib' clearly to be able to specify the semantics of the type
>> system involving vector_mask. If we do this, don't we run the risk of a
>> dichotomy between the CFE and GFE semantics of vector_mask? I'm assuming
>> we'd want to retain vector_mask semantics as they are in GIMPLE.
>>
>> If we want to enforce constant semantics for vector_mask in the CFE, one
>> way is to treat vector_mask types as distinct if they're 'attached' to
>> distinct data vector types. In such a scenario, vector_mask types
>> attached to two data vector types with the same lane-width and number of
>> lanes would be classified as distinct. For eg:
>>
>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>> typedef v8si v8sib __attribute__((vector_mask));
>>
>> typedef float v8sf __attribute__((vector_size(8*sizeof(float))));
>> typedef v8sf v8sfb __attribute__((vector_mask));
>>
>> v8si  foo (v8sf x, v8sf y, v8si i, v8si j)
>> {
>>     (a == b) & (v8sfb)(x == y) ? x : (v8si){0};
>> }
>>
>> This could be the case for unsigned vs signed int vectors too for eg -
>> seems a bit unnecessary tbh.
>>
>> Though vector_mask's being 'attached' to a type has its drawbacks, it
>> does seem to have an advantage when sizeless types are considered. If we
>> have to define a sizeless vector boolean type that is implied by the
>> lane size, we could do something like
>>
>> typedef svint32_t svbool32_t __attribute__((vector_mask));
>>
>> int32_t foo (svint32_t a, svint32_t b)
>> {
>>     svbool32_t pred = a > b;
>>
>>     return pred[2] ? a[2] : b[2];
>> }
>>
>> This is harder to do in the other schemes proposed so far as they're
>> size-based.
>>
>> To be able to free the boolean from the base type (not size) and retain
>> vector_mask's flexibility to declare sizeless types, we could have an
>> attribute that is more flexibly-typed and only 'derives' the lane-size
>> and number of lanes from its 'base' type without actually inheriting the
>> actual base type(char, short, int etc) or its signedness. This creates a
>> purer and stand-alone boolean type without the associated semantics'
>> complexity of having to cast between two same-size types with the same
>> number of lanes. Eg.
>>
>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>> typedef v8si v8b __attribute__((vector_bool));
>>
>> However, with differing lane-sizes, there will have to be a cast as the
>> 'derived' element size is different which could impact the layout of the
>> vector mask. Eg.
>>
>> v8si  foo (v8hi x, v8hi y, v8si i, v8si j)
>> {
>>     (v8sib)(x == y) & (i == j) ? i : (v8si){0};
>> }
>>
>> Such conversions on targets like AVX512/AMDGCN will be a NOP, but
>> non-trivial on SVE (depending on the implemented layout of the bool vector).
>>
>> vector_bool decouples us from having to retain the behaviour of
>> vector_mask and provides the flexibility of not having to cast across
>> same-element-size vector types. Wrt to sizeless types, it could scale well.
>>
>> typedef svint32_t svbool32_t __attribute__((vector_bool));
>> typedef svint16_t svbool16_t __attribute__((vector_bool));
>>
>> int32_t foo (svint32_t a, svint32_t b)
>> {
>>     svbool32_t pred = a > b;
>>
>>     return pred[2] ? a[2] : b[2];
>> }
>>
>> int16_t bar (svint16_t a, svint16_t b)
>> {
>>     svbool16_t pred = a > b;
>>
>>     return pred[2] ? a[2] : b[2];
>> }
>>
>> On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint32_t on
>> the target predicate.
>>
>> Thoughts?
> 
> The GIMPLE frontend accepts just what is valid on the target here.  Any
> "plumbing" such as implicit conversions (if we do not want to require
> explicit ones even when NOP) need to be done/enforced by the C frontend.
> 

Sorry, I'm not sure I follow - correct me if I'm wrong here.

If we desire to define/allow operations like implicit/explicit 
conversion on vector_mask types in CFE, don't we have to start from a 
position of defining what makes vector_mask types distinct and therefore 
require implicit/explicit conversions?

IIUC, GFE's distinctness of vector_mask types depends on how the mask 
mode is implemented on the target. If implemented in CFE, vector_mask 
types' distinctness probably shouldn't be based on target layout and 
could be based on the type they're 'attached' to.

Wouldn't that diverge from target-specific GFE behaviour - or are you 
suggesting its OK for vector_mask type semantics to be different in CFE 
and GFE?

> There's one issue I can see that wasn't mentioned yet - GCC currently
> accepts
> 
> typedef long gv1024di __attribute__((vector_size(1024*8)));
> 
> even if there's no underlying support on the target which either has support
> only for smaller vectors or no vectors at all.  Currently vector_mask will
> simply fail to produce sth desirable here.  What's your idea of making
> that not target dependent?  GCC will later lower operations with such
> vectors, possibly splitting them up into sizes supported by the hardware
> natively, possibly performing elementwise operations.  For the former
> one would need to guess the "decomposition type" and based on that
> select the mask type [layout]?
> 
> One idea would be to specify the mask layout follows the largest vector
> kind supported by the target and if there is none follow the layout
> of (signed?) _Bool [n]?  When there's no target support for vectors
> GCC will generally use elementwise operations apart from some
> special-cases.
> 

That is a very good point - thanks for raising it. For when GCC chooses 
to lower to a vector type supported by the target, my initial thought 
would be to, as you say, choose a mask that has enough bits to represent 
the largest vector size with the smallest lane-width. The actual layout 
of the mask will depend on how the target implements its mask mode. 
Decomposition of vector_mask ought to follow the decomposition of the 
GNU vectors type and each decomposed vector_mask type ought to have 
enough bits to represent the decomposed GNU vector shape. It sounds nice 
on paper, but I haven't really worked through a design for this. Do you 
see any gotchas here?

> While using a different name than vector_mask is certainly possible
> it wouldn't me to decide that, but I'm also not yet convinced it's
> really necessary.  As said, what the GIMPLE frontend accepts
> or not shouldn't limit us here - just the actual chosen layout of the
> boolean vectors.
> 

I'm just concerned about creating an alternate vector_mask functionality 
in the CFE and risk not being consistent with GFE.

Thanks,
Tejas.

> Richard.
> 
>> Thanks,
>> Tejas.
>>
>>>>
>>>> Thanks,
>>>> Tejas.
>>>>
>>>>>
>>>>> Richard.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Tejas.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Richard.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Tejas.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> 1. __attribute__((vector_size (n))) where n represents bytes
>>>>>>>>
>>>>>>>>      typedef bool vbool __attribute__ ((vector_size (1)));
>>>>>>>>
>>>>>>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
>>>>>>>> clear if each bit in 'n' controls a byte or an element. On targets
>>>>>>>> like SVE, it would be natural to have each bit control a byte of the target
>>>>>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
>>>>>>>> bit would control one element/lane on the target vector(therefore resulting in a
>>>>>>>> 'packed' layout with all significant bits at the LSB).
>>>>>>>>
>>>>>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
>>>>>>>>
>>>>>>>>      typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
>>>>>>>>      typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
>>>>>>>>
>>>>>>>> Here the 'n' in the vector_size attribute represents the number of bits that
>>>>>>>> is needed to represent a vector quantity.  In this case, this packed boolean
>>>>>>>> vector can represent upto 'n' vector lanes. The size of the type is
>>>>>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
>>>>>>>> example is 1.
>>>>>>>>
>>>>>>>> In this approach, because of the nature of the representation, the n bits required
>>>>>>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
>>>>>>>> align with the SVE approach of each bit representing a byte of the target vector
>>>>>>>> and PBV therefore having an 'unpacked' layout.
>>>>>>>>
>>>>>>>> More importantly, another drawback here is that the change in units for vector_size
>>>>>>>> might be confusing to programmers.  The units will have to be interpreted based on the
>>>>>>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
>>>>>>>> the bool vector - it is fixed.
>>>>>>>>
>>>>>>>> 3. Combination of 1 and 2.
>>>>>>>>
>>>>>>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
>>>>>>>> unambiguously represent the layout of the PBV. Consider
>>>>>>>>
>>>>>>>>      typedef bool vbool __attribute__((vector_size (s, n[, w])));
>>>>>>>>
>>>>>>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
>>>>>>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
>>>>>>>> allow a target to force a certain layout of the PBV.
>>>>>>>>
>>>>>>>> The 2-parameter form of vector_size allows the target to have an
>>>>>>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
>>>>>>>> if it is not specified to mirror the target layout of predicate registers. For
>>>>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
>>>>>>>>
>>>>>>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
>>>>>>>> 8 lanes of boolean which could be represented by
>>>>>>>>
>>>>>>>>      typedef bool v8b __attribute__ ((vector_size (2, 8)));
>>>>>>>>
>>>>>>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
>>>>>>>>
>>>>>>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
>>>>>>>> be significant i.e. w == 1.
>>>>>>>>
>>>>>>>> This scheme would accomodate more than 1 target to effectively represent vector
>>>>>>>> bools that mirror the target properties.
>>>>>>>>
>>>>>>>> 4. A new attribite
>>>>>>>>
>>>>>>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
>>>>>>>> attribute to define the PBV and make it general enough to
>>>>>>>>
>>>>>>>> * represent all targets flexibly (SVE, AVX etc)
>>>>>>>> * represent sub-byte length predicates
>>>>>>>> * have no change in units of vector_size/no new vector_size signature
>>>>>>>> * not have the number of bytes constrain representation
>>>>>>>>
>>>>>>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
>>>>>>>>
>>>>>>>>      typedef bool vbool __attribute__((bool_vec (n[, w])))
>>>>>>>>
>>>>>>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
>>>>>>>>
>>>>>>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
>>>>>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
>>>>>>>>
>>>>>>>> 5. Behaviour of the packed vector boolean type.
>>>>>>>>
>>>>>>>> Taking the example of one of the options above, following is an illustration of it's behavior
>>>>>>>>
>>>>>>>> * ABI
>>>>>>>>
>>>>>>>>      New ABI rules will need to be defined for this type - eg alignment, PCS,
>>>>>>>>      mangling etc
>>>>>>>>
>>>>>>>> * Initialization:
>>>>>>>>
>>>>>>>>      Packed Boolean Vectors(PBV) can be initialized like so:
>>>>>>>>
>>>>>>>>        typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
>>>>>>>>        v4bi p = {false, true, false, false};
>>>>>>>>
>>>>>>>>      Each value in the initizlizer constant is of type bool. The lowest numbered
>>>>>>>>      element in the const array corresponds to the LSbit of p, element 1 is
>>>>>>>>      assigned to bit 4 etc.
>>>>>>>>
>>>>>>>>      p is effectively a 2-byte bitmask with value 0x0010
>>>>>>>>
>>>>>>>>      With a different layout
>>>>>>>>
>>>>>>>>        typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
>>>>>>>>        v4bi p = {false, true, false, false};
>>>>>>>>
>>>>>>>>      p is effectively a 2-byte bitmask with value 0x0002
>>>>>>>>
>>>>>>>> * Operations:
>>>>>>>>
>>>>>>>>      Packed Boolean Vectors support the following operations:
>>>>>>>>      . unary ~
>>>>>>>>      . unary !
>>>>>>>>      . binary&,|andˆ
>>>>>>>>      . assignments &=, |= and ˆ=
>>>>>>>>      . comparisons <, <=, ==, !=, >= and >
>>>>>>>>      . Ternary operator ?:
>>>>>>>>
>>>>>>>>      Operations are defined as applied to the individual elements i.e the bits
>>>>>>>>      that are significant in the PBV. Whether the PBVs are treated as bitmasks
>>>>>>>>      or otherwise is implementation-defined.
>>>>>>>>
>>>>>>>>      Insignificant bits could affect results of comparisons or ternary operators.
>>>>>>>>      In such cases, it is implementation defined how the unused bits are treated.
>>>>>>>>
>>>>>>>>      . Subscript operator []
>>>>>>>>
>>>>>>>>      For the subscript operator, the packed boolean vector acts like a array of
>>>>>>>>      elements - the first or the 0th indexed element being the LSbit of the PBV.
>>>>>>>>      Subscript operator yields a scalar boolean value.
>>>>>>>>      For example:
>>>>>>>>
>>>>>>>>        typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
>>>>>>>>
>>>>>>>>        // Subscript operator result yields a boolean value.
>>>>>>>>        // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
>>>>>>>>        bool foo (v8b p, int n) { p[3] = true; return p[1]; }
>>>>>>>>
>>>>>>>>      Out of bounds access: OOB access can be determined at compile time given the
>>>>>>>>      strong typing of the PBVs.
>>>>>>>>
>>>>>>>>      PBV does not support address of operator(&) for elements of PBVs.
>>>>>>>>
>>>>>>>>      . Implicit conversion from integer vectors to PBVs
>>>>>>>>
>>>>>>>>      We would like to support the output of comparison operations to be PBVs. This
>>>>>>>>      requires us to define the implicit conversion from an integer vector to PBV
>>>>>>>>      as the result of vector comparisons are integer vectors.
>>>>>>>>
>>>>>>>>      To define this operation:
>>>>>>>>
>>>>>>>>        bool_vector = vector <cmpop> vector
>>>>>>>>
>>>>>>>>      There is no change in how vector <cmpop> vector behavior i.e. this comparison
>>>>>>>>      would still produce an int_vector type as it does now.
>>>>>>>>
>>>>>>>>        temp_int_vec = vector <cmpop> vector
>>>>>>>>        bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
>>>>>>>>
>>>>>>>>      The implicit conversion from int_vec to bool I'd define simply to be:
>>>>>>>>
>>>>>>>>        bool_vec[n] = (_Bool) int_vec[n]
>>>>>>>>
>>>>>>>>      where the C11 standard rules apply
>>>>>>>>      6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
>>>>>>>>      is 0 if the value compares equal to 0; otherwise, the result is 1.
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
>>>>>>>> [2] https://reviews.llvm.org/D88905
>>>>>>>> [3] https://reviews.llvm.org/D81083
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Tejas.
>>>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-14 10:18                   ` Tejas Belagod
@ 2023-07-17 12:16                     ` Richard Biener
  2023-07-26  7:21                       ` Tejas Belagod
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2023-07-17 12:16 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

On Fri, Jul 14, 2023 at 12:18 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
>
> On 7/13/23 4:05 PM, Richard Biener wrote:
> > On Thu, Jul 13, 2023 at 12:15 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
> >>
> >> On 7/3/23 1:31 PM, Richard Biener wrote:
> >>> On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
> >>>>
> >>>> On 6/29/23 6:55 PM, Richard Biener wrote:
> >>>>> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> From: Richard Biener <richard.guenther@gmail.com>
> >>>>>> Date: Tuesday, June 27, 2023 at 12:58 PM
> >>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> >>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> >>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>>>>
> >>>>>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> From: Richard Biener <richard.guenther@gmail.com>
> >>>>>>> Date: Monday, June 26, 2023 at 2:23 PM
> >>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> >>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> >>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>>>>>
> >>>>>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> >>>>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Packed Boolean Vectors
> >>>>>>>> ----------------------
> >>>>>>>>
> >>>>>>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
> >>>>>>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> >>>>>>>> been implemented in Clang recently[2].
> >>>>>>>>
> >>>>>>>> With predication features being added to vector architectures (SVE, MVE, AVX),
> >>>>>>>> it is a useful feature to have to model predication on targets.  This could
> >>>>>>>> find its use in intrinsics or just used as is as a GNU vector extension being
> >>>>>>>> mapped to underlying target features.  For example, the packed boolean vector
> >>>>>>>> could directly map to a predicate register on SVE.
> >>>>>>>>
> >>>>>>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
> >>>>>>>> intrinsics to replace a fixed-length svbool_t.
> >>>>>>>>
> >>>>>>>> Here are a few options to represent the packed boolean vector type.
> >>>>>>>
> >>>>>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
> >>>>>>>
> >>>>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>>>>> typedef v8si v8sib __attribute__((vector_mask));
> >>>>>>>
> >>>>>>> it get's you a vector type that's the appropriate (dependent on the
> >>>>>>> target) vector
> >>>>>>> mask type for the vector data type (v8si in this case).
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks Richard.
> >>>>>>>
> >>>>>>> Having had a quick look at the implementation, it does seem to tick the boxes.
> >>>>>>>
> >>>>>>> I must admit I haven't dug deep, but if the target hook allows the mask to be
> >>>>>>>
> >>>>>>> defined in way that is target-friendly (and I don't know how much effort it will
> >>>>>>>
> >>>>>>> be to migrate the attribute to more front-ends), it should do the job nicely.
> >>>>>>>
> >>>>>>> Let me go back and dig a bit deeper and get back with questions if any.
> >>>>>>
> >>>>>>
> >>>>>> Let me add that the advantage of this is the compiler doesn't need
> >>>>>> to support weird explicitely laid out packed boolean vectors that do
> >>>>>> not match what the target supports and the user doesn't need to know
> >>>>>> what the target supports (and thus have an #ifdef maze around explicitely
> >>>>>> specified layouts).
> >>>>>>
> >>>>>> Sorry for the delayed response – I spent a day experimenting with vector_mask.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
> >>>>>>
> >>>>>> to avoid having to sprinkle the code with ifdefs.
> >>>>>>
> >>>>>>
> >>>>>> It does remove some flexibility though, for example with -mavx512f -mavx512vl
> >>>>>> you'll get AVX512 style masks for V4SImode data vectors but of course the
> >>>>>> target sill supports SSE2/AVX2 style masks as well, but those would not be
> >>>>>> available as "packed boolean vectors", though they are of course in fact
> >>>>>> equal to V4SImode data vectors with -1 or 0 values, so in this particular
> >>>>>> case it might not matter.
> >>>>>>
> >>>>>> That said, the vector_mask attribute will get you V4SImode vectors with
> >>>>>> signed boolean elements of 32 bits for V4SImode data vectors with
> >>>>>> SSE2/AVX2.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
> >>>>>>
> >>>>>> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
> >>>>>>
> >>>>>> and a ‘w’ specified for the type.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
> >>>>>>
> >>>>>> major challenges wrt implementation or defining behaviour semantics? I played around with a
> >>>>>>
> >>>>>> few examples from the testsuite and wrote some new ones. I mostly tried operations that
> >>>>>>
> >>>>>> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
> >>>>>>
> >>>>>> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
> >>>>>>
> >>>>>> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
> >>>>>>
> >>>>>> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
> >>>>>>
> >>>>>> to the CFE – initialization and subscript operations?
> >>>>>
> >>>>> Yes, either that or restrict how the mask vectors can be used, thus
> >>>>> properly diagnose improper
> >>>>> uses.
> >>>>
> >>>> Indeed.
> >>>>
> >>>>     A question would be for example how to write common mask test
> >>>>> operations like
> >>>>> if (any (mask)) or if (all (mask)).
> >>>>
> >>>> I see 2 options here. New builtins could support new types - they'd
> >>>> provide a target independent way to test any and all conditions. Another
> >>>> would be to let the target use its intrinsics to do them in the most
> >>>> efficient way possible (which the builtins would get lowered down to
> >>>> anyway).
> >>>>
> >>>>
> >>>>     Likewise writing merge operations
> >>>>> - do those as
> >>>>>
> >>>>>     a = a | (mask ? b : 0);
> >>>>>
> >>>>> thus use ternary ?: for this?
> >>>>
> >>>> Yes, like now, the ternary could just translate to
> >>>>
> >>>>      {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
> >>>>
> >>>> One thing to flesh out is the semantics. Should we allow this operation
> >>>> as long as the number of elements are the same even if the mask type if
> >>>> different i.e.
> >>>>
> >>>>      v4hib ? v4si : v4si;
> >>>>
> >>>> I don't see why this can't be allowed as now we let
> >>>>
> >>>>      v4si ? v4sf : v4sf;
> >>>>
> >>>>
> >>>> For initialization regular vector
> >>>>> syntax should work:
> >>>>>
> >>>>> mtype mask = (mtype){ -1, -1, 0, 0, ... };
> >>>>>
> >>>>> there's the question of the signedness of the mask elements.  GCC
> >>>>> internally uses signed
> >>>>> bools with values -1 for true and 0 for false.
> >>>>
> >>>> One of the things is the value that represents true. This is largely
> >>>> target-dependent when it comes to the vector_mask type. When vector_mask
> >>>> types are created from GCC's internal representation of bool vectors
> >>>> (signed ints) the point about implicit/explicit conversions from signed
> >>>> int vect to mask types in the proposal covers this. So mask in
> >>>>
> >>>>      v4sib mask = (v4sib){-1, -1, 0, 0, ... }
> >>>>
> >>>> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
> >>>> on SVE. On AVX2/SSE they'd still be represented as vector of signed ints
> >>>> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this
> >>>> new mask type representations will have in the mid-end while being
> >>>> converted back and forth to and from GCC's internal representation, but
> >>>> I'm guessing this is already being handled at some level by the
> >>>> vector_mask type's current support?
> >>>
> >>> Yes, I would guess so.  Of course what the middle-end is currently exposed
> >>> to is simply what the vectorizer generates - once fuzzers discover this feature
> >>> we'll see "interesting" uses that might run into missed or wrong handling of
> >>> them.
> >>>
> >>> So whatever we do on the side of exposing this to users a good portion
> >>> of testsuite coverage for the allowed use cases is important.
> >>>
> >>> Richard.
> >>>
> >>
> >> Apologies for the long-ish reply, but here's a TLDR and gory details follow.
> >>
> >> TLDR:
> >> GIMPLE's vector_mask type semantics seems to be target-dependent, so
> >> elevating vector_mask to CFE with same semantics is undesirable. OTOH,
> >> changing vector_mask to have target-independent CFE semantics will cause
> >> dichotomy between its CFE and GFE behaviours. But vector_mask approach
> >> scales well for sizeless types. Is the solution to have something like
> >> vector_mask with defined target-independent type semantics, but call it
> >> something else to prevent conflation with GIMPLE, a viable option?
> >>
> >> Details:
> >> After some more analysis of the proposed options, here are some
> >> interesting findings:
> >>
> >> vector_mask looked like a very interesting option until I ran into some
> >> semantic uncertainly. This code:
> >>
> >> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >> typedef v8si v8sib __attribute__((vector_mask));
> >>
> >> typedef short v8hi __attribute__((vector_size(8*sizeof(short))));
> >> typedef v8hi v8hib __attribute__((vector_mask));
> >>
> >> v8si res;
> >> v8hi resh;
> >>
> >> v8hib __GIMPLE () foo (v8hib x, v8sib y)
> >> {
> >>     v8hib res;
> >>
> >>     res = x & y;
> >>     return res;
> >> }
> >>
> >> When compiled on AArch64, produces a type-mismatch error for binary
> >> expression involving '&' because the 'derived' types 'v8hib' and 'v8sib'
> >>    have a different target-layout. If the layout of these two 'derived'
> >> types match, then the above code has no issue. Which is the case on
> >> amdgcn-amdhsa target where it compiles without any error(amdgcn uses a
> >> scalar DImode mask mode). IoW such code seems to be allowed on some
> >> targets and not on others.
> >>
> >> With the same code, I tried putting casts and it worked fine on AArch64
> >> and amdgcn. This target-specific behaviour of vector_mask derived types
> >> will be difficult to specify once we move it to the CFE - in fact we
> >> probably don't want target-specific behaviour once it moves to the CFE.
> >>
> >> If we expose vector_mask to CFE, we'd have to specify consistent
> >> semantics for vector_mask types. We'd have to resolve ambiguities like
> >> 'v4hib & v4sib' clearly to be able to specify the semantics of the type
> >> system involving vector_mask. If we do this, don't we run the risk of a
> >> dichotomy between the CFE and GFE semantics of vector_mask? I'm assuming
> >> we'd want to retain vector_mask semantics as they are in GIMPLE.
> >>
> >> If we want to enforce constant semantics for vector_mask in the CFE, one
> >> way is to treat vector_mask types as distinct if they're 'attached' to
> >> distinct data vector types. In such a scenario, vector_mask types
> >> attached to two data vector types with the same lane-width and number of
> >> lanes would be classified as distinct. For eg:
> >>
> >> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >> typedef v8si v8sib __attribute__((vector_mask));
> >>
> >> typedef float v8sf __attribute__((vector_size(8*sizeof(float))));
> >> typedef v8sf v8sfb __attribute__((vector_mask));
> >>
> >> v8si  foo (v8sf x, v8sf y, v8si i, v8si j)
> >> {
> >>     (a == b) & (v8sfb)(x == y) ? x : (v8si){0};
> >> }
> >>
> >> This could be the case for unsigned vs signed int vectors too for eg -
> >> seems a bit unnecessary tbh.
> >>
> >> Though vector_mask's being 'attached' to a type has its drawbacks, it
> >> does seem to have an advantage when sizeless types are considered. If we
> >> have to define a sizeless vector boolean type that is implied by the
> >> lane size, we could do something like
> >>
> >> typedef svint32_t svbool32_t __attribute__((vector_mask));
> >>
> >> int32_t foo (svint32_t a, svint32_t b)
> >> {
> >>     svbool32_t pred = a > b;
> >>
> >>     return pred[2] ? a[2] : b[2];
> >> }
> >>
> >> This is harder to do in the other schemes proposed so far as they're
> >> size-based.
> >>
> >> To be able to free the boolean from the base type (not size) and retain
> >> vector_mask's flexibility to declare sizeless types, we could have an
> >> attribute that is more flexibly-typed and only 'derives' the lane-size
> >> and number of lanes from its 'base' type without actually inheriting the
> >> actual base type(char, short, int etc) or its signedness. This creates a
> >> purer and stand-alone boolean type without the associated semantics'
> >> complexity of having to cast between two same-size types with the same
> >> number of lanes. Eg.
> >>
> >> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >> typedef v8si v8b __attribute__((vector_bool));
> >>
> >> However, with differing lane-sizes, there will have to be a cast as the
> >> 'derived' element size is different which could impact the layout of the
> >> vector mask. Eg.
> >>
> >> v8si  foo (v8hi x, v8hi y, v8si i, v8si j)
> >> {
> >>     (v8sib)(x == y) & (i == j) ? i : (v8si){0};
> >> }
> >>
> >> Such conversions on targets like AVX512/AMDGCN will be a NOP, but
> >> non-trivial on SVE (depending on the implemented layout of the bool vector).
> >>
> >> vector_bool decouples us from having to retain the behaviour of
> >> vector_mask and provides the flexibility of not having to cast across
> >> same-element-size vector types. Wrt to sizeless types, it could scale well.
> >>
> >> typedef svint32_t svbool32_t __attribute__((vector_bool));
> >> typedef svint16_t svbool16_t __attribute__((vector_bool));
> >>
> >> int32_t foo (svint32_t a, svint32_t b)
> >> {
> >>     svbool32_t pred = a > b;
> >>
> >>     return pred[2] ? a[2] : b[2];
> >> }
> >>
> >> int16_t bar (svint16_t a, svint16_t b)
> >> {
> >>     svbool16_t pred = a > b;
> >>
> >>     return pred[2] ? a[2] : b[2];
> >> }
> >>
> >> On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint32_t on
> >> the target predicate.
> >>
> >> Thoughts?
> >
> > The GIMPLE frontend accepts just what is valid on the target here.  Any
> > "plumbing" such as implicit conversions (if we do not want to require
> > explicit ones even when NOP) need to be done/enforced by the C frontend.
> >
>
> Sorry, I'm not sure I follow - correct me if I'm wrong here.
>
> If we desire to define/allow operations like implicit/explicit
> conversion on vector_mask types in CFE, don't we have to start from a
> position of defining what makes vector_mask types distinct and therefore
> require implicit/explicit conversions?

We need to look at which operations we want to produce vector masks and
which operations consume them and what operations operate on them.

In GIMPLE comparisons produce them, conditionals consume them and
we allow bitwise ops to operate on them directly (GIMPLE doesn't have
logical && it just has bitwise &).

> IIUC, GFE's distinctness of vector_mask types depends on how the mask
> mode is implemented on the target. If implemented in CFE, vector_mask
> types' distinctness probably shouldn't be based on target layout and
> could be based on the type they're 'attached' to.

But since we eventually run on the target the layout should ideally
match that of the target.  Now, the question is whether that's ever
OK behavior - it effectively makes the mask somewhat opaque and
only "observable" by probing it in defined manners.

> Wouldn't that diverge from target-specific GFE behaviour - or are you
> suggesting its OK for vector_mask type semantics to be different in CFE
> and GFE?

It's definitely undesirable but as said I'm not sure it has to differ
[the layout].

> > There's one issue I can see that wasn't mentioned yet - GCC currently
> > accepts
> >
> > typedef long gv1024di __attribute__((vector_size(1024*8)));
> >
> > even if there's no underlying support on the target which either has support
> > only for smaller vectors or no vectors at all.  Currently vector_mask will
> > simply fail to produce sth desirable here.  What's your idea of making
> > that not target dependent?  GCC will later lower operations with such
> > vectors, possibly splitting them up into sizes supported by the hardware
> > natively, possibly performing elementwise operations.  For the former
> > one would need to guess the "decomposition type" and based on that
> > select the mask type [layout]?
> >
> > One idea would be to specify the mask layout follows the largest vector
> > kind supported by the target and if there is none follow the layout
> > of (signed?) _Bool [n]?  When there's no target support for vectors
> > GCC will generally use elementwise operations apart from some
> > special-cases.
> >
>
> That is a very good point - thanks for raising it. For when GCC chooses
> to lower to a vector type supported by the target, my initial thought
> would be to, as you say, choose a mask that has enough bits to represent
> the largest vector size with the smallest lane-width. The actual layout
> of the mask will depend on how the target implements its mask mode.
> Decomposition of vector_mask ought to follow the decomposition of the
> GNU vectors type and each decomposed vector_mask type ought to have
> enough bits to represent the decomposed GNU vector shape. It sounds nice
> on paper, but I haven't really worked through a design for this. Do you
> see any gotchas here?

Not really.  In the end it comes down to what the C writer is allowed to
do with a vector mask.  I would for example expect that I could do

 auto m = v1 < v2;
 _mm512_mask_sub_epi32 (a, m, b, c);

so generic masks should inter-operate with intrinsics (when the appropriate
ISA is enabled).  That works for the data vectors themselves for example
(quite some intrinsics are implemented with GCCs generic vector code).

I for example can't do

  _Bool lane2 = m[2];

to inspect lane two of a maks with AVX512.  I can do m & 2 but I wouldn't expect
that to work (should I?) with a vector_mask mask (it's at least not
valid directly
in GIMPLE).  There's _mm512_int2mask and _mm512_mask2int which transfer
between mask and int (but the mask types are really just typedefd to
integer typeS).

> > While using a different name than vector_mask is certainly possible
> > it wouldn't me to decide that, but I'm also not yet convinced it's
> > really necessary.  As said, what the GIMPLE frontend accepts
> > or not shouldn't limit us here - just the actual chosen layout of the
> > boolean vectors.
> >
>
> I'm just concerned about creating an alternate vector_mask functionality
> in the CFE and risk not being consistent with GFE.

I think it's more important to double-check usablilty from the users side.
If the implementation necessarily diverges from GIMPLE then we can
choose a different attribute name but then it will also inevitably have
code-generation (quality) issues as GIMPLE matches what the hardware
can do.

Richard.

> Thanks,
> Tejas.
>
> > Richard.
> >
> >> Thanks,
> >> Tejas.
> >>
> >>>>
> >>>> Thanks,
> >>>> Tejas.
> >>>>
> >>>>>
> >>>>> Richard.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Tejas.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Richard.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Tejas.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> 1. __attribute__((vector_size (n))) where n represents bytes
> >>>>>>>>
> >>>>>>>>      typedef bool vbool __attribute__ ((vector_size (1)));
> >>>>>>>>
> >>>>>>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
> >>>>>>>> clear if each bit in 'n' controls a byte or an element. On targets
> >>>>>>>> like SVE, it would be natural to have each bit control a byte of the target
> >>>>>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> >>>>>>>> bit would control one element/lane on the target vector(therefore resulting in a
> >>>>>>>> 'packed' layout with all significant bits at the LSB).
> >>>>>>>>
> >>>>>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
> >>>>>>>>
> >>>>>>>>      typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
> >>>>>>>>      typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
> >>>>>>>>
> >>>>>>>> Here the 'n' in the vector_size attribute represents the number of bits that
> >>>>>>>> is needed to represent a vector quantity.  In this case, this packed boolean
> >>>>>>>> vector can represent upto 'n' vector lanes. The size of the type is
> >>>>>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
> >>>>>>>> example is 1.
> >>>>>>>>
> >>>>>>>> In this approach, because of the nature of the representation, the n bits required
> >>>>>>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
> >>>>>>>> align with the SVE approach of each bit representing a byte of the target vector
> >>>>>>>> and PBV therefore having an 'unpacked' layout.
> >>>>>>>>
> >>>>>>>> More importantly, another drawback here is that the change in units for vector_size
> >>>>>>>> might be confusing to programmers.  The units will have to be interpreted based on the
> >>>>>>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
> >>>>>>>> the bool vector - it is fixed.
> >>>>>>>>
> >>>>>>>> 3. Combination of 1 and 2.
> >>>>>>>>
> >>>>>>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> >>>>>>>> unambiguously represent the layout of the PBV. Consider
> >>>>>>>>
> >>>>>>>>      typedef bool vbool __attribute__((vector_size (s, n[, w])));
> >>>>>>>>
> >>>>>>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> >>>>>>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> >>>>>>>> allow a target to force a certain layout of the PBV.
> >>>>>>>>
> >>>>>>>> The 2-parameter form of vector_size allows the target to have an
> >>>>>>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
> >>>>>>>> if it is not specified to mirror the target layout of predicate registers. For
> >>>>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> >>>>>>>>
> >>>>>>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> >>>>>>>> 8 lanes of boolean which could be represented by
> >>>>>>>>
> >>>>>>>>      typedef bool v8b __attribute__ ((vector_size (2, 8)));
> >>>>>>>>
> >>>>>>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
> >>>>>>>>
> >>>>>>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> >>>>>>>> be significant i.e. w == 1.
> >>>>>>>>
> >>>>>>>> This scheme would accomodate more than 1 target to effectively represent vector
> >>>>>>>> bools that mirror the target properties.
> >>>>>>>>
> >>>>>>>> 4. A new attribite
> >>>>>>>>
> >>>>>>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> >>>>>>>> attribute to define the PBV and make it general enough to
> >>>>>>>>
> >>>>>>>> * represent all targets flexibly (SVE, AVX etc)
> >>>>>>>> * represent sub-byte length predicates
> >>>>>>>> * have no change in units of vector_size/no new vector_size signature
> >>>>>>>> * not have the number of bytes constrain representation
> >>>>>>>>
> >>>>>>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
> >>>>>>>>
> >>>>>>>>      typedef bool vbool __attribute__((bool_vec (n[, w])))
> >>>>>>>>
> >>>>>>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
> >>>>>>>>
> >>>>>>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> >>>>>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> >>>>>>>>
> >>>>>>>> 5. Behaviour of the packed vector boolean type.
> >>>>>>>>
> >>>>>>>> Taking the example of one of the options above, following is an illustration of it's behavior
> >>>>>>>>
> >>>>>>>> * ABI
> >>>>>>>>
> >>>>>>>>      New ABI rules will need to be defined for this type - eg alignment, PCS,
> >>>>>>>>      mangling etc
> >>>>>>>>
> >>>>>>>> * Initialization:
> >>>>>>>>
> >>>>>>>>      Packed Boolean Vectors(PBV) can be initialized like so:
> >>>>>>>>
> >>>>>>>>        typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> >>>>>>>>        v4bi p = {false, true, false, false};
> >>>>>>>>
> >>>>>>>>      Each value in the initizlizer constant is of type bool. The lowest numbered
> >>>>>>>>      element in the const array corresponds to the LSbit of p, element 1 is
> >>>>>>>>      assigned to bit 4 etc.
> >>>>>>>>
> >>>>>>>>      p is effectively a 2-byte bitmask with value 0x0010
> >>>>>>>>
> >>>>>>>>      With a different layout
> >>>>>>>>
> >>>>>>>>        typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> >>>>>>>>        v4bi p = {false, true, false, false};
> >>>>>>>>
> >>>>>>>>      p is effectively a 2-byte bitmask with value 0x0002
> >>>>>>>>
> >>>>>>>> * Operations:
> >>>>>>>>
> >>>>>>>>      Packed Boolean Vectors support the following operations:
> >>>>>>>>      . unary ~
> >>>>>>>>      . unary !
> >>>>>>>>      . binary&,|andˆ
> >>>>>>>>      . assignments &=, |= and ˆ=
> >>>>>>>>      . comparisons <, <=, ==, !=, >= and >
> >>>>>>>>      . Ternary operator ?:
> >>>>>>>>
> >>>>>>>>      Operations are defined as applied to the individual elements i.e the bits
> >>>>>>>>      that are significant in the PBV. Whether the PBVs are treated as bitmasks
> >>>>>>>>      or otherwise is implementation-defined.
> >>>>>>>>
> >>>>>>>>      Insignificant bits could affect results of comparisons or ternary operators.
> >>>>>>>>      In such cases, it is implementation defined how the unused bits are treated.
> >>>>>>>>
> >>>>>>>>      . Subscript operator []
> >>>>>>>>
> >>>>>>>>      For the subscript operator, the packed boolean vector acts like a array of
> >>>>>>>>      elements - the first or the 0th indexed element being the LSbit of the PBV.
> >>>>>>>>      Subscript operator yields a scalar boolean value.
> >>>>>>>>      For example:
> >>>>>>>>
> >>>>>>>>        typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> >>>>>>>>
> >>>>>>>>        // Subscript operator result yields a boolean value.
> >>>>>>>>        // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> >>>>>>>>        bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> >>>>>>>>
> >>>>>>>>      Out of bounds access: OOB access can be determined at compile time given the
> >>>>>>>>      strong typing of the PBVs.
> >>>>>>>>
> >>>>>>>>      PBV does not support address of operator(&) for elements of PBVs.
> >>>>>>>>
> >>>>>>>>      . Implicit conversion from integer vectors to PBVs
> >>>>>>>>
> >>>>>>>>      We would like to support the output of comparison operations to be PBVs. This
> >>>>>>>>      requires us to define the implicit conversion from an integer vector to PBV
> >>>>>>>>      as the result of vector comparisons are integer vectors.
> >>>>>>>>
> >>>>>>>>      To define this operation:
> >>>>>>>>
> >>>>>>>>        bool_vector = vector <cmpop> vector
> >>>>>>>>
> >>>>>>>>      There is no change in how vector <cmpop> vector behavior i.e. this comparison
> >>>>>>>>      would still produce an int_vector type as it does now.
> >>>>>>>>
> >>>>>>>>        temp_int_vec = vector <cmpop> vector
> >>>>>>>>        bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
> >>>>>>>>
> >>>>>>>>      The implicit conversion from int_vec to bool I'd define simply to be:
> >>>>>>>>
> >>>>>>>>        bool_vec[n] = (_Bool) int_vec[n]
> >>>>>>>>
> >>>>>>>>      where the C11 standard rules apply
> >>>>>>>>      6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
> >>>>>>>>      is 0 if the value compares equal to 0; otherwise, the result is 1.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> >>>>>>>> [2] https://reviews.llvm.org/D88905
> >>>>>>>> [3] https://reviews.llvm.org/D81083
> >>>>>>>>
> >>>>>>>> Thoughts?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Tejas.
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-17 12:16                     ` Richard Biener
@ 2023-07-26  7:21                       ` Tejas Belagod
  2023-07-26 12:26                         ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Tejas Belagod @ 2023-07-26  7:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

On 7/17/23 5:46 PM, Richard Biener wrote:
> On Fri, Jul 14, 2023 at 12:18 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
>>
>> On 7/13/23 4:05 PM, Richard Biener wrote:
>>> On Thu, Jul 13, 2023 at 12:15 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
>>>>
>>>> On 7/3/23 1:31 PM, Richard Biener wrote:
>>>>> On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
>>>>>>
>>>>>> On 6/29/23 6:55 PM, Richard Biener wrote:
>>>>>>> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> From: Richard Biener <richard.guenther@gmail.com>
>>>>>>>> Date: Tuesday, June 27, 2023 at 12:58 PM
>>>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
>>>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
>>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>>>>>>>>
>>>>>>>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Richard Biener <richard.guenther@gmail.com>
>>>>>>>>> Date: Monday, June 26, 2023 at 2:23 PM
>>>>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
>>>>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
>>>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
>>>>>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Packed Boolean Vectors
>>>>>>>>>> ----------------------
>>>>>>>>>>
>>>>>>>>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
>>>>>>>>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
>>>>>>>>>> been implemented in Clang recently[2].
>>>>>>>>>>
>>>>>>>>>> With predication features being added to vector architectures (SVE, MVE, AVX),
>>>>>>>>>> it is a useful feature to have to model predication on targets.  This could
>>>>>>>>>> find its use in intrinsics or just used as is as a GNU vector extension being
>>>>>>>>>> mapped to underlying target features.  For example, the packed boolean vector
>>>>>>>>>> could directly map to a predicate register on SVE.
>>>>>>>>>>
>>>>>>>>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
>>>>>>>>>> intrinsics to replace a fixed-length svbool_t.
>>>>>>>>>>
>>>>>>>>>> Here are a few options to represent the packed boolean vector type.
>>>>>>>>>
>>>>>>>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
>>>>>>>>>
>>>>>>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>>>>>>>>> typedef v8si v8sib __attribute__((vector_mask));
>>>>>>>>>
>>>>>>>>> it get's you a vector type that's the appropriate (dependent on the
>>>>>>>>> target) vector
>>>>>>>>> mask type for the vector data type (v8si in this case).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks Richard.
>>>>>>>>>
>>>>>>>>> Having had a quick look at the implementation, it does seem to tick the boxes.
>>>>>>>>>
>>>>>>>>> I must admit I haven't dug deep, but if the target hook allows the mask to be
>>>>>>>>>
>>>>>>>>> defined in way that is target-friendly (and I don't know how much effort it will
>>>>>>>>>
>>>>>>>>> be to migrate the attribute to more front-ends), it should do the job nicely.
>>>>>>>>>
>>>>>>>>> Let me go back and dig a bit deeper and get back with questions if any.
>>>>>>>>
>>>>>>>>
>>>>>>>> Let me add that the advantage of this is the compiler doesn't need
>>>>>>>> to support weird explicitely laid out packed boolean vectors that do
>>>>>>>> not match what the target supports and the user doesn't need to know
>>>>>>>> what the target supports (and thus have an #ifdef maze around explicitely
>>>>>>>> specified layouts).
>>>>>>>>
>>>>>>>> Sorry for the delayed response – I spent a day experimenting with vector_mask.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
>>>>>>>>
>>>>>>>> to avoid having to sprinkle the code with ifdefs.
>>>>>>>>
>>>>>>>>
>>>>>>>> It does remove some flexibility though, for example with -mavx512f -mavx512vl
>>>>>>>> you'll get AVX512 style masks for V4SImode data vectors but of course the
>>>>>>>> target sill supports SSE2/AVX2 style masks as well, but those would not be
>>>>>>>> available as "packed boolean vectors", though they are of course in fact
>>>>>>>> equal to V4SImode data vectors with -1 or 0 values, so in this particular
>>>>>>>> case it might not matter.
>>>>>>>>
>>>>>>>> That said, the vector_mask attribute will get you V4SImode vectors with
>>>>>>>> signed boolean elements of 32 bits for V4SImode data vectors with
>>>>>>>> SSE2/AVX2.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
>>>>>>>>
>>>>>>>> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
>>>>>>>>
>>>>>>>> and a ‘w’ specified for the type.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
>>>>>>>>
>>>>>>>> major challenges wrt implementation or defining behaviour semantics? I played around with a
>>>>>>>>
>>>>>>>> few examples from the testsuite and wrote some new ones. I mostly tried operations that
>>>>>>>>
>>>>>>>> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
>>>>>>>>
>>>>>>>> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
>>>>>>>>
>>>>>>>> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
>>>>>>>>
>>>>>>>> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
>>>>>>>>
>>>>>>>> to the CFE – initialization and subscript operations?
>>>>>>>
>>>>>>> Yes, either that or restrict how the mask vectors can be used, thus
>>>>>>> properly diagnose improper
>>>>>>> uses.
>>>>>>
>>>>>> Indeed.
>>>>>>
>>>>>>      A question would be for example how to write common mask test
>>>>>>> operations like
>>>>>>> if (any (mask)) or if (all (mask)).
>>>>>>
>>>>>> I see 2 options here. New builtins could support new types - they'd
>>>>>> provide a target independent way to test any and all conditions. Another
>>>>>> would be to let the target use its intrinsics to do them in the most
>>>>>> efficient way possible (which the builtins would get lowered down to
>>>>>> anyway).
>>>>>>
>>>>>>
>>>>>>      Likewise writing merge operations
>>>>>>> - do those as
>>>>>>>
>>>>>>>      a = a | (mask ? b : 0);
>>>>>>>
>>>>>>> thus use ternary ?: for this?
>>>>>>
>>>>>> Yes, like now, the ternary could just translate to
>>>>>>
>>>>>>       {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
>>>>>>
>>>>>> One thing to flesh out is the semantics. Should we allow this operation
>>>>>> as long as the number of elements are the same even if the mask type if
>>>>>> different i.e.
>>>>>>
>>>>>>       v4hib ? v4si : v4si;
>>>>>>
>>>>>> I don't see why this can't be allowed as now we let
>>>>>>
>>>>>>       v4si ? v4sf : v4sf;
>>>>>>
>>>>>>
>>>>>> For initialization regular vector
>>>>>>> syntax should work:
>>>>>>>
>>>>>>> mtype mask = (mtype){ -1, -1, 0, 0, ... };
>>>>>>>
>>>>>>> there's the question of the signedness of the mask elements.  GCC
>>>>>>> internally uses signed
>>>>>>> bools with values -1 for true and 0 for false.
>>>>>>
>>>>>> One of the things is the value that represents true. This is largely
>>>>>> target-dependent when it comes to the vector_mask type. When vector_mask
>>>>>> types are created from GCC's internal representation of bool vectors
>>>>>> (signed ints) the point about implicit/explicit conversions from signed
>>>>>> int vect to mask types in the proposal covers this. So mask in
>>>>>>
>>>>>>       v4sib mask = (v4sib){-1, -1, 0, 0, ... }
>>>>>>
>>>>>> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
>>>>>> on SVE. On AVX2/SSE they'd still be represented as vector of signed ints
>>>>>> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this
>>>>>> new mask type representations will have in the mid-end while being
>>>>>> converted back and forth to and from GCC's internal representation, but
>>>>>> I'm guessing this is already being handled at some level by the
>>>>>> vector_mask type's current support?
>>>>>
>>>>> Yes, I would guess so.  Of course what the middle-end is currently exposed
>>>>> to is simply what the vectorizer generates - once fuzzers discover this feature
>>>>> we'll see "interesting" uses that might run into missed or wrong handling of
>>>>> them.
>>>>>
>>>>> So whatever we do on the side of exposing this to users a good portion
>>>>> of testsuite coverage for the allowed use cases is important.
>>>>>
>>>>> Richard.
>>>>>
>>>>
>>>> Apologies for the long-ish reply, but here's a TLDR and gory details follow.
>>>>
>>>> TLDR:
>>>> GIMPLE's vector_mask type semantics seems to be target-dependent, so
>>>> elevating vector_mask to CFE with same semantics is undesirable. OTOH,
>>>> changing vector_mask to have target-independent CFE semantics will cause
>>>> dichotomy between its CFE and GFE behaviours. But vector_mask approach
>>>> scales well for sizeless types. Is the solution to have something like
>>>> vector_mask with defined target-independent type semantics, but call it
>>>> something else to prevent conflation with GIMPLE, a viable option?
>>>>
>>>> Details:
>>>> After some more analysis of the proposed options, here are some
>>>> interesting findings:
>>>>
>>>> vector_mask looked like a very interesting option until I ran into some
>>>> semantic uncertainly. This code:
>>>>
>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>>>> typedef v8si v8sib __attribute__((vector_mask));
>>>>
>>>> typedef short v8hi __attribute__((vector_size(8*sizeof(short))));
>>>> typedef v8hi v8hib __attribute__((vector_mask));
>>>>
>>>> v8si res;
>>>> v8hi resh;
>>>>
>>>> v8hib __GIMPLE () foo (v8hib x, v8sib y)
>>>> {
>>>>      v8hib res;
>>>>
>>>>      res = x & y;
>>>>      return res;
>>>> }
>>>>
>>>> When compiled on AArch64, produces a type-mismatch error for binary
>>>> expression involving '&' because the 'derived' types 'v8hib' and 'v8sib'
>>>>     have a different target-layout. If the layout of these two 'derived'
>>>> types match, then the above code has no issue. Which is the case on
>>>> amdgcn-amdhsa target where it compiles without any error(amdgcn uses a
>>>> scalar DImode mask mode). IoW such code seems to be allowed on some
>>>> targets and not on others.
>>>>
>>>> With the same code, I tried putting casts and it worked fine on AArch64
>>>> and amdgcn. This target-specific behaviour of vector_mask derived types
>>>> will be difficult to specify once we move it to the CFE - in fact we
>>>> probably don't want target-specific behaviour once it moves to the CFE.
>>>>
>>>> If we expose vector_mask to CFE, we'd have to specify consistent
>>>> semantics for vector_mask types. We'd have to resolve ambiguities like
>>>> 'v4hib & v4sib' clearly to be able to specify the semantics of the type
>>>> system involving vector_mask. If we do this, don't we run the risk of a
>>>> dichotomy between the CFE and GFE semantics of vector_mask? I'm assuming
>>>> we'd want to retain vector_mask semantics as they are in GIMPLE.
>>>>
>>>> If we want to enforce constant semantics for vector_mask in the CFE, one
>>>> way is to treat vector_mask types as distinct if they're 'attached' to
>>>> distinct data vector types. In such a scenario, vector_mask types
>>>> attached to two data vector types with the same lane-width and number of
>>>> lanes would be classified as distinct. For eg:
>>>>
>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>>>> typedef v8si v8sib __attribute__((vector_mask));
>>>>
>>>> typedef float v8sf __attribute__((vector_size(8*sizeof(float))));
>>>> typedef v8sf v8sfb __attribute__((vector_mask));
>>>>
>>>> v8si  foo (v8sf x, v8sf y, v8si i, v8si j)
>>>> {
>>>>      (a == b) & (v8sfb)(x == y) ? x : (v8si){0};
>>>> }
>>>>
>>>> This could be the case for unsigned vs signed int vectors too for eg -
>>>> seems a bit unnecessary tbh.
>>>>
>>>> Though vector_mask's being 'attached' to a type has its drawbacks, it
>>>> does seem to have an advantage when sizeless types are considered. If we
>>>> have to define a sizeless vector boolean type that is implied by the
>>>> lane size, we could do something like
>>>>
>>>> typedef svint32_t svbool32_t __attribute__((vector_mask));
>>>>
>>>> int32_t foo (svint32_t a, svint32_t b)
>>>> {
>>>>      svbool32_t pred = a > b;
>>>>
>>>>      return pred[2] ? a[2] : b[2];
>>>> }
>>>>
>>>> This is harder to do in the other schemes proposed so far as they're
>>>> size-based.
>>>>
>>>> To be able to free the boolean from the base type (not size) and retain
>>>> vector_mask's flexibility to declare sizeless types, we could have an
>>>> attribute that is more flexibly-typed and only 'derives' the lane-size
>>>> and number of lanes from its 'base' type without actually inheriting the
>>>> actual base type(char, short, int etc) or its signedness. This creates a
>>>> purer and stand-alone boolean type without the associated semantics'
>>>> complexity of having to cast between two same-size types with the same
>>>> number of lanes. Eg.
>>>>
>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
>>>> typedef v8si v8b __attribute__((vector_bool));
>>>>
>>>> However, with differing lane-sizes, there will have to be a cast as the
>>>> 'derived' element size is different which could impact the layout of the
>>>> vector mask. Eg.
>>>>
>>>> v8si  foo (v8hi x, v8hi y, v8si i, v8si j)
>>>> {
>>>>      (v8sib)(x == y) & (i == j) ? i : (v8si){0};
>>>> }
>>>>
>>>> Such conversions on targets like AVX512/AMDGCN will be a NOP, but
>>>> non-trivial on SVE (depending on the implemented layout of the bool vector).
>>>>
>>>> vector_bool decouples us from having to retain the behaviour of
>>>> vector_mask and provides the flexibility of not having to cast across
>>>> same-element-size vector types. Wrt to sizeless types, it could scale well.
>>>>
>>>> typedef svint32_t svbool32_t __attribute__((vector_bool));
>>>> typedef svint16_t svbool16_t __attribute__((vector_bool));
>>>>
>>>> int32_t foo (svint32_t a, svint32_t b)
>>>> {
>>>>      svbool32_t pred = a > b;
>>>>
>>>>      return pred[2] ? a[2] : b[2];
>>>> }
>>>>
>>>> int16_t bar (svint16_t a, svint16_t b)
>>>> {
>>>>      svbool16_t pred = a > b;
>>>>
>>>>      return pred[2] ? a[2] : b[2];
>>>> }
>>>>
>>>> On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint32_t on
>>>> the target predicate.
>>>>
>>>> Thoughts?
>>>
>>> The GIMPLE frontend accepts just what is valid on the target here.  Any
>>> "plumbing" such as implicit conversions (if we do not want to require
>>> explicit ones even when NOP) need to be done/enforced by the C frontend.
>>>
>>
>> Sorry, I'm not sure I follow - correct me if I'm wrong here.
>>
>> If we desire to define/allow operations like implicit/explicit
>> conversion on vector_mask types in CFE, don't we have to start from a
>> position of defining what makes vector_mask types distinct and therefore
>> require implicit/explicit conversions?
> 
> We need to look at which operations we want to produce vector masks and
> which operations consume them and what operations operate on them.
> 
> In GIMPLE comparisons produce them, conditionals consume them and
> we allow bitwise ops to operate on them directly (GIMPLE doesn't have
> logical && it just has bitwise &).
> 

Thanks for your thoughts - after I spent more cycles researching and 
experimenting, I think I understand the driving factors here. Comparison 
producers generate signed integer vectors of the same lane-width as the 
comparison operands. This means mixed type vectors can't be applied to 
conditional consumers or bitwise operators eg:

   v8hi foo (v8si a, v8si b, v8hi c, v8hi d)
   {
     return a > b || c > d; // error!
     return a > b || __builtin_convertvector (c > d, v8si); // OK.
     return a | b && c | d; // error!
     return a | b && __builtin_convertvector (c | d, v8si); // OK.
   }

Similarly, if we extend these 'stricter-typing' rules to vector_mask, it 
could look like:

typedef v4sib v4si __attribute__((vector_mask));
typedef v4hib v4hi __attribute__((vector_mask));

   v8sib foo (v8si a, v8si b, v8hi c, v8hi d)
   {
     v8sib psi = a > b;
     v8hib phi = c > d;

     return psi || phi; // error!
     return psi || __builtin_convertvector (phi, v8sib); // OK.
     return psi | phi; // error!
     return psi | __builtin_convertvector (phi, v8sib); // OK.
   }

At GIMPLE stage, on targets where the layout allows it (eg AMDGCN), 
expressions like
   psi | __builtin_convertvector (phi, v8sib)
can be optimized to
   psi | phi
because __builtin_convertvector (phi, v8sib) is a NOP.

I think this could make vector_mask more portable across targets. If one 
wants to take CFE vector_mask code and run it on the GFE, it should 
work; while the reverse won't as CFE vector_mask rules are more restrictive.

Does this look like a sensible approach for progress?

>> IIUC, GFE's distinctness of vector_mask types depends on how the mask
>> mode is implemented on the target. If implemented in CFE, vector_mask
>> types' distinctness probably shouldn't be based on target layout and
>> could be based on the type they're 'attached' to.
> 
> But since we eventually run on the target the layout should ideally
> match that of the target.  Now, the question is whether that's ever
> OK behavior - it effectively makes the mask somewhat opaque and
> only "observable" by probing it in defined manners.
> 
>> Wouldn't that diverge from target-specific GFE behaviour - or are you
>> suggesting its OK for vector_mask type semantics to be different in CFE
>> and GFE?
> 
> It's definitely undesirable but as said I'm not sure it has to differ
> [the layout].
> 

I agree it is best to have a consistent layout of vector_mask across CFE 
and GFE and also implement it to match the target layout for optimal 
code quality.

For observability, I think it makes sense to allow operations that are 
relevant and have a consistent meaning irrespective of that target. Eg. 
'vector_mask & 2' might not mean the same thing on all targets, but 
vector_mask[2] does. Therefore, I think the opaqueness is useful and 
necessary to some extent.

Thanks,
Tejas.

>>> There's one issue I can see that wasn't mentioned yet - GCC currently
>>> accepts
>>>
>>> typedef long gv1024di __attribute__((vector_size(1024*8)));
>>>
>>> even if there's no underlying support on the target which either has support
>>> only for smaller vectors or no vectors at all.  Currently vector_mask will
>>> simply fail to produce sth desirable here.  What's your idea of making
>>> that not target dependent?  GCC will later lower operations with such
>>> vectors, possibly splitting them up into sizes supported by the hardware
>>> natively, possibly performing elementwise operations.  For the former
>>> one would need to guess the "decomposition type" and based on that
>>> select the mask type [layout]?
>>>
>>> One idea would be to specify the mask layout follows the largest vector
>>> kind supported by the target and if there is none follow the layout
>>> of (signed?) _Bool [n]?  When there's no target support for vectors
>>> GCC will generally use elementwise operations apart from some
>>> special-cases.
>>>
>>
>> That is a very good point - thanks for raising it. For when GCC chooses
>> to lower to a vector type supported by the target, my initial thought
>> would be to, as you say, choose a mask that has enough bits to represent
>> the largest vector size with the smallest lane-width. The actual layout
>> of the mask will depend on how the target implements its mask mode.
>> Decomposition of vector_mask ought to follow the decomposition of the
>> GNU vectors type and each decomposed vector_mask type ought to have
>> enough bits to represent the decomposed GNU vector shape. It sounds nice
>> on paper, but I haven't really worked through a design for this. Do you
>> see any gotchas here?
> 
> Not really.  In the end it comes down to what the C writer is allowed to
> do with a vector mask.  I would for example expect that I could do
> 
>   auto m = v1 < v2;
>   _mm512_mask_sub_epi32 (a, m, b, c);
> 
> so generic masks should inter-operate with intrinsics (when the appropriate
> ISA is enabled).  That works for the data vectors themselves for example
> (quite some intrinsics are implemented with GCCs generic vector code).
> 
> I for example can't do
> 
>    _Bool lane2 = m[2];
> 
> to inspect lane two of a maks with AVX512.  I can do m & 2 but I wouldn't expect
> that to work (should I?) with a vector_mask mask (it's at least not
> valid directly
> in GIMPLE).  There's _mm512_int2mask and _mm512_mask2int which transfer
> between mask and int (but the mask types are really just typedefd to
> integer typeS).
> 
>>> While using a different name than vector_mask is certainly possible
>>> it wouldn't me to decide that, but I'm also not yet convinced it's
>>> really necessary.  As said, what the GIMPLE frontend accepts
>>> or not shouldn't limit us here - just the actual chosen layout of the
>>> boolean vectors.
>>>
>>
>> I'm just concerned about creating an alternate vector_mask functionality
>> in the CFE and risk not being consistent with GFE.
> 
> I think it's more important to double-check usablilty from the users side.
> If the implementation necessarily diverges from GIMPLE then we can
> choose a different attribute name but then it will also inevitably have
> code-generation (quality) issues as GIMPLE matches what the hardware
> can do.
> 
> Richard.
> 
>> Thanks,
>> Tejas.
>>
>>> Richard.
>>>
>>>> Thanks,
>>>> Tejas.
>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Tejas.
>>>>>>
>>>>>>>
>>>>>>> Richard.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Tejas.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Richard.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Tejas.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 1. __attribute__((vector_size (n))) where n represents bytes
>>>>>>>>>>
>>>>>>>>>>       typedef bool vbool __attribute__ ((vector_size (1)));
>>>>>>>>>>
>>>>>>>>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
>>>>>>>>>> clear if each bit in 'n' controls a byte or an element. On targets
>>>>>>>>>> like SVE, it would be natural to have each bit control a byte of the target
>>>>>>>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
>>>>>>>>>> bit would control one element/lane on the target vector(therefore resulting in a
>>>>>>>>>> 'packed' layout with all significant bits at the LSB).
>>>>>>>>>>
>>>>>>>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
>>>>>>>>>>
>>>>>>>>>>       typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
>>>>>>>>>>       typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
>>>>>>>>>>
>>>>>>>>>> Here the 'n' in the vector_size attribute represents the number of bits that
>>>>>>>>>> is needed to represent a vector quantity.  In this case, this packed boolean
>>>>>>>>>> vector can represent upto 'n' vector lanes. The size of the type is
>>>>>>>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
>>>>>>>>>> example is 1.
>>>>>>>>>>
>>>>>>>>>> In this approach, because of the nature of the representation, the n bits required
>>>>>>>>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
>>>>>>>>>> align with the SVE approach of each bit representing a byte of the target vector
>>>>>>>>>> and PBV therefore having an 'unpacked' layout.
>>>>>>>>>>
>>>>>>>>>> More importantly, another drawback here is that the change in units for vector_size
>>>>>>>>>> might be confusing to programmers.  The units will have to be interpreted based on the
>>>>>>>>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
>>>>>>>>>> the bool vector - it is fixed.
>>>>>>>>>>
>>>>>>>>>> 3. Combination of 1 and 2.
>>>>>>>>>>
>>>>>>>>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
>>>>>>>>>> unambiguously represent the layout of the PBV. Consider
>>>>>>>>>>
>>>>>>>>>>       typedef bool vbool __attribute__((vector_size (s, n[, w])));
>>>>>>>>>>
>>>>>>>>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
>>>>>>>>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
>>>>>>>>>> allow a target to force a certain layout of the PBV.
>>>>>>>>>>
>>>>>>>>>> The 2-parameter form of vector_size allows the target to have an
>>>>>>>>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
>>>>>>>>>> if it is not specified to mirror the target layout of predicate registers. For
>>>>>>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
>>>>>>>>>>
>>>>>>>>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
>>>>>>>>>> 8 lanes of boolean which could be represented by
>>>>>>>>>>
>>>>>>>>>>       typedef bool v8b __attribute__ ((vector_size (2, 8)));
>>>>>>>>>>
>>>>>>>>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
>>>>>>>>>>
>>>>>>>>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
>>>>>>>>>> be significant i.e. w == 1.
>>>>>>>>>>
>>>>>>>>>> This scheme would accomodate more than 1 target to effectively represent vector
>>>>>>>>>> bools that mirror the target properties.
>>>>>>>>>>
>>>>>>>>>> 4. A new attribite
>>>>>>>>>>
>>>>>>>>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
>>>>>>>>>> attribute to define the PBV and make it general enough to
>>>>>>>>>>
>>>>>>>>>> * represent all targets flexibly (SVE, AVX etc)
>>>>>>>>>> * represent sub-byte length predicates
>>>>>>>>>> * have no change in units of vector_size/no new vector_size signature
>>>>>>>>>> * not have the number of bytes constrain representation
>>>>>>>>>>
>>>>>>>>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
>>>>>>>>>>
>>>>>>>>>>       typedef bool vbool __attribute__((bool_vec (n[, w])))
>>>>>>>>>>
>>>>>>>>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
>>>>>>>>>>
>>>>>>>>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
>>>>>>>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
>>>>>>>>>>
>>>>>>>>>> 5. Behaviour of the packed vector boolean type.
>>>>>>>>>>
>>>>>>>>>> Taking the example of one of the options above, following is an illustration of it's behavior
>>>>>>>>>>
>>>>>>>>>> * ABI
>>>>>>>>>>
>>>>>>>>>>       New ABI rules will need to be defined for this type - eg alignment, PCS,
>>>>>>>>>>       mangling etc
>>>>>>>>>>
>>>>>>>>>> * Initialization:
>>>>>>>>>>
>>>>>>>>>>       Packed Boolean Vectors(PBV) can be initialized like so:
>>>>>>>>>>
>>>>>>>>>>         typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
>>>>>>>>>>         v4bi p = {false, true, false, false};
>>>>>>>>>>
>>>>>>>>>>       Each value in the initizlizer constant is of type bool. The lowest numbered
>>>>>>>>>>       element in the const array corresponds to the LSbit of p, element 1 is
>>>>>>>>>>       assigned to bit 4 etc.
>>>>>>>>>>
>>>>>>>>>>       p is effectively a 2-byte bitmask with value 0x0010
>>>>>>>>>>
>>>>>>>>>>       With a different layout
>>>>>>>>>>
>>>>>>>>>>         typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
>>>>>>>>>>         v4bi p = {false, true, false, false};
>>>>>>>>>>
>>>>>>>>>>       p is effectively a 2-byte bitmask with value 0x0002
>>>>>>>>>>
>>>>>>>>>> * Operations:
>>>>>>>>>>
>>>>>>>>>>       Packed Boolean Vectors support the following operations:
>>>>>>>>>>       . unary ~
>>>>>>>>>>       . unary !
>>>>>>>>>>       . binary&,|andˆ
>>>>>>>>>>       . assignments &=, |= and ˆ=
>>>>>>>>>>       . comparisons <, <=, ==, !=, >= and >
>>>>>>>>>>       . Ternary operator ?:
>>>>>>>>>>
>>>>>>>>>>       Operations are defined as applied to the individual elements i.e the bits
>>>>>>>>>>       that are significant in the PBV. Whether the PBVs are treated as bitmasks
>>>>>>>>>>       or otherwise is implementation-defined.
>>>>>>>>>>
>>>>>>>>>>       Insignificant bits could affect results of comparisons or ternary operators.
>>>>>>>>>>       In such cases, it is implementation defined how the unused bits are treated.
>>>>>>>>>>
>>>>>>>>>>       . Subscript operator []
>>>>>>>>>>
>>>>>>>>>>       For the subscript operator, the packed boolean vector acts like a array of
>>>>>>>>>>       elements - the first or the 0th indexed element being the LSbit of the PBV.
>>>>>>>>>>       Subscript operator yields a scalar boolean value.
>>>>>>>>>>       For example:
>>>>>>>>>>
>>>>>>>>>>         typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
>>>>>>>>>>
>>>>>>>>>>         // Subscript operator result yields a boolean value.
>>>>>>>>>>         // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
>>>>>>>>>>         bool foo (v8b p, int n) { p[3] = true; return p[1]; }
>>>>>>>>>>
>>>>>>>>>>       Out of bounds access: OOB access can be determined at compile time given the
>>>>>>>>>>       strong typing of the PBVs.
>>>>>>>>>>
>>>>>>>>>>       PBV does not support address of operator(&) for elements of PBVs.
>>>>>>>>>>
>>>>>>>>>>       . Implicit conversion from integer vectors to PBVs
>>>>>>>>>>
>>>>>>>>>>       We would like to support the output of comparison operations to be PBVs. This
>>>>>>>>>>       requires us to define the implicit conversion from an integer vector to PBV
>>>>>>>>>>       as the result of vector comparisons are integer vectors.
>>>>>>>>>>
>>>>>>>>>>       To define this operation:
>>>>>>>>>>
>>>>>>>>>>         bool_vector = vector <cmpop> vector
>>>>>>>>>>
>>>>>>>>>>       There is no change in how vector <cmpop> vector behavior i.e. this comparison
>>>>>>>>>>       would still produce an int_vector type as it does now.
>>>>>>>>>>
>>>>>>>>>>         temp_int_vec = vector <cmpop> vector
>>>>>>>>>>         bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
>>>>>>>>>>
>>>>>>>>>>       The implicit conversion from int_vec to bool I'd define simply to be:
>>>>>>>>>>
>>>>>>>>>>         bool_vec[n] = (_Bool) int_vec[n]
>>>>>>>>>>
>>>>>>>>>>       where the C11 standard rules apply
>>>>>>>>>>       6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
>>>>>>>>>>       is 0 if the value compares equal to 0; otherwise, the result is 1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
>>>>>>>>>> [2] https://reviews.llvm.org/D88905
>>>>>>>>>> [3] https://reviews.llvm.org/D81083
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Tejas.
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-26  7:21                       ` Tejas Belagod
@ 2023-07-26 12:26                         ` Richard Biener
  2023-07-26 12:33                           ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2023-07-26 12:26 UTC (permalink / raw)
  To: Tejas Belagod; +Cc: gcc-patches

On Wed, Jul 26, 2023 at 9:21 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
>
> On 7/17/23 5:46 PM, Richard Biener wrote:
> > On Fri, Jul 14, 2023 at 12:18 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
> >>
> >> On 7/13/23 4:05 PM, Richard Biener wrote:
> >>> On Thu, Jul 13, 2023 at 12:15 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
> >>>>
> >>>> On 7/3/23 1:31 PM, Richard Biener wrote:
> >>>>> On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
> >>>>>>
> >>>>>> On 6/29/23 6:55 PM, Richard Biener wrote:
> >>>>>>> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> From: Richard Biener <richard.guenther@gmail.com>
> >>>>>>>> Date: Tuesday, June 27, 2023 at 12:58 PM
> >>>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> >>>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> >>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>>>>>>
> >>>>>>>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> From: Richard Biener <richard.guenther@gmail.com>
> >>>>>>>>> Date: Monday, June 26, 2023 at 2:23 PM
> >>>>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> >>>>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> >>>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>>>>>>>
> >>>>>>>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> >>>>>>>>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Packed Boolean Vectors
> >>>>>>>>>> ----------------------
> >>>>>>>>>>
> >>>>>>>>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
> >>>>>>>>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> >>>>>>>>>> been implemented in Clang recently[2].
> >>>>>>>>>>
> >>>>>>>>>> With predication features being added to vector architectures (SVE, MVE, AVX),
> >>>>>>>>>> it is a useful feature to have to model predication on targets.  This could
> >>>>>>>>>> find its use in intrinsics or just used as is as a GNU vector extension being
> >>>>>>>>>> mapped to underlying target features.  For example, the packed boolean vector
> >>>>>>>>>> could directly map to a predicate register on SVE.
> >>>>>>>>>>
> >>>>>>>>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
> >>>>>>>>>> intrinsics to replace a fixed-length svbool_t.
> >>>>>>>>>>
> >>>>>>>>>> Here are a few options to represent the packed boolean vector type.
> >>>>>>>>>
> >>>>>>>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
> >>>>>>>>>
> >>>>>>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>>>>>>> typedef v8si v8sib __attribute__((vector_mask));
> >>>>>>>>>
> >>>>>>>>> it get's you a vector type that's the appropriate (dependent on the
> >>>>>>>>> target) vector
> >>>>>>>>> mask type for the vector data type (v8si in this case).
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks Richard.
> >>>>>>>>>
> >>>>>>>>> Having had a quick look at the implementation, it does seem to tick the boxes.
> >>>>>>>>>
> >>>>>>>>> I must admit I haven't dug deep, but if the target hook allows the mask to be
> >>>>>>>>>
> >>>>>>>>> defined in way that is target-friendly (and I don't know how much effort it will
> >>>>>>>>>
> >>>>>>>>> be to migrate the attribute to more front-ends), it should do the job nicely.
> >>>>>>>>>
> >>>>>>>>> Let me go back and dig a bit deeper and get back with questions if any.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Let me add that the advantage of this is the compiler doesn't need
> >>>>>>>> to support weird explicitely laid out packed boolean vectors that do
> >>>>>>>> not match what the target supports and the user doesn't need to know
> >>>>>>>> what the target supports (and thus have an #ifdef maze around explicitely
> >>>>>>>> specified layouts).
> >>>>>>>>
> >>>>>>>> Sorry for the delayed response – I spent a day experimenting with vector_mask.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
> >>>>>>>>
> >>>>>>>> to avoid having to sprinkle the code with ifdefs.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It does remove some flexibility though, for example with -mavx512f -mavx512vl
> >>>>>>>> you'll get AVX512 style masks for V4SImode data vectors but of course the
> >>>>>>>> target sill supports SSE2/AVX2 style masks as well, but those would not be
> >>>>>>>> available as "packed boolean vectors", though they are of course in fact
> >>>>>>>> equal to V4SImode data vectors with -1 or 0 values, so in this particular
> >>>>>>>> case it might not matter.
> >>>>>>>>
> >>>>>>>> That said, the vector_mask attribute will get you V4SImode vectors with
> >>>>>>>> signed boolean elements of 32 bits for V4SImode data vectors with
> >>>>>>>> SSE2/AVX2.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
> >>>>>>>>
> >>>>>>>> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
> >>>>>>>>
> >>>>>>>> and a ‘w’ specified for the type.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
> >>>>>>>>
> >>>>>>>> major challenges wrt implementation or defining behaviour semantics? I played around with a
> >>>>>>>>
> >>>>>>>> few examples from the testsuite and wrote some new ones. I mostly tried operations that
> >>>>>>>>
> >>>>>>>> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
> >>>>>>>>
> >>>>>>>> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
> >>>>>>>>
> >>>>>>>> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
> >>>>>>>>
> >>>>>>>> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
> >>>>>>>>
> >>>>>>>> to the CFE – initialization and subscript operations?
> >>>>>>>
> >>>>>>> Yes, either that or restrict how the mask vectors can be used, thus
> >>>>>>> properly diagnose improper
> >>>>>>> uses.
> >>>>>>
> >>>>>> Indeed.
> >>>>>>
> >>>>>>      A question would be for example how to write common mask test
> >>>>>>> operations like
> >>>>>>> if (any (mask)) or if (all (mask)).
> >>>>>>
> >>>>>> I see 2 options here. New builtins could support new types - they'd
> >>>>>> provide a target independent way to test any and all conditions. Another
> >>>>>> would be to let the target use its intrinsics to do them in the most
> >>>>>> efficient way possible (which the builtins would get lowered down to
> >>>>>> anyway).
> >>>>>>
> >>>>>>
> >>>>>>      Likewise writing merge operations
> >>>>>>> - do those as
> >>>>>>>
> >>>>>>>      a = a | (mask ? b : 0);
> >>>>>>>
> >>>>>>> thus use ternary ?: for this?
> >>>>>>
> >>>>>> Yes, like now, the ternary could just translate to
> >>>>>>
> >>>>>>       {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
> >>>>>>
> >>>>>> One thing to flesh out is the semantics. Should we allow this operation
> >>>>>> as long as the number of elements are the same even if the mask type if
> >>>>>> different i.e.
> >>>>>>
> >>>>>>       v4hib ? v4si : v4si;
> >>>>>>
> >>>>>> I don't see why this can't be allowed as now we let
> >>>>>>
> >>>>>>       v4si ? v4sf : v4sf;
> >>>>>>
> >>>>>>
> >>>>>> For initialization regular vector
> >>>>>>> syntax should work:
> >>>>>>>
> >>>>>>> mtype mask = (mtype){ -1, -1, 0, 0, ... };
> >>>>>>>
> >>>>>>> there's the question of the signedness of the mask elements.  GCC
> >>>>>>> internally uses signed
> >>>>>>> bools with values -1 for true and 0 for false.
> >>>>>>
> >>>>>> One of the things is the value that represents true. This is largely
> >>>>>> target-dependent when it comes to the vector_mask type. When vector_mask
> >>>>>> types are created from GCC's internal representation of bool vectors
> >>>>>> (signed ints) the point about implicit/explicit conversions from signed
> >>>>>> int vect to mask types in the proposal covers this. So mask in
> >>>>>>
> >>>>>>       v4sib mask = (v4sib){-1, -1, 0, 0, ... }
> >>>>>>
> >>>>>> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
> >>>>>> on SVE. On AVX2/SSE they'd still be represented as vector of signed ints
> >>>>>> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this
> >>>>>> new mask type representations will have in the mid-end while being
> >>>>>> converted back and forth to and from GCC's internal representation, but
> >>>>>> I'm guessing this is already being handled at some level by the
> >>>>>> vector_mask type's current support?
> >>>>>
> >>>>> Yes, I would guess so.  Of course what the middle-end is currently exposed
> >>>>> to is simply what the vectorizer generates - once fuzzers discover this feature
> >>>>> we'll see "interesting" uses that might run into missed or wrong handling of
> >>>>> them.
> >>>>>
> >>>>> So whatever we do on the side of exposing this to users a good portion
> >>>>> of testsuite coverage for the allowed use cases is important.
> >>>>>
> >>>>> Richard.
> >>>>>
> >>>>
> >>>> Apologies for the long-ish reply, but here's a TLDR and gory details follow.
> >>>>
> >>>> TLDR:
> >>>> GIMPLE's vector_mask type semantics seems to be target-dependent, so
> >>>> elevating vector_mask to CFE with same semantics is undesirable. OTOH,
> >>>> changing vector_mask to have target-independent CFE semantics will cause
> >>>> dichotomy between its CFE and GFE behaviours. But vector_mask approach
> >>>> scales well for sizeless types. Is the solution to have something like
> >>>> vector_mask with defined target-independent type semantics, but call it
> >>>> something else to prevent conflation with GIMPLE, a viable option?
> >>>>
> >>>> Details:
> >>>> After some more analysis of the proposed options, here are some
> >>>> interesting findings:
> >>>>
> >>>> vector_mask looked like a very interesting option until I ran into some
> >>>> semantic uncertainly. This code:
> >>>>
> >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>> typedef v8si v8sib __attribute__((vector_mask));
> >>>>
> >>>> typedef short v8hi __attribute__((vector_size(8*sizeof(short))));
> >>>> typedef v8hi v8hib __attribute__((vector_mask));
> >>>>
> >>>> v8si res;
> >>>> v8hi resh;
> >>>>
> >>>> v8hib __GIMPLE () foo (v8hib x, v8sib y)
> >>>> {
> >>>>      v8hib res;
> >>>>
> >>>>      res = x & y;
> >>>>      return res;
> >>>> }
> >>>>
> >>>> When compiled on AArch64, produces a type-mismatch error for binary
> >>>> expression involving '&' because the 'derived' types 'v8hib' and 'v8sib'
> >>>>     have a different target-layout. If the layout of these two 'derived'
> >>>> types match, then the above code has no issue. Which is the case on
> >>>> amdgcn-amdhsa target where it compiles without any error(amdgcn uses a
> >>>> scalar DImode mask mode). IoW such code seems to be allowed on some
> >>>> targets and not on others.
> >>>>
> >>>> With the same code, I tried putting casts and it worked fine on AArch64
> >>>> and amdgcn. This target-specific behaviour of vector_mask derived types
> >>>> will be difficult to specify once we move it to the CFE - in fact we
> >>>> probably don't want target-specific behaviour once it moves to the CFE.
> >>>>
> >>>> If we expose vector_mask to CFE, we'd have to specify consistent
> >>>> semantics for vector_mask types. We'd have to resolve ambiguities like
> >>>> 'v4hib & v4sib' clearly to be able to specify the semantics of the type
> >>>> system involving vector_mask. If we do this, don't we run the risk of a
> >>>> dichotomy between the CFE and GFE semantics of vector_mask? I'm assuming
> >>>> we'd want to retain vector_mask semantics as they are in GIMPLE.
> >>>>
> >>>> If we want to enforce constant semantics for vector_mask in the CFE, one
> >>>> way is to treat vector_mask types as distinct if they're 'attached' to
> >>>> distinct data vector types. In such a scenario, vector_mask types
> >>>> attached to two data vector types with the same lane-width and number of
> >>>> lanes would be classified as distinct. For eg:
> >>>>
> >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>> typedef v8si v8sib __attribute__((vector_mask));
> >>>>
> >>>> typedef float v8sf __attribute__((vector_size(8*sizeof(float))));
> >>>> typedef v8sf v8sfb __attribute__((vector_mask));
> >>>>
> >>>> v8si  foo (v8sf x, v8sf y, v8si i, v8si j)
> >>>> {
> >>>>      (a == b) & (v8sfb)(x == y) ? x : (v8si){0};
> >>>> }
> >>>>
> >>>> This could be the case for unsigned vs signed int vectors too for eg -
> >>>> seems a bit unnecessary tbh.
> >>>>
> >>>> Though vector_mask's being 'attached' to a type has its drawbacks, it
> >>>> does seem to have an advantage when sizeless types are considered. If we
> >>>> have to define a sizeless vector boolean type that is implied by the
> >>>> lane size, we could do something like
> >>>>
> >>>> typedef svint32_t svbool32_t __attribute__((vector_mask));
> >>>>
> >>>> int32_t foo (svint32_t a, svint32_t b)
> >>>> {
> >>>>      svbool32_t pred = a > b;
> >>>>
> >>>>      return pred[2] ? a[2] : b[2];
> >>>> }
> >>>>
> >>>> This is harder to do in the other schemes proposed so far as they're
> >>>> size-based.
> >>>>
> >>>> To be able to free the boolean from the base type (not size) and retain
> >>>> vector_mask's flexibility to declare sizeless types, we could have an
> >>>> attribute that is more flexibly-typed and only 'derives' the lane-size
> >>>> and number of lanes from its 'base' type without actually inheriting the
> >>>> actual base type(char, short, int etc) or its signedness. This creates a
> >>>> purer and stand-alone boolean type without the associated semantics'
> >>>> complexity of having to cast between two same-size types with the same
> >>>> number of lanes. Eg.
> >>>>
> >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>> typedef v8si v8b __attribute__((vector_bool));
> >>>>
> >>>> However, with differing lane-sizes, there will have to be a cast as the
> >>>> 'derived' element size is different which could impact the layout of the
> >>>> vector mask. Eg.
> >>>>
> >>>> v8si  foo (v8hi x, v8hi y, v8si i, v8si j)
> >>>> {
> >>>>      (v8sib)(x == y) & (i == j) ? i : (v8si){0};
> >>>> }
> >>>>
> >>>> Such conversions on targets like AVX512/AMDGCN will be a NOP, but
> >>>> non-trivial on SVE (depending on the implemented layout of the bool vector).
> >>>>
> >>>> vector_bool decouples us from having to retain the behaviour of
> >>>> vector_mask and provides the flexibility of not having to cast across
> >>>> same-element-size vector types. Wrt to sizeless types, it could scale well.
> >>>>
> >>>> typedef svint32_t svbool32_t __attribute__((vector_bool));
> >>>> typedef svint16_t svbool16_t __attribute__((vector_bool));
> >>>>
> >>>> int32_t foo (svint32_t a, svint32_t b)
> >>>> {
> >>>>      svbool32_t pred = a > b;
> >>>>
> >>>>      return pred[2] ? a[2] : b[2];
> >>>> }
> >>>>
> >>>> int16_t bar (svint16_t a, svint16_t b)
> >>>> {
> >>>>      svbool16_t pred = a > b;
> >>>>
> >>>>      return pred[2] ? a[2] : b[2];
> >>>> }
> >>>>
> >>>> On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint32_t on
> >>>> the target predicate.
> >>>>
> >>>> Thoughts?
> >>>
> >>> The GIMPLE frontend accepts just what is valid on the target here.  Any
> >>> "plumbing" such as implicit conversions (if we do not want to require
> >>> explicit ones even when NOP) need to be done/enforced by the C frontend.
> >>>
> >>
> >> Sorry, I'm not sure I follow - correct me if I'm wrong here.
> >>
> >> If we desire to define/allow operations like implicit/explicit
> >> conversion on vector_mask types in CFE, don't we have to start from a
> >> position of defining what makes vector_mask types distinct and therefore
> >> require implicit/explicit conversions?
> >
> > We need to look at which operations we want to produce vector masks and
> > which operations consume them and what operations operate on them.
> >
> > In GIMPLE comparisons produce them, conditionals consume them and
> > we allow bitwise ops to operate on them directly (GIMPLE doesn't have
> > logical && it just has bitwise &).
> >
>
> Thanks for your thoughts - after I spent more cycles researching and
> experimenting, I think I understand the driving factors here. Comparison
> producers generate signed integer vectors of the same lane-width as the
> comparison operands. This means mixed type vectors can't be applied to
> conditional consumers or bitwise operators eg:
>
>    v8hi foo (v8si a, v8si b, v8hi c, v8hi d)
>    {
>      return a > b || c > d; // error!
>      return a > b || __builtin_convertvector (c > d, v8si); // OK.
>      return a | b && c | d; // error!
>      return a | b && __builtin_convertvector (c | d, v8si); // OK.
>    }
>
> Similarly, if we extend these 'stricter-typing' rules to vector_mask, it
> could look like:
>
> typedef v4sib v4si __attribute__((vector_mask));
> typedef v4hib v4hi __attribute__((vector_mask));
>
>    v8sib foo (v8si a, v8si b, v8hi c, v8hi d)
>    {
>      v8sib psi = a > b;
>      v8hib phi = c > d;
>
>      return psi || phi; // error!
>      return psi || __builtin_convertvector (phi, v8sib); // OK.
>      return psi | phi; // error!
>      return psi | __builtin_convertvector (phi, v8sib); // OK.
>    }
>
> At GIMPLE stage, on targets where the layout allows it (eg AMDGCN),
> expressions like
>    psi | __builtin_convertvector (phi, v8sib)
> can be optimized to
>    psi | phi
> because __builtin_convertvector (phi, v8sib) is a NOP.
>
> I think this could make vector_mask more portable across targets. If one
> wants to take CFE vector_mask code and run it on the GFE, it should
> work; while the reverse won't as CFE vector_mask rules are more restrictive.
>
> Does this look like a sensible approach for progress?

Yes, that looks good.

> >> IIUC, GFE's distinctness of vector_mask types depends on how the mask
> >> mode is implemented on the target. If implemented in CFE, vector_mask
> >> types' distinctness probably shouldn't be based on target layout and
> >> could be based on the type they're 'attached' to.
> >
> > But since we eventually run on the target the layout should ideally
> > match that of the target.  Now, the question is whether that's ever
> > OK behavior - it effectively makes the mask somewhat opaque and
> > only "observable" by probing it in defined manners.
> >
> >> Wouldn't that diverge from target-specific GFE behaviour - or are you
> >> suggesting its OK for vector_mask type semantics to be different in CFE
> >> and GFE?
> >
> > It's definitely undesirable but as said I'm not sure it has to differ
> > [the layout].
> >
>
> I agree it is best to have a consistent layout of vector_mask across CFE
> and GFE and also implement it to match the target layout for optimal
> code quality.
>
> For observability, I think it makes sense to allow operations that are
> relevant and have a consistent meaning irrespective of that target. Eg.
> 'vector_mask & 2' might not mean the same thing on all targets, but
> vector_mask[2] does. Therefore, I think the opaqueness is useful and
> necessary to some extent.

Yes.  The main question regarding to observability will be
things like sizeof or alignof and putting masks into addressable storage.

I think IBM folks have introduced some "opaque" types for their
matrix-multiplication accelerator where intrinsics need something to
work with but the observability of many aspect is restricted.  In
the middle-end we have OPAQUE_TYPE and MODE_OPAQUE
(but IIRC there can only be a single kind of that at the moment).
Interestingly an OPAQUE_TYPE does have a size.

Note one way out would be to make vector_mask types "decay"
to a value vector type.  Thus any time you try to observe it
you get a vector bool (a vector of actual 8 bit bool data elements)
and when you use a vector data type in mask context you
get a "mask conversion" aka vector bool != 0.  It would then be
up to the compiler to elide round-trips between mask and data.

That would mean sizeof (vector_mask) would be sizeof (vector bool)
even when for example the hardware would produce V4SImode
mask from a V4SFmode compare or when it would produce a
QImode 4-bit integer from the same?

Richard.

> Thanks,
> Tejas.
>
> >>> There's one issue I can see that wasn't mentioned yet - GCC currently
> >>> accepts
> >>>
> >>> typedef long gv1024di __attribute__((vector_size(1024*8)));
> >>>
> >>> even if there's no underlying support on the target which either has support
> >>> only for smaller vectors or no vectors at all.  Currently vector_mask will
> >>> simply fail to produce sth desirable here.  What's your idea of making
> >>> that not target dependent?  GCC will later lower operations with such
> >>> vectors, possibly splitting them up into sizes supported by the hardware
> >>> natively, possibly performing elementwise operations.  For the former
> >>> one would need to guess the "decomposition type" and based on that
> >>> select the mask type [layout]?
> >>>
> >>> One idea would be to specify the mask layout follows the largest vector
> >>> kind supported by the target and if there is none follow the layout
> >>> of (signed?) _Bool [n]?  When there's no target support for vectors
> >>> GCC will generally use elementwise operations apart from some
> >>> special-cases.
> >>>
> >>
> >> That is a very good point - thanks for raising it. For when GCC chooses
> >> to lower to a vector type supported by the target, my initial thought
> >> would be to, as you say, choose a mask that has enough bits to represent
> >> the largest vector size with the smallest lane-width. The actual layout
> >> of the mask will depend on how the target implements its mask mode.
> >> Decomposition of vector_mask ought to follow the decomposition of the
> >> GNU vectors type and each decomposed vector_mask type ought to have
> >> enough bits to represent the decomposed GNU vector shape. It sounds nice
> >> on paper, but I haven't really worked through a design for this. Do you
> >> see any gotchas here?
> >
> > Not really.  In the end it comes down to what the C writer is allowed to
> > do with a vector mask.  I would for example expect that I could do
> >
> >   auto m = v1 < v2;
> >   _mm512_mask_sub_epi32 (a, m, b, c);
> >
> > so generic masks should inter-operate with intrinsics (when the appropriate
> > ISA is enabled).  That works for the data vectors themselves for example
> > (quite some intrinsics are implemented with GCCs generic vector code).
> >
> > I for example can't do
> >
> >    _Bool lane2 = m[2];
> >
> > to inspect lane two of a maks with AVX512.  I can do m & 2 but I wouldn't expect
> > that to work (should I?) with a vector_mask mask (it's at least not
> > valid directly
> > in GIMPLE).  There's _mm512_int2mask and _mm512_mask2int which transfer
> > between mask and int (but the mask types are really just typedefd to
> > integer typeS).
> >
> >>> While using a different name than vector_mask is certainly possible
> >>> it wouldn't me to decide that, but I'm also not yet convinced it's
> >>> really necessary.  As said, what the GIMPLE frontend accepts
> >>> or not shouldn't limit us here - just the actual chosen layout of the
> >>> boolean vectors.
> >>>
> >>
> >> I'm just concerned about creating an alternate vector_mask functionality
> >> in the CFE and risk not being consistent with GFE.
> >
> > I think it's more important to double-check usablilty from the users side.
> > If the implementation necessarily diverges from GIMPLE then we can
> > choose a different attribute name but then it will also inevitably have
> > code-generation (quality) issues as GIMPLE matches what the hardware
> > can do.
> >
> > Richard.
> >
> >> Thanks,
> >> Tejas.
> >>
> >>> Richard.
> >>>
> >>>> Thanks,
> >>>> Tejas.
> >>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Tejas.
> >>>>>>
> >>>>>>>
> >>>>>>> Richard.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Tejas.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Richard.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Tejas.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> 1. __attribute__((vector_size (n))) where n represents bytes
> >>>>>>>>>>
> >>>>>>>>>>       typedef bool vbool __attribute__ ((vector_size (1)));
> >>>>>>>>>>
> >>>>>>>>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
> >>>>>>>>>> clear if each bit in 'n' controls a byte or an element. On targets
> >>>>>>>>>> like SVE, it would be natural to have each bit control a byte of the target
> >>>>>>>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> >>>>>>>>>> bit would control one element/lane on the target vector(therefore resulting in a
> >>>>>>>>>> 'packed' layout with all significant bits at the LSB).
> >>>>>>>>>>
> >>>>>>>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
> >>>>>>>>>>
> >>>>>>>>>>       typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
> >>>>>>>>>>       typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
> >>>>>>>>>>
> >>>>>>>>>> Here the 'n' in the vector_size attribute represents the number of bits that
> >>>>>>>>>> is needed to represent a vector quantity.  In this case, this packed boolean
> >>>>>>>>>> vector can represent upto 'n' vector lanes. The size of the type is
> >>>>>>>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
> >>>>>>>>>> example is 1.
> >>>>>>>>>>
> >>>>>>>>>> In this approach, because of the nature of the representation, the n bits required
> >>>>>>>>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
> >>>>>>>>>> align with the SVE approach of each bit representing a byte of the target vector
> >>>>>>>>>> and PBV therefore having an 'unpacked' layout.
> >>>>>>>>>>
> >>>>>>>>>> More importantly, another drawback here is that the change in units for vector_size
> >>>>>>>>>> might be confusing to programmers.  The units will have to be interpreted based on the
> >>>>>>>>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
> >>>>>>>>>> the bool vector - it is fixed.
> >>>>>>>>>>
> >>>>>>>>>> 3. Combination of 1 and 2.
> >>>>>>>>>>
> >>>>>>>>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> >>>>>>>>>> unambiguously represent the layout of the PBV. Consider
> >>>>>>>>>>
> >>>>>>>>>>       typedef bool vbool __attribute__((vector_size (s, n[, w])));
> >>>>>>>>>>
> >>>>>>>>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> >>>>>>>>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> >>>>>>>>>> allow a target to force a certain layout of the PBV.
> >>>>>>>>>>
> >>>>>>>>>> The 2-parameter form of vector_size allows the target to have an
> >>>>>>>>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
> >>>>>>>>>> if it is not specified to mirror the target layout of predicate registers. For
> >>>>>>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> >>>>>>>>>>
> >>>>>>>>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> >>>>>>>>>> 8 lanes of boolean which could be represented by
> >>>>>>>>>>
> >>>>>>>>>>       typedef bool v8b __attribute__ ((vector_size (2, 8)));
> >>>>>>>>>>
> >>>>>>>>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
> >>>>>>>>>>
> >>>>>>>>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> >>>>>>>>>> be significant i.e. w == 1.
> >>>>>>>>>>
> >>>>>>>>>> This scheme would accomodate more than 1 target to effectively represent vector
> >>>>>>>>>> bools that mirror the target properties.
> >>>>>>>>>>
> >>>>>>>>>> 4. A new attribite
> >>>>>>>>>>
> >>>>>>>>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> >>>>>>>>>> attribute to define the PBV and make it general enough to
> >>>>>>>>>>
> >>>>>>>>>> * represent all targets flexibly (SVE, AVX etc)
> >>>>>>>>>> * represent sub-byte length predicates
> >>>>>>>>>> * have no change in units of vector_size/no new vector_size signature
> >>>>>>>>>> * not have the number of bytes constrain representation
> >>>>>>>>>>
> >>>>>>>>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
> >>>>>>>>>>
> >>>>>>>>>>       typedef bool vbool __attribute__((bool_vec (n[, w])))
> >>>>>>>>>>
> >>>>>>>>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
> >>>>>>>>>>
> >>>>>>>>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> >>>>>>>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> >>>>>>>>>>
> >>>>>>>>>> 5. Behaviour of the packed vector boolean type.
> >>>>>>>>>>
> >>>>>>>>>> Taking the example of one of the options above, following is an illustration of it's behavior
> >>>>>>>>>>
> >>>>>>>>>> * ABI
> >>>>>>>>>>
> >>>>>>>>>>       New ABI rules will need to be defined for this type - eg alignment, PCS,
> >>>>>>>>>>       mangling etc
> >>>>>>>>>>
> >>>>>>>>>> * Initialization:
> >>>>>>>>>>
> >>>>>>>>>>       Packed Boolean Vectors(PBV) can be initialized like so:
> >>>>>>>>>>
> >>>>>>>>>>         typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> >>>>>>>>>>         v4bi p = {false, true, false, false};
> >>>>>>>>>>
> >>>>>>>>>>       Each value in the initizlizer constant is of type bool. The lowest numbered
> >>>>>>>>>>       element in the const array corresponds to the LSbit of p, element 1 is
> >>>>>>>>>>       assigned to bit 4 etc.
> >>>>>>>>>>
> >>>>>>>>>>       p is effectively a 2-byte bitmask with value 0x0010
> >>>>>>>>>>
> >>>>>>>>>>       With a different layout
> >>>>>>>>>>
> >>>>>>>>>>         typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> >>>>>>>>>>         v4bi p = {false, true, false, false};
> >>>>>>>>>>
> >>>>>>>>>>       p is effectively a 2-byte bitmask with value 0x0002
> >>>>>>>>>>
> >>>>>>>>>> * Operations:
> >>>>>>>>>>
> >>>>>>>>>>       Packed Boolean Vectors support the following operations:
> >>>>>>>>>>       . unary ~
> >>>>>>>>>>       . unary !
> >>>>>>>>>>       . binary&,|andˆ
> >>>>>>>>>>       . assignments &=, |= and ˆ=
> >>>>>>>>>>       . comparisons <, <=, ==, !=, >= and >
> >>>>>>>>>>       . Ternary operator ?:
> >>>>>>>>>>
> >>>>>>>>>>       Operations are defined as applied to the individual elements i.e the bits
> >>>>>>>>>>       that are significant in the PBV. Whether the PBVs are treated as bitmasks
> >>>>>>>>>>       or otherwise is implementation-defined.
> >>>>>>>>>>
> >>>>>>>>>>       Insignificant bits could affect results of comparisons or ternary operators.
> >>>>>>>>>>       In such cases, it is implementation defined how the unused bits are treated.
> >>>>>>>>>>
> >>>>>>>>>>       . Subscript operator []
> >>>>>>>>>>
> >>>>>>>>>>       For the subscript operator, the packed boolean vector acts like a array of
> >>>>>>>>>>       elements - the first or the 0th indexed element being the LSbit of the PBV.
> >>>>>>>>>>       Subscript operator yields a scalar boolean value.
> >>>>>>>>>>       For example:
> >>>>>>>>>>
> >>>>>>>>>>         typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> >>>>>>>>>>
> >>>>>>>>>>         // Subscript operator result yields a boolean value.
> >>>>>>>>>>         // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> >>>>>>>>>>         bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> >>>>>>>>>>
> >>>>>>>>>>       Out of bounds access: OOB access can be determined at compile time given the
> >>>>>>>>>>       strong typing of the PBVs.
> >>>>>>>>>>
> >>>>>>>>>>       PBV does not support address of operator(&) for elements of PBVs.
> >>>>>>>>>>
> >>>>>>>>>>       . Implicit conversion from integer vectors to PBVs
> >>>>>>>>>>
> >>>>>>>>>>       We would like to support the output of comparison operations to be PBVs. This
> >>>>>>>>>>       requires us to define the implicit conversion from an integer vector to PBV
> >>>>>>>>>>       as the result of vector comparisons are integer vectors.
> >>>>>>>>>>
> >>>>>>>>>>       To define this operation:
> >>>>>>>>>>
> >>>>>>>>>>         bool_vector = vector <cmpop> vector
> >>>>>>>>>>
> >>>>>>>>>>       There is no change in how vector <cmpop> vector behavior i.e. this comparison
> >>>>>>>>>>       would still produce an int_vector type as it does now.
> >>>>>>>>>>
> >>>>>>>>>>         temp_int_vec = vector <cmpop> vector
> >>>>>>>>>>         bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
> >>>>>>>>>>
> >>>>>>>>>>       The implicit conversion from int_vec to bool I'd define simply to be:
> >>>>>>>>>>
> >>>>>>>>>>         bool_vec[n] = (_Bool) int_vec[n]
> >>>>>>>>>>
> >>>>>>>>>>       where the C11 standard rules apply
> >>>>>>>>>>       6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
> >>>>>>>>>>       is 0 if the value compares equal to 0; otherwise, the result is 1.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> >>>>>>>>>> [2] https://reviews.llvm.org/D88905
> >>>>>>>>>> [3] https://reviews.llvm.org/D81083
> >>>>>>>>>>
> >>>>>>>>>> Thoughts?
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Tejas.
> >>>>>>
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-26 12:26                         ` Richard Biener
@ 2023-07-26 12:33                           ` Richard Biener
  2023-10-05 20:48                             ` Matthias Kretz
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2023-07-26 12:33 UTC (permalink / raw)
  To: Tejas Belagod, Matthias Kretz; +Cc: gcc-patches

On Wed, Jul 26, 2023 at 2:26 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Jul 26, 2023 at 9:21 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
> >
> > On 7/17/23 5:46 PM, Richard Biener wrote:
> > > On Fri, Jul 14, 2023 at 12:18 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
> > >>
> > >> On 7/13/23 4:05 PM, Richard Biener wrote:
> > >>> On Thu, Jul 13, 2023 at 12:15 PM Tejas Belagod <tejas.belagod@arm.com> wrote:
> > >>>>
> > >>>> On 7/3/23 1:31 PM, Richard Biener wrote:
> > >>>>> On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <tejas.belagod@arm.com> wrote:
> > >>>>>>
> > >>>>>> On 6/29/23 6:55 PM, Richard Biener wrote:
> > >>>>>>> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> From: Richard Biener <richard.guenther@gmail.com>
> > >>>>>>>> Date: Tuesday, June 27, 2023 at 12:58 PM
> > >>>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> > >>>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> > >>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> > >>>>>>>>
> > >>>>>>>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod <Tejas.Belagod@arm.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> From: Richard Biener <richard.guenther@gmail.com>
> > >>>>>>>>> Date: Monday, June 26, 2023 at 2:23 PM
> > >>>>>>>>> To: Tejas Belagod <Tejas.Belagod@arm.com>
> > >>>>>>>>> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>
> > >>>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> > >>>>>>>>> <gcc-patches@gcc.gnu.org> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> Packed Boolean Vectors
> > >>>>>>>>>> ----------------------
> > >>>>>>>>>>
> > >>>>>>>>>> I'd like to propose a feature addition to GNU Vector extensions to add packed
> > >>>>>>>>>> boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
> > >>>>>>>>>> been implemented in Clang recently[2].
> > >>>>>>>>>>
> > >>>>>>>>>> With predication features being added to vector architectures (SVE, MVE, AVX),
> > >>>>>>>>>> it is a useful feature to have to model predication on targets.  This could
> > >>>>>>>>>> find its use in intrinsics or just used as is as a GNU vector extension being
> > >>>>>>>>>> mapped to underlying target features.  For example, the packed boolean vector
> > >>>>>>>>>> could directly map to a predicate register on SVE.
> > >>>>>>>>>>
> > >>>>>>>>>> Also, this new packed boolean type GNU extension can be used with SVE ACLE
> > >>>>>>>>>> intrinsics to replace a fixed-length svbool_t.
> > >>>>>>>>>>
> > >>>>>>>>>> Here are a few options to represent the packed boolean vector type.
> > >>>>>>>>>
> > >>>>>>>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
> > >>>>>>>>>
> > >>>>>>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> > >>>>>>>>> typedef v8si v8sib __attribute__((vector_mask));
> > >>>>>>>>>
> > >>>>>>>>> it get's you a vector type that's the appropriate (dependent on the
> > >>>>>>>>> target) vector
> > >>>>>>>>> mask type for the vector data type (v8si in this case).
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Thanks Richard.
> > >>>>>>>>>
> > >>>>>>>>> Having had a quick look at the implementation, it does seem to tick the boxes.
> > >>>>>>>>>
> > >>>>>>>>> I must admit I haven't dug deep, but if the target hook allows the mask to be
> > >>>>>>>>>
> > >>>>>>>>> defined in way that is target-friendly (and I don't know how much effort it will
> > >>>>>>>>>
> > >>>>>>>>> be to migrate the attribute to more front-ends), it should do the job nicely.
> > >>>>>>>>>
> > >>>>>>>>> Let me go back and dig a bit deeper and get back with questions if any.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Let me add that the advantage of this is the compiler doesn't need
> > >>>>>>>> to support weird explicitely laid out packed boolean vectors that do
> > >>>>>>>> not match what the target supports and the user doesn't need to know
> > >>>>>>>> what the target supports (and thus have an #ifdef maze around explicitely
> > >>>>>>>> specified layouts).
> > >>>>>>>>
> > >>>>>>>> Sorry for the delayed response – I spent a day experimenting with vector_mask.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Yeah, this is what option 4 in the RFC is trying to achieve – be portable enough
> > >>>>>>>>
> > >>>>>>>> to avoid having to sprinkle the code with ifdefs.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> It does remove some flexibility though, for example with -mavx512f -mavx512vl
> > >>>>>>>> you'll get AVX512 style masks for V4SImode data vectors but of course the
> > >>>>>>>> target sill supports SSE2/AVX2 style masks as well, but those would not be
> > >>>>>>>> available as "packed boolean vectors", though they are of course in fact
> > >>>>>>>> equal to V4SImode data vectors with -1 or 0 values, so in this particular
> > >>>>>>>> case it might not matter.
> > >>>>>>>>
> > >>>>>>>> That said, the vector_mask attribute will get you V4SImode vectors with
> > >>>>>>>> signed boolean elements of 32 bits for V4SImode data vectors with
> > >>>>>>>> SSE2/AVX2.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> This sounds very much like what the scenario would be with NEON vs SVE. Coming to think
> > >>>>>>>>
> > >>>>>>>> of it, vector_mask resembles option 4 in the proposal with ‘n’ implied by the ‘base’ vector type
> > >>>>>>>>
> > >>>>>>>> and a ‘w’ specified for the type.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Given its current implementation, if vector_mask is exposed to the CFE, would there be any
> > >>>>>>>>
> > >>>>>>>> major challenges wrt implementation or defining behaviour semantics? I played around with a
> > >>>>>>>>
> > >>>>>>>> few examples from the testsuite and wrote some new ones. I mostly tried operations that
> > >>>>>>>>
> > >>>>>>>> the new type would have to support (unary, binary bitwise, initializations etc) – with a couple of exceptions
> > >>>>>>>>
> > >>>>>>>> most of the ops seem to be supported. I also triggered a couple of ICEs in some tests involving
> > >>>>>>>>
> > >>>>>>>> implicit conversions to wider/narrower vector_mask types (will raise reports for these). Correct me
> > >>>>>>>>
> > >>>>>>>> if I’m wrong here, but we’d probably have to support a couple of new ops if vector_mask is exposed
> > >>>>>>>>
> > >>>>>>>> to the CFE – initialization and subscript operations?
> > >>>>>>>
> > >>>>>>> Yes, either that or restrict how the mask vectors can be used, thus
> > >>>>>>> properly diagnose improper
> > >>>>>>> uses.
> > >>>>>>
> > >>>>>> Indeed.
> > >>>>>>
> > >>>>>>      A question would be for example how to write common mask test
> > >>>>>>> operations like
> > >>>>>>> if (any (mask)) or if (all (mask)).
> > >>>>>>
> > >>>>>> I see 2 options here. New builtins could support new types - they'd
> > >>>>>> provide a target independent way to test any and all conditions. Another
> > >>>>>> would be to let the target use its intrinsics to do them in the most
> > >>>>>> efficient way possible (which the builtins would get lowered down to
> > >>>>>> anyway).
> > >>>>>>
> > >>>>>>
> > >>>>>>      Likewise writing merge operations
> > >>>>>>> - do those as
> > >>>>>>>
> > >>>>>>>      a = a | (mask ? b : 0);
> > >>>>>>>
> > >>>>>>> thus use ternary ?: for this?
> > >>>>>>
> > >>>>>> Yes, like now, the ternary could just translate to
> > >>>>>>
> > >>>>>>       {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
> > >>>>>>
> > >>>>>> One thing to flesh out is the semantics. Should we allow this operation
> > >>>>>> as long as the number of elements are the same even if the mask type if
> > >>>>>> different i.e.
> > >>>>>>
> > >>>>>>       v4hib ? v4si : v4si;
> > >>>>>>
> > >>>>>> I don't see why this can't be allowed as now we let
> > >>>>>>
> > >>>>>>       v4si ? v4sf : v4sf;
> > >>>>>>
> > >>>>>>
> > >>>>>> For initialization regular vector
> > >>>>>>> syntax should work:
> > >>>>>>>
> > >>>>>>> mtype mask = (mtype){ -1, -1, 0, 0, ... };
> > >>>>>>>
> > >>>>>>> there's the question of the signedness of the mask elements.  GCC
> > >>>>>>> internally uses signed
> > >>>>>>> bools with values -1 for true and 0 for false.
> > >>>>>>
> > >>>>>> One of the things is the value that represents true. This is largely
> > >>>>>> target-dependent when it comes to the vector_mask type. When vector_mask
> > >>>>>> types are created from GCC's internal representation of bool vectors
> > >>>>>> (signed ints) the point about implicit/explicit conversions from signed
> > >>>>>> int vect to mask types in the proposal covers this. So mask in
> > >>>>>>
> > >>>>>>       v4sib mask = (v4sib){-1, -1, 0, 0, ... }
> > >>>>>>
> > >>>>>> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
> > >>>>>> on SVE. On AVX2/SSE they'd still be represented as vector of signed ints
> > >>>>>> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications this
> > >>>>>> new mask type representations will have in the mid-end while being
> > >>>>>> converted back and forth to and from GCC's internal representation, but
> > >>>>>> I'm guessing this is already being handled at some level by the
> > >>>>>> vector_mask type's current support?
> > >>>>>
> > >>>>> Yes, I would guess so.  Of course what the middle-end is currently exposed
> > >>>>> to is simply what the vectorizer generates - once fuzzers discover this feature
> > >>>>> we'll see "interesting" uses that might run into missed or wrong handling of
> > >>>>> them.
> > >>>>>
> > >>>>> So whatever we do on the side of exposing this to users a good portion
> > >>>>> of testsuite coverage for the allowed use cases is important.
> > >>>>>
> > >>>>> Richard.
> > >>>>>
> > >>>>
> > >>>> Apologies for the long-ish reply, but here's a TLDR and gory details follow.
> > >>>>
> > >>>> TLDR:
> > >>>> GIMPLE's vector_mask type semantics seems to be target-dependent, so
> > >>>> elevating vector_mask to CFE with same semantics is undesirable. OTOH,
> > >>>> changing vector_mask to have target-independent CFE semantics will cause
> > >>>> dichotomy between its CFE and GFE behaviours. But vector_mask approach
> > >>>> scales well for sizeless types. Is the solution to have something like
> > >>>> vector_mask with defined target-independent type semantics, but call it
> > >>>> something else to prevent conflation with GIMPLE, a viable option?
> > >>>>
> > >>>> Details:
> > >>>> After some more analysis of the proposed options, here are some
> > >>>> interesting findings:
> > >>>>
> > >>>> vector_mask looked like a very interesting option until I ran into some
> > >>>> semantic uncertainly. This code:
> > >>>>
> > >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> > >>>> typedef v8si v8sib __attribute__((vector_mask));
> > >>>>
> > >>>> typedef short v8hi __attribute__((vector_size(8*sizeof(short))));
> > >>>> typedef v8hi v8hib __attribute__((vector_mask));
> > >>>>
> > >>>> v8si res;
> > >>>> v8hi resh;
> > >>>>
> > >>>> v8hib __GIMPLE () foo (v8hib x, v8sib y)
> > >>>> {
> > >>>>      v8hib res;
> > >>>>
> > >>>>      res = x & y;
> > >>>>      return res;
> > >>>> }
> > >>>>
> > >>>> When compiled on AArch64, produces a type-mismatch error for binary
> > >>>> expression involving '&' because the 'derived' types 'v8hib' and 'v8sib'
> > >>>>     have a different target-layout. If the layout of these two 'derived'
> > >>>> types match, then the above code has no issue. Which is the case on
> > >>>> amdgcn-amdhsa target where it compiles without any error(amdgcn uses a
> > >>>> scalar DImode mask mode). IoW such code seems to be allowed on some
> > >>>> targets and not on others.
> > >>>>
> > >>>> With the same code, I tried putting casts and it worked fine on AArch64
> > >>>> and amdgcn. This target-specific behaviour of vector_mask derived types
> > >>>> will be difficult to specify once we move it to the CFE - in fact we
> > >>>> probably don't want target-specific behaviour once it moves to the CFE.
> > >>>>
> > >>>> If we expose vector_mask to CFE, we'd have to specify consistent
> > >>>> semantics for vector_mask types. We'd have to resolve ambiguities like
> > >>>> 'v4hib & v4sib' clearly to be able to specify the semantics of the type
> > >>>> system involving vector_mask. If we do this, don't we run the risk of a
> > >>>> dichotomy between the CFE and GFE semantics of vector_mask? I'm assuming
> > >>>> we'd want to retain vector_mask semantics as they are in GIMPLE.
> > >>>>
> > >>>> If we want to enforce constant semantics for vector_mask in the CFE, one
> > >>>> way is to treat vector_mask types as distinct if they're 'attached' to
> > >>>> distinct data vector types. In such a scenario, vector_mask types
> > >>>> attached to two data vector types with the same lane-width and number of
> > >>>> lanes would be classified as distinct. For eg:
> > >>>>
> > >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> > >>>> typedef v8si v8sib __attribute__((vector_mask));
> > >>>>
> > >>>> typedef float v8sf __attribute__((vector_size(8*sizeof(float))));
> > >>>> typedef v8sf v8sfb __attribute__((vector_mask));
> > >>>>
> > >>>> v8si  foo (v8sf x, v8sf y, v8si i, v8si j)
> > >>>> {
> > >>>>      (a == b) & (v8sfb)(x == y) ? x : (v8si){0};
> > >>>> }
> > >>>>
> > >>>> This could be the case for unsigned vs signed int vectors too for eg -
> > >>>> seems a bit unnecessary tbh.
> > >>>>
> > >>>> Though vector_mask's being 'attached' to a type has its drawbacks, it
> > >>>> does seem to have an advantage when sizeless types are considered. If we
> > >>>> have to define a sizeless vector boolean type that is implied by the
> > >>>> lane size, we could do something like
> > >>>>
> > >>>> typedef svint32_t svbool32_t __attribute__((vector_mask));
> > >>>>
> > >>>> int32_t foo (svint32_t a, svint32_t b)
> > >>>> {
> > >>>>      svbool32_t pred = a > b;
> > >>>>
> > >>>>      return pred[2] ? a[2] : b[2];
> > >>>> }
> > >>>>
> > >>>> This is harder to do in the other schemes proposed so far as they're
> > >>>> size-based.
> > >>>>
> > >>>> To be able to free the boolean from the base type (not size) and retain
> > >>>> vector_mask's flexibility to declare sizeless types, we could have an
> > >>>> attribute that is more flexibly-typed and only 'derives' the lane-size
> > >>>> and number of lanes from its 'base' type without actually inheriting the
> > >>>> actual base type(char, short, int etc) or its signedness. This creates a
> > >>>> purer and stand-alone boolean type without the associated semantics'
> > >>>> complexity of having to cast between two same-size types with the same
> > >>>> number of lanes. Eg.
> > >>>>
> > >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> > >>>> typedef v8si v8b __attribute__((vector_bool));
> > >>>>
> > >>>> However, with differing lane-sizes, there will have to be a cast as the
> > >>>> 'derived' element size is different which could impact the layout of the
> > >>>> vector mask. Eg.
> > >>>>
> > >>>> v8si  foo (v8hi x, v8hi y, v8si i, v8si j)
> > >>>> {
> > >>>>      (v8sib)(x == y) & (i == j) ? i : (v8si){0};
> > >>>> }
> > >>>>
> > >>>> Such conversions on targets like AVX512/AMDGCN will be a NOP, but
> > >>>> non-trivial on SVE (depending on the implemented layout of the bool vector).
> > >>>>
> > >>>> vector_bool decouples us from having to retain the behaviour of
> > >>>> vector_mask and provides the flexibility of not having to cast across
> > >>>> same-element-size vector types. Wrt to sizeless types, it could scale well.
> > >>>>
> > >>>> typedef svint32_t svbool32_t __attribute__((vector_bool));
> > >>>> typedef svint16_t svbool16_t __attribute__((vector_bool));
> > >>>>
> > >>>> int32_t foo (svint32_t a, svint32_t b)
> > >>>> {
> > >>>>      svbool32_t pred = a > b;
> > >>>>
> > >>>>      return pred[2] ? a[2] : b[2];
> > >>>> }
> > >>>>
> > >>>> int16_t bar (svint16_t a, svint16_t b)
> > >>>> {
> > >>>>      svbool16_t pred = a > b;
> > >>>>
> > >>>>      return pred[2] ? a[2] : b[2];
> > >>>> }
> > >>>>
> > >>>> On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint32_t on
> > >>>> the target predicate.
> > >>>>
> > >>>> Thoughts?
> > >>>
> > >>> The GIMPLE frontend accepts just what is valid on the target here.  Any
> > >>> "plumbing" such as implicit conversions (if we do not want to require
> > >>> explicit ones even when NOP) need to be done/enforced by the C frontend.
> > >>>
> > >>
> > >> Sorry, I'm not sure I follow - correct me if I'm wrong here.
> > >>
> > >> If we desire to define/allow operations like implicit/explicit
> > >> conversion on vector_mask types in CFE, don't we have to start from a
> > >> position of defining what makes vector_mask types distinct and therefore
> > >> require implicit/explicit conversions?
> > >
> > > We need to look at which operations we want to produce vector masks and
> > > which operations consume them and what operations operate on them.
> > >
> > > In GIMPLE comparisons produce them, conditionals consume them and
> > > we allow bitwise ops to operate on them directly (GIMPLE doesn't have
> > > logical && it just has bitwise &).
> > >
> >
> > Thanks for your thoughts - after I spent more cycles researching and
> > experimenting, I think I understand the driving factors here. Comparison
> > producers generate signed integer vectors of the same lane-width as the
> > comparison operands. This means mixed type vectors can't be applied to
> > conditional consumers or bitwise operators eg:
> >
> >    v8hi foo (v8si a, v8si b, v8hi c, v8hi d)
> >    {
> >      return a > b || c > d; // error!
> >      return a > b || __builtin_convertvector (c > d, v8si); // OK.
> >      return a | b && c | d; // error!
> >      return a | b && __builtin_convertvector (c | d, v8si); // OK.
> >    }
> >
> > Similarly, if we extend these 'stricter-typing' rules to vector_mask, it
> > could look like:
> >
> > typedef v4sib v4si __attribute__((vector_mask));
> > typedef v4hib v4hi __attribute__((vector_mask));
> >
> >    v8sib foo (v8si a, v8si b, v8hi c, v8hi d)
> >    {
> >      v8sib psi = a > b;
> >      v8hib phi = c > d;
> >
> >      return psi || phi; // error!
> >      return psi || __builtin_convertvector (phi, v8sib); // OK.
> >      return psi | phi; // error!
> >      return psi | __builtin_convertvector (phi, v8sib); // OK.
> >    }
> >
> > At GIMPLE stage, on targets where the layout allows it (eg AMDGCN),
> > expressions like
> >    psi | __builtin_convertvector (phi, v8sib)
> > can be optimized to
> >    psi | phi
> > because __builtin_convertvector (phi, v8sib) is a NOP.
> >
> > I think this could make vector_mask more portable across targets. If one
> > wants to take CFE vector_mask code and run it on the GFE, it should
> > work; while the reverse won't as CFE vector_mask rules are more restrictive.
> >
> > Does this look like a sensible approach for progress?
>
> Yes, that looks good.
>
> > >> IIUC, GFE's distinctness of vector_mask types depends on how the mask
> > >> mode is implemented on the target. If implemented in CFE, vector_mask
> > >> types' distinctness probably shouldn't be based on target layout and
> > >> could be based on the type they're 'attached' to.
> > >
> > > But since we eventually run on the target the layout should ideally
> > > match that of the target.  Now, the question is whether that's ever
> > > OK behavior - it effectively makes the mask somewhat opaque and
> > > only "observable" by probing it in defined manners.
> > >
> > >> Wouldn't that diverge from target-specific GFE behaviour - or are you
> > >> suggesting its OK for vector_mask type semantics to be different in CFE
> > >> and GFE?
> > >
> > > It's definitely undesirable but as said I'm not sure it has to differ
> > > [the layout].
> > >
> >
> > I agree it is best to have a consistent layout of vector_mask across CFE
> > and GFE and also implement it to match the target layout for optimal
> > code quality.
> >
> > For observability, I think it makes sense to allow operations that are
> > relevant and have a consistent meaning irrespective of that target. Eg.
> > 'vector_mask & 2' might not mean the same thing on all targets, but
> > vector_mask[2] does. Therefore, I think the opaqueness is useful and
> > necessary to some extent.
>
> Yes.  The main question regarding to observability will be
> things like sizeof or alignof and putting masks into addressable storage.
>
> I think IBM folks have introduced some "opaque" types for their
> matrix-multiplication accelerator where intrinsics need something to
> work with but the observability of many aspect is restricted.  In
> the middle-end we have OPAQUE_TYPE and MODE_OPAQUE
> (but IIRC there can only be a single kind of that at the moment).
> Interestingly an OPAQUE_TYPE does have a size.
>
> Note one way out would be to make vector_mask types "decay"
> to a value vector type.  Thus any time you try to observe it
> you get a vector bool (a vector of actual 8 bit bool data elements)
> and when you use a vector data type in mask context you
> get a "mask conversion" aka vector bool != 0.  It would then be
> up to the compiler to elide round-trips between mask and data.
>
> That would mean sizeof (vector_mask) would be sizeof (vector bool)
> even when for example the hardware would produce V4SImode
> mask from a V4SFmode compare or when it would produce a
> QImode 4-bit integer from the same?

Btw, how the experimental SIMD C++ standard library handles
these issue might be also interesting to research (author CCed)

Richard.

> Richard.
>
> > Thanks,
> > Tejas.
> >
> > >>> There's one issue I can see that wasn't mentioned yet - GCC currently
> > >>> accepts
> > >>>
> > >>> typedef long gv1024di __attribute__((vector_size(1024*8)));
> > >>>
> > >>> even if there's no underlying support on the target which either has support
> > >>> only for smaller vectors or no vectors at all.  Currently vector_mask will
> > >>> simply fail to produce sth desirable here.  What's your idea of making
> > >>> that not target dependent?  GCC will later lower operations with such
> > >>> vectors, possibly splitting them up into sizes supported by the hardware
> > >>> natively, possibly performing elementwise operations.  For the former
> > >>> one would need to guess the "decomposition type" and based on that
> > >>> select the mask type [layout]?
> > >>>
> > >>> One idea would be to specify the mask layout follows the largest vector
> > >>> kind supported by the target and if there is none follow the layout
> > >>> of (signed?) _Bool [n]?  When there's no target support for vectors
> > >>> GCC will generally use elementwise operations apart from some
> > >>> special-cases.
> > >>>
> > >>
> > >> That is a very good point - thanks for raising it. For when GCC chooses
> > >> to lower to a vector type supported by the target, my initial thought
> > >> would be to, as you say, choose a mask that has enough bits to represent
> > >> the largest vector size with the smallest lane-width. The actual layout
> > >> of the mask will depend on how the target implements its mask mode.
> > >> Decomposition of vector_mask ought to follow the decomposition of the
> > >> GNU vectors type and each decomposed vector_mask type ought to have
> > >> enough bits to represent the decomposed GNU vector shape. It sounds nice
> > >> on paper, but I haven't really worked through a design for this. Do you
> > >> see any gotchas here?
> > >
> > > Not really.  In the end it comes down to what the C writer is allowed to
> > > do with a vector mask.  I would for example expect that I could do
> > >
> > >   auto m = v1 < v2;
> > >   _mm512_mask_sub_epi32 (a, m, b, c);
> > >
> > > so generic masks should inter-operate with intrinsics (when the appropriate
> > > ISA is enabled).  That works for the data vectors themselves for example
> > > (quite some intrinsics are implemented with GCCs generic vector code).
> > >
> > > I for example can't do
> > >
> > >    _Bool lane2 = m[2];
> > >
> > > to inspect lane two of a maks with AVX512.  I can do m & 2 but I wouldn't expect
> > > that to work (should I?) with a vector_mask mask (it's at least not
> > > valid directly
> > > in GIMPLE).  There's _mm512_int2mask and _mm512_mask2int which transfer
> > > between mask and int (but the mask types are really just typedefd to
> > > integer typeS).
> > >
> > >>> While using a different name than vector_mask is certainly possible
> > >>> it wouldn't me to decide that, but I'm also not yet convinced it's
> > >>> really necessary.  As said, what the GIMPLE frontend accepts
> > >>> or not shouldn't limit us here - just the actual chosen layout of the
> > >>> boolean vectors.
> > >>>
> > >>
> > >> I'm just concerned about creating an alternate vector_mask functionality
> > >> in the CFE and risk not being consistent with GFE.
> > >
> > > I think it's more important to double-check usablilty from the users side.
> > > If the implementation necessarily diverges from GIMPLE then we can
> > > choose a different attribute name but then it will also inevitably have
> > > code-generation (quality) issues as GIMPLE matches what the hardware
> > > can do.
> > >
> > > Richard.
> > >
> > >> Thanks,
> > >> Tejas.
> > >>
> > >>> Richard.
> > >>>
> > >>>> Thanks,
> > >>>> Tejas.
> > >>>>
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Tejas.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Richard.
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>>
> > >>>>>>>> Tejas.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Richard.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>>
> > >>>>>>>>> Tejas.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> 1. __attribute__((vector_size (n))) where n represents bytes
> > >>>>>>>>>>
> > >>>>>>>>>>       typedef bool vbool __attribute__ ((vector_size (1)));
> > >>>>>>>>>>
> > >>>>>>>>>> In this approach, the shape of the boolean vector is unclear. IoW, it is not
> > >>>>>>>>>> clear if each bit in 'n' controls a byte or an element. On targets
> > >>>>>>>>>> like SVE, it would be natural to have each bit control a byte of the target
> > >>>>>>>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
> > >>>>>>>>>> bit would control one element/lane on the target vector(therefore resulting in a
> > >>>>>>>>>> 'packed' layout with all significant bits at the LSB).
> > >>>>>>>>>>
> > >>>>>>>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
> > >>>>>>>>>>
> > >>>>>>>>>>       typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
> > >>>>>>>>>>       typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));
> > >>>>>>>>>>
> > >>>>>>>>>> Here the 'n' in the vector_size attribute represents the number of bits that
> > >>>>>>>>>> is needed to represent a vector quantity.  In this case, this packed boolean
> > >>>>>>>>>> vector can represent upto 'n' vector lanes. The size of the type is
> > >>>>>>>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the above
> > >>>>>>>>>> example is 1.
> > >>>>>>>>>>
> > >>>>>>>>>> In this approach, because of the nature of the representation, the n bits required
> > >>>>>>>>>> to represent the n lanes of the vector are packed at the LSB. This does not naturally
> > >>>>>>>>>> align with the SVE approach of each bit representing a byte of the target vector
> > >>>>>>>>>> and PBV therefore having an 'unpacked' layout.
> > >>>>>>>>>>
> > >>>>>>>>>> More importantly, another drawback here is that the change in units for vector_size
> > >>>>>>>>>> might be confusing to programmers.  The units will have to be interpreted based on the
> > >>>>>>>>>> base type of the typedef. It does not offer any flexibility in terms of the layout of
> > >>>>>>>>>> the bool vector - it is fixed.
> > >>>>>>>>>>
> > >>>>>>>>>> 3. Combination of 1 and 2.
> > >>>>>>>>>>
> > >>>>>>>>>> Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
> > >>>>>>>>>> unambiguously represent the layout of the PBV. Consider
> > >>>>>>>>>>
> > >>>>>>>>>>       typedef bool vbool __attribute__((vector_size (s, n[, w])));
> > >>>>>>>>>>
> > >>>>>>>>>> where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
> > >>>>>>>>>> is the number of bits of the PBV that represents a lane of the target vector. 'w' would
> > >>>>>>>>>> allow a target to force a certain layout of the PBV.
> > >>>>>>>>>>
> > >>>>>>>>>> The 2-parameter form of vector_size allows the target to have an
> > >>>>>>>>>> implementation-defined layout of the PBV. The target is free to choose the 'w'
> > >>>>>>>>>> if it is not specified to mirror the target layout of predicate registers. For
> > >>>>>>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> > >>>>>>>>>>
> > >>>>>>>>>> As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
> > >>>>>>>>>> 8 lanes of boolean which could be represented by
> > >>>>>>>>>>
> > >>>>>>>>>>       typedef bool v8b __attribute__ ((vector_size (2, 8)));
> > >>>>>>>>>>
> > >>>>>>>>>> SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2
> > >>>>>>>>>>
> > >>>>>>>>>> and AVX would choose a layout where all 8 consecutive bits packed at LSB would
> > >>>>>>>>>> be significant i.e. w == 1.
> > >>>>>>>>>>
> > >>>>>>>>>> This scheme would accomodate more than 1 target to effectively represent vector
> > >>>>>>>>>> bools that mirror the target properties.
> > >>>>>>>>>>
> > >>>>>>>>>> 4. A new attribite
> > >>>>>>>>>>
> > >>>>>>>>>> This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
> > >>>>>>>>>> attribute to define the PBV and make it general enough to
> > >>>>>>>>>>
> > >>>>>>>>>> * represent all targets flexibly (SVE, AVX etc)
> > >>>>>>>>>> * represent sub-byte length predicates
> > >>>>>>>>>> * have no change in units of vector_size/no new vector_size signature
> > >>>>>>>>>> * not have the number of bytes constrain representation
> > >>>>>>>>>>
> > >>>>>>>>>> If we call the new attribute 'bool_vec' (for lack of a better name), consider
> > >>>>>>>>>>
> > >>>>>>>>>>       typedef bool vbool __attribute__((bool_vec (n[, w])))
> > >>>>>>>>>>
> > >>>>>>>>>> where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.
> > >>>>>>>>>>
> > >>>>>>>>>> If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
> > >>>>>>>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> > >>>>>>>>>>
> > >>>>>>>>>> 5. Behaviour of the packed vector boolean type.
> > >>>>>>>>>>
> > >>>>>>>>>> Taking the example of one of the options above, following is an illustration of it's behavior
> > >>>>>>>>>>
> > >>>>>>>>>> * ABI
> > >>>>>>>>>>
> > >>>>>>>>>>       New ABI rules will need to be defined for this type - eg alignment, PCS,
> > >>>>>>>>>>       mangling etc
> > >>>>>>>>>>
> > >>>>>>>>>> * Initialization:
> > >>>>>>>>>>
> > >>>>>>>>>>       Packed Boolean Vectors(PBV) can be initialized like so:
> > >>>>>>>>>>
> > >>>>>>>>>>         typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> > >>>>>>>>>>         v4bi p = {false, true, false, false};
> > >>>>>>>>>>
> > >>>>>>>>>>       Each value in the initizlizer constant is of type bool. The lowest numbered
> > >>>>>>>>>>       element in the const array corresponds to the LSbit of p, element 1 is
> > >>>>>>>>>>       assigned to bit 4 etc.
> > >>>>>>>>>>
> > >>>>>>>>>>       p is effectively a 2-byte bitmask with value 0x0010
> > >>>>>>>>>>
> > >>>>>>>>>>       With a different layout
> > >>>>>>>>>>
> > >>>>>>>>>>         typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> > >>>>>>>>>>         v4bi p = {false, true, false, false};
> > >>>>>>>>>>
> > >>>>>>>>>>       p is effectively a 2-byte bitmask with value 0x0002
> > >>>>>>>>>>
> > >>>>>>>>>> * Operations:
> > >>>>>>>>>>
> > >>>>>>>>>>       Packed Boolean Vectors support the following operations:
> > >>>>>>>>>>       . unary ~
> > >>>>>>>>>>       . unary !
> > >>>>>>>>>>       . binary&,|andˆ
> > >>>>>>>>>>       . assignments &=, |= and ˆ=
> > >>>>>>>>>>       . comparisons <, <=, ==, !=, >= and >
> > >>>>>>>>>>       . Ternary operator ?:
> > >>>>>>>>>>
> > >>>>>>>>>>       Operations are defined as applied to the individual elements i.e the bits
> > >>>>>>>>>>       that are significant in the PBV. Whether the PBVs are treated as bitmasks
> > >>>>>>>>>>       or otherwise is implementation-defined.
> > >>>>>>>>>>
> > >>>>>>>>>>       Insignificant bits could affect results of comparisons or ternary operators.
> > >>>>>>>>>>       In such cases, it is implementation defined how the unused bits are treated.
> > >>>>>>>>>>
> > >>>>>>>>>>       . Subscript operator []
> > >>>>>>>>>>
> > >>>>>>>>>>       For the subscript operator, the packed boolean vector acts like a array of
> > >>>>>>>>>>       elements - the first or the 0th indexed element being the LSbit of the PBV.
> > >>>>>>>>>>       Subscript operator yields a scalar boolean value.
> > >>>>>>>>>>       For example:
> > >>>>>>>>>>
> > >>>>>>>>>>         typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> > >>>>>>>>>>
> > >>>>>>>>>>         // Subscript operator result yields a boolean value.
> > >>>>>>>>>>         // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> > >>>>>>>>>>         bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> > >>>>>>>>>>
> > >>>>>>>>>>       Out of bounds access: OOB access can be determined at compile time given the
> > >>>>>>>>>>       strong typing of the PBVs.
> > >>>>>>>>>>
> > >>>>>>>>>>       PBV does not support address of operator(&) for elements of PBVs.
> > >>>>>>>>>>
> > >>>>>>>>>>       . Implicit conversion from integer vectors to PBVs
> > >>>>>>>>>>
> > >>>>>>>>>>       We would like to support the output of comparison operations to be PBVs. This
> > >>>>>>>>>>       requires us to define the implicit conversion from an integer vector to PBV
> > >>>>>>>>>>       as the result of vector comparisons are integer vectors.
> > >>>>>>>>>>
> > >>>>>>>>>>       To define this operation:
> > >>>>>>>>>>
> > >>>>>>>>>>         bool_vector = vector <cmpop> vector
> > >>>>>>>>>>
> > >>>>>>>>>>       There is no change in how vector <cmpop> vector behavior i.e. this comparison
> > >>>>>>>>>>       would still produce an int_vector type as it does now.
> > >>>>>>>>>>
> > >>>>>>>>>>         temp_int_vec = vector <cmpop> vector
> > >>>>>>>>>>         bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec
> > >>>>>>>>>>
> > >>>>>>>>>>       The implicit conversion from int_vec to bool I'd define simply to be:
> > >>>>>>>>>>
> > >>>>>>>>>>         bool_vec[n] = (_Bool) int_vec[n]
> > >>>>>>>>>>
> > >>>>>>>>>>       where the C11 standard rules apply
> > >>>>>>>>>>       6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
> > >>>>>>>>>>       is 0 if the value compares equal to 0; otherwise, the result is 1.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> > >>>>>>>>>> [2] https://reviews.llvm.org/D88905
> > >>>>>>>>>> [3] https://reviews.llvm.org/D81083
> > >>>>>>>>>>
> > >>>>>>>>>> Thoughts?
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>> Tejas.
> > >>>>>>
> > >>>>
> > >>
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
  2023-07-26 12:33                           ` Richard Biener
@ 2023-10-05 20:48                             ` Matthias Kretz
  0 siblings, 0 replies; 16+ messages in thread
From: Matthias Kretz @ 2023-10-05 20:48 UTC (permalink / raw)
  To: Tejas Belagod, Richard Biener; +Cc: gcc-patches

On Wednesday, 26 July 2023 06:33:41 MDT Richard Biener wrote:
> Btw, how the experimental SIMD C++ standard library handles
> these issue might be also interesting to research (author CCed)

I only skimmed over this thread now. FWIW, I would really like better 
support for AVX-512 bitmasks for the std::experimental::simd implementation 
(std::simd for C++26). I probably want better support for all the other 
targets that use bitmasks - but so far I only have experience with AVX512.

To make the AVX512 implementation of std::experimental::simd efficient I 
have to call intrinsics/builtins instead of directly expressing what I want 
to do using the [[gnu::vector_size]] types. There are some instances where 
I have to convert between bitmask and element-sized mask vectors - and to 
make that efficient I call all kinds of intrinsics/builtins. And from what 
I've seen, a bitmask -> mask vector -> bitmask conversion won't be 
recognized as a no-op (the other way around, as well).

At this point I have no technical input to this thread. But if there's 
anything you want me to test - whether it helps in the simd implementation 
- let me know.

-Matthias

-- 
──────────────────────────────────────────────────────────────────────────
 Dr. Matthias Kretz                           https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
 std::simd
──────────────────────────────────────────────────────────────────────────

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-10-05 20:48 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-26  6:23 [RFC] GNU Vector Extension -- Packed Boolean Vectors Tejas Belagod
2023-06-26  8:50 ` Richard Biener
2023-06-27  6:30   ` Tejas Belagod
2023-06-27  7:28     ` Richard Biener
2023-06-28 11:26       ` Tejas Belagod
2023-06-29 13:25         ` Richard Biener
2023-07-03  6:50           ` Tejas Belagod
2023-07-03  8:01             ` Richard Biener
2023-07-13 10:14               ` Tejas Belagod
2023-07-13 10:35                 ` Richard Biener
2023-07-14 10:18                   ` Tejas Belagod
2023-07-17 12:16                     ` Richard Biener
2023-07-26  7:21                       ` Tejas Belagod
2023-07-26 12:26                         ` Richard Biener
2023-07-26 12:33                           ` Richard Biener
2023-10-05 20:48                             ` Matthias Kretz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).