Hi,

Packed Boolean Vectors
----------------------

I'd like to propose a feature addition to GNU Vector extensions to add packed
boolean vectors (PBV).  This has been discussed in the past here[1] and a variant has
been implemented in Clang recently[2].

With predication features being added to vector architectures (SVE, MVE, AVX),
it is a useful feature to have to model predication on targets.  This could
find its use in intrinsics or just used as is as a GNU vector extension being
mapped to underlying target features.  For example, the packed boolean vector
could directly map to a predicate register on SVE.

Also, this new packed boolean type GNU extension can be used with SVE ACLE
intrinsics to replace a fixed-length svbool_t.

Here are a few options to represent the packed boolean vector type.

1. __attribute__((vector_size (n))) where n represents bytes

  typedef bool vbool __attribute__ ((vector_size (1)));

In this approach, the shape of the boolean vector is unclear. IoW, it is not
clear if each bit in 'n' controls a byte or an element. On targets
like SVE, it would be natural to have each bit control a byte of the target
vector (therefore resulting in an 'unpacked' layout of the PBV) and on AVX, each
bit would control one element/lane on the target vector(therefore resulting in a
'packed' layout with all significant bits at the LSB).

2. __attribute__((vector_size (n))) where n represents num of lanes

  typedef int v4si __attribute__ ((vector_size (4 * sizeof (int)));
  typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / sizeof (v4si){0}[0])));

Here the 'n' in the vector_size attribute represents the number of bits that
is needed to represent a vector quantity.  In this case, this packed boolean
vector can represent upto 'n' vector lanes. The size of the type is
rounded up the nearest byte.  For example, the sizeof v4bi in the above
example is 1.

In this approach, because of the nature of the representation, the n bits required
to represent the n lanes of the vector are packed at the LSB. This does not naturally
align with the SVE approach of each bit representing a byte of the target vector
and PBV therefore having an 'unpacked' layout.

More importantly, another drawback here is that the change in units for vector_size
might be confusing to programmers.  The units will have to be interpreted based on the
base type of the typedef. It does not offer any flexibility in terms of the layout of
the bool vector - it is fixed.

3. Combination of 1 and 2.

Combining the best of 1 and 2, we can introduce extra parameters to vector_size that will
unambiguously represent the layout of the PBV. Consider

  typedef bool vbool __attribute__((vector_size (s, n[, w])));

where 's' is size in bytes, 'n' is the number of lanes and an optional 3rd parameter 'w'
is the number of bits of the PBV that represents a lane of the target vector. 'w' would
allow a target to force a certain layout of the PBV.

The 2-parameter form of vector_size allows the target to have an
implementation-defined layout of the PBV. The target is free to choose the 'w'
if it is not specified to mirror the target layout of predicate registers. For
eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.

As an example, to represent the result of a comparison on 2 int16x8_t, we'd need
8 lanes of boolean which could be represented by

  typedef bool v8b __attribute__ ((vector_size (2, 8)));

SVE would implement v8b layout to make every 2nd bit significant i.e. w == 2

and AVX would choose a layout where all 8 consecutive bits packed at LSB would
be significant i.e. w == 1.

This scheme would accomodate more than 1 target to effectively represent vector
bools that mirror the target properties.

4. A new attribite

This is based on a suggestion from Richard S in [3]. The idea is to introduce a new
attribute to define the PBV and make it general enough to

* represent all targets flexibly (SVE, AVX etc)
* represent sub-byte length predicates
* have no change in units of vector_size/no new vector_size signature
* not have the number of bytes constrain representation

If we call the new attribute 'bool_vec' (for lack of a better name), consider

  typedef bool vbool __attribute__((bool_vec (n[, w])))

where 'n' represents number of lanes/elements and the optional 'w' is bits-per-lane.

If 'w' is not specified, it and bytes-per-predicate are implementation-defined based on target.
If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).

5. Behaviour of the packed vector boolean type.

Taking the example of one of the options above, following is an illustration of it's behavior

* ABI

  New ABI rules will need to be defined for this type - eg alignment, PCS,
  mangling etc

* Initialization:

  Packed Boolean Vectors(PBV) can be initialized like so:

    typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
    v4bi p = {false, true, false, false};

  Each value in the initizlizer constant is of type bool. The lowest numbered
  element in the const array corresponds to the LSbit of p, element 1 is
  assigned to bit 4 etc.

  p is effectively a 2-byte bitmask with value 0x0010

  With a different layout

    typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
    v4bi p = {false, true, false, false};

  p is effectively a 2-byte bitmask with value 0x0002

* Operations:

  Packed Boolean Vectors support the following operations:
  . unary ~
  . unary !
  . binary&,|andˆ
  . assignments &=, |= and ˆ=
  . comparisons <, <=, ==, !=, >= and >
  . Ternary operator ?:

  Operations are defined as applied to the individual elements i.e the bits
  that are significant in the PBV. Whether the PBVs are treated as bitmasks
  or otherwise is implementation-defined.

  Insignificant bits could affect results of comparisons or ternary operators.
  In such cases, it is implementation defined how the unused bits are treated.

  . Subscript operator []

  For the subscript operator, the packed boolean vector acts like a array of
  elements - the first or the 0th indexed element being the LSbit of the PBV.
  Subscript operator yields a scalar boolean value.
  For example:

    typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));

    // Subscript operator result yields a boolean value.
    // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
    bool foo (v8b p, int n) { p[3] = true; return p[1]; }

  Out of bounds access: OOB access can be determined at compile time given the
  strong typing of the PBVs.

  PBV does not support address of operator(&) for elements of PBVs.

  . Implicit conversion from integer vectors to PBVs

  We would like to support the output of comparison operations to be PBVs. This
  requires us to define the implicit conversion from an integer vector to PBV
  as the result of vector comparisons are integer vectors.

  To define this operation:

    bool_vector = vector <cmpop> vector

  There is no change in how vector <cmpop> vector behavior i.e. this comparison
  would still produce an int_vector type as it does now.

    temp_int_vec = vector <cmpop> vector
    bool_vec = temp_int_vec // Implicit conversion from int_vec to bool_vec

  The implicit conversion from int_vec to bool I'd define simply to be:

    bool_vec[n] = (_Bool) int_vec[n]

  where the C11 standard rules apply
  6.3.1.2 Boolean type  When any scalar value is converted to _Bool, the result
  is 0 if the value compares equal to 0; otherwise, the result is 1.


[1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
[2] https://reviews.llvm.org/D88905
[3] https://reviews.llvm.org/D81083

Thoughts?

Thanks,
Tejas.