From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x22a.google.com (mail-lj1-x22a.google.com [IPv6:2a00:1450:4864:20::22a]) by sourceware.org (Postfix) with ESMTPS id F3BFD3858417 for ; Wed, 26 Jul 2023 12:34:29 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F3BFD3858417 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lj1-x22a.google.com with SMTP id 38308e7fff4ca-2b70404a5a0so99504711fa.2 for ; Wed, 26 Jul 2023 05:34:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1690374868; x=1690979668; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=b12Uyrxp4nY3Nl2cqFvoiV2Q0wOfnPAQa7werL3RXN4=; b=U4FbpA5QvfH4X2x0fR3YYuhhzcIN6w6Hul3sKffMs1a/N5FpaXfZj8NAkJm60cM1bx ujmgsFCD1riOBjy5TdWmQ7u8iniVNl3j5u0im/moaOFS4njIWanAQXmLTSuPmwLRaE7F xJEuc0UEEwaIGX5i07N5XOQnGUWgwayU6VINpZ2ee9zE57+/4JJ4FtLAU16vuKK+h9fx l9BlY2aD7dUl+ZVIui1vCOPFrwhVznOKKL32hQHm8qVQWfV7Rfi9ZC+TE5le+/wfrK5R 29x7IvqhL3wIGX734NbgqL1W0rDSGimsTm6ktogmqZ6VcxgFgTr7sEVje4WHyEpE2fzU M+3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690374868; x=1690979668; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=b12Uyrxp4nY3Nl2cqFvoiV2Q0wOfnPAQa7werL3RXN4=; b=YgMfaOnH9hjaKnvv1n5pT1f0mGpz28lHGUyVsFGozRFd5+66f2hcSXlMq9a+hPqLG/ WnjBWbY9Umd/jpK4FvnYrbQAqJK/FP4XqVLr7dJckJMXMoMXn7NVMPhP673IJNuWMBJL nbvBHC6TxzILP3ONmIvhZHvEzlDIoQXBzJMsvBnqpOQAZs78Ou1zXHm9fz6ff568Z8zO 6zLvxivwts9lsmi2UiG7T1CMWV2b4RyDlRQVeob2C9Dx14a5PSJSWLHpshx4P9rzeZCz IFK6HOgNVoW4ygOxIlk5IJOAtP+AkwQKxkGqvnUCrZZK+LtuKVGAh4p8MbBs+hplc7QK /J4w== X-Gm-Message-State: ABy/qLa6uuckfPGHaCn0ArwwpGjjwGz78xWMi1UZ5dXkvfnvsFhuxr5n BBaTs2VDkj7vPVAFLcsxUdI/HrKtb1xovJ+fzMs= X-Google-Smtp-Source: APBJJlE9Gews1It6i3MlryMyWD/2qG3lW2QLnDQ0zHQDkxbCAcnJSPGtJMPI8q/5xlRDviIPVcsX8I2lgQk/pc4rCDI= X-Received: by 2002:a05:651c:214:b0:2b6:9bd3:840e with SMTP id y20-20020a05651c021400b002b69bd3840emr1412150ljn.21.1690374867601; Wed, 26 Jul 2023 05:34:27 -0700 (PDT) MIME-Version: 1.0 References: <87a51e61-271a-44d7-ed94-de45d32b2e18@arm.com> In-Reply-To: From: Richard Biener Date: Wed, 26 Jul 2023 14:33:41 +0200 Message-ID: Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors To: Tejas Belagod , Matthias Kretz Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Wed, Jul 26, 2023 at 2:26=E2=80=AFPM Richard Biener wrote: > > On Wed, Jul 26, 2023 at 9:21=E2=80=AFAM Tejas Belagod wrote: > > > > On 7/17/23 5:46 PM, Richard Biener wrote: > > > On Fri, Jul 14, 2023 at 12:18=E2=80=AFPM Tejas Belagod wrote: > > >> > > >> On 7/13/23 4:05 PM, Richard Biener wrote: > > >>> On Thu, Jul 13, 2023 at 12:15=E2=80=AFPM Tejas Belagod wrote: > > >>>> > > >>>> On 7/3/23 1:31 PM, Richard Biener wrote: > > >>>>> On Mon, Jul 3, 2023 at 8:50=E2=80=AFAM Tejas Belagod wrote: > > >>>>>> > > >>>>>> On 6/29/23 6:55 PM, Richard Biener wrote: > > >>>>>>> On Wed, Jun 28, 2023 at 1:26=E2=80=AFPM Tejas Belagod wrote: > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> From: Richard Biener > > >>>>>>>> Date: Tuesday, June 27, 2023 at 12:58 PM > > >>>>>>>> To: Tejas Belagod > > >>>>>>>> Cc: gcc-patches@gcc.gnu.org > > >>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vect= ors > > >>>>>>>> > > >>>>>>>> On Tue, Jun 27, 2023 at 8:30=E2=80=AFAM Tejas Belagod wrote: > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> From: Richard Biener > > >>>>>>>>> Date: Monday, June 26, 2023 at 2:23 PM > > >>>>>>>>> To: Tejas Belagod > > >>>>>>>>> Cc: gcc-patches@gcc.gnu.org > > >>>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vec= tors > > >>>>>>>>> > > >>>>>>>>> On Mon, Jun 26, 2023 at 8:24=E2=80=AFAM Tejas Belagod via Gcc= -patches > > >>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> Hi, > > >>>>>>>>>> > > >>>>>>>>>> Packed Boolean Vectors > > >>>>>>>>>> ---------------------- > > >>>>>>>>>> > > >>>>>>>>>> I'd like to propose a feature addition to GNU Vector extensi= ons to add packed > > >>>>>>>>>> boolean vectors (PBV). This has been discussed in the past = here[1] and a variant has > > >>>>>>>>>> been implemented in Clang recently[2]. > > >>>>>>>>>> > > >>>>>>>>>> With predication features being added to vector architecture= s (SVE, MVE, AVX), > > >>>>>>>>>> it is a useful feature to have to model predication on targe= ts. This could > > >>>>>>>>>> find its use in intrinsics or just used as is as a GNU vecto= r extension being > > >>>>>>>>>> mapped to underlying target features. For example, the pack= ed boolean vector > > >>>>>>>>>> could directly map to a predicate register on SVE. > > >>>>>>>>>> > > >>>>>>>>>> Also, this new packed boolean type GNU extension can be used= with SVE ACLE > > >>>>>>>>>> intrinsics to replace a fixed-length svbool_t. > > >>>>>>>>>> > > >>>>>>>>>> Here are a few options to represent the packed boolean vecto= r type. > > >>>>>>>>> > > >>>>>>>>> The GIMPLE frontend uses a new 'vector_mask' attribute: > > >>>>>>>>> > > >>>>>>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int)))); > > >>>>>>>>> typedef v8si v8sib __attribute__((vector_mask)); > > >>>>>>>>> > > >>>>>>>>> it get's you a vector type that's the appropriate (dependent = on the > > >>>>>>>>> target) vector > > >>>>>>>>> mask type for the vector data type (v8si in this case). > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Thanks Richard. > > >>>>>>>>> > > >>>>>>>>> Having had a quick look at the implementation, it does seem t= o tick the boxes. > > >>>>>>>>> > > >>>>>>>>> I must admit I haven't dug deep, but if the target hook allow= s the mask to be > > >>>>>>>>> > > >>>>>>>>> defined in way that is target-friendly (and I don't know how = much effort it will > > >>>>>>>>> > > >>>>>>>>> be to migrate the attribute to more front-ends), it should do= the job nicely. > > >>>>>>>>> > > >>>>>>>>> Let me go back and dig a bit deeper and get back with questio= ns if any. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Let me add that the advantage of this is the compiler doesn't = need > > >>>>>>>> to support weird explicitely laid out packed boolean vectors t= hat do > > >>>>>>>> not match what the target supports and the user doesn't need t= o know > > >>>>>>>> what the target supports (and thus have an #ifdef maze around = explicitely > > >>>>>>>> specified layouts). > > >>>>>>>> > > >>>>>>>> Sorry for the delayed response =E2=80=93 I spent a day experim= enting with vector_mask. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Yeah, this is what option 4 in the RFC is trying to achieve = =E2=80=93 be portable enough > > >>>>>>>> > > >>>>>>>> to avoid having to sprinkle the code with ifdefs. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> It does remove some flexibility though, for example with -mavx= 512f -mavx512vl > > >>>>>>>> you'll get AVX512 style masks for V4SImode data vectors but of= course the > > >>>>>>>> target sill supports SSE2/AVX2 style masks as well, but those = would not be > > >>>>>>>> available as "packed boolean vectors", though they are of cour= se in fact > > >>>>>>>> equal to V4SImode data vectors with -1 or 0 values, so in this= particular > > >>>>>>>> case it might not matter. > > >>>>>>>> > > >>>>>>>> That said, the vector_mask attribute will get you V4SImode vec= tors with > > >>>>>>>> signed boolean elements of 32 bits for V4SImode data vectors w= ith > > >>>>>>>> SSE2/AVX2. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> This sounds very much like what the scenario would be with NEO= N vs SVE. Coming to think > > >>>>>>>> > > >>>>>>>> of it, vector_mask resembles option 4 in the proposal with =E2= =80=98n=E2=80=99 implied by the =E2=80=98base=E2=80=99 vector type > > >>>>>>>> > > >>>>>>>> and a =E2=80=98w=E2=80=99 specified for the type. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Given its current implementation, if vector_mask is exposed to= the CFE, would there be any > > >>>>>>>> > > >>>>>>>> major challenges wrt implementation or defining behaviour sema= ntics? I played around with a > > >>>>>>>> > > >>>>>>>> few examples from the testsuite and wrote some new ones. I mos= tly tried operations that > > >>>>>>>> > > >>>>>>>> the new type would have to support (unary, binary bitwise, ini= tializations etc) =E2=80=93 with a couple of exceptions > > >>>>>>>> > > >>>>>>>> most of the ops seem to be supported. I also triggered a coupl= e of ICEs in some tests involving > > >>>>>>>> > > >>>>>>>> implicit conversions to wider/narrower vector_mask types (will= raise reports for these). Correct me > > >>>>>>>> > > >>>>>>>> if I=E2=80=99m wrong here, but we=E2=80=99d probably have to s= upport a couple of new ops if vector_mask is exposed > > >>>>>>>> > > >>>>>>>> to the CFE =E2=80=93 initialization and subscript operations? > > >>>>>>> > > >>>>>>> Yes, either that or restrict how the mask vectors can be used, = thus > > >>>>>>> properly diagnose improper > > >>>>>>> uses. > > >>>>>> > > >>>>>> Indeed. > > >>>>>> > > >>>>>> A question would be for example how to write common mask te= st > > >>>>>>> operations like > > >>>>>>> if (any (mask)) or if (all (mask)). > > >>>>>> > > >>>>>> I see 2 options here. New builtins could support new types - the= y'd > > >>>>>> provide a target independent way to test any and all conditions.= Another > > >>>>>> would be to let the target use its intrinsics to do them in the = most > > >>>>>> efficient way possible (which the builtins would get lowered dow= n to > > >>>>>> anyway). > > >>>>>> > > >>>>>> > > >>>>>> Likewise writing merge operations > > >>>>>>> - do those as > > >>>>>>> > > >>>>>>> a =3D a | (mask ? b : 0); > > >>>>>>> > > >>>>>>> thus use ternary ?: for this? > > >>>>>> > > >>>>>> Yes, like now, the ternary could just translate to > > >>>>>> > > >>>>>> {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... } > > >>>>>> > > >>>>>> One thing to flesh out is the semantics. Should we allow this op= eration > > >>>>>> as long as the number of elements are the same even if the mask = type if > > >>>>>> different i.e. > > >>>>>> > > >>>>>> v4hib ? v4si : v4si; > > >>>>>> > > >>>>>> I don't see why this can't be allowed as now we let > > >>>>>> > > >>>>>> v4si ? v4sf : v4sf; > > >>>>>> > > >>>>>> > > >>>>>> For initialization regular vector > > >>>>>>> syntax should work: > > >>>>>>> > > >>>>>>> mtype mask =3D (mtype){ -1, -1, 0, 0, ... }; > > >>>>>>> > > >>>>>>> there's the question of the signedness of the mask elements. G= CC > > >>>>>>> internally uses signed > > >>>>>>> bools with values -1 for true and 0 for false. > > >>>>>> > > >>>>>> One of the things is the value that represents true. This is lar= gely > > >>>>>> target-dependent when it comes to the vector_mask type. When vec= tor_mask > > >>>>>> types are created from GCC's internal representation of bool vec= tors > > >>>>>> (signed ints) the point about implicit/explicit conversions from= signed > > >>>>>> int vect to mask types in the proposal covers this. So mask in > > >>>>>> > > >>>>>> v4sib mask =3D (v4sib){-1, -1, 0, 0, ... } > > >>>>>> > > >>>>>> will probably end up being represented as 0x3xxxx on AVX512 and = 0x11xxx > > >>>>>> on SVE. On AVX2/SSE they'd still be represented as vector of sig= ned ints > > >>>>>> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramificati= ons this > > >>>>>> new mask type representations will have in the mid-end while bei= ng > > >>>>>> converted back and forth to and from GCC's internal representati= on, but > > >>>>>> I'm guessing this is already being handled at some level by the > > >>>>>> vector_mask type's current support? > > >>>>> > > >>>>> Yes, I would guess so. Of course what the middle-end is currentl= y exposed > > >>>>> to is simply what the vectorizer generates - once fuzzers discove= r this feature > > >>>>> we'll see "interesting" uses that might run into missed or wrong = handling of > > >>>>> them. > > >>>>> > > >>>>> So whatever we do on the side of exposing this to users a good po= rtion > > >>>>> of testsuite coverage for the allowed use cases is important. > > >>>>> > > >>>>> Richard. > > >>>>> > > >>>> > > >>>> Apologies for the long-ish reply, but here's a TLDR and gory detai= ls follow. > > >>>> > > >>>> TLDR: > > >>>> GIMPLE's vector_mask type semantics seems to be target-dependent, = so > > >>>> elevating vector_mask to CFE with same semantics is undesirable. O= TOH, > > >>>> changing vector_mask to have target-independent CFE semantics will= cause > > >>>> dichotomy between its CFE and GFE behaviours. But vector_mask appr= oach > > >>>> scales well for sizeless types. Is the solution to have something = like > > >>>> vector_mask with defined target-independent type semantics, but ca= ll it > > >>>> something else to prevent conflation with GIMPLE, a viable option? > > >>>> > > >>>> Details: > > >>>> After some more analysis of the proposed options, here are some > > >>>> interesting findings: > > >>>> > > >>>> vector_mask looked like a very interesting option until I ran into= some > > >>>> semantic uncertainly. This code: > > >>>> > > >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int)))); > > >>>> typedef v8si v8sib __attribute__((vector_mask)); > > >>>> > > >>>> typedef short v8hi __attribute__((vector_size(8*sizeof(short)))); > > >>>> typedef v8hi v8hib __attribute__((vector_mask)); > > >>>> > > >>>> v8si res; > > >>>> v8hi resh; > > >>>> > > >>>> v8hib __GIMPLE () foo (v8hib x, v8sib y) > > >>>> { > > >>>> v8hib res; > > >>>> > > >>>> res =3D x & y; > > >>>> return res; > > >>>> } > > >>>> > > >>>> When compiled on AArch64, produces a type-mismatch error for binar= y > > >>>> expression involving '&' because the 'derived' types 'v8hib' and '= v8sib' > > >>>> have a different target-layout. If the layout of these two 'de= rived' > > >>>> types match, then the above code has no issue. Which is the case o= n > > >>>> amdgcn-amdhsa target where it compiles without any error(amdgcn us= es a > > >>>> scalar DImode mask mode). IoW such code seems to be allowed on som= e > > >>>> targets and not on others. > > >>>> > > >>>> With the same code, I tried putting casts and it worked fine on AA= rch64 > > >>>> and amdgcn. This target-specific behaviour of vector_mask derived = types > > >>>> will be difficult to specify once we move it to the CFE - in fact = we > > >>>> probably don't want target-specific behaviour once it moves to the= CFE. > > >>>> > > >>>> If we expose vector_mask to CFE, we'd have to specify consistent > > >>>> semantics for vector_mask types. We'd have to resolve ambiguities = like > > >>>> 'v4hib & v4sib' clearly to be able to specify the semantics of the= type > > >>>> system involving vector_mask. If we do this, don't we run the risk= of a > > >>>> dichotomy between the CFE and GFE semantics of vector_mask? I'm as= suming > > >>>> we'd want to retain vector_mask semantics as they are in GIMPLE. > > >>>> > > >>>> If we want to enforce constant semantics for vector_mask in the CF= E, one > > >>>> way is to treat vector_mask types as distinct if they're 'attached= ' to > > >>>> distinct data vector types. In such a scenario, vector_mask types > > >>>> attached to two data vector types with the same lane-width and num= ber of > > >>>> lanes would be classified as distinct. For eg: > > >>>> > > >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int)))); > > >>>> typedef v8si v8sib __attribute__((vector_mask)); > > >>>> > > >>>> typedef float v8sf __attribute__((vector_size(8*sizeof(float)))); > > >>>> typedef v8sf v8sfb __attribute__((vector_mask)); > > >>>> > > >>>> v8si foo (v8sf x, v8sf y, v8si i, v8si j) > > >>>> { > > >>>> (a =3D=3D b) & (v8sfb)(x =3D=3D y) ? x : (v8si){0}; > > >>>> } > > >>>> > > >>>> This could be the case for unsigned vs signed int vectors too for = eg - > > >>>> seems a bit unnecessary tbh. > > >>>> > > >>>> Though vector_mask's being 'attached' to a type has its drawbacks,= it > > >>>> does seem to have an advantage when sizeless types are considered.= If we > > >>>> have to define a sizeless vector boolean type that is implied by t= he > > >>>> lane size, we could do something like > > >>>> > > >>>> typedef svint32_t svbool32_t __attribute__((vector_mask)); > > >>>> > > >>>> int32_t foo (svint32_t a, svint32_t b) > > >>>> { > > >>>> svbool32_t pred =3D a > b; > > >>>> > > >>>> return pred[2] ? a[2] : b[2]; > > >>>> } > > >>>> > > >>>> This is harder to do in the other schemes proposed so far as they'= re > > >>>> size-based. > > >>>> > > >>>> To be able to free the boolean from the base type (not size) and r= etain > > >>>> vector_mask's flexibility to declare sizeless types, we could have= an > > >>>> attribute that is more flexibly-typed and only 'derives' the lane-= size > > >>>> and number of lanes from its 'base' type without actually inheriti= ng the > > >>>> actual base type(char, short, int etc) or its signedness. This cre= ates a > > >>>> purer and stand-alone boolean type without the associated semantic= s' > > >>>> complexity of having to cast between two same-size types with the = same > > >>>> number of lanes. Eg. > > >>>> > > >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int)))); > > >>>> typedef v8si v8b __attribute__((vector_bool)); > > >>>> > > >>>> However, with differing lane-sizes, there will have to be a cast a= s the > > >>>> 'derived' element size is different which could impact the layout = of the > > >>>> vector mask. Eg. > > >>>> > > >>>> v8si foo (v8hi x, v8hi y, v8si i, v8si j) > > >>>> { > > >>>> (v8sib)(x =3D=3D y) & (i =3D=3D j) ? i : (v8si){0}; > > >>>> } > > >>>> > > >>>> Such conversions on targets like AVX512/AMDGCN will be a NOP, but > > >>>> non-trivial on SVE (depending on the implemented layout of the boo= l vector). > > >>>> > > >>>> vector_bool decouples us from having to retain the behaviour of > > >>>> vector_mask and provides the flexibility of not having to cast acr= oss > > >>>> same-element-size vector types. Wrt to sizeless types, it could sc= ale well. > > >>>> > > >>>> typedef svint32_t svbool32_t __attribute__((vector_bool)); > > >>>> typedef svint16_t svbool16_t __attribute__((vector_bool)); > > >>>> > > >>>> int32_t foo (svint32_t a, svint32_t b) > > >>>> { > > >>>> svbool32_t pred =3D a > b; > > >>>> > > >>>> return pred[2] ? a[2] : b[2]; > > >>>> } > > >>>> > > >>>> int16_t bar (svint16_t a, svint16_t b) > > >>>> { > > >>>> svbool16_t pred =3D a > b; > > >>>> > > >>>> return pred[2] ? a[2] : b[2]; > > >>>> } > > >>>> > > >>>> On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint3= 2_t on > > >>>> the target predicate. > > >>>> > > >>>> Thoughts? > > >>> > > >>> The GIMPLE frontend accepts just what is valid on the target here. = Any > > >>> "plumbing" such as implicit conversions (if we do not want to requi= re > > >>> explicit ones even when NOP) need to be done/enforced by the C fron= tend. > > >>> > > >> > > >> Sorry, I'm not sure I follow - correct me if I'm wrong here. > > >> > > >> If we desire to define/allow operations like implicit/explicit > > >> conversion on vector_mask types in CFE, don't we have to start from = a > > >> position of defining what makes vector_mask types distinct and there= fore > > >> require implicit/explicit conversions? > > > > > > We need to look at which operations we want to produce vector masks a= nd > > > which operations consume them and what operations operate on them. > > > > > > In GIMPLE comparisons produce them, conditionals consume them and > > > we allow bitwise ops to operate on them directly (GIMPLE doesn't have > > > logical && it just has bitwise &). > > > > > > > Thanks for your thoughts - after I spent more cycles researching and > > experimenting, I think I understand the driving factors here. Compariso= n > > producers generate signed integer vectors of the same lane-width as the > > comparison operands. This means mixed type vectors can't be applied to > > conditional consumers or bitwise operators eg: > > > > v8hi foo (v8si a, v8si b, v8hi c, v8hi d) > > { > > return a > b || c > d; // error! > > return a > b || __builtin_convertvector (c > d, v8si); // OK. > > return a | b && c | d; // error! > > return a | b && __builtin_convertvector (c | d, v8si); // OK. > > } > > > > Similarly, if we extend these 'stricter-typing' rules to vector_mask, i= t > > could look like: > > > > typedef v4sib v4si __attribute__((vector_mask)); > > typedef v4hib v4hi __attribute__((vector_mask)); > > > > v8sib foo (v8si a, v8si b, v8hi c, v8hi d) > > { > > v8sib psi =3D a > b; > > v8hib phi =3D c > d; > > > > return psi || phi; // error! > > return psi || __builtin_convertvector (phi, v8sib); // OK. > > return psi | phi; // error! > > return psi | __builtin_convertvector (phi, v8sib); // OK. > > } > > > > At GIMPLE stage, on targets where the layout allows it (eg AMDGCN), > > expressions like > > psi | __builtin_convertvector (phi, v8sib) > > can be optimized to > > psi | phi > > because __builtin_convertvector (phi, v8sib) is a NOP. > > > > I think this could make vector_mask more portable across targets. If on= e > > wants to take CFE vector_mask code and run it on the GFE, it should > > work; while the reverse won't as CFE vector_mask rules are more restric= tive. > > > > Does this look like a sensible approach for progress? > > Yes, that looks good. > > > >> IIUC, GFE's distinctness of vector_mask types depends on how the mas= k > > >> mode is implemented on the target. If implemented in CFE, vector_mas= k > > >> types' distinctness probably shouldn't be based on target layout and > > >> could be based on the type they're 'attached' to. > > > > > > But since we eventually run on the target the layout should ideally > > > match that of the target. Now, the question is whether that's ever > > > OK behavior - it effectively makes the mask somewhat opaque and > > > only "observable" by probing it in defined manners. > > > > > >> Wouldn't that diverge from target-specific GFE behaviour - or are yo= u > > >> suggesting its OK for vector_mask type semantics to be different in = CFE > > >> and GFE? > > > > > > It's definitely undesirable but as said I'm not sure it has to differ > > > [the layout]. > > > > > > > I agree it is best to have a consistent layout of vector_mask across CF= E > > and GFE and also implement it to match the target layout for optimal > > code quality. > > > > For observability, I think it makes sense to allow operations that are > > relevant and have a consistent meaning irrespective of that target. Eg. > > 'vector_mask & 2' might not mean the same thing on all targets, but > > vector_mask[2] does. Therefore, I think the opaqueness is useful and > > necessary to some extent. > > Yes. The main question regarding to observability will be > things like sizeof or alignof and putting masks into addressable storage. > > I think IBM folks have introduced some "opaque" types for their > matrix-multiplication accelerator where intrinsics need something to > work with but the observability of many aspect is restricted. In > the middle-end we have OPAQUE_TYPE and MODE_OPAQUE > (but IIRC there can only be a single kind of that at the moment). > Interestingly an OPAQUE_TYPE does have a size. > > Note one way out would be to make vector_mask types "decay" > to a value vector type. Thus any time you try to observe it > you get a vector bool (a vector of actual 8 bit bool data elements) > and when you use a vector data type in mask context you > get a "mask conversion" aka vector bool !=3D 0. It would then be > up to the compiler to elide round-trips between mask and data. > > That would mean sizeof (vector_mask) would be sizeof (vector bool) > even when for example the hardware would produce V4SImode > mask from a V4SFmode compare or when it would produce a > QImode 4-bit integer from the same? Btw, how the experimental SIMD C++ standard library handles these issue might be also interesting to research (author CCed) Richard. > Richard. > > > Thanks, > > Tejas. > > > > >>> There's one issue I can see that wasn't mentioned yet - GCC current= ly > > >>> accepts > > >>> > > >>> typedef long gv1024di __attribute__((vector_size(1024*8))); > > >>> > > >>> even if there's no underlying support on the target which either ha= s support > > >>> only for smaller vectors or no vectors at all. Currently vector_ma= sk will > > >>> simply fail to produce sth desirable here. What's your idea of mak= ing > > >>> that not target dependent? GCC will later lower operations with su= ch > > >>> vectors, possibly splitting them up into sizes supported by the har= dware > > >>> natively, possibly performing elementwise operations. For the form= er > > >>> one would need to guess the "decomposition type" and based on that > > >>> select the mask type [layout]? > > >>> > > >>> One idea would be to specify the mask layout follows the largest ve= ctor > > >>> kind supported by the target and if there is none follow the layout > > >>> of (signed?) _Bool [n]? When there's no target support for vectors > > >>> GCC will generally use elementwise operations apart from some > > >>> special-cases. > > >>> > > >> > > >> That is a very good point - thanks for raising it. For when GCC choo= ses > > >> to lower to a vector type supported by the target, my initial though= t > > >> would be to, as you say, choose a mask that has enough bits to repre= sent > > >> the largest vector size with the smallest lane-width. The actual lay= out > > >> of the mask will depend on how the target implements its mask mode. > > >> Decomposition of vector_mask ought to follow the decomposition of th= e > > >> GNU vectors type and each decomposed vector_mask type ought to have > > >> enough bits to represent the decomposed GNU vector shape. It sounds = nice > > >> on paper, but I haven't really worked through a design for this. Do = you > > >> see any gotchas here? > > > > > > Not really. In the end it comes down to what the C writer is allowed= to > > > do with a vector mask. I would for example expect that I could do > > > > > > auto m =3D v1 < v2; > > > _mm512_mask_sub_epi32 (a, m, b, c); > > > > > > so generic masks should inter-operate with intrinsics (when the appro= priate > > > ISA is enabled). That works for the data vectors themselves for exam= ple > > > (quite some intrinsics are implemented with GCCs generic vector code)= . > > > > > > I for example can't do > > > > > > _Bool lane2 =3D m[2]; > > > > > > to inspect lane two of a maks with AVX512. I can do m & 2 but I woul= dn't expect > > > that to work (should I?) with a vector_mask mask (it's at least not > > > valid directly > > > in GIMPLE). There's _mm512_int2mask and _mm512_mask2int which transf= er > > > between mask and int (but the mask types are really just typedefd to > > > integer typeS). > > > > > >>> While using a different name than vector_mask is certainly possible > > >>> it wouldn't me to decide that, but I'm also not yet convinced it's > > >>> really necessary. As said, what the GIMPLE frontend accepts > > >>> or not shouldn't limit us here - just the actual chosen layout of t= he > > >>> boolean vectors. > > >>> > > >> > > >> I'm just concerned about creating an alternate vector_mask functiona= lity > > >> in the CFE and risk not being consistent with GFE. > > > > > > I think it's more important to double-check usablilty from the users = side. > > > If the implementation necessarily diverges from GIMPLE then we can > > > choose a different attribute name but then it will also inevitably ha= ve > > > code-generation (quality) issues as GIMPLE matches what the hardware > > > can do. > > > > > > Richard. > > > > > >> Thanks, > > >> Tejas. > > >> > > >>> Richard. > > >>> > > >>>> Thanks, > > >>>> Tejas. > > >>>> > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Tejas. > > >>>>>> > > >>>>>>> > > >>>>>>> Richard. > > >>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> > > >>>>>>>> Tejas. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Richard. > > >>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Thanks, > > >>>>>>>>> > > >>>>>>>>> Tejas. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>> 1. __attribute__((vector_size (n))) where n represents bytes > > >>>>>>>>>> > > >>>>>>>>>> typedef bool vbool __attribute__ ((vector_size (1))); > > >>>>>>>>>> > > >>>>>>>>>> In this approach, the shape of the boolean vector is unclear= . IoW, it is not > > >>>>>>>>>> clear if each bit in 'n' controls a byte or an element. On t= argets > > >>>>>>>>>> like SVE, it would be natural to have each bit control a byt= e of the target > > >>>>>>>>>> vector (therefore resulting in an 'unpacked' layout of the P= BV) and on AVX, each > > >>>>>>>>>> bit would control one element/lane on the target vector(ther= efore resulting in a > > >>>>>>>>>> 'packed' layout with all significant bits at the LSB). > > >>>>>>>>>> > > >>>>>>>>>> 2. __attribute__((vector_size (n))) where n represents num o= f lanes > > >>>>>>>>>> > > >>>>>>>>>> typedef int v4si __attribute__ ((vector_size (4 * size= of (int))); > > >>>>>>>>>> typedef bool v4bi __attribute__ ((vector_size (sizeof = v4si / sizeof (v4si){0}[0]))); > > >>>>>>>>>> > > >>>>>>>>>> Here the 'n' in the vector_size attribute represents the num= ber of bits that > > >>>>>>>>>> is needed to represent a vector quantity. In this case, thi= s packed boolean > > >>>>>>>>>> vector can represent upto 'n' vector lanes. The size of the = type is > > >>>>>>>>>> rounded up the nearest byte. For example, the sizeof v4bi i= n the above > > >>>>>>>>>> example is 1. > > >>>>>>>>>> > > >>>>>>>>>> In this approach, because of the nature of the representatio= n, the n bits required > > >>>>>>>>>> to represent the n lanes of the vector are packed at the LSB= . This does not naturally > > >>>>>>>>>> align with the SVE approach of each bit representing a byte = of the target vector > > >>>>>>>>>> and PBV therefore having an 'unpacked' layout. > > >>>>>>>>>> > > >>>>>>>>>> More importantly, another drawback here is that the change i= n units for vector_size > > >>>>>>>>>> might be confusing to programmers. The units will have to b= e interpreted based on the > > >>>>>>>>>> base type of the typedef. It does not offer any flexibility = in terms of the layout of > > >>>>>>>>>> the bool vector - it is fixed. > > >>>>>>>>>> > > >>>>>>>>>> 3. Combination of 1 and 2. > > >>>>>>>>>> > > >>>>>>>>>> Combining the best of 1 and 2, we can introduce extra parame= ters to vector_size that will > > >>>>>>>>>> unambiguously represent the layout of the PBV. Consider > > >>>>>>>>>> > > >>>>>>>>>> typedef bool vbool __attribute__((vector_size (s, n[, = w]))); > > >>>>>>>>>> > > >>>>>>>>>> where 's' is size in bytes, 'n' is the number of lanes and a= n optional 3rd parameter 'w' > > >>>>>>>>>> is the number of bits of the PBV that represents a lane of t= he target vector. 'w' would > > >>>>>>>>>> allow a target to force a certain layout of the PBV. > > >>>>>>>>>> > > >>>>>>>>>> The 2-parameter form of vector_size allows the target to hav= e an > > >>>>>>>>>> implementation-defined layout of the PBV. The target is free= to choose the 'w' > > >>>>>>>>>> if it is not specified to mirror the target layout of predic= ate registers. For > > >>>>>>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n. > > >>>>>>>>>> > > >>>>>>>>>> As an example, to represent the result of a comparison on 2 = int16x8_t, we'd need > > >>>>>>>>>> 8 lanes of boolean which could be represented by > > >>>>>>>>>> > > >>>>>>>>>> typedef bool v8b __attribute__ ((vector_size (2, 8))); > > >>>>>>>>>> > > >>>>>>>>>> SVE would implement v8b layout to make every 2nd bit signifi= cant i.e. w =3D=3D 2 > > >>>>>>>>>> > > >>>>>>>>>> and AVX would choose a layout where all 8 consecutive bits p= acked at LSB would > > >>>>>>>>>> be significant i.e. w =3D=3D 1. > > >>>>>>>>>> > > >>>>>>>>>> This scheme would accomodate more than 1 target to effective= ly represent vector > > >>>>>>>>>> bools that mirror the target properties. > > >>>>>>>>>> > > >>>>>>>>>> 4. A new attribite > > >>>>>>>>>> > > >>>>>>>>>> This is based on a suggestion from Richard S in [3]. The ide= a is to introduce a new > > >>>>>>>>>> attribute to define the PBV and make it general enough to > > >>>>>>>>>> > > >>>>>>>>>> * represent all targets flexibly (SVE, AVX etc) > > >>>>>>>>>> * represent sub-byte length predicates > > >>>>>>>>>> * have no change in units of vector_size/no new vector_size = signature > > >>>>>>>>>> * not have the number of bytes constrain representation > > >>>>>>>>>> > > >>>>>>>>>> If we call the new attribute 'bool_vec' (for lack of a bette= r name), consider > > >>>>>>>>>> > > >>>>>>>>>> typedef bool vbool __attribute__((bool_vec (n[, w]))) > > >>>>>>>>>> > > >>>>>>>>>> where 'n' represents number of lanes/elements and the option= al 'w' is bits-per-lane. > > >>>>>>>>>> > > >>>>>>>>>> If 'w' is not specified, it and bytes-per-predicate are impl= ementation-defined based on target. > > >>>>>>>>>> If 'w' is specified, sizeof (vbool) will be ceil (n*w/8). > > >>>>>>>>>> > > >>>>>>>>>> 5. Behaviour of the packed vector boolean type. > > >>>>>>>>>> > > >>>>>>>>>> Taking the example of one of the options above, following is= an illustration of it's behavior > > >>>>>>>>>> > > >>>>>>>>>> * ABI > > >>>>>>>>>> > > >>>>>>>>>> New ABI rules will need to be defined for this type - = eg alignment, PCS, > > >>>>>>>>>> mangling etc > > >>>>>>>>>> > > >>>>>>>>>> * Initialization: > > >>>>>>>>>> > > >>>>>>>>>> Packed Boolean Vectors(PBV) can be initialized like so= : > > >>>>>>>>>> > > >>>>>>>>>> typedef bool v4bi __attribute__ ((vector_size (2, 4,= 4))); > > >>>>>>>>>> v4bi p =3D {false, true, false, false}; > > >>>>>>>>>> > > >>>>>>>>>> Each value in the initizlizer constant is of type bool= . The lowest numbered > > >>>>>>>>>> element in the const array corresponds to the LSbit of= p, element 1 is > > >>>>>>>>>> assigned to bit 4 etc. > > >>>>>>>>>> > > >>>>>>>>>> p is effectively a 2-byte bitmask with value 0x0010 > > >>>>>>>>>> > > >>>>>>>>>> With a different layout > > >>>>>>>>>> > > >>>>>>>>>> typedef bool v4bi __attribute__ ((vector_size (2, 4,= 1))); > > >>>>>>>>>> v4bi p =3D {false, true, false, false}; > > >>>>>>>>>> > > >>>>>>>>>> p is effectively a 2-byte bitmask with value 0x0002 > > >>>>>>>>>> > > >>>>>>>>>> * Operations: > > >>>>>>>>>> > > >>>>>>>>>> Packed Boolean Vectors support the following operation= s: > > >>>>>>>>>> . unary ~ > > >>>>>>>>>> . unary ! > > >>>>>>>>>> . binary&,|and=CB=86 > > >>>>>>>>>> . assignments &=3D, |=3D and =CB=86=3D > > >>>>>>>>>> . comparisons <, <=3D, =3D=3D, !=3D, >=3D and > > > >>>>>>>>>> . Ternary operator ?: > > >>>>>>>>>> > > >>>>>>>>>> Operations are defined as applied to the individual el= ements i.e the bits > > >>>>>>>>>> that are significant in the PBV. Whether the PBVs are = treated as bitmasks > > >>>>>>>>>> or otherwise is implementation-defined. > > >>>>>>>>>> > > >>>>>>>>>> Insignificant bits could affect results of comparisons= or ternary operators. > > >>>>>>>>>> In such cases, it is implementation defined how the un= used bits are treated. > > >>>>>>>>>> > > >>>>>>>>>> . Subscript operator [] > > >>>>>>>>>> > > >>>>>>>>>> For the subscript operator, the packed boolean vector = acts like a array of > > >>>>>>>>>> elements - the first or the 0th indexed element being = the LSbit of the PBV. > > >>>>>>>>>> Subscript operator yields a scalar boolean value. > > >>>>>>>>>> For example: > > >>>>>>>>>> > > >>>>>>>>>> typedef bool v8b __attribute__ ((vector_size (2, 8, = 2))); > > >>>>>>>>>> > > >>>>>>>>>> // Subscript operator result yields a boolean value. > > >>>>>>>>>> // x[3] is the 7th LSbit and x[1] is the 3rd LSbit o= f x. > > >>>>>>>>>> bool foo (v8b p, int n) { p[3] =3D true; return p[1]= ; } > > >>>>>>>>>> > > >>>>>>>>>> Out of bounds access: OOB access can be determined at = compile time given the > > >>>>>>>>>> strong typing of the PBVs. > > >>>>>>>>>> > > >>>>>>>>>> PBV does not support address of operator(&) for elemen= ts of PBVs. > > >>>>>>>>>> > > >>>>>>>>>> . Implicit conversion from integer vectors to PBVs > > >>>>>>>>>> > > >>>>>>>>>> We would like to support the output of comparison oper= ations to be PBVs. This > > >>>>>>>>>> requires us to define the implicit conversion from an = integer vector to PBV > > >>>>>>>>>> as the result of vector comparisons are integer vector= s. > > >>>>>>>>>> > > >>>>>>>>>> To define this operation: > > >>>>>>>>>> > > >>>>>>>>>> bool_vector =3D vector vector > > >>>>>>>>>> > > >>>>>>>>>> There is no change in how vector vector behavi= or i.e. this comparison > > >>>>>>>>>> would still produce an int_vector type as it does now. > > >>>>>>>>>> > > >>>>>>>>>> temp_int_vec =3D vector vector > > >>>>>>>>>> bool_vec =3D temp_int_vec // Implicit conversion fro= m int_vec to bool_vec > > >>>>>>>>>> > > >>>>>>>>>> The implicit conversion from int_vec to bool I'd defin= e simply to be: > > >>>>>>>>>> > > >>>>>>>>>> bool_vec[n] =3D (_Bool) int_vec[n] > > >>>>>>>>>> > > >>>>>>>>>> where the C11 standard rules apply > > >>>>>>>>>> 6.3.1.2 Boolean type When any scalar value is convert= ed to _Bool, the result > > >>>>>>>>>> is 0 if the value compares equal to 0; otherwise, the = result is 1. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434= .html > > >>>>>>>>>> [2] https://reviews.llvm.org/D88905 > > >>>>>>>>>> [3] https://reviews.llvm.org/D81083 > > >>>>>>>>>> > > >>>>>>>>>> Thoughts? > > >>>>>>>>>> > > >>>>>>>>>> Thanks, > > >>>>>>>>>> Tejas. > > >>>>>> > > >>>> > > >> > >