From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16768 invoked by alias); 22 Mar 2011 17:10:33 -0000 Received: (qmail 16756 invoked by uid 22791); 22 Mar 2011 17:10:30 -0000 X-SWARE-Spam-Status: No, hits=-2.3 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: sourceware.org Received: from mail-wy0-f175.google.com (HELO mail-wy0-f175.google.com) (74.125.82.175) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 22 Mar 2011 17:10:22 +0000 Received: by wyb40 with SMTP id 40so7052791wyb.20 for ; Tue, 22 Mar 2011 10:10:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.227.91.77 with SMTP id l13mr1666737wbm.44.1300813819400; Tue, 22 Mar 2011 10:10:19 -0700 (PDT) Received: by 10.227.64.142 with HTTP; Tue, 22 Mar 2011 10:10:19 -0700 (PDT) In-Reply-To: References: Date: Tue, 22 Mar 2011 17:10:00 -0000 Message-ID: Subject: Re: RFC: Representing vector lane load/store operations From: Richard Guenther To: gcc@gcc.gnu.org, richard.sandiford@linaro.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org X-SW-Source: 2011-03/txt/msg00323.txt.bz2 On Tue, Mar 22, 2011 at 5:52 PM, Richard Sandiford wrote: > This is an RFC about adding gimple and optab support for things like > ARM's load-lane and store-lane instructions. =A0It builds on an earlier > discussion between Ira and Julian, with the aim of allowing these > instructions to be used by the vectoriser. > > These instructions operate on N vector registers of M elements each and > on a sequence of 1 or M N-element structures. =A0They come in three forms: > > =A0- full load/store: > > =A0 =A0 =A00<=3DI > =A0 =A0E.g., for N=3D3, M=3D4: > > =A0 =A0 =A0 =A0 Registers =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Memory > =A0 =A0 =A0 =A0 ---------------- =A0 =A0 =A0 =A0 =A0 =A0--------------- > =A0 =A0 =A0 =A0 RRRR =A0GGGG =A0BBBB =A0 =A0<---> =A0 RGB RGB RGB RGB > > =A0- lane load/store: > > =A0 =A0 =A0given L, 0<=3DI > =A0 =A0E.g., for N=3D3. M=3D4, L=3D2: > > =A0 =A0 =A0 =A0 Registers =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Memory > =A0 =A0 =A0 =A0 ---------------- =A0 =A0 =A0 =A0 =A0 =A0--------------- > =A0 =A0 =A0 =A0 ..R. =A0..G. =A0..B. =A0 =A0<---> =A0 RGB > > =A0- load-and-duplicate: > > =A0 =A0 =A00<=3DI > =A0 =A0E.g. for N=3D3 V4HIs: > > =A0 =A0 =A0 =A0 Registers =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Memory > =A0 =A0 =A0 =A0 ---------------- =A0 =A0 =A0 =A0 =A0 =A0---------------- > =A0 =A0 =A0 =A0 RRRR =A0GGGG =A0BBBB =A0 =A0<---- =A0 RGB > > Starting points: > > =A01) Memory references should be MEM_REFs at the gimple level. > =A0 =A0 We shouldn't add new tree codes for memory references. > > =A02) Because of the large data involved (at least in the "full" case), > =A0 =A0 the gimple statement that represents the lane interleaving should > =A0 =A0 also have the MEM_REF. =A0The two shouldn't be split between > =A0 =A0 statements. > > =A03) The ARM doubleword instructions allow the N vectors to be in > =A0 =A0 consecutive registers (DM, DM+1, ...) or in every second register > =A0 =A0 (DM, DM+2, ...). =A0However, the latter case is only interesting > =A0 =A0 if we're dealing with halves of quadword vectors. =A0It's therefo= re > =A0 =A0 reasonable to view the N vectors as one big value. > > (3) significantly simplifies things at the rtl level for ARM, because it > avoids having to find some way of saying that N separate pseudos must > be allocated to N consecutive hard registers. =A0If other targets allow t= he > N vectors to be stored in arbitrary (non-consecutive) registers, then > they could split the register up into subregs at expand time. > The lower-subreg pass should then optimise things nicely. > > The easiest way of dealing with (1) and (2) seems to be to model the > operations as built-in functions. =A0And if we do treat the N vectors as > a single value, the load functions can simply return that value. =A0So we > could have something like: > > =A0- full load/store: > > =A0 =A0 =A0combined_vectors =3D __builtin_load_lanes (memory); > =A0 =A0 =A0memory =3D __builtin_store_lanes (combined_vectors); > > =A0- lane load/store: > > =A0 =A0 =A0combined_vectors =3D __builltin_load_lane (memory, combined_ve= ctors, lane); > =A0 =A0 =A0memory =3D __builtin_store_lane (combined_vectors, lane); > > =A0- load-and-duplicate: > > =A0 =A0 =A0combined_vectors =3D __builtin_load_dup (memory); > > We could then use normal component references to set or get the individual > vectors of combined_vectors. =A0Does that sound OK so far? > > The question then is: what type should combined_vectors have? =A0(At this > point I'm just talking about types, not modes.) =A0The main possibilities > seemed to be: > > 1. an integer type > > =A0 =A0 Pros > =A0 =A0 =A0 * Gimple registers can store integers. > > =A0 =A0 Cons > =A0 =A0 =A0 * As Julian points out, GCC doesn't really support integer ty= pes > =A0 =A0 =A0 =A0 that are wider than 2 HOST_WIDE_INTs. =A0It would be good= to > =A0 =A0 =A0 =A0 remove that restriction, but it might be a lot of work. > > =A0 =A0 =A0 * We're not really using the type as an integer. > > =A0 =A0 =A0 * The combination of the integer type and the __builtin_load_= lanes > =A0 =A0 =A0 =A0 array argument wouldn't be enough to determine the correct > =A0 =A0 =A0 =A0 load operation. =A0__builtin_load_lanes would need someth= ing > =A0 =A0 =A0 =A0 like a vector count argument (N in the above description)= as well. > > 2. a vector type > > =A0 =A0 Pros > =A0 =A0 =A0 * Gimple registers can store vectors. > > =A0 =A0 Cons > =A0 =A0 =A0 * For vld3, this would mean creating vector types with non-po= wer- > =A0 =A0 =A0 =A0 of-two vectors. =A0GCC doesn't support those yet, and you= get > =A0 =A0 =A0 =A0 ICEs as soon as you try to use them. =A0(Remember that th= is is > =A0 =A0 =A0 =A0 all about types, not modes.) > > =A0 =A0 =A0 =A0 It _might_ be interesting to implement this support, but = as > =A0 =A0 =A0 =A0 above, it would be a lot of work. =A0It also raises some = tricky > =A0 =A0 =A0 =A0 semantic questions, such as: what is the alignment of the= new > =A0 =A0 =A0 =A0 vectors? Which leads to... > > =A0 =A0 =A0 * The alignment of the type would be strange. =A0E.g. suppose > =A0 =A0 =A0 =A0 we're dealing with M=3D2, and use uint32xY_t to represent= a > =A0 =A0 =A0 =A0 vector of Y uint32_ts. =A0The types and alignments would = be: > > =A0 =A0 =A0 =A0 =A0 N=3D2 uint32x4_t, alignment 16 > =A0 =A0 =A0 =A0 =A0 N=3D3 uint32x6_t, alignment 8 (if we follow the conve= ntion for modes) > =A0 =A0 =A0 =A0 =A0 N=3D4 uint32x8_t, alignment 32 > > =A0 =A0 =A0 =A0 We don't need alignments greater than 8 in our intended u= se; > =A0 =A0 =A0 =A0 16 and 32 are overkill. > > =A0 =A0 =A0 * We're not really using the type as a single vector, > =A0 =A0 =A0 =A0 but as a collection of vectors. > > =A0 =A0 =A0 * The combination of the vector type and the __builtin_load_l= anes > =A0 =A0 =A0 =A0 array argument wouldn't be enough to determine the correct > =A0 =A0 =A0 =A0 load operation. =A0__builtin_load_lanes would need someth= ing > =A0 =A0 =A0 =A0 like a vector count argument (N in the above description)= as well. > > 3. an array-of-vectors type > > =A0 =A0 Pros > =A0 =A0 =A0 * No support for new GCC features (large integers or non-powe= r-of-two > =A0 =A0 =A0 =A0 vectors) is needed. > > =A0 =A0 =A0 * The alignment of the type would be taken from the alignment= of the > =A0 =A0 =A0 =A0 individual vectors, which is correct. > > =A0 =A0 =A0 * It accurately reflects how the loaded value is going to be = used. > > =A0 =A0 =A0 * The type uniquely identifies the correct load operation, > =A0 =A0 =A0 =A0 without need for additional arguments. =A0(This is minor.) > > =A0 =A0 Cons > =A0 =A0 =A0 * Gimple registers can't store array values. Simple. Just make them registers anyway (I did that in the past when working on middle-end arrays). You'd set DECL_GIMPLE_REG_P on the decl. 4. a vector-of-vectors type Cons * I don't think we want that ;) Using an array type sounds like the only sensible option to me apart from using a large non-power-of-two vector type (but then you'd have the issue of what operations operate on, see below). > So I think the only disadvantage of using an array of vectors is that the > result can never be a gimple register. =A0But that isn't much of a disadv= antage > really; the things we care about are the individual vectors, which can > of course be treated as gimple registers. =A0I think our tracking of memo= ry > values is good enough for combined_vectors to be treated as such. > > These arrays of vectors would still need to have a non-BLK mode, > so that they can be stored in _rtl_ registers. =A0But we need that anyway > for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic > functions is very poor. > > So how about the following functions? =A0(Forgive the pascally syntax.) > > =A0 =A0__builtin_load_lanes (REF : array N*M of X) > =A0 =A0 =A0returns array N of vector M of X > =A0 =A0 =A0maps to vldN on ARM > =A0 =A0 =A0in practice, the result would be used in assignments of the fo= rm: > =A0 =A0 =A0 =A0vectorY =3D ARRAY_REF > > =A0 =A0__builtin_store_lanes (VECTORS : array N of vector M of X) > =A0 =A0 =A0returns array N*M of X > =A0 =A0 =A0maps to vstN on ARM > =A0 =A0 =A0in practice, the argument would be populated by assignments of= the form: > =A0 =A0 =A0 =A0ARRAY_REF =3D vectorY > > =A0 =A0__builtin_load_lane (REF : array N of X, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 VECTORS : array N of vect= or M of X, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 LANE : integer) > =A0 =A0 =A0returns array N of vector M of X > =A0 =A0 =A0maps to vldN_lane on ARM > > =A0 =A0__builtin_store_lane (VECTORS : array N of vector M of X, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0LANE : integer) > =A0 =A0 =A0returns array N of X > =A0 =A0 =A0maps to vstN_lane on ARM > > =A0 =A0__builtin_load_dup (REF : array N of X) > =A0 =A0 =A0returns array N of vector M of X > =A0 =A0 =A0maps to vldN_dup on ARM > > I've hacked up a prototype of this and it seems to produce good code. > What do you think? How do you expect these to be used? That is, would you ever expect components of those large vectors/arrays be used in operations like add, or does the HW provide vector-lane variants for those as well? Thus, will for (i=3D0; i Richard >