RFC: Representing vector lane load/store operations

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* RFC: Representing vector lane load/store operations
@ 2011-03-22 16:52 Richard Sandiford
  2011-03-22 17:10 ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Sandiford @ 2011-03-22 16:52 UTC (permalink / raw)
  To: gcc

This is an RFC about adding gimple and optab support for things like
ARM's load-lane and store-lane instructions.  It builds on an earlier
discussion between Ira and Julian, with the aim of allowing these
instructions to be used by the vectoriser.

These instructions operate on N vector registers of M elements each and
on a sequence of 1 or M N-element structures.  They come in three forms:

  - full load/store:

      0<=I<N, 0<=J<M, register[I][J] = memory[J*M+I]

    E.g., for N=3, M=4:

         Registers                   Memory
         ----------------            ---------------
         RRRR  GGGG  BBBB    <--->   RGB RGB RGB RGB

  - lane load/store:

      given L, 0<=I<N register[I][L] = memory[I]

    E.g., for N=3. M=4, L=2:

         Registers                   Memory
         ----------------            ---------------
         ..R.  ..G.  ..B.    <--->   RGB

  - load-and-duplicate:

      0<=I<N, 0<=J<M, register[I][J] = memory[I]

    E.g. for N=3 V4HIs:

         Registers                   Memory
         ----------------            ----------------
         RRRR  GGGG  BBBB    <----   RGB

Starting points:

  1) Memory references should be MEM_REFs at the gimple level.
     We shouldn't add new tree codes for memory references.

  2) Because of the large data involved (at least in the "full" case),
     the gimple statement that represents the lane interleaving should
     also have the MEM_REF.  The two shouldn't be split between
     statements.

  3) The ARM doubleword instructions allow the N vectors to be in
     consecutive registers (DM, DM+1, ...) or in every second register
     (DM, DM+2, ...).  However, the latter case is only interesting
     if we're dealing with halves of quadword vectors.  It's therefore
     reasonable to view the N vectors as one big value.

(3) significantly simplifies things at the rtl level for ARM, because it
avoids having to find some way of saying that N separate pseudos must
be allocated to N consecutive hard registers.  If other targets allow the
N vectors to be stored in arbitrary (non-consecutive) registers, then
they could split the register up into subregs at expand time.
The lower-subreg pass should then optimise things nicely.

The easiest way of dealing with (1) and (2) seems to be to model the
operations as built-in functions.  And if we do treat the N vectors as
a single value, the load functions can simply return that value.  So we
could have something like:

  - full load/store:

      combined_vectors = __builtin_load_lanes (memory);
      memory = __builtin_store_lanes (combined_vectors);

  - lane load/store:

      combined_vectors = __builltin_load_lane (memory, combined_vectors, lane);
      memory = __builtin_store_lane (combined_vectors, lane);

  - load-and-duplicate:

      combined_vectors = __builtin_load_dup (memory);

We could then use normal component references to set or get the individual
vectors of combined_vectors.  Does that sound OK so far?

The question then is: what type should combined_vectors have?  (At this
point I'm just talking about types, not modes.)  The main possibilities
seemed to be:

1. an integer type

     Pros
       * Gimple registers can store integers.

     Cons
       * As Julian points out, GCC doesn't really support integer types
         that are wider than 2 HOST_WIDE_INTs.  It would be good to
         remove that restriction, but it might be a lot of work.

       * We're not really using the type as an integer.

       * The combination of the integer type and the __builtin_load_lanes
         array argument wouldn't be enough to determine the correct
         load operation.  __builtin_load_lanes would need something
         like a vector count argument (N in the above description) as well.

2. a vector type

     Pros
       * Gimple registers can store vectors.

     Cons
       * For vld3, this would mean creating vector types with non-power-
         of-two vectors.  GCC doesn't support those yet, and you get
         ICEs as soon as you try to use them.  (Remember that this is
         all about types, not modes.)

         It _might_ be interesting to implement this support, but as
         above, it would be a lot of work.  It also raises some tricky
         semantic questions, such as: what is the alignment of the new
         vectors? Which leads to...

       * The alignment of the type would be strange.  E.g. suppose
         we're dealing with M=2, and use uint32xY_t to represent a
         vector of Y uint32_ts.  The types and alignments would be:

           N=2 uint32x4_t, alignment 16
           N=3 uint32x6_t, alignment 8 (if we follow the convention for modes)
           N=4 uint32x8_t, alignment 32

         We don't need alignments greater than 8 in our intended use;
         16 and 32 are overkill.

       * We're not really using the type as a single vector,
         but as a collection of vectors.

       * The combination of the vector type and the __builtin_load_lanes
         array argument wouldn't be enough to determine the correct
         load operation.  __builtin_load_lanes would need something
         like a vector count argument (N in the above description) as well.

3. an array-of-vectors type

     Pros
       * No support for new GCC features (large integers or non-power-of-two
         vectors) is needed.

       * The alignment of the type would be taken from the alignment of the
         individual vectors, which is correct.

       * It accurately reflects how the loaded value is going to be used.

       * The type uniquely identifies the correct load operation,
         without need for additional arguments.  (This is minor.)

     Cons
       * Gimple registers can't store array values.

So I think the only disadvantage of using an array of vectors is that the
result can never be a gimple register.  But that isn't much of a disadvantage
really; the things we care about are the individual vectors, which can
of course be treated as gimple registers.  I think our tracking of memory
values is good enough for combined_vectors to be treated as such.

These arrays of vectors would still need to have a non-BLK mode,
so that they can be stored in _rtl_ registers.  But we need that anyway
for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic
functions is very poor.

So how about the following functions?  (Forgive the pascally syntax.)

    __builtin_load_lanes (REF : array N*M of X)
      returns array N of vector M of X
      maps to vldN on ARM
      in practice, the result would be used in assignments of the form:
        vectorY = ARRAY_REF <result, Y>

    __builtin_store_lanes (VECTORS : array N of vector M of X)
      returns array N*M of X
      maps to vstN on ARM
      in practice, the argument would be populated by assignments of the form:
        ARRAY_REF <VECTORS, Y> = vectorY

    __builtin_load_lane (REF : array N of X,
			 VECTORS : array N of vector M of X,
			 LANE : integer)
      returns array N of vector M of X
      maps to vldN_lane on ARM

    __builtin_store_lane (VECTORS : array N of vector M of X,
			  LANE : integer)
      returns array N of X
      maps to vstN_lane on ARM

    __builtin_load_dup (REF : array N of X)
      returns array N of vector M of X
      maps to vldN_dup on ARM

I've hacked up a prototype of this and it seems to produce good code.
What do you think?

Richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-22 16:52 RFC: Representing vector lane load/store operations Richard Sandiford
@ 2011-03-22 17:10 ` Richard Guenther
  2011-03-22 19:43   ` Richard Sandiford
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-03-22 17:10 UTC (permalink / raw)
  To: gcc, richard.sandiford

On Tue, Mar 22, 2011 at 5:52 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> This is an RFC about adding gimple and optab support for things like
> ARM's load-lane and store-lane instructions.  It builds on an earlier
> discussion between Ira and Julian, with the aim of allowing these
> instructions to be used by the vectoriser.
>
> These instructions operate on N vector registers of M elements each and
> on a sequence of 1 or M N-element structures.  They come in three forms:
>
>  - full load/store:
>
>      0<=I<N, 0<=J<M, register[I][J] = memory[J*M+I]
>
>    E.g., for N=3, M=4:
>
>         Registers                   Memory
>         ----------------            ---------------
>         RRRR  GGGG  BBBB    <--->   RGB RGB RGB RGB
>
>  - lane load/store:
>
>      given L, 0<=I<N register[I][L] = memory[I]
>
>    E.g., for N=3. M=4, L=2:
>
>         Registers                   Memory
>         ----------------            ---------------
>         ..R.  ..G.  ..B.    <--->   RGB
>
>  - load-and-duplicate:
>
>      0<=I<N, 0<=J<M, register[I][J] = memory[I]
>
>    E.g. for N=3 V4HIs:
>
>         Registers                   Memory
>         ----------------            ----------------
>         RRRR  GGGG  BBBB    <----   RGB
>
> Starting points:
>
>  1) Memory references should be MEM_REFs at the gimple level.
>     We shouldn't add new tree codes for memory references.
>
>  2) Because of the large data involved (at least in the "full" case),
>     the gimple statement that represents the lane interleaving should
>     also have the MEM_REF.  The two shouldn't be split between
>     statements.
>
>  3) The ARM doubleword instructions allow the N vectors to be in
>     consecutive registers (DM, DM+1, ...) or in every second register
>     (DM, DM+2, ...).  However, the latter case is only interesting
>     if we're dealing with halves of quadword vectors.  It's therefore
>     reasonable to view the N vectors as one big value.
>
> (3) significantly simplifies things at the rtl level for ARM, because it
> avoids having to find some way of saying that N separate pseudos must
> be allocated to N consecutive hard registers.  If other targets allow the
> N vectors to be stored in arbitrary (non-consecutive) registers, then
> they could split the register up into subregs at expand time.
> The lower-subreg pass should then optimise things nicely.
>
> The easiest way of dealing with (1) and (2) seems to be to model the
> operations as built-in functions.  And if we do treat the N vectors as
> a single value, the load functions can simply return that value.  So we
> could have something like:
>
>  - full load/store:
>
>      combined_vectors = __builtin_load_lanes (memory);
>      memory = __builtin_store_lanes (combined_vectors);
>
>  - lane load/store:
>
>      combined_vectors = __builltin_load_lane (memory, combined_vectors, lane);
>      memory = __builtin_store_lane (combined_vectors, lane);
>
>  - load-and-duplicate:
>
>      combined_vectors = __builtin_load_dup (memory);
>
> We could then use normal component references to set or get the individual
> vectors of combined_vectors.  Does that sound OK so far?
>
> The question then is: what type should combined_vectors have?  (At this
> point I'm just talking about types, not modes.)  The main possibilities
> seemed to be:
>
> 1. an integer type
>
>     Pros
>       * Gimple registers can store integers.
>
>     Cons
>       * As Julian points out, GCC doesn't really support integer types
>         that are wider than 2 HOST_WIDE_INTs.  It would be good to
>         remove that restriction, but it might be a lot of work.
>
>       * We're not really using the type as an integer.
>
>       * The combination of the integer type and the __builtin_load_lanes
>         array argument wouldn't be enough to determine the correct
>         load operation.  __builtin_load_lanes would need something
>         like a vector count argument (N in the above description) as well.
>
> 2. a vector type
>
>     Pros
>       * Gimple registers can store vectors.
>
>     Cons
>       * For vld3, this would mean creating vector types with non-power-
>         of-two vectors.  GCC doesn't support those yet, and you get
>         ICEs as soon as you try to use them.  (Remember that this is
>         all about types, not modes.)
>
>         It _might_ be interesting to implement this support, but as
>         above, it would be a lot of work.  It also raises some tricky
>         semantic questions, such as: what is the alignment of the new
>         vectors? Which leads to...
>
>       * The alignment of the type would be strange.  E.g. suppose
>         we're dealing with M=2, and use uint32xY_t to represent a
>         vector of Y uint32_ts.  The types and alignments would be:
>
>           N=2 uint32x4_t, alignment 16
>           N=3 uint32x6_t, alignment 8 (if we follow the convention for modes)
>           N=4 uint32x8_t, alignment 32
>
>         We don't need alignments greater than 8 in our intended use;
>         16 and 32 are overkill.
>
>       * We're not really using the type as a single vector,
>         but as a collection of vectors.
>
>       * The combination of the vector type and the __builtin_load_lanes
>         array argument wouldn't be enough to determine the correct
>         load operation.  __builtin_load_lanes would need something
>         like a vector count argument (N in the above description) as well.
>
> 3. an array-of-vectors type
>
>     Pros
>       * No support for new GCC features (large integers or non-power-of-two
>         vectors) is needed.
>
>       * The alignment of the type would be taken from the alignment of the
>         individual vectors, which is correct.
>
>       * It accurately reflects how the loaded value is going to be used.
>
>       * The type uniquely identifies the correct load operation,
>         without need for additional arguments.  (This is minor.)
>
>     Cons
>       * Gimple registers can't store array values.

Simple.  Just make them registers anyway (I did that in the past
when working on middle-end arrays).  You'd set DECL_GIMPLE_REG_P
on the decl.

  4. a vector-of-vectors type

     Cons
        * I don't think we want that ;)

Using an array type sounds like the only sensible option to me apart
from using a large non-power-of-two vector type (but then you'd have
the issue of what operations operate on, see below).

> So I think the only disadvantage of using an array of vectors is that the
> result can never be a gimple register.  But that isn't much of a disadvantage
> really; the things we care about are the individual vectors, which can
> of course be treated as gimple registers.  I think our tracking of memory
> values is good enough for combined_vectors to be treated as such.
>
> These arrays of vectors would still need to have a non-BLK mode,
> so that they can be stored in _rtl_ registers.  But we need that anyway
> for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic
> functions is very poor.
>
> So how about the following functions?  (Forgive the pascally syntax.)
>
>    __builtin_load_lanes (REF : array N*M of X)
>      returns array N of vector M of X
>      maps to vldN on ARM
>      in practice, the result would be used in assignments of the form:
>        vectorY = ARRAY_REF <result, Y>
>
>    __builtin_store_lanes (VECTORS : array N of vector M of X)
>      returns array N*M of X
>      maps to vstN on ARM
>      in practice, the argument would be populated by assignments of the form:
>        ARRAY_REF <VECTORS, Y> = vectorY
>
>    __builtin_load_lane (REF : array N of X,
>                         VECTORS : array N of vector M of X,
>                         LANE : integer)
>      returns array N of vector M of X
>      maps to vldN_lane on ARM
>
>    __builtin_store_lane (VECTORS : array N of vector M of X,
>                          LANE : integer)
>      returns array N of X
>      maps to vstN_lane on ARM
>
>    __builtin_load_dup (REF : array N of X)
>      returns array N of vector M of X
>      maps to vldN_dup on ARM
>
> I've hacked up a prototype of this and it seems to produce good code.
> What do you think?

How do you expect these to be used?  That is, would you ever expect
components of those large vectors/arrays be used in operations
like add, or does the HW provide vector-lane variants for those as well?

Thus, will

  for (i=0; i<N; ++i)
    X[i] = Y[i] + Z[i];

result in a single add per vector lane load or a single vector lane load
for M "unrolled" instances of (small) vector adds?  If the latter then
we have to think about indexing the vector lanes as well as allowing
partial stores (or have a vector-lane construct operation).  Representing
vector lanes as automatic memory (with array of vector type) makes
things easy, but eventually not very efficient.

I had new tree/stmt codes for array loads/stores for middle-end arrays.
Eventually the vector lane support can at least walk in the same direction
that middle-end arrays would ;)

Richard.

> Richard
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-22 17:10 ` Richard Guenther
@ 2011-03-22 19:43   ` Richard Sandiford
  2011-03-23  9:23     ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Sandiford @ 2011-03-22 19:43 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

Richard Guenther <richard.guenther@gmail.com> writes:
> Simple.  Just make them registers anyway (I did that in the past
> when working on middle-end arrays).  You'd set DECL_GIMPLE_REG_P
> on the decl.

OK, thanks, I'll give that a go.  TBH, I'm still hopeful we can
do without it, because we do seem to cope quite well as things stand.
But I suppose that might not hold true as the examples get more complicated.

>   4. a vector-of-vectors type
>
>      Cons
>         * I don't think we want that ;)

Yeah :-)

>>    __builtin_load_lanes (REF : array N*M of X)
>>      returns array N of vector M of X
>>      maps to vldN on ARM
>>      in practice, the result would be used in assignments of the form:
>>        vectorY = ARRAY_REF <result, Y>
>>
>>    __builtin_store_lanes (VECTORS : array N of vector M of X)
>>      returns array N*M of X
>>      maps to vstN on ARM
>>      in practice, the argument would be populated by assignments of the form:
>>        ARRAY_REF <VECTORS, Y> = vectorY
>>
>>    __builtin_load_lane (REF : array N of X,
>>                         VECTORS : array N of vector M of X,
>>                         LANE : integer)
>>      returns array N of vector M of X
>>      maps to vldN_lane on ARM
>>
>>    __builtin_store_lane (VECTORS : array N of vector M of X,
>>                          LANE : integer)
>>      returns array N of X
>>      maps to vstN_lane on ARM
>>
>>    __builtin_load_dup (REF : array N of X)
>>      returns array N of vector M of X
>>      maps to vldN_dup on ARM
>>
>> I've hacked up a prototype of this and it seems to produce good code.
>> What do you think?
>
> How do you expect these to be used?  That is, would you ever expect
> components of those large vectors/arrays be used in operations
> like add, or does the HW provide vector-lane variants for those as well?

The individual vectors would be used for add, etc.  That's what the
ARRAY_REF stuff above is supposed to be getting at.  So...

> Thus, will
>
>   for (i=0; i<N; ++i)
>     X[i] = Y[i] + Z[i];
>
> result in a single add per vector lane load or a single vector lane load
> for M "unrolled" instances of (small) vector adds?  If the latter then
> we have to think about indexing the vector lanes as well as allowing
> partial stores (or have a vector-lane construct operation).  Representing
> vector lanes as automatic memory (with array of vector type) makes
> things easy, but eventually not very efficient.

...Ira would know best, but I don't think it would be used for this
kind of loop.  It would be more something like:

   for (i=0; i<N; ++i)
     X[i] = Y[i].red + Y[i].blue + Y[i].green;
    
(not a realistic example).  You'd then have:

    compoundY = __builtin_load_lanes (Y);
    red = ARRAY_REF <compoundY, 0>
    green = ARRAY_REF <compoundY, 1>
    blue = ARRAY_REF <compoundY, 2>
    D1 = red + green
    D2 = D1 + blue
    MEM_REF <X> = D2;

My understanding is that'd we never do any operations besides ARRAY_REFs
on the compound value, and that the individual vectors would be treated
pretty much like any other.

> I had new tree/stmt codes for array loads/stores for middle-end arrays.
> Eventually the vector lane support can at least walk in the same direction
> that middle-end arrays would ;)

What's the status of the middle-end array stuff?  A quick search
showed up your paper, but is it still WIP, or has it already gone in?
(Showing my ignorance of tree-level stuff here. :-))  It does sound
like it'd be a good fit for these ops.

Richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-22 19:43   ` Richard Sandiford
@ 2011-03-23  9:23     ` Richard Guenther
  2011-03-23 10:38       ` Richard Sandiford
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-03-23  9:23 UTC (permalink / raw)
  To: Richard Guenther, gcc, rdsandiford

On Tue, Mar 22, 2011 at 8:43 PM, Richard Sandiford
<rdsandiford@googlemail.com> wrote:
> Richard Guenther <richard.guenther@gmail.com> writes:
>> Simple.  Just make them registers anyway (I did that in the past
>> when working on middle-end arrays).  You'd set DECL_GIMPLE_REG_P
>> on the decl.
>
> OK, thanks, I'll give that a go.  TBH, I'm still hopeful we can
> do without it, because we do seem to cope quite well as things stand.
> But I suppose that might not hold true as the examples get more complicated.
>
>>   4. a vector-of-vectors type
>>
>>      Cons
>>         * I don't think we want that ;)
>
> Yeah :-)
>
>>>    __builtin_load_lanes (REF : array N*M of X)
>>>      returns array N of vector M of X
>>>      maps to vldN on ARM
>>>      in practice, the result would be used in assignments of the form:
>>>        vectorY = ARRAY_REF <result, Y>
>>>
>>>    __builtin_store_lanes (VECTORS : array N of vector M of X)
>>>      returns array N*M of X
>>>      maps to vstN on ARM
>>>      in practice, the argument would be populated by assignments of the form:
>>>        ARRAY_REF <VECTORS, Y> = vectorY
>>>
>>>    __builtin_load_lane (REF : array N of X,
>>>                         VECTORS : array N of vector M of X,
>>>                         LANE : integer)
>>>      returns array N of vector M of X
>>>      maps to vldN_lane on ARM
>>>
>>>    __builtin_store_lane (VECTORS : array N of vector M of X,
>>>                          LANE : integer)
>>>      returns array N of X
>>>      maps to vstN_lane on ARM
>>>
>>>    __builtin_load_dup (REF : array N of X)
>>>      returns array N of vector M of X
>>>      maps to vldN_dup on ARM
>>>
>>> I've hacked up a prototype of this and it seems to produce good code.
>>> What do you think?
>>
>> How do you expect these to be used?  That is, would you ever expect
>> components of those large vectors/arrays be used in operations
>> like add, or does the HW provide vector-lane variants for those as well?
>
> The individual vectors would be used for add, etc.  That's what the
> ARRAY_REF stuff above is supposed to be getting at.  So...
>
>> Thus, will
>>
>>   for (i=0; i<N; ++i)
>>     X[i] = Y[i] + Z[i];
>>
>> result in a single add per vector lane load or a single vector lane load
>> for M "unrolled" instances of (small) vector adds?  If the latter then
>> we have to think about indexing the vector lanes as well as allowing
>> partial stores (or have a vector-lane construct operation).  Representing
>> vector lanes as automatic memory (with array of vector type) makes
>> things easy, but eventually not very efficient.
>
> ...Ira would know best, but I don't think it would be used for this
> kind of loop.  It would be more something like:
>
>   for (i=0; i<N; ++i)
>     X[i] = Y[i].red + Y[i].blue + Y[i].green;
>
> (not a realistic example).  You'd then have:
>
>    compoundY = __builtin_load_lanes (Y);
>    red = ARRAY_REF <compoundY, 0>
>    green = ARRAY_REF <compoundY, 1>
>    blue = ARRAY_REF <compoundY, 2>
>    D1 = red + green
>    D2 = D1 + blue
>    MEM_REF <X> = D2;
>
> My understanding is that'd we never do any operations besides ARRAY_REFs
> on the compound value, and that the individual vectors would be treated
> pretty much like any other.

Ok, I thought it might be used to have a larger vectorization factor for
loads and stores, basically make further unrolling cheaper because you
don't have to duplicate the loads and stores.

>> I had new tree/stmt codes for array loads/stores for middle-end arrays.
>> Eventually the vector lane support can at least walk in the same direction
>> that middle-end arrays would ;)
>
> What's the status of the middle-end array stuff?  A quick search
> showed up your paper, but is it still WIP, or has it already gone in?
> (Showing my ignorance of tree-level stuff here. :-))  It does sound
> like it'd be a good fit for these ops.

Well, the work is basically suspended (though a lot of middle-end
surgery that was required went in) - I was stuck on the necessity
to have the Fortran frontend generate these expressions to have
testing on real code (rather than constructing examples from my
lame C frontend + builtins hack).  ISTR porting the patch to tuples,
the current patch seems to have two or three places that adjust
the middle-end in order to allow aggregate typed SSA names.

But as you have partial defs of the vector lane array the simplest
approach is probably to not make them a register.  Be prepared
for some surprises during RTL expansion though ;)

Richard.

> Richard
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23  9:23     ` Richard Guenther
@ 2011-03-23 10:38       ` Richard Sandiford
  2011-03-23 11:52         ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Sandiford @ 2011-03-23 10:38 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

Richard Guenther <richard.guenther@gmail.com> writes:
> But as you have partial defs of the vector lane array the simplest
> approach is probably to not make them a register.  Be prepared
> for some surprises during RTL expansion though ;)

OK.  It's there I'd like to start, specifically with:

  These arrays of vectors would still need to have a non-BLK mode,
  so that they can be stored in _rtl_ registers.  But we need that anyway
  for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic
  functions is very poor.

because I'd like to fix the bad code we generate for intrinsics.

Thing is, this is going to be another case where the mode of a type
depends on the current target.  E.g. on ARM, we don't want to use
a 24-byte mode for an array of 3 2xSI vectors unless V2SI is also
available.  Both the mode of the vector type and the mode of the
array type will therefore depend on whether Neon is enabled.

I know you don't like the way we handle TYPE_MODE for vectors:

  #define TYPE_MODE(NODE) \
    (TREE_CODE (TYPE_CHECK (NODE)) == VECTOR_TYPE \
     ? vector_type_mode (NODE) : (NODE)->type.mode)

so I'm guessing you wouldn't be too happy to see ARRAY_TYPE popping
up there as well. :-)  What's the best way of handling this?

Richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 10:38       ` Richard Sandiford
@ 2011-03-23 11:52         ` Richard Guenther
  2011-03-23 12:18           ` Richard Sandiford
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-03-23 11:52 UTC (permalink / raw)
  To: Richard Guenther, gcc, richard.sandiford

On Wed, Mar 23, 2011 at 11:38 AM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Guenther <richard.guenther@gmail.com> writes:
>> But as you have partial defs of the vector lane array the simplest
>> approach is probably to not make them a register.  Be prepared
>> for some surprises during RTL expansion though ;)
>
> OK.  It's there I'd like to start, specifically with:
>
>  These arrays of vectors would still need to have a non-BLK mode,
>  so that they can be stored in _rtl_ registers.  But we need that anyway
>  for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic
>  functions is very poor.
>
> because I'd like to fix the bad code we generate for intrinsics.
>
> Thing is, this is going to be another case where the mode of a type
> depends on the current target.  E.g. on ARM, we don't want to use
> a 24-byte mode for an array of 3 2xSI vectors unless V2SI is also
> available.  Both the mode of the vector type and the mode of the
> array type will therefore depend on whether Neon is enabled.
>
> I know you don't like the way we handle TYPE_MODE for vectors:
>
>  #define TYPE_MODE(NODE) \
>    (TREE_CODE (TYPE_CHECK (NODE)) == VECTOR_TYPE \
>     ? vector_type_mode (NODE) : (NODE)->type.mode)
>
> so I'm guessing you wouldn't be too happy to see ARRAY_TYPE popping
> up there as well. :-)  What's the best way of handling this?

I'd say use either DECL_MODE at the point where we decide on
expanding vars (setting it from a target hook), or simply ask such
a hook at expansion time.  That should have worked for the target
atttribute stuff as well instead of dispatching in TYPE_MODE (types
are global and TYPE_MODE with the target attribute depends on
the context, but decls are local to the declaration context, so the
mode persists and is not dependent on the attribute). Might
need some surgery in places where we assume TYPE_MODE == DECL_MODE,
but I suspect it's mostly around RTL expansion.

Richard.

> Richard
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 11:52         ` Richard Guenther
@ 2011-03-23 12:18           ` Richard Sandiford
  2011-03-23 12:37             ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Sandiford @ 2011-03-23 12:18 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

Richard Guenther <richard.guenther@gmail.com> writes:
> On Wed, Mar 23, 2011 at 11:38 AM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> Richard Guenther <richard.guenther@gmail.com> writes:
>>> But as you have partial defs of the vector lane array the simplest
>>> approach is probably to not make them a register.  Be prepared
>>> for some surprises during RTL expansion though ;)
>>
>> OK.  It's there I'd like to start, specifically with:
>>
>>  These arrays of vectors would still need to have a non-BLK mode,
>>  so that they can be stored in _rtl_ registers.  But we need that anyway
>>  for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic
>>  functions is very poor.
>>
>> because I'd like to fix the bad code we generate for intrinsics.
>>
>> Thing is, this is going to be another case where the mode of a type
>> depends on the current target.  E.g. on ARM, we don't want to use
>> a 24-byte mode for an array of 3 2xSI vectors unless V2SI is also
>> available.  Both the mode of the vector type and the mode of the
>> array type will therefore depend on whether Neon is enabled.
>>
>> I know you don't like the way we handle TYPE_MODE for vectors:
>>
>>  #define TYPE_MODE(NODE) \
>>    (TREE_CODE (TYPE_CHECK (NODE)) == VECTOR_TYPE \
>>     ? vector_type_mode (NODE) : (NODE)->type.mode)
>>
>> so I'm guessing you wouldn't be too happy to see ARRAY_TYPE popping
>> up there as well. :-)  What's the best way of handling this?
>
> I'd say use either DECL_MODE at the point where we decide on
> expanding vars (setting it from a target hook), or simply ask such
> a hook at expansion time.  That should have worked for the target
> atttribute stuff as well instead of dispatching in TYPE_MODE (types
> are global and TYPE_MODE with the target attribute depends on
> the context, but decls are local to the declaration context, so the
> mode persists and is not dependent on the attribute). Might
> need some surgery in places where we assume TYPE_MODE == DECL_MODE,
> but I suspect it's mostly around RTL expansion.

Hmm, but if we do that, when is it correct to look at TYPE_MODE?

E.g. when expanding the new __builtin_load_lanes function described
earlier, it wouldn't be valid to base the target register's mode on
TYPE_MODE, so I suppose we'd have to call the hook instead.  And if we
did revert the TYPE_MODE change for vector types, the vector optabs
would need to do the same thing.  Wouldn't we just end up replacing
most/all uses of TYPE_MODE with calls to the hook?  What would any
remaining uses of TYPE_MODE actually be testing?

E.g. I suppose we really ought to do the same thing for 128-bit types,
since this:

    /* TODO: This isn't correct, but as logic depends at the moment on
       host's instead of target's wide-integer.
       If there is a target not supporting TImode, but has an 128-bit
       integer-scalar register, this target check needs to be adjusted. */
    if (targetm.scalar_mode_supported_p (TImode))
      {
        int128_integer_type_node = make_signed_type (128);
        int128_unsigned_type_node = make_unsigned_type (128);
      }

seems to apply one value of scalar_mode_supported_p to the whole compilation.
(TImode support seems to depend on TARGET_ZARCH for s390.)

Richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 12:18           ` Richard Sandiford
@ 2011-03-23 12:37             ` Richard Guenther
  2011-03-23 13:01               ` Richard Sandiford
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-03-23 12:37 UTC (permalink / raw)
  To: Richard Guenther, gcc, richard.sandiford

On Wed, Mar 23, 2011 at 1:18 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Guenther <richard.guenther@gmail.com> writes:
>> On Wed, Mar 23, 2011 at 11:38 AM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> Richard Guenther <richard.guenther@gmail.com> writes:
>>>> But as you have partial defs of the vector lane array the simplest
>>>> approach is probably to not make them a register.  Be prepared
>>>> for some surprises during RTL expansion though ;)
>>>
>>> OK.  It's there I'd like to start, specifically with:
>>>
>>>  These arrays of vectors would still need to have a non-BLK mode,
>>>  so that they can be stored in _rtl_ registers.  But we need that anyway
>>>  for ARM's arm_neon.h; the code that today's GCC produces for the intrinsic
>>>  functions is very poor.
>>>
>>> because I'd like to fix the bad code we generate for intrinsics.
>>>
>>> Thing is, this is going to be another case where the mode of a type
>>> depends on the current target.  E.g. on ARM, we don't want to use
>>> a 24-byte mode for an array of 3 2xSI vectors unless V2SI is also
>>> available.  Both the mode of the vector type and the mode of the
>>> array type will therefore depend on whether Neon is enabled.
>>>
>>> I know you don't like the way we handle TYPE_MODE for vectors:
>>>
>>>  #define TYPE_MODE(NODE) \
>>>    (TREE_CODE (TYPE_CHECK (NODE)) == VECTOR_TYPE \
>>>     ? vector_type_mode (NODE) : (NODE)->type.mode)
>>>
>>> so I'm guessing you wouldn't be too happy to see ARRAY_TYPE popping
>>> up there as well. :-)  What's the best way of handling this?
>>
>> I'd say use either DECL_MODE at the point where we decide on
>> expanding vars (setting it from a target hook), or simply ask such
>> a hook at expansion time.  That should have worked for the target
>> atttribute stuff as well instead of dispatching in TYPE_MODE (types
>> are global and TYPE_MODE with the target attribute depends on
>> the context, but decls are local to the declaration context, so the
>> mode persists and is not dependent on the attribute). Might
>> need some surgery in places where we assume TYPE_MODE == DECL_MODE,
>> but I suspect it's mostly around RTL expansion.
>
> Hmm, but if we do that, when is it correct to look at TYPE_MODE?

Most of the tree passes shouldn't care about TYPE_MODE (nor
DECL_MODE) and on RTL we shouldn't need to care about trees.

> E.g. when expanding the new __builtin_load_lanes function described
> earlier, it wouldn't be valid to base the target register's mode on
> TYPE_MODE, so I suppose we'd have to call the hook instead.

Well, you'd expand __builtin_load_lanes only if the mode is available, no?
So you know the mode in advance and don't need to get it from anywhere.

>  And if we
> did revert the TYPE_MODE change for vector types, the vector optabs
> would need to do the same thing.  Wouldn't we just end up replacing
> most/all uses of TYPE_MODE with calls to the hook?  What would any
> remaining uses of TYPE_MODE actually be testing?

I think a lot of TYPE_MODE users are just lazy, like the optabs should
get a mode input and not use a type - the vectorizer knows what target
support it targets for so it can supply a proper mode.  Alternatively
extract the mode from the operands instead, using DECL_MODE.

That said, I think given that target support can change across functions
using something global like TYPE_MODE is fundamentally flawed
(unless you start doing ugly things like that callback in the TYPE_MODE
implementation).

> E.g. I suppose we really ought to do the same thing for 128-bit types,
> since this:
>
>    /* TODO: This isn't correct, but as logic depends at the moment on
>       host's instead of target's wide-integer.
>       If there is a target not supporting TImode, but has an 128-bit
>       integer-scalar register, this target check needs to be adjusted. */
>    if (targetm.scalar_mode_supported_p (TImode))
>      {
>        int128_integer_type_node = make_signed_type (128);
>        int128_unsigned_type_node = make_unsigned_type (128);
>      }
>
> seems to apply one value of scalar_mode_supported_p to the whole compilation.
> (TImode support seems to depend on TARGET_ZARCH for s390.)

Well, it depends on where int128_integer_type_node is used.  I think
if the target with some settings supports TImode then we probably
want to have that type node.  At the point the user declares some vars
with it you can error out dependent on local support.  At expansion
time you'd need to check whether accesses in a given mode are
really "possible" and dispatch to BLKmode handling if they are not.

The tree level really doesn't care, and most TYPE_MODE uses there
are bogus - the valid ones want to check targetm.xxxx_mode_supported_p
instead.  During RTL expansion we have to deal with handling modes
we don't support (or ICE, as we do now with a lot of target attribute
uses).

For your case in question the vectorizer would create local vars with
that mode, knowing it is supported, so I don't see big problems for
that particular case.

Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 12:37             ` Richard Guenther
@ 2011-03-23 13:01               ` Richard Sandiford
  2011-03-23 13:14                 ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Sandiford @ 2011-03-23 13:01 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

Richard Guenther <richard.guenther@gmail.com> writes:
>> Hmm, but if we do that, when is it correct to look at TYPE_MODE?
>
> Most of the tree passes shouldn't care about TYPE_MODE (nor
> DECL_MODE) and on RTL we shouldn't need to care about trees.

It sounds like you think it would be better to get rid of TYPE_MODE.
I can see why that's appealing, but it also sounds very ambitious :-)
As well as all the uses in the middle-end, targets have (wrongly) tended
to use type modes to define the ABI.  It might be quite difficult to
untangle the whole mess now.

Of course, that's also an argument in favour of what you say about
TYPE_MODE not changing unless we can help it...

> For your case in question the vectorizer would create local vars with
> that mode, knowing it is supported, so I don't see big problems for
> that particular case.

The problem is that I'd like to use this for intrinsics as well as for
automatic vectorisation.  E.g. I'd like:

typedef struct int8x16x4_t
{
  int8x16_t val[4];
} int8x16x4_t;

to have non-BLKmode as well.  arm_neon.h uses this type of structure
to represent compounds vectors.  But once the type is defined (with Neon
support enabled), there's nothing to stop someone using the type
(not the intrinsics) in a function that has Neon disabled.  We mustn't
use the special mode in such cases, because there aren't enough GPRs to
store it.  It should be treated as BLKmode instead.  Which I suppose
is the same situation as...

> > E.g. I suppose we really ought to do the same thing for 128-bit types,
> > since this:
> >
> >    /* TODO: This isn't correct, but as logic depends at the moment on
> >       host's instead of target's wide-integer.
> >       If there is a target not supporting TImode, but has an 128-bit
> >       integer-scalar register, this target check needs to be adjusted. */
> >    if (targetm.scalar_mode_supported_p (TImode))
> >      {
> >        int128_integer_type_node = make_signed_type (128);
> >        int128_unsigned_type_node = make_unsigned_type (128);
> >      }
> >
> > seems to apply one value of scalar_mode_supported_p to the whole compilation.
> > (TImode support seems to depend on TARGET_ZARCH for s390.)
>
> Well, it depends on where int128_integer_type_node is used.  I think
> if the target with some settings supports TImode then we probably
> want to have that type node.  At the point the user declares some vars
> with it you can error out dependent on local support. At expansion
> time you'd need to check whether accesses in a given mode are
> really "possible" and dispatch to BLKmode handling if they are not.

...this.  Do you mean that we'd error for local declarations, but fall
back to BLKmode for operations on already-defined (global) declarations?
I'm just worried that might be a bit inconsistent.

Richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 13:01               ` Richard Sandiford
@ 2011-03-23 13:14                 ` Richard Guenther
  2011-03-23 14:14                   ` Richard Sandiford
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-03-23 13:14 UTC (permalink / raw)
  To: Richard Guenther, gcc, richard.sandiford

On Wed, Mar 23, 2011 at 2:01 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Guenther <richard.guenther@gmail.com> writes:
>>> Hmm, but if we do that, when is it correct to look at TYPE_MODE?
>>
>> Most of the tree passes shouldn't care about TYPE_MODE (nor
>> DECL_MODE) and on RTL we shouldn't need to care about trees.
>
> It sounds like you think it would be better to get rid of TYPE_MODE.
> I can see why that's appealing, but it also sounds very ambitious :-)
> As well as all the uses in the middle-end, targets have (wrongly) tended
> to use type modes to define the ABI.  It might be quite difficult to
> untangle the whole mess now.
>
> Of course, that's also an argument in favour of what you say about
> TYPE_MODE not changing unless we can help it...

Indeed.

>> For your case in question the vectorizer would create local vars with
>> that mode, knowing it is supported, so I don't see big problems for
>> that particular case.
>
> The problem is that I'd like to use this for intrinsics as well as for
> automatic vectorisation.  E.g. I'd like:
>
> typedef struct int8x16x4_t
> {
>  int8x16_t val[4];
> } int8x16x4_t;
>
> to have non-BLKmode as well.  arm_neon.h uses this type of structure
> to represent compounds vectors.  But once the type is defined (with Neon
> support enabled), there's nothing to stop someone using the type
> (not the intrinsics) in a function that has Neon disabled.  We mustn't
> use the special mode in such cases, because there aren't enough GPRs to
> store it.  It should be treated as BLKmode instead.  Which I suppose
> is the same situation as...

I'd use non-BLKmode for the above unconditionally.

>> > E.g. I suppose we really ought to do the same thing for 128-bit types,
>> > since this:
>> >
>> >    /* TODO: This isn't correct, but as logic depends at the moment on
>> >       host's instead of target's wide-integer.
>> >       If there is a target not supporting TImode, but has an 128-bit
>> >       integer-scalar register, this target check needs to be adjusted. */
>> >    if (targetm.scalar_mode_supported_p (TImode))
>> >      {
>> >        int128_integer_type_node = make_signed_type (128);
>> >        int128_unsigned_type_node = make_unsigned_type (128);
>> >      }
>> >
>> > seems to apply one value of scalar_mode_supported_p to the whole compilation.
>> > (TImode support seems to depend on TARGET_ZARCH for s390.)
>>
>> Well, it depends on where int128_integer_type_node is used.  I think
>> if the target with some settings supports TImode then we probably
>> want to have that type node.  At the point the user declares some vars
>> with it you can error out dependent on local support. At expansion
>> time you'd need to check whether accesses in a given mode are
>> really "possible" and dispatch to BLKmode handling if they are not.
>
> ...this.  Do you mean that we'd error for local declarations, but fall
> back to BLKmode for operations on already-defined (global) declarations?
> I'm just worried that might be a bit inconsistent.

I'd say if somebody writes

v4sf float_vec;

void __attribute__((target("no-sse")))
foo (void)
{
  float_vec += float_vec;
}

he deserves to get a diagnostic.  Thus, even for global decls I think we
can reject such uses.  Complication arises whenever we do not see
a decl, like for

void foo(v4sf *x)
{
}

which we could of course reject (at function definition time) if an
unsupported type is used in this way.  But the function might
not even dereference that pointer ...

That said, I think for your case in question we should set possible
target attribute issues aside (because we have those issues already).
In that case you wouldn't need to touch TYPE_MODE at all as
it would have non-BLKmode as soon as you create a vector-lane
type or decl?

And I still think that only changing DECL_MODEs based on
target attributes and not TYPE_MODEs is appealing ;)

Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 13:14                 ` Richard Guenther
@ 2011-03-23 14:14                   ` Richard Sandiford
  2011-03-23 14:28                     ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Sandiford @ 2011-03-23 14:14 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

Richard Guenther <richard.guenther@gmail.com> writes:
>>> For your case in question the vectorizer would create local vars with
>>> that mode, knowing it is supported, so I don't see big problems for
>>> that particular case.
>>
>> The problem is that I'd like to use this for intrinsics as well as for
>> automatic vectorisation.  E.g. I'd like:
>>
>> typedef struct int8x16x4_t
>> {
>>  int8x16_t val[4];
>> } int8x16x4_t;
>>
>> to have non-BLKmode as well.  arm_neon.h uses this type of structure
>> to represent compounds vectors.  But once the type is defined (with Neon
>> support enabled), there's nothing to stop someone using the type
>> (not the intrinsics) in a function that has Neon disabled.  We mustn't
>> use the special mode in such cases, because there aren't enough GPRs to
>> store it.  It should be treated as BLKmode instead.  Which I suppose
>> is the same situation as...
>
> I'd use non-BLKmode for the above unconditionally.

But without Neon, there aren't enough registers to store the structure.
Any use of the Neon mode would just lead to a reload failure.  Even if
we think it's not sensible to use the type without Neon, we need a better
diagnostic than that.

So I think this mode has to be conditional in exactly the way that
vector modes are, or be subject to the same diagnostics that you
were suggesting for 128-bit types.

I was actually thinking along the lines of having a target hook such as:

   array_mode_supported_p (tree elemtype, unsigned HOST_WIDE_INT size)

which would return true if ELEMTYPE[SIZE] should use non-BLKmode where
possible.  When it returns true, we'd pass 0 rather than 1 to this
mode_for_size_tree call (from the ARRAY_TYPE case in layout_type):

	    /* One-element arrays get the component type's mode.  */
	    if (simple_cst_equal (TYPE_SIZE (type),
				  TYPE_SIZE (TREE_TYPE (type))))
	      SET_TYPE_MODE (type, TYPE_MODE (TREE_TYPE (type)));
	    else
	      SET_TYPE_MODE (type, mode_for_size_tree (TYPE_SIZE (type),
						       MODE_INT, 1));

This would have the "advantage" (as I see it) of working with the
generic vector extensions too.  E.g. if a user defines their own
3-element-array-of-vector type, they would benefit from the same
handling as the Neon-specific intrinsics and the vectoriser-generated
arrays.

We still make generic vectors available when there's no underlying
hardware support, so I'd have expected these 3-element-array-of-vector
types to be available too.  That's why I prefer the idea of making the
mode conditional, as for vector types, rather than rejecting uses of
the type outright.

But from this:

> I'd say if somebody writes
>
> v4sf float_vec;
>
> void __attribute__((target("no-sse")))
> foo (void)
> {
>   float_vec += float_vec;
> }
>
> he deserves to get a diagnostic.  Thus, even for global decls I think we
> can reject such uses.  Complication arises whenever we do not see
> a decl, like for
>
> void foo(v4sf *x)
> {
> }
>
> which we could of course reject (at function definition time) if an
> unsupported type is used in this way.  But the function might
> not even dereference that pointer ...

it sounds like you think there's no point supporting generic vectors
when no underlying hardware support is available.

> And I still think that only changing DECL_MODEs based on
> target attributes and not TYPE_MODEs is appealing ;)

Understood.  I just think that, if we do that, we really should
get rid of TYPE_MODE (as a global property) as well, otherwise there'd
be even more chaos than there is now.  And that sounds like it could
be several months' work in itself.

Richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 14:14                   ` Richard Sandiford
@ 2011-03-23 14:28                     ` Richard Guenther
  2011-03-23 14:41                       ` Richard Sandiford
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-03-23 14:28 UTC (permalink / raw)
  To: Richard Guenther, gcc, richard.sandiford

On Wed, Mar 23, 2011 at 3:13 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Guenther <richard.guenther@gmail.com> writes:
>>>> For your case in question the vectorizer would create local vars with
>>>> that mode, knowing it is supported, so I don't see big problems for
>>>> that particular case.
>>>
>>> The problem is that I'd like to use this for intrinsics as well as for
>>> automatic vectorisation.  E.g. I'd like:
>>>
>>> typedef struct int8x16x4_t
>>> {
>>>  int8x16_t val[4];
>>> } int8x16x4_t;
>>>
>>> to have non-BLKmode as well.  arm_neon.h uses this type of structure
>>> to represent compounds vectors.  But once the type is defined (with Neon
>>> support enabled), there's nothing to stop someone using the type
>>> (not the intrinsics) in a function that has Neon disabled.  We mustn't
>>> use the special mode in such cases, because there aren't enough GPRs to
>>> store it.  It should be treated as BLKmode instead.  Which I suppose
>>> is the same situation as...
>>
>> I'd use non-BLKmode for the above unconditionally.
>
> But without Neon, there aren't enough registers to store the structure.
> Any use of the Neon mode would just lead to a reload failure.  Even if
> we think it's not sensible to use the type without Neon, we need a better
> diagnostic than that.
>
> So I think this mode has to be conditional in exactly the way that
> vector modes are, or be subject to the same diagnostics that you
> were suggesting for 128-bit types.
>
> I was actually thinking along the lines of having a target hook such as:
>
>   array_mode_supported_p (tree elemtype, unsigned HOST_WIDE_INT size)
>
> which would return true if ELEMTYPE[SIZE] should use non-BLKmode where
> possible.  When it returns true, we'd pass 0 rather than 1 to this
> mode_for_size_tree call (from the ARRAY_TYPE case in layout_type):
>
>            /* One-element arrays get the component type's mode.  */
>            if (simple_cst_equal (TYPE_SIZE (type),
>                                  TYPE_SIZE (TREE_TYPE (type))))
>              SET_TYPE_MODE (type, TYPE_MODE (TREE_TYPE (type)));
>            else
>              SET_TYPE_MODE (type, mode_for_size_tree (TYPE_SIZE (type),
>                                                       MODE_INT, 1));
>
> This would have the "advantage" (as I see it) of working with the
> generic vector extensions too.  E.g. if a user defines their own
> 3-element-array-of-vector type, they would benefit from the same
> handling as the Neon-specific intrinsics and the vectoriser-generated
> arrays.

So the 3-element-array-of-vector type has the vector mode of a single
element?  I'm confused.  I also don't see how a user could want to
have a non-BLK mode on such array types (consider them being
part of a struct - how would that affect argument passing and other
ABI details?).

> We still make generic vectors available when there's no underlying
> hardware support, so I'd have expected these 3-element-array-of-vector
> types to be available too.  That's why I prefer the idea of making the
> mode conditional, as for vector types, rather than rejecting uses of
> the type outright.
>
> But from this:
>
>> I'd say if somebody writes
>>
>> v4sf float_vec;
>>
>> void __attribute__((target("no-sse")))
>> foo (void)
>> {
>>   float_vec += float_vec;
>> }
>>
>> he deserves to get a diagnostic.  Thus, even for global decls I think we
>> can reject such uses.  Complication arises whenever we do not see
>> a decl, like for
>>
>> void foo(v4sf *x)
>> {
>> }
>>
>> which we could of course reject (at function definition time) if an
>> unsupported type is used in this way.  But the function might
>> not even dereference that pointer ...
>
> it sounds like you think there's no point supporting generic vectors
> when no underlying hardware support is available.

Well, I meant if the user compiles with -msse, declares such a
global var (which means it gets V4SFmode and not BLKmode)
and then uses it in a function where he explicitly disables SSE
then something is wrong.  If he declares a BLKmode global
then generic vector support will happily trigger and make it work.

I realize this is all a bit tricky and probably nobody properly designed
the target attribute stuff with all these details in mind.  But now
we have to live with it ... :(

>> And I still think that only changing DECL_MODEs based on
>> target attributes and not TYPE_MODEs is appealing ;)
>
> Understood.  I just think that, if we do that, we really should
> get rid of TYPE_MODE (as a global property) as well, otherwise there'd
> be even more chaos than there is now.  And that sounds like it could
> be several months' work in itself.

True.  But I like the idea of TYPE_MODE becoming even more "dynamic"
even less.

If it's just three element array-of-vector types you need why not expose
it via attribute((mode(xyz))) only?  You could alias that mode to BLKmode
if neon is not enabled ...

Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 14:28                     ` Richard Guenther
@ 2011-03-23 14:41                       ` Richard Sandiford
  2011-03-29 12:50                         ` Richard Sandiford
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Sandiford @ 2011-03-23 14:41 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

Richard Guenther <richard.guenther@gmail.com> writes:
> On Wed, Mar 23, 2011 at 3:13 PM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> Richard Guenther <richard.guenther@gmail.com> writes:
>>>>> For your case in question the vectorizer would create local vars with
>>>>> that mode, knowing it is supported, so I don't see big problems for
>>>>> that particular case.
>>>>
>>>> The problem is that I'd like to use this for intrinsics as well as for
>>>> automatic vectorisation.  E.g. I'd like:
>>>>
>>>> typedef struct int8x16x4_t
>>>> {
>>>>  int8x16_t val[4];
>>>> } int8x16x4_t;
>>>>
>>>> to have non-BLKmode as well.  arm_neon.h uses this type of structure
>>>> to represent compounds vectors.  But once the type is defined (with Neon
>>>> support enabled), there's nothing to stop someone using the type
>>>> (not the intrinsics) in a function that has Neon disabled.  We mustn't
>>>> use the special mode in such cases, because there aren't enough GPRs to
>>>> store it.  It should be treated as BLKmode instead.  Which I suppose
>>>> is the same situation as...
>>>
>>> I'd use non-BLKmode for the above unconditionally.
>>
>> But without Neon, there aren't enough registers to store the structure.
>> Any use of the Neon mode would just lead to a reload failure.  Even if
>> we think it's not sensible to use the type without Neon, we need a better
>> diagnostic than that.
>>
>> So I think this mode has to be conditional in exactly the way that
>> vector modes are, or be subject to the same diagnostics that you
>> were suggesting for 128-bit types.
>>
>> I was actually thinking along the lines of having a target hook such as:
>>
>>   array_mode_supported_p (tree elemtype, unsigned HOST_WIDE_INT size)
>>
>> which would return true if ELEMTYPE[SIZE] should use non-BLKmode where
>> possible.  When it returns true, we'd pass 0 rather than 1 to this
>> mode_for_size_tree call (from the ARRAY_TYPE case in layout_type):
>>
>>            /* One-element arrays get the component type's mode.  */
>>            if (simple_cst_equal (TYPE_SIZE (type),
>>                                  TYPE_SIZE (TREE_TYPE (type))))
>>              SET_TYPE_MODE (type, TYPE_MODE (TREE_TYPE (type)));
>>            else
>>              SET_TYPE_MODE (type, mode_for_size_tree (TYPE_SIZE (type),
>>                                                       MODE_INT, 1));
>>
>> This would have the "advantage" (as I see it) of working with the
>> generic vector extensions too.  E.g. if a user defines their own
>> 3-element-array-of-vector type, they would benefit from the same
>> handling as the Neon-specific intrinsics and the vectoriser-generated
>> arrays.
>
> So the 3-element-array-of-vector type has the vector mode of a single
> element?

No, it has a wider, non-vector mode.  At the moment, ARM uses integer
modes for this, and after trying a few variations, I think that's
actually the best compromise.  So the uint8x16x4_t ought to have a
64-byte integer type(!), which ARM defines as XImode:

INT_MODE (XI, 64);

> I also don't see how a user could want to have a non-BLK mode on such
> array types (consider them being part of a struct - how would that
> affect argument passing and other ABI details?).

The point is that we shouldn't use the mode for the ABI anyway.  Even the
intrinsic-defined types (like uint8x16x4_t above) should be passed in
the same way as BLKmode structures would.

>>> I'd say if somebody writes
>>>
>>> v4sf float_vec;
>>>
>>> void __attribute__((target("no-sse")))
>>> foo (void)
>>> {
>>>   float_vec += float_vec;
>>> }
>>>
>>> he deserves to get a diagnostic.  Thus, even for global decls I think we
>>> can reject such uses.  Complication arises whenever we do not see
>>> a decl, like for
>>>
>>> void foo(v4sf *x)
>>> {
>>> }
>>>
>>> which we could of course reject (at function definition time) if an
>>> unsupported type is used in this way.  But the function might
>>> not even dereference that pointer ...
>>
>> it sounds like you think there's no point supporting generic vectors
>> when no underlying hardware support is available.
>
> Well, I meant if the user compiles with -msse, declares such a
> global var (which means it gets V4SFmode and not BLKmode)
> and then uses it in a function where he explicitly disables SSE
> then something is wrong.  If he declares a BLKmode global
> then generic vector support will happily trigger and make it work.

Ah, OK.  I'm just not sure whether, to take a MIPS example,
MIPS16 functions in a "-mno-mips16" compile should behave
differently from unannotated functions in a "-mips16" compile.

> If it's just three element array-of-vector types you need why not expose
> it via attribute((mode(xyz))) only?  You could alias that mode to BLKmode
> if neon is not enabled ...

I don't think that really changes anything.  Getting the non-BLK mode
on the array type seems like the easy part.  The difficult part is
dealing with the fallout when the array is defined in a Neon context
and used in a non-Neon context.

Richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-23 14:41                       ` Richard Sandiford
@ 2011-03-29 12:50                         ` Richard Sandiford
  2011-03-29 14:05                           ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Sandiford @ 2011-03-29 12:50 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

Richard Sandiford <richard.sandiford@linaro.org> writes:
> Richard Guenther <richard.guenther@gmail.com> writes:
>> Well, I meant if the user compiles with -msse, declares such a
>> global var (which means it gets V4SFmode and not BLKmode)
>> and then uses it in a function where he explicitly disables SSE
>> then something is wrong.  If he declares a BLKmode global
>> then generic vector support will happily trigger and make it work.
>
> Ah, OK.  I'm just not sure whether, to take a MIPS example,
> MIPS16 functions in a "-mno-mips16" compile should behave
> differently from unannotated functions in a "-mips16" compile.
>
>> If it's just three element array-of-vector types you need why not expose
>> it via attribute((mode(xyz))) only?  You could alias that mode to BLKmode
>> if neon is not enabled ...
>
> I don't think that really changes anything.  Getting the non-BLK mode
> on the array type seems like the easy part.  The difficult part is
> dealing with the fallout when the array is defined in a Neon context
> and used in a non-Neon context.

As a follow-up to this, I think the current definition of TYPE_MODE
is too restrictive even for the vector case.  Single-element structures
get the modes of their fields, and similarly for arrays.  So if we modify
the original 38240 testcase a bit, we still get a difference:

-------------------------------------------------------------------------
#if STRUCT
typedef struct {
  float x __attribute__ ((__vector_size__ (16), __may_alias__));
} V;
#else
typedef float V __attribute__ ((__vector_size__ (16), __may_alias__));
#endif

V __attribute__((target("sse"))) f(const V *ptr) { return *ptr; }
-------------------------------------------------------------------------

Without -DSTRUCT, this generates the same code regardless of whether
you compile with -msse.  But with -DSTRUCT, you get:

        movaps  (%rdi), %xmm0
        ret

with -msse and:

        movq    (%rdi), %rax
        movq    %rax, -24(%rsp)
        movq    8(%rdi), %rax
        movq    %rax, -16(%rsp)
        movdqa  -24(%rsp), %xmm0
        ret

with -mno-sse.

I think your argument is that most/all uses of TYPE_MODE are a mistake.
But I still think it makes sense to say that types have a natural mode
_in a given context_, just not globally.  So how about replacing it with
a current_mode_of_type function?  That makes it obvious that TYPE_MODE is
not a global property, and that it isn't really a simple accessor any more.
We could then make it recompute the mode for all types, possibly with a
cache if that's necessary for performance reasons.

Richard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC: Representing vector lane load/store operations
  2011-03-29 12:50                         ` Richard Sandiford
@ 2011-03-29 14:05                           ` Richard Guenther
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Guenther @ 2011-03-29 14:05 UTC (permalink / raw)
  To: Richard Guenther, gcc, richard.sandiford

On Tue, Mar 29, 2011 at 2:44 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Sandiford <richard.sandiford@linaro.org> writes:
>> Richard Guenther <richard.guenther@gmail.com> writes:
>>> Well, I meant if the user compiles with -msse, declares such a
>>> global var (which means it gets V4SFmode and not BLKmode)
>>> and then uses it in a function where he explicitly disables SSE
>>> then something is wrong.  If he declares a BLKmode global
>>> then generic vector support will happily trigger and make it work.
>>
>> Ah, OK.  I'm just not sure whether, to take a MIPS example,
>> MIPS16 functions in a "-mno-mips16" compile should behave
>> differently from unannotated functions in a "-mips16" compile.
>>
>>> If it's just three element array-of-vector types you need why not expose
>>> it via attribute((mode(xyz))) only?  You could alias that mode to BLKmode
>>> if neon is not enabled ...
>>
>> I don't think that really changes anything.  Getting the non-BLK mode
>> on the array type seems like the easy part.  The difficult part is
>> dealing with the fallout when the array is defined in a Neon context
>> and used in a non-Neon context.
>
> As a follow-up to this, I think the current definition of TYPE_MODE
> is too restrictive even for the vector case.  Single-element structures
> get the modes of their fields, and similarly for arrays.  So if we modify
> the original 38240 testcase a bit, we still get a difference:
>
> -------------------------------------------------------------------------
> #if STRUCT
> typedef struct {
>  float x __attribute__ ((__vector_size__ (16), __may_alias__));
> } V;
> #else
> typedef float V __attribute__ ((__vector_size__ (16), __may_alias__));
> #endif
>
> V __attribute__((target("sse"))) f(const V *ptr) { return *ptr; }
> -------------------------------------------------------------------------
>
> Without -DSTRUCT, this generates the same code regardless of whether
> you compile with -msse.  But with -DSTRUCT, you get:
>
>        movaps  (%rdi), %xmm0
>        ret
>
> with -msse and:
>
>        movq    (%rdi), %rax
>        movq    %rax, -24(%rsp)
>        movq    8(%rdi), %rax
>        movq    %rax, -16(%rsp)
>        movdqa  -24(%rsp), %xmm0
>        ret
>
> with -mno-sse.
>
> I think your argument is that most/all uses of TYPE_MODE are a mistake.
> But I still think it makes sense to say that types have a natural mode
> _in a given context_, just not globally.  So how about replacing it with
> a current_mode_of_type function?  That makes it obvious that TYPE_MODE is
> not a global property, and that it isn't really a simple accessor any more.
> We could then make it recompute the mode for all types, possibly with a
> cache if that's necessary for performance reasons.

Well, ok.  That current_mode_of_type wouldn't make sense when for
example expanding global initializers (neither would looking at TYPE_MODE).
But - what's the natural mode to choose for global entities?  After all
we have to stick something into TYPE_MODE and DECL_MODE.

But yes, changing the TYPE_MODE users over to current_mode_of_type
(or rather mode_of_type_in_fn (struct function *, tree)) would be nice
(and then get rid of the TYPE_MODE hack).

Richard.

> Richard
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-03-29 13:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-22 16:52 RFC: Representing vector lane load/store operations Richard Sandiford
2011-03-22 17:10 ` Richard Guenther
2011-03-22 19:43   ` Richard Sandiford
2011-03-23  9:23     ` Richard Guenther
2011-03-23 10:38       ` Richard Sandiford
2011-03-23 11:52         ` Richard Guenther
2011-03-23 12:18           ` Richard Sandiford
2011-03-23 12:37             ` Richard Guenther
2011-03-23 13:01               ` Richard Sandiford
2011-03-23 13:14                 ` Richard Guenther
2011-03-23 14:14                   ` Richard Sandiford
2011-03-23 14:28                     ` Richard Guenther
2011-03-23 14:41                       ` Richard Sandiford
2011-03-29 12:50                         ` Richard Sandiford
2011-03-29 14:05                           ` Richard Guenther

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).