public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* [tree-ssa] vectorizer related issues
@ 2003-12-16 15:34 Dorit Naishlos
  2003-12-16 23:55 ` Richard Henderson
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Dorit Naishlos @ 2003-12-16 15:34 UTC (permalink / raw)
  To: gcc





Hi,

I went through the exercise of vectorizing a simple loop in tree-ssa,
skipping some analysis but going all the way through the transformation, in
order to get an idea of the problems and issues that lie ahead. This note
first describes what I did in general, then raises a list of open
problems/questions, and ends with a couple of tree-ssa dumps. You are
encouraged to read & respond, at least to the second part! :-)

thanks,
dorit

Part I. The preliminary implemented vectorizer

The vectorizer transforms the following simple loop:

loop1.c:
  short a[N]; short b[N]; short c[N]; int i;

  for (i=0; i<N; i++){
    a[i] = b[i] + c[i];
  }

as if it was manually vectorized by rewriting the source code into:

loop2.c:
  typedef int __attribute__((mode(V8HI))) v8hi;
  short a[N];  short b[N]; short c[N];   int i;
  v8hi *pa = (v8hi*)a, *pb = (v8hi*)b, *pc = (v8hi*)c;
  v8hi va, vb, vc;

  for (i=0; i<N/8; i++){
    vb = pb[i];
    vc = pc[i];
    va = vb + vc;
    pa[i] = va;
  }

The assembly generated by auto-vectorizing loop1.c is identical to that
generated by (manually vectorizing and) compiling loop2.c.

The transformation that loop1.c undergoes is demonstrated at the end of
this note, where you'll find a dump of the (SSA) trees before and after the
vectorizer (part III).

As mentioned, I still need to go back and fill-in the implementation of a
few missing analyses. The main purpose of this exercise was to expose as
many issues/problems early on, as follow (in part II).

I will try to put together a design doc with some preliminary code for a
basic vectorizer; The questions below already highlight a lot on the way
the vectorizer works.


Part II. Open problems and questions

(1) Array addressing using pointers.
      As illustrated in the example above (loop2.c), the basic approach for
handling array accesses is to generate a pointer to a vector type ('pb'),
have it point to the array base ('pb = (v8hi *)b'), and use that to access
the array ('vb = pb[i]'). I was hoping to get away with generating an
ARRAY_REF expression for pb[i], but the RTL expander didn't like that. So I
generated the following sequence of stmts instead, inspired by what the
compiler currently generates at the tree level for the C stmt 'vb = pb[i]':

pb = &b
T.1 = (<unnamed type>)i_1;
T.2 = T.1 * 16;
T.3 = pb + T.2;
vb = *T.3;

It's too bad that we loose the information that these accesses were simple
array references. Maybe there's a simple way to convince the expander to
digest the expression pb[i] (Q1.1: is there?). What's much more worrying is
that we loose this information even before vectorization takes place. This
goes back to the issue mentioned in
http://gcc.gnu.org/ml/gcc/2003-07/msg02013.html.  With a lot of effort the
vectorizer could deal with pointer arithmetic, but I think this is exactly
the kind of effort we should save, working at the tree level.

Is anyone planning to fix this situation? i.e., Q1.2: have the front-end
represent p[i] as an array_ref even when p is declared a pointer, and Q1.3:
have the RTL expander accept such array references.

This is probably the most critical infrastructure fix required for the
vectorizer to be applicable in the short term. If gcc continues to
represent array references this way, it would be pretty useless to
vectorize specifically for array accesses. Good handling of pointer
arithmetic and pointer aliasing (which have well-known limitations) will
then be vital.


(2) virtual defs/uses.
      This issue goes back to an action item on the tree-ssa todo list:
"SSA information for arrays
      The existing implementation treats arrays as an opaque object. A
definition to an array location is treated as a definition for the whole
array".

Q2.1: Is anyone planning to address this issue? (so that the vdefs/vuses of
arrays will be connected only to may/must aliased references?)

It's not a major obstacle at the moment because I'm simply ignoring the
virtual defs/uses, as they are currently not very informative. What I do
when I analyze cross-iteration def-use cycles, is examine the def-use cycle
related to each of the loop phi's. Virtual phi's are ignored, because I
don't know whether they represent a real def-use cycle or not; I rely on
other analysis passes to check the data references that are responsible for
virtual defs-uses. For now, the only form of non-scalar data access I allow
is array references. So the analysis of the loop scalar phi's together with
array data dependence tests, suffice. In the future, when other forms of
non-scalar accesses are allowed (structure, etc) - this workaround virtual
defs/uses would need to be revisited.

By the way, (Q2.2:) is there an API that provides a mapping between a
virtual def/use and the operand it corresponds to? I guess in GIMPLE
finding this mapping is pretty straightforward because the grammar doesn't
allow a lot of flexibility; For example, I want to verify that any
vuse/vdef found corresponds to an array_ref (and not structure or pointer
access). I currently do that as follows: if a (GIMPLE) stmt has any vdefs
(vuses), I check that it has exactly one vdef (vuse), I assume that it
corresponds to op0 (op1), and I expect that the stmt is of the form: op0 =
SSA_NAME/CONST (SSA_NAME = op1). Then I verify that op0 (op1) is an
ARRAY_REF. Is there a better way to do that?
Q3.3: I'm assuming that if a stmt has multiple vdefs/vuses then it's either
a non vectorizable stmt (CALL_EXPR, asm) or there are aliasing problems
which also prevent vectorization. Are there cases involving stmts with
multiple vdefs/vuses that are relevant for vectorization?


(3) loop invariants, dead code, redundant code:
      The general approach taken by the vectorizer when generating vector
stmts, is to insert the vector code sequence corresponding to a certain
(scalar) stmt X just before stmt X, and remove stmt X only if it has side
effects (like changing memory). If stmt X does not have side effects, we
rely on dead code elimination for removing it.

According to this approach, the vectorizer is currently doing the bare
minimum, relying on subsequent optimization phases to take care of any loop
invariants, dead code etc. This is indeed the case - currently taken care
of by RTL level opts. The resulting code is as efficient as could be.

Taking a look at the tree dumps below, you'll find that there are some
(temporary!) inefficiencies introduced by this approach:
(3.1) some loop invariant code is generated inside the loop body (lines 3,
9 and 17) instead of in the loop prolog.
(3.2) a few values that can be reused (like T.7_17, lines 1,2) are instead
recomputed (variables T.12_22 & T.7_28, lines 7,8 & 15,16).
(3.3) some of the original (scalar) code, which is now redundant, is left
in the loop (lines 6, 12, 14).

The vectorizer can take care of these inefficiency issues by itself. Q3.1:
should it? It's simply a design decision. Cleaning up may help subsequent
(tree-level) optimizations and may also save compilation time and memory.
In any case, the very first version of the vectorizer will probably
continue to take this approach. If we decide that the vectorizer should
cleanup after itself, I'll incorporate these optimizations into the next
versions of the vectorizer.


(4) naming conventions:
      The vectorizer creates many new variables; looking at the tree-dumps,
having more informative names for variables would be helpful; for example,
when creating a pointer which is used to point to array b, maybe use p_b as
a prefix. Q4.1: Are there any naming conventions?


(5) stmt_info data structure:
      During the analysis phases the vectorizer records some information
per stmt. For example, a pointer to the vectorized version of each stmt;
Say stmt1:'b=x[i]' is vectorized into vec_stmt1:'vb=px[T]'. When
vectorizing the following stmt2:'a=b', the vectorizer has to find the def
of 'vb'; it does so by first finding stmt1 (via 'SSA_NAME_DEF_STMT(b)'),
then finding vec_stmt1 from stmt1 (via 'stmt_info(stmt1)'), and finally
finding the def 'vb' from vec_stmt1.

This scheme works for operands that are SSA_NAMEs. Operands that aren't,
are currently limited to array references appearing in load/store
operations, and are handled differently.

I've actually used info-per-SSA_NAME so far, but I'm considering to switch
to an info-per-stmt scheme since (a) I can't use it for non SSA_NAME
operands (arrays), (b) I had to allocate an array of size
"next_ssa_version" which is a bit wasteful considering I'm only looking at
SSA_NAMEs that are used in certain loops, and (c) I expect to have more
information to record in the future per stmt. My questions are:
Q5.1: Is there an available field (in stmt/stmt_ann) for recording such
information? I didn't find a general field, like a pointer to void, that
each optimization pass could use for it's own purposes. Q5.2: Is it OK to
add such a field (assuming the answer to Q5.1 is "no")?
Q5.3: In GIMPLE, are there stmts that have more than one def, and I mean
vectorizable stmts, not CALL_EXPRs and such? are they conceivable, even if
not exist today?


(6) modeling the target vector support:
      One of the issues that I haven't addressed yet is the modelling of
the vector support that is available in the target machine. The approach I
took for now is to generate only vector operations which can be expressed
using existing scalar tree codes (PLUS, MULT etc), after verifying that
they are supported by checking the relevant optab at the relevant
machine_mode (e.g, add_optab->handlers[(int) V8HImode].insn_code). If the
value found is CODE_FOR_nothing, then there's no target support, and we
can't vectorize the stmt. Otherwise - the stmt is transformed. The only
target specific information I'm currently exposing is the size of the
vector (in bytes) (through a variable in defaults.h which each target can
define). In the case of loads/stores, for which there's no optab (Q6.1: is
there?), I assume that the target supports any loads/store of the vector
size (i.e., if the vector size is 16 bytes, it's possible to load/store 16
QIs, 8 HIs, 4 SIs/SFs, etc).

Q6.2: Sounds reasonable so far?

There are a lot of vector operations for which this approach is not
expressive enough (e.g., a dot-product operation and reduction operations
in general, operations that perform computations on complex numbers,
conditional operations, and more). Q6.3: Should we add new TREE_CODES (and
corresponding optabs?) for the new types of operations we want to express?
Or instead, Q6.4: should the vectorizer generate calls to target specific
builtin functions? in this case we'll need to decide (Q6.5: ) how to expose
them to the tree level, and (Q6.6: ) how to address the problem that the
compiler doesn't attempt to perform any optimizations around target
specific builtins, not even to SSA their arguments; Or, Q6.7: should we
introduce new generic builtins that are totally exposed to the compiler?

A related issue, is how much information to expose to the tree level. We
want the vectorizer to have enough information to decide whether it's worth
while to vectorize; Also, some patterns can be detected and transformed
much more easily at the tree level. Other should be left to the RTL level.


III. Here are tree dumps for the loop that is auto-vectorized.

The SSA trees for loop1.c just before the vectorizer:
========================================================
  int i;
  short int c[12];
  short int b[12];
  short int a[12];
  short int T.4;
  short int T.3;
  short int T.2;
  short int T.1;

  # BLOCK 0
  # PRED: ENTRY (fallthru,exec)
  goto <L1>;
  # SUCC: 2 (exec)

  # BLOCK 1
  # PRED: 2 (true,exec)
<L0>:;
  #   VUSE <b_5>;
  T.1_6 = b[i_1];
  #   VUSE <c_7>;
  T.2_8 = c[i_1];
  T.3_9 = T.1_6 + T.2_8;
  #   a_10 = VDEF <a_2>;
  a[i_1] = T.3_9;
  i_11 = i_1 + 1;
  # SUCC: 2 (fallthru,exec)

  # BLOCK 2
  # PRED: 1 (fallthru,exec) 0 (exec)
  # a_2 = PHI <a_4(0), a_10(1)>;
  # i_1 = PHI <0(0), i_11(1)>;
<L1>:;
  if (i_1 <= 11) goto <L0>;
   else goto <L2>;


The SSA trees for loop1.c just after the vectorizer:
I added the following things manually to the dump file:
- line numbering in the loop body
- the actual type where the dump just emitted "unnamed type"
- comments with variable names in an attempt to make it more readable
=====================================================================
  <unnamed type (v8hi)> * T.19;
  <unnamed type (v8hi)> * T.18;           # pa
  <unnamed type (unsigned int)> T.17;
  <unnamed type (unsigned int)> T.16;
  <unnamed type (v8hi)> T.15;             # va
  <unnamed type (v8hi)> * T.14;
  <unnamed type (v8hi)> * T.13;           # pc
  <unnamed type (unsigned int)> T.12;
  <unnamed type (unsigned int)> T.11;
  <unnamed type (v8hi)> T.10;             # vc
  <unnamed type (v8hi)> * T.9;
  <unnamed type (v8hi)> * T.8;            # pb
  <unnamed type (unsigned int)> T.7;
  <unnamed type (unsigned int)> T.6;
  <unnamed type (v8hi)> T.5;              # vb
  int i;
  short int c[12];
  short int b[12];
  short int a[12];
  short int T.4;
  short int T.3;
  short int T.2;
  short int T.1;

  # BLOCK 0
  # PRED: ENTRY (fallthru,exec)
  goto <L1>;
  # SUCC: 2 (exec)

  # BLOCK 1
  # PRED: 2 (true,exec)
<L0>:;
1)   T.6_16 = (<unnamed type> (unsigned int))i_1;
2)   T.7_17 = T.6_16 * 16;
3)   T.8_18 = &b;                   #pb = b
4)   T.9_19 = T.7_17 + T.8_18;      #pb' = pb + (i * 16)
     #   VUSE <b_5>;
5)   T.5_20 = *T.9_19;              #vb = (* pb')
     #   VUSE <b_5>;
6)   T.1_6 = b[i_1];                #original scalar code
7)   T.11_21 = (<unnamed type> (unsigned int))i_1;
8)   T.12_22 = T.11_21 * 16;
9)   T.13_23 = &c;                  #pc = c
10)  T.14_24 = T.12_22 + T.13_23;   #pc' = pc + (i * 16)
     #   VUSE <c_7>;
11)  T.10_25 = *T.14_24;            #vc = (* pc')
     #   VUSE <c_7>;
12)  T.2_8 = c[i_1];                #original scalar code
13)  T.15_26 = T.5_20 + T.10_25;    #va = vb + vc
14)  T.3_9 = T.1_6 + T.2_8;         #original scalar code
15)  T.16_27 = (<unnamed type> (unsigned int))i_1;
16)  T.17_28 = T.16_27 * 16;
17)  T.18_29 = &a;                  #pa = a
18)  T.19_30 = T.17_28 + T.18_29;   #pa' = pa + (i * 16)
     #   a_10 = VDEF <a_2>;
19)  *T.19_30 = T.15_26;            #(* pa') = va
20)  i_11 = i_1 + 1;
  # SUCC: 2 (fallthru,exec)

  # BLOCK 2
  # PRED: 1 (fallthru,exec) 0 (exec)
  # a_2 = PHI <a_4(0), a_10(1)>;
  # i_1 = PHI <0(0), i_11(1)>;
<L1>:;
  if (i_1 <= 11) goto <L0>;
  else goto <L2>;
# SUCC: 3 (false,exec) 1 (true,exec)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tree-ssa] vectorizer related issues
  2003-12-16 15:34 [tree-ssa] vectorizer related issues Dorit Naishlos
@ 2003-12-16 23:55 ` Richard Henderson
  2003-12-17 11:22 ` Pop Sébastian
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Richard Henderson @ 2003-12-16 23:55 UTC (permalink / raw)
  To: Dorit Naishlos; +Cc: gcc

On Tue, Dec 16, 2003 at 03:05:16PM +0200, Dorit Naishlos wrote:
> (1) Array addressing using pointers.
...
> Is anyone planning to fix this situation?

Yes.  I've been fixing is slowly, but there appear to be quite a lot
of hidden problems.  My last patch, which mearly *preserves* array_ref
on arrays in situations that we lower to pointer arithmetic in the 
front end, had to be reverted because of bootstrap miscompilation.  
I've not yet figured out what went wrong.

> Q2.1: Is anyone planning to address this issue? (so that the vdefs/vuses of
> arrays will be connected only to may/must aliased references?)

Eh?  I thought this was already the case...

> The vectorizer can take care of these inefficiency issues by itself. Q3.1:
> should it? It's simply a design decision.

I think the answer is "how difficult is it to do locally relative to
re-running other scalar cleanup operations?"  I.e.  if it's simple
enough to do locally then do so, but don't work so hard as to duplicate
one of the existing scalar cleanup passes.

Certainly as a starting point I wouldn't bother doing any cleanups
yourself.  We should just re-run other optimizers as needed.  We can
decide what needs doing here later.

> Q4.1: Are there any naming conventions?

Not to my knowledge, but there should be.

> (6) modeling the target vector support:

Looking at optabs for specific operations is reasonable.

As for knowing whether a vector mode ought to be handlable at all
is available from VECTOR_MODE_SUPPORTED_P.  The interface isn't
ideal, since you have to iterate over all modes, but all required
information is present.

Note that checking "the vector size" isn't reasonable, because
there is no fixed vector size.  Consider i386 which has both
MMX (8 byte vectors) and SSE (16 byte vectors).

> Q6.3: Should we add new TREE_CODES (and corresponding optabs?) for the
> new types of operations we want to express?

Generic concepts such as dot product and saturating arithmetic should
not be handled with target-specific additions.  Whether this should be
tree codes or target-independent builtins is an open question.  Myself
I kinda like tree codes, but I seem to recall we're pressed for space
there.

> (Q6.6: ) how to address the problem that the
> compiler doesn't attempt to perform any optimizations around target
> specific builtins, not even to SSA their arguments

This should be fixed, IMO.



r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tree-ssa] vectorizer related issues
  2003-12-16 15:34 [tree-ssa] vectorizer related issues Dorit Naishlos
  2003-12-16 23:55 ` Richard Henderson
@ 2003-12-17 11:22 ` Pop Sébastian
  2004-01-06  7:21 ` law
  2004-01-08 16:42 ` Diego Novillo
  3 siblings, 0 replies; 9+ messages in thread
From: Pop Sébastian @ 2003-12-17 11:22 UTC (permalink / raw)
  To: Dorit Naishlos; +Cc: gcc

On Tue, Dec 16, 2003 at 03:05:16PM +0200, Dorit Naishlos wrote:
> With a lot of effort the vectorizer could deal with pointer arithmetic,

I was playing with the pointer arithmetic in one of the versions of
the scalar evolution analysis.  The pointer arithmetic can be handled
by this algorithm, as described by Robert van Engelen in "An Efficient
Algorithm for Pointer-to-Array Access Conversion for Compiling and
Optimizing DSP Applications" http://www.cs.fsu.edu/~engelen/iwia.pdf

> Is anyone planning to fix this situation? i.e., Q1.2: have the front-end
> represent p[i] as an array_ref even when p is declared a pointer, and 

This is feasible as a preprocessing pass that promotes pointers to
references.

> 
> (4) naming conventions:
>       The vectorizer creates many new variables; looking at the tree-dumps,
> having more informative names for variables would be helpful; for example,
> when creating a pointer which is used to point to array b, maybe use p_b as
> a prefix. Q4.1: Are there any naming conventions?
> 

Since there is no naming conventions for the moment, I think that they
have to be documented in the function that create the temp vars:

/* Create a new temporary variable declaration of type TYPE.  DOES push the
   variable into the current binding.  Further, assume that this is called
   only from gimplification or optimization, at which point the creation of
   certain types are bugs.  */

tree
create_tmp_var (tree type, const char *prefix)
{
...
}

A grep for create_tmp_var gives the following prefixes:

gimplify.c:713:	  temp = create_tmp_var (TREE_TYPE (ptr), "retval");
gimplify.c:748:  tmp_var = create_tmp_var (ptr_type_node, "saved_stack");
gimplify.c:2066:	  tmp = create_tmp_var (TREE_TYPE (expr), "iftmp");
gimplify.c:2657:      tree flag = create_tmp_var (boolean_type_node, "cleanup");
gimplify.c:3237:	  tree tmp = create_tmp_var (TREE_TYPE (*expr_p), "vol");
tree-alias-common.c:640:	      tree temp = create_tmp_var_raw (void_type_node, "aliastmp");
tree-alias-common.c:781:	  tree fakedecl = create_tmp_var_raw (TREE_VALUE (arg), "normarg");
tree-alias-common.c:806:      tree fakedecl = create_tmp_var_raw (void_type_node, "fakearg");
tree-alias-common.c:814:      rdecl = create_tmp_var_raw (TREE_TYPE (TREE_TYPE (decl)), "_rv_");
tree-alias-common.c:865:	  tree fakedecl = create_tmp_var_raw (TREE_VALUE (arg), "ptfarg");
tree-alias-common.c:883:  rdecl = create_tmp_var_raw (TREE_TYPE (type), "_rv_");
tree-cfg.c:415:	      var = create_tmp_var (ptr_type_node, "gotovar");
tree-dfa.c:2476:      tag = create_tmp_var_raw (tag_type, "MT");
tree-eh.c:590:		new = create_tmp_var (TREE_TYPE (old), "rettmp");
tree-eh.c:785:      save_eptr = create_tmp_var (ptr_type_node, "save_eptr");
tree-eh.c:786:      save_filt = create_tmp_var (integer_type_node, "save_filt");
tree-eh.c:1092:  finally_tmp = create_tmp_var (integer_type_node, "finally_tmp");
tree-sra.c:150:      sra_map[var_ix][field_ix] = create_tmp_var (type, "SR");
tree-sra.c:424:      tree stmt, tmp = create_tmp_var (TREE_TYPE (rhs), "T");
tree-ssa-pre.c:2999:  ei->temp = create_tmp_var (TREE_TYPE (ei->expr), "pretmp");

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tree-ssa] vectorizer related issues
  2003-12-16 15:34 [tree-ssa] vectorizer related issues Dorit Naishlos
  2003-12-16 23:55 ` Richard Henderson
  2003-12-17 11:22 ` Pop Sébastian
@ 2004-01-06  7:21 ` law
  2004-01-08 16:42 ` Diego Novillo
  3 siblings, 0 replies; 9+ messages in thread
From: law @ 2004-01-06  7:21 UTC (permalink / raw)
  To: Dorit Naishlos; +Cc: gcc

In message <OF84E7BD0B.816A0013-ONC2256DFE.002C8227-C2256DFE.0047E4E4@il.ibm.co
m>, Dorit Naishlos writes:
 >(1) Array addressing using pointers.
 >      As illustrated in the example above (loop2.c), the basic approach for
 >handling array accesses is to generate a pointer to a vector type ('pb'),
 >have it point to the array base ('pb = (v8hi *)b'), and use that to access
 >the array ('vb = pb[i]'). I was hoping to get away with generating an
 >ARRAY_REF expression for pb[i], but the RTL expander didn't like that. So I
 >generated the following sequence of stmts instead, inspired by what the
 >compiler currently generates at the tree level for the C stmt 'vb = pb[i]':
 >
 >pb = &b
 >T.1 = (<unnamed type>)i_1;
 >T.2 = T.1 * 16;
 >T.3 = pb + T.2;
 >vb = *T.3;
 >
 >It's too bad that we loose the information that these accesses were simple
 >array references. Maybe there's a simple way to convince the expander to
 >digest the expression pb[i] (Q1.1: is there?). What's much more worrying is
 >that we loose this information even before vectorization takes place. This
 >goes back to the issue mentioned in
 >http://gcc.gnu.org/ml/gcc/2003-07/msg02013.html.  With a lot of effort the
 >vectorizer could deal with pointer arithmetic, but I think this is exactly
 >the kind of effort we should save, working at the tree level.
 >
 >Is anyone planning to fix this situation? i.e., Q1.2: have the front-end
 >represent p[i] as an array_ref even when p is declared a pointer, and Q1.3:
 >have the RTL expander accept such array references.
 >
 >This is probably the most critical infrastructure fix required for the
 >vectorizer to be applicable in the short term. If gcc continues to
 >represent array references this way, it would be pretty useless to
 >vectorize specifically for array accesses. Good handling of pointer
 >arithmetic and pointer aliasing (which have well-known limitations) will
 >then be vital.
Richard has been working on a lot of this stuff.  I wouldn't be terribly
surprised to find out that we're still missing cases.

If you don't build the ARRAY_REF precisely how the rest of the compiler
expects, then all kinds of bad things happen.  I ran into this when
trying to stop the compiler from building pointer arithmetic for 
array indexing.

I'd suggest focusing on fixing whatever issues you run into when trying
to generate ARRAY_REFs for stuff like bv = pb[i].  Probably the first
step is to stop declaring pb as a pointer type since that's what ultimately
triggers the creation of pointer arithmetic instead of ARRAY_REFs in the
early parts of the compiler.

[ If it's the case that Richard's work was supposed to stop this lowering
  from ARRAY_REF to pointer arithmetic, then your job would be to figure
  out why Richard's fixes aren't solving your problem.  See
  c-typeck.c::build_array_ref.


 >(2) virtual defs/uses.
 >      This issue goes back to an action item on the tree-ssa todo list:
 >"SSA information for arrays
 >      The existing implementation treats arrays as an opaque object. A
 >definition to an array location is treated as a definition for the whole
 >array".
 >
 >Q2.1: Is anyone planning to address this issue? (so that the vdefs/vuses of
 >arrays will be connected only to may/must aliased references?)
 >
 >It's not a major obstacle at the moment because I'm simply ignoring the
 >virtual defs/uses, as they are currently not very informative. What I do
 >when I analyze cross-iteration def-use cycles, is examine the def-use cycle
 >related to each of the loop phi's. Virtual phi's are ignored, because I
 >don't know whether they represent a real def-use cycle or not; I rely on
 >other analysis passes to check the data references that are responsible for
 >virtual defs-uses. For now, the only form of non-scalar data access I allow
 >is array references. So the analysis of the loop scalar phi's together with
 >array data dependence tests, suffice. In the future, when other forms of
 >non-scalar accesses are allowed (structure, etc) - this workaround virtual
 >defs/uses would need to be revisited.
Note it is possible to tell the difference between PHIs for virtual
operands and those for real operands.  I think 

is_gimple_reg (PHI_RESULT (phi))

will indicate that a particular PHI is for real operands rather than
virtual operands.

I don't think anyone is currently working on changing the way virtual 
operands work such that a store to an array memory would no longer be
treated as a write to the whole array.

In general, I don't think that's feasible since that could easily lead to
an explosion of memory tags and virtual operands.  I'd be curious to know
what other compilers do for this kind of issue (particularly SGI's since
they must deal with similar issues and our memory tags/virtual operand
scheme has a lot of similarities to what SGI does).


 >By the way, (Q2.2:) is there an API that provides a mapping between a
 >virtual def/use and the operand it corresponds to? I guess in GIMPLE
 >finding this mapping is pretty straightforward because the grammar doesn't
 >allow a lot of flexibility; For example, I want to verify that any
 >vuse/vdef found corresponds to an array_ref (and not structure or pointer
 >access).
Not that I'm aware of.  We haven't needed it so we haven't expanded the
APIs to provide this kind of detail.


 I currently do that as follows: if a (GIMPLE) stmt has any vdefs
 >(vuses), I check that it has exactly one vdef (vuse), I assume that it
 >corresponds to op0 (op1), and I expect that the stmt is of the form: op0 =
 >SSA_NAME/CONST (SSA_NAME = op1). Then I verify that op0 (op1) is an
 >ARRAY_REF. Is there a better way to do that?
I think this kind of stuff would really need to be gathered during the
get_stmt_operands walk.  Otherwise I would expect it to "mostly" work, but
fail in strange and wonderful ways since you're using a heuristic to recover
this information.

 >Q3.3: I'm assuming that if a stmt has multiple vdefs/vuses then it's either
 >a non vectorizable stmt (CALL_EXPR, asm) or there are aliasing problems
 >which also prevent vectorization. Are there cases involving stmts with
 >multiple vdefs/vuses that are relevant for vectorization?
I'd expect there are -- if for no other reason than we generate far more
virtual operands than we actually need to :(


 >
 >(3) loop invariants, dead code, redundant code:
 >      The general approach taken by the vectorizer when generating vector
 >stmts, is to insert the vector code sequence corresponding to a certain
 >(scalar) stmt X just before stmt X, and remove stmt X only if it has side
 >effects (like changing memory). If stmt X does not have side effects, we
 >rely on dead code elimination for removing it.
Sounds quite reasonable to me -- and it's what I (and others) would prefer
as a design goal.


 >According to this approach, the vectorizer is currently doing the bare
 >minimum, relying on subsequent optimization phases to take care of any loop
 >invariants, dead code etc. 
Good.

 >The vectorizer can take care of these inefficiency issues by itself. Q3.1:
 >should it? It's simply a design decision. Cleaning up may help subsequent
 >(tree-level) optimizations and may also save compilation time and memory.
 >In any case, the very first version of the vectorizer will probably
 >continue to take this approach. If we decide that the vectorizer should
 >cleanup after itself, I'll incorporate these optimizations into the next
 >versions of the vectorizer.
I'd prefer to keep the cleanups separate.  If we find that we're not cleaning
up something, then that's a good indication some other pass needs to be beefed
up.  For example, we know we need a good dead store eliminator
(store meaning a store to memory rather than an assignment to a non-addressable
scalar).



 >(5) stmt_info data structure:
 >      During the analysis phases the vectorizer records some information
 >per stmt. For example, a pointer to the vectorized version of each stmt;
 >Say stmt1:'b=x[i]' is vectorized into vec_stmt1:'vb=px[T]'. When
 >vectorizing the following stmt2:'a=b', the vectorizer has to find the def
 >of 'vb'; it does so by first finding stmt1 (via 'SSA_NAME_DEF_STMT(b)'),
 >then finding vec_stmt1 from stmt1 (via 'stmt_info(stmt1)'), and finally
 >finding the def 'vb' from vec_stmt1.
 >
 >This scheme works for operands that are SSA_NAMEs. Operands that aren't,
 >are currently limited to array references appearing in load/store
 >operations, and are handled differently.
 >
 >I've actually used info-per-SSA_NAME so far, but I'm considering to switch
 >to an info-per-stmt scheme since (a) I can't use it for non SSA_NAME
 >operands (arrays), (b) I had to allocate an array of size
 >"next_ssa_version" which is a bit wasteful considering I'm only looking at
 >SSA_NAMEs that are used in certain loops, and (c) I expect to have more
 >information to record in the future per stmt. My questions are:
 >Q5.1: Is there an available field (in stmt/stmt_ann) for recording such
 >information? I didn't find a general field, like a pointer to void, that
 >each optimization pass could use for it's own purposes. Q5.2: Is it OK to
 >add such a field (assuming the answer to Q5.1 is "no")?
Couldn't you put this in a hash table?  Our hash tables automagically
expand as they get collisions.  Meaning they tend to be reasonably
memory efficient for this kind of thing.  With that kind of scheme you
could store whatever you need for use by the vectorizer, but not pollute
the statement annotation.

I wouldn't object strongly to putting a generic pointer into the
statement annotation as a stop-gap, but long term the whole annotation
concept has major problems and IMHO they need to be completely rethought.
If a pass can keep its data in some local data structure not attached to
the annotation it's going to make fixing the annotation mess a lot easier
(and make those passes much less likely to break as we muck around with
the annotation scheme).


 >Q5.3: In GIMPLE, are there stmts that have more than one def, and I mean
 >vectorizable stmts, not CALL_EXPRs and such? are they conceivable, even if
 >not exist today?
Real defs?  Hmmm, I would think the only thing that would create multiple
real defs would be calls and asms.


 >
 >(6) modeling the target vector support:
 >      One of the issues that I haven't addressed yet is the modelling of
 >the vector support that is available in the target machine. The approach I
 >took for now is to generate only vector operations which can be expressed
 >using existing scalar tree codes (PLUS, MULT etc), after verifying that
 >they are supported by checking the relevant optab at the relevant
 >machine_mode (e.g, add_optab->handlers[(int) V8HImode].insn_code). If the
 >value found is CODE_FOR_nothing, then there's no target support, and we
 >can't vectorize the stmt. Otherwise - the stmt is transformed. The only
 >target specific information I'm currently exposing is the size of the
 >vector (in bytes) (through a variable in defaults.h which each target can
 >define). In the case of loads/stores, for which there's no optab (Q6.1: is
 >there?), I assume that the target supports any loads/store of the vector
 >size (i.e., if the vector size is 16 bytes, it's possible to load/store 16
 >QIs, 8 HIs, 4 SIs/SFs, etc).
 >
 >Q6.2: Sounds reasonable so far?
Yes.  Though we could also consider vectorizing those loops and allowing
the expanders to break them down into serial code (or smaller vectors)
if the target doesn't directly support the width of the vectors we found.


 >There are a lot of vector operations for which this approach is not
 >expressive enough (e.g., a dot-product operation and reduction operations
 >in general, operations that perform computations on complex numbers,
 >conditional operations, and more). Q6.3: Should we add new TREE_CODES (and
 >corresponding optabs?) for the new types of operations we want to express?
 >Or instead, Q6.4: should the vectorizer generate calls to target specific
 >builtin functions? in this case we'll need to decide (Q6.5: ) how to expose
 >them to the tree level, and (Q6.6: ) how to address the problem that the
 >compiler doesn't attempt to perform any optimizations around target
 >specific builtins, not even to SSA their arguments; Or, Q6.7: should we
 >introduce new generic builtins that are totally exposed to the compiler?
We're probably going to need new tree nodes for various things in this
space.  This isn't totally unexpected.

Note that as we create new nodes for vectorizing purposes we'll need to 
update the definition of the gimple grammar.


 >A related issue, is how much information to expose to the tree level. We
 >want the vectorizer to have enough information to decide whether it's worth
 >while to vectorize; Also, some patterns can be detected and transformed
 >much more easily at the tree level. Other should be left to the RTL level.
I think we want to expose as little as possible.  Ideally we'd expose nothing
about the target to the tree optimizers -- in reality various things are
bound to need to leak through -- but I'd like it kept to a minimum
(some things like BRANCH_COST already leak through much to my annoyance)


Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tree-ssa] vectorizer related issues
  2003-12-16 15:34 [tree-ssa] vectorizer related issues Dorit Naishlos
                   ` (2 preceding siblings ...)
  2004-01-06  7:21 ` law
@ 2004-01-08 16:42 ` Diego Novillo
  2004-01-08 23:50   ` Devang Patel
  3 siblings, 1 reply; 9+ messages in thread
From: Diego Novillo @ 2004-01-08 16:42 UTC (permalink / raw)
  To: Dorit Naishlos; +Cc: gcc

On Tue, 2003-12-16 at 08:05, Dorit Naishlos wrote:

> (2) virtual defs/uses.
>       This issue goes back to an action item on the tree-ssa todo list:
> "SSA information for arrays
>       The existing implementation treats arrays as an opaque object. A
> definition to an array location is treated as a definition for the whole
> array".
> 
> Q2.1: Is anyone planning to address this issue? (so that the vdefs/vuses of
> arrays will be connected only to may/must aliased references?)
> 
It's in my long term todo list but I haven't thought about it much.  One
vague idea was to try out the various SSA-for-arrays papers that are
floating about.  I don't know how they compare to the traditional
dependency analysis, though.


> By the way, (Q2.2:) is there an API that provides a mapping between a
> virtual def/use and the operand it corresponds to?
>
No, there isn't.  You're right.  GIMPLE is pretty explicit in memory
references.

> I currently do that as follows: if a (GIMPLE) stmt has any vdefs
> (vuses), I check that it has exactly one vdef (vuse), I assume that it
> corresponds to op0 (op1), and I expect that the stmt is of the form: op0 =
> SSA_NAME/CONST (SSA_NAME = op1). Then I verify that op0 (op1) is an
> ARRAY_REF. Is there a better way to do that?
>
Yes, that should be fine.  tree-sra.c does something similar for
structure references.  We could probably grow a few generic functions if
you find enough commonalities.

The only places you have to be careful are call-sites where you can have
multiple VDEFs for all call-clobbered variables or a single VDEF for
.GLOBAL_VAR (when num_call_sites x num_call_clobbered_vars is above some
threshold).

Other than that, I would first filter by statement code.  Array
references can only be found inside MODIFY_EXPRs.  To determine whether
it's an array or not, you can check the type of the variable in the
VDEF/VUSE operand.  If TREE_TYPE (SSA_NAME_VAR (VDEF_RESULT)) ==
ARRAY_TYPE, then you have an array store.  Similarly for loads.

> Q3.3: I'm assuming that if a stmt has multiple vdefs/vuses then it's either
> a non vectorizable stmt (CALL_EXPR, asm) or there are aliasing problems
> which also prevent vectorization. Are there cases involving stmts with
> multiple vdefs/vuses that are relevant for vectorization?
> 
There shouldn't be.  In fact, that's something interesting to add to the
SSA verifier.

> The vectorizer can take care of these inefficiency issues by itself. Q3.1:
> should it?
>
In principle, I would avoid doing this.  I'd rather have the vectorizer
specify what scalar cleanups need to be performed after it runs.  This
would be part of the info we give to the pass manager that we discussed
briefly recently
(http://gcc.gnu.org/ml/gcc-patches/2003-12/msg00416.html).


> (4) naming conventions:
>       The vectorizer creates many new variables; looking at the tree-dumps,
> having more informative names for variables would be helpful; for example,
> when creating a pointer which is used to point to array b, maybe use p_b as
> a prefix. Q4.1: Are there any naming conventions?
> 
None.  Though we have loosely named them with a prefix that denotes
where they came from:

MT.n		Memory tags
SR.n		Scalar replacements
iftmp.n		Expanded ?: predicates
pretmp.n	PRE generated temporaries
T.n		Any other temporary

> 
> (5) stmt_info data structure:
>       During the analysis phases the vectorizer records some information
> per stmt. For example, a pointer to the vectorized version of each stmt;
> [ ... ]
>  Q5.2: Is it OK to
> add such a field (assuming the answer to Q5.1 is "no")?
>
Yes.  stmt_ann is meant to be an extensible scratch pad for this kind of
thing.  In the future we may want to split it up or convert some fields
into hash tables or have a more generic statement/variable annotation
machinery.

> Q5.3: In GIMPLE, are there stmts that have more than one def, and I mean
> vectorizable stmts, not CALL_EXPRs and such? are they conceivable, even if
> not exist today?
> 
Only asms.  Other than that, the only statements that generate values
are MODIFY_EXPR and CALL_EXPR.


Diego.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tree-ssa] vectorizer related issues
  2004-01-08 16:42 ` Diego Novillo
@ 2004-01-08 23:50   ` Devang Patel
  2004-01-09  3:37     ` Diego Novillo
  0 siblings, 1 reply; 9+ messages in thread
From: Devang Patel @ 2004-01-08 23:50 UTC (permalink / raw)
  To: Diego Novillo; +Cc: Dorit Naishlos, gcc


On Jan 8, 2004, at 8:42 AM, Diego Novillo wrote:

> On Tue, 2003-12-16 at 08:05, Dorit Naishlos wrote:
>
>> (2) virtual defs/uses.
>>       This issue goes back to an action item on the tree-ssa todo 
>> list:
>> "SSA information for arrays
>>       The existing implementation treats arrays as an opaque object. A
>> definition to an array location is treated as a definition for the 
>> whole
>> array".
>>
>> Q2.1: Is anyone planning to address this issue? (so that the 
>> vdefs/vuses of
>> arrays will be connected only to may/must aliased references?)
>>
> It's in my long term todo list but I haven't thought about it much.  
> One
> vague idea was to try out the various SSA-for-arrays papers that are
> floating about.  I don't know how they compare to the traditional
> dependency analysis, though.

Recently I spent some time understanding
"Array SSA form and its use in Parallelization
   - Kathleen Knobe, Vivek Sarkar
  (POPL 98)"

Any thoughts about their approach? What other papers you've in mind?

--
Devang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tree-ssa] vectorizer related issues
  2004-01-08 23:50   ` Devang Patel
@ 2004-01-09  3:37     ` Diego Novillo
  2004-01-10  1:12       ` Devang Patel
  0 siblings, 1 reply; 9+ messages in thread
From: Diego Novillo @ 2004-01-09  3:37 UTC (permalink / raw)
  To: Devang Patel; +Cc: Dorit Naishlos, gcc

On Thu, 2004-01-08 at 18:49, Devang Patel wrote:

> Recently I spent some time understanding
> "Array SSA form and its use in Parallelization
>    - Kathleen Knobe, Vivek Sarkar
>   (POPL 98)"
> 
> Any thoughts about their approach?
>
Yes, that's the one you'll see most often referenced.  From what I
remember, we could use the VDEFs as their define-phi nodes.  I wouldn't
want to use actual PHI_NODEs because those have specific control-flow
semantics that define-phis don't.

The "time-stamp" @ operators could be added to the virtual operands of
statements.  Though we may also want to add them as real statements.  We
will need to quickly collect iteration vectors with all the
corresponding index variables in nested loops for these time stamps.  I
haven't thought much about that and it's been a while since I last read
the paper.


>  What other papers you've in mind?
> 
INPROCEEDINGS{bib:collard-99,
  AUTHOR = {Jean-Francois Collard},
  TITLE = {Array {SSA} for Explicitly Parallel Programs},
  BOOKTITLE = {European Conference on Parallel Processing},
  PAGES = {383-390},
  YEAR = {1999},
}


Diego.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tree-ssa] vectorizer related issues
  2004-01-09  3:37     ` Diego Novillo
@ 2004-01-10  1:12       ` Devang Patel
  0 siblings, 0 replies; 9+ messages in thread
From: Devang Patel @ 2004-01-10  1:12 UTC (permalink / raw)
  To: Diego Novillo; +Cc: Dorit Naishlos, gcc


On Jan 8, 2004, at 7:37 PM, Diego Novillo wrote:

> On Thu, 2004-01-08 at 18:49, Devang Patel wrote:
>
>> Recently I spent some time understanding
>> "Array SSA form and its use in Parallelization
>>    - Kathleen Knobe, Vivek Sarkar
>>   (POPL 98)"
>>
>> Any thoughts about their approach?
>>
> Yes, that's the one you'll see most often referenced.  From what I
> remember, we could use the VDEFs as their define-phi nodes.  I wouldn't
> want to use actual PHI_NODEs because those have specific control-flow
> semantics that define-phis don't.
>
> The "time-stamp" @ operators could be added to the virtual operands of
> statements.  Though we may also want to add them as real statements.  
> We
> will need to quickly collect iteration vectors with all the
> corresponding index variables in nested loops for these time stamps.

space is also a concern for @ operators. it also adds runtime overhead.

--
Devang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [tree-ssa] vectorizer related issues
@ 2003-12-16 15:39 S. Bosscher
  0 siblings, 0 replies; 9+ messages in thread
From: S. Bosscher @ 2003-12-16 15:39 UTC (permalink / raw)
  To: 'Dorit Naishlos ', 'gcc@gcc.gnu.org '

Dorit Naishlos asked:
> Q6.3: Should we add new TREE_CODES (and corresponding optabs?) for the
> new types of operations we want to express?
> Or instead, Q6.4: should the vectorizer generate calls to target
> specific builtins?

I would prefer tree codes for generic operations and operations that can be
expressed as a sequence of simpler operations, but please note that we risk
running out of tree codes RSN. Obj-C++ can only barely fit in the existing
maximum of 256 tree codes :-/
Also perhaps target builtins are better because this is obviously all target
specific anyway...

Gr.
Steven


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-01-10  1:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-16 15:34 [tree-ssa] vectorizer related issues Dorit Naishlos
2003-12-16 23:55 ` Richard Henderson
2003-12-17 11:22 ` Pop Sébastian
2004-01-06  7:21 ` law
2004-01-08 16:42 ` Diego Novillo
2004-01-08 23:50   ` Devang Patel
2004-01-09  3:37     ` Diego Novillo
2004-01-10  1:12       ` Devang Patel
2003-12-16 15:39 S. Bosscher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).