public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [patch] [4.3 projects] outer-loop vectorization patch 2/n
       [not found] <OF23B76DCA.3E1B28C1-ONC2257334.0041189F-C2257334.0041E49C@LocalDomain>
@ 2007-08-11 17:01 ` Dorit Nuzman
  0 siblings, 0 replies; 4+ messages in thread
From: Dorit Nuzman @ 2007-08-11 17:01 UTC (permalink / raw)
  To: Dorit Nuzman; +Cc: gcc-patches, Jack Howarth

[-- Attachment #1: Type: text/plain, Size: 1410 bytes --]

Dorit Nuzman/Haifa/IBM wrote on 11/08/2007 14:59:43:

> Dorit Nuzman/Haifa/IBM wrote on 11/08/2007 13:38:26:
>
> > Hi,
> >
> > This is the second part of http://gcc.gnu.org/ml/gcc-
> > patches/2007-08/msg00461.html. It adds support for memory-references
> > in the inner-loop of outer-loop-vectorization. I'll use the
> > following example to describe the features that were added:
> ...
> >         * gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c: New
test.
>
> I fixed this testcase in the last minute (following a failure
> pointed out to me by Jack Howarth - thanks to Jack for testing the
> outer-loop patch on powerpc-apple-darwin9!). So I fixed one problem
> and introduced another - in the patch I sent there's a typo in this
> testcase: it includes "tree-vect.h" instead of "../../tree-vect.h".
> Sorry about that.
>

another mixup... - the testcase tree-vect-outer-fir-lb.c that I posted
included changes that are part of the third part of the outer-loop patch...
:-(  So here's how the patch for this testcase should look like:

(See attached file: vect-outer-fir-lb.c.txt)

I apologize again for the confusion.
(I hope to repost an updated version of this patch with these two testcases
fixed soonish).

dorit


> dorit
>
> >
> > #### mainlineouterloopdiff23t.txt has been deleted (was saved in
> > repository MyAttachments Repository ->) from this note on 11 August
> > 2007 by Dorit Nuzman

[-- Attachment #2: vect-outer-fir-lb.c.txt --]
[-- Type: text/plain, Size: 1913 bytes --]

Index: testsuite/gcc.dg/vect/vect-outer-fir-lb.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-fir-lb.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-fir-lb.c	(revision 0)
***************
*** 0 ****
--- 1,80 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 64
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ float fir_out[N];
+ 
+ /* Should be vectorized. Fixed misaligment in the inner-loop.  */
+ /* Currently not vectorized because because the compiler doesn't realize 
+    that the inner-loop executes at least once (cause k<4), and so the
+    vectorizer doesn't know that the loop is countable.  */
+ void foo (){
+  int i,j,k;
+  float diff;
+ 
+  for (i = 0; i < N; i++) {
+   out[i] = 0;
+  }
+ 
+  for (k = 0; k < 4; k++) {
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     j = k;
+ 
+     do {
+       diff += in[j+i]*coeff[j]; 
+       j+=4;	
+     } while (j < M);
+ 
+     out[i] += diff;
+   }
+  }
+ 
+ }
+ 
+ /* Vectorized. Changing misalignment in the inner-loop.  */
+ void fir (){
+   int i,j,k;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j++) {
+       diff += in[j+i]*coeff[j];
+     }
+     fir_out[i] = diff;
+   }
+ }
+ 
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   fir ();
+   
+   for (i = 0; i < N; i++) {
+     if (out[i] != fir_out[i])
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 2 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [patch] [4.3 projects] outer-loop vectorization patch 2/n
  2007-08-11 10:36 Dorit Nuzman
@ 2007-08-12 15:02 ` Dorit Nuzman
  0 siblings, 0 replies; 4+ messages in thread
From: Dorit Nuzman @ 2007-08-12 15:02 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 9518 bytes --]

Attached is the updated patch (part 2, updated to a more recent snapshot,
as there were a few changes that went into mainline in the meantime that
made the previous patch inapplicable)

Bootstrpped on powerpc64-linux,
bootstrapped with vectorization enabled on i386-linux,
and tested on the vectorizer testcases.

dorit

(See attached file: updated-outerloop-patch2.txt)

> Hi,

> This is the second part of
> http://gcc.gnu.org/ml/gcc-patches/2007-08/msg00461.html. It adds support
> for memory-references in the inner-loop of outer-loop-vectorization. I'll
> use the following example to describe the features that were added:

> for (i=0; i<N; i++){
> s=0;
> for (j=0; j<M; j++)
> s += a[i+j] * b[j];
> a[i]=s;
> }

> The patch includes the following changes to the vectorizer:

> - To analyze the initial-address and step of inner-loop references
relative
> to the outer-loop, I used the function split_constant_offset. I basically
> take the BASE+INIT+OFFSET that was computed relative to the inner-loop
and
> analyze it relative to the outer-loop (as discussed here
> http://gcc.gnu.org/ml/gcc-patches/2007-05/msg00648.html).
> - Generally the vectorizer does not support invariant accesses. One
> exception that we add here is memory-references in the inner-loop that
have
> a zero step in the outer-loop. For example, the b[j] access in the loop
> example above. Because the access b[j] has no evolution in the outer-loop
> we have to duplicate the value b[j] into all entries of the vector. At
the
> moment this is done by simply adding this duplication on top of the
current
> scheme: i.e. we continue to generate a regular vector load, and then we
> extract the first element and duplicate it:
> vb = b[j,j+1,j+2,j+3]
> sb = BIT_FIELD_REF (vb, bitpos, bitsize) # extract
> b[j]
> vb = {sb, sb, sb, sb}
> (there are better ways to do this - so this will be improved later).
> In order to be able to use the function 'vect_init_vector' to create the
> vector vb above, we extend it to poace the vector initialization at BSI
(a
> new argument passed to the function) instead of always insterting it at
the
> loop preheader.

> - Support misaligned accesses. In case the misalignment remains fixed
(i.e.
> the step (stride) of the accesses in the inner-loop is a multiple of the
> Vector Size (VS)), this can be vectorized using the optimized
> realignment-scheme (which used to be called the "software-pipelined"
> scheme, and is now called "optimized_explicit_realign"): the computation
of
> the misalignment can be taken out of the loop, and only one additional
> vector load is generated (before the loop) instead of 2 in each iteration
> (we basically do predictive-commoning here). In case the misalignment
does
> *not* remain fixed throughout the iterations of the loop (as is the case
in
> the example loop above), we cannot use the optimized scheme. Instead we
> need to compute the misalignment inside the inner-loop along with the two
> vector loads (this is the newly added "explicit_realign" scheme; for more
> details can also see
> http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00248.html).
> (this is also planned to be improved later).
> To support this a couple functions  - 'vect_setup_realignment' and
> 'vectorizable_load' - that used to support only the optimized realignment
> scheme were modified to support both schemes. See detailed documentation
in
> 'vect_setup_realignment' and 'vect_supportable_dr_alignment').

> - The functions that deal with creating/initializing/updating the pointer
> that is used for the vector loads/stores had to be modified a bit:
> * 'vect_create_data_ref_ptr' now needs to create an update chain both
> in the inner-loop and in the outer-loop. It also needs to consider if the
> outer-loop step is 0 (the only case where the pointer will not be bumped
by
> VS (Vector Size)). See detailed documentation in this function.
> * 'vect_create_addr_base_for_vector_ref' needs to know relative to
> which loop the address-base is requested (to know whether to use the
> step/offset/init relative to the inner or outer loop).
> * 'bump_vector_ptr' is extended to support bump amounts other than VS
> (we need to nump by VS-1 for the "explicit_realign" scheme).

> Bootstrpped on powerpc64-linux,
> bootstrapped with vectorization enabled on i386-linux,
> passed full regression testing on both platforms.

> I will wait at least a week to give people a chance to review and
comment.

> thanks,
> dorit

> ChangeLog:

> * tree-data-refs.c (split_constant_offset): Expose.
> * tree-data-refs.h (split_constant_offset): Add declaration.
>
> * tree-vectorizer.h (dr_alignment_support): Renamed
> dr_unaligned_software_pipeline to dr_explicit_realign_optimized.
> Added a new value dr_explicit_realign.
> (_stmt_vec_info): Added new fields: dr_base_address, dr_init,
> dr_offset, dr_step, and dr_aligned_to, along with new access
> functions for these fields: STMT_VINFO_DR_BASE_ADDRESS,
> STMT_VINFO_DR_INIT, STMT_VINFO_DR_OFFSET, STMT_VINFO_DR_STEP, and
> STMT_VINFO_DR_ALIGNED_TO.
>
> * tree-vectorizer.c (vect_supportable_dr_alignment): Add
> documentation.
> In case of outer-loop vectorization with non-fixed misalignment -
> use
> the dr_explicit_realign scheme instead of the optimized realignment
> scheme.
> (new_stmt_vec_info): Initialize new fields.
>
> * tree-vect-analyze.c (vect_compute_data_ref_alignment): Handle the
> 'nested_in_vect_loop' case. Change verbosity level.
> (vect_analyze_data_ref_access): Handle the 'nested_in_vect_loop'
> case.
> Don't fail on zero step in the outer-loop for loads.
> (vect_analyze_data_refs): Call split_constant_offset to calculate
> base,
> offset and init relative to the outer-loop.
>
> * tree-vect-transform.c (vect_create_data_ref_ptr): Replace the
> unused
> BSI function argument with a new function argument - at_loop.
> Simplify the condition that determines STEP. Takes additional
> argument
> INV_P. Support outer-loop vectorization (handle the
> nested_in_vect_loop
> case), including zero step in the outer-loop. Call
> vect_create_addr_base_for_vector_ref with additional argument.
> (vect_create_addr_base_for_vector_ref): Takes additional argument
> LOOP.
> Updated function documentation. Handle the 'nested_in_vect_loop'
> case.
> Fixed and simplified calculation of step.
> (vectorizable_store): Call vect_create_data_ref_ptr with loop
> instead
> of bsi, and with additional argument. Call bump_vector_ptr with
> additional argument. Fix typos. Handle the 'nested_in_vect_loop'
> case.
> (vect_setup_realignment): Takes additional arguments INIT_ADDR and
> DR_ALIGNMENT_SUPPORT. Returns another value AT_LOOP. Handle the
> case
> when the realignment setup needs to take place inside the loop.
> Support
> the dr_explicit_realign scheme. Allow generating the optimized
> realignment scheme for outer-loop vectorization. Added
> documentation.
> (vectorizable_load): Support the dr_explicit_realign scheme. Handle
> the
> 'nested_in_vect_loop' case, including loads that are invariant in
> the
> outer-loop and the realignment schemes. Handle the case when the
> realignment setup needs to take place inside the loop. Call
> vect_setup_realignment with additional arguments.  Call
> vect_create_data_ref_ptr with additional argument and with loop
> instead
> of bsi. Fix 80-column overflow. Fix typos. Rename PHI_STMT to PHI.
> (vect_gen_niters_for_prolog_loop): Call
> vect_create_addr_base_for_vector_ref with additional arguments.
> (vect_create_cond_for_align_checks): Likewise.
> (bump_vector_ptr): Updated to support the new dr_explicit_realign
> scheme: takes additional argument bump; argument ptr_incr is now
> optional; updated documentation.
> (vect_init_vector): Takes additional argument (bsi). Use it, if
> available, to insert the vector initialization.
> (get_initial_def_for_induction): Pass additional argument in call
> to
> vect_init_vector.
> (vect_get_vec_def_for_operand): Likewise.
> (vect_setup_realignment): Likewise.
> (vectorizable_load): Likewise.
>
> testsuite/ChangeLog:

> * gcc.dg/vect/vect-117.c: Change inner-loop bound to
> unknown (so that outer-loop wont get analyzed).
> * gcc.dg/vect/vect-outer-1a.c: New test.
> * gcc.dg/vect/vect-outer-1b.c: New test.
> * gcc.dg/vect/vect-outer-1.c: New test.
> * gcc.dg/vect/vect-outer-2a.c: New test.
> * gcc.dg/vect/vect-outer-2b.c: New test.
> * gcc.dg/vect/vect-outer-2c.c: New test.
> * gcc.dg/vect/vect-outer-2.c: New test.
> * gcc.dg/vect/vect-outer-3a.c: New test.
> * gcc.dg/vect/vect-outer-3b.c: New test.
> * gcc.dg/vect/vect-outer-3c.c: New test.
> * gcc.dg/vect/vect-outer-3.c: New test.
> * gcc.dg/vect/vect-outer-4a.c: New test.
> * gcc.dg/vect/vect-outer-4b.c: New test.
> * gcc.dg/vect/vect-outer-4c.c: New test.
> * gcc.dg/vect/vect-outer-4d.c: New test.
> * gcc.dg/vect/vect-outer-4e.c: New test.
> * gcc.dg/vect/vect-outer-4f.c: New test.
> * gcc.dg/vect/vect-outer-4g.c: New test.
> * gcc.dg/vect/no-section-anchors-vect-outer-4h.c: New test.
> * gcc.dg/vect/vect-outer-4i.c: New test.
> * gcc.dg/vect/vect-outer-4j.c: New test.
> * gcc.dg/vect/vect-outer-4k.c: New test.
> * gcc.dg/vect/vect-outer-4l.c: New test.
> * gcc.dg/vect/vect-outer-4m.c: New test.
> * gcc.dg/vect/vect-outer-4.c: New test.
> * gcc.dg/vect/vect-outer-5.c: New test.
> * gcc.dg/vect/vect-outer-6.c: New test.
> * gcc.dg/vect/vect-outer-fir.c: New test.
> * gcc.dg/vect/vect-outer-fir-lb.c: New test.
> * gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c: New test.
>
> (See attached file: mainlineouterloopdiff23t.txt)

[-- Attachment #2: updated-outerloop-patch2.txt --]
[-- Type: text/plain, Size: 127834 bytes --]

*** tree-vect-analyze.c.patch1	2007-08-12 15:27:22.000000000 +0300
--- tree-vect-analyze.c	2007-08-12 15:31:34.000000000 +0300
*************** vect_compute_data_ref_alignment (struct 
*** 1193,1198 ****
--- 1193,1200 ----
  {
    tree stmt = DR_STMT (dr);
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);  
+   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    tree ref = DR_REF (dr);
    tree vectype;
    tree base, base_addr;
*************** vect_compute_data_ref_alignment (struct 
*** 1209,1221 ****
    misalign = DR_INIT (dr);
    aligned_to = DR_ALIGNED_TO (dr);
    base_addr = DR_BASE_ADDRESS (dr);
    base = build_fold_indirect_ref (base_addr);
    vectype = STMT_VINFO_VECTYPE (stmt_info);
    alignment = ssize_int (TYPE_ALIGN (vectype)/BITS_PER_UNIT);
  
!   if (tree_int_cst_compare (aligned_to, alignment) < 0)
      {
!       if (vect_print_dump_info (REPORT_DETAILS))
  	{
  	  fprintf (vect_dump, "Unknown alignment for access: ");
  	  print_generic_expr (vect_dump, base, TDF_SLIM);
--- 1211,1252 ----
    misalign = DR_INIT (dr);
    aligned_to = DR_ALIGNED_TO (dr);
    base_addr = DR_BASE_ADDRESS (dr);
+ 
+   /* In case the dataref is in an inner-loop of the loop that is being
+      vectorized (LOOP), we use the base and misalignment information
+      relative to the outer-loop (LOOP). This is ok only if the misalignment
+      stays the same throughout the execution of the inner-loop, which is why
+      we have to check that the stride of the dataref in the inner-loop evenly
+      divides by the vector size.  */
+   if (nested_in_vect_loop_p (loop, stmt))
+     {
+       tree step = DR_STEP (dr);
+       HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
+     
+       if (dr_step % UNITS_PER_SIMD_WORD == 0)
+         {
+           if (vect_print_dump_info (REPORT_ALIGNMENT))
+             fprintf (vect_dump, "inner step divides the vector-size.");
+ 	  misalign = STMT_VINFO_DR_INIT (stmt_info);
+ 	  aligned_to = STMT_VINFO_DR_ALIGNED_TO (stmt_info);
+ 	  base_addr = STMT_VINFO_DR_BASE_ADDRESS (stmt_info);
+         }
+       else
+ 	{
+ 	  if (vect_print_dump_info (REPORT_ALIGNMENT))
+ 	    fprintf (vect_dump, "inner step doesn't divide the vector-size.");
+ 	  misalign = NULL_TREE;
+ 	}
+     }
+ 
    base = build_fold_indirect_ref (base_addr);
    vectype = STMT_VINFO_VECTYPE (stmt_info);
    alignment = ssize_int (TYPE_ALIGN (vectype)/BITS_PER_UNIT);
  
!   if ((aligned_to && tree_int_cst_compare (aligned_to, alignment) < 0)
!       || !misalign)
      {
!       if (vect_print_dump_info (REPORT_ALIGNMENT))
  	{
  	  fprintf (vect_dump, "Unknown alignment for access: ");
  	  print_generic_expr (vect_dump, base, TDF_SLIM);
*************** static bool
*** 1887,1906 ****
  vect_analyze_data_ref_access (struct data_reference *dr)
  {
    tree step = DR_STEP (dr);
-   HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
    tree scalar_type = TREE_TYPE (DR_REF (dr));
    HOST_WIDE_INT type_size = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (scalar_type));
    tree stmt = DR_STMT (dr);
!   /* For interleaving, STRIDE is STEP counted in elements, i.e., the size of the 
!      interleaving group (including gaps).  */
!   HOST_WIDE_INT stride = dr_step / type_size;
  
!   if (!step)
      {
!       if (vect_print_dump_info (REPORT_DETAILS))
! 	fprintf (vect_dump, "bad data-ref access");
!       return false;
      }
  
    /* Consecutive?  */
    if (!tree_int_cst_compare (step, TYPE_SIZE_UNIT (scalar_type)))
--- 1918,1956 ----
  vect_analyze_data_ref_access (struct data_reference *dr)
  {
    tree step = DR_STEP (dr);
    tree scalar_type = TREE_TYPE (DR_REF (dr));
    HOST_WIDE_INT type_size = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (scalar_type));
    tree stmt = DR_STMT (dr);
!   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
!   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
!   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
!   HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
!   HOST_WIDE_INT stride;
! 
!   /* Don't allow invariant accesses.  */
!   if (dr_step == 0)
!     return false; 
  
!   if (nested_in_vect_loop_p (loop, stmt))
      {
!       /* For the rest of the analysis we use the outer-loop step.  */
!       step = STMT_VINFO_DR_STEP (stmt_info);
!       dr_step = TREE_INT_CST_LOW (step);
!       
!       if (dr_step == 0)
! 	{
! 	  if (vect_print_dump_info (REPORT_ALIGNMENT))
! 	    fprintf (vect_dump, "zero step in outer loop.");
! 	  if (DR_IS_READ (dr))
!   	    return true; 
! 	  else
! 	    return false;
! 	}
      }
+     
+   /* For interleaving, STRIDE is STEP counted in elements, i.e., the size of the 
+      interleaving group (including gaps).  */
+   stride = dr_step / type_size; 
  
    /* Consecutive?  */
    if (!tree_int_cst_compare (step, TYPE_SIZE_UNIT (scalar_type)))
*************** vect_analyze_data_ref_access (struct dat
*** 1910,1915 ****
--- 1960,1972 ----
        return true;
      }
  
+   if (nested_in_vect_loop_p (loop, stmt))
+     {
+       if (vect_print_dump_info (REPORT_ALIGNMENT))
+ 	fprintf (vect_dump, "strided access in outer loop.");
+       return false;
+     }
+ 
    /* Not consecutive access is possible only if it is a part of interleaving.  */
    if (!DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)))
      {
*************** vect_analyze_data_refs (loop_vec_info lo
*** 2138,2143 ****
--- 2195,2201 ----
        tree stmt;
        stmt_vec_info stmt_info;
        basic_block bb;
+       tree base, offset, init;	
     
        if (!dr || !DR_REF (dr))
          {
*************** vect_analyze_data_refs (loop_vec_info lo
*** 2145,2180 ****
  	    fprintf (vect_dump, "not vectorized: unhandled data-ref ");
            return false;
          }
!  
!       /* Update DR field in stmt_vec_info struct.  */
        stmt = DR_STMT (dr);
        stmt_info = vinfo_for_stmt (stmt);
  
-       /* If outer-loop vectorization: we don't yet support datarefs
- 	 in the innermost loop.  */
-       bb = bb_for_stmt (stmt);
-       if (bb->loop_father != LOOP_VINFO_LOOP (loop_vinfo))
- 	{
- 	  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
- 	    fprintf (vect_dump, "not vectorized: data-ref in nested loop");
- 	  return false;
- 	}
- 
-       if (STMT_VINFO_DATA_REF (stmt_info))
-         {
-           if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
-             {
-               fprintf (vect_dump,
-                        "not vectorized: more than one data ref in stmt: ");
-               print_generic_expr (vect_dump, stmt, TDF_SLIM);
-             }
-           return false;
-         }
-       STMT_VINFO_DATA_REF (stmt_info) = dr;
-      
        /* Check that analysis of the data-ref succeeded.  */
        if (!DR_BASE_ADDRESS (dr) || !DR_OFFSET (dr) || !DR_INIT (dr)
!           || !DR_STEP (dr))   
          {
            if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
              {
--- 2203,2215 ----
  	    fprintf (vect_dump, "not vectorized: unhandled data-ref ");
            return false;
          }
! 
        stmt = DR_STMT (dr);
        stmt_info = vinfo_for_stmt (stmt);
  
        /* Check that analysis of the data-ref succeeded.  */
        if (!DR_BASE_ADDRESS (dr) || !DR_OFFSET (dr) || !DR_INIT (dr)
!           || !DR_STEP (dr))
          {
            if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
              {
*************** vect_analyze_data_refs (loop_vec_info lo
*** 2201,2207 ****
              }
            return false;
          }
!                        
        /* Set vectype for STMT.  */
        scalar_type = TREE_TYPE (DR_REF (dr));
        STMT_VINFO_VECTYPE (stmt_info) =
--- 2236,2362 ----
              }
            return false;
          }
! 
!       base = unshare_expr (DR_BASE_ADDRESS (dr));
!       offset = unshare_expr (DR_OFFSET (dr));
!       init = unshare_expr (DR_INIT (dr));
! 	
!       /* Update DR field in stmt_vec_info struct.  */
!       bb = bb_for_stmt (stmt);
! 
!       /* If the dataref is in an inner-loop of the loop that is considered for
! 	 for vectorization, we also want to analyze the access relative to
! 	 the outer-loop (DR contains information only relative to the 
! 	 inner-most enclosing loop).  We do that by building a reference to the
! 	 first location accessed by the inner-loop, and analyze it relative to
! 	 the outer-loop.  */ 	
!       if (nested_in_vect_loop_p (loop, stmt)) 
! 	{
! 	  tree outer_step, outer_base, outer_init;
! 	  HOST_WIDE_INT pbitsize, pbitpos;
! 	  tree poffset;
! 	  enum machine_mode pmode;
! 	  int punsignedp, pvolatilep;
! 	  affine_iv base_iv, offset_iv;
! 	  tree dinit;
! 
! 	  /* Build a reference to the first location accessed by the 
! 	     inner-loop: *(BASE+INIT). (The first location is actually
! 	     BASE+INIT+OFFSET, but we add OFFSET separately later.  */
! 	  tree inner_base = build_fold_indirect_ref 
! 				(fold_build2 (PLUS_EXPR, TREE_TYPE (base), base, init));
! 
! 	  if (vect_print_dump_info (REPORT_DETAILS))
! 	    {
! 	      fprintf (dump_file, "analyze in outer-loop: ");
! 	      print_generic_expr (dump_file, inner_base, TDF_SLIM);
! 	    }
! 
! 	  outer_base = get_inner_reference (inner_base, &pbitsize, &pbitpos, 
! 		          &poffset, &pmode, &punsignedp, &pvolatilep, false);
! 	  gcc_assert (outer_base != NULL_TREE);
! 
! 	  if (pbitpos % BITS_PER_UNIT != 0)
! 	    {
! 	      if (vect_print_dump_info (REPORT_DETAILS))
! 		fprintf (dump_file, "failed: bit offset alignment.\n");
! 	      return false;
! 	    }
! 
! 	  outer_base = build_fold_addr_expr (outer_base);
! 	  if (!simple_iv (loop, stmt, outer_base, &base_iv, false))
! 	    {
! 	      if (vect_print_dump_info (REPORT_DETAILS))
! 		fprintf (dump_file, "failed: evolution of base is not affine.\n");
! 	      return false;
! 	    }
! 
! 	  if (offset)
! 	    {
! 	      if (poffset)
! 		poffset = fold_build2 (PLUS_EXPR, TREE_TYPE (offset), offset, poffset);
! 	      else
! 		poffset = offset;
! 	    }
! 
! 	  if (!poffset)
! 	    {
! 	      offset_iv.base = ssize_int (0);
! 	      offset_iv.step = ssize_int (0);
! 	    }
! 	  else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
! 	    {
! 	      if (vect_print_dump_info (REPORT_DETAILS))
! 	        fprintf (dump_file, "evolution of offset is not affine.\n");
! 	      return false;
! 	    }
! 
! 	  outer_init = ssize_int (pbitpos / BITS_PER_UNIT);
! 	  split_constant_offset (base_iv.base, &base_iv.base, &dinit);
! 	  outer_init =  size_binop (PLUS_EXPR, outer_init, dinit);
! 	  split_constant_offset (offset_iv.base, &offset_iv.base, &dinit);
! 	  outer_init =  size_binop (PLUS_EXPR, outer_init, dinit);
! 
! 	  outer_step = size_binop (PLUS_EXPR,
! 				fold_convert (ssizetype, base_iv.step),
! 				fold_convert (ssizetype, offset_iv.step));
! 
! 	  STMT_VINFO_DR_STEP (stmt_info) = outer_step;
! 	  /* FIXME: Use canonicalize_base_object_address (base_iv.base); */
! 	  STMT_VINFO_DR_BASE_ADDRESS (stmt_info) = base_iv.base; 
! 	  STMT_VINFO_DR_INIT (stmt_info) = outer_init;
! 	  STMT_VINFO_DR_OFFSET (stmt_info) = 
! 				fold_convert (ssizetype, offset_iv.base);
! 	  STMT_VINFO_DR_ALIGNED_TO (stmt_info) = 
! 				size_int (highest_pow2_factor (offset_iv.base));
! 
! 	  if (dump_file && (dump_flags & TDF_DETAILS))
! 	    {
! 	      fprintf (dump_file, "\touter base_address: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_BASE_ADDRESS (stmt_info), TDF_SLIM);
! 	      fprintf (dump_file, "\n\touter offset from base address: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_OFFSET (stmt_info), TDF_SLIM);
! 	      fprintf (dump_file, "\n\touter constant offset from base address: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_INIT (stmt_info), TDF_SLIM);
! 	      fprintf (dump_file, "\n\touter step: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_STEP (stmt_info), TDF_SLIM);
! 	      fprintf (dump_file, "\n\touter aligned to: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_ALIGNED_TO (stmt_info), TDF_SLIM);
! 	    }
! 	}
! 
!       if (STMT_VINFO_DATA_REF (stmt_info))
!         {
!           if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
!             {
!               fprintf (vect_dump,
!                        "not vectorized: more than one data ref in stmt: ");
!               print_generic_expr (vect_dump, stmt, TDF_SLIM);
!             }
!           return false;
!         }
!       STMT_VINFO_DATA_REF (stmt_info) = dr;
!      
        /* Set vectype for STMT.  */
        scalar_type = TREE_TYPE (DR_REF (dr));
        STMT_VINFO_VECTYPE (stmt_info) =
*** tree-vectorizer.c.patch1	2007-08-12 15:27:30.000000000 +0300
--- tree-vectorizer.c	2007-08-12 16:12:21.000000000 +0300
*************** new_stmt_vec_info (tree stmt, loop_vec_i
*** 1345,1350 ****
--- 1345,1357 ----
    STMT_VINFO_IN_PATTERN_P (res) = false;
    STMT_VINFO_RELATED_STMT (res) = NULL;
    STMT_VINFO_DATA_REF (res) = NULL;
+ 
+   STMT_VINFO_DR_BASE_ADDRESS (res) = NULL;
+   STMT_VINFO_DR_OFFSET (res) = NULL;
+   STMT_VINFO_DR_INIT (res) = NULL;
+   STMT_VINFO_DR_STEP (res) = NULL;
+   STMT_VINFO_DR_ALIGNED_TO (res) = NULL;
+ 
    if (TREE_CODE (stmt) == PHI_NODE && is_loop_header_bb_p (bb_for_stmt (stmt)))
      STMT_VINFO_DEF_TYPE (res) = vect_unknown_def_type;
    else
*************** get_vectype_for_scalar_type (tree scalar
*** 1651,1670 ****
  enum dr_alignment_support
  vect_supportable_dr_alignment (struct data_reference *dr)
  {
!   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr)));
    enum machine_mode mode = (int) TYPE_MODE (vectype);
  
    if (aligned_access_p (dr))
      return dr_aligned;
  
    /* Possibly unaligned access.  */
    
    if (DR_IS_READ (dr))
      {
        if (optab_handler (vec_realign_load_optab, mode)->insn_code != CODE_FOR_nothing
  	  && (!targetm.vectorize.builtin_mask_for_load
  	      || targetm.vectorize.builtin_mask_for_load ()))
! 	return dr_unaligned_software_pipeline;
  
        if (optab_handler (movmisalign_optab, mode)->insn_code != CODE_FOR_nothing)
  	/* Can't software pipeline the loads, but can at least do them.  */
--- 1658,1756 ----
  enum dr_alignment_support
  vect_supportable_dr_alignment (struct data_reference *dr)
  {
!   tree stmt = DR_STMT (dr);
!   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
!   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    enum machine_mode mode = (int) TYPE_MODE (vectype);
+   struct loop *vect_loop = LOOP_VINFO_LOOP (STMT_VINFO_LOOP_VINFO (stmt_info));
+   bool nested_in_vect_loop = nested_in_vect_loop_p (vect_loop, stmt);
+   bool invariant_in_outerloop = false;
  
    if (aligned_access_p (dr))
      return dr_aligned;
  
+   if (nested_in_vect_loop)
+     {
+       tree outerloop_step = STMT_VINFO_DR_STEP (stmt_info);
+       invariant_in_outerloop =
+ 	(tree_int_cst_compare (outerloop_step, size_zero_node) == 0);
+     }
+ 
    /* Possibly unaligned access.  */
    
+   /* We can choose between using the implicit realignment scheme (generating
+      a misaligned_move stmt) and the explicit realignment scheme (generating
+      aligned loads with a REALIGN_LOAD). There are two variants to the explicit
+      realignment scheme: optimized, and unoptimized.
+      We can optimize the realignment only if the step between consecutive
+      vector loads is equal to the vector size.  Since the vector memory
+      accesses advance in steps of VS (Vector Size) in the vectorized loop, it
+      is guaranteed that the misalignment amount remains the same throughout the
+      execution of the vectorized loop.  Therefore, we can create the
+      "realignment token" (the permutation mask that is passed to REALIGN_LOAD)
+      at the loop preheader.
+ 
+      However, in the case of outer-loop vectorization, when vectorizing a
+      memory access in the inner-loop nested within the LOOP that is now being
+      vectorized, while it is guaranteed that the misalignment of the
+      vectorized memory access will remain the same in different outer-loop
+      iterations, it is *not* guaranteed that is will remain the same throughout
+      the execution of the inner-loop.  This is because the inner-loop advances
+      with the original scalar step (and not in steps of VS).  If the inner-loop
+      step happens to be a multiple of VS, then the misalignment remaines fixed
+      and we can use the optimized realignment scheme.  For example:
+ 
+       for (i=0; i<N; i++)
+         for (j=0; j<M; j++)
+           s += a[i+j];
+ 
+      When vectorizing the i-loop in the above example, the step between
+      consecutive vector loads is 1, and so the misalignment does not remain
+      fixed across the execution of the inner-loop, and the realignment cannot
+      be optimized (as illustrated in the following pseudo vectorized loop):
+ 
+       for (i=0; i<N; i+=4)
+         for (j=0; j<M; j++){
+           vs += vp[i+j]; // misalignment of &vp[i+j] is {0,1,2,3,0,1,2,3,...}
+                          // when j is {0,1,2,3,4,5,6,7,...} respectively.
+                          // (assuming that we start from an aligned address).
+           }
+ 
+      We therefore have to use the unoptimized realignment scheme:
+ 
+       for (i=0; i<N; i+=4)
+           for (j=k; j<M; j+=4)
+           vs += vp[i+j]; // misalignment of &vp[i+j] is always k (assuming
+                            // that the misalignment of the initial address is
+                            // 0).
+ 
+      The loop can then be vectorized as follows:
+ 
+       for (k=0; k<4; k++){
+         rt = get_realignment_token (&vp[k]);
+         for (i=0; i<N; i+=4){
+           v1 = vp[i+k];
+           for (j=k; j<M; j+=4){
+             v2 = vp[i+j+VS-1];
+             va = REALIGN_LOAD <v1,v2,rt>;
+             vs += va;
+             v1 = v2;
+           }
+         }
+     } */
+ 
    if (DR_IS_READ (dr))
      {
        if (optab_handler (vec_realign_load_optab, mode)->insn_code != CODE_FOR_nothing
  	  && (!targetm.vectorize.builtin_mask_for_load
  	      || targetm.vectorize.builtin_mask_for_load ()))
! 	{
! 	    if (nested_in_vect_loop
! 	      && TREE_INT_CST_LOW (DR_STEP (dr)) != UNITS_PER_SIMD_WORD)
! 	    return dr_explicit_realign;
! 	  else
! 	    return dr_explicit_realign_optimized;
! 	}
  
        if (optab_handler (movmisalign_optab, mode)->insn_code != CODE_FOR_nothing)
  	/* Can't software pipeline the loads, but can at least do them.  */
*** tree-vectorizer.h.patch1	2007-08-12 15:27:35.000000000 +0300
--- tree-vectorizer.h	2007-08-12 15:31:34.000000000 +0300
*************** enum operation_type {
*** 53,59 ****
  enum dr_alignment_support {
    dr_unaligned_unsupported,
    dr_unaligned_supported,
!   dr_unaligned_software_pipeline,
    dr_aligned
  };
  
--- 53,60 ----
  enum dr_alignment_support {
    dr_unaligned_unsupported,
    dr_unaligned_supported,
!   dr_explicit_realign,
!   dr_explicit_realign_optimized,
    dr_aligned
  };
  
*************** typedef struct _stmt_vec_info {
*** 244,252 ****
       data-ref (array/pointer/struct access). A GIMPLE stmt is expected to have 
       at most one such data-ref.  **/
  
!   /* Information about the data-ref (access function, etc).  */
    struct data_reference *data_ref_info;
  
    /* Stmt is part of some pattern (computation idiom)  */
    bool in_pattern_p;
  
--- 245,262 ----
       data-ref (array/pointer/struct access). A GIMPLE stmt is expected to have 
       at most one such data-ref.  **/
  
!   /* Information about the data-ref (access function, etc),
!      relative to the inner-most containing loop.  */
    struct data_reference *data_ref_info;
  
+   /* Information about the data-ref relative to this loop
+      nest (the loop that is being considered for vectorization).  */
+   tree dr_base_address;
+   tree dr_init;
+   tree dr_offset;
+   tree dr_step;
+   tree dr_aligned_to;
+ 
    /* Stmt is part of some pattern (computation idiom)  */
    bool in_pattern_p;
  
*************** typedef struct _stmt_vec_info {
*** 305,310 ****
--- 315,327 ----
  #define STMT_VINFO_VECTYPE(S)              (S)->vectype
  #define STMT_VINFO_VEC_STMT(S)             (S)->vectorized_stmt
  #define STMT_VINFO_DATA_REF(S)             (S)->data_ref_info
+ 
+ #define STMT_VINFO_DR_BASE_ADDRESS(S)      (S)->dr_base_address
+ #define STMT_VINFO_DR_INIT(S)              (S)->dr_init
+ #define STMT_VINFO_DR_OFFSET(S)            (S)->dr_offset
+ #define STMT_VINFO_DR_STEP(S)              (S)->dr_step
+ #define STMT_VINFO_DR_ALIGNED_TO(S)        (S)->dr_aligned_to
+ 
  #define STMT_VINFO_IN_PATTERN_P(S)         (S)->in_pattern_p
  #define STMT_VINFO_RELATED_STMT(S)         (S)->related_stmt
  #define STMT_VINFO_SAME_ALIGN_REFS(S)      (S)->same_align_refs
*** tree-vect-transform.c.patch1	2007-08-12 15:27:53.000000000 +0300
--- tree-vect-transform.c	2007-08-12 16:13:27.000000000 +0300
*************** along with GCC; see the file COPYING3.  
*** 49,62 ****
  static bool vect_transform_stmt (tree, block_stmt_iterator *, bool *);
  static tree vect_create_destination_var (tree, tree);
  static tree vect_create_data_ref_ptr 
!   (tree, block_stmt_iterator *, tree, tree *, tree *, bool, tree); 
! static tree vect_create_addr_base_for_vector_ref (tree, tree *, tree);
! static tree vect_setup_realignment (tree, block_stmt_iterator *, tree *);
  static tree vect_get_new_vect_var (tree, enum vect_var_kind, const char *);
  static tree vect_get_vec_def_for_operand (tree, tree, tree *);
! static tree vect_init_vector (tree, tree, tree);
  static void vect_finish_stmt_generation 
!   (tree stmt, tree vec_stmt, block_stmt_iterator *bsi);
  static bool vect_is_simple_cond (tree, loop_vec_info); 
  static void vect_create_epilog_for_reduction (tree, tree, enum tree_code, tree);
  static tree get_initial_def_for_reduction (tree, tree, tree *);
--- 49,62 ----
  static bool vect_transform_stmt (tree, block_stmt_iterator *, bool *);
  static tree vect_create_destination_var (tree, tree);
  static tree vect_create_data_ref_ptr 
!   (tree, struct loop*, tree, tree *, tree *, bool, tree, bool *); 
! static tree vect_create_addr_base_for_vector_ref 
!   (tree, tree *, tree, struct loop *);
  static tree vect_get_new_vect_var (tree, enum vect_var_kind, const char *);
  static tree vect_get_vec_def_for_operand (tree, tree, tree *);
! static tree vect_init_vector (tree, tree, tree, block_stmt_iterator *);
  static void vect_finish_stmt_generation 
!   (tree stmt, tree vec_stmt, block_stmt_iterator *);
  static bool vect_is_simple_cond (tree, loop_vec_info); 
  static void vect_create_epilog_for_reduction (tree, tree, enum tree_code, tree);
  static tree get_initial_def_for_reduction (tree, tree, tree *);
*************** vect_model_reduction_cost (stmt_vec_info
*** 371,376 ****
--- 371,378 ----
    enum machine_mode mode;
    tree operation = GIMPLE_STMT_OPERAND (STMT_VINFO_STMT (stmt_info), 1);
    int op_type = TREE_CODE_LENGTH (TREE_CODE (operation));
+   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
  
    /* Cost of reduction op inside loop.  */
    STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info) += ncopies * TARG_VEC_STMT_COST;
*************** vect_model_reduction_cost (stmt_vec_info
*** 393,422 ****
       We have a reduction operator that will reduce the vector in one statement.
       Also requires scalar extract.  */
  
!   if (reduc_code < NUM_TREE_CODES) 
!     outer_cost += TARG_VEC_STMT_COST + TARG_VEC_TO_SCALAR_COST;
!   else 
      {
!       int vec_size_in_bits = tree_low_cst (TYPE_SIZE (vectype), 1);
!       tree bitsize =
! 	TYPE_SIZE (TREE_TYPE ( GIMPLE_STMT_OPERAND (orig_stmt, 0)));
!       int element_bitsize = tree_low_cst (bitsize, 1);
!       int nelements = vec_size_in_bits / element_bitsize;
! 
!       optab = optab_for_tree_code (code, vectype);
! 
!       /* We have a whole vector shift available.  */
!       if (VECTOR_MODE_P (mode)
! 	  && optab_handler (optab, mode)->insn_code != CODE_FOR_nothing
! 	  && optab_handler (vec_shr_optab, mode)->insn_code != CODE_FOR_nothing)
!         /* Final reduction via vector shifts and the reduction operator. Also
!            requires scalar extract.  */
! 	outer_cost += ((exact_log2(nelements) * 2) * TARG_VEC_STMT_COST
! 			+ TARG_VEC_TO_SCALAR_COST); 
!       else
! 	/* Use extracts and reduction op for final reduction.  For N elements,
!            we have N extracts and N-1 reduction ops.  */
! 	outer_cost += ((nelements + nelements - 1) * TARG_VEC_STMT_COST);
      }
  
    STMT_VINFO_OUTSIDE_OF_LOOP_COST (stmt_info) = outer_cost;
--- 395,427 ----
       We have a reduction operator that will reduce the vector in one statement.
       Also requires scalar extract.  */
  
!   if (!nested_in_vect_loop_p (loop, orig_stmt))
      {
!       if (reduc_code < NUM_TREE_CODES) 
! 	outer_cost += TARG_VEC_STMT_COST + TARG_VEC_TO_SCALAR_COST;
!       else 
! 	{
! 	  int vec_size_in_bits = tree_low_cst (TYPE_SIZE (vectype), 1);
! 	  tree bitsize =
! 	    TYPE_SIZE (TREE_TYPE ( GIMPLE_STMT_OPERAND (orig_stmt, 0)));
! 	  int element_bitsize = tree_low_cst (bitsize, 1);
! 	  int nelements = vec_size_in_bits / element_bitsize;
! 
! 	  optab = optab_for_tree_code (code, vectype);
! 
! 	  /* We have a whole vector shift available.  */
! 	  if (VECTOR_MODE_P (mode)
! 	      && optab_handler (optab, mode)->insn_code != CODE_FOR_nothing
! 	      && optab_handler (vec_shr_optab, mode)->insn_code != CODE_FOR_nothing)
! 	    /* Final reduction via vector shifts and the reduction operator. Also
! 	       requires scalar extract.  */
! 	    outer_cost += ((exact_log2(nelements) * 2) * TARG_VEC_STMT_COST
! 				+ TARG_VEC_TO_SCALAR_COST); 
! 	  else
! 	    /* Use extracts and reduction op for final reduction.  For N elements,
!                we have N extracts and N-1 reduction ops.  */
! 	    outer_cost += ((nelements + nelements - 1) * TARG_VEC_STMT_COST);
! 	}
      }
  
    STMT_VINFO_OUTSIDE_OF_LOOP_COST (stmt_info) = outer_cost;
*************** vect_model_load_cost (stmt_vec_info stmt
*** 609,615 ****
  
          break;
        }
!     case dr_unaligned_software_pipeline:
        {
          int outer_cost = 0;
  
--- 614,632 ----
  
          break;
        }
!     case dr_explicit_realign:
!       {
!         inner_cost += ncopies * (2*TARG_VEC_LOAD_COST + TARG_VEC_STMT_COST);
! 
!         /* FIXME: If the misalignment remains fixed across the iterations of
!            the containing loop, the following cost should be added to the
!            outside costs.  */
!         if (targetm.vectorize.builtin_mask_for_load)
!           inner_cost += TARG_VEC_STMT_COST;
! 
!         break;
!       }
!     case dr_explicit_realign_optimized:
        {
          int outer_cost = 0;
  
*************** vect_get_new_vect_var (tree type, enum v
*** 706,711 ****
--- 723,741 ----
     STMT: The statement containing the data reference.
     NEW_STMT_LIST: Must be initialized to NULL_TREE or a statement list.
     OFFSET: Optional. If supplied, it is be added to the initial address.
+    LOOP:    Specify relative to which loop-nest should the address be computed.
+             For example, when the dataref is in an inner-loop nested in an
+ 	    outer-loop that is now being vectorized, LOOP can be either the
+ 	    outer-loop, or the inner-loop. The first memory location accessed
+ 	    by the following dataref ('in' points to short):
+ 
+ 		for (i=0; i<N; i++)
+ 		   for (j=0; j<M; j++)
+ 		     s += in[i+j]
+ 
+ 	    is as follows:
+ 	    if LOOP=i_loop:	&in		(relative to i_loop)
+ 	    if LOOP=j_loop: 	&in+i*2B	(relative to j_loop)
  
     Output:
     1. Return an SSA_NAME whose value is the address of the memory location of 
*************** vect_get_new_vect_var (tree type, enum v
*** 718,731 ****
  static tree
  vect_create_addr_base_for_vector_ref (tree stmt,
                                        tree *new_stmt_list,
! 				      tree offset)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
!   tree data_ref_base_expr = unshare_expr (DR_BASE_ADDRESS (dr));
!   tree base_name = build_fold_indirect_ref (data_ref_base_expr);
    tree data_ref_base_var;
-   tree data_ref_base;
    tree new_base_stmt;
    tree vec_stmt;
    tree addr_base, addr_expr;
--- 748,762 ----
  static tree
  vect_create_addr_base_for_vector_ref (tree stmt,
                                        tree *new_stmt_list,
! 				      tree offset,
! 				      struct loop *loop)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
!   struct loop *containing_loop = (bb_for_stmt (stmt))->loop_father;
!   tree data_ref_base = unshare_expr (DR_BASE_ADDRESS (dr));
!   tree base_name;
    tree data_ref_base_var;
    tree new_base_stmt;
    tree vec_stmt;
    tree addr_base, addr_expr;
*************** vect_create_addr_base_for_vector_ref (tr
*** 733,744 ****
    tree base_offset = unshare_expr (DR_OFFSET (dr));
    tree init = unshare_expr (DR_INIT (dr));
    tree vect_ptr_type, addr_expr2;
!   
!   
!   /* Create data_ref_base */
!   data_ref_base_var = create_tmp_var (TREE_TYPE (data_ref_base_expr), "batmp");
    add_referenced_var (data_ref_base_var);
!   data_ref_base = force_gimple_operand (data_ref_base_expr, &new_base_stmt,
  					true, data_ref_base_var);
    append_to_statement_list_force(new_base_stmt, new_stmt_list);
  
--- 764,789 ----
    tree base_offset = unshare_expr (DR_OFFSET (dr));
    tree init = unshare_expr (DR_INIT (dr));
    tree vect_ptr_type, addr_expr2;
!   tree step = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)));
! 
!   gcc_assert (loop);
!   if (loop != containing_loop)
!     {
!       loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
!       struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
! 
!       gcc_assert (nested_in_vect_loop_p (loop, stmt));
! 
!       data_ref_base = unshare_expr (STMT_VINFO_DR_BASE_ADDRESS (stmt_info));
!       base_offset = unshare_expr (STMT_VINFO_DR_OFFSET (stmt_info));
!       init = unshare_expr (STMT_VINFO_DR_INIT (stmt_info));
!     }
! 
!   /* Create base_offset */
!   base_name = build_fold_indirect_ref (data_ref_base);
!   data_ref_base_var = create_tmp_var (TREE_TYPE (data_ref_base), "batmp");
    add_referenced_var (data_ref_base_var);
!   data_ref_base = force_gimple_operand (data_ref_base, &new_base_stmt,
  					true, data_ref_base_var);
    append_to_statement_list_force(new_base_stmt, new_stmt_list);
  
*************** vect_create_addr_base_for_vector_ref (tr
*** 753,768 ****
    if (offset)
      {
        tree tmp = create_tmp_var (sizetype, "offset");
-       tree step; 
- 
-       /* For interleaved access step we divide STEP by the size of the
-         interleaving group.  */
-       if (DR_GROUP_SIZE (stmt_info))
- 	step = fold_build2 (TRUNC_DIV_EXPR, TREE_TYPE (offset), DR_STEP (dr),
- 			    build_int_cst (TREE_TYPE (offset),
- 					   DR_GROUP_SIZE (stmt_info)));
-       else
- 	step = DR_STEP (dr);
  
        add_referenced_var (tmp);
        offset = fold_build2 (MULT_EXPR, TREE_TYPE (offset), offset, step);
--- 798,803 ----
*************** vect_create_addr_base_for_vector_ref (tr
*** 811,817 ****
     1. STMT: a stmt that references memory. Expected to be of the form
           GIMPLE_MODIFY_STMT <name, data-ref> or
  	 GIMPLE_MODIFY_STMT <data-ref, name>.
!    2. BSI: block_stmt_iterator where new stmts can be added.
     3. OFFSET (optional): an offset to be added to the initial address accessed
          by the data-ref in STMT.
     4. ONLY_INIT: indicate if vp is to be updated in the loop, or remain
--- 846,852 ----
     1. STMT: a stmt that references memory. Expected to be of the form
           GIMPLE_MODIFY_STMT <name, data-ref> or
  	 GIMPLE_MODIFY_STMT <data-ref, name>.
!    2. AT_LOOP: the loop where the vector memref is to be created.
     3. OFFSET (optional): an offset to be added to the initial address accessed
          by the data-ref in STMT.
     4. ONLY_INIT: indicate if vp is to be updated in the loop, or remain
*************** vect_create_addr_base_for_vector_ref (tr
*** 838,855 ****
  
        Return the increment stmt that updates the pointer in PTR_INCR.
  
!    3. Return the pointer.  */
  
  static tree
! vect_create_data_ref_ptr (tree stmt,
! 			  block_stmt_iterator *bsi ATTRIBUTE_UNUSED,
  			  tree offset, tree *initial_address, tree *ptr_incr,
! 			  bool only_init, tree type)
  {
    tree base_name;
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    tree vect_ptr_type;
    tree vect_ptr;
--- 873,894 ----
  
        Return the increment stmt that updates the pointer in PTR_INCR.
  
!    3. Set INV_P to true if the access pattern of the data reference in the 
!       vectorized loop is invariant. Set it to false otherwise.
! 
!    4. Return the pointer.  */
  
  static tree
! vect_create_data_ref_ptr (tree stmt, struct loop *at_loop,
  			  tree offset, tree *initial_address, tree *ptr_incr,
! 			  bool only_init, tree type, bool *inv_p)
  {
    tree base_name;
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+   bool nested_in_vect_loop = nested_in_vect_loop_p (loop, stmt);
+   struct loop *containing_loop = (bb_for_stmt (stmt))->loop_father;
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    tree vect_ptr_type;
    tree vect_ptr;
*************** vect_create_data_ref_ptr (tree stmt,
*** 857,867 ****
    tree new_temp;
    tree vec_stmt;
    tree new_stmt_list = NULL_TREE;
!   edge pe = loop_preheader_edge (loop);
    basic_block new_bb;
    tree vect_ptr_init;
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
  
    base_name =  build_fold_indirect_ref (unshare_expr (DR_BASE_ADDRESS (dr)));
  
    if (vect_print_dump_info (REPORT_DETAILS))
--- 896,926 ----
    tree new_temp;
    tree vec_stmt;
    tree new_stmt_list = NULL_TREE;
!   edge pe;
    basic_block new_bb;
    tree vect_ptr_init;
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+   tree vptr;
+   block_stmt_iterator incr_bsi;
+   bool insert_after;
+   tree indx_before_incr, indx_after_incr;
+   tree incr;
+   tree step;
+ 
+   /* Check the step (evolution) of the load in LOOP, and record
+      whether it's invariant.  */
+   if (nested_in_vect_loop)
+     step = STMT_VINFO_DR_STEP (stmt_info);
+   else
+     step = DR_STEP (STMT_VINFO_DATA_REF (stmt_info));
+     
+   if (tree_int_cst_compare (step, size_zero_node) == 0)
+     *inv_p = true;
+   else
+     *inv_p = false;
  
+   /* Create an expression for the first address accessed by this load
+      in LOOP.  */ 
    base_name =  build_fold_indirect_ref (unshare_expr (DR_BASE_ADDRESS (dr)));
  
    if (vect_print_dump_info (REPORT_DETAILS))
*************** vect_create_data_ref_ptr (tree stmt,
*** 904,915 ****
  
    var_ann (vect_ptr)->subvars = DR_SUBVARS (dr);
  
    /** (3) Calculate the initial address the vector-pointer, and set
            the vector-pointer to point to it before the loop:  **/
  
    /* Create: (&(base[init_val+offset]) in the loop preheader.  */
    new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list,
!                                                    offset);
    pe = loop_preheader_edge (loop);
    new_bb = bsi_insert_on_edge_immediate (pe, new_stmt_list);
    gcc_assert (!new_bb);
--- 963,1006 ----
  
    var_ann (vect_ptr)->subvars = DR_SUBVARS (dr);
  
+   /** Note: If the dataref is in an inner-loop nested in LOOP, and we are
+       vectorizing LOOP (i.e. outer-loop vectorization), we need to create two
+       def-use update cycles for the pointer: One relative to the outer-loop
+       (LOOP), which is what steps (3) and (4) below do. The other is relative
+       to the inner-loop (which is the inner-most loop containing the dataref),
+       and this is done be step (5) below. 
+ 
+       When vectorizing inner-most loops, the vectorized loop (LOOP) is also the
+       inner-most loop, and so steps (3),(4) work the same, and step (5) is
+       redundant.  Steps (3),(4) create the following:
+ 
+ 	vp0 = &base_addr;
+ 	LOOP:	vp1 = phi(vp0,vp2)
+ 		...  
+ 		...
+ 		vp2 = vp1 + step
+ 		goto LOOP
+ 			
+       If there is an inner-loop nested in loop, then step (5) will also be
+       applied, and an additional update in the inner-loop will be created:
+ 
+ 	vp0 = &base_addr;
+ 	LOOP:   vp1 = phi(vp0,vp2)
+ 		...
+         inner:     vp3 = phi(vp1,vp4)
+ 	           vp4 = vp3 + inner_step
+ 	           if () goto inner
+ 		...
+ 		vp2 = vp1 + step
+ 		if () goto LOOP   */
+ 
    /** (3) Calculate the initial address the vector-pointer, and set
            the vector-pointer to point to it before the loop:  **/
  
    /* Create: (&(base[init_val+offset]) in the loop preheader.  */
+ 
    new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list,
!                                                    offset, loop);
    pe = loop_preheader_edge (loop);
    new_bb = bsi_insert_on_edge_immediate (pe, new_stmt_list);
    gcc_assert (!new_bb);
*************** vect_create_data_ref_ptr (tree stmt,
*** 924,948 ****
    gcc_assert (!new_bb);
  
  
!   /** (4) Handle the updating of the vector-pointer inside the loop: **/
  
!   if (only_init) /* No update in loop is required.  */
      {
        /* Copy the points-to information if it exists. */
        if (DR_PTR_INFO (dr))
          duplicate_ssa_name_ptr_info (vect_ptr_init, DR_PTR_INFO (dr));
!       return vect_ptr_init;
      }
    else
      {
!       block_stmt_iterator incr_bsi;
!       bool insert_after;
!       tree indx_before_incr, indx_after_incr;
!       tree incr;
  
        standard_iv_increment_position (loop, &incr_bsi, &insert_after);
        create_iv (vect_ptr_init,
! 		 fold_convert (vect_ptr_type, TYPE_SIZE_UNIT (vectype)),
  		 NULL_TREE, loop, &incr_bsi, insert_after,
  		 &indx_before_incr, &indx_after_incr);
        incr = bsi_stmt (incr_bsi);
--- 1015,1045 ----
    gcc_assert (!new_bb);
  
  
!   /** (4) Handle the updating of the vector-pointer inside the loop.
! 	  This is needed when ONLY_INIT is false, and also when AT_LOOP
! 	  is the inner-loop nested in LOOP (during outer-loop vectorization).
!    **/
  
!   if (only_init && at_loop == loop) /* No update in loop is required.  */
      {
        /* Copy the points-to information if it exists. */
        if (DR_PTR_INFO (dr))
          duplicate_ssa_name_ptr_info (vect_ptr_init, DR_PTR_INFO (dr));
!       vptr = vect_ptr_init;
      }
    else
      {
!       /* The step of the vector pointer is the Vector Size.  */
!       tree step = TYPE_SIZE_UNIT (vectype);
!       /* One exception to the above is when the scalar step of the load in 
! 	 LOOP is zero. In this case the step here is also zero.  */
!       if (*inv_p)
! 	step = size_zero_node;
  
        standard_iv_increment_position (loop, &incr_bsi, &insert_after);
+ 
        create_iv (vect_ptr_init,
! 		 fold_convert (vect_ptr_type, step),
  		 NULL_TREE, loop, &incr_bsi, insert_after,
  		 &indx_before_incr, &indx_after_incr);
        incr = bsi_stmt (incr_bsi);
*************** vect_create_data_ref_ptr (tree stmt,
*** 960,974 ****
        if (ptr_incr)
  	*ptr_incr = incr;
  
!       return indx_before_incr;
      }
  }
  
  
  /* Function bump_vector_ptr
  
!    Increment a pointer (to a vector type) by vector-size. Connect the new 
!    increment stmt to the existing def-use update-chain of the pointer.
  
     The pointer def-use update-chain before this function:
                          DATAREF_PTR = phi (p_0, p_2)
--- 1057,1107 ----
        if (ptr_incr)
  	*ptr_incr = incr;
  
!       vptr = indx_before_incr;
      }
+ 
+   if (!nested_in_vect_loop || only_init)
+     return vptr;
+ 
+ 
+   /** (5) Handle the updating of the vector-pointer inside the inner-loop
+ 	  nested in LOOP, if exists: **/
+ 
+   gcc_assert (nested_in_vect_loop);
+   if (!only_init)
+     {
+       standard_iv_increment_position (containing_loop, &incr_bsi, 
+ 				      &insert_after);
+       create_iv (vptr, fold_convert (vect_ptr_type, DR_STEP (dr)), NULL_TREE, 
+ 		 containing_loop, &incr_bsi, insert_after, &indx_before_incr, 
+ 		 &indx_after_incr);
+       incr = bsi_stmt (incr_bsi);
+       set_stmt_info (stmt_ann (incr), new_stmt_vec_info (incr, loop_vinfo));
+ 
+       /* Copy the points-to information if it exists. */
+       if (DR_PTR_INFO (dr))
+ 	{
+ 	  duplicate_ssa_name_ptr_info (indx_before_incr, DR_PTR_INFO (dr));
+ 	  duplicate_ssa_name_ptr_info (indx_after_incr, DR_PTR_INFO (dr));
+ 	}
+       merge_alias_info (vect_ptr_init, indx_before_incr);
+       merge_alias_info (vect_ptr_init, indx_after_incr);
+       if (ptr_incr)
+ 	*ptr_incr = incr;
+ 
+       return indx_before_incr; 
+     }
+   else
+     gcc_unreachable ();
  }
  
  
  /* Function bump_vector_ptr
  
!    Increment a pointer (to a vector type) by vector-size. If requested,
!    i.e. if PTR-INCR is given, then also connect the new increment stmt 
!    to the existing def-use update-chain of the pointer, by modifying
!    the PTR_INCR as illustrated below:
  
     The pointer def-use update-chain before this function:
                          DATAREF_PTR = phi (p_0, p_2)
*************** vect_create_data_ref_ptr (tree stmt,
*** 978,995 ****
     The pointer def-use update-chain after this function:
                          DATAREF_PTR = phi (p_0, p_2)
                          ....
!                         NEW_DATAREF_PTR = DATAREF_PTR + vector_size
                          ....
          PTR_INCR:       p_2 = NEW_DATAREF_PTR + step
  
     Input:
     DATAREF_PTR - ssa_name of a pointer (to vector type) that is being updated 
                   in the loop.
!    PTR_INCR - the stmt that updates the pointer in each iteration of the loop.
!               The increment amount across iterations is also expected to be
!               vector_size.      
     BSI - location where the new update stmt is to be placed.
     STMT - the original scalar memory-access stmt that is being vectorized.
  
     Output: Return NEW_DATAREF_PTR as illustrated above.
     
--- 1111,1130 ----
     The pointer def-use update-chain after this function:
                          DATAREF_PTR = phi (p_0, p_2)
                          ....
!                         NEW_DATAREF_PTR = DATAREF_PTR + BUMP
                          ....
          PTR_INCR:       p_2 = NEW_DATAREF_PTR + step
  
     Input:
     DATAREF_PTR - ssa_name of a pointer (to vector type) that is being updated 
                   in the loop.
!    PTR_INCR - optional. The stmt that updates the pointer in each iteration of 
! 	      the loop.  The increment amount across iterations is expected
! 	      to be vector_size.      
     BSI - location where the new update stmt is to be placed.
     STMT - the original scalar memory-access stmt that is being vectorized.
+    BUMP - optional. The offset by which to bump the pointer. If not given,
+ 	  the offset is assumed to be vector_size.
  
     Output: Return NEW_DATAREF_PTR as illustrated above.
     
*************** vect_create_data_ref_ptr (tree stmt,
*** 997,1003 ****
  
  static tree
  bump_vector_ptr (tree dataref_ptr, tree ptr_incr, block_stmt_iterator *bsi,
!                  tree stmt)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
--- 1132,1138 ----
  
  static tree
  bump_vector_ptr (tree dataref_ptr, tree ptr_incr, block_stmt_iterator *bsi,
!                  tree stmt, tree bump)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
*************** bump_vector_ptr (tree dataref_ptr, tree 
*** 1010,1015 ****
--- 1145,1153 ----
    use_operand_p use_p;
    tree new_dataref_ptr;
  
+   if (bump)
+     update = bump;
+     
    incr_stmt = build_gimple_modify_stmt (ptr_var,
  					build2 (POINTER_PLUS_EXPR, vptr_type,
  						dataref_ptr, update));
*************** bump_vector_ptr (tree dataref_ptr, tree 
*** 1017,1022 ****
--- 1155,1168 ----
    GIMPLE_STMT_OPERAND (incr_stmt, 0) = new_dataref_ptr;
    vect_finish_stmt_generation (stmt, incr_stmt, bsi);
  
+   /* Copy the points-to information if it exists. */
+   if (DR_PTR_INFO (dr))
+     duplicate_ssa_name_ptr_info (new_dataref_ptr, DR_PTR_INFO (dr));
+   merge_alias_info (new_dataref_ptr, dataref_ptr);
+ 
+   if (!ptr_incr)
+     return new_dataref_ptr;
+ 
    /* Update the vector-pointer's cross-iteration increment.  */
    FOR_EACH_SSA_USE_OPERAND (use_p, ptr_incr, iter, SSA_OP_USE)
      {
*************** bump_vector_ptr (tree dataref_ptr, tree 
*** 1028,1038 ****
          gcc_assert (tree_int_cst_compare (use, update) == 0);
      }
  
-   /* Copy the points-to information if it exists. */
-   if (DR_PTR_INFO (dr))
-     duplicate_ssa_name_ptr_info (new_dataref_ptr, DR_PTR_INFO (dr));
-   merge_alias_info (new_dataref_ptr, dataref_ptr);
- 
    return new_dataref_ptr;
  }
  
--- 1174,1179 ----
*************** vect_create_destination_var (tree scalar
*** 1067,1081 ****
  /* Function vect_init_vector.
  
     Insert a new stmt (INIT_STMT) that initializes a new vector variable with
!    the vector elements of VECTOR_VAR. Return the DEF of INIT_STMT. It will be
!    used in the vectorization of STMT.  */
  
  static tree
! vect_init_vector (tree stmt, tree vector_var, tree vector_type)
  {
    stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt);
-   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
-   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    tree new_var;
    tree init_stmt;
    tree vec_oprnd;
--- 1208,1223 ----
  /* Function vect_init_vector.
  
     Insert a new stmt (INIT_STMT) that initializes a new vector variable with
!    the vector elements of VECTOR_VAR. Place the initialization at BSI if it
!    is not NULL. Otherwise, place the initialization at the loop preheader.
!    Return the DEF of INIT_STMT. 
!    It will be used in the vectorization of STMT.  */
  
  static tree
! vect_init_vector (tree stmt, tree vector_var, tree vector_type,
! 		  block_stmt_iterator *bsi)
  {
    stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt);
    tree new_var;
    tree init_stmt;
    tree vec_oprnd;
*************** vect_init_vector (tree stmt, tree vector
*** 1083,1101 ****
    tree new_temp;
    basic_block new_bb;
   
-   if (nested_in_vect_loop_p (loop, stmt))
-     loop = loop->inner;
- 
    new_var = vect_get_new_vect_var (vector_type, vect_simple_var, "cst_");
    add_referenced_var (new_var); 
-  
    init_stmt = build_gimple_modify_stmt (new_var, vector_var);
    new_temp = make_ssa_name (new_var, init_stmt);
    GIMPLE_STMT_OPERAND (init_stmt, 0) = new_temp;
  
!   pe = loop_preheader_edge (loop);
!   new_bb = bsi_insert_on_edge_immediate (pe, init_stmt);
!   gcc_assert (!new_bb);
  
    if (vect_print_dump_info (REPORT_DETAILS))
      {
--- 1225,1249 ----
    tree new_temp;
    basic_block new_bb;
   
    new_var = vect_get_new_vect_var (vector_type, vect_simple_var, "cst_");
    add_referenced_var (new_var); 
    init_stmt = build_gimple_modify_stmt (new_var, vector_var);
    new_temp = make_ssa_name (new_var, init_stmt);
    GIMPLE_STMT_OPERAND (init_stmt, 0) = new_temp;
  
!   if (bsi)
!     vect_finish_stmt_generation (stmt, init_stmt, bsi);
!   else
!     {
!       loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
!       struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
! 
!       if (nested_in_vect_loop_p (loop, stmt))
!         loop = loop->inner;
!       pe = loop_preheader_edge (loop);
!       new_bb = bsi_insert_on_edge_immediate (pe, init_stmt);
!       gcc_assert (!new_bb);
!     }
  
    if (vect_print_dump_info (REPORT_DETAILS))
      {
*************** get_initial_def_for_induction (tree iv_p
*** 1233,1239 ****
  	}
        /* Create a vector from [new_name_0, new_name_1, ..., new_name_nunits-1]  */
        vec = build_constructor_from_list (vectype, nreverse (t));
!       vec_init = vect_init_vector (iv_phi, vec, vectype);
      }
  
  
--- 1381,1387 ----
  	}
        /* Create a vector from [new_name_0, new_name_1, ..., new_name_nunits-1]  */
        vec = build_constructor_from_list (vectype, nreverse (t));
!       vec_init = vect_init_vector (iv_phi, vec, vectype, NULL);
      }
  
  
*************** get_initial_def_for_induction (tree iv_p
*** 1254,1260 ****
    for (i = 0; i < nunits; i++)
      t = tree_cons (NULL_TREE, unshare_expr (new_name), t);
    vec = build_constructor_from_list (vectype, t);
!   vec_step = vect_init_vector (iv_phi, vec, vectype);
  
  
    /* Create the following def-use cycle:
--- 1402,1408 ----
    for (i = 0; i < nunits; i++)
      t = tree_cons (NULL_TREE, unshare_expr (new_name), t);
    vec = build_constructor_from_list (vectype, t);
!   vec_step = vect_init_vector (iv_phi, vec, vectype, NULL);
  
  
    /* Create the following def-use cycle:
*************** get_initial_def_for_induction (tree iv_p
*** 1310,1316 ****
        for (i = 0; i < nunits; i++)
  	t = tree_cons (NULL_TREE, unshare_expr (new_name), t);
        vec = build_constructor_from_list (vectype, t);
!       vec_step = vect_init_vector (iv_phi, vec, vectype);
  
        vec_def = induc_def;
        prev_stmt_vinfo = vinfo_for_stmt (induction_phi);
--- 1458,1464 ----
        for (i = 0; i < nunits; i++)
  	t = tree_cons (NULL_TREE, unshare_expr (new_name), t);
        vec = build_constructor_from_list (vectype, t);
!       vec_step = vect_init_vector (iv_phi, vec, vectype, NULL);
  
        vec_def = induc_def;
        prev_stmt_vinfo = vinfo_for_stmt (induction_phi);
*************** vect_get_vec_def_for_operand (tree op, t
*** 1447,1453 ****
          vector_type = get_vectype_for_scalar_type (TREE_TYPE (op));
          vec_cst = build_vector (vector_type, t);
  
!         return vect_init_vector (stmt, vec_cst, vector_type);
        }
  
      /* Case 2: operand is defined outside the loop - loop invariant.  */
--- 1595,1601 ----
          vector_type = get_vectype_for_scalar_type (TREE_TYPE (op));
          vec_cst = build_vector (vector_type, t);
  
!         return vect_init_vector (stmt, vec_cst, vector_type, NULL);
        }
  
      /* Case 2: operand is defined outside the loop - loop invariant.  */
*************** vect_get_vec_def_for_operand (tree op, t
*** 1468,1475 ****
  	/* FIXME: use build_constructor directly.  */
  	vector_type = get_vectype_for_scalar_type (TREE_TYPE (def));
          vec_inv = build_constructor_from_list (vector_type, t);
! 
!         return vect_init_vector (stmt, vec_inv, vector_type);
        }
  
      /* Case 3: operand is defined inside the loop.  */
--- 1616,1622 ----
  	/* FIXME: use build_constructor directly.  */
  	vector_type = get_vectype_for_scalar_type (TREE_TYPE (def));
          vec_inv = build_constructor_from_list (vector_type, t);
!         return vect_init_vector (stmt, vec_inv, vector_type, NULL);
        }
  
      /* Case 3: operand is defined inside the loop.  */
*************** vectorizable_store (tree stmt, block_stm
*** 4112,4118 ****
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    enum machine_mode vec_mode;
    tree dummy;
!   enum dr_alignment_support alignment_support_cheme;
    tree def, def_stmt;
    enum vect_def_type dt;
    stmt_vec_info prev_stmt_info = NULL;
--- 4259,4265 ----
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    enum machine_mode vec_mode;
    tree dummy;
!   enum dr_alignment_support alignment_support_scheme;
    tree def, def_stmt;
    enum vect_def_type dt;
    stmt_vec_info prev_stmt_info = NULL;
*************** vectorizable_store (tree stmt, block_stm
*** 4124,4130 ****
--- 4271,4280 ----
    bool strided_store = false;
    unsigned int group_size, i;
    VEC(tree,heap) *dr_chain = NULL, *oprnds = NULL, *result_chain = NULL;
+   bool inv_p;
+ 
    gcc_assert (ncopies >= 1);
+ 
    /* FORNOW. This restriction should be relaxed.  */
    if (nested_in_vect_loop_p (loop, stmt) && ncopies > 1)
      {
*************** vectorizable_store (tree stmt, block_stm
*** 4198,4203 ****
--- 4348,4356 ----
  
        DR_GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt))++;
  
+       /* FORNOW */
+       gcc_assert (!nested_in_vect_loop_p (loop, stmt));
+ 
        /* We vectorize all the stmts of the interleaving group when we
  	 reach the last stmt in the group.  */
        if (DR_GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt)) 
*************** vectorizable_store (tree stmt, block_stm
*** 4220,4228 ****
    dr_chain = VEC_alloc (tree, heap, group_size);
    oprnds = VEC_alloc (tree, heap, group_size);
  
!   alignment_support_cheme = vect_supportable_dr_alignment (first_dr);
!   gcc_assert (alignment_support_cheme);
!   gcc_assert (alignment_support_cheme == dr_aligned);  /* FORNOW */
  
    /* In case the vectorization factor (VF) is bigger than the number
       of elements that we can fit in a vectype (nunits), we have to generate
--- 4373,4381 ----
    dr_chain = VEC_alloc (tree, heap, group_size);
    oprnds = VEC_alloc (tree, heap, group_size);
  
!   alignment_support_scheme = vect_supportable_dr_alignment (first_dr);
!   gcc_assert (alignment_support_scheme);
!   gcc_assert (alignment_support_scheme == dr_aligned);  /* FORNOW */
  
    /* In case the vectorization factor (VF) is bigger than the number
       of elements that we can fit in a vectype (nunits), we have to generate
*************** vectorizable_store (tree stmt, block_stm
*** 4292,4300 ****
  	      VEC_quick_push(tree, oprnds, vec_oprnd); 
  	      next_stmt = DR_GROUP_NEXT_DR (vinfo_for_stmt (next_stmt));
  	    }
! 	  dataref_ptr = vect_create_data_ref_ptr (first_stmt, bsi, NULL_TREE, 
  						  &dummy, &ptr_incr, false,
! 						  TREE_TYPE (vec_oprnd));
  	}
        else 
  	{
--- 4445,4454 ----
  	      VEC_quick_push(tree, oprnds, vec_oprnd); 
  	      next_stmt = DR_GROUP_NEXT_DR (vinfo_for_stmt (next_stmt));
  	    }
! 	  dataref_ptr = vect_create_data_ref_ptr (first_stmt, NULL, NULL_TREE, 
  						  &dummy, &ptr_incr, false,
! 						  TREE_TYPE (vec_oprnd), &inv_p);
! 	  gcc_assert (!inv_p);
  	}
        else 
  	{
*************** vectorizable_store (tree stmt, block_stm
*** 4312,4318 ****
  	      VEC_replace(tree, dr_chain, i, vec_oprnd);
  	      VEC_replace(tree, oprnds, i, vec_oprnd);
  	    }
! 	  dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt);
  	}
  
        if (strided_store)
--- 4466,4473 ----
  	      VEC_replace(tree, dr_chain, i, vec_oprnd);
  	      VEC_replace(tree, oprnds, i, vec_oprnd);
  	    }
! 	  dataref_ptr = 
! 		bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt, NULL_TREE);
  	}
  
        if (strided_store)
*************** vectorizable_store (tree stmt, block_stm
*** 4348,4354 ****
  	  if (!next_stmt)
  	    break;
  	  /* Bump the vector pointer.  */
! 	  dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt);
  	}
      }
  
--- 4503,4510 ----
  	  if (!next_stmt)
  	    break;
  	  /* Bump the vector pointer.  */
! 	  dataref_ptr = 
! 		bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt, NULL_TREE);
  	}
      }
  
*************** vectorizable_store (tree stmt, block_stm
*** 4359,4372 ****
  /* Function vect_setup_realignment
    
     This function is called when vectorizing an unaligned load using
!    the dr_unaligned_software_pipeline scheme.
     This function generates the following code at the loop prolog:
  
        p = initial_addr;
!       msq_init = *(floor(p));   # prolog load
        realignment_token = call target_builtin; 
      loop:
!       msq = phi (msq_init, ---)
  
     The code above sets up a new (vector) pointer, pointing to the first 
     location accessed by STMT, and a "floor-aligned" load using that pointer.
--- 4515,4531 ----
  /* Function vect_setup_realignment
    
     This function is called when vectorizing an unaligned load using
!    the dr_explicit_realign[_optimized] scheme.
     This function generates the following code at the loop prolog:
  
        p = initial_addr;
!    x  msq_init = *(floor(p));   # prolog load
        realignment_token = call target_builtin; 
      loop:
!    x  msq = phi (msq_init, ---)
! 
!    The stmts marked with x are generated only for the case of 
!    dr_explicit_realign_optimized.
  
     The code above sets up a new (vector) pointer, pointing to the first 
     location accessed by STMT, and a "floor-aligned" load using that pointer.
*************** vectorizable_store (tree stmt, block_stm
*** 4375,4393 ****
     whose arguments are the result of the prolog-load (created by this
     function) and the result of a load that takes place in the loop (to be
     created by the caller to this function).
     The caller to this function uses the phi-result (msq) to create the 
     realignment code inside the loop, and sets up the missing phi argument,
     as follows:
- 
      loop: 
        msq = phi (msq_init, lsq)
        lsq = *(floor(p'));        # load in loop
        result = realign_load (msq, lsq, realignment_token);
  
     Input:
     STMT - (scalar) load stmt to be vectorized. This load accesses
            a memory location that may be unaligned.
     BSI - place where new code is to be inserted.
     
     Output:
     REALIGNMENT_TOKEN - the result of a call to the builtin_mask_for_load
--- 4534,4562 ----
     whose arguments are the result of the prolog-load (created by this
     function) and the result of a load that takes place in the loop (to be
     created by the caller to this function).
+ 
+    For the case of dr_explicit_realign_optimized:
     The caller to this function uses the phi-result (msq) to create the 
     realignment code inside the loop, and sets up the missing phi argument,
     as follows:
      loop: 
        msq = phi (msq_init, lsq)
        lsq = *(floor(p'));        # load in loop
        result = realign_load (msq, lsq, realignment_token);
  
+    For the case of dr_explicit_realign:
+     loop:
+       msq = *(floor(p)); 	# load in loop
+       p' = p + (VS-1);
+       lsq = *(floor(p'));	# load in loop
+       result = realign_load (msq, lsq, realignment_token);
+ 
     Input:
     STMT - (scalar) load stmt to be vectorized. This load accesses
            a memory location that may be unaligned.
     BSI - place where new code is to be inserted.
+    ALIGNMENT_SUPPORT_SCHEME - which of the two misalignment handling schemes
+ 			      is used.	
     
     Output:
     REALIGNMENT_TOKEN - the result of a call to the builtin_mask_for_load
*************** vectorizable_store (tree stmt, block_stm
*** 4396,4438 ****
  
  static tree
  vect_setup_realignment (tree stmt, block_stmt_iterator *bsi,
!                         tree *realignment_token)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
!   edge pe = loop_preheader_edge (loop);
    tree scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0);
    tree vec_dest;
-   tree init_addr;
    tree inc;
    tree ptr;
    tree data_ref;
    tree new_stmt;
    basic_block new_bb;
!   tree msq_init;
    tree new_temp;
    tree phi_stmt;
!   tree msq;
  
-   /* 1. Create msq_init = *(floor(p1)) in the loop preheader  */
-   vec_dest = vect_create_destination_var (scalar_dest, vectype);
-   ptr = vect_create_data_ref_ptr (stmt, bsi, NULL_TREE, &init_addr, &inc, true,
- 				  NULL_TREE);
-   data_ref = build1 (ALIGN_INDIRECT_REF, vectype, ptr);
-   new_stmt = build_gimple_modify_stmt (vec_dest, data_ref);
-   new_temp = make_ssa_name (vec_dest, new_stmt);
-   GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
-   new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);
-   gcc_assert (!new_bb);
-   msq_init = GIMPLE_STMT_OPERAND (new_stmt, 0);
  
!   /* 2. Create permutation mask, if required, in loop preheader.  */
    if (targetm.vectorize.builtin_mask_for_load)
      {
        tree builtin_decl;
  
        builtin_decl = targetm.vectorize.builtin_mask_for_load ();
        new_stmt = build_call_expr (builtin_decl, 1, init_addr);
        vec_dest = vect_create_destination_var (scalar_dest, 
--- 4565,4708 ----
  
  static tree
  vect_setup_realignment (tree stmt, block_stmt_iterator *bsi,
!                         tree *realignment_token,
! 			enum dr_alignment_support alignment_support_scheme,
! 			tree init_addr,
! 			struct loop **at_loop)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
!   edge pe;
    tree scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0);
    tree vec_dest;
    tree inc;
    tree ptr;
    tree data_ref;
    tree new_stmt;
    basic_block new_bb;
!   tree msq_init = NULL_TREE;
    tree new_temp;
    tree phi_stmt;
!   tree msq = NULL_TREE;
!   tree stmts = NULL_TREE;
!   bool inv_p;
!   bool compute_in_loop = false;
!   bool nested_in_vect_loop = nested_in_vect_loop_p (loop, stmt);
!   struct loop *containing_loop = (bb_for_stmt (stmt))->loop_father;
!   struct loop *loop_for_initial_load;
! 
!   gcc_assert (alignment_support_scheme == dr_explicit_realign
! 	      || alignment_support_scheme == dr_explicit_realign_optimized);
! 
!   /* We need to generate three things:
!      1. the misalignment computation
!      2. the extra vector load (for the optimized realignment scheme).
!      3. the phi node for the two vectors from which the realignment is
!       done (for the optimized realignment scheme).
!    */
! 
!   /* 1. Determine where to generate the misalignment computation.
! 
!      If INIT_ADDR is NULL_TREE, this indicates that the misalignment
!      calculation will be generated by this function, outside the loop (in the
!      preheader).  Otherwise, INIT_ADDR had already been computed for us by the
!      caller, inside the loop.
! 
!      Background: If the misalignment remains fixed throughout the iterations of
!      the loop, then both realignment schemes are applicable, and also the
!      misalignment computation can be done outside LOOP.  This is because we are
!      vectorizing LOOP, and so the memory accesses in LOOP advance in steps that
!      are a multiple of VS (the Vector Size), and therefore the misalignment in
!      different vectorized LOOP iterations is always the same.
!      The problem arises only if the memory access is in an inner-loop nested
!      inside LOOP, which is now being vectorized using outer-loop vectorization.
!      This is the only case when the misalignment of the memory access may not
!      remain fixed thtoughout the iterations of the inner-loop (as exaplained in
!      detail in vect_supportable_dr_alignment).  In this case, not only is the
!      optimized realignment scheme not applicable, but also the misalignment
!      computation (and generation of the realignment token that is passed to
!      REALIGN_LOAD) have to be done inside the loop.
! 
!      In short, INIT_ADDR indicates whether we are in a COMPUTE_IN_LOOP mode
!      or not, which in turn determines if the misalignment is computed inside
!      the inner-loop, or outside LOOP.  */
! 
!   if (init_addr != NULL_TREE)
!     {
!       compute_in_loop = true;
!       gcc_assert (alignment_support_scheme == dr_explicit_realign);
!     }
  
  
!   /* 2. Determine where to generate the extra vector load.
! 
!      For the optimized realignment scheme, instead of generating two vector
!      loads in each iteration, we generate a single extra vector load in the
!      preheader of the loop, and in each iteration reuse the result of the
!      vector load from the previous iteration.  In case the memory access is in
!      an inner-loop nested inside LOOP, which is now being vectorized using
!      outer-loop vectorization, we need to determine whether this initial vector
!      load should be generated at the preheader of the inner-loop, or can be
!      generated at the preheader of LOOP.  If the memory access has no evolution
!      in LOOP, it can be generated in the preheader of LOOP. Otherwise, it has
!      to be generated inside LOOP (in the preheader of the inner-loop).  */
! 
!   if (nested_in_vect_loop)
!     {
!       tree outerloop_step = STMT_VINFO_DR_STEP (stmt_info);
!       bool invariant_in_outerloop =
!             (tree_int_cst_compare (outerloop_step, size_zero_node) == 0);
!       loop_for_initial_load = (invariant_in_outerloop ? loop : loop->inner);
!     }
!   else
!     loop_for_initial_load = loop;
!   if (at_loop)
!     *at_loop = loop_for_initial_load;
! 
!   /* 3. For the case of the optimized realignment, create the first vector
!       load at the loop preheader.  */
! 
!   if (alignment_support_scheme == dr_explicit_realign_optimized)
!     {
!       /* Create msq_init = *(floor(p1)) in the loop preheader  */
! 
!       gcc_assert (!compute_in_loop);
!       pe = loop_preheader_edge (loop_for_initial_load);
!       vec_dest = vect_create_destination_var (scalar_dest, vectype);
!       ptr = vect_create_data_ref_ptr (stmt, loop_for_initial_load, NULL_TREE,
! 				&init_addr, &inc, true, NULL_TREE, &inv_p);
!       data_ref = build1 (ALIGN_INDIRECT_REF, vectype, ptr);
!       new_stmt = build_gimple_modify_stmt (vec_dest, data_ref);
!       new_temp = make_ssa_name (vec_dest, new_stmt);
!       GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
!       new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);
!       gcc_assert (!new_bb);
!       msq_init = GIMPLE_STMT_OPERAND (new_stmt, 0);
!     }
! 
!   /* 4. Create realignment token using a target builtin, if available.
!       It is done either inside the containing loop, or before LOOP (as
!       determined above).  */
! 
    if (targetm.vectorize.builtin_mask_for_load)
      {
        tree builtin_decl;
  
+       /* Compute INIT_ADDR - the initial addressed accessed by this memref.  */
+       if (compute_in_loop)
+ 	gcc_assert (init_addr); /* already computed by the caller.  */
+       else
+ 	{
+ 	  /* Generate the INIT_ADDR computation outside LOOP.  */
+ 	  init_addr = vect_create_addr_base_for_vector_ref (stmt, &stmts,
+ 							NULL_TREE, loop);
+ 	  pe = loop_preheader_edge (loop);
+ 	  new_bb = bsi_insert_on_edge_immediate (pe, stmts);
+ 	  gcc_assert (!new_bb);
+ 	}
+ 
        builtin_decl = targetm.vectorize.builtin_mask_for_load ();
        new_stmt = build_call_expr (builtin_decl, 1, init_addr);
        vec_dest = vect_create_destination_var (scalar_dest, 
*************** vect_setup_realignment (tree stmt, block
*** 4440,4447 ****
        new_stmt = build_gimple_modify_stmt (vec_dest, new_stmt);
        new_temp = make_ssa_name (vec_dest, new_stmt);
        GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
!       new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);
!       gcc_assert (!new_bb);
        *realignment_token = GIMPLE_STMT_OPERAND (new_stmt, 0);
  
        /* The result of the CALL_EXPR to this builtin is determined from
--- 4710,4726 ----
        new_stmt = build_gimple_modify_stmt (vec_dest, new_stmt);
        new_temp = make_ssa_name (vec_dest, new_stmt);
        GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
! 
!       if (compute_in_loop)
! 	bsi_insert_before (bsi, new_stmt, BSI_SAME_STMT);
!       else
! 	{
! 	  /* Generate the misalignment computation outside LOOP.  */
! 	  pe = loop_preheader_edge (loop);
! 	  new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);
! 	  gcc_assert (!new_bb);
! 	}
! 
        *realignment_token = GIMPLE_STMT_OPERAND (new_stmt, 0);
  
        /* The result of the CALL_EXPR to this builtin is determined from
*************** vect_setup_realignment (tree stmt, block
*** 4452,4463 ****
        gcc_assert (TREE_READONLY (builtin_decl));
      }
  
!   /* 3. Create msq = phi <msq_init, lsq> in loop  */
    vec_dest = vect_create_destination_var (scalar_dest, vectype);
    msq = make_ssa_name (vec_dest, NULL_TREE);
!   phi_stmt = create_phi_node (msq, loop->header); 
    SSA_NAME_DEF_STMT (msq) = phi_stmt;
!   add_phi_arg (phi_stmt, msq_init, loop_preheader_edge (loop));
  
    return msq;
  }
--- 4731,4751 ----
        gcc_assert (TREE_READONLY (builtin_decl));
      }
  
!   if (alignment_support_scheme == dr_explicit_realign)
!     return msq;
! 
!   gcc_assert (!compute_in_loop);
!   gcc_assert (alignment_support_scheme == dr_explicit_realign_optimized);
! 
! 
!   /* 5. Create msq = phi <msq_init, lsq> in loop  */
! 
!   pe = loop_preheader_edge (containing_loop);
    vec_dest = vect_create_destination_var (scalar_dest, vectype);
    msq = make_ssa_name (vec_dest, NULL_TREE);
!   phi_stmt = create_phi_node (msq, containing_loop->header);
    SSA_NAME_DEF_STMT (msq) = phi_stmt;
!   add_phi_arg (phi_stmt, msq_init, pe);
  
    return msq;
  }
*************** vectorizable_load (tree stmt, block_stmt
*** 4747,4759 ****
    stmt_vec_info prev_stmt_info; 
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info), *first_dr;
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    tree new_temp;
    int mode;
    tree new_stmt = NULL_TREE;
    tree dummy;
!   enum dr_alignment_support alignment_support_cheme;
    tree dataref_ptr = NULL_TREE;
    tree ptr_incr;
    int nunits = TYPE_VECTOR_SUBPARTS (vectype);
--- 5035,5049 ----
    stmt_vec_info prev_stmt_info; 
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+   struct loop *containing_loop = (bb_for_stmt (stmt))->loop_father;
+   bool nested_in_vect_loop = nested_in_vect_loop_p (loop, stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info), *first_dr;
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    tree new_temp;
    int mode;
    tree new_stmt = NULL_TREE;
    tree dummy;
!   enum dr_alignment_support alignment_support_scheme;
    tree dataref_ptr = NULL_TREE;
    tree ptr_incr;
    int nunits = TYPE_VECTOR_SUBPARTS (vectype);
*************** vectorizable_load (tree stmt, block_stmt
*** 4762,4775 ****
    tree msq = NULL_TREE, lsq;
    tree offset = NULL_TREE;
    tree realignment_token = NULL_TREE;
!   tree phi_stmt = NULL_TREE;
    VEC(tree,heap) *dr_chain = NULL;
    bool strided_load = false;
    tree first_stmt;
  
    gcc_assert (ncopies >= 1);
    /* FORNOW. This restriction should be relaxed.  */
!   if (nested_in_vect_loop_p (loop, stmt) && ncopies > 1)
      {
        if (vect_print_dump_info (REPORT_DETAILS))
          fprintf (vect_dump, "multiple types in nested loop.");
--- 5052,5070 ----
    tree msq = NULL_TREE, lsq;
    tree offset = NULL_TREE;
    tree realignment_token = NULL_TREE;
!   tree phi = NULL_TREE;
    VEC(tree,heap) *dr_chain = NULL;
    bool strided_load = false;
    tree first_stmt;
+   tree scalar_type;
+   bool inv_p;
+   bool compute_in_loop = false;
+   struct loop *at_loop;
  
    gcc_assert (ncopies >= 1);
+ 
    /* FORNOW. This restriction should be relaxed.  */
!   if (nested_in_vect_loop && ncopies > 1)
      {
        if (vect_print_dump_info (REPORT_DETAILS))
          fprintf (vect_dump, "multiple types in nested loop.");
*************** vectorizable_load (tree stmt, block_stmt
*** 4807,4812 ****
--- 5102,5108 ----
    if (!STMT_VINFO_DATA_REF (stmt_info))
      return false;
  
+   scalar_type = TREE_TYPE (DR_REF (dr));
    mode = (int) TYPE_MODE (vectype);
  
    /* FORNOW. In some cases can vectorize even if data-type not supported
*************** vectorizable_load (tree stmt, block_stmt
*** 4822,4827 ****
--- 5118,5125 ----
    if (DR_GROUP_FIRST_DR (stmt_info))
      {
        strided_load = true;
+       /* FORNOW */
+       gcc_assert (! nested_in_vect_loop);
  
        /* Check if interleaving is supported.  */
        if (!vect_strided_load_supported (vectype))
*************** vectorizable_load (tree stmt, block_stmt
*** 4860,4868 ****
        group_size = 1;
      }
  
!   alignment_support_cheme = vect_supportable_dr_alignment (first_dr);
!   gcc_assert (alignment_support_cheme);
! 
  
    /* In case the vectorization factor (VF) is bigger than the number
       of elements that we can fit in a vectype (nunits), we have to generate
--- 5158,5165 ----
        group_size = 1;
      }
  
!   alignment_support_scheme = vect_supportable_dr_alignment (first_dr);
!   gcc_assert (alignment_support_scheme);
  
    /* In case the vectorization factor (VF) is bigger than the number
       of elements that we can fit in a vectype (nunits), we have to generate
*************** vectorizable_load (tree stmt, block_stmt
*** 4944,4950 ****
           }
  
       Otherwise, the data reference is potentially unaligned on a target that
!      does not support unaligned accesses (dr_unaligned_software_pipeline) - 
       then generate the following code, in which the data in each iteration is
       obtained by two vector loads, one from the previous iteration, and one
       from the current iteration:
--- 5241,5247 ----
           }
  
       Otherwise, the data reference is potentially unaligned on a target that
!      does not support unaligned accesses (dr_explicit_realign_optimized) - 
       then generate the following code, in which the data in each iteration is
       obtained by two vector loads, one from the previous iteration, and one
       from the current iteration:
*************** vectorizable_load (tree stmt, block_stmt
*** 4961,4987 ****
             msq = lsq;
           }   */
  
!   if (alignment_support_cheme == dr_unaligned_software_pipeline)
!     {
!       msq = vect_setup_realignment (first_stmt, bsi, &realignment_token);
!       phi_stmt = SSA_NAME_DEF_STMT (msq);
!       offset = size_int (TYPE_VECTOR_SUBPARTS (vectype) - 1);
      }
  
    prev_stmt_info = NULL;
    for (j = 0; j < ncopies; j++)
      { 
        /* 1. Create the vector pointer update chain.  */
        if (j == 0)
!         dataref_ptr = vect_create_data_ref_ptr (first_stmt, bsi, offset, &dummy,
!                                                 &ptr_incr, false, NULL_TREE);
        else
!         dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt);
  
        for (i = 0; i < group_size; i++)
  	{
  	  /* 2. Create the vector-load in the loop.  */
! 	  switch (alignment_support_cheme)
  	    {
  	    case dr_aligned:
  	      gcc_assert (aligned_access_p (first_dr));
--- 5258,5309 ----
             msq = lsq;
           }   */
  
!   /* If the misalignment remains the same throughout the execution of the
!      loop, we can create the init_addr and permutation mask at the loop
!      preheader. Otherwise, it needs to be created inside the loop.
!      This can only occur when vectorizing memory accesses in the inner-loop
!      nested within an outer-loop that is being vectorized.  */
! 
!   if (nested_in_vect_loop_p (loop, stmt)
!       && (TREE_INT_CST_LOW (DR_STEP (dr)) % UNITS_PER_SIMD_WORD != 0))
!     {
!       gcc_assert (alignment_support_scheme != dr_explicit_realign_optimized);
!       compute_in_loop = true;
!     }
! 
!   if ((alignment_support_scheme == dr_explicit_realign_optimized
!        || alignment_support_scheme == dr_explicit_realign)
!       && !compute_in_loop)
!     {
!       msq = vect_setup_realignment (first_stmt, bsi, &realignment_token,
! 				    alignment_support_scheme, NULL_TREE,
! 				    &at_loop);
!       if (alignment_support_scheme == dr_explicit_realign_optimized)
! 	{
! 	  phi = SSA_NAME_DEF_STMT (msq);
! 	  offset = size_int (TYPE_VECTOR_SUBPARTS (vectype) - 1);
! 	}
      }
+   else
+     at_loop = loop;
  
    prev_stmt_info = NULL;
    for (j = 0; j < ncopies; j++)
      { 
        /* 1. Create the vector pointer update chain.  */
        if (j == 0)
!         dataref_ptr = vect_create_data_ref_ptr (first_stmt,
! 					        at_loop, offset, 
! 						&dummy, &ptr_incr, false, 
! 						NULL_TREE, &inv_p);
        else
!         dataref_ptr = 
! 		bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt, NULL_TREE);
  
        for (i = 0; i < group_size; i++)
  	{
  	  /* 2. Create the vector-load in the loop.  */
! 	  switch (alignment_support_scheme)
  	    {
  	    case dr_aligned:
  	      gcc_assert (aligned_access_p (first_dr));
*************** vectorizable_load (tree stmt, block_stmt
*** 4992,5005 ****
  		int mis = DR_MISALIGNMENT (first_dr);
  		tree tmis = (mis == -1 ? size_zero_node : size_int (mis));
  
- 		gcc_assert (!aligned_access_p (first_dr));
  		tmis = size_binop (MULT_EXPR, tmis, size_int(BITS_PER_UNIT));
  		data_ref =
  		  build2 (MISALIGNED_INDIRECT_REF, vectype, dataref_ptr, tmis);
  		break;
  	      }
! 	    case dr_unaligned_software_pipeline:
! 	      gcc_assert (!aligned_access_p (first_dr));
  	      data_ref = build1 (ALIGN_INDIRECT_REF, vectype, dataref_ptr);
  	      break;
  	    default:
--- 5314,5352 ----
  		int mis = DR_MISALIGNMENT (first_dr);
  		tree tmis = (mis == -1 ? size_zero_node : size_int (mis));
  
  		tmis = size_binop (MULT_EXPR, tmis, size_int(BITS_PER_UNIT));
  		data_ref =
  		  build2 (MISALIGNED_INDIRECT_REF, vectype, dataref_ptr, tmis);
  		break;
  	      }
! 	    case dr_explicit_realign:
! 	      {
! 		tree ptr, bump;
! 		tree vs_minus_1 = size_int (TYPE_VECTOR_SUBPARTS (vectype) - 1);
! 
! 		if (compute_in_loop)
! 		  msq = vect_setup_realignment (first_stmt, bsi, 
! 						&realignment_token,
! 						dr_explicit_realign, 
! 						dataref_ptr, NULL);
! 
! 		data_ref = build1 (ALIGN_INDIRECT_REF, vectype, dataref_ptr);
! 		vec_dest = vect_create_destination_var (scalar_dest, vectype);
! 		new_stmt = build_gimple_modify_stmt (vec_dest, data_ref);
! 		new_temp = make_ssa_name (vec_dest, new_stmt);
! 		GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
! 		vect_finish_stmt_generation (stmt, new_stmt, bsi);
! 		copy_virtual_operands (new_stmt, stmt);
! 		mark_symbols_for_renaming (new_stmt);
! 		msq = new_temp;
! 
! 		bump = size_binop (MULT_EXPR, vs_minus_1,
! 				   TYPE_SIZE_UNIT (scalar_type));
! 		ptr = bump_vector_ptr (dataref_ptr, NULL_TREE, bsi, stmt, bump);
! 	        data_ref = build1 (ALIGN_INDIRECT_REF, vectype, ptr);
! 	        break;
! 	      }
! 	    case dr_explicit_realign_optimized:
  	      data_ref = build1 (ALIGN_INDIRECT_REF, vectype, dataref_ptr);
  	      break;
  	    default:
*************** vectorizable_load (tree stmt, block_stmt
*** 5012,5040 ****
  	  vect_finish_stmt_generation (stmt, new_stmt, bsi);
  	  mark_symbols_for_renaming (new_stmt);
  
! 	  /* 3. Handle explicit realignment if necessary/supported.  */
! 	  if (alignment_support_cheme == dr_unaligned_software_pipeline)
  	    {
- 	      /* Create in loop: 
- 		 <vec_dest = realign_load (msq, lsq, realignment_token)>  */
  	      lsq = GIMPLE_STMT_OPERAND (new_stmt, 0);
  	      if (!realignment_token)
  		realignment_token = dataref_ptr;
  	      vec_dest = vect_create_destination_var (scalar_dest, vectype);
! 	      new_stmt =
! 		build3 (REALIGN_LOAD_EXPR, vectype, msq, lsq, realignment_token);
  	      new_stmt = build_gimple_modify_stmt (vec_dest, new_stmt);
  	      new_temp = make_ssa_name (vec_dest, new_stmt);
  	      GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
  	      vect_finish_stmt_generation (stmt, new_stmt, bsi);
! 	      if (i == group_size - 1 && j == ncopies - 1)
! 		add_phi_arg (phi_stmt, lsq, loop_latch_edge (loop));
! 	      msq = lsq;
  	    }
  	  if (strided_load)
  	    VEC_quick_push (tree, dr_chain, new_temp);
  	  if (i < group_size - 1)
! 	    dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt);	  
  	}
  
        if (strided_load)
--- 5359,5428 ----
  	  vect_finish_stmt_generation (stmt, new_stmt, bsi);
  	  mark_symbols_for_renaming (new_stmt);
  
! 	  /* 3. Handle explicit realignment if necessary/supported. Create in
! 		loop: vec_dest = realign_load (msq, lsq, realignment_token)  */
! 	  if (alignment_support_scheme == dr_explicit_realign_optimized
! 	      || alignment_support_scheme == dr_explicit_realign)
  	    {
  	      lsq = GIMPLE_STMT_OPERAND (new_stmt, 0);
  	      if (!realignment_token)
  		realignment_token = dataref_ptr;
  	      vec_dest = vect_create_destination_var (scalar_dest, vectype);
! 	      new_stmt = build3 (REALIGN_LOAD_EXPR, vectype, msq, lsq, 
! 				 realignment_token);
  	      new_stmt = build_gimple_modify_stmt (vec_dest, new_stmt);
  	      new_temp = make_ssa_name (vec_dest, new_stmt);
  	      GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
  	      vect_finish_stmt_generation (stmt, new_stmt, bsi);
! 
! 	      if (alignment_support_scheme == dr_explicit_realign_optimized)
! 		{
! 		  if (i == group_size - 1 && j == ncopies - 1)
! 		    add_phi_arg (phi, lsq, loop_latch_edge (containing_loop));
! 		  msq = lsq;
! 		}
! 	    }
! 
! 	  /* 4. Handle invariant-load.  */
! 	  if (inv_p)
! 	    {
! 	      gcc_assert (!strided_load);
! 	      gcc_assert (nested_in_vect_loop_p (loop, stmt));
! 	      if (j == 0)
! 		{
! 		  int k;
! 		  tree t = NULL_TREE;
! 		  tree vec_inv, bitpos, bitsize = TYPE_SIZE (scalar_type);
! 
! 		  /* CHECKME: bitpos depends on endianess?  */
! 		  bitpos = bitsize_zero_node;
! 		  vec_inv = build3 (BIT_FIELD_REF, scalar_type, new_temp, 
! 							    bitsize, bitpos);
! 		  BIT_FIELD_REF_UNSIGNED (vec_inv) = 
! 						 TYPE_UNSIGNED (scalar_type);
! 		  vec_dest = 
! 			vect_create_destination_var (scalar_dest, NULL_TREE);
! 		  new_stmt = build_gimple_modify_stmt (vec_dest, vec_inv);
!                   new_temp = make_ssa_name (vec_dest, new_stmt);
!                   GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
!                   vect_finish_stmt_generation (stmt, new_stmt, bsi);
! 
! 		  for (k = nunits - 1; k >= 0; --k)
! 		    t = tree_cons (NULL_TREE, new_temp, t);
! 		  /* FIXME: use build_constructor directly.  */
! 		  vec_inv = build_constructor_from_list (vectype, t);
! 		  new_temp = vect_init_vector (stmt, vec_inv, vectype, bsi);
! 		  new_stmt = SSA_NAME_DEF_STMT (new_temp);
! 		}
! 	      else
! 		gcc_unreachable (); /* FORNOW. */
  	    }
+ 
  	  if (strided_load)
  	    VEC_quick_push (tree, dr_chain, new_temp);
  	  if (i < group_size - 1)
! 	    dataref_ptr = 
! 		bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt, NULL_TREE);	  
  	}
  
        if (strided_load)
*************** vect_gen_niters_for_prolog_loop (loop_ve
*** 5805,5812 ****
    else
      {
        tree new_stmts = NULL_TREE;
!       tree start_addr =
!         vect_create_addr_base_for_vector_ref (dr_stmt, &new_stmts, NULL_TREE);
        tree ptr_type = TREE_TYPE (start_addr);
        tree size = TYPE_SIZE (ptr_type);
        tree type = lang_hooks.types.type_for_size (tree_low_cst (size, 1), 1);
--- 6193,6200 ----
    else
      {
        tree new_stmts = NULL_TREE;
!       tree start_addr = vect_create_addr_base_for_vector_ref (dr_stmt, 
! 						&new_stmts, NULL_TREE, loop);
        tree ptr_type = TREE_TYPE (start_addr);
        tree size = TYPE_SIZE (ptr_type);
        tree type = lang_hooks.types.type_for_size (tree_low_cst (size, 1), 1);
*************** static tree
*** 5979,5984 ****
--- 6367,6373 ----
  vect_create_cond_for_align_checks (loop_vec_info loop_vinfo,
                                     tree *cond_expr_stmt_list)
  {
+   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    VEC(tree,heap) *may_misalign_stmts
      = LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo);
    tree ref_stmt, tmp;
*************** vect_create_cond_for_align_checks (loop_
*** 6014,6021 ****
  
        /* create: addr_tmp = (int)(address_of_first_vector) */
        addr_base = vect_create_addr_base_for_vector_ref (ref_stmt, 
! 							&new_stmt_list, 
! 							NULL_TREE);
  
        if (new_stmt_list != NULL_TREE)
          append_to_statement_list_force (new_stmt_list, cond_expr_stmt_list);
--- 6403,6409 ----
  
        /* create: addr_tmp = (int)(address_of_first_vector) */
        addr_base = vect_create_addr_base_for_vector_ref (ref_stmt, 
! 					&new_stmt_list, NULL_TREE, loop);
  
        if (new_stmt_list != NULL_TREE)
          append_to_statement_list_force (new_stmt_list, cond_expr_stmt_list);
Index: tree-data-ref.h
===================================================================
*** tree-data-ref.h	(revision 127356)
--- tree-data-ref.h	(working copy)
*************** index_in_loop_nest (int var, VEC (loop_p
*** 388,391 ****
--- 388,394 ----
  /* In lambda-code.c  */
  bool lambda_transform_legal_p (lambda_trans_matrix, int, VEC (ddr_p, heap) *);
  
+ /* In tree-data-refs.c  */
+ void split_constant_offset (tree , tree *, tree *);
+ 
  #endif  /* GCC_TREE_DATA_REF_H  */
Index: tree-data-ref.c
===================================================================
*** tree-data-ref.c	(revision 127356)
--- tree-data-ref.c	(working copy)
*************** dump_ddrs (FILE *file, VEC (ddr_p, heap)
*** 489,495 ****
  /* Expresses EXP as VAR + OFF, where off is a constant.  The type of OFF
     will be ssizetype.  */
  
! static void
  split_constant_offset (tree exp, tree *var, tree *off)
  {
    tree type = TREE_TYPE (exp), otype;
--- 489,495 ----
  /* Expresses EXP as VAR + OFF, where off is a constant.  The type of OFF
     will be ssizetype.  */
  
! void
  split_constant_offset (tree exp, tree *var, tree *off)
  {
    tree type = TREE_TYPE (exp), otype;
Index: testsuite/gcc.dg/vect/vect-outer-1a.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-1a.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-1a.c	(revision 0)
***************
*** 0 ****
--- 1,28 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ signed short image[N][N] __attribute__ ((__aligned__(16)));
+ signed short block[N][N] __attribute__ ((__aligned__(16)));
+ 
+ /* Can't do outer-loop vectorization because of non-consecutive access.
+    Currently fails to vectorize because the reduction pattern is not
+    recognized.  */
+ 
+ int
+ foo (){
+   int i,j;
+   int diff = 0;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j+=8) {
+       diff += (image[i][j] - block[i][j]);
+     }
+   }
+   return diff;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* FORNOW */
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "unexpected pattern" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-1b.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-1b.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-1b.c	(revision 0)
***************
*** 0 ****
--- 1,26 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ signed short image[N][N];
+ signed short block[N][N];
+ signed short out[N];
+ 
+ /* Outer-loop cannot get vectorized because of non-consecutive access.  */
+ 
+ void
+ foo (){
+   int i,j;
+   int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j+=4) {
+       diff += (image[i][j] - block[i][j]);
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-1.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-1.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-1.c	(revision 0)
***************
*** 0 ****
--- 1,26 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ signed short image[N][N] __attribute__ ((__aligned__(16)));
+ signed short block[N][N] __attribute__ ((__aligned__(16)));
+ signed short out[N] __attribute__ ((__aligned__(16)));
+ 
+ /* Can't do outer-loop vectorization because of non-consecutive access. */
+ 
+ void
+ foo (){
+   int i,j;
+   int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j+=8) {
+       diff += (image[i][j] - block[i][j]);
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2a.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2a.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2a.c	(revision 0)
***************
*** 0 ****
--- 1,41 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N][N] __attribute__ ((__aligned__(16)));
+ 
+ void
+ foo (){
+   int i,j,k;
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[k][j][i] = j+i+k;
+     }
+   }
+  }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j, k;
+ 
+   foo ();
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       if (image[k][j][i] != j+i+k)
+ 	abort ();
+     }
+   }
+  }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2b.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2b.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2b.c	(revision 0)
***************
*** 0 ****
--- 1,41 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[2*N][N][N] __attribute__ ((__aligned__(16)));
+ 
+ void
+ foo (){
+   int i,j,k;
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[k+i][j][i] = j+i+k;
+     }
+   }
+  }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j, k;
+ 
+   foo ();
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       if (image[k+i][j][i] != j+i+k)
+ 	abort ();
+     }
+   }
+  }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2.c	(revision 0)
***************
*** 0 ****
--- 1,40 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectorization.  */
+ 
+ void
+ foo (){
+   int i,j;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[j][i] = j+i;
+     }
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       if (image[j][i] != j+i)
+ 	abort ();
+     }
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2c.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2c.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2c.c	(revision 0)
***************
*** 0 ****
--- 1,41 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[2*N][2*N][N] __attribute__ ((__aligned__(16)));
+ 
+ void
+ foo (){
+   int i,j,k;
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j+=2) {
+       image[k][j][i] = j+i+k;
+     }
+   }
+  }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j, k;
+ 
+   foo ();
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j+=2) {
+       if (image[k][j][i] != j+i+k)
+ 	abort ();
+     }
+   }
+  }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2d.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2d.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2d.c	(revision 0)
***************
*** 0 ****
--- 1,41 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N][N+1] __attribute__ ((__aligned__(16)));
+ 
+ void
+ foo (){
+   int i,j,k;
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < i+1; j++) {
+       image[k][j][i] = j+i+k;
+     }
+   }
+  }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j, k;
+ 
+   foo ();
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < i+1; j++) {
+       if (image[k][j][i] != j+i+k)
+        abort ();
+     }
+   }
+  }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 0 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-3a.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-3a.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-3a.c	(revision 0)
***************
*** 0 ****
--- 1,53 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N+1] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectorization with misaliged accesses in the inner-loop.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[i][j]=i+j;
+     }
+   }
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { scan-tree-dump-times "step doesn't divide the vector-size" 2 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-3b.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-3b.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-3b.c	(revision 0)
***************
*** 0 ****
--- 1,53 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectorization with non-consecutive access. Not vectorized yet.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N/2; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][2*i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[i][j]=i+j;
+     }
+   }
+ 
+   foo ();
+ 
+   for (i = 0; i < N/2; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][2*i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop" 2 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-3.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-3.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-3.c	(revision 0)
***************
*** 0 ****
--- 1,52 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectoriation.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[i][j]=i+j;
+     }
+   }
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-3c.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-3c.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-3c.c	(revision 0)
***************
*** 0 ****
--- 1,52 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N+1] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectorization.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j+=4) {
+       diff += image[j][i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[i][j]=i+j;
+     }
+   }
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j+=4) {
+       diff += image[j][i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4a.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4a.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4a.c	(revision 0)
***************
*** 0 ****
--- 1,31 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ signed short in[N+M];
+ signed short coeff[M];
+ signed short out[N];
+ 
+ /* Outer-loop vectorization.
+    Currently not vectorized because of multiple-data-types in the inner-loop.  */
+ 
+ void
+ foo (){
+   int i,j;
+   int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* FORNOW. not vectorized until we support 0-stride acceses like coeff[j]. should be:
+    { scan-tree-dump-not "multiple types in nested loop." "vect" { xfail *-*-* } } } */
+ 
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1  "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4b.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4b.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4b.c	(revision 0)
***************
*** 0 ****
--- 1,31 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ signed short in[N+M];
+ signed short coeff[M];
+ int out[N];
+ 
+ /* Outer-loop vectorization.
+    Currently not vectorized because of multiple-data-types in the inner-loop.  */
+ 
+ void
+ foo (){
+   int i,j;
+   int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* FORNOW. not vectorized until we support 0-stride acceses like coeff[j]. should be:
+    { scan-tree-dump-not "multiple types in nested loop." "vect" { xfail *-*-* } } } */
+ 
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1  "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4.c	(revision 0)
***************
*** 0 ****
--- 1,55 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ 
+ /* Outer-loop vectorization.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=4) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=4) {
+       diff += in[j+i]*coeff[j];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4c.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4c.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4c.c	(revision 0)
***************
*** 0 ****
--- 1,27 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned short coeff[M];
+ unsigned int out[N];
+ 
+ /* Outer-loop vectorization. */
+ 
+ void
+ foo (){
+   int i,j;
+   unsigned short diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { target vect_short_mult } } } */
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4d.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4d.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4d.c	(revision 0)
***************
*** 0 ****
--- 1,51 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float out[N];
+ 
+ /* Outer-loop vectorization.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=4) {
+       diff += in[j+i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++)
+     in[i] = i;
+ 
+   foo ();
+   
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=4) {
+       diff += in[j+i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect"  } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4e.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4e.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4e.c	(revision 0)
***************
*** 0 ****
--- 1,25 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ unsigned int in[N+M];
+ unsigned short out[N];
+ 
+ /* Outer-loop vectorization. */
+ 
+ void
+ foo (){
+   int i,j;
+   unsigned int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     out[i]=(unsigned short)diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4f.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4f.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4f.c	(revision 0)
***************
*** 0 ****
--- 1,70 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ unsigned char arr[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ unsigned int
+ bar (int i, unsigned int diff, unsigned short *in)
+ {
+     int j;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     return diff;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     diff = bar (i, diff, in);
+     s += diff;
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: not allowed" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4g.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4g.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4g.c	(revision 0)
***************
*** 0 ****
--- 1,70 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ unsigned char arr[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ unsigned int
+ bar (int i, unsigned int diff, unsigned short *in)
+ {
+     int j;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     return diff;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     diff = bar (i, diff, in);
+     s += diff;
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: not allowed" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4i.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4i.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4i.c	(revision 0)
***************
*** 0 ****
--- 1,28 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ unsigned char in[N+M];
+ unsigned short out[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned short
+ foo (){
+   int i,j;
+   unsigned short diff;
+   unsigned short s=0;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4j.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4j.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4j.c	(revision 0)
***************
*** 0 ****
--- 1,26 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ unsigned char in[N+M];
+ unsigned short out[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ void
+ foo (){
+   int i,j;
+   unsigned short diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4k.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4k.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4k.c	(revision 0)
***************
*** 0 ****
--- 1,70 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ unsigned char arr[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ unsigned int
+ bar (int i, unsigned int diff, unsigned short *in)
+ {
+     int j;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     return diff;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     diff = bar (i, diff, in);
+     s += diff;
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: not allowed" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4l.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4l.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4l.c	(revision 0)
***************
*** 0 ****
--- 1,70 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ unsigned char arr[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ unsigned int
+ bar (int i, unsigned int diff, unsigned short *in)
+ {
+     int j;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     return diff;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     diff = bar (i, diff, in);
+     s += diff;
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: not allowed" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4m.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4m.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4m.c	(revision 0)
***************
*** 0 ****
--- 1,58 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=((unsigned short)diff>>3);
+   }
+   return s;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s += ((unsigned short)diff>>3);
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect"  { xfail *-*-* } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-5.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-5.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-5.c	(revision 0)
***************
*** 0 ****
--- 1,80 ----
+ /* { dg-require-effective-target vect_int } */
+ 
+ #include <stdarg.h>
+ #include <signal.h>
+ #include "tree-vect.h"
+ 
+ #define N 64
+ #define MAX 42
+ 
+ extern void abort(void); 
+ 
+ int main1 ()
+ {  
+   float A[N] __attribute__ ((__aligned__(16)));
+   float B[N] __attribute__ ((__aligned__(16)));
+   float C[N] __attribute__ ((__aligned__(16)));
+   float D[N] __attribute__ ((__aligned__(16)));
+   float s;
+ 
+   int i, j;
+ 
+   for (i = 0; i < N; i++)
+     {
+       A[i] = i;
+       B[i] = i;
+       C[i] = i;
+       D[i] = i;
+     }
+ 
+   /* Outer-loop 1: Vectorizable with respect to dependence distance. */
+   for (i = 0; i < N-20; i++)
+     {
+       s = 0;
+       for (j=0; j<N; j+=4)
+         s += C[j];
+       A[i] = A[i+20] + s;
+     }
+ 
+   /* check results:  */
+   for (i = 0; i < N-20; i++)
+     {
+       s = 0;
+       for (j=0; j<N; j+=4)
+         s += C[j];
+       if (A[i] != D[i+20] + s)
+         abort ();
+     }
+ 
+   /* Outer-loop 2: Not vectorizable because of dependence distance. */
+   for (i = 0; i < 4; i++)
+     {
+       s = 0;
+       for (j=0; j<N; j+=4)
+ 	s += C[j];
+       B[i] = B[i+3] + s;
+     }
+ 
+   /* check results:  */
+   for (i = 0; i < 4; i++)
+     {
+       s = 0;
+       for (j=0; j<N; j+=4)
+ 	s += C[j];
+       if (B[i] != D[i+3] + s)
+ 	abort ();
+     }
+ 
+   return 0;
+ }
+ 
+ int main ()
+ {
+   check_vect ();
+   return main1();
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "not vectorized: possible dependence between data-refs" 1 "vect" } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-6.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-6.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-6.c	(revision 0)
***************
*** 0 ****
--- 1,65 ----
+ /* { dg-require-effective-target vect_int } */
+ 
+ #include <stdarg.h>
+ #include <signal.h>
+ #include "tree-vect.h"
+ 
+ #define N 64
+ #define MAX 42
+ 
+ float A[N] __attribute__ ((__aligned__(16)));
+ float B[N] __attribute__ ((__aligned__(16)));
+ float C[N] __attribute__ ((__aligned__(16)));
+ float D[N] __attribute__ ((__aligned__(16)));
+ extern void abort(void); 
+ 
+ int main1 ()
+ {  
+   float s;
+ 
+   int i, j;
+ 
+   for (i = 0; i < 8; i++)
+     {
+       s = 0;
+       for (j=0; j<8; j+=4)
+ 	s += C[j];
+       A[i] = s;
+     }
+ 
+   return 0;
+ }
+ 
+ int main ()
+ {
+   int i,j;
+   float s;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N; i++)
+     {
+       A[i] = i;
+       B[i] = i;
+       C[i] = i;
+       D[i] = i;
+     }
+ 
+   main1();
+ 
+   /* check results:  */
+   for (i = 0; i < 8; i++)
+     {
+       s = 0;
+       for (j=0; j<8; j+=4)
+         s += C[j];
+       if (A[i] != s)
+         abort ();
+     }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-fir.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-fir.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-fir.c	(revision 0)
***************
*** 0 ****
--- 1,77 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ float fir_out[N];
+ 
+ /* Should be vectorized. Fixed misaligment in the inner-loop.  */
+ /* Currently not vectorized because we get too many BBs in the inner-loop,
+    because the compiler doesn't realize that the inner-loop executes at
+    least once (cause k<4), and so there's no need to create a guard code
+    to skip the inner-loop in case it doesn't execute.  */
+ void foo (){
+  int i,j,k;
+  float diff;
+ 
+  for (i = 0; i < N; i++) {
+   out[i] = 0;
+  }
+ 
+  for (k = 0; k < 4; k++) {
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = k; j < M; j+=4) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i] += diff;
+   }
+  }
+ 
+ }
+ 
+ /* Vectorized. Changing misalignment in the inner-loop.  */
+ void fir (){
+   int i,j,k;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j++) {
+       diff += in[j+i]*coeff[j];
+     }
+     fir_out[i] = diff;
+   }
+ }
+ 
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   fir ();
+   
+   for (i = 0; i < N; i++) {
+     if (out[i] != fir_out[i])
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 2 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-fir-lb.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-fir-lb.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-fir-lb.c	(revision 0)
***************
*** 0 ****
--- 1,80 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 64
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ float fir_out[N];
+ 
+ /* Should be vectorized. Fixed misaligment in the inner-loop.  */
+ /* Currently not vectorized because the loop-count for the inner-loop
+    has a maybe_zero component. Will be fixed when we incorporate the
+    "cond_expr in rhs" patch.  */
+ void foo (){
+  int i,j,k;
+  float diff;
+ 
+  for (i = 0; i < N; i++) {
+   out[i] = 0;
+  }
+ 
+  for (k = 0; k < 4; k++) {
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     j = k;
+ 
+     do {
+       diff += in[j+i]*coeff[j];
+       j+=4;	
+     } while (j < M);
+ 
+     out[i] += diff;
+   }
+  }
+ 
+ }
+ 
+ /* Vectorized. Changing misalignment in the inner-loop.  */
+ void fir (){
+   int i,j,k;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j++) {
+       diff += in[j+i]*coeff[j];
+     }
+     fir_out[i] = diff;
+   }
+ }
+ 
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   fir ();
+   
+   for (i = 0; i < N; i++) {
+     if (out[i] != fir_out[i])
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 2 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c
===================================================================
*** testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c	(revision 0)
--- testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c	(revision 0)
***************
*** 0 ****
--- 1,152 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "../../tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ float fir_out[N];
+ 
+ /* Should be vectorized. Fixed misaligment in the inner-loop.  */
+ /* Currently not vectorized because we get too many BBs in the inner-loop,
+    because the compiler doesn't realize that the inner-loop executes at
+    least once (cause k<4), and so there's no need to create a guard code
+    to skip the inner-loop in case it doesn't execute.  */
+ void foo (){
+  int i,j,k;
+  float diff;
+ 
+  for (i = 0; i < N; i++) {
+   out[i] = 0;
+  }
+ 
+  for (k = 0; k < 4; k++) {
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = k; j < M; j+=4) {
+       diff += in[j+i]*coeff[j];
+     }
+     out[i] += diff;
+   }
+  }
+ 
+ /* Vectorized. Changing misalignment in the inner-loop.  */
+ void fir (){
+   int i,j,k;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j++) {
+       diff += in[j+i]*coeff[j];
+     }
+     fir_out[i] = diff;
+   }
+ }
+ 
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   fir ();
+ 
+   for (i = 0; i < N; i++) {
+     if (out[i] != fir_out[i])
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 2 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ float fir_out[N];
+ 
+ /* Should be vectorized. Fixed misaligment in the inner-loop.  */
+ /* Currently not vectorized because we get too many BBs in the inner-loop,
+    because the compiler doesn't realize that the inner-loop executes at
+    least once (cause k<4), and so there's no need to create a guard code
+    to skip the inner-loop in case it doesn't execute.  */
+ void foo (){
+  int i,j,k;
+  float diff;
+ 
+  for (i = 0; i < N; i++) {
+   out[i] = 0;
+  }
+ 
+  for (k = 0; k < 4; k++) {
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = k; j < M; j+=4) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i] += diff;
+   }
+  }
+ 
+ }
+ 
+ /* Vectorized. Changing misalignment in the inner-loop.  */
+ void fir (){
+   int i,j,k;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j++) {
+       diff += in[j+i]*coeff[j];
+     }
+     fir_out[i] = diff;
+   }
+ }
+ 
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   fir ();
+   
+   for (i = 0; i < N; i++) {
+     if (out[i] != fir_out[i])
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 2 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/no-section-anchors-vect-outer-4h.c
===================================================================
*** testsuite/gcc.dg/vect/no-section-anchors-vect-outer-4h.c	(revision 0)
--- testsuite/gcc.dg/vect/no-section-anchors-vect-outer-4h.c	(revision 0)
***************
*** 0 ****
--- 1,47 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ 
+ #define N 40
+ #define M 128
+ unsigned short a[M][N];
+ unsigned int out[N];
+ 
+ /* Outer-loop vectorization. */
+ 
+ void
+ foo (){
+   int i,j;
+   unsigned int diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < M; j++) {
+       a[j][i] = 4;
+     }
+     out[i]=5;
+   }
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   check_vect ();
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < M; j++) {
+       if (a[j][i] != 4)
+         abort ();
+     }
+     if (out[i] != 5)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-117.c
===================================================================
*** testsuite/gcc.dg/vect/vect-117.c	(revision 127371)
--- testsuite/gcc.dg/vect/vect-117.c	(working copy)
*************** static  int c[N][N] = {{ 1, 2, 3, 4, 5},
*** 20,26 ****

  volatile int foo;

! int main1 (int A[N][N])
  {

    int i,j;
--- 20,26 ----

  volatile int foo;

! int main1 (int A[N][N], int n)
  {

    int i,j;
*************** int main1 (int A[N][N])
*** 28,34 ****
    /* vectorizable */
    for (i = 1; i < N; i++)
    {
!     for (j = 0; j < N; j++)
      {
        A[i][j] = A[i-1][j] + A[i][j];
      }
--- 28,34 ----
    /* vectorizable */
    for (i = 1; i < N; i++)
    {
!     for (j = 0; j < n; j++)
      {
        A[i][j] = A[i-1][j] + A[i][j];
      }
*************** int main (void)
*** 42,48 ****
    int i,j;

    foo = 0;
!   main1 (a);

    /* check results: */

--- 42,48 ----
    int i,j;

    foo = 0;
!   main1 (a, N);

    /* check results: */



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [patch] [4.3 projects] outer-loop vectorization patch 2/n
       [not found] <OF8F786C61.17F48632-ONC2257331.006C2237-C2257334.003A7381@LocalDomain>
@ 2007-08-11 11:57 ` Dorit Nuzman
  0 siblings, 0 replies; 4+ messages in thread
From: Dorit Nuzman @ 2007-08-11 11:57 UTC (permalink / raw)
  To: Dorit Nuzman; +Cc: gcc-patches, Jack Howarth

Dorit Nuzman/Haifa/IBM wrote on 11/08/2007 13:38:26:

> Hi,
>
> This is the second part of http://gcc.gnu.org/ml/gcc-
> patches/2007-08/msg00461.html. It adds support for memory-references
> in the inner-loop of outer-loop-vectorization. I'll use the
> following example to describe the features that were added:
...
>         * gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c: New test.

I fixed this testcase in the last minute (following a failure pointed out
to me by Jack Howarth - thanks to Jack for testing the outer-loop patch on
powerpc-apple-darwin9!). So I fixed one problem and introduced another - in
the patch I sent there's a typo in this testcase: it includes "tree-vect.h"
instead of "../../tree-vect.h". Sorry about that.

dorit

>
> #### mainlineouterloopdiff23t.txt has been deleted (was saved in
> repository MyAttachments Repository ->) from this note on 11 August
> 2007 by Dorit Nuzman

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [patch] [4.3 projects] outer-loop vectorization patch 2/n
@ 2007-08-11 10:36 Dorit Nuzman
  2007-08-12 15:02 ` Dorit Nuzman
  0 siblings, 1 reply; 4+ messages in thread
From: Dorit Nuzman @ 2007-08-11 10:36 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 9691 bytes --]


Hi,

This is the second part of
http://gcc.gnu.org/ml/gcc-patches/2007-08/msg00461.html. It adds support
for memory-references in the inner-loop of outer-loop-vectorization. I'll
use the following example to describe the features that were added:

            for (i=0; i<N; i++){
                  s=0;
                  for (j=0; j<M; j++)
                        s += a[i+j] * b[j];
                  a[i]=s;
            }

The patch includes the following changes to the vectorizer:

- To analyze the initial-address and step of inner-loop references relative
to the outer-loop, I used the function split_constant_offset. I basically
take the BASE+INIT+OFFSET that was computed relative to the inner-loop and
analyze it relative to the outer-loop (as discussed here
http://gcc.gnu.org/ml/gcc-patches/2007-05/msg00648.html).
- Generally the vectorizer does not support invariant accesses. One
exception that we add here is memory-references in the inner-loop that have
a zero step in the outer-loop. For example, the b[j] access in the loop
example above. Because the access b[j] has no evolution in the outer-loop
we have to duplicate the value b[j] into all entries of the vector. At the
moment this is done by simply adding this duplication on top of the current
scheme: i.e. we continue to generate a regular vector load, and then we
extract the first element and duplicate it:
                        vb = b[j,j+1,j+2,j+3]
                        sb = BIT_FIELD_REF (vb, bitpos, bitsize) # extract
b[j]
                        vb = {sb, sb, sb, sb}
(there are better ways to do this - so this will be improved later).
In order to be able to use the function 'vect_init_vector' to create the
vector vb above, we extend it to poace the vector initialization at BSI (a
new argument passed to the function) instead of always insterting it at the
loop preheader.

- Support misaligned accesses. In case the misalignment remains fixed (i.e.
the step (stride) of the accesses in the inner-loop is a multiple of the
Vector Size (VS)), this can be vectorized using the optimized
realignment-scheme (which used to be called the "software-pipelined"
scheme, and is now called "optimized_explicit_realign"): the computation of
the misalignment can be taken out of the loop, and only one additional
vector load is generated (before the loop) instead of 2 in each iteration
(we basically do predictive-commoning here). In case the misalignment does
*not* remain fixed throughout the iterations of the loop (as is the case in
the example loop above), we cannot use the optimized scheme. Instead we
need to compute the misalignment inside the inner-loop along with the two
vector loads (this is the newly added "explicit_realign" scheme; for more
details can also see
http://gcc.gnu.org/ml/gcc-patches/2007-06/msg00248.html).
(this is also planned to be improved later).
To support this a couple functions  - 'vect_setup_realignment' and
'vectorizable_load' - that used to support only the optimized realignment
scheme were modified to support both schemes. See detailed documentation in
'vect_setup_realignment' and 'vect_supportable_dr_alignment').

- The functions that deal with creating/initializing/updating the pointer
that is used for the vector loads/stores had to be modified a bit:
      * 'vect_create_data_ref_ptr' now needs to create an update chain both
in the inner-loop and in the outer-loop. It also needs to consider if the
outer-loop step is 0 (the only case where the pointer will not be bumped by
VS (Vector Size)). See detailed documentation in this function.
      * 'vect_create_addr_base_for_vector_ref' needs to know relative to
which loop the address-base is requested (to know whether to use the
step/offset/init relative to the inner or outer loop).
      * 'bump_vector_ptr' is extended to support bump amounts other than VS
(we need to nump by VS-1 for the "explicit_realign" scheme).

Bootstrpped on powerpc64-linux,
bootstrapped with vectorization enabled on i386-linux,
passed full regression testing on both platforms.

I will wait at least a week to give people a chance to review and comment.

thanks,
dorit

ChangeLog:

        * tree-data-refs.c (split_constant_offset): Expose.
        * tree-data-refs.h (split_constant_offset): Add declaration.

        * tree-vectorizer.h (dr_alignment_support): Renamed
        dr_unaligned_software_pipeline to dr_explicit_realign_optimized.
        Added a new value dr_explicit_realign.
        (_stmt_vec_info): Added new fields: dr_base_address, dr_init,
        dr_offset, dr_step, and dr_aligned_to, along with new access
        functions for these fields: STMT_VINFO_DR_BASE_ADDRESS,
        STMT_VINFO_DR_INIT, STMT_VINFO_DR_OFFSET, STMT_VINFO_DR_STEP, and
        STMT_VINFO_DR_ALIGNED_TO.

        * tree-vectorizer.c (vect_supportable_dr_alignment): Add
documentation.
        In case of outer-loop vectorization with non-fixed misalignment -
use
        the dr_explicit_realign scheme instead of the optimized realignment
        scheme.
        (new_stmt_vec_info): Initialize new fields.

        * tree-vect-analyze.c (vect_compute_data_ref_alignment): Handle the
        'nested_in_vect_loop' case. Change verbosity level.
        (vect_analyze_data_ref_access): Handle the 'nested_in_vect_loop'
case.
        Don't fail on zero step in the outer-loop for loads.
        (vect_analyze_data_refs): Call split_constant_offset to calculate
base,
        offset and init relative to the outer-loop.

        * tree-vect-transform.c (vect_create_data_ref_ptr): Replace the
unused
        BSI function argument with a new function argument - at_loop.
        Simplify the condition that determines STEP. Takes additional
argument
        INV_P. Support outer-loop vectorization (handle the
nested_in_vect_loop
        case), including zero step in the outer-loop. Call
        vect_create_addr_base_for_vector_ref with additional argument.
        (vect_create_addr_base_for_vector_ref): Takes additional argument
LOOP.
        Updated function documentation. Handle the 'nested_in_vect_loop'
case.
        Fixed and simplified calculation of step.
        (vectorizable_store): Call vect_create_data_ref_ptr with loop
instead
        of bsi, and with additional argument. Call bump_vector_ptr with
        additional argument. Fix typos. Handle the 'nested_in_vect_loop'
case.
        (vect_setup_realignment): Takes additional arguments INIT_ADDR and
        DR_ALIGNMENT_SUPPORT. Returns another value AT_LOOP. Handle the
case
        when the realignment setup needs to take place inside the loop.
Support
        the dr_explicit_realign scheme. Allow generating the optimized
        realignment scheme for outer-loop vectorization. Added
documentation.
        (vectorizable_load): Support the dr_explicit_realign scheme. Handle
the
        'nested_in_vect_loop' case, including loads that are invariant in
the
        outer-loop and the realignment schemes. Handle the case when the
        realignment setup needs to take place inside the loop. Call
        vect_setup_realignment with additional arguments.  Call
        vect_create_data_ref_ptr with additional argument and with loop
instead
        of bsi. Fix 80-column overflow. Fix typos. Rename PHI_STMT to PHI.
        (vect_gen_niters_for_prolog_loop): Call
        vect_create_addr_base_for_vector_ref with additional arguments.
        (vect_create_cond_for_align_checks): Likewise.
        (bump_vector_ptr): Updated to support the new dr_explicit_realign
        scheme: takes additional argument bump; argument ptr_incr is now
        optional; updated documentation.
        (vect_init_vector): Takes additional argument (bsi). Use it, if
        available, to insert the vector initialization.
        (get_initial_def_for_induction): Pass additional argument in call
to
        vect_init_vector.
        (vect_get_vec_def_for_operand): Likewise.
        (vect_setup_realignment): Likewise.
        (vectorizable_load): Likewise.

testsuite/ChangeLog:

        * gcc.dg/vect/vect-117.c: Change inner-loop bound to
        unknown (so that outer-loop wont get analyzed).
        * gcc.dg/vect/vect-outer-1a.c: New test.
        * gcc.dg/vect/vect-outer-1b.c: New test.
        * gcc.dg/vect/vect-outer-1.c: New test.
        * gcc.dg/vect/vect-outer-2a.c: New test.
        * gcc.dg/vect/vect-outer-2b.c: New test.
        * gcc.dg/vect/vect-outer-2c.c: New test.
        * gcc.dg/vect/vect-outer-2.c: New test.
        * gcc.dg/vect/vect-outer-3a.c: New test.
        * gcc.dg/vect/vect-outer-3b.c: New test.
        * gcc.dg/vect/vect-outer-3c.c: New test.
        * gcc.dg/vect/vect-outer-3.c: New test.
        * gcc.dg/vect/vect-outer-4a.c: New test.
        * gcc.dg/vect/vect-outer-4b.c: New test.
        * gcc.dg/vect/vect-outer-4c.c: New test.
        * gcc.dg/vect/vect-outer-4d.c: New test.
        * gcc.dg/vect/vect-outer-4e.c: New test.
        * gcc.dg/vect/vect-outer-4f.c: New test.
        * gcc.dg/vect/vect-outer-4g.c: New test.
        * gcc.dg/vect/no-section-anchors-vect-outer-4h.c: New test.
        * gcc.dg/vect/vect-outer-4i.c: New test.
        * gcc.dg/vect/vect-outer-4j.c: New test.
        * gcc.dg/vect/vect-outer-4k.c: New test.
        * gcc.dg/vect/vect-outer-4l.c: New test.
        * gcc.dg/vect/vect-outer-4m.c: New test.
        * gcc.dg/vect/vect-outer-4.c: New test.
        * gcc.dg/vect/vect-outer-5.c: New test.
        * gcc.dg/vect/vect-outer-6.c: New test.
        * gcc.dg/vect/vect-outer-fir.c: New test.
        * gcc.dg/vect/vect-outer-fir-lb.c: New test.
        * gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c: New test.

(See attached file: mainlineouterloopdiff23t.txt)

[-- Attachment #2: mainlineouterloopdiff23t.txt --]
[-- Type: text/plain, Size: 122367 bytes --]

*** tree-vect-transform.c.p1	2007-08-05 20:50:02.000000000 +0300
--- tree-vect-transform.c	2007-08-05 22:56:51.000000000 +0300
*************** along with GCC; see the file COPYING3.
*** 49,62 ****
  static bool vect_transform_stmt (tree, block_stmt_iterator *, bool *);
  static tree vect_create_destination_var (tree, tree);
  static tree vect_create_data_ref_ptr 
!   (tree, block_stmt_iterator *, tree, tree *, tree *, bool, tree); 
! static tree vect_create_addr_base_for_vector_ref (tree, tree *, tree);
! static tree vect_setup_realignment (tree, block_stmt_iterator *, tree *);
  static tree vect_get_new_vect_var (tree, enum vect_var_kind, const char *);
  static tree vect_get_vec_def_for_operand (tree, tree, tree *);
! static tree vect_init_vector (tree, tree, tree);
  static void vect_finish_stmt_generation 
!   (tree stmt, tree vec_stmt, block_stmt_iterator *bsi);
  static bool vect_is_simple_cond (tree, loop_vec_info); 
  static void update_vuses_to_preheader (tree, struct loop*);
  static void vect_create_epilog_for_reduction (tree, tree, enum tree_code, tree);
--- 49,62 ----
  static bool vect_transform_stmt (tree, block_stmt_iterator *, bool *);
  static tree vect_create_destination_var (tree, tree);
  static tree vect_create_data_ref_ptr 
!   (tree, struct loop*, tree, tree *, tree *, bool, tree, bool *); 
! static tree vect_create_addr_base_for_vector_ref 
!   (tree, tree *, tree, struct loop *);
  static tree vect_get_new_vect_var (tree, enum vect_var_kind, const char *);
  static tree vect_get_vec_def_for_operand (tree, tree, tree *);
! static tree vect_init_vector (tree, tree, tree, block_stmt_iterator *);
  static void vect_finish_stmt_generation 
!   (tree stmt, tree vec_stmt, block_stmt_iterator *);
  static bool vect_is_simple_cond (tree, loop_vec_info); 
  static void update_vuses_to_preheader (tree, struct loop*);
  static void vect_create_epilog_for_reduction (tree, tree, enum tree_code, tree);
*************** vect_model_load_cost (stmt_vec_info stmt
*** 610,616 ****
  
          break;
        }
!     case dr_unaligned_software_pipeline:
        {
          int outer_cost = 0;
  
--- 610,628 ----
  
          break;
        }
!     case dr_explicit_realign:
!       {
!         inner_cost += ncopies * (2*TARG_VEC_LOAD_COST + TARG_VEC_STMT_COST);
! 
!         /* FIXME: If the misalignment remains fixed across the iterations of
!            the containing loop, the following cost should be added to the
!            outside costs.  */
!         if (targetm.vectorize.builtin_mask_for_load)
!           inner_cost += TARG_VEC_STMT_COST;
! 
!         break;
!       }
!     case dr_explicit_realign_optimized:
        {
          int outer_cost = 0;
  
*************** vect_get_new_vect_var (tree type, enum v
*** 707,712 ****
--- 719,737 ----
     STMT: The statement containing the data reference.
     NEW_STMT_LIST: Must be initialized to NULL_TREE or a statement list.
     OFFSET: Optional. If supplied, it is be added to the initial address.
+    LOOP:    Specify relative to which loop-nest should the address be computed.
+             For example, when the dataref is in an inner-loop nested in an
+ 	    outer-loop that is now being vectorized, LOOP can be either the
+ 	    outer-loop, or the inner-loop. The first memory location accessed
+ 	    by the following dataref ('in' points to short):
+ 
+ 		for (i=0; i<N; i++)
+ 		   for (j=0; j<M; j++)
+ 		     s += in[i+j]
+ 
+ 	    is as follows:
+ 	    if LOOP=i_loop:	&in		(relative to i_loop)
+ 	    if LOOP=j_loop: 	&in+i*2B	(relative to j_loop)
  
     Output:
     1. Return an SSA_NAME whose value is the address of the memory location of 
*************** vect_get_new_vect_var (tree type, enum v
*** 719,732 ****
  static tree
  vect_create_addr_base_for_vector_ref (tree stmt,
                                        tree *new_stmt_list,
! 				      tree offset)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
!   tree data_ref_base_expr = unshare_expr (DR_BASE_ADDRESS (dr));
!   tree base_name = build_fold_indirect_ref (data_ref_base_expr);
    tree data_ref_base_var;
-   tree data_ref_base;
    tree new_base_stmt;
    tree vec_stmt;
    tree addr_base, addr_expr;
--- 744,758 ----
  static tree
  vect_create_addr_base_for_vector_ref (tree stmt,
                                        tree *new_stmt_list,
! 				      tree offset,
! 				      struct loop *loop)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
!   struct loop *containing_loop = (bb_for_stmt (stmt))->loop_father;
!   tree data_ref_base = unshare_expr (DR_BASE_ADDRESS (dr));
!   tree base_name;
    tree data_ref_base_var;
    tree new_base_stmt;
    tree vec_stmt;
    tree addr_base, addr_expr;
*************** vect_create_addr_base_for_vector_ref (tr
*** 734,745 ****
    tree base_offset = unshare_expr (DR_OFFSET (dr));
    tree init = unshare_expr (DR_INIT (dr));
    tree vect_ptr_type, addr_expr2;
!   
!   
!   /* Create data_ref_base */
!   data_ref_base_var = create_tmp_var (TREE_TYPE (data_ref_base_expr), "batmp");
    add_referenced_var (data_ref_base_var);
!   data_ref_base = force_gimple_operand (data_ref_base_expr, &new_base_stmt,
  					true, data_ref_base_var);
    append_to_statement_list_force(new_base_stmt, new_stmt_list);
  
--- 760,785 ----
    tree base_offset = unshare_expr (DR_OFFSET (dr));
    tree init = unshare_expr (DR_INIT (dr));
    tree vect_ptr_type, addr_expr2;
!   tree step = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)));
! 
!   gcc_assert (loop);
!   if (loop != containing_loop)
!     {
!       loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
!       struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
! 
!       gcc_assert (nested_in_vect_loop_p (loop, stmt));
! 
!       data_ref_base = unshare_expr (STMT_VINFO_DR_BASE_ADDRESS (stmt_info));
!       base_offset = unshare_expr (STMT_VINFO_DR_OFFSET (stmt_info));
!       init = unshare_expr (STMT_VINFO_DR_INIT (stmt_info));
!     }
! 
!   /* Create base_offset */
!   base_name = build_fold_indirect_ref (data_ref_base);
!   data_ref_base_var = create_tmp_var (TREE_TYPE (data_ref_base), "batmp");
    add_referenced_var (data_ref_base_var);
!   data_ref_base = force_gimple_operand (data_ref_base, &new_base_stmt,
  					true, data_ref_base_var);
    append_to_statement_list_force(new_base_stmt, new_stmt_list);
  
*************** vect_create_addr_base_for_vector_ref (tr
*** 754,769 ****
    if (offset)
      {
        tree tmp = create_tmp_var (sizetype, "offset");
-       tree step; 
- 
-       /* For interleaved access step we divide STEP by the size of the
-         interleaving group.  */
-       if (DR_GROUP_SIZE (stmt_info))
- 	step = fold_build2 (TRUNC_DIV_EXPR, TREE_TYPE (offset), DR_STEP (dr),
- 			    build_int_cst (TREE_TYPE (offset),
- 					   DR_GROUP_SIZE (stmt_info)));
-       else
- 	step = DR_STEP (dr);
  
        add_referenced_var (tmp);
        offset = fold_build2 (MULT_EXPR, TREE_TYPE (offset), offset, step);
--- 794,799 ----
*************** vect_create_addr_base_for_vector_ref (tr
*** 812,818 ****
     1. STMT: a stmt that references memory. Expected to be of the form
           GIMPLE_MODIFY_STMT <name, data-ref> or
  	 GIMPLE_MODIFY_STMT <data-ref, name>.
!    2. BSI: block_stmt_iterator where new stmts can be added.
     3. OFFSET (optional): an offset to be added to the initial address accessed
          by the data-ref in STMT.
     4. ONLY_INIT: indicate if vp is to be updated in the loop, or remain
--- 842,848 ----
     1. STMT: a stmt that references memory. Expected to be of the form
           GIMPLE_MODIFY_STMT <name, data-ref> or
  	 GIMPLE_MODIFY_STMT <data-ref, name>.
!    2. AT_LOOP: the loop where the vector memref is to be created.
     3. OFFSET (optional): an offset to be added to the initial address accessed
          by the data-ref in STMT.
     4. ONLY_INIT: indicate if vp is to be updated in the loop, or remain
*************** vect_create_addr_base_for_vector_ref (tr
*** 839,856 ****
  
        Return the increment stmt that updates the pointer in PTR_INCR.
  
!    3. Return the pointer.  */
  
  static tree
! vect_create_data_ref_ptr (tree stmt,
! 			  block_stmt_iterator *bsi ATTRIBUTE_UNUSED,
  			  tree offset, tree *initial_address, tree *ptr_incr,
! 			  bool only_init, tree type)
  {
    tree base_name;
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    tree vect_ptr_type;
    tree vect_ptr;
--- 869,890 ----
  
        Return the increment stmt that updates the pointer in PTR_INCR.
  
!    3. Set INV_P to true if the access pattern of the data reference in the 
!       vectorized loop is invariant. Set it to false otherwise.
! 
!    4. Return the pointer.  */
  
  static tree
! vect_create_data_ref_ptr (tree stmt, struct loop *at_loop,
  			  tree offset, tree *initial_address, tree *ptr_incr,
! 			  bool only_init, tree type, bool *inv_p)
  {
    tree base_name;
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+   bool nested_in_vect_loop = nested_in_vect_loop_p (loop, stmt);
+   struct loop *containing_loop = (bb_for_stmt (stmt))->loop_father;
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    tree vect_ptr_type;
    tree vect_ptr;
*************** vect_create_data_ref_ptr (tree stmt,
*** 858,868 ****
    tree new_temp;
    tree vec_stmt;
    tree new_stmt_list = NULL_TREE;
!   edge pe = loop_preheader_edge (loop);
    basic_block new_bb;
    tree vect_ptr_init;
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
  
    base_name =  build_fold_indirect_ref (unshare_expr (DR_BASE_ADDRESS (dr)));
  
    if (vect_print_dump_info (REPORT_DETAILS))
--- 892,922 ----
    tree new_temp;
    tree vec_stmt;
    tree new_stmt_list = NULL_TREE;
!   edge pe;
    basic_block new_bb;
    tree vect_ptr_init;
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+   tree vptr;
+   block_stmt_iterator incr_bsi;
+   bool insert_after;
+   tree indx_before_incr, indx_after_incr;
+   tree incr;
+   tree step;
+ 
+   /* Check the step (evolution) of the load in LOOP, and record
+      whether it's invariant.  */
+   if (nested_in_vect_loop)
+     step = STMT_VINFO_DR_STEP (stmt_info);
+   else
+     step = DR_STEP (STMT_VINFO_DATA_REF (stmt_info));
+     
+   if (tree_int_cst_compare (step, size_zero_node) == 0)
+     *inv_p = true;
+   else
+     *inv_p = false;
  
+   /* Create an expression for the first address accessed by this load
+      in LOOP.  */ 
    base_name =  build_fold_indirect_ref (unshare_expr (DR_BASE_ADDRESS (dr)));
  
    if (vect_print_dump_info (REPORT_DETAILS))
*************** vect_create_data_ref_ptr (tree stmt,
*** 905,916 ****
  
    var_ann (vect_ptr)->subvars = DR_SUBVARS (dr);
  
    /** (3) Calculate the initial address the vector-pointer, and set
            the vector-pointer to point to it before the loop:  **/
  
    /* Create: (&(base[init_val+offset]) in the loop preheader.  */
    new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list,
!                                                    offset);
    pe = loop_preheader_edge (loop);
    new_bb = bsi_insert_on_edge_immediate (pe, new_stmt_list);
    gcc_assert (!new_bb);
--- 959,1002 ----
  
    var_ann (vect_ptr)->subvars = DR_SUBVARS (dr);
  
+   /** Note: If the dataref is in an inner-loop nested in LOOP, and we are
+       vectorizing LOOP (i.e. outer-loop vectorization), we need to create two
+       def-use update cycles for the pointer: One relative to the outer-loop
+       (LOOP), which is what steps (3) and (4) below do. The other is relative
+       to the inner-loop (which is the inner-most loop containing the dataref),
+       and this is done be step (5) below. 
+ 
+       When vectorizing inner-most loops, the vectorized loop (LOOP) is also the
+       inner-most loop, and so steps (3),(4) work the same, and step (5) is
+       redundant.  Steps (3),(4) create the following:
+ 
+ 	vp0 = &base_addr;
+ 	LOOP:	vp1 = phi(vp0,vp2)
+ 		...  
+ 		...
+ 		vp2 = vp1 + step
+ 		goto LOOP
+ 			
+       If there is an inner-loop nested in loop, then step (5) will also be
+       applied, and an additional update in the inner-loop will be created:
+ 
+ 	vp0 = &base_addr;
+ 	LOOP:   vp1 = phi(vp0,vp2)
+ 		...
+         inner:     vp3 = phi(vp1,vp4)
+ 	           vp4 = vp3 + inner_step
+ 	           if () goto inner
+ 		...
+ 		vp2 = vp1 + step
+ 		if () goto LOOP   */
+ 
    /** (3) Calculate the initial address the vector-pointer, and set
            the vector-pointer to point to it before the loop:  **/
  
    /* Create: (&(base[init_val+offset]) in the loop preheader.  */
+ 
    new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list,
!                                                    offset, loop);
    pe = loop_preheader_edge (loop);
    new_bb = bsi_insert_on_edge_immediate (pe, new_stmt_list);
    gcc_assert (!new_bb);
*************** vect_create_data_ref_ptr (tree stmt,
*** 925,949 ****
    gcc_assert (!new_bb);
  
  
!   /** (4) Handle the updating of the vector-pointer inside the loop: **/
  
!   if (only_init) /* No update in loop is required.  */
      {
        /* Copy the points-to information if it exists. */
        if (DR_PTR_INFO (dr))
          duplicate_ssa_name_ptr_info (vect_ptr_init, DR_PTR_INFO (dr));
!       return vect_ptr_init;
      }
    else
      {
!       block_stmt_iterator incr_bsi;
!       bool insert_after;
!       tree indx_before_incr, indx_after_incr;
!       tree incr;
  
        standard_iv_increment_position (loop, &incr_bsi, &insert_after);
        create_iv (vect_ptr_init,
! 		 fold_convert (vect_ptr_type, TYPE_SIZE_UNIT (vectype)),
  		 NULL_TREE, loop, &incr_bsi, insert_after,
  		 &indx_before_incr, &indx_after_incr);
        incr = bsi_stmt (incr_bsi);
--- 1011,1041 ----
    gcc_assert (!new_bb);
  
  
!   /** (4) Handle the updating of the vector-pointer inside the loop.
! 	  This is needed when ONLY_INIT is false, and also when AT_LOOP
! 	  is the inner-loop nested in LOOP (during outer-loop vectorization).
!    **/
  
!   if (only_init && at_loop == loop) /* No update in loop is required.  */
      {
        /* Copy the points-to information if it exists. */
        if (DR_PTR_INFO (dr))
          duplicate_ssa_name_ptr_info (vect_ptr_init, DR_PTR_INFO (dr));
!       vptr = vect_ptr_init;
      }
    else
      {
!       /* The step of the vector pointer is the Vector Size.  */
!       tree step = TYPE_SIZE_UNIT (vectype);
!       /* One exception to the above is when the scalar step of the load in 
! 	 LOOP is zero. In this case the step here is also zero.  */
!       if (*inv_p)
! 	step = size_zero_node;
  
        standard_iv_increment_position (loop, &incr_bsi, &insert_after);
+ 
        create_iv (vect_ptr_init,
! 		 fold_convert (vect_ptr_type, step),
  		 NULL_TREE, loop, &incr_bsi, insert_after,
  		 &indx_before_incr, &indx_after_incr);
        incr = bsi_stmt (incr_bsi);
*************** vect_create_data_ref_ptr (tree stmt,
*** 961,975 ****
        if (ptr_incr)
  	*ptr_incr = incr;
  
!       return indx_before_incr;
      }
  }
  
  
  /* Function bump_vector_ptr
  
!    Increment a pointer (to a vector type) by vector-size. Connect the new 
!    increment stmt to the existing def-use update-chain of the pointer.
  
     The pointer def-use update-chain before this function:
                          DATAREF_PTR = phi (p_0, p_2)
--- 1053,1103 ----
        if (ptr_incr)
  	*ptr_incr = incr;
  
!       vptr = indx_before_incr;
      }
+ 
+   if (!nested_in_vect_loop || only_init)
+     return vptr;
+ 
+ 
+   /** (5) Handle the updating of the vector-pointer inside the inner-loop
+ 	  nested in LOOP, if exists: **/
+ 
+   gcc_assert (nested_in_vect_loop);
+   if (!only_init)
+     {
+       standard_iv_increment_position (containing_loop, &incr_bsi, 
+ 				      &insert_after);
+       create_iv (vptr, fold_convert (vect_ptr_type, DR_STEP (dr)), NULL_TREE, 
+ 		 containing_loop, &incr_bsi, insert_after, &indx_before_incr, 
+ 		 &indx_after_incr);
+       incr = bsi_stmt (incr_bsi);
+       set_stmt_info (stmt_ann (incr), new_stmt_vec_info (incr, loop_vinfo));
+ 
+       /* Copy the points-to information if it exists. */
+       if (DR_PTR_INFO (dr))
+ 	{
+ 	  duplicate_ssa_name_ptr_info (indx_before_incr, DR_PTR_INFO (dr));
+ 	  duplicate_ssa_name_ptr_info (indx_after_incr, DR_PTR_INFO (dr));
+ 	}
+       merge_alias_info (vect_ptr_init, indx_before_incr);
+       merge_alias_info (vect_ptr_init, indx_after_incr);
+       if (ptr_incr)
+ 	*ptr_incr = incr;
+ 
+       return indx_before_incr; 
+     }
+   else
+     gcc_unreachable ();
  }
  
  
  /* Function bump_vector_ptr
  
!    Increment a pointer (to a vector type) by vector-size. If requested,
!    i.e. if PTR-INCR is given, then also connect the new increment stmt 
!    to the existing def-use update-chain of the pointer, by modifying
!    the PTR_INCR as illustrated below:
  
     The pointer def-use update-chain before this function:
                          DATAREF_PTR = phi (p_0, p_2)
*************** vect_create_data_ref_ptr (tree stmt,
*** 979,996 ****
     The pointer def-use update-chain after this function:
                          DATAREF_PTR = phi (p_0, p_2)
                          ....
!                         NEW_DATAREF_PTR = DATAREF_PTR + vector_size
                          ....
          PTR_INCR:       p_2 = NEW_DATAREF_PTR + step
  
     Input:
     DATAREF_PTR - ssa_name of a pointer (to vector type) that is being updated 
                   in the loop.
!    PTR_INCR - the stmt that updates the pointer in each iteration of the loop.
!               The increment amount across iterations is also expected to be
!               vector_size.      
     BSI - location where the new update stmt is to be placed.
     STMT - the original scalar memory-access stmt that is being vectorized.
  
     Output: Return NEW_DATAREF_PTR as illustrated above.
     
--- 1107,1126 ----
     The pointer def-use update-chain after this function:
                          DATAREF_PTR = phi (p_0, p_2)
                          ....
!                         NEW_DATAREF_PTR = DATAREF_PTR + BUMP
                          ....
          PTR_INCR:       p_2 = NEW_DATAREF_PTR + step
  
     Input:
     DATAREF_PTR - ssa_name of a pointer (to vector type) that is being updated 
                   in the loop.
!    PTR_INCR - optional. The stmt that updates the pointer in each iteration of 
! 	      the loop.  The increment amount across iterations is expected
! 	      to be vector_size.      
     BSI - location where the new update stmt is to be placed.
     STMT - the original scalar memory-access stmt that is being vectorized.
+    BUMP - optional. The offset by which to bump the pointer. If not given,
+ 	  the offset is assumed to be vector_size.
  
     Output: Return NEW_DATAREF_PTR as illustrated above.
     
*************** vect_create_data_ref_ptr (tree stmt,
*** 998,1004 ****
  
  static tree
  bump_vector_ptr (tree dataref_ptr, tree ptr_incr, block_stmt_iterator *bsi,
!                  tree stmt)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
--- 1128,1134 ----
  
  static tree
  bump_vector_ptr (tree dataref_ptr, tree ptr_incr, block_stmt_iterator *bsi,
!                  tree stmt, tree bump)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
*************** bump_vector_ptr (tree dataref_ptr, tree 
*** 1011,1016 ****
--- 1141,1149 ----
    use_operand_p use_p;
    tree new_dataref_ptr;
  
+   if (bump)
+     update = bump;
+     
    incr_stmt = build_gimple_modify_stmt (ptr_var,
  					build2 (POINTER_PLUS_EXPR, vptr_type,
  						dataref_ptr, update));
*************** bump_vector_ptr (tree dataref_ptr, tree 
*** 1018,1023 ****
--- 1151,1164 ----
    GIMPLE_STMT_OPERAND (incr_stmt, 0) = new_dataref_ptr;
    vect_finish_stmt_generation (stmt, incr_stmt, bsi);
  
+   /* Copy the points-to information if it exists. */
+   if (DR_PTR_INFO (dr))
+     duplicate_ssa_name_ptr_info (new_dataref_ptr, DR_PTR_INFO (dr));
+   merge_alias_info (new_dataref_ptr, dataref_ptr);
+ 
+   if (!ptr_incr)
+     return new_dataref_ptr;
+ 
    /* Update the vector-pointer's cross-iteration increment.  */
    FOR_EACH_SSA_USE_OPERAND (use_p, ptr_incr, iter, SSA_OP_USE)
      {
*************** bump_vector_ptr (tree dataref_ptr, tree 
*** 1029,1039 ****
          gcc_assert (tree_int_cst_compare (use, update) == 0);
      }
  
-   /* Copy the points-to information if it exists. */
-   if (DR_PTR_INFO (dr))
-     duplicate_ssa_name_ptr_info (new_dataref_ptr, DR_PTR_INFO (dr));
-   merge_alias_info (new_dataref_ptr, dataref_ptr);
- 
    return new_dataref_ptr;
  }
  
--- 1170,1175 ----
*************** vect_create_destination_var (tree scalar
*** 1068,1082 ****
  /* Function vect_init_vector.
  
     Insert a new stmt (INIT_STMT) that initializes a new vector variable with
!    the vector elements of VECTOR_VAR. Return the DEF of INIT_STMT. It will be
!    used in the vectorization of STMT.  */
  
  static tree
! vect_init_vector (tree stmt, tree vector_var, tree vector_type)
  {
    stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt);
-   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
-   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    tree new_var;
    tree init_stmt;
    tree vec_oprnd;
--- 1204,1219 ----
  /* Function vect_init_vector.
  
     Insert a new stmt (INIT_STMT) that initializes a new vector variable with
!    the vector elements of VECTOR_VAR. Place the initialization at BSI if it
!    is not NULL. Otherwise, place the initialization at the loop preheader.
!    Return the DEF of INIT_STMT. 
!    It will be used in the vectorization of STMT.  */
  
  static tree
! vect_init_vector (tree stmt, tree vector_var, tree vector_type,
! 		  block_stmt_iterator *bsi)
  {
    stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt);
    tree new_var;
    tree init_stmt;
    tree vec_oprnd;
*************** vect_init_vector (tree stmt, tree vector
*** 1084,1102 ****
    tree new_temp;
    basic_block new_bb;
   
-   if (nested_in_vect_loop_p (loop, stmt))
-     loop = loop->inner;
- 
    new_var = vect_get_new_vect_var (vector_type, vect_simple_var, "cst_");
    add_referenced_var (new_var); 
-  
    init_stmt = build_gimple_modify_stmt (new_var, vector_var);
    new_temp = make_ssa_name (new_var, init_stmt);
    GIMPLE_STMT_OPERAND (init_stmt, 0) = new_temp;
  
!   pe = loop_preheader_edge (loop);
!   new_bb = bsi_insert_on_edge_immediate (pe, init_stmt);
!   gcc_assert (!new_bb);
  
    if (vect_print_dump_info (REPORT_DETAILS))
      {
--- 1221,1245 ----
    tree new_temp;
    basic_block new_bb;
   
    new_var = vect_get_new_vect_var (vector_type, vect_simple_var, "cst_");
    add_referenced_var (new_var); 
    init_stmt = build_gimple_modify_stmt (new_var, vector_var);
    new_temp = make_ssa_name (new_var, init_stmt);
    GIMPLE_STMT_OPERAND (init_stmt, 0) = new_temp;
  
!   if (bsi)
!     vect_finish_stmt_generation (stmt, init_stmt, bsi);
!   else
!     {
!       loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
!       struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
! 
!       if (nested_in_vect_loop_p (loop, stmt))
!         loop = loop->inner;
!       pe = loop_preheader_edge (loop);
!       new_bb = bsi_insert_on_edge_immediate (pe, init_stmt);
!       gcc_assert (!new_bb);
!     }
  
    if (vect_print_dump_info (REPORT_DETAILS))
      {
*************** get_initial_def_for_induction (tree iv_p
*** 1234,1240 ****
  	}
        /* Create a vector from [new_name_0, new_name_1, ..., new_name_nunits-1]  */
        vec = build_constructor_from_list (vectype, nreverse (t));
!       vec_init = vect_init_vector (iv_phi, vec, vectype);
      }
  
  
--- 1377,1383 ----
  	}
        /* Create a vector from [new_name_0, new_name_1, ..., new_name_nunits-1]  */
        vec = build_constructor_from_list (vectype, nreverse (t));
!       vec_init = vect_init_vector (iv_phi, vec, vectype, NULL);
      }
  
  
*************** get_initial_def_for_induction (tree iv_p
*** 1255,1261 ****
    for (i = 0; i < nunits; i++)
      t = tree_cons (NULL_TREE, unshare_expr (new_name), t);
    vec = build_constructor_from_list (vectype, t);
!   vec_step = vect_init_vector (iv_phi, vec, vectype);
  
  
    /* Create the following def-use cycle:
--- 1398,1404 ----
    for (i = 0; i < nunits; i++)
      t = tree_cons (NULL_TREE, unshare_expr (new_name), t);
    vec = build_constructor_from_list (vectype, t);
!   vec_step = vect_init_vector (iv_phi, vec, vectype, NULL);
  
  
    /* Create the following def-use cycle:
*************** get_initial_def_for_induction (tree iv_p
*** 1311,1317 ****
        for (i = 0; i < nunits; i++)
  	t = tree_cons (NULL_TREE, unshare_expr (new_name), t);
        vec = build_constructor_from_list (vectype, t);
!       vec_step = vect_init_vector (iv_phi, vec, vectype);
  
        vec_def = induc_def;
        prev_stmt_vinfo = vinfo_for_stmt (induction_phi);
--- 1454,1460 ----
        for (i = 0; i < nunits; i++)
  	t = tree_cons (NULL_TREE, unshare_expr (new_name), t);
        vec = build_constructor_from_list (vectype, t);
!       vec_step = vect_init_vector (iv_phi, vec, vectype, NULL);
  
        vec_def = induc_def;
        prev_stmt_vinfo = vinfo_for_stmt (induction_phi);
*************** vect_get_vec_def_for_operand (tree op, t
*** 1448,1454 ****
          vector_type = get_vectype_for_scalar_type (TREE_TYPE (op));
          vec_cst = build_vector (vector_type, t);
  
!         return vect_init_vector (stmt, vec_cst, vector_type);
        }
  
      /* Case 2: operand is defined outside the loop - loop invariant.  */
--- 1591,1597 ----
          vector_type = get_vectype_for_scalar_type (TREE_TYPE (op));
          vec_cst = build_vector (vector_type, t);
  
!         return vect_init_vector (stmt, vec_cst, vector_type, NULL);
        }
  
      /* Case 2: operand is defined outside the loop - loop invariant.  */
*************** vect_get_vec_def_for_operand (tree op, t
*** 1469,1476 ****
  	/* FIXME: use build_constructor directly.  */
  	vector_type = get_vectype_for_scalar_type (TREE_TYPE (def));
          vec_inv = build_constructor_from_list (vector_type, t);
! 
!         return vect_init_vector (stmt, vec_inv, vector_type);
        }
  
      /* Case 3: operand is defined inside the loop.  */
--- 1612,1618 ----
  	/* FIXME: use build_constructor directly.  */
  	vector_type = get_vectype_for_scalar_type (TREE_TYPE (def));
          vec_inv = build_constructor_from_list (vector_type, t);
!         return vect_init_vector (stmt, vec_inv, vector_type, NULL);
        }
  
      /* Case 3: operand is defined inside the loop.  */
*************** vectorizable_store (tree stmt, block_stm
*** 4113,4119 ****
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    enum machine_mode vec_mode;
    tree dummy;
!   enum dr_alignment_support alignment_support_cheme;
    ssa_op_iter iter;
    def_operand_p def_p;
    tree def, def_stmt;
--- 4255,4261 ----
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    enum machine_mode vec_mode;
    tree dummy;
!   enum dr_alignment_support alignment_support_scheme;
    ssa_op_iter iter;
    def_operand_p def_p;
    tree def, def_stmt;
*************** vectorizable_store (tree stmt, block_stm
*** 4127,4133 ****
--- 4269,4278 ----
    bool strided_store = false;
    unsigned int group_size, i;
    VEC(tree,heap) *dr_chain = NULL, *oprnds = NULL, *result_chain = NULL;
+   bool inv_p;
+ 
    gcc_assert (ncopies >= 1);
+ 
    /* FORNOW. This restriction should be relaxed.  */
    if (nested_in_vect_loop_p (loop, stmt) && ncopies > 1)
      {
*************** vectorizable_store (tree stmt, block_stm
*** 4201,4206 ****
--- 4346,4354 ----
  
        DR_GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt))++;
  
+       /* FORNOW */
+       gcc_assert (!nested_in_vect_loop_p (loop, stmt));
+ 
        /* We vectorize all the stmts of the interleaving group when we
  	 reach the last stmt in the group.  */
        if (DR_GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt)) 
*************** vectorizable_store (tree stmt, block_stm
*** 4223,4231 ****
    dr_chain = VEC_alloc (tree, heap, group_size);
    oprnds = VEC_alloc (tree, heap, group_size);
  
!   alignment_support_cheme = vect_supportable_dr_alignment (first_dr);
!   gcc_assert (alignment_support_cheme);
!   gcc_assert (alignment_support_cheme == dr_aligned);  /* FORNOW */
  
    /* In case the vectorization factor (VF) is bigger than the number
       of elements that we can fit in a vectype (nunits), we have to generate
--- 4371,4379 ----
    dr_chain = VEC_alloc (tree, heap, group_size);
    oprnds = VEC_alloc (tree, heap, group_size);
  
!   alignment_support_scheme = vect_supportable_dr_alignment (first_dr);
!   gcc_assert (alignment_support_scheme);
!   gcc_assert (alignment_support_scheme == dr_aligned);  /* FORNOW */
  
    /* In case the vectorization factor (VF) is bigger than the number
       of elements that we can fit in a vectype (nunits), we have to generate
*************** vectorizable_store (tree stmt, block_stm
*** 4295,4303 ****
  	      VEC_quick_push(tree, oprnds, vec_oprnd); 
  	      next_stmt = DR_GROUP_NEXT_DR (vinfo_for_stmt (next_stmt));
  	    }
! 	  dataref_ptr = vect_create_data_ref_ptr (first_stmt, bsi, NULL_TREE, 
  						  &dummy, &ptr_incr, false,
! 						  TREE_TYPE (vec_oprnd));
  	}
        else 
  	{
--- 4443,4452 ----
  	      VEC_quick_push(tree, oprnds, vec_oprnd); 
  	      next_stmt = DR_GROUP_NEXT_DR (vinfo_for_stmt (next_stmt));
  	    }
! 	  dataref_ptr = vect_create_data_ref_ptr (first_stmt, NULL, NULL_TREE, 
  						  &dummy, &ptr_incr, false,
! 						  TREE_TYPE (vec_oprnd), &inv_p);
! 	  gcc_assert (!inv_p);
  	}
        else 
  	{
*************** vectorizable_store (tree stmt, block_stm
*** 4315,4321 ****
  	      VEC_replace(tree, dr_chain, i, vec_oprnd);
  	      VEC_replace(tree, oprnds, i, vec_oprnd);
  	    }
! 	  dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt);
  	}
  
        if (strided_store)
--- 4464,4471 ----
  	      VEC_replace(tree, dr_chain, i, vec_oprnd);
  	      VEC_replace(tree, oprnds, i, vec_oprnd);
  	    }
! 	  dataref_ptr = 
! 		bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt, NULL_TREE);
  	}
  
        if (strided_store)
*************** vectorizable_store (tree stmt, block_stm
*** 4375,4381 ****
  	  if (!next_stmt)
  	    break;
  	  /* Bump the vector pointer.  */
! 	  dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt);
  	}
      }
  
--- 4525,4532 ----
  	  if (!next_stmt)
  	    break;
  	  /* Bump the vector pointer.  */
! 	  dataref_ptr = 
! 		bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt, NULL_TREE);
  	}
      }
  
*************** vectorizable_store (tree stmt, block_stm
*** 4386,4399 ****
  /* Function vect_setup_realignment
    
     This function is called when vectorizing an unaligned load using
!    the dr_unaligned_software_pipeline scheme.
     This function generates the following code at the loop prolog:
  
        p = initial_addr;
!       msq_init = *(floor(p));   # prolog load
        realignment_token = call target_builtin; 
      loop:
!       msq = phi (msq_init, ---)
  
     The code above sets up a new (vector) pointer, pointing to the first 
     location accessed by STMT, and a "floor-aligned" load using that pointer.
--- 4537,4553 ----
  /* Function vect_setup_realignment
    
     This function is called when vectorizing an unaligned load using
!    the dr_explicit_realign[_optimized] scheme.
     This function generates the following code at the loop prolog:
  
        p = initial_addr;
!    x  msq_init = *(floor(p));   # prolog load
        realignment_token = call target_builtin; 
      loop:
!    x  msq = phi (msq_init, ---)
! 
!    The stmts marked with x are generated only for the case of 
!    dr_explicit_realign_optimized.
  
     The code above sets up a new (vector) pointer, pointing to the first 
     location accessed by STMT, and a "floor-aligned" load using that pointer.
*************** vectorizable_store (tree stmt, block_stm
*** 4402,4420 ****
     whose arguments are the result of the prolog-load (created by this
     function) and the result of a load that takes place in the loop (to be
     created by the caller to this function).
     The caller to this function uses the phi-result (msq) to create the 
     realignment code inside the loop, and sets up the missing phi argument,
     as follows:
- 
      loop: 
        msq = phi (msq_init, lsq)
        lsq = *(floor(p'));        # load in loop
        result = realign_load (msq, lsq, realignment_token);
  
     Input:
     STMT - (scalar) load stmt to be vectorized. This load accesses
            a memory location that may be unaligned.
     BSI - place where new code is to be inserted.
     
     Output:
     REALIGNMENT_TOKEN - the result of a call to the builtin_mask_for_load
--- 4556,4584 ----
     whose arguments are the result of the prolog-load (created by this
     function) and the result of a load that takes place in the loop (to be
     created by the caller to this function).
+ 
+    For the case of dr_explicit_realign_optimized:
     The caller to this function uses the phi-result (msq) to create the 
     realignment code inside the loop, and sets up the missing phi argument,
     as follows:
      loop: 
        msq = phi (msq_init, lsq)
        lsq = *(floor(p'));        # load in loop
        result = realign_load (msq, lsq, realignment_token);
  
+    For the case of dr_explicit_realign:
+     loop:
+       msq = *(floor(p)); 	# load in loop
+       p' = p + (VS-1);
+       lsq = *(floor(p'));	# load in loop
+       result = realign_load (msq, lsq, realignment_token);
+ 
     Input:
     STMT - (scalar) load stmt to be vectorized. This load accesses
            a memory location that may be unaligned.
     BSI - place where new code is to be inserted.
+    ALIGNMENT_SUPPORT_SCHEME - which of the two misalignment handling schemes
+ 			      is used.	
     
     Output:
     REALIGNMENT_TOKEN - the result of a call to the builtin_mask_for_load
*************** vectorizable_store (tree stmt, block_stm
*** 4423,4467 ****
  
  static tree
  vect_setup_realignment (tree stmt, block_stmt_iterator *bsi,
!                         tree *realignment_token)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
!   edge pe = loop_preheader_edge (loop);
    tree scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0);
    tree vec_dest;
-   tree init_addr;
    tree inc;
    tree ptr;
    tree data_ref;
    tree new_stmt;
    basic_block new_bb;
!   tree msq_init;
    tree new_temp;
    tree phi_stmt;
!   tree msq;
  
-   /* 1. Create msq_init = *(floor(p1)) in the loop preheader  */
-   vec_dest = vect_create_destination_var (scalar_dest, vectype);
-   ptr = vect_create_data_ref_ptr (stmt, bsi, NULL_TREE, &init_addr, &inc, true,
- 				  NULL_TREE);
-   data_ref = build1 (ALIGN_INDIRECT_REF, vectype, ptr);
-   new_stmt = build_gimple_modify_stmt (vec_dest, data_ref);
-   new_temp = make_ssa_name (vec_dest, new_stmt);
-   GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
-   new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);
-   gcc_assert (!new_bb);
-   msq_init = GIMPLE_STMT_OPERAND (new_stmt, 0);
-   copy_virtual_operands (new_stmt, stmt);
-   update_vuses_to_preheader (new_stmt, loop);
  
!   /* 2. Create permutation mask, if required, in loop preheader.  */
    if (targetm.vectorize.builtin_mask_for_load)
      {
        tree builtin_decl;
  
        builtin_decl = targetm.vectorize.builtin_mask_for_load ();
        new_stmt = build_call_expr (builtin_decl, 1, init_addr);
        vec_dest = vect_create_destination_var (scalar_dest, 
--- 4587,4733 ----
  
  static tree
  vect_setup_realignment (tree stmt, block_stmt_iterator *bsi,
!                         tree *realignment_token,
! 			enum dr_alignment_support alignment_support_scheme,
! 			tree init_addr,
! 			struct loop **at_loop)
  {
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
!   edge pe;
    tree scalar_dest = GIMPLE_STMT_OPERAND (stmt, 0);
    tree vec_dest;
    tree inc;
    tree ptr;
    tree data_ref;
    tree new_stmt;
    basic_block new_bb;
!   tree msq_init = NULL_TREE;
    tree new_temp;
    tree phi_stmt;
!   tree msq = NULL_TREE;
!   tree stmts = NULL_TREE;
!   bool inv_p;
!   bool compute_in_loop = false;
!   bool nested_in_vect_loop = nested_in_vect_loop_p (loop, stmt);
!   struct loop *containing_loop = (bb_for_stmt (stmt))->loop_father;
!   struct loop *loop_for_initial_load;
! 
!   gcc_assert (alignment_support_scheme == dr_explicit_realign
! 	      || alignment_support_scheme == dr_explicit_realign_optimized);	
! 
!   /* We need to generate three things:
!      1. the misalignment computation
!      2. the extra vector load (for the optimized realignment scheme).
!      3. the phi node for the two vectors from which the realignment is
! 	done (for the optimized realignment scheme).
!    */
! 
!   /* 1. Determine where to generate the misalignment computation. 
! 
!      If INIT_ADDR is NULL_TREE, this indicates that the misalignment 
!      calculation will be generated by this function, outside the loop (in the 
!      preheader).  Otherwise, INIT_ADDR had already been computed for us by the 
!      caller, inside the loop. 
! 
!      Background: If the misalignment remains fixed throughout the iterations of
!      the loop, then both realignment schemes are applicable, and also the
!      misalignment computation can be done outside LOOP.  This is because we are
!      vectorizing LOOP, and so the memory accesses in LOOP advance in steps that
!      are a multiple of VS (the Vector Size), and therefore the misalignment in 
!      different vectorized LOOP iterations is always the same.  
!      The problem arises only if the memory access is in an inner-loop nested
!      inside LOOP, which is now being vectorized using outer-loop vectorization.
!      This is the only case when the misalignment of the memory access may not 
!      remain fixed thtoughout the iterations of the inner-loop (as exaplained in
!      detail in vect_supportable_dr_alignment).  In this case, not only is the 
!      optimized realignment scheme not applicable, but also the misalignment 
!      computation (and generation of the realignment token that is passed to 
!      REALIGN_LOAD) have to be done inside the loop.  
! 
!      In short, INIT_ADDR indicates whether we are in a COMPUTE_IN_LOOP mode 
!      or not, which in turn determines if the misalignment is computed inside 
!      the inner-loop, or outside LOOP.  */
! 
!   if (init_addr != NULL_TREE)
!     {
!       compute_in_loop = true; 
!       gcc_assert (alignment_support_scheme == dr_explicit_realign);
!     }
  
  
!   /* 2. Determine where to generate the extra vector load.
! 
!      For the optimized realignment scheme, instead of generating two vector
!      loads in each iteration, we generate a single extra vector load in the
!      preheader of the loop, and in each iteration reuse the result of the 
!      vector load from the previous iteration.  In case the memory access is in
!      an inner-loop nested inside LOOP, which is now being vectorized using
!      outer-loop vectorization, we need to determine whether this initial vector
!      load should be generated at the preheader of the inner-loop, or can be
!      generated at the preheader of LOOP.  If the memory access has no evolution
!      in LOOP, it can be generated in the preheader of LOOP. Otherwise, it has 
!      to be generated inside LOOP (in the preheader of the inner-loop).  */
! 
!   if (nested_in_vect_loop)
!     {
!       tree outerloop_step = STMT_VINFO_DR_STEP (stmt_info);
!       bool invariant_in_outerloop =
!             (tree_int_cst_compare (outerloop_step, size_zero_node) == 0);
!       loop_for_initial_load = (invariant_in_outerloop ? loop : loop->inner);
!     }
!   else
!     loop_for_initial_load = loop;
!   if (at_loop)
!     *at_loop = loop_for_initial_load;
! 
!   /* 3. For the case of the optimized realignment, create the first vector 
! 	load at the loop preheader.  */
! 
!   if (alignment_support_scheme == dr_explicit_realign_optimized)
!     {
!       /* Create msq_init = *(floor(p1)) in the loop preheader  */
! 
!       gcc_assert (!compute_in_loop);
!       pe = loop_preheader_edge (loop_for_initial_load);
!       vec_dest = vect_create_destination_var (scalar_dest, vectype);
!       ptr = vect_create_data_ref_ptr (stmt, loop_for_initial_load, NULL_TREE,
! 				    &init_addr, &inc, true, NULL_TREE, &inv_p);
!       data_ref = build1 (ALIGN_INDIRECT_REF, vectype, ptr);
!       new_stmt = build_gimple_modify_stmt (vec_dest, data_ref);
!       new_temp = make_ssa_name (vec_dest, new_stmt);
!       GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
!       new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);
!       gcc_assert (!new_bb);
!       msq_init = GIMPLE_STMT_OPERAND (new_stmt, 0);
!       copy_virtual_operands (new_stmt, stmt);
!       update_vuses_to_preheader (new_stmt, loop_for_initial_load);
!     }
! 
! 
!   /* 4. Create realignment token using a target builtin, if available.
! 	It is done either inside the containing loop, or before LOOP (as
! 	determined above).  */
! 
    if (targetm.vectorize.builtin_mask_for_load)
      {
        tree builtin_decl;
  
+       /* Compute INIT_ADDR - the initial addressed accessed by this memref.  */
+       if (compute_in_loop)
+ 	gcc_assert (init_addr); /* already computed by the caller.  */
+       else
+ 	{
+ 	  /* Generate the INIT_ADDR computation outside LOOP.  */
+ 	  init_addr = vect_create_addr_base_for_vector_ref (stmt, &stmts,
+ 							    NULL_TREE, loop);
+ 	  pe = loop_preheader_edge (loop);
+ 	  new_bb = bsi_insert_on_edge_immediate (pe, stmts);
+ 	  gcc_assert (!new_bb);
+ 	}
+ 
        builtin_decl = targetm.vectorize.builtin_mask_for_load ();
        new_stmt = build_call_expr (builtin_decl, 1, init_addr);
        vec_dest = vect_create_destination_var (scalar_dest, 
*************** vect_setup_realignment (tree stmt, block
*** 4469,4476 ****
        new_stmt = build_gimple_modify_stmt (vec_dest, new_stmt);
        new_temp = make_ssa_name (vec_dest, new_stmt);
        GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
!       new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);
!       gcc_assert (!new_bb);
        *realignment_token = GIMPLE_STMT_OPERAND (new_stmt, 0);
  
        /* The result of the CALL_EXPR to this builtin is determined from
--- 4735,4751 ----
        new_stmt = build_gimple_modify_stmt (vec_dest, new_stmt);
        new_temp = make_ssa_name (vec_dest, new_stmt);
        GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
! 
!       if (compute_in_loop)
! 	bsi_insert_before (bsi, new_stmt, BSI_SAME_STMT);
!       else
! 	{
! 	  /* Generate the misalignment computation outside LOOP.  */
! 	  pe = loop_preheader_edge (loop);
! 	  new_bb = bsi_insert_on_edge_immediate (pe, new_stmt);
! 	  gcc_assert (!new_bb);
! 	}
! 
        *realignment_token = GIMPLE_STMT_OPERAND (new_stmt, 0);
  
        /* The result of the CALL_EXPR to this builtin is determined from
*************** vect_setup_realignment (tree stmt, block
*** 4481,4492 ****
        gcc_assert (TREE_READONLY (builtin_decl));
      }
  
!   /* 3. Create msq = phi <msq_init, lsq> in loop  */
    vec_dest = vect_create_destination_var (scalar_dest, vectype);
    msq = make_ssa_name (vec_dest, NULL_TREE);
!   phi_stmt = create_phi_node (msq, loop->header); 
    SSA_NAME_DEF_STMT (msq) = phi_stmt;
!   add_phi_arg (phi_stmt, msq_init, loop_preheader_edge (loop));
  
    return msq;
  }
--- 4756,4776 ----
        gcc_assert (TREE_READONLY (builtin_decl));
      }
  
!   if (alignment_support_scheme == dr_explicit_realign)
!     return msq;
! 
!   gcc_assert (!compute_in_loop);
!   gcc_assert (alignment_support_scheme == dr_explicit_realign_optimized);
! 
! 
!   /* 5. Create msq = phi <msq_init, lsq> in loop  */
! 
!   pe = loop_preheader_edge (containing_loop);
    vec_dest = vect_create_destination_var (scalar_dest, vectype);
    msq = make_ssa_name (vec_dest, NULL_TREE);
!   phi_stmt = create_phi_node (msq, containing_loop->header);
    SSA_NAME_DEF_STMT (msq) = phi_stmt;
!   add_phi_arg (phi_stmt, msq_init, pe);
  
    return msq;
  }
*************** vectorizable_load (tree stmt, block_stmt
*** 4776,4788 ****
    stmt_vec_info prev_stmt_info; 
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info), *first_dr;
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    tree new_temp;
    int mode;
    tree new_stmt = NULL_TREE;
    tree dummy;
!   enum dr_alignment_support alignment_support_cheme;
    tree dataref_ptr = NULL_TREE;
    tree ptr_incr;
    int nunits = TYPE_VECTOR_SUBPARTS (vectype);
--- 5060,5074 ----
    stmt_vec_info prev_stmt_info; 
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+   struct loop *containing_loop = (bb_for_stmt (stmt))->loop_father;
+   bool nested_in_vect_loop = nested_in_vect_loop_p (loop, stmt);
    struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info), *first_dr;
    tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    tree new_temp;
    int mode;
    tree new_stmt = NULL_TREE;
    tree dummy;
!   enum dr_alignment_support alignment_support_scheme;
    tree dataref_ptr = NULL_TREE;
    tree ptr_incr;
    int nunits = TYPE_VECTOR_SUBPARTS (vectype);
*************** vectorizable_load (tree stmt, block_stmt
*** 4791,4804 ****
    tree msq = NULL_TREE, lsq;
    tree offset = NULL_TREE;
    tree realignment_token = NULL_TREE;
!   tree phi_stmt = NULL_TREE;
    VEC(tree,heap) *dr_chain = NULL;
    bool strided_load = false;
    tree first_stmt;
  
    gcc_assert (ncopies >= 1);
    /* FORNOW. This restriction should be relaxed.  */
!   if (nested_in_vect_loop_p (loop, stmt) && ncopies > 1)
      {
        if (vect_print_dump_info (REPORT_DETAILS))
          fprintf (vect_dump, "multiple types in nested loop.");
--- 5077,5095 ----
    tree msq = NULL_TREE, lsq;
    tree offset = NULL_TREE;
    tree realignment_token = NULL_TREE;
!   tree phi = NULL_TREE;
    VEC(tree,heap) *dr_chain = NULL;
    bool strided_load = false;
    tree first_stmt;
+   tree scalar_type;
+   bool inv_p;
+   bool compute_in_loop = false;
+   struct loop *at_loop;
  
    gcc_assert (ncopies >= 1);
+ 
    /* FORNOW. This restriction should be relaxed.  */
!   if (nested_in_vect_loop && ncopies > 1)
      {
        if (vect_print_dump_info (REPORT_DETAILS))
          fprintf (vect_dump, "multiple types in nested loop.");
*************** vectorizable_load (tree stmt, block_stmt
*** 4836,4841 ****
--- 5127,5133 ----
    if (!STMT_VINFO_DATA_REF (stmt_info))
      return false;
  
+   scalar_type = TREE_TYPE (DR_REF (dr));
    mode = (int) TYPE_MODE (vectype);
  
    /* FORNOW. In some cases can vectorize even if data-type not supported
*************** vectorizable_load (tree stmt, block_stmt
*** 4851,4856 ****
--- 5143,5150 ----
    if (DR_GROUP_FIRST_DR (stmt_info))
      {
        strided_load = true;
+       /* FORNOW */
+       gcc_assert (! nested_in_vect_loop);
  
        /* Check if interleaving is supported.  */
        if (!vect_strided_load_supported (vectype))
*************** vectorizable_load (tree stmt, block_stmt
*** 4889,4897 ****
        group_size = 1;
      }
  
!   alignment_support_cheme = vect_supportable_dr_alignment (first_dr);
!   gcc_assert (alignment_support_cheme);
! 
  
    /* In case the vectorization factor (VF) is bigger than the number
       of elements that we can fit in a vectype (nunits), we have to generate
--- 5183,5190 ----
        group_size = 1;
      }
  
!   alignment_support_scheme = vect_supportable_dr_alignment (first_dr);
!   gcc_assert (alignment_support_scheme);
  
    /* In case the vectorization factor (VF) is bigger than the number
       of elements that we can fit in a vectype (nunits), we have to generate
*************** vectorizable_load (tree stmt, block_stmt
*** 4973,4979 ****
           }
  
       Otherwise, the data reference is potentially unaligned on a target that
!      does not support unaligned accesses (dr_unaligned_software_pipeline) - 
       then generate the following code, in which the data in each iteration is
       obtained by two vector loads, one from the previous iteration, and one
       from the current iteration:
--- 5266,5272 ----
           }
  
       Otherwise, the data reference is potentially unaligned on a target that
!      does not support unaligned accesses (dr_explicit_realign_optimized) - 
       then generate the following code, in which the data in each iteration is
       obtained by two vector loads, one from the previous iteration, and one
       from the current iteration:
*************** vectorizable_load (tree stmt, block_stmt
*** 4990,5016 ****
             msq = lsq;
           }   */
  
!   if (alignment_support_cheme == dr_unaligned_software_pipeline)
!     {
!       msq = vect_setup_realignment (first_stmt, bsi, &realignment_token);
!       phi_stmt = SSA_NAME_DEF_STMT (msq);
!       offset = size_int (TYPE_VECTOR_SUBPARTS (vectype) - 1);
      }
  
    prev_stmt_info = NULL;
    for (j = 0; j < ncopies; j++)
      { 
        /* 1. Create the vector pointer update chain.  */
        if (j == 0)
!         dataref_ptr = vect_create_data_ref_ptr (first_stmt, bsi, offset, &dummy,
!                                                 &ptr_incr, false, NULL_TREE);
        else
!         dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt);
  
        for (i = 0; i < group_size; i++)
  	{
  	  /* 2. Create the vector-load in the loop.  */
! 	  switch (alignment_support_cheme)
  	    {
  	    case dr_aligned:
  	      gcc_assert (aligned_access_p (first_dr));
--- 5283,5334 ----
             msq = lsq;
           }   */
  
!   /* If the misalignment remains the same throughout the execution of the
!      loop, we can create the init_addr and permutation mask at the loop
!      preheader. Otherwise, it needs to be created inside the loop.
!      This can only occur when vectorizing memory accesses in the inner-loop
!      nested within an outer-loop that is being vectorized.  */
! 
!   if (nested_in_vect_loop_p (loop, stmt)
!       && (TREE_INT_CST_LOW (DR_STEP (dr)) % UNITS_PER_SIMD_WORD != 0))
!     {
!       gcc_assert (alignment_support_scheme != dr_explicit_realign_optimized);
!       compute_in_loop = true;
!     }
! 
!   if ((alignment_support_scheme == dr_explicit_realign_optimized
!        || alignment_support_scheme == dr_explicit_realign)
!       && !compute_in_loop)
!     {
!       msq = vect_setup_realignment (first_stmt, bsi, &realignment_token,
! 				    alignment_support_scheme, NULL_TREE,
! 				    &at_loop);
!       if (alignment_support_scheme == dr_explicit_realign_optimized)
! 	{
! 	  phi = SSA_NAME_DEF_STMT (msq);
! 	  offset = size_int (TYPE_VECTOR_SUBPARTS (vectype) - 1);
! 	}
      }
+   else
+     at_loop = loop;
  
    prev_stmt_info = NULL;
    for (j = 0; j < ncopies; j++)
      { 
        /* 1. Create the vector pointer update chain.  */
        if (j == 0)
!         dataref_ptr = vect_create_data_ref_ptr (first_stmt,
! 					        at_loop, offset, 
! 						&dummy, &ptr_incr, false, 
! 						NULL_TREE, &inv_p);
        else
!         dataref_ptr = 
! 		bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt, NULL_TREE);
  
        for (i = 0; i < group_size; i++)
  	{
  	  /* 2. Create the vector-load in the loop.  */
! 	  switch (alignment_support_scheme)
  	    {
  	    case dr_aligned:
  	      gcc_assert (aligned_access_p (first_dr));
*************** vectorizable_load (tree stmt, block_stmt
*** 5021,5034 ****
  		int mis = DR_MISALIGNMENT (first_dr);
  		tree tmis = (mis == -1 ? size_zero_node : size_int (mis));
  
- 		gcc_assert (!aligned_access_p (first_dr));
  		tmis = size_binop (MULT_EXPR, tmis, size_int(BITS_PER_UNIT));
  		data_ref =
  		  build2 (MISALIGNED_INDIRECT_REF, vectype, dataref_ptr, tmis);
  		break;
  	      }
! 	    case dr_unaligned_software_pipeline:
! 	      gcc_assert (!aligned_access_p (first_dr));
  	      data_ref = build1 (ALIGN_INDIRECT_REF, vectype, dataref_ptr);
  	      break;
  	    default:
--- 5339,5377 ----
  		int mis = DR_MISALIGNMENT (first_dr);
  		tree tmis = (mis == -1 ? size_zero_node : size_int (mis));
  
  		tmis = size_binop (MULT_EXPR, tmis, size_int(BITS_PER_UNIT));
  		data_ref =
  		  build2 (MISALIGNED_INDIRECT_REF, vectype, dataref_ptr, tmis);
  		break;
  	      }
! 	    case dr_explicit_realign:
! 	      {
! 		tree ptr, bump;
! 		tree vs_minus_1 = size_int (TYPE_VECTOR_SUBPARTS (vectype) - 1);
! 
! 		if (compute_in_loop)
! 		  msq = vect_setup_realignment (first_stmt, bsi, 
! 						&realignment_token,
! 						dr_explicit_realign, 
! 						dataref_ptr, NULL);
! 
! 		data_ref = build1 (ALIGN_INDIRECT_REF, vectype, dataref_ptr);
! 		vec_dest = vect_create_destination_var (scalar_dest, vectype);
! 		new_stmt = build_gimple_modify_stmt (vec_dest, data_ref);
! 		new_temp = make_ssa_name (vec_dest, new_stmt);
! 		GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
! 		vect_finish_stmt_generation (stmt, new_stmt, bsi);
! 		copy_virtual_operands (new_stmt, stmt);
! 		mark_symbols_for_renaming (new_stmt);
! 		msq = new_temp;
! 
! 		bump = size_binop (MULT_EXPR, vs_minus_1,
! 				   TYPE_SIZE_UNIT (scalar_type));
! 		ptr = bump_vector_ptr (dataref_ptr, NULL_TREE, bsi, stmt, bump);
! 	        data_ref = build1 (ALIGN_INDIRECT_REF, vectype, ptr);
! 	        break;
! 	      }
! 	    case dr_explicit_realign_optimized:
  	      data_ref = build1 (ALIGN_INDIRECT_REF, vectype, dataref_ptr);
  	      break;
  	    default:
*************** vectorizable_load (tree stmt, block_stmt
*** 5042,5070 ****
  	  copy_virtual_operands (new_stmt, stmt);
  	  mark_symbols_for_renaming (new_stmt);
  
! 	  /* 3. Handle explicit realignment if necessary/supported.  */
! 	  if (alignment_support_cheme == dr_unaligned_software_pipeline)
  	    {
- 	      /* Create in loop: 
- 		 <vec_dest = realign_load (msq, lsq, realignment_token)>  */
  	      lsq = GIMPLE_STMT_OPERAND (new_stmt, 0);
  	      if (!realignment_token)
  		realignment_token = dataref_ptr;
  	      vec_dest = vect_create_destination_var (scalar_dest, vectype);
! 	      new_stmt =
! 		build3 (REALIGN_LOAD_EXPR, vectype, msq, lsq, realignment_token);
  	      new_stmt = build_gimple_modify_stmt (vec_dest, new_stmt);
  	      new_temp = make_ssa_name (vec_dest, new_stmt);
  	      GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
  	      vect_finish_stmt_generation (stmt, new_stmt, bsi);
! 	      if (i == group_size - 1 && j == ncopies - 1)
! 		add_phi_arg (phi_stmt, lsq, loop_latch_edge (loop));
! 	      msq = lsq;
  	    }
  	  if (strided_load)
  	    VEC_quick_push (tree, dr_chain, new_temp);
  	  if (i < group_size - 1)
! 	    dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt);	  
  	}
  
        if (strided_load)
--- 5385,5454 ----
  	  copy_virtual_operands (new_stmt, stmt);
  	  mark_symbols_for_renaming (new_stmt);
  
! 	  /* 3. Handle explicit realignment if necessary/supported. Create in
! 		loop: vec_dest = realign_load (msq, lsq, realignment_token)  */
! 	  if (alignment_support_scheme == dr_explicit_realign_optimized
! 	      || alignment_support_scheme == dr_explicit_realign)
  	    {
  	      lsq = GIMPLE_STMT_OPERAND (new_stmt, 0);
  	      if (!realignment_token)
  		realignment_token = dataref_ptr;
  	      vec_dest = vect_create_destination_var (scalar_dest, vectype);
! 	      new_stmt = build3 (REALIGN_LOAD_EXPR, vectype, msq, lsq, 
! 				 realignment_token);
  	      new_stmt = build_gimple_modify_stmt (vec_dest, new_stmt);
  	      new_temp = make_ssa_name (vec_dest, new_stmt);
  	      GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
  	      vect_finish_stmt_generation (stmt, new_stmt, bsi);
! 
! 	      if (alignment_support_scheme == dr_explicit_realign_optimized)
! 		{
! 		  if (i == group_size - 1 && j == ncopies - 1)
! 		    add_phi_arg (phi, lsq, loop_latch_edge (containing_loop));
! 		  msq = lsq;
! 		}
  	    }
+ 
+ 	  /* 4. Handle invariant-load.  */
+ 	  if (inv_p)
+ 	    {
+ 	      gcc_assert (!strided_load);
+ 	      gcc_assert (nested_in_vect_loop_p (loop, stmt));
+ 	      if (j == 0)
+ 		{
+ 		  int k;
+ 		  tree t = NULL_TREE;
+ 		  tree vec_inv, bitpos, bitsize = TYPE_SIZE (scalar_type);
+ 
+ 		  /* CHECKME: bitpos depends on endianess?  */
+ 		  bitpos = bitsize_zero_node;
+ 		  vec_inv = build3 (BIT_FIELD_REF, scalar_type, new_temp, 
+ 							    bitsize, bitpos);
+ 		  BIT_FIELD_REF_UNSIGNED (vec_inv) = 
+ 						 TYPE_UNSIGNED (scalar_type);
+ 		  vec_dest = 
+ 			vect_create_destination_var (scalar_dest, NULL_TREE);
+ 		  new_stmt = build_gimple_modify_stmt (vec_dest, vec_inv);
+                   new_temp = make_ssa_name (vec_dest, new_stmt);
+                   GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
+                   vect_finish_stmt_generation (stmt, new_stmt, bsi);
+ 
+ 		  for (k = nunits - 1; k >= 0; --k)
+ 		    t = tree_cons (NULL_TREE, new_temp, t);
+ 		  /* FIXME: use build_constructor directly.  */
+ 		  vec_inv = build_constructor_from_list (vectype, t);
+ 		  new_temp = vect_init_vector (stmt, vec_inv, vectype, bsi);
+ 		  new_stmt = SSA_NAME_DEF_STMT (new_temp);
+ 		}
+ 	      else
+ 		gcc_unreachable (); /* FORNOW. */
+ 	    }
+ 
  	  if (strided_load)
  	    VEC_quick_push (tree, dr_chain, new_temp);
  	  if (i < group_size - 1)
! 	    dataref_ptr = 
! 		bump_vector_ptr (dataref_ptr, ptr_incr, bsi, stmt, NULL_TREE);	  
  	}
  
        if (strided_load)
*************** vect_gen_niters_for_prolog_loop (loop_ve
*** 5907,5914 ****
    else
      {
        tree new_stmts = NULL_TREE;
!       tree start_addr =
!         vect_create_addr_base_for_vector_ref (dr_stmt, &new_stmts, NULL_TREE);
        tree ptr_type = TREE_TYPE (start_addr);
        tree size = TYPE_SIZE (ptr_type);
        tree type = lang_hooks.types.type_for_size (tree_low_cst (size, 1), 1);
--- 6291,6298 ----
    else
      {
        tree new_stmts = NULL_TREE;
!       tree start_addr = vect_create_addr_base_for_vector_ref (dr_stmt, 
! 						&new_stmts, NULL_TREE, loop);
        tree ptr_type = TREE_TYPE (start_addr);
        tree size = TYPE_SIZE (ptr_type);
        tree type = lang_hooks.types.type_for_size (tree_low_cst (size, 1), 1);
*************** static tree
*** 6081,6086 ****
--- 6465,6471 ----
  vect_create_cond_for_align_checks (loop_vec_info loop_vinfo,
                                     tree *cond_expr_stmt_list)
  {
+   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    VEC(tree,heap) *may_misalign_stmts
      = LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo);
    tree ref_stmt, tmp;
*************** vect_create_cond_for_align_checks (loop_
*** 6116,6123 ****
  
        /* create: addr_tmp = (int)(address_of_first_vector) */
        addr_base = vect_create_addr_base_for_vector_ref (ref_stmt, 
! 							&new_stmt_list, 
! 							NULL_TREE);
  
        if (new_stmt_list != NULL_TREE)
          append_to_statement_list_force (new_stmt_list, cond_expr_stmt_list);
--- 6501,6507 ----
  
        /* create: addr_tmp = (int)(address_of_first_vector) */
        addr_base = vect_create_addr_base_for_vector_ref (ref_stmt, 
! 					&new_stmt_list, NULL_TREE, loop);
  
        if (new_stmt_list != NULL_TREE)
          append_to_statement_list_force (new_stmt_list, cond_expr_stmt_list);
*** tree-vect-analyze.c.p1	2007-08-05 20:50:10.000000000 +0300
--- tree-vect-analyze.c	2007-08-07 07:12:07.000000000 +0300
*************** vect_compute_data_ref_alignment (struct 
*** 1193,1198 ****
--- 1193,1200 ----
  {
    tree stmt = DR_STMT (dr);
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);  
+   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    tree ref = DR_REF (dr);
    tree vectype;
    tree base, base_addr;
*************** vect_compute_data_ref_alignment (struct 
*** 1209,1221 ****
    misalign = DR_INIT (dr);
    aligned_to = DR_ALIGNED_TO (dr);
    base_addr = DR_BASE_ADDRESS (dr);
    base = build_fold_indirect_ref (base_addr);
    vectype = STMT_VINFO_VECTYPE (stmt_info);
    alignment = ssize_int (TYPE_ALIGN (vectype)/BITS_PER_UNIT);
  
!   if (tree_int_cst_compare (aligned_to, alignment) < 0)
      {
!       if (vect_print_dump_info (REPORT_DETAILS))
  	{
  	  fprintf (vect_dump, "Unknown alignment for access: ");
  	  print_generic_expr (vect_dump, base, TDF_SLIM);
--- 1211,1252 ----
    misalign = DR_INIT (dr);
    aligned_to = DR_ALIGNED_TO (dr);
    base_addr = DR_BASE_ADDRESS (dr);
+ 
+   /* In case the dataref is in an inner-loop of the loop that is being
+      vectorized (LOOP), we use the base and misalignment information
+      relative to the outer-loop (LOOP). This is ok only if the misalignment
+      stays the same throughout the execution of the inner-loop, which is why
+      we have to check that the stride of the dataref in the inner-loop evenly
+      divides by the vector size.  */
+   if (nested_in_vect_loop_p (loop, stmt))
+     {
+       tree step = DR_STEP (dr);
+       HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
+     
+       if (dr_step % UNITS_PER_SIMD_WORD == 0)
+         {
+           if (vect_print_dump_info (REPORT_ALIGNMENT))
+             fprintf (vect_dump, "inner step divides the vector-size.");
+ 	  misalign = STMT_VINFO_DR_INIT (stmt_info);
+ 	  aligned_to = STMT_VINFO_DR_ALIGNED_TO (stmt_info);
+ 	  base_addr = STMT_VINFO_DR_BASE_ADDRESS (stmt_info);
+         }
+       else
+ 	{
+ 	  if (vect_print_dump_info (REPORT_ALIGNMENT))
+ 	    fprintf (vect_dump, "inner step doesn't divide the vector-size.");
+ 	  misalign = NULL_TREE;
+ 	}
+     }
+ 
    base = build_fold_indirect_ref (base_addr);
    vectype = STMT_VINFO_VECTYPE (stmt_info);
    alignment = ssize_int (TYPE_ALIGN (vectype)/BITS_PER_UNIT);
  
!   if ((aligned_to && tree_int_cst_compare (aligned_to, alignment) < 0)
!       || !misalign)
      {
!       if (vect_print_dump_info (REPORT_ALIGNMENT))
  	{
  	  fprintf (vect_dump, "Unknown alignment for access: ");
  	  print_generic_expr (vect_dump, base, TDF_SLIM);
*************** static bool
*** 1887,1906 ****
  vect_analyze_data_ref_access (struct data_reference *dr)
  {
    tree step = DR_STEP (dr);
-   HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
    tree scalar_type = TREE_TYPE (DR_REF (dr));
    HOST_WIDE_INT type_size = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (scalar_type));
    tree stmt = DR_STMT (dr);
!   /* For interleaving, STRIDE is STEP counted in elements, i.e., the size of the 
!      interleaving group (including gaps).  */
!   HOST_WIDE_INT stride = dr_step / type_size;
  
!   if (!step)
      {
!       if (vect_print_dump_info (REPORT_DETAILS))
! 	fprintf (vect_dump, "bad data-ref access");
!       return false;
      }
  
    /* Consecutive?  */
    if (!tree_int_cst_compare (step, TYPE_SIZE_UNIT (scalar_type)))
--- 1918,1956 ----
  vect_analyze_data_ref_access (struct data_reference *dr)
  {
    tree step = DR_STEP (dr);
    tree scalar_type = TREE_TYPE (DR_REF (dr));
    HOST_WIDE_INT type_size = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (scalar_type));
    tree stmt = DR_STMT (dr);
!   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
!   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
!   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
!   HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
!   HOST_WIDE_INT stride;
! 
!   /* Don't allow invariant accesses.  */
!   if (dr_step == 0)
!     return false; 
  
!   if (nested_in_vect_loop_p (loop, stmt))
      {
!       /* For the rest of the analysis we use the outer-loop step.  */
!       step = STMT_VINFO_DR_STEP (stmt_info);
!       dr_step = TREE_INT_CST_LOW (step);
!       
!       if (dr_step == 0)
! 	{
! 	  if (vect_print_dump_info (REPORT_ALIGNMENT))
! 	    fprintf (vect_dump, "zero step in outer loop.");
! 	  if (DR_IS_READ (dr))
!   	    return true; 
! 	  else
! 	    return false;
! 	}
      }
+     
+   /* For interleaving, STRIDE is STEP counted in elements, i.e., the size of the 
+      interleaving group (including gaps).  */
+   stride = dr_step / type_size; 
  
    /* Consecutive?  */
    if (!tree_int_cst_compare (step, TYPE_SIZE_UNIT (scalar_type)))
*************** vect_analyze_data_ref_access (struct dat
*** 1910,1915 ****
--- 1960,1972 ----
        return true;
      }
  
+   if (nested_in_vect_loop_p (loop, stmt))
+     {
+       if (vect_print_dump_info (REPORT_ALIGNMENT))
+ 	fprintf (vect_dump, "strided access in outer loop.");
+       return false;
+     }
+ 
    /* Not consecutive access is possible only if it is a part of interleaving.  */
    if (!DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt)))
      {
*************** vect_analyze_data_refs (loop_vec_info lo
*** 2138,2143 ****
--- 2195,2201 ----
        tree stmt;
        stmt_vec_info stmt_info;
        basic_block bb;
+       tree base, offset, init;	
     
        if (!dr || !DR_REF (dr))
          {
*************** vect_analyze_data_refs (loop_vec_info lo
*** 2145,2180 ****
  	    fprintf (vect_dump, "not vectorized: unhandled data-ref ");
            return false;
          }
!  
!       /* Update DR field in stmt_vec_info struct.  */
        stmt = DR_STMT (dr);
        stmt_info = vinfo_for_stmt (stmt);
  
-       /* If outer-loop vectorization: we don't yet support datarefs
- 	 in the innermost loop.  */
-       bb = bb_for_stmt (stmt);
-       if (bb->loop_father != LOOP_VINFO_LOOP (loop_vinfo))
- 	{
- 	  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
- 	    fprintf (vect_dump, "not vectorized: data-ref in nested loop");
- 	  return false;
- 	}
- 
-       if (STMT_VINFO_DATA_REF (stmt_info))
-         {
-           if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
-             {
-               fprintf (vect_dump,
-                        "not vectorized: more than one data ref in stmt: ");
-               print_generic_expr (vect_dump, stmt, TDF_SLIM);
-             }
-           return false;
-         }
-       STMT_VINFO_DATA_REF (stmt_info) = dr;
-      
        /* Check that analysis of the data-ref succeeded.  */
        if (!DR_BASE_ADDRESS (dr) || !DR_OFFSET (dr) || !DR_INIT (dr)
!           || !DR_STEP (dr))   
          {
            if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
              {
--- 2203,2215 ----
  	    fprintf (vect_dump, "not vectorized: unhandled data-ref ");
            return false;
          }
! 
        stmt = DR_STMT (dr);
        stmt_info = vinfo_for_stmt (stmt);
  
        /* Check that analysis of the data-ref succeeded.  */
        if (!DR_BASE_ADDRESS (dr) || !DR_OFFSET (dr) || !DR_INIT (dr)
!           || !DR_STEP (dr))
          {
            if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
              {
*************** vect_analyze_data_refs (loop_vec_info lo
*** 2201,2207 ****
              }
            return false;
          }
!                        
        /* Set vectype for STMT.  */
        scalar_type = TREE_TYPE (DR_REF (dr));
        STMT_VINFO_VECTYPE (stmt_info) =
--- 2236,2362 ----
              }
            return false;
          }
! 
!       base = unshare_expr (DR_BASE_ADDRESS (dr));
!       offset = unshare_expr (DR_OFFSET (dr));
!       init = unshare_expr (DR_INIT (dr));
! 	
!       /* Update DR field in stmt_vec_info struct.  */
!       bb = bb_for_stmt (stmt);
! 
!       /* If the dataref is in an inner-loop of the loop that is considered for
! 	 for vectorization, we also want to analyze the access relative to
! 	 the outer-loop (DR contains information only relative to the 
! 	 inner-most enclosing loop).  We do that by building a reference to the
! 	 first location accessed by the inner-loop, and analyze it relative to
! 	 the outer-loop.  */ 	
!       if (nested_in_vect_loop_p (loop, stmt)) 
! 	{
! 	  tree outer_step, outer_base, outer_init;
! 	  HOST_WIDE_INT pbitsize, pbitpos;
! 	  tree poffset;
! 	  enum machine_mode pmode;
! 	  int punsignedp, pvolatilep;
! 	  affine_iv base_iv, offset_iv;
! 	  tree dinit;
! 
! 	  /* Build a reference to the first location accessed by the 
! 	     inner-loop: *(BASE+INIT). (The first location is actually
! 	     BASE+INIT+OFFSET, but we add OFFSET separately later.  */
! 	  tree inner_base = build_fold_indirect_ref 
! 				(fold_build2 (PLUS_EXPR, TREE_TYPE (base), base, init));
! 
! 	  if (vect_print_dump_info (REPORT_DETAILS))
! 	    {
! 	      fprintf (dump_file, "analyze in outer-loop: ");
! 	      print_generic_expr (dump_file, inner_base, TDF_SLIM);
! 	    }
! 
! 	  outer_base = get_inner_reference (inner_base, &pbitsize, &pbitpos, 
! 		          &poffset, &pmode, &punsignedp, &pvolatilep, false);
! 	  gcc_assert (outer_base != NULL_TREE);
! 
! 	  if (pbitpos % BITS_PER_UNIT != 0)
! 	    {
! 	      if (vect_print_dump_info (REPORT_DETAILS))
! 		fprintf (dump_file, "failed: bit offset alignment.\n");
! 	      return false;
! 	    }
! 
! 	  outer_base = build_fold_addr_expr (outer_base);
! 	  if (!simple_iv (loop, stmt, outer_base, &base_iv, false))
! 	    {
! 	      if (vect_print_dump_info (REPORT_DETAILS))
! 		fprintf (dump_file, "failed: evolution of base is not affine.\n");
! 	      return false;
! 	    }
! 
! 	  if (offset)
! 	    {
! 	      if (poffset)
! 		poffset = fold_build2 (PLUS_EXPR, TREE_TYPE (offset), offset, poffset);
! 	      else
! 		poffset = offset;
! 	    }
! 
! 	  if (!poffset)
! 	    {
! 	      offset_iv.base = ssize_int (0);
! 	      offset_iv.step = ssize_int (0);
! 	    }
! 	  else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
! 	    {
! 	      if (vect_print_dump_info (REPORT_DETAILS))
! 	        fprintf (dump_file, "evolution of offset is not affine.\n");
! 	      return false;
! 	    }
! 
! 	  outer_init = ssize_int (pbitpos / BITS_PER_UNIT);
! 	  split_constant_offset (base_iv.base, &base_iv.base, &dinit);
! 	  outer_init =  size_binop (PLUS_EXPR, outer_init, dinit);
! 	  split_constant_offset (offset_iv.base, &offset_iv.base, &dinit);
! 	  outer_init =  size_binop (PLUS_EXPR, outer_init, dinit);
! 
! 	  outer_step = size_binop (PLUS_EXPR,
! 				fold_convert (ssizetype, base_iv.step),
! 				fold_convert (ssizetype, offset_iv.step));
! 
! 	  STMT_VINFO_DR_STEP (stmt_info) = outer_step;
! 	  /* FIXME: Use canonicalize_base_object_address (base_iv.base); */
! 	  STMT_VINFO_DR_BASE_ADDRESS (stmt_info) = base_iv.base; 
! 	  STMT_VINFO_DR_INIT (stmt_info) = outer_init;
! 	  STMT_VINFO_DR_OFFSET (stmt_info) = 
! 				fold_convert (ssizetype, offset_iv.base);
! 	  STMT_VINFO_DR_ALIGNED_TO (stmt_info) = 
! 				size_int (highest_pow2_factor (offset_iv.base));
! 
! 	  if (dump_file && (dump_flags & TDF_DETAILS))
! 	    {
! 	      fprintf (dump_file, "\touter base_address: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_BASE_ADDRESS (stmt_info), TDF_SLIM);
! 	      fprintf (dump_file, "\n\touter offset from base address: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_OFFSET (stmt_info), TDF_SLIM);
! 	      fprintf (dump_file, "\n\touter constant offset from base address: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_INIT (stmt_info), TDF_SLIM);
! 	      fprintf (dump_file, "\n\touter step: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_STEP (stmt_info), TDF_SLIM);
! 	      fprintf (dump_file, "\n\touter aligned to: ");
! 	      print_generic_expr (dump_file, STMT_VINFO_DR_ALIGNED_TO (stmt_info), TDF_SLIM);
! 	    }
! 	}
! 
!       if (STMT_VINFO_DATA_REF (stmt_info))
!         {
!           if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
!             {
!               fprintf (vect_dump,
!                        "not vectorized: more than one data ref in stmt: ");
!               print_generic_expr (vect_dump, stmt, TDF_SLIM);
!             }
!           return false;
!         }
!       STMT_VINFO_DATA_REF (stmt_info) = dr;
!      
        /* Set vectype for STMT.  */
        scalar_type = TREE_TYPE (DR_REF (dr));
        STMT_VINFO_VECTYPE (stmt_info) =
*** tree-vectorizer.h.p1	2007-08-05 20:50:18.000000000 +0300
--- tree-vectorizer.h	2007-08-05 22:40:50.000000000 +0300
*************** enum operation_type {
*** 53,59 ****
  enum dr_alignment_support {
    dr_unaligned_unsupported,
    dr_unaligned_supported,
!   dr_unaligned_software_pipeline,
    dr_aligned
  };
  
--- 53,60 ----
  enum dr_alignment_support {
    dr_unaligned_unsupported,
    dr_unaligned_supported,
!   dr_explicit_realign,
!   dr_explicit_realign_optimized,
    dr_aligned
  };
  
*************** typedef struct _stmt_vec_info {
*** 244,252 ****
       data-ref (array/pointer/struct access). A GIMPLE stmt is expected to have 
       at most one such data-ref.  **/
  
!   /* Information about the data-ref (access function, etc).  */
    struct data_reference *data_ref_info;
  
    /* Stmt is part of some pattern (computation idiom)  */
    bool in_pattern_p;
  
--- 245,262 ----
       data-ref (array/pointer/struct access). A GIMPLE stmt is expected to have 
       at most one such data-ref.  **/
  
!   /* Information about the data-ref (access function, etc),
!      relative to the inner-most containing loop.  */
    struct data_reference *data_ref_info;
  
+   /* Information about the data-ref relative to this loop
+      nest (the loop that is being considered for vectorization).  */
+   tree dr_base_address;
+   tree dr_init;
+   tree dr_offset;
+   tree dr_step;
+   tree dr_aligned_to;
+ 
    /* Stmt is part of some pattern (computation idiom)  */
    bool in_pattern_p;
  
*************** typedef struct _stmt_vec_info {
*** 305,310 ****
--- 315,327 ----
  #define STMT_VINFO_VECTYPE(S)              (S)->vectype
  #define STMT_VINFO_VEC_STMT(S)             (S)->vectorized_stmt
  #define STMT_VINFO_DATA_REF(S)             (S)->data_ref_info
+ 
+ #define STMT_VINFO_DR_BASE_ADDRESS(S)      (S)->dr_base_address
+ #define STMT_VINFO_DR_INIT(S)              (S)->dr_init
+ #define STMT_VINFO_DR_OFFSET(S)            (S)->dr_offset
+ #define STMT_VINFO_DR_STEP(S)              (S)->dr_step
+ #define STMT_VINFO_DR_ALIGNED_TO(S)        (S)->dr_aligned_to
+ 
  #define STMT_VINFO_IN_PATTERN_P(S)         (S)->in_pattern_p
  #define STMT_VINFO_RELATED_STMT(S)         (S)->related_stmt
  #define STMT_VINFO_SAME_ALIGN_REFS(S)      (S)->same_align_refs
*** tree-vectorizer.c.p1	2007-08-05 20:50:24.000000000 +0300
--- tree-vectorizer.c	2007-08-05 22:41:36.000000000 +0300
*************** new_stmt_vec_info (tree stmt, loop_vec_i
*** 1345,1350 ****
--- 1345,1357 ----
    STMT_VINFO_IN_PATTERN_P (res) = false;
    STMT_VINFO_RELATED_STMT (res) = NULL;
    STMT_VINFO_DATA_REF (res) = NULL;
+ 
+   STMT_VINFO_DR_BASE_ADDRESS (res) = NULL;
+   STMT_VINFO_DR_OFFSET (res) = NULL;
+   STMT_VINFO_DR_INIT (res) = NULL;
+   STMT_VINFO_DR_STEP (res) = NULL;
+   STMT_VINFO_DR_ALIGNED_TO (res) = NULL;
+ 
    if (TREE_CODE (stmt) == PHI_NODE && is_loop_header_bb_p (bb_for_stmt (stmt)))
      STMT_VINFO_DEF_TYPE (res) = vect_unknown_def_type;
    else
*************** get_vectype_for_scalar_type (tree scalar
*** 1651,1673 ****
  enum dr_alignment_support
  vect_supportable_dr_alignment (struct data_reference *dr)
  {
!   tree vectype = STMT_VINFO_VECTYPE (vinfo_for_stmt (DR_STMT (dr)));
    enum machine_mode mode = (int) TYPE_MODE (vectype);
  
    if (aligned_access_p (dr))
      return dr_aligned;
  
    /* Possibly unaligned access.  */
!   
    if (DR_IS_READ (dr))
      {
        if (vec_realign_load_optab->handlers[mode].insn_code != CODE_FOR_nothing
  	  && (!targetm.vectorize.builtin_mask_for_load
  	      || targetm.vectorize.builtin_mask_for_load ()))
! 	return dr_unaligned_software_pipeline;
! 
        if (movmisalign_optab->handlers[mode].insn_code != CODE_FOR_nothing)
- 	/* Can't software pipeline the loads, but can at least do them.  */
  	return dr_unaligned_supported;
      }
  
--- 1658,1778 ----
  enum dr_alignment_support
  vect_supportable_dr_alignment (struct data_reference *dr)
  {
!   tree stmt = DR_STMT (dr);
!   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
!   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
    enum machine_mode mode = (int) TYPE_MODE (vectype);
+   struct loop *vect_loop = LOOP_VINFO_LOOP (STMT_VINFO_LOOP_VINFO (stmt_info));
+   bool nested_in_vect_loop = nested_in_vect_loop_p (vect_loop, stmt);
+   bool invariant_in_outerloop = false;
  
    if (aligned_access_p (dr))
      return dr_aligned;
  
+   if (nested_in_vect_loop)
+     {
+       tree outerloop_step = STMT_VINFO_DR_STEP (stmt_info);
+       invariant_in_outerloop = 
+ 	(tree_int_cst_compare (outerloop_step, size_zero_node) == 0);
+     }
+ 
    /* Possibly unaligned access.  */
! 
!   /* We can choose between using the implicit realignment scheme (generating
!      a misaligned_move stmt) and the explicit realignment scheme (generating
!      aligned loads with a REALIGN_LOAD). There are two variants to the explicit
!      realignment scheme: optimized, and unoptimized.
!      We can optimize the realignment only if the step between consecutive 
!      vector loads is equal to the vector size.  Since the vector memory 
!      accesses advance in steps of VS (Vector Size) in the vectorized loop, it 
!      is guaranteed that the misalignment amount remains the same throughout the
!      execution of the vectorized loop.  Therefore, we can create the 
!      "realignment token" (the permutation mask that is passed to REALIGN_LOAD) 
!      at the loop preheader.
! 
!      However, in the case of outer-loop vectorization, when vectorizing a
!      memory access in the inner-loop nested within the LOOP that is now being
!      vectorized, while it is guaranteed that the misalignment of the
!      vectorized memory access will remain the same in different outer-loop
!      iterations, it is *not* guaranteed that is will remain the same throughout
!      the execution of the inner-loop.  This is because the inner-loop advances
!      with the original scalar step (and not in steps of VS).  If the inner-loop
!      step happens to be a multiple of VS, then the misalignment remaines fixed
!      and we can use the optimized realignment scheme.  For example:
! 
! 	for (i=0; i<N; i++)
! 	  for (j=0; j<M; j++)
! 	    s += a[i+j];
! 
!      When vectorizing the i-loop in the above example, the step between 
!      consecutive vector loads is 1, and so the misalignment does not remain 
!      fixed across the execution of the inner-loop, and the realignment cannot
!      be optimized (as illustrated in the following pseudo vectorized loop):
! 
! 	for (i=0; i<N; i+=4)
! 	  for (j=0; j<M; j++){
! 	    vs += vp[i+j]; // misalignment of &vp[i+j] is {0,1,2,3,0,1,2,3,...}
! 			   // when j is {0,1,2,3,4,5,6,7,...} respectively. 
! 			   // (assuming that we start from an aligned address).
!           }
! 
!      We therefore have to use the unoptimized realignment scheme:
! 
! 	for (i=0; i<N; i+=4)
! 	  for (j=0; j<M; j++){
! 	    rt = get_realignment_token (&vp[i+j]);
! 	    v1 = vp[i+j];
! 	    v2 = vp[i+j+VS-1];
! 	    va = REALIGN_LOAD <v1,v2,rt>;
! 	    vs += va;
! 	  }
! 
!      On the other hand, when vectorizing the i-loop in the following example 
!      (that implements the same computation as above):
! 
! 	for (k=0; k<4; k++)
! 	  for (i=0; i<N; i++)
! 	    for (j=k; j<M; j+=4)
! 	      s += a[i+j];
! 
!      the step between consecutive vector loads is 4, which (if assuming that 
!      the vector size is also 4) can be optimized:
! 
! 	for (k=0; k<4; k++)
! 	  for (i=0; i<N; i+=4)
! 	    for (j=k; j<M; j+=4)
! 	      vs += vp[i+j]; // misalignment of &vp[i+j] is always k (assuming 
! 			     // that the misalignment of the initial address is
! 			     // 0).
! 
!      The loop can then be vectorized as follows:
! 
! 	for (k=0; k<4; k++){
! 	  rt = get_realignment_token (&vp[k]);
! 	  for (i=0; i<N; i+=4){
! 	    v1 = vp[i+k];
! 	    for (j=k; j<M; j+=4){
! 	      v2 = vp[i+j+VS-1];
! 	      va = REALIGN_LOAD <v1,v2,rt>;
! 	      vs += va;
! 	      v1 = v2;
! 	    }
! 	  }
! 	} */
! 
    if (DR_IS_READ (dr))
      {
        if (vec_realign_load_optab->handlers[mode].insn_code != CODE_FOR_nothing
  	  && (!targetm.vectorize.builtin_mask_for_load
  	      || targetm.vectorize.builtin_mask_for_load ()))
! 	{
!           if (nested_in_vect_loop
! 	      && TREE_INT_CST_LOW (DR_STEP (dr)) != UNITS_PER_SIMD_WORD)
! 	    return dr_explicit_realign;
! 	  else
! 	    return dr_explicit_realign_optimized;
! 	}
        if (movmisalign_optab->handlers[mode].insn_code != CODE_FOR_nothing)
  	return dr_unaligned_supported;
      }
  
Index: tree-data-ref.c
===================================================================
*** tree-data-ref.c	(revision 127202)
--- tree-data-ref.c	(working copy)
*************** dump_ddrs (FILE *file, VEC (ddr_p, heap)
*** 489,495 ****
  /* Expresses EXP as VAR + OFF, where off is a constant.  The type of OFF
     will be ssizetype.  */
  
! static void
  split_constant_offset (tree exp, tree *var, tree *off)
  {
    tree type = TREE_TYPE (exp), otype;
--- 489,495 ----
  /* Expresses EXP as VAR + OFF, where off is a constant.  The type of OFF
     will be ssizetype.  */
  
! void
  split_constant_offset (tree exp, tree *var, tree *off)
  {
    tree type = TREE_TYPE (exp), otype;
Index: tree-data-ref.h
===================================================================
*** tree-data-ref.h	(revision 127202)
--- tree-data-ref.h	(working copy)
*************** index_in_loop_nest (int var, VEC (loop_p
*** 388,391 ****
--- 388,394 ----
  /* In lambda-code.c  */
  bool lambda_transform_legal_p (lambda_trans_matrix, int, VEC (ddr_p, heap) *);
  
+ /* In tree-data-refs.c  */
+ void split_constant_offset (tree , tree *, tree *);
+ 
  #endif  /* GCC_TREE_DATA_REF_H  */
Index: testsuite/gcc.dg/vect/vect-outer-1a.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-1a.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-1a.c	(revision 0)
***************
*** 0 ****
--- 1,28 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ signed short image[N][N] __attribute__ ((__aligned__(16)));
+ signed short block[N][N] __attribute__ ((__aligned__(16)));
+ 
+ /* Can't do outer-loop vectorization because of non-consecutive access.
+    Currently fails to vectorize because the reduction pattern is not
+    recognized.  */
+ 
+ int
+ foo (){
+   int i,j;
+   int diff = 0;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j+=8) {
+       diff += (image[i][j] - block[i][j]);
+     }
+   }
+   return diff;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* FORNOW */
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "unexpected pattern" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-1b.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-1b.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-1b.c	(revision 0)
***************
*** 0 ****
--- 1,26 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ signed short image[N][N];
+ signed short block[N][N];
+ signed short out[N];
+ 
+ /* Outer-loop cannot get vectorized because of non-consecutive access.  */
+ 
+ void
+ foo (){
+   int i,j;
+   int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j+=4) {
+       diff += (image[i][j] - block[i][j]);
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-1.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-1.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-1.c	(revision 0)
***************
*** 0 ****
--- 1,26 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ signed short image[N][N] __attribute__ ((__aligned__(16)));
+ signed short block[N][N] __attribute__ ((__aligned__(16)));
+ signed short out[N] __attribute__ ((__aligned__(16)));
+ 
+ /* Can't do outer-loop vectorization because of non-consecutive access. */
+ 
+ void
+ foo (){
+   int i,j;
+   int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j+=8) {
+       diff += (image[i][j] - block[i][j]);
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2a.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2a.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2a.c	(revision 0)
***************
*** 0 ****
--- 1,41 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N][N] __attribute__ ((__aligned__(16)));
+ 
+ void
+ foo (){
+   int i,j,k;
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[k][j][i] = j+i+k;
+     }
+   }
+  }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j, k;
+ 
+   foo ();
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       if (image[k][j][i] != j+i+k)
+ 	abort ();
+     }
+   }
+  }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2b.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2b.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2b.c	(revision 0)
***************
*** 0 ****
--- 1,41 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[2*N][N][N] __attribute__ ((__aligned__(16)));
+ 
+ void
+ foo (){
+   int i,j,k;
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[k+i][j][i] = j+i+k;
+     }
+   }
+  }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j, k;
+ 
+   foo ();
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       if (image[k+i][j][i] != j+i+k)
+ 	abort ();
+     }
+   }
+  }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2.c	(revision 0)
***************
*** 0 ****
--- 1,40 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectorization.  */
+ 
+ void
+ foo (){
+   int i,j;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[j][i] = j+i;
+     }
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       if (image[j][i] != j+i)
+ 	abort ();
+     }
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-2c.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-2c.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-2c.c	(revision 0)
***************
*** 0 ****
--- 1,41 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[2*N][2*N][N] __attribute__ ((__aligned__(16)));
+ 
+ void
+ foo (){
+   int i,j,k;
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j+=2) {
+       image[k][j][i] = j+i+k;
+     }
+   }
+  }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j, k;
+ 
+   foo ();
+ 
+  for (k=0; k<N; k++) {
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j+=2) {
+       if (image[k][j][i] != j+i+k)
+ 	abort ();
+     }
+   }
+  }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-3a.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-3a.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-3a.c	(revision 0)
***************
*** 0 ****
--- 1,53 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N+1] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectorization with misaliged accesses in the inner-loop.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[i][j]=i+j;
+     }
+   }
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { scan-tree-dump-times "step doesn't divide the vector-size" 2 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-3b.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-3b.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-3b.c	(revision 0)
***************
*** 0 ****
--- 1,53 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectorization with non-consecutive access. Not vectorized yet.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N/2; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][2*i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[i][j]=i+j;
+     }
+   }
+ 
+   foo ();
+ 
+   for (i = 0; i < N/2; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][2*i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "strided access in outer loop" 2 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-3.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-3.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-3.c	(revision 0)
***************
*** 0 ****
--- 1,52 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectoriation.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[i][j]=i+j;
+     }
+   }
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j++) {
+       diff += image[j][i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-3c.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-3c.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-3c.c	(revision 0)
***************
*** 0 ****
--- 1,52 ----
+ /* { dg-require-effective-target vect_float } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ float image[N][N+1] __attribute__ ((__aligned__(16)));
+ float out[N];
+ 
+ /* Outer-loop vectorization.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j+=4) {
+       diff += image[j][i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < N; j++) {
+       image[i][j]=i+j;
+     }
+   }
+ 
+   foo ();
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < N; j+=4) {
+       diff += image[j][i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4a.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4a.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4a.c	(revision 0)
***************
*** 0 ****
--- 1,31 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ signed short in[N+M];
+ signed short coeff[M];
+ signed short out[N];
+ 
+ /* Outer-loop vectorization.
+    Currently not vectorized because of multiple-data-types in the inner-loop.  */
+ 
+ void
+ foo (){
+   int i,j;
+   int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* FORNOW. not vectorized until we support 0-stride acceses like coeff[j]. should be:
+    { scan-tree-dump-not "multiple types in nested loop." "vect" { xfail *-*-* } } } */
+ 
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1  "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4b.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4b.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4b.c	(revision 0)
***************
*** 0 ****
--- 1,31 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ signed short in[N+M];
+ signed short coeff[M];
+ int out[N];
+ 
+ /* Outer-loop vectorization.
+    Currently not vectorized because of multiple-data-types in the inner-loop.  */
+ 
+ void
+ foo (){
+   int i,j;
+   int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* FORNOW. not vectorized until we support 0-stride acceses like coeff[j]. should be:
+    { scan-tree-dump-not "multiple types in nested loop." "vect" { xfail *-*-* } } } */
+ 
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1  "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4.c	(revision 0)
***************
*** 0 ****
--- 1,55 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ 
+ /* Outer-loop vectorization.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=4) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=4) {
+       diff += in[j+i]*coeff[j];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4c.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4c.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4c.c	(revision 0)
***************
*** 0 ****
--- 1,27 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned short coeff[M];
+ unsigned int out[N];
+ 
+ /* Outer-loop vectorization. */
+ 
+ void
+ foo (){
+   int i,j;
+   unsigned short diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { target vect_short_mult } } } */
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4d.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4d.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4d.c	(revision 0)
***************
*** 0 ****
--- 1,51 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float out[N];
+ 
+ /* Outer-loop vectorization.  */
+ 
+ void
+ foo (){
+   int i,j;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=4) {
+       diff += in[j+i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < N; i++)
+     in[i] = i;
+ 
+   foo ();
+   
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=4) {
+       diff += in[j+i];
+     }
+     if (out[i] != diff)
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect"  } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4e.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4e.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4e.c	(revision 0)
***************
*** 0 ****
--- 1,25 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ unsigned int in[N+M];
+ unsigned short out[N];
+ 
+ /* Outer-loop vectorization. */
+ 
+ void
+ foo (){
+   int i,j;
+   unsigned int diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     out[i]=(unsigned short)diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4f.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4f.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4f.c	(revision 0)
***************
*** 0 ****
--- 1,70 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ unsigned char arr[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ unsigned int
+ bar (int i, unsigned int diff, unsigned short *in)
+ {
+     int j;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     return diff;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     diff = bar (i, diff, in);
+     s += diff;
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: not allowed" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4g.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4g.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4g.c	(revision 0)
***************
*** 0 ****
--- 1,70 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ unsigned char arr[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ unsigned int
+ bar (int i, unsigned int diff, unsigned short *in)
+ {
+     int j;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     return diff;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     diff = bar (i, diff, in);
+     s += diff;
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: not allowed" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4i.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4i.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4i.c	(revision 0)
***************
*** 0 ****
--- 1,28 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ unsigned char in[N+M];
+ unsigned short out[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned short
+ foo (){
+   int i,j;
+   unsigned short diff;
+   unsigned short s=0;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4j.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4j.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4j.c	(revision 0)
***************
*** 0 ****
--- 1,26 ----
+ /* { dg-do compile } */
+ 
+ #define N 40
+ #define M 128
+ unsigned char in[N+M];
+ unsigned short out[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ void
+ foo (){
+   int i,j;
+   unsigned short diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     out[i]=diff;
+   }
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4k.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4k.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4k.c	(revision 0)
***************
*** 0 ****
--- 1,70 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ unsigned char arr[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ unsigned int
+ bar (int i, unsigned int diff, unsigned short *in)
+ {
+     int j;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     return diff;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     diff = bar (i, diff, in);
+     s += diff;
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: not allowed" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4l.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4l.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4l.c	(revision 0)
***************
*** 0 ****
--- 1,70 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ unsigned char arr[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=diff;
+   }
+   return s;
+ }
+ 
+ unsigned int
+ bar (int i, unsigned int diff, unsigned short *in)
+ {
+     int j;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     return diff;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     arr[i] = 3;
+     diff = 0;
+     diff = bar (i, diff, in);
+     s += diff;
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "vect_recog_widen_sum_pattern: not allowed" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-4m.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-4m.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-4m.c	(revision 0)
***************
*** 0 ****
--- 1,58 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ unsigned short in[N+M];
+ unsigned int out[N];
+ 
+ /* Outer-loop vectorization. */
+ /* Not vectorized due to multiple-types in the inner-loop.  */
+ 
+ unsigned int
+ foo (){
+   int i,j;
+   unsigned int diff;
+   unsigned int s=0;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s+=((unsigned short)diff>>3);
+   }
+   return s;
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   unsigned int diff;
+   unsigned int s=0,sum=0;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N+M; i++) {
+     in[i] = i;
+   }
+ 
+   sum=foo ();
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j+=8) {
+       diff += in[j+i];
+     }
+     s += ((unsigned short)diff>>3);
+   }
+ 
+   if (s != sum)
+     abort ();
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect"  { xfail *-*-* } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-5.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-5.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-5.c	(revision 0)
***************
*** 0 ****
--- 1,80 ----
+ /* { dg-require-effective-target vect_int } */
+ 
+ #include <stdarg.h>
+ #include <signal.h>
+ #include "tree-vect.h"
+ 
+ #define N 64
+ #define MAX 42
+ 
+ extern void abort(void); 
+ 
+ int main1 ()
+ {  
+   float A[N] __attribute__ ((__aligned__(16)));
+   float B[N] __attribute__ ((__aligned__(16)));
+   float C[N] __attribute__ ((__aligned__(16)));
+   float D[N] __attribute__ ((__aligned__(16)));
+   float s;
+ 
+   int i, j;
+ 
+   for (i = 0; i < N; i++)
+     {
+       A[i] = i;
+       B[i] = i;
+       C[i] = i;
+       D[i] = i;
+     }
+ 
+   /* Outer-loop 1: Vectorizable with respect to dependence distance. */
+   for (i = 0; i < N-20; i++)
+     {
+       s = 0;
+       for (j=0; j<N; j+=4)
+         s += C[j];
+       A[i] = A[i+20] + s;
+     }
+ 
+   /* check results:  */
+   for (i = 0; i < N-20; i++)
+     {
+       s = 0;
+       for (j=0; j<N; j+=4)
+         s += C[j];
+       if (A[i] != D[i+20] + s)
+         abort ();
+     }
+ 
+   /* Outer-loop 2: Not vectorizable because of dependence distance. */
+   for (i = 0; i < 4; i++)
+     {
+       s = 0;
+       for (j=0; j<N; j+=4)
+ 	s += C[j];
+       B[i] = B[i+3] + s;
+     }
+ 
+   /* check results:  */
+   for (i = 0; i < 4; i++)
+     {
+       s = 0;
+       for (j=0; j<N; j+=4)
+ 	s += C[j];
+       if (B[i] != D[i+3] + s)
+ 	abort ();
+     }
+ 
+   return 0;
+ }
+ 
+ int main ()
+ {
+   check_vect ();
+   return main1();
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "not vectorized: possible dependence between data-refs" 1 "vect" } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-6.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-6.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-6.c	(revision 0)
***************
*** 0 ****
--- 1,65 ----
+ /* { dg-require-effective-target vect_int } */
+ 
+ #include <stdarg.h>
+ #include <signal.h>
+ #include "tree-vect.h"
+ 
+ #define N 64
+ #define MAX 42
+ 
+ float A[N] __attribute__ ((__aligned__(16)));
+ float B[N] __attribute__ ((__aligned__(16)));
+ float C[N] __attribute__ ((__aligned__(16)));
+ float D[N] __attribute__ ((__aligned__(16)));
+ extern void abort(void); 
+ 
+ int main1 ()
+ {  
+   float s;
+ 
+   int i, j;
+ 
+   for (i = 0; i < 8; i++)
+     {
+       s = 0;
+       for (j=0; j<8; j+=4)
+ 	s += C[j];
+       A[i] = s;
+     }
+ 
+   return 0;
+ }
+ 
+ int main ()
+ {
+   int i,j;
+   float s;
+ 
+   check_vect ();
+ 
+   for (i = 0; i < N; i++)
+     {
+       A[i] = i;
+       B[i] = i;
+       C[i] = i;
+       D[i] = i;
+     }
+ 
+   main1();
+ 
+   /* check results:  */
+   for (i = 0; i < 8; i++)
+     {
+       s = 0;
+       for (j=0; j<8; j+=4)
+         s += C[j];
+       if (A[i] != s)
+         abort ();
+     }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { scan-tree-dump-times "zero step in outer loop." 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-fir.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-fir.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-fir.c	(revision 0)
***************
*** 0 ****
--- 1,77 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ float fir_out[N];
+ 
+ /* Should be vectorized. Fixed misaligment in the inner-loop.  */
+ /* Currently not vectorized because we get too many BBs in the inner-loop,
+    because the compiler doesn't realize that the inner-loop executes at
+    least once (cause k<4), and so there's no need to create a guard code
+    to skip the inner-loop in case it doesn't execute.  */
+ void foo (){
+  int i,j,k;
+  float diff;
+ 
+  for (i = 0; i < N; i++) {
+   out[i] = 0;
+  }
+ 
+  for (k = 0; k < 4; k++) {
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = k; j < M; j+=4) {
+       diff += in[j+i]*coeff[j]; 
+     }
+     out[i] += diff;
+   }
+  }
+ 
+ }
+
+ /* Vectorized. Changing misalignment in the inner-loop.  */
+ void fir (){
+   int i,j,k;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j++) {
+       diff += in[j+i]*coeff[j];
+     }
+     fir_out[i] = diff;
+   }
+ }
+ 
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   fir ();
+   
+   for (i = 0; i < N; i++) {
+     if (out[i] != fir_out[i])
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 2 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-outer-fir-lb.c
===================================================================
*** testsuite/gcc.dg/vect/vect-outer-fir-lb.c	(revision 0)
--- testsuite/gcc.dg/vect/vect-outer-fir-lb.c	(revision 0)
***************
*** 0 ****
--- 1,80 ----
+ /* { dg-require-effective-target vect_float } */
+ 
+ #include <stdarg.h>
+ #include "tree-vect.h"
+ 
+ #define N 40
+ #define M 64
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ float fir_out[N];
+ 
+ /* Should be vectorized. Fixed misaligment in the inner-loop.  */
+ /* Currently not vectorized because the loop-count for the inner-loop
+    has a maybe_zero component. Will be fixed when we incorporate the
+    "cond_expr in rhs" patch.  */
+ void foo (){
+  int i,j,k;
+  float diff;
+ 
+  for (i = 0; i < N; i++) {
+   out[i] = 0;
+  }
+ 
+  for (k = 0; k < 4; k++) {
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     j = k;
+ 
+     do {
+       diff += in[j+i]*coeff[j];
+       j+=4;	
+     } while (j < M);
+ 
+     out[i] += diff;
+   }
+  }
+ 
+ }
+ 
+ /* Vectorized. Changing misalignment in the inner-loop.  */
+ void fir (){
+   int i,j,k;
+   float diff;
+ 
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j++) {
+       diff += in[j+i]*coeff[j];
+     }
+     fir_out[i] = diff;
+   }
+ }
+ 
+ 
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+ 
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+ 
+   foo ();
+   fir ();
+   
+   for (i = 0; i < N; i++) {
+     if (out[i] != fir_out[i])
+       abort ();
+   }
+ 
+   return 0;
+ }
+ 
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 2 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/no-section-anchors-vect-outer-4h.c
===================================================================
*** testsuite/gcc.dg/vect/no-section-anchors-vect-outer-4h.c	(revision 0)
--- testsuite/gcc.dg/vect/no-section-anchors-vect-outer-4h.c	(revision 0)
***************
*** 0 ****
--- 1,47 ----
+ /* { dg-require-effective-target vect_int } */
+ #include <stdarg.h>
+ #include "tree-vect.h"
+
+ 
+ #define N 40
+ #define M 128
+ unsigned short a[M][N];
+ unsigned int out[N];
+
+ /* Outer-loop vectorization. */
+
+ void
+ foo (){
+   int i,j;
+   unsigned int diff;
+ 
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < M; j++) {
+       a[j][i] = 4;
+     }
+     out[i]=5;
+   }
+ }
+ 
+ int main (void)
+ {
+   int i, j;
+   check_vect ();
+
+   foo ();
+
+   for (i = 0; i < N; i++) {
+     for (j = 0; j < M; j++) {
+       if (a[j][i] != 4)
+         abort ();
+     }
+     if (out[i] != 5)
+       abort ();
+   }
+ 
+   return 0;
+ }
+
+
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: testsuite/gcc.dg/vect/vect-117.c
===================================================================
*** testsuite/gcc.dg/vect/vect-117.c	(revision 127202)
--- testsuite/gcc.dg/vect/vect-117.c	(working copy)
*************** static  int c[N][N] = {{ 1, 2, 3, 4, 5},
*** 20,26 ****

  volatile int foo;

! int main1 (int A[N][N])
  {

    int i,j;
--- 20,26 ----

  volatile int foo;

! int main1 (int A[N][N], int n)
  {

    int i,j;
*************** int main1 (int A[N][N])
*** 28,34 ****
    /* vectorizable */
    for (i = 1; i < N; i++)
    {
!     for (j = 0; j < N; j++)
      {
        A[i][j] = A[i-1][j] + A[i][j];
      }
--- 28,34 ----
    /* vectorizable */
    for (i = 1; i < N; i++)
    {
!     for (j = 0; j < n; j++)
      {
        A[i][j] = A[i-1][j] + A[i][j];
      }
*************** int main (void)
*** 42,48 ****
    int i,j;
  
    foo = 0;
!   main1 (a);
  
    /* check results: */
  
--- 42,48 ----
    int i,j;
  
    foo = 0;
!   main1 (a, N);

    /* check results: */

Index: testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c
===================================================================
*** testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c	(revision 0)
--- testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c	(revision 0)
***************
*** 0 ****
--- 1,77 ----
+ /* { dg-require-effective-target vect_float } */
+
+ #include <stdarg.h>
+ #include "tree-vect.h"
+
+ #define N 40
+ #define M 128
+ float in[N+M];
+ float coeff[M];
+ float out[N];
+ float fir_out[N];
+
+ /* Should be vectorized. Fixed misaligment in the inner-loop.  */
+ /* Currently not vectorized because we get too many BBs in the inner-loop,
+    because the compiler doesn't realize that the inner-loop executes at
+    least once (cause k<4), and so there's no need to create a guard code
+    to skip the inner-loop in case it doesn't execute.  */
+ void foo (){
+  int i,j,k;
+  float diff;
+
+  for (i = 0; i < N; i++) {
+   out[i] = 0;
+  }
+
+  for (k = 0; k < 4; k++) {
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = k; j < M; j+=4) {
+       diff += in[j+i]*coeff[j];
+     }
+     out[i] += diff;
+   }
+  }
+
+ /* Vectorized. Changing misalignment in the inner-loop.  */
+ void fir (){
+   int i,j,k;
+   float diff;
+
+   for (i = 0; i < N; i++) {
+     diff = 0;
+     for (j = 0; j < M; j++) {
+       diff += in[j+i]*coeff[j];
+     }
+     fir_out[i] = diff;
+   }
+ }
+
+
+ int main (void)
+ {
+   check_vect ();
+   int i, j;
+   float diff;
+
+   for (i = 0; i < M; i++)
+     coeff[i] = i;
+   for (i = 0; i < N+M; i++)
+     in[i] = i;
+
+   foo ();
+   fir ();
+
+   for (i = 0; i < N; i++) {
+     if (out[i] != fir_out[i])
+       abort ();
+   }
+
+   return 0;
+ }
+
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 2 "vect" { xfail *-*-* } } } */
+ /* { dg-final { scan-tree-dump-times "OUTER LOOP VECTORIZED" 1 "vect" { xfail vect_no_align } } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-08-12 15:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <OF23B76DCA.3E1B28C1-ONC2257334.0041189F-C2257334.0041E49C@LocalDomain>
2007-08-11 17:01 ` [patch] [4.3 projects] outer-loop vectorization patch 2/n Dorit Nuzman
     [not found] <OF8F786C61.17F48632-ONC2257331.006C2237-C2257334.003A7381@LocalDomain>
2007-08-11 11:57 ` Dorit Nuzman
2007-08-11 10:36 Dorit Nuzman
2007-08-12 15:02 ` Dorit Nuzman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).