public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][RFC] Add versioning for constant strides for vectorization
@ 2009-01-23 18:03 Richard Guenther
  2009-01-25 14:08 ` Ira Rosen
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Richard Guenther @ 2009-01-23 18:03 UTC (permalink / raw)
  To: gcc-patches


This patch adds the capability to the vectorizer to perform versioning
for the case of a constant (suitable) stride.  For example for

subroutine to_product_of(self,a,b,a1,a2)
  complex(kind=8) :: self (:)
  complex(kind=8), intent(in) :: a(:,:)
  complex(kind=8), intent(in) :: b(:)
  integer a1,a2
  do i = 1,a1
    do j = 1,a2
      self(i) = self(i) + a(j,i)*b(j)
    end do
  end do
end subroutine

we can only apply vectorization if the strides of the fastest dimension
of self, a and b are one (they are loaded from the passed array
descriptors and thus appear as (loop invariant) variables).

During the implementation of this I noticed that peeling for
number of iterations (we have to unroll the above loop twice, and so
for an odd number of iterations have a epilogue loop for the remaining
iteration(s)) does not play well with versioning and we end up
vectorizing the wrong loop.  So I just disabled versioning if we
apply peeling with an epilogue loop and instead attach the versioning
condition to the pre-condition of the main loop that skips directly
to the epilogue if the number of iterations is too small.  We obviously
can use the epilogue loop as the non-vectorized version.

This patch also inserts an extra copyprop and dce pass before the
vectorizer so it can recognize the reduction in the above testcase
(LIM has made that reduction non-obvious).  So I noticed that
copyprop does not preserve loop-closed SSA form and fixed that as well.

Some earlier version bootstrapped and tested ok on 
x86_64-unknown-linux-gnu, a final attempt is still running.

I didn't yet performance test this extensively, but it might need
cost-model adjustments and/or need to wait until we have profile
feedback to properly seed vectorizer analysis here.  A micro-benchmark
based on the above loop shows around 15% improvement on AMD K10.

Feedback (and ppc testing) is still welcome of course.

Thanks,
Richard.

2009-01-23  Richard Guenther  <rguenther@suse.de>

	* passes.c (init_optimization_passes): Add copy-prop and dce
	before vectorization.
	* Makefile.in (tree-ssa-copy.o): Add $(CFGLOOP_H) dependency.
	* tree-ssa-copy.c (init_copy_prop): Do not propagate through
	single-argument PHIs if we are in loop-closed SSA form.
	* tree-data-ref.c (dr_analyze_innermost): Allow affine offsets.
	* tree-vect-analyze.c (vect_check_interleaving): Check that
	DR_STEP is constant.
	(vect_enhance_data_refs_alignment): If versioning for strides
	is required do not peel.
	(vect_analyze_data_ref_access): Allow non-constant step of
	a specific form, remember them for versioning.
	* params.def (vect-max-version-for-stride-checks): New param.
	(vect-version-for-stride-value): Likewise.
	* tree-vectorizer.c (slpeel_add_loop_guard): Pass extra guards
	for the pre-condition.
	(slpeel_tree_peel_loop_to_edge): Likewise.
	(new_loop_vec_info): Allocate stride versioning data.
	(destroy_loop_vec_info): Free stride versioning data.
	* tree-vectorizer.h (struct _loop_vec_info): Add variable_strides
	field.
	(LOOP_VINFO_VARIABLE_STRIDES): Define.
	(slpeel_tree_peel_loop_to_edge): Adjust declaration.
	* tree-vect-transform.c (vect_build_loop_niters): Take an
	optional sequence to append stmts.
	(vect_generate_tmps_on_preheader): Likewise.
	(vect_do_peeling_for_loop_bound): Take extra guards for the
	pre-condition.
	(vect_do_peeling_for_alignment): Adjust.
	(vect_create_cond_for_stride_checks): New function.
	(vect_loop_versioning): Take stmt and stmt list to put pre-condition
	guards if we are going to peel.  Do not apply versioning in that
	case.
	(vect_transform_loop): If we are peeling for loop bound only
	record extra pre-conditions, do not apply loop versioning.

	* gcc.dg/vect/fast-math-vect-complex-5.c: New testcase.
	* gfortran.dg/vect/fast-math-vect-complex-1.f90: Likewise.
	* gfortran.dg/vect/fast-math-vect-stride-1.f90: Likewise.

Index: trunk/gcc/passes.c
===================================================================
*** trunk.orig/gcc/passes.c	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/passes.c	2009-01-23 16:48:50.000000000 +0100
*************** init_optimization_passes (void)
*** 659,664 ****
--- 659,666 ----
  	  NEXT_PASS (pass_graphite_transforms);
  	  NEXT_PASS (pass_iv_canon);
  	  NEXT_PASS (pass_if_conversion);
+ 	  NEXT_PASS (pass_copy_prop);
+ 	  NEXT_PASS (pass_dce_loop);
  	  NEXT_PASS (pass_vectorize);
  	    {
  	      struct opt_pass **p = &pass_vectorize.pass.sub;
Index: trunk/gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-5.c
===================================================================
*** /dev/null	1970-01-01 00:00:00.000000000 +0000
--- trunk/gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-5.c	2009-01-23 16:48:50.000000000 +0100
***************
*** 0 ****
--- 1,18 ----
+ /* { dg-do compile } */
+ /* { dg-require-effective-target vect_double } */
+ 
+ #define NUM 64
+ _Complex double ad[NUM], bd[NUM], cd[NUM];
+ 
+ void testd(void)
+ {
+   int i;
+   int j;
+ 
+   for (i = 0; i < NUM; i++)
+     for (j = 0; j < NUM; j++)
+       cd[i] = cd[i] + ad[j] * bd[j];
+ }
+ 
+ /* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */
+ /* { dg-final { cleanup-tree-dump "vect" } } */
Index: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-complex-1.f90
===================================================================
*** /dev/null	1970-01-01 00:00:00.000000000 +0000
--- trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-complex-1.f90	2009-01-23 16:48:50.000000000 +0100
***************
*** 0 ****
--- 1,16 ----
+ ! { dg-do compile }
+ 
+ subroutine to_product_of(self,a,b,a1,a2)
+   complex(kind=8) :: self (:)
+   complex(kind=8), intent(in) :: a(:,:)
+   complex(kind=8), intent(in) :: b(:)
+   integer a1,a2
+   do i = 1,a1
+     do j = 1,a2
+       self(i) = self(i) + a(i,j)*b(j)
+     end do
+   end do
+ end subroutine
+ 
+ ! { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } }
+ ! { dg-final { cleanup-tree-dump "vect" } }
Index: trunk/gcc/tree-data-ref.c
===================================================================
*** trunk.orig/gcc/tree-data-ref.c	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/tree-data-ref.c	2009-01-23 16:48:50.000000000 +0100
*************** dr_analyze_innermost (struct data_refere
*** 708,714 ****
        offset_iv.base = ssize_int (0);
        offset_iv.step = ssize_int (0);
      }
!   else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
      {
        if (dump_file && (dump_flags & TDF_DETAILS))
  	fprintf (dump_file, "failed: evolution of offset is not affine.\n");
--- 708,714 ----
        offset_iv.base = ssize_int (0);
        offset_iv.step = ssize_int (0);
      }
!   else if (!simple_iv (loop, stmt, poffset, &offset_iv, true))
      {
        if (dump_file && (dump_flags & TDF_DETAILS))
  	fprintf (dump_file, "failed: evolution of offset is not affine.\n");
Index: trunk/gcc/tree-vect-analyze.c
===================================================================
*** trunk.orig/gcc/tree-vect-analyze.c	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/tree-vect-analyze.c	2009-01-23 16:48:50.000000000 +0100
*************** vect_check_interleaving (struct data_ref
*** 1109,1114 ****
--- 1109,1116 ----
    type_size_b = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (drb))));
  
    if (type_size_a != type_size_b
+       || TREE_CODE (DR_STEP (dra)) != INTEGER_CST
+       || TREE_CODE (DR_STEP (drb)) != INTEGER_CST
        || tree_int_cst_compare (DR_STEP (dra), DR_STEP (drb))
        || !types_compatible_p (TREE_TYPE (DR_REF (dra)), 
                                TREE_TYPE (DR_REF (drb))))
*************** vect_enhance_data_refs_alignment (loop_v
*** 1825,1830 ****
--- 1827,1833 ----
    gimple stmt;
    stmt_vec_info stmt_info;
    int vect_versioning_for_alias_required;
+   int vect_versioning_for_strides_required;
  
    if (vect_print_dump_info (REPORT_DETAILS))
      fprintf (vect_dump, "=== vect_enhance_data_refs_alignment ===");
*************** vect_enhance_data_refs_alignment (loop_v
*** 1892,1904 ****
  
    vect_versioning_for_alias_required =
      (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)) > 0);
  
    /* Temporarily, if versioning for alias is required, we disable peeling
       until we support peeling and versioning.  Often peeling for alignment
       will require peeling for loop-bound, which in turn requires that we
       know how to adjust the loop ivs after the loop.  */
    if (vect_versioning_for_alias_required
!        || !vect_can_advance_ivs_p (loop_vinfo)
        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
      do_peeling = false;
  
--- 1895,1910 ----
  
    vect_versioning_for_alias_required =
      (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)) > 0);
+   vect_versioning_for_strides_required =
+     !bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo));
  
    /* Temporarily, if versioning for alias is required, we disable peeling
       until we support peeling and versioning.  Often peeling for alignment
       will require peeling for loop-bound, which in turn requires that we
       know how to adjust the loop ivs after the loop.  */
    if (vect_versioning_for_alias_required
!       || vect_versioning_for_strides_required
!       || !vect_can_advance_ivs_p (loop_vinfo)
        || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
      do_peeling = false;
  
*************** vect_analyze_data_ref_access (struct dat
*** 2349,2357 ****
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
!   HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
  
!   if (!step)
      {
        if (vect_print_dump_info (REPORT_DETAILS))
  	fprintf (vect_dump, "bad data-ref access");
--- 2355,2364 ----
    stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
!   HOST_WIDE_INT dr_step;
  
!   if (!step
!       || TREE_CODE (step) != INTEGER_CST)
      {
        if (vect_print_dump_info (REPORT_DETAILS))
  	fprintf (vect_dump, "bad data-ref access");
*************** vect_analyze_data_ref_access (struct dat
*** 2359,2364 ****
--- 2366,2372 ----
      }
  
    /* Don't allow invariant accesses.  */
+   dr_step = TREE_INT_CST_LOW (step);
    if (dr_step == 0)
      return false; 
  
*************** vect_analyze_data_refs (loop_vec_info lo
*** 3563,3568 ****
--- 3571,3620 ----
            return false;
          }
  
+       /* If the non-constant (but loop invariant) step is of the
+ 	 form NAME or NAME * CST where CST is the element size mark
+ 	 this ddr for versioning for strides and re-set DR_STEP
+ 	 to the value we will version for.  Otherwise reject
+ 	 non-constant steps.  */
+       if (TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
+ 	{
+ 	  tree step = DR_STEP (dr);
+ 
+ 	  STRIP_NOPS (step);
+ 	  if (flag_tree_vect_loop_version
+ 	      && (TREE_CODE (step) == SSA_NAME
+ 		  || (TREE_CODE (step) == MULT_EXPR
+ 		      && TREE_CODE (TREE_OPERAND (step, 1)) == INTEGER_CST)))
+ 	    {
+ 	      tree stride;
+ 	      tree newstep;
+ 
+ 	      stride = step;
+ 	      if (TREE_CODE (step) == MULT_EXPR)
+ 		stride = TREE_OPERAND (step, 0);
+ 	      STRIP_NOPS (stride);
+ 	      if (TREE_CODE (stride) != SSA_NAME)
+ 		return false;
+ 
+ 	      bitmap_set_bit (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo),
+ 			      SSA_NAME_VERSION (stride));
+ 	      if (bitmap_count_bits (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo))
+ 		  > (unsigned)PARAM_VALUE (PARAM_VECT_MAX_VERSION_FOR_STRIDE_CHECKS))
+ 		return false;
+ 
+ 	      /* ???  Delay this change until after versioning or
+ 	         preserve the original step somewhere.  */
+ 	      newstep = build_int_cst (TREE_TYPE (step),
+ 		       PARAM_VALUE (PARAM_VECT_VERSION_FOR_STRIDE_VALUE));
+ 	      if (TREE_CODE (step) == MULT_EXPR)
+ 		newstep = int_const_binop (MULT_EXPR, newstep,
+ 					   TREE_OPERAND (step, 1), false);
+ 	      DR_STEP (dr) = newstep;
+ 	    }
+ 	  else
+ 	    return false;
+ 	}
+ 
        if (!DR_SYMBOL_TAG (dr))
          {
            if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
Index: trunk/gcc/params.def
===================================================================
*** trunk.orig/gcc/params.def	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/params.def	2009-01-23 16:48:50.000000000 +0100
*************** DEFPARAM(PARAM_VECT_MAX_VERSION_FOR_ALIA
*** 506,511 ****
--- 506,521 ----
           "Bound on number of runtime checks inserted by the vectorizer's loop versioning for alias check",
           10, 0, 0)
  
+ DEFPARAM(PARAM_VECT_MAX_VERSION_FOR_STRIDE_CHECKS,
+          "vect-max-version-for-stride-checks",
+          "Bound on number of runtime checks inserted by the vectorizer's loop versioning for stride check",
+          4, 0, 0)
+ 
+ DEFPARAM(PARAM_VECT_VERSION_FOR_STRIDE_VALUE,
+          "vect-version-for-stride-value",
+          "The constant stride in elements the vectorizer uses for loop versioning",
+          1, 0, 0)
+ 
  DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIONS,
  	 "max-cselib-memory-locations",
  	 "The maximum memory locations recorded by cselib",
Index: trunk/gcc/tree-vectorizer.c
===================================================================
*** trunk.orig/gcc/tree-vectorizer.c	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/tree-vectorizer.c	2009-01-23 16:48:50.000000000 +0100
*************** slpeel_tree_duplicate_loop_to_edge_cfg (
*** 927,934 ****
     Returns the skip edge.  */
  
  static edge
! slpeel_add_loop_guard (basic_block guard_bb, tree cond, basic_block exit_bb,
! 		       basic_block dom_bb)
  {
    gimple_stmt_iterator gsi;
    edge new_e, enter_e;
--- 927,935 ----
     Returns the skip edge.  */
  
  static edge
! slpeel_add_loop_guard (basic_block guard_bb, tree cond,
! 		       gimple_seq cond_expr_stmt_list,
! 		       basic_block exit_bb, basic_block dom_bb)
  {
    gimple_stmt_iterator gsi;
    edge new_e, enter_e;
*************** slpeel_add_loop_guard (basic_block guard
*** 941,951 ****
    gsi = gsi_last_bb (guard_bb);
  
    cond = force_gimple_operand (cond, &gimplify_stmt_list, true, NULL_TREE);
    cond_stmt = gimple_build_cond (NE_EXPR,
  				 cond, build_int_cst (TREE_TYPE (cond), 0),
  				 NULL_TREE, NULL_TREE);
!   if (gimplify_stmt_list)
!     gsi_insert_seq_after (&gsi, gimplify_stmt_list, GSI_NEW_STMT);
  
    gsi = gsi_last_bb (guard_bb);
    gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
--- 942,954 ----
    gsi = gsi_last_bb (guard_bb);
  
    cond = force_gimple_operand (cond, &gimplify_stmt_list, true, NULL_TREE);
+   if (gimplify_stmt_list)
+     gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
    cond_stmt = gimple_build_cond (NE_EXPR,
  				 cond, build_int_cst (TREE_TYPE (cond), 0),
  				 NULL_TREE, NULL_TREE);
!   if (cond_expr_stmt_list)
!     gsi_insert_seq_after (&gsi, cond_expr_stmt_list, GSI_NEW_STMT);
  
    gsi = gsi_last_bb (guard_bb);
    gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
*************** struct loop*
*** 1151,1157 ****
  slpeel_tree_peel_loop_to_edge (struct loop *loop, 
  			       edge e, tree first_niters, 
  			       tree niters, bool update_first_loop_count,
! 			       unsigned int th, bool check_profitability)
  {
    struct loop *new_loop = NULL, *first_loop, *second_loop;
    edge skip_e;
--- 1154,1161 ----
  slpeel_tree_peel_loop_to_edge (struct loop *loop, 
  			       edge e, tree first_niters, 
  			       tree niters, bool update_first_loop_count,
! 			       unsigned int th, bool check_profitability,
! 			       tree cond_expr, gimple_seq cond_expr_stmt_list)
  {
    struct loop *new_loop = NULL, *first_loop, *second_loop;
    edge skip_e;
*************** slpeel_tree_peel_loop_to_edge (struct lo
*** 1325,1330 ****
--- 1329,1342 ----
  	  pre_condition = fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
  				       cost_pre_condition, pre_condition);
  	}
+       if (cond_expr)
+ 	{
+ 	  pre_condition =
+ 	    fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
+ 			 pre_condition,
+ 			 fold_build1 (TRUTH_NOT_EXPR, boolean_type_node,
+ 				      cond_expr));
+ 	}
      }
  
    /* Prologue peeling.  */  
*************** slpeel_tree_peel_loop_to_edge (struct lo
*** 1340,1345 ****
--- 1352,1358 ----
      }
  
    skip_e = slpeel_add_loop_guard (bb_before_first_loop, pre_condition,
+ 				  cond_expr_stmt_list,
                                    bb_before_second_loop, bb_before_first_loop);
    slpeel_update_phi_nodes_for_guard1 (skip_e, first_loop,
  				      first_loop == new_loop,
*************** slpeel_tree_peel_loop_to_edge (struct lo
*** 1377,1383 ****
  
    pre_condition = 
  	fold_build2 (EQ_EXPR, boolean_type_node, first_niters, niters);
!   skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition,
                                    bb_after_second_loop, bb_before_first_loop);
    slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
                                       second_loop == new_loop, &new_exit_bb);
--- 1390,1396 ----
  
    pre_condition = 
  	fold_build2 (EQ_EXPR, boolean_type_node, first_niters, niters);
!   skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
                                    bb_after_second_loop, bb_before_first_loop);
    slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
                                       second_loop == new_loop, &new_exit_bb);
*************** new_loop_vec_info (struct loop *loop)
*** 1714,1719 ****
--- 1727,1733 ----
    LOOP_VINFO_MAY_ALIAS_DDRS (res) =
      VEC_alloc (ddr_p, heap,
  	       PARAM_VALUE (PARAM_VECT_MAX_VERSION_FOR_ALIAS_CHECKS));
+   LOOP_VINFO_VARIABLE_STRIDES (res) = BITMAP_ALLOC (NULL);
    LOOP_VINFO_STRIDED_STORES (res) = VEC_alloc (gimple, heap, 10);
    LOOP_VINFO_SLP_INSTANCES (res) = VEC_alloc (slp_instance, heap, 10);
    LOOP_VINFO_SLP_UNROLLING_FACTOR (res) = 1;
*************** destroy_loop_vec_info (loop_vec_info loo
*** 1800,1805 ****
--- 1814,1820 ----
    free_dependence_relations (LOOP_VINFO_DDRS (loop_vinfo));
    VEC_free (gimple, heap, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo));
    VEC_free (ddr_p, heap, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo));
+   BITMAP_FREE (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo));
    slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
    for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++)
      vect_free_slp_instance (instance);
Index: trunk/gcc/tree-vectorizer.h
===================================================================
*** trunk.orig/gcc/tree-vectorizer.h	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/tree-vectorizer.h	2009-01-23 16:48:50.000000000 +0100
*************** typedef struct _loop_vec_info {
*** 210,215 ****
--- 210,219 ----
    /* All data dependences in the loop.  */
    VEC (ddr_p, heap) *ddrs;
  
+   /* SSA_NAMEs representing variable strides in data references.
+      Candidates for a run-time stride check.  */
+   bitmap variable_strides;
+ 
    /* Data Dependence Relations defining address ranges that are candidates
       for a run-time aliasing check.  */
    VEC (ddr_p, heap) *may_alias_ddrs;
*************** typedef struct _loop_vec_info {
*** 254,259 ****
--- 258,264 ----
  #define LOOP_VINFO_LOC(L)             (L)->loop_line_number
  #define LOOP_VINFO_MAY_ALIAS_DDRS(L)  (L)->may_alias_ddrs
  #define LOOP_VINFO_STRIDED_STORES(L)  (L)->strided_stores
+ #define LOOP_VINFO_VARIABLE_STRIDES(L) (L)->variable_strides
  #define LOOP_VINFO_SLP_INSTANCES(L)   (L)->slp_instances
  #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
  
*************** extern bitmap vect_memsyms_to_rename;
*** 707,713 ****
     divide by the vectorization factor, and to peel the first few iterations
     to force the alignment of data references in the loop.  */
  extern struct loop *slpeel_tree_peel_loop_to_edge 
!   (struct loop *, edge, tree, tree, bool, unsigned int, bool);
  extern void set_prologue_iterations (basic_block, tree,
  				     struct loop *, unsigned int);
  struct loop *tree_duplicate_loop_on_edge (struct loop *, edge);
--- 712,718 ----
     divide by the vectorization factor, and to peel the first few iterations
     to force the alignment of data references in the loop.  */
  extern struct loop *slpeel_tree_peel_loop_to_edge 
!   (struct loop *, edge, tree, tree, bool, unsigned int, bool, tree, gimple_seq);
  extern void set_prologue_iterations (basic_block, tree,
  				     struct loop *, unsigned int);
  struct loop *tree_duplicate_loop_on_edge (struct loop *, edge);
Index: trunk/gcc/tree-vect-transform.c
===================================================================
*** trunk.orig/gcc/tree-vect-transform.c	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/tree-vect-transform.c	2009-01-23 16:48:50.000000000 +0100
*************** static tree get_initial_def_for_reductio
*** 65,72 ****
  
  /* Utility function dealing with loop peeling (not peeling itself).  */
  static void vect_generate_tmps_on_preheader 
!   (loop_vec_info, tree *, tree *, tree *);
! static tree vect_build_loop_niters (loop_vec_info);
  static void vect_update_ivs_after_vectorizer (loop_vec_info, tree, edge); 
  static tree vect_gen_niters_for_prolog_loop (loop_vec_info, tree);
  static void vect_update_init_of_dr (struct data_reference *, tree niters);
--- 65,72 ----
  
  /* Utility function dealing with loop peeling (not peeling itself).  */
  static void vect_generate_tmps_on_preheader 
!   (loop_vec_info, tree *, tree *, tree *, gimple_seq);
! static tree vect_build_loop_niters (loop_vec_info, gimple_seq);
  static void vect_update_ivs_after_vectorizer (loop_vec_info, tree, edge); 
  static tree vect_gen_niters_for_prolog_loop (loop_vec_info, tree);
  static void vect_update_init_of_dr (struct data_reference *, tree niters);
*************** vect_transform_stmt (gimple stmt, gimple
*** 7199,7205 ****
     on the loop preheader.  */
  
  static tree
! vect_build_loop_niters (loop_vec_info loop_vinfo)
  {
    tree ni_name, var;
    gimple_seq stmts = NULL;
--- 7199,7205 ----
     on the loop preheader.  */
  
  static tree
! vect_build_loop_niters (loop_vec_info loop_vinfo, gimple_seq seq)
  {
    tree ni_name, var;
    gimple_seq stmts = NULL;
*************** vect_build_loop_niters (loop_vec_info lo
*** 7214,7221 ****
    pe = loop_preheader_edge (loop);
    if (stmts)
      {
!       basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
!       gcc_assert (!new_bb);
      }
        
    return ni_name;
--- 7214,7226 ----
    pe = loop_preheader_edge (loop);
    if (stmts)
      {
!       if (seq)
! 	gimple_seq_add_seq (&seq, stmts);
!       else
! 	{
! 	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
! 	  gcc_assert (!new_bb);
! 	}
      }
        
    return ni_name;
*************** static void
*** 7234,7240 ****
  vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, 
  				 tree *ni_name_ptr,
  				 tree *ratio_mult_vf_name_ptr, 
! 				 tree *ratio_name_ptr)
  {
  
    edge pe;
--- 7239,7246 ----
  vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, 
  				 tree *ni_name_ptr,
  				 tree *ratio_mult_vf_name_ptr, 
! 				 tree *ratio_name_ptr,
! 				 gimple_seq cond_expr_stmt_list)
  {
  
    edge pe;
*************** vect_generate_tmps_on_preheader (loop_ve
*** 7254,7260 ****
    /* Generate temporary variable that contains 
       number of iterations loop executes.  */
  
!   ni_name = vect_build_loop_niters (loop_vinfo);
    log_vf = build_int_cst (TREE_TYPE (ni), exact_log2 (vf));
  
    /* Create: ratio = ni >> log2(vf) */
--- 7260,7266 ----
    /* Generate temporary variable that contains 
       number of iterations loop executes.  */
  
!   ni_name = vect_build_loop_niters (loop_vinfo, cond_expr_stmt_list);
    log_vf = build_int_cst (TREE_TYPE (ni), exact_log2 (vf));
  
    /* Create: ratio = ni >> log2(vf) */
*************** vect_generate_tmps_on_preheader (loop_ve
*** 7267,7275 ****
  
        stmts = NULL;
        ratio_name = force_gimple_operand (ratio_name, &stmts, true, var);
!       pe = loop_preheader_edge (loop);
!       new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
!       gcc_assert (!new_bb);
      }
         
    /* Create: ratio_mult_vf = ratio << log2 (vf).  */
--- 7273,7286 ----
  
        stmts = NULL;
        ratio_name = force_gimple_operand (ratio_name, &stmts, true, var);
!       if (cond_expr_stmt_list)
! 	gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
!       else
! 	{
! 	  pe = loop_preheader_edge (loop);
! 	  new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
! 	  gcc_assert (!new_bb);
! 	}
      }
         
    /* Create: ratio_mult_vf = ratio << log2 (vf).  */
*************** vect_generate_tmps_on_preheader (loop_ve
*** 7284,7292 ****
        stmts = NULL;
        ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name, &stmts,
  						 true, var);
!       pe = loop_preheader_edge (loop);
!       new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
!       gcc_assert (!new_bb);
      }
  
    *ni_name_ptr = ni_name;
--- 7295,7308 ----
        stmts = NULL;
        ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name, &stmts,
  						 true, var);
!       if (cond_expr_stmt_list)
! 	gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
!       else
! 	{
! 	  pe = loop_preheader_edge (loop);
! 	  new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
! 	  gcc_assert (!new_bb);
! 	}
      }
  
    *ni_name_ptr = ni_name;
*************** conservative_cost_threshold (loop_vec_in
*** 7470,7476 ****
     NITERS / VECTORIZATION_FACTOR times (this value is placed into RATIO).  */
  
  static void 
! vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio)
  {
    tree ni_name, ratio_mult_vf_name;
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
--- 7486,7493 ----
     NITERS / VECTORIZATION_FACTOR times (this value is placed into RATIO).  */
  
  static void 
! vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
! 				tree cond_expr, gimple_seq cond_expr_stmt_list)
  {
    tree ni_name, ratio_mult_vf_name;
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
*************** vect_do_peeling_for_loop_bound (loop_vec
*** 7493,7499 ****
       ratio = ni_name / vf
       ratio_mult_vf_name = ratio * vf  */
    vect_generate_tmps_on_preheader (loop_vinfo, &ni_name,
! 				   &ratio_mult_vf_name, ratio);
  
    loop_num  = loop->num; 
  
--- 7510,7517 ----
       ratio = ni_name / vf
       ratio_mult_vf_name = ratio * vf  */
    vect_generate_tmps_on_preheader (loop_vinfo, &ni_name,
! 				   &ratio_mult_vf_name, ratio,
! 				   cond_expr_stmt_list);
  
    loop_num  = loop->num; 
  
*************** vect_do_peeling_for_loop_bound (loop_vec
*** 7501,7507 ****
       peeling for alignment.  */
    if (!VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
        && !VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
!       && !LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
      {
        check_profitability = true;
  
--- 7519,7526 ----
       peeling for alignment.  */
    if (!VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
        && !VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
!       && !LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo)
!       && !cond_expr)
      {
        check_profitability = true;
  
*************** vect_do_peeling_for_loop_bound (loop_vec
*** 7514,7520 ****
  
    new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
                                              ratio_mult_vf_name, ni_name, false,
!                                             th, check_profitability);
    gcc_assert (new_loop);
    gcc_assert (loop_num == loop->num);
  #ifdef ENABLE_CHECKING
--- 7533,7540 ----
  
    new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
                                              ratio_mult_vf_name, ni_name, false,
!                                             th, check_profitability,
! 					    cond_expr, cond_expr_stmt_list);
    gcc_assert (new_loop);
    gcc_assert (loop_num == loop->num);
  #ifdef ENABLE_CHECKING
*************** vect_do_peeling_for_alignment (loop_vec_
*** 7738,7744 ****
  
    initialize_original_copy_tables ();
  
!   ni_name = vect_build_loop_niters (loop_vinfo);
    niters_of_prolog_loop = vect_gen_niters_for_prolog_loop (loop_vinfo, ni_name);
    
  
--- 7758,7764 ----
  
    initialize_original_copy_tables ();
  
!   ni_name = vect_build_loop_niters (loop_vinfo, NULL);
    niters_of_prolog_loop = vect_gen_niters_for_prolog_loop (loop_vinfo, ni_name);
    
  
*************** vect_do_peeling_for_alignment (loop_vec_
*** 7759,7765 ****
    new_loop =
      slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
  				   niters_of_prolog_loop, ni_name, true,
! 				   th, check_profitability);
  
    gcc_assert (new_loop);
  #ifdef ENABLE_CHECKING
--- 7779,7785 ----
    new_loop =
      slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
  				   niters_of_prolog_loop, ni_name, true,
! 				   th, check_profitability, NULL_TREE, NULL);
  
    gcc_assert (new_loop);
  #ifdef ENABLE_CHECKING
*************** vect_create_cond_for_align_checks (loop_
*** 7909,7914 ****
--- 7929,7981 ----
      *cond_expr = part_cond_expr;
  }
  
+ /* Function vect_create_cond_for_stride_checks.
+ 
+    Create a conditional expression that represents the stride checks for
+    all of the stride SSA_NAMEs used in data references (array element
+    references) whose stride must be checked at runtime.
+ 
+    Input:
+    COND_EXPR  - input conditional expression.  New conditions will be chained
+                 with logical AND operation.
+    LOOP_VINFO - on field of the loop information is used.
+                 LOOP_VINFO_VARIABLE_STRIDES is a bitmap of SSA_NAMEs to be
+ 		checked.
+ 
+    Output:
+    COND_EXPR_STMT_LIST - statements needed to construct the conditional
+                          expression.
+    The returned value is the conditional expression to be used in the if
+    statement that controls which version of the loop gets executed at runtime.
+ 
+    The stride we do versioning for is currently specified by a compile-time
+    param.  In future the stride should be chosen by information from
+    profile-feedback.  */
+ 
+ static void
+ vect_create_cond_for_stride_checks (loop_vec_info loop_vinfo,
+ 				    tree *cond_expr)
+ {
+   bitmap_iterator bi;
+   unsigned int i;
+   HOST_WIDE_INT stride;
+ 
+   stride = PARAM_VALUE (PARAM_VECT_VERSION_FOR_STRIDE_VALUE);
+ 
+   EXECUTE_IF_SET_IN_BITMAP (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo), 0, i, bi)
+     {
+       tree name = ssa_name (i);
+       tree cond = fold_build2 (EQ_EXPR, boolean_type_node,
+ 			       name,
+ 			       build_int_cst (TREE_TYPE (name), stride));
+       if (*cond_expr)
+ 	*cond_expr = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
+ 				  *cond_expr, cond);
+       else
+ 	*cond_expr = cond;
+     }
+ }
+ 
  /* Function vect_vfa_segment_size.
  
     Create an expression that computes the size of segment
*************** vect_create_cond_for_alias_checks (loop_
*** 8076,8087 ****
     cost model initially.  */
  
  static void
! vect_loop_versioning (loop_vec_info loop_vinfo)
  {
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    struct loop *nloop;
-   tree cond_expr = NULL_TREE;
-   gimple_seq cond_expr_stmt_list = NULL;
    basic_block condition_bb;
    gimple_stmt_iterator gsi, cond_exp_gsi;
    basic_block merge_bb;
--- 8143,8153 ----
     cost model initially.  */
  
  static void
! vect_loop_versioning (loop_vec_info loop_vinfo, bool do_versioning,
! 		      tree *cond_expr, gimple_seq *cond_expr_stmt_list)
  {
    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
    struct loop *nloop;
    basic_block condition_bb;
    gimple_stmt_iterator gsi, cond_exp_gsi;
    basic_block merge_bb;
*************** vect_loop_versioning (loop_vec_info loop
*** 8101,8129 ****
    th = conservative_cost_threshold (loop_vinfo,
  				    min_profitable_iters);
  
!   cond_expr =
!     build2 (GT_EXPR, boolean_type_node, scalar_loop_iters, 
! 	    build_int_cst (TREE_TYPE (scalar_loop_iters), th));
  
!   cond_expr = force_gimple_operand (cond_expr, &cond_expr_stmt_list,
! 				    false, NULL_TREE);
  
    if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)))
!       vect_create_cond_for_align_checks (loop_vinfo, &cond_expr,
! 					 &cond_expr_stmt_list);
  
    if (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
!     vect_create_cond_for_alias_checks (loop_vinfo, &cond_expr, 
! 				       &cond_expr_stmt_list);
  
!   cond_expr =
!     fold_build2 (NE_EXPR, boolean_type_node, cond_expr, integer_zero_node);
!   cond_expr =
!     force_gimple_operand (cond_expr, &gimplify_stmt_list, true, NULL_TREE);
!   gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
  
    initialize_original_copy_tables ();
!   nloop = loop_version (loop, cond_expr, &condition_bb,
  			prob, prob, REG_BR_PROB_BASE - prob, true);
    free_original_copy_tables();
  
--- 8167,8200 ----
    th = conservative_cost_threshold (loop_vinfo,
  				    min_profitable_iters);
  
!   *cond_expr =
!     fold_build2 (GT_EXPR, boolean_type_node, scalar_loop_iters,
! 		 build_int_cst (TREE_TYPE (scalar_loop_iters), th));
  
!   *cond_expr = force_gimple_operand (*cond_expr, cond_expr_stmt_list,
! 				     false, NULL_TREE);
  
    if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)))
!       vect_create_cond_for_align_checks (loop_vinfo, cond_expr,
! 					 cond_expr_stmt_list);
  
    if (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
!     vect_create_cond_for_alias_checks (loop_vinfo, cond_expr,
! 				       cond_expr_stmt_list);
  
!   if (!bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo)))
!     vect_create_cond_for_stride_checks (loop_vinfo, cond_expr);
! 
!   *cond_expr =
!     fold_build2 (NE_EXPR, boolean_type_node, *cond_expr, integer_zero_node);
!   *cond_expr =
!     force_gimple_operand (*cond_expr, &gimplify_stmt_list, true, NULL_TREE);
!   gimple_seq_add_seq (cond_expr_stmt_list, gimplify_stmt_list);
!   if (!do_versioning)
!     return;
  
    initialize_original_copy_tables ();
!   nloop = loop_version (loop, *cond_expr, &condition_bb,
  			prob, prob, REG_BR_PROB_BASE - prob, true);
    free_original_copy_tables();
  
*************** vect_loop_versioning (loop_vec_info loop
*** 8154,8164 ****
    /* End loop-exit-fixes after versioning.  */
  
    update_ssa (TODO_update_ssa);
!   if (cond_expr_stmt_list)
      {
        cond_exp_gsi = gsi_last_bb (condition_bb);
!       gsi_insert_seq_before (&cond_exp_gsi, cond_expr_stmt_list, GSI_SAME_STMT);
      }
  }
  
  /* Remove a group of stores (for SLP or interleaving), free their 
--- 8225,8238 ----
    /* End loop-exit-fixes after versioning.  */
  
    update_ssa (TODO_update_ssa);
!   if (*cond_expr_stmt_list)
      {
        cond_exp_gsi = gsi_last_bb (condition_bb);
!       gsi_insert_seq_before (&cond_exp_gsi, *cond_expr_stmt_list,
! 			     GSI_SAME_STMT);
!       *cond_expr_stmt_list = NULL;
      }
+   *cond_expr = NULL_TREE;
  }
  
  /* Remove a group of stores (for SLP or interleaving), free their 
*************** vect_transform_loop (loop_vec_info loop_
*** 8320,8342 ****
    bool strided_store;
    bool slp_scheduled = false;
    unsigned int nunits;
  
    if (vect_print_dump_info (REPORT_DETAILS))
      fprintf (vect_dump, "=== vec_transform_loop ===");
  
-   if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
-       || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
-     vect_loop_versioning (loop_vinfo);
- 
-   /* CHECKME: we wouldn't need this if we called update_ssa once
-      for all loops.  */
-   bitmap_zero (vect_memsyms_to_rename);
- 
    /* Peel the loop if there are data refs with unknown alignment.
       Only one data ref with unknown store is allowed.  */
  
    if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
      vect_do_peeling_for_alignment (loop_vinfo);
    
    /* If the loop has a symbolic number of iterations 'n' (i.e. it's not a
       compile time constant), or it is a constant that doesn't divide by the
--- 8394,8427 ----
    bool strided_store;
    bool slp_scheduled = false;
    unsigned int nunits;
+   tree cond_expr = NULL_TREE;
+   gimple_seq cond_expr_stmt_list = NULL;
+   bool do_peeling_for_loop_bound;
  
    if (vect_print_dump_info (REPORT_DETAILS))
      fprintf (vect_dump, "=== vec_transform_loop ===");
  
    /* Peel the loop if there are data refs with unknown alignment.
       Only one data ref with unknown store is allowed.  */
  
    if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
      vect_do_peeling_for_alignment (loop_vinfo);
+ 
+   do_peeling_for_loop_bound
+     = (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+        || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+ 	   && LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0));
+ 
+   if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
+       || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
+       || !bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo)))
+     vect_loop_versioning (loop_vinfo,
+ 			  !do_peeling_for_loop_bound,
+ 			  &cond_expr, &cond_expr_stmt_list);
+ 
+   /* CHECKME: we wouldn't need this if we called update_ssa once
+      for all loops.  */
+   bitmap_zero (vect_memsyms_to_rename);
    
    /* If the loop has a symbolic number of iterations 'n' (i.e. it's not a
       compile time constant), or it is a constant that doesn't divide by the
*************** vect_transform_loop (loop_vec_info loop_
*** 8346,8355 ****
       will remain scalar and will compute the remaining (n%VF) iterations.
       (VF is the vectorization factor).  */
  
!   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
!       || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
!           && LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0))
!     vect_do_peeling_for_loop_bound (loop_vinfo, &ratio);
    else
      ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
  		LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
--- 8431,8439 ----
       will remain scalar and will compute the remaining (n%VF) iterations.
       (VF is the vectorization factor).  */
  
!   if (do_peeling_for_loop_bound)
!     vect_do_peeling_for_loop_bound (loop_vinfo, &ratio,
! 				    cond_expr, cond_expr_stmt_list);
    else
      ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
  		LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
Index: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-stride-1.f90
===================================================================
*** /dev/null	1970-01-01 00:00:00.000000000 +0000
--- trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-stride-1.f90	2009-01-23 16:48:50.000000000 +0100
***************
*** 0 ****
--- 1,17 ----
+ ! { dg-do compile }
+ 
+ subroutine to_product_of(self,a,b)
+   real(kind=8), dimension(:,:) :: self
+   real(kind=8), dimension(:,:), intent(in) :: a, b
+   integer(kind=kind(1)) :: dim1, dim2
+   dim1 = size(self,1)
+   dim2 = size(self,2)
+   do i = 1,dim1
+     do j = 1,dim2
+       self(i,j) = sum(a(i,:)*b(:,j))
+     end do
+   end do
+ end subroutine
+ 
+ ! { dg-final { scan-tree-dump "vectorized 1 loop" "vect" } }
+ ! { dg-final { cleanup-tree-dump "vect" } }
Index: trunk/gcc/Makefile.in
===================================================================
*** trunk.orig/gcc/Makefile.in	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/Makefile.in	2009-01-23 16:48:50.000000000 +0100
*************** tree-nrv.o : tree-nrv.c $(CONFIG_H) $(SY
*** 2135,2141 ****
  tree-ssa-copy.o : tree-ssa-copy.c $(TREE_FLOW_H) $(CONFIG_H) $(SYSTEM_H) \
     $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h $(DIAGNOSTIC_H) \
     $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
!    $(BASIC_BLOCK_H) tree-pass.h langhooks.h tree-ssa-propagate.h $(FLAGS_H)
  tree-ssa-propagate.o : tree-ssa-propagate.c $(TREE_FLOW_H) $(CONFIG_H) \
     $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h \
     $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
--- 2135,2142 ----
  tree-ssa-copy.o : tree-ssa-copy.c $(TREE_FLOW_H) $(CONFIG_H) $(SYSTEM_H) \
     $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h $(DIAGNOSTIC_H) \
     $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
!    $(BASIC_BLOCK_H) tree-pass.h langhooks.h tree-ssa-propagate.h $(FLAGS_H) \
!    $(CFGLOOP_H)
  tree-ssa-propagate.o : tree-ssa-propagate.c $(TREE_FLOW_H) $(CONFIG_H) \
     $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h \
     $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
Index: trunk/gcc/tree-ssa-copy.c
===================================================================
*** trunk.orig/gcc/tree-ssa-copy.c	2009-01-23 16:47:53.000000000 +0100
--- trunk/gcc/tree-ssa-copy.c	2009-01-23 16:48:50.000000000 +0100
*************** along with GCC; see the file COPYING3.
*** 37,42 ****
--- 37,43 ----
  #include "tree-pass.h"
  #include "tree-ssa-propagate.h"
  #include "langhooks.h"
+ #include "cfgloop.h"
  
  /* This file implements the copy propagation pass and provides a
     handful of interfaces for performing const/copy propagation and
*************** init_copy_prop (void)
*** 991,997 ****
            tree def;
  
  	  def = gimple_phi_result (phi);
! 	  if (!is_gimple_reg (def))
              prop_set_simulate_again (phi, false);
  	  else
              prop_set_simulate_again (phi, true);
--- 992,1004 ----
            tree def;
  
  	  def = gimple_phi_result (phi);
! 	  if (!is_gimple_reg (def)
! 	      /* In loop-closed SSA form do not copy-propagate through
! 	         PHI nodes.  Technically this is only needed for loop
! 		 exit PHIs, but this is difficult to query.  */
! 	      || (current_loops
! 		  && gimple_phi_num_args (phi) == 1
! 		  && loops_state_satisfies_p (LOOP_CLOSED_SSA)))
              prop_set_simulate_again (phi, false);
  	  else
              prop_set_simulate_again (phi, true);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for vectorization
  2009-01-23 18:03 [PATCH][RFC] Add versioning for constant strides for vectorization Richard Guenther
@ 2009-01-25 14:08 ` Ira Rosen
  2009-01-27 12:42 ` Dorit Nuzman
  2010-03-13 19:19 ` Jack Howarth
  2 siblings, 0 replies; 11+ messages in thread
From: Ira Rosen @ 2009-01-25 14:08 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches

gcc-patches-owner@gcc.gnu.org wrote on 23/01/2009 18:08:43:

>
> This patch adds the capability to the vectorizer to perform versioning
> for the case of a constant (suitable) stride.  For example for
>
> subroutine to_product_of(self,a,b,a1,a2)
>   complex(kind=8) :: self (:)
>   complex(kind=8), intent(in) :: a(:,:)
>   complex(kind=8), intent(in) :: b(:)
>   integer a1,a2
>   do i = 1,a1
>     do j = 1,a2
>       self(i) = self(i) + a(j,i)*b(j)
>     end do
>   end do
> end subroutine
>
> we can only apply vectorization if the strides of the fastest dimension
> of self, a and b are one (they are loaded from the passed array
> descriptors and thus appear as (loop invariant) variables).
>
> During the implementation of this I noticed that peeling for
> number of iterations (we have to unroll the above loop twice, and so
> for an odd number of iterations have a epilogue loop for the remaining
> iteration(s)) does not play well with versioning and we end up
> vectorizing the wrong loop.  So I just disabled versioning if we
> apply peeling with an epilogue loop and instead attach the versioning
> condition to the pre-condition of the main loop that skips directly
> to the epilogue if the number of iterations is too small.  We obviously
> can use the epilogue loop as the non-vectorized version.
>
> This patch also inserts an extra copyprop and dce pass before the
> vectorizer so it can recognize the reduction in the above testcase
> (LIM has made that reduction non-obvious).  So I noticed that
> copyprop does not preserve loop-closed SSA form and fixed that as well.
>
> Some earlier version bootstrapped and tested ok on
> x86_64-unknown-linux-gnu, a final attempt is still running.
>
> I didn't yet performance test this extensively, but it might need
> cost-model adjustments and/or need to wait until we have profile
> feedback to properly seed vectorizer analysis here.  A micro-benchmark
> based on the above loop shows around 15% improvement on AMD K10.
>
> Feedback (and ppc testing) is still welcome of course.

Regtested on powerpc64-suse-linux..


>
> 2009-01-23  Richard Guenther  <rguenther@suse.de>
>
>
>    * gcc.dg/vect/fast-math-vect-complex-5.c: New testcase.
>    * gfortran.dg/vect/fast-math-vect-complex-1.f90: Likewise.
>    * gfortran.dg/vect/fast-math-vect-stride-1.f90: Likewise.

The Fortran testcases require vect_double as well.
The testcases get vectorized on powerpc64-suse-linux if the type is
_Complex float/complex(kind=4).

Thanks,
Ira



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for vectorization
  2009-01-23 18:03 [PATCH][RFC] Add versioning for constant strides for vectorization Richard Guenther
  2009-01-25 14:08 ` Ira Rosen
@ 2009-01-27 12:42 ` Dorit Nuzman
  2009-01-27 12:55   ` Richard Guenther
  2010-03-13 19:19 ` Jack Howarth
  2 siblings, 1 reply; 11+ messages in thread
From: Dorit Nuzman @ 2009-01-27 12:42 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches

>
> This patch adds the capability to the vectorizer to perform versioning
> for the case of a constant (suitable) stride.  For example for
>

awesome!

Another very useful application of the ability to work with unknown strides
is in the context of outer-loop vectorization: if you have an unknown
stride in the inner-loop and stride one in the outer-loop, you can
vectorize the outer-loop (in which case you don't care about the inner-loop
stride as long as it is invariant). So for this purpose there's no need for
versioning - just to relax some assumptions here and there in the
vectorizer about the inner-loop stride having to be a known constant, which
is what part of your patch already starts to do (I know this is not the
purpose of your patch, but it provides initial infrastructure towards
that). I think this is quite a common case, and it appears in several PRs -
e.g. PR33113 (with a bit more detailed discussion) and PR37021 (which has a
(i,j) instead of a(j,i) in the example you show below); Outer-loop
vectorization would be better cause the reduction would be done more
accurately and more efficiently than with inner-loop vectorization, and it
could be done for any inner-loop stride (not only 1).

[off topic comment:  At least a couple tweaks that are currently required
to allow outer-loop vectorization for examples like in the above mentioned
PRs:
1) somehow don't let gcc create guard code around the innermost loop to
check that it executes more than zero iterations. This creates a
complicated control flow structure within the outer-loop. For now you have
to have  constant number of iterations for the inner-loop because of that,
or insert a statement like "if (LB<=0) return;" before the loop...
2) use -fno-tree-sink cause otherwise it moves the loop iv increment to the
latch block and the vectorizer likes to have the latch block empty...
I really hope we could do something about these... maybe use a COND
expression for the LB to solve the first problem, and clean up non-empty
latch blocks as part of vectorization preprocessing for the second?
]

I'm sorry for digressing... I know this is a bit off topic, but besides
this patch adding a very good feature to have as is, it's also a beginning
of a direction that can really help outer-loop vectorization in really
common cases - so this is very good news!

> subroutine to_product_of(self,a,b,a1,a2)
>   complex(kind=8) :: self (:)
>   complex(kind=8), intent(in) :: a(:,:)
>   complex(kind=8), intent(in) :: b(:)
>   integer a1,a2
>   do i = 1,a1
>     do j = 1,a2
>       self(i) = self(i) + a(j,i)*b(j)
>     end do
>   end do
> end subroutine
>
> we can only apply vectorization if the strides of the fastest dimension
> of self, a and b are one (they are loaded from the passed array
> descriptors and thus appear as (loop invariant) variables).
>
> During the implementation of this I noticed that peeling for
> number of iterations (we have to unroll the above loop twice, and so
> for an odd number of iterations have a epilogue loop for the remaining
> iteration(s)) does not play well with versioning and we end up
> vectorizing the wrong loop.  So I just disabled versioning if we
> apply peeling with an epilogue loop

yes, this is an old limitation since the introduction of loop versioning to
the vectorizer - it's either peeling or versioning... (not for any good
reason):)

>  and instead attach the versioning
> condition to the pre-condition of the main loop that skips directly
> to the epilogue if the number of iterations is too small.  We obviously
> can use the epilogue loop as the non-vectorized version.
>

maybe we should consider doing that also when versioning for
aliasing/alignment...

> This patch also inserts an extra copyprop and dce pass before the
> vectorizer so it can recognize the reduction in the above testcase
> (LIM has made that reduction non-obvious).

cool. This was also repeatedly reported problem... I wonder if it doesn't
solve a few missed-optimization PRs on the way...

> So I noticed that
> copyprop does not preserve loop-closed SSA form and fixed that as well.
>
> Some earlier version bootstrapped and tested ok on
> x86_64-unknown-linux-gnu, a final attempt is still running.
>
> I didn't yet performance test this extensively, but it might need
> cost-model adjustments and/or need to wait until we have profile
> feedback to properly seed vectorizer analysis here.  A micro-benchmark
> based on the above loop shows around 15% improvement on AMD K10.
>
> Feedback (and ppc testing) is still welcome of course.
>

I didn't review the code very closely, but one tiny thing that would be
nice to add is a printout to the dump file reporting that versioining for
stride is done.

about:

> +   /* ???  Delay this change until after versioning or
> +      preserve the original step somewhere.  */

When we take this forward to support outer-loop vectorization with unknown
inner-loop strides, we'd indeed want to
delay this change.

thanks!

dorit


> Thanks,
> Richard.
>
> 2009-01-23  Richard Guenther  <rguenther@suse.de>
>
>    * passes.c (init_optimization_passes): Add copy-prop and dce
>    before vectorization.
>    * Makefile.in (tree-ssa-copy.o): Add $(CFGLOOP_H) dependency.
>    * tree-ssa-copy.c (init_copy_prop): Do not propagate through
>    single-argument PHIs if we are in loop-closed SSA form.
>    * tree-data-ref.c (dr_analyze_innermost): Allow affine offsets.
>    * tree-vect-analyze.c (vect_check_interleaving): Check that
>    DR_STEP is constant.
>    (vect_enhance_data_refs_alignment): If versioning for strides
>    is required do not peel.
>    (vect_analyze_data_ref_access): Allow non-constant step of
>    a specific form, remember them for versioning.
>    * params.def (vect-max-version-for-stride-checks): New param.
>    (vect-version-for-stride-value): Likewise.
>    * tree-vectorizer.c (slpeel_add_loop_guard): Pass extra guards
>    for the pre-condition.
>    (slpeel_tree_peel_loop_to_edge): Likewise.
>    (new_loop_vec_info): Allocate stride versioning data.
>    (destroy_loop_vec_info): Free stride versioning data.
>    * tree-vectorizer.h (struct _loop_vec_info): Add variable_strides
>    field.
>    (LOOP_VINFO_VARIABLE_STRIDES): Define.
>    (slpeel_tree_peel_loop_to_edge): Adjust declaration.
>    * tree-vect-transform.c (vect_build_loop_niters): Take an
>    optional sequence to append stmts.
>    (vect_generate_tmps_on_preheader): Likewise.
>    (vect_do_peeling_for_loop_bound): Take extra guards for the
>    pre-condition.
>    (vect_do_peeling_for_alignment): Adjust.
>    (vect_create_cond_for_stride_checks): New function.
>    (vect_loop_versioning): Take stmt and stmt list to put pre-condition
>    guards if we are going to peel.  Do not apply versioning in that
>    case.
>    (vect_transform_loop): If we are peeling for loop bound only
>    record extra pre-conditions, do not apply loop versioning.
>
>    * gcc.dg/vect/fast-math-vect-complex-5.c: New testcase.
>    * gfortran.dg/vect/fast-math-vect-complex-1.f90: Likewise.
>    * gfortran.dg/vect/fast-math-vect-stride-1.f90: Likewise.
>
> Index: trunk/gcc/passes.c
> ===================================================================
> *** trunk.orig/gcc/passes.c   2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/passes.c   2009-01-23 16:48:50.000000000 +0100
> *************** init_optimization_passes (void)
> *** 659,664 ****
> --- 659,666 ----
>        NEXT_PASS (pass_graphite_transforms);
>        NEXT_PASS (pass_iv_canon);
>        NEXT_PASS (pass_if_conversion);
> +      NEXT_PASS (pass_copy_prop);
> +      NEXT_PASS (pass_dce_loop);
>        NEXT_PASS (pass_vectorize);
>          {
>            struct opt_pass **p = &pass_vectorize.pass.sub;
> Index: trunk/gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-5.c
> ===================================================================
> *** /dev/null   1970-01-01 00:00:00.000000000 +0000
> --- trunk/gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-5.c
> 2009-01-23 16:48:50.000000000 +0100
> ***************
> *** 0 ****
> --- 1,18 ----
> + /* { dg-do compile } */
> + /* { dg-require-effective-target vect_double } */
> +
> + #define NUM 64
> + _Complex double ad[NUM], bd[NUM], cd[NUM];
> +
> + void testd(void)
> + {
> +   int i;
> +   int j;
> +
> +   for (i = 0; i < NUM; i++)
> +     for (j = 0; j < NUM; j++)
> +       cd[i] = cd[i] + ad[j] * bd[j];
> + }
> +
> + /* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */
> + /* { dg-final { cleanup-tree-dump "vect" } } */
> Index: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-complex-1.f90
> ===================================================================
> *** /dev/null   1970-01-01 00:00:00.000000000 +0000
> --- trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-complex-1.
> f90   2009-01-23 16:48:50.000000000 +0100
> ***************
> *** 0 ****
> --- 1,16 ----
> + ! { dg-do compile }
> +
> + subroutine to_product_of(self,a,b,a1,a2)
> +   complex(kind=8) :: self (:)
> +   complex(kind=8), intent(in) :: a(:,:)
> +   complex(kind=8), intent(in) :: b(:)
> +   integer a1,a2
> +   do i = 1,a1
> +     do j = 1,a2
> +       self(i) = self(i) + a(i,j)*b(j)
> +     end do
> +   end do
> + end subroutine
> +
> + ! { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } }
> + ! { dg-final { cleanup-tree-dump "vect" } }
> Index: trunk/gcc/tree-data-ref.c
> ===================================================================
> *** trunk.orig/gcc/tree-data-ref.c   2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/tree-data-ref.c   2009-01-23 16:48:50.000000000 +0100
> *************** dr_analyze_innermost (struct data_refere
> *** 708,714 ****
>         offset_iv.base = ssize_int (0);
>         offset_iv.step = ssize_int (0);
>       }
> !   else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
>       {
>         if (dump_file && (dump_flags & TDF_DETAILS))
>      fprintf (dump_file, "failed: evolution of offset is not affine.\n");
> --- 708,714 ----
>         offset_iv.base = ssize_int (0);
>         offset_iv.step = ssize_int (0);
>       }
> !   else if (!simple_iv (loop, stmt, poffset, &offset_iv, true))
>       {
>         if (dump_file && (dump_flags & TDF_DETAILS))
>      fprintf (dump_file, "failed: evolution of offset is not affine.\n");
> Index: trunk/gcc/tree-vect-analyze.c
> ===================================================================
> *** trunk.orig/gcc/tree-vect-analyze.c   2009-01-23 16:47:53.000000000
+0100
> --- trunk/gcc/tree-vect-analyze.c   2009-01-23 16:48:50.000000000 +0100
> *************** vect_check_interleaving (struct data_ref
> *** 1109,1114 ****
> --- 1109,1116 ----
>     type_size_b = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (TREE_TYPE
> (DR_REF (drb))));
>
>     if (type_size_a != type_size_b
> +       || TREE_CODE (DR_STEP (dra)) != INTEGER_CST
> +       || TREE_CODE (DR_STEP (drb)) != INTEGER_CST
>         || tree_int_cst_compare (DR_STEP (dra), DR_STEP (drb))
>         || !types_compatible_p (TREE_TYPE (DR_REF (dra)),
>                                 TREE_TYPE (DR_REF (drb))))
> *************** vect_enhance_data_refs_alignment (loop_v
> *** 1825,1830 ****
> --- 1827,1833 ----
>     gimple stmt;
>     stmt_vec_info stmt_info;
>     int vect_versioning_for_alias_required;
> +   int vect_versioning_for_strides_required;
>
>     if (vect_print_dump_info (REPORT_DETAILS))
>       fprintf (vect_dump, "=== vect_enhance_data_refs_alignment ===");
> *************** vect_enhance_data_refs_alignment (loop_v
> *** 1892,1904 ****
>
>     vect_versioning_for_alias_required =
>       (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)) > 0);
>
>     /* Temporarily, if versioning for alias is required, we disable
peeling
>        until we support peeling and versioning.  Often peeling for
alignment
>        will require peeling for loop-bound, which in turn requires that
we
>        know how to adjust the loop ivs after the loop.  */
>     if (vect_versioning_for_alias_required
> !        || !vect_can_advance_ivs_p (loop_vinfo)
>         || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
>       do_peeling = false;
>
> --- 1895,1910 ----
>
>     vect_versioning_for_alias_required =
>       (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)) > 0);
> +   vect_versioning_for_strides_required =
> +     !bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo));
>
>     /* Temporarily, if versioning for alias is required, we disable
peeling
>        until we support peeling and versioning.  Often peeling for
alignment
>        will require peeling for loop-bound, which in turn requires that
we
>        know how to adjust the loop ivs after the loop.  */
>     if (vect_versioning_for_alias_required
> !       || vect_versioning_for_strides_required
> !       || !vect_can_advance_ivs_p (loop_vinfo)
>         || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
>       do_peeling = false;
>
> *************** vect_analyze_data_ref_access (struct dat
> *** 2349,2357 ****
>     stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>     loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> !   HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
>
> !   if (!step)
>       {
>         if (vect_print_dump_info (REPORT_DETAILS))
>      fprintf (vect_dump, "bad data-ref access");
> --- 2355,2364 ----
>     stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>     loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> !   HOST_WIDE_INT dr_step;
>
> !   if (!step
> !       || TREE_CODE (step) != INTEGER_CST)
>       {
>         if (vect_print_dump_info (REPORT_DETAILS))
>      fprintf (vect_dump, "bad data-ref access");
> *************** vect_analyze_data_ref_access (struct dat
> *** 2359,2364 ****
> --- 2366,2372 ----
>       }
>
>     /* Don't allow invariant accesses.  */
> +   dr_step = TREE_INT_CST_LOW (step);
>     if (dr_step == 0)
>       return false;
>
> *************** vect_analyze_data_refs (loop_vec_info lo
> *** 3563,3568 ****
> --- 3571,3620 ----
>             return false;
>           }
>
> +       /* If the non-constant (but loop invariant) step is of the
> +     form NAME or NAME * CST where CST is the element size mark
> +     this ddr for versioning for strides and re-set DR_STEP
> +     to the value we will version for.  Otherwise reject
> +     non-constant steps.  */
> +       if (TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
> +    {
> +      tree step = DR_STEP (dr);
> +
> +      STRIP_NOPS (step);
> +      if (flag_tree_vect_loop_version
> +          && (TREE_CODE (step) == SSA_NAME
> +         || (TREE_CODE (step) == MULT_EXPR
> +             && TREE_CODE (TREE_OPERAND (step, 1)) == INTEGER_CST)))
> +        {
> +          tree stride;
> +          tree newstep;
> +
> +          stride = step;
> +          if (TREE_CODE (step) == MULT_EXPR)
> +       stride = TREE_OPERAND (step, 0);
> +          STRIP_NOPS (stride);
> +          if (TREE_CODE (stride) != SSA_NAME)
> +       return false;
> +
> +          bitmap_set_bit (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo),
> +                SSA_NAME_VERSION (stride));
> +          if (bitmap_count_bits (LOOP_VINFO_VARIABLE_STRIDES
(loop_vinfo))
> +         > (unsigned)PARAM_VALUE
(PARAM_VECT_MAX_VERSION_FOR_STRIDE_CHECKS))
> +       return false;
> +
> +          /* ???  Delay this change until after versioning or
> +             preserve the original step somewhere.  */
> +          newstep = build_int_cst (TREE_TYPE (step),
> +              PARAM_VALUE (PARAM_VECT_VERSION_FOR_STRIDE_VALUE));
> +          if (TREE_CODE (step) == MULT_EXPR)
> +       newstep = int_const_binop (MULT_EXPR, newstep,
> +                   TREE_OPERAND (step, 1), false);
> +          DR_STEP (dr) = newstep;
> +        }
> +      else
> +        return false;
> +    }
> +
>         if (!DR_SYMBOL_TAG (dr))
>           {
>             if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
> Index: trunk/gcc/params.def
> ===================================================================
> *** trunk.orig/gcc/params.def   2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/params.def   2009-01-23 16:48:50.000000000 +0100
> *************** DEFPARAM(PARAM_VECT_MAX_VERSION_FOR_ALIA
> *** 506,511 ****
> --- 506,521 ----
>            "Bound on number of runtime checks inserted by the
> vectorizer's loop versioning for alias check",
>            10, 0, 0)
>
> + DEFPARAM(PARAM_VECT_MAX_VERSION_FOR_STRIDE_CHECKS,
> +          "vect-max-version-for-stride-checks",
> +          "Bound on number of runtime checks inserted by the
> vectorizer's loop versioning for stride check",
> +          4, 0, 0)
> +
> + DEFPARAM(PARAM_VECT_VERSION_FOR_STRIDE_VALUE,
> +          "vect-version-for-stride-value",
> +          "The constant stride in elements the vectorizer uses for
> loop versioning",
> +          1, 0, 0)
> +
>   DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIONS,
>       "max-cselib-memory-locations",
>       "The maximum memory locations recorded by cselib",
> Index: trunk/gcc/tree-vectorizer.c
> ===================================================================
> *** trunk.orig/gcc/tree-vectorizer.c   2009-01-23 16:47:53.000000000
+0100
> --- trunk/gcc/tree-vectorizer.c   2009-01-23 16:48:50.000000000 +0100
> *************** slpeel_tree_duplicate_loop_to_edge_cfg (
> *** 927,934 ****
>      Returns the skip edge.  */
>
>   static edge
> ! slpeel_add_loop_guard (basic_block guard_bb, tree cond, basic_block
exit_bb,
> !              basic_block dom_bb)
>   {
>     gimple_stmt_iterator gsi;
>     edge new_e, enter_e;
> --- 927,935 ----
>      Returns the skip edge.  */
>
>   static edge
> ! slpeel_add_loop_guard (basic_block guard_bb, tree cond,
> !              gimple_seq cond_expr_stmt_list,
> !              basic_block exit_bb, basic_block dom_bb)
>   {
>     gimple_stmt_iterator gsi;
>     edge new_e, enter_e;
> *************** slpeel_add_loop_guard (basic_block guard
> *** 941,951 ****
>     gsi = gsi_last_bb (guard_bb);
>
>     cond = force_gimple_operand (cond, &gimplify_stmt_list, true,
NULL_TREE);
>     cond_stmt = gimple_build_cond (NE_EXPR,
>                cond, build_int_cst (TREE_TYPE (cond), 0),
>                NULL_TREE, NULL_TREE);
> !   if (gimplify_stmt_list)
> !     gsi_insert_seq_after (&gsi, gimplify_stmt_list, GSI_NEW_STMT);
>
>     gsi = gsi_last_bb (guard_bb);
>     gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
> --- 942,954 ----
>     gsi = gsi_last_bb (guard_bb);
>
>     cond = force_gimple_operand (cond, &gimplify_stmt_list, true,
NULL_TREE);
> +   if (gimplify_stmt_list)
> +     gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
>     cond_stmt = gimple_build_cond (NE_EXPR,
>                cond, build_int_cst (TREE_TYPE (cond), 0),
>                NULL_TREE, NULL_TREE);
> !   if (cond_expr_stmt_list)
> !     gsi_insert_seq_after (&gsi, cond_expr_stmt_list, GSI_NEW_STMT);
>
>     gsi = gsi_last_bb (guard_bb);
>     gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
> *************** struct loop*
> *** 1151,1157 ****
>   slpeel_tree_peel_loop_to_edge (struct loop *loop,
>                   edge e, tree first_niters,
>                   tree niters, bool update_first_loop_count,
> !                 unsigned int th, bool check_profitability)
>   {
>     struct loop *new_loop = NULL, *first_loop, *second_loop;
>     edge skip_e;
> --- 1154,1161 ----
>   slpeel_tree_peel_loop_to_edge (struct loop *loop,
>                   edge e, tree first_niters,
>                   tree niters, bool update_first_loop_count,
> !                 unsigned int th, bool check_profitability,
> !                 tree cond_expr, gimple_seq cond_expr_stmt_list)
>   {
>     struct loop *new_loop = NULL, *first_loop, *second_loop;
>     edge skip_e;
> *************** slpeel_tree_peel_loop_to_edge (struct lo
> *** 1325,1330 ****
> --- 1329,1342 ----
>        pre_condition = fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
>                      cost_pre_condition, pre_condition);
>      }
> +       if (cond_expr)
> +    {
> +      pre_condition =
> +        fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
> +           pre_condition,
> +           fold_build1 (TRUTH_NOT_EXPR, boolean_type_node,
> +                   cond_expr));
> +    }
>       }
>
>     /* Prologue peeling.  */
> *************** slpeel_tree_peel_loop_to_edge (struct lo
> *** 1340,1345 ****
> --- 1352,1358 ----
>       }
>
>     skip_e = slpeel_add_loop_guard (bb_before_first_loop, pre_condition,
> +               cond_expr_stmt_list,
>                                     bb_before_second_loop,
> bb_before_first_loop);
>     slpeel_update_phi_nodes_for_guard1 (skip_e, first_loop,
>                     first_loop == new_loop,
> *************** slpeel_tree_peel_loop_to_edge (struct lo
> *** 1377,1383 ****
>
>     pre_condition =
>      fold_build2 (EQ_EXPR, boolean_type_node, first_niters, niters);
> !   skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition,
>                                     bb_after_second_loop,
> bb_before_first_loop);
>     slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
>                                        second_loop == new_loop,
&new_exit_bb);
> --- 1390,1396 ----
>
>     pre_condition =
>      fold_build2 (EQ_EXPR, boolean_type_node, first_niters, niters);
> !   skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition,
NULL,
>                                     bb_after_second_loop,
> bb_before_first_loop);
>     slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
>                                        second_loop == new_loop,
&new_exit_bb);
> *************** new_loop_vec_info (struct loop *loop)
> *** 1714,1719 ****
> --- 1727,1733 ----
>     LOOP_VINFO_MAY_ALIAS_DDRS (res) =
>       VEC_alloc (ddr_p, heap,
>             PARAM_VALUE (PARAM_VECT_MAX_VERSION_FOR_ALIAS_CHECKS));
> +   LOOP_VINFO_VARIABLE_STRIDES (res) = BITMAP_ALLOC (NULL);
>     LOOP_VINFO_STRIDED_STORES (res) = VEC_alloc (gimple, heap, 10);
>     LOOP_VINFO_SLP_INSTANCES (res) = VEC_alloc (slp_instance, heap, 10);
>     LOOP_VINFO_SLP_UNROLLING_FACTOR (res) = 1;
> *************** destroy_loop_vec_info (loop_vec_info loo
> *** 1800,1805 ****
> --- 1814,1820 ----
>     free_dependence_relations (LOOP_VINFO_DDRS (loop_vinfo));
>     VEC_free (gimple, heap, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo));
>     VEC_free (ddr_p, heap, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo));
> +   BITMAP_FREE (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo));
>     slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
>     for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j
++)
>       vect_free_slp_instance (instance);
> Index: trunk/gcc/tree-vectorizer.h
> ===================================================================
> *** trunk.orig/gcc/tree-vectorizer.h   2009-01-23 16:47:53.000000000
+0100
> --- trunk/gcc/tree-vectorizer.h   2009-01-23 16:48:50.000000000 +0100
> *************** typedef struct _loop_vec_info {
> *** 210,215 ****
> --- 210,219 ----
>     /* All data dependences in the loop.  */
>     VEC (ddr_p, heap) *ddrs;
>
> +   /* SSA_NAMEs representing variable strides in data references.
> +      Candidates for a run-time stride check.  */
> +   bitmap variable_strides;
> +
>     /* Data Dependence Relations defining address ranges that are
candidates
>        for a run-time aliasing check.  */
>     VEC (ddr_p, heap) *may_alias_ddrs;
> *************** typedef struct _loop_vec_info {
> *** 254,259 ****
> --- 258,264 ----
>   #define LOOP_VINFO_LOC(L)             (L)->loop_line_number
>   #define LOOP_VINFO_MAY_ALIAS_DDRS(L)  (L)->may_alias_ddrs
>   #define LOOP_VINFO_STRIDED_STORES(L)  (L)->strided_stores
> + #define LOOP_VINFO_VARIABLE_STRIDES(L) (L)->variable_strides
>   #define LOOP_VINFO_SLP_INSTANCES(L)   (L)->slp_instances
>   #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
>
> *************** extern bitmap vect_memsyms_to_rename;
> *** 707,713 ****
>      divide by the vectorization factor, and to peel the first few
iterations
>      to force the alignment of data references in the loop.  */
>   extern struct loop *slpeel_tree_peel_loop_to_edge
> !   (struct loop *, edge, tree, tree, bool, unsigned int, bool);
>   extern void set_prologue_iterations (basic_block, tree,
>                    struct loop *, unsigned int);
>   struct loop *tree_duplicate_loop_on_edge (struct loop *, edge);
> --- 712,718 ----
>      divide by the vectorization factor, and to peel the first few
iterations
>      to force the alignment of data references in the loop.  */
>   extern struct loop *slpeel_tree_peel_loop_to_edge
> !   (struct loop *, edge, tree, tree, bool, unsigned int, bool,
> tree, gimple_seq);
>   extern void set_prologue_iterations (basic_block, tree,
>                    struct loop *, unsigned int);
>   struct loop *tree_duplicate_loop_on_edge (struct loop *, edge);
> Index: trunk/gcc/tree-vect-transform.c
> ===================================================================
> *** trunk.orig/gcc/tree-vect-transform.c   2009-01-23 16:47:53.000000000
+0100
> --- trunk/gcc/tree-vect-transform.c   2009-01-23 16:48:50.000000000 +0100
> *************** static tree get_initial_def_for_reductio
> *** 65,72 ****
>
>   /* Utility function dealing with loop peeling (not peeling itself).  */
>   static void vect_generate_tmps_on_preheader
> !   (loop_vec_info, tree *, tree *, tree *);
> ! static tree vect_build_loop_niters (loop_vec_info);
>   static void vect_update_ivs_after_vectorizer (loop_vec_info, tree,
edge);
>   static tree vect_gen_niters_for_prolog_loop (loop_vec_info, tree);
>   static void vect_update_init_of_dr (struct data_reference *, tree
niters);
> --- 65,72 ----
>
>   /* Utility function dealing with loop peeling (not peeling itself).  */
>   static void vect_generate_tmps_on_preheader
> !   (loop_vec_info, tree *, tree *, tree *, gimple_seq);
> ! static tree vect_build_loop_niters (loop_vec_info, gimple_seq);
>   static void vect_update_ivs_after_vectorizer (loop_vec_info, tree,
edge);
>   static tree vect_gen_niters_for_prolog_loop (loop_vec_info, tree);
>   static void vect_update_init_of_dr (struct data_reference *, tree
niters);
> *************** vect_transform_stmt (gimple stmt, gimple
> *** 7199,7205 ****
>      on the loop preheader.  */
>
>   static tree
> ! vect_build_loop_niters (loop_vec_info loop_vinfo)
>   {
>     tree ni_name, var;
>     gimple_seq stmts = NULL;
> --- 7199,7205 ----
>      on the loop preheader.  */
>
>   static tree
> ! vect_build_loop_niters (loop_vec_info loop_vinfo, gimple_seq seq)
>   {
>     tree ni_name, var;
>     gimple_seq stmts = NULL;
> *************** vect_build_loop_niters (loop_vec_info lo
> *** 7214,7221 ****
>     pe = loop_preheader_edge (loop);
>     if (stmts)
>       {
> !       basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe,
stmts);
> !       gcc_assert (!new_bb);
>       }
>
>     return ni_name;
> --- 7214,7226 ----
>     pe = loop_preheader_edge (loop);
>     if (stmts)
>       {
> !       if (seq)
> !    gimple_seq_add_seq (&seq, stmts);
> !       else
> !    {
> !      basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> !      gcc_assert (!new_bb);
> !    }
>       }
>
>     return ni_name;
> *************** static void
> *** 7234,7240 ****
>   vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>                tree *ni_name_ptr,
>                tree *ratio_mult_vf_name_ptr,
> !              tree *ratio_name_ptr)
>   {
>
>     edge pe;
> --- 7239,7246 ----
>   vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo,
>                tree *ni_name_ptr,
>                tree *ratio_mult_vf_name_ptr,
> !              tree *ratio_name_ptr,
> !              gimple_seq cond_expr_stmt_list)
>   {
>
>     edge pe;
> *************** vect_generate_tmps_on_preheader (loop_ve
> *** 7254,7260 ****

>     /* Generate temporary variable that contains
>        number of iterations loop executes.  */
>
> !   ni_name = vect_build_loop_niters (loop_vinfo);
>     log_vf = build_int_cst (TREE_TYPE (ni), exact_log2 (vf));
>
>     /* Create: ratio = ni >> log2(vf) */
> --- 7260,7266 ----
>     /* Generate temporary variable that contains
>        number of iterations loop executes.  */
>
> !   ni_name = vect_build_loop_niters (loop_vinfo, cond_expr_stmt_list);
>     log_vf = build_int_cst (TREE_TYPE (ni), exact_log2 (vf));
>
>     /* Create: ratio = ni >> log2(vf) */
> *************** vect_generate_tmps_on_preheader (loop_ve
> *** 7267,7275 ****
>
>         stmts = NULL;
>         ratio_name = force_gimple_operand (ratio_name, &stmts, true,
var);
> !       pe = loop_preheader_edge (loop);
> !       new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> !       gcc_assert (!new_bb);
>       }
>
>     /* Create: ratio_mult_vf = ratio << log2 (vf).  */
> --- 7273,7286 ----
>
>         stmts = NULL;
>         ratio_name = force_gimple_operand (ratio_name, &stmts, true,
var);
> !       if (cond_expr_stmt_list)
> !    gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
> !       else
> !    {
> !      pe = loop_preheader_edge (loop);
> !      new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> !      gcc_assert (!new_bb);
> !    }
>       }
>
>     /* Create: ratio_mult_vf = ratio << log2 (vf).  */
> *************** vect_generate_tmps_on_preheader (loop_ve
> *** 7284,7292 ****
>         stmts = NULL;
>         ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name,
&stmts,
>                      true, var);
> !       pe = loop_preheader_edge (loop);
> !       new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> !       gcc_assert (!new_bb);
>       }
>
>     *ni_name_ptr = ni_name;
> --- 7295,7308 ----
>         stmts = NULL;
>         ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name,
&stmts,
>                      true, var);
> !       if (cond_expr_stmt_list)
> !    gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
> !       else
> !    {
> !      pe = loop_preheader_edge (loop);
> !      new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> !      gcc_assert (!new_bb);
> !    }
>       }
>
>     *ni_name_ptr = ni_name;
> *************** conservative_cost_threshold (loop_vec_in
> *** 7470,7476 ****
>      NITERS / VECTORIZATION_FACTOR times (this value is placed into
> RATIO).  */
>
>   static void
> ! vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio)
>   {
>     tree ni_name, ratio_mult_vf_name;
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> --- 7486,7493 ----
>      NITERS / VECTORIZATION_FACTOR times (this value is placed into
> RATIO).  */
>
>   static void
> ! vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> !             tree cond_expr, gimple_seq cond_expr_stmt_list)
>   {
>     tree ni_name, ratio_mult_vf_name;
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> *************** vect_do_peeling_for_loop_bound (loop_vec
> *** 7493,7499 ****
>        ratio = ni_name / vf
>        ratio_mult_vf_name = ratio * vf  */
>     vect_generate_tmps_on_preheader (loop_vinfo, &ni_name,
> !                &ratio_mult_vf_name, ratio);
>
>     loop_num  = loop->num;
>
> --- 7510,7517 ----
>        ratio = ni_name / vf
>        ratio_mult_vf_name = ratio * vf  */
>     vect_generate_tmps_on_preheader (loop_vinfo, &ni_name,
> !                &ratio_mult_vf_name, ratio,
> !                cond_expr_stmt_list);
>
>     loop_num  = loop->num;
>
> *************** vect_do_peeling_for_loop_bound (loop_vec
> *** 7501,7507 ****
>        peeling for alignment.  */
>     if (!VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
>         && !VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> !       && !LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
>       {
>         check_profitability = true;
>
> --- 7519,7526 ----
>        peeling for alignment.  */
>     if (!VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
>         && !VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> !       && !LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo)
> !       && !cond_expr)
>       {
>         check_profitability = true;
>
> *************** vect_do_peeling_for_loop_bound (loop_vec
> *** 7514,7520 ****
>
>     new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
>                                               ratio_mult_vf_name,
> ni_name, false,
> !                                             th, check_profitability);
>     gcc_assert (new_loop);
>     gcc_assert (loop_num == loop->num);
>   #ifdef ENABLE_CHECKING
> --- 7533,7540 ----
>
>     new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
>                                               ratio_mult_vf_name,
> ni_name, false,
> !                                             th, check_profitability,
> !                    cond_expr, cond_expr_stmt_list);
>     gcc_assert (new_loop);
>     gcc_assert (loop_num == loop->num);
>   #ifdef ENABLE_CHECKING
> *************** vect_do_peeling_for_alignment (loop_vec_
> *** 7738,7744 ****
>
>     initialize_original_copy_tables ();
>
> !   ni_name = vect_build_loop_niters (loop_vinfo);
>     niters_of_prolog_loop = vect_gen_niters_for_prolog_loop
> (loop_vinfo, ni_name);
>
>
> --- 7758,7764 ----
>
>     initialize_original_copy_tables ();
>
> !   ni_name = vect_build_loop_niters (loop_vinfo, NULL);
>     niters_of_prolog_loop = vect_gen_niters_for_prolog_loop
> (loop_vinfo, ni_name);
>
>
> *************** vect_do_peeling_for_alignment (loop_vec_
> *** 7759,7765 ****
>     new_loop =
>       slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
>                  niters_of_prolog_loop, ni_name, true,
> !                th, check_profitability);
>
>     gcc_assert (new_loop);
>   #ifdef ENABLE_CHECKING
> --- 7779,7785 ----
>     new_loop =
>       slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
>                  niters_of_prolog_loop, ni_name, true,
> !                th, check_profitability, NULL_TREE, NULL);
>
>     gcc_assert (new_loop);
>   #ifdef ENABLE_CHECKING
> *************** vect_create_cond_for_align_checks (loop_
> *** 7909,7914 ****
> --- 7929,7981 ----
>       *cond_expr = part_cond_expr;
>   }
>
> + /* Function vect_create_cond_for_stride_checks.
> +
> +    Create a conditional expression that represents the stride checks
for
> +    all of the stride SSA_NAMEs used in data references (array element
> +    references) whose stride must be checked at runtime.
> +
> +    Input:
> +    COND_EXPR  - input conditional expression.  New conditions willbe
chained
> +                 with logical AND operation.
> +    LOOP_VINFO - on field of the loop information is used.
> +                 LOOP_VINFO_VARIABLE_STRIDES is a bitmap of SSA_NAMEs to
be
> +       checked.
> +
> +    Output:
> +    COND_EXPR_STMT_LIST - statements needed to construct the conditional
> +                          expression.
> +    The returned value is the conditional expression to be used in the
if
> +    statement that controls which version of the loop gets executed
> at runtime.
> +
> +    The stride we do versioning for is currently specified by a
compile-time
> +    param.  In future the stride should be chosen by information from
> +    profile-feedback.  */
> +
> + static void
> + vect_create_cond_for_stride_checks (loop_vec_info loop_vinfo,
> +                 tree *cond_expr)
> + {
> +   bitmap_iterator bi;
> +   unsigned int i;
> +   HOST_WIDE_INT stride;
> +
> +   stride = PARAM_VALUE (PARAM_VECT_VERSION_FOR_STRIDE_VALUE);
> +
> +   EXECUTE_IF_SET_IN_BITMAP (LOOP_VINFO_VARIABLE_STRIDES
> (loop_vinfo), 0, i, bi)
> +     {
> +       tree name = ssa_name (i);
> +       tree cond = fold_build2 (EQ_EXPR, boolean_type_node,
> +                 name,
> +                 build_int_cst (TREE_TYPE (name), stride));
> +       if (*cond_expr)
> +    *cond_expr = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
> +               *cond_expr, cond);
> +       else
> +    *cond_expr = cond;
> +     }
> + }
> +
>   /* Function vect_vfa_segment_size.
>
>      Create an expression that computes the size of segment
> *************** vect_create_cond_for_alias_checks (loop_
> *** 8076,8087 ****
>      cost model initially.  */
>
>   static void
> ! vect_loop_versioning (loop_vec_info loop_vinfo)
>   {
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>     struct loop *nloop;
> -   tree cond_expr = NULL_TREE;
> -   gimple_seq cond_expr_stmt_list = NULL;
>     basic_block condition_bb;
>     gimple_stmt_iterator gsi, cond_exp_gsi;
>     basic_block merge_bb;
> --- 8143,8153 ----
>      cost model initially.  */
>
>   static void
> ! vect_loop_versioning (loop_vec_info loop_vinfo, bool do_versioning,
> !             tree *cond_expr, gimple_seq *cond_expr_stmt_list)
>   {
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>     struct loop *nloop;
>     basic_block condition_bb;
>     gimple_stmt_iterator gsi, cond_exp_gsi;
>     basic_block merge_bb;
> *************** vect_loop_versioning (loop_vec_info loop
> *** 8101,8129 ****
>     th = conservative_cost_threshold (loop_vinfo,
>                   min_profitable_iters);
>
> !   cond_expr =
> !     build2 (GT_EXPR, boolean_type_node, scalar_loop_iters,
> !        build_int_cst (TREE_TYPE (scalar_loop_iters), th));
>
> !   cond_expr = force_gimple_operand (cond_expr, &cond_expr_stmt_list,
> !                 false, NULL_TREE);
>
>     if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)))
> !       vect_create_cond_for_align_checks (loop_vinfo, &cond_expr,
> !                 &cond_expr_stmt_list);
>
>     if (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> !     vect_create_cond_for_alias_checks (loop_vinfo, &cond_expr,
> !                    &cond_expr_stmt_list);
>
> !   cond_expr =
> !     fold_build2 (NE_EXPR, boolean_type_node, cond_expr,
integer_zero_node);
> !   cond_expr =
> !     force_gimple_operand (cond_expr, &gimplify_stmt_list, true,
NULL_TREE);
> !   gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
>
>     initialize_original_copy_tables ();
> !   nloop = loop_version (loop, cond_expr, &condition_bb,
>            prob, prob, REG_BR_PROB_BASE - prob, true);
>     free_original_copy_tables();
>
> --- 8167,8200 ----
>     th = conservative_cost_threshold (loop_vinfo,
>                   min_profitable_iters);
>
> !   *cond_expr =
> !     fold_build2 (GT_EXPR, boolean_type_node, scalar_loop_iters,
> !        build_int_cst (TREE_TYPE (scalar_loop_iters), th));
>
> !   *cond_expr = force_gimple_operand (*cond_expr, cond_expr_stmt_list,
> !                  false, NULL_TREE);
>
>     if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)))
> !       vect_create_cond_for_align_checks (loop_vinfo, cond_expr,
> !                 cond_expr_stmt_list);
>
>     if (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> !     vect_create_cond_for_alias_checks (loop_vinfo, cond_expr,
> !                    cond_expr_stmt_list);
>
> !   if (!bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo)))
> !     vect_create_cond_for_stride_checks (loop_vinfo, cond_expr);
> !
> !   *cond_expr =
> !     fold_build2 (NE_EXPR, boolean_type_node, *cond_expr,
integer_zero_node);
> !   *cond_expr =
> !     force_gimple_operand (*cond_expr, &gimplify_stmt_list, true,
NULL_TREE);
> !   gimple_seq_add_seq (cond_expr_stmt_list, gimplify_stmt_list);
> !   if (!do_versioning)
> !     return;
>
>     initialize_original_copy_tables ();
> !   nloop = loop_version (loop, *cond_expr, &condition_bb,
>            prob, prob, REG_BR_PROB_BASE - prob, true);
>     free_original_copy_tables();
>
> *************** vect_loop_versioning (loop_vec_info loop
> *** 8154,8164 ****
>     /* End loop-exit-fixes after versioning.  */
>
>     update_ssa (TODO_update_ssa);
> !   if (cond_expr_stmt_list)
>       {
>         cond_exp_gsi = gsi_last_bb (condition_bb);
> !       gsi_insert_seq_before (&cond_exp_gsi, cond_expr_stmt_list,
> GSI_SAME_STMT);
>       }
>   }
>
>   /* Remove a group of stores (for SLP or interleaving), free their
> --- 8225,8238 ----
>     /* End loop-exit-fixes after versioning.  */
>
>     update_ssa (TODO_update_ssa);
> !   if (*cond_expr_stmt_list)
>       {
>         cond_exp_gsi = gsi_last_bb (condition_bb);
> !       gsi_insert_seq_before (&cond_exp_gsi, *cond_expr_stmt_list,
> !               GSI_SAME_STMT);
> !       *cond_expr_stmt_list = NULL;
>       }
> +   *cond_expr = NULL_TREE;
>   }
>
>   /* Remove a group of stores (for SLP or interleaving), free their
> *************** vect_transform_loop (loop_vec_info loop_
> *** 8320,8342 ****
>     bool strided_store;
>     bool slp_scheduled = false;
>     unsigned int nunits;
>
>     if (vect_print_dump_info (REPORT_DETAILS))
>       fprintf (vect_dump, "=== vec_transform_loop ===");
>
> -   if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> -       || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> -     vect_loop_versioning (loop_vinfo);
> -
> -   /* CHECKME: we wouldn't need this if we called update_ssa once
> -      for all loops.  */
> -   bitmap_zero (vect_memsyms_to_rename);
> -
>     /* Peel the loop if there are data refs with unknown alignment.
>        Only one data ref with unknown store is allowed.  */
>
>     if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
>       vect_do_peeling_for_alignment (loop_vinfo);
>
>     /* If the loop has a symbolic number of iterations 'n' (i.e. it's not
a
>        compile time constant), or it is a constant that doesn't divide by
the
> --- 8394,8427 ----
>     bool strided_store;
>     bool slp_scheduled = false;
>     unsigned int nunits;
> +   tree cond_expr = NULL_TREE;
> +   gimple_seq cond_expr_stmt_list = NULL;
> +   bool do_peeling_for_loop_bound;
>
>     if (vect_print_dump_info (REPORT_DETAILS))
>       fprintf (vect_dump, "=== vec_transform_loop ===");
>
>     /* Peel the loop if there are data refs with unknown alignment.
>        Only one data ref with unknown store is allowed.  */
>
>     if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
>       vect_do_peeling_for_alignment (loop_vinfo);
> +
> +   do_peeling_for_loop_bound
> +     = (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +        || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +       && LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor !=
0));
> +
> +   if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> +       || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> +       || !bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo)))
> +     vect_loop_versioning (loop_vinfo,
> +            !do_peeling_for_loop_bound,
> +            &cond_expr, &cond_expr_stmt_list);
> +
> +   /* CHECKME: we wouldn't need this if we called update_ssa once
> +      for all loops.  */
> +   bitmap_zero (vect_memsyms_to_rename);
>
>     /* If the loop has a symbolic number of iterations 'n' (i.e. it's not
a
>        compile time constant), or it is a constant that doesn't divide by
the
> *************** vect_transform_loop (loop_vec_info loop_
> *** 8346,8355 ****
>        will remain scalar and will compute the remaining (n%VF)
iterations.
>        (VF is the vectorization factor).  */
>
> !   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> !       || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> !           && LOOP_VINFO_INT_NITERS (loop_vinfo) %
> vectorization_factor != 0))
> !     vect_do_peeling_for_loop_bound (loop_vinfo, &ratio);
>     else
>       ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
>         LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
> --- 8431,8439 ----
>        will remain scalar and will compute the remaining (n%VF)
iterations.
>        (VF is the vectorization factor).  */
>
> !   if (do_peeling_for_loop_bound)
> !     vect_do_peeling_for_loop_bound (loop_vinfo, &ratio,
> !                 cond_expr, cond_expr_stmt_list);
>     else
>       ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
>         LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
> Index: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-stride-1.f90
> ===================================================================
> *** /dev/null   1970-01-01 00:00:00.000000000 +0000
> --- trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-stride-1.f90
> 2009-01-23 16:48:50.000000000 +0100
> ***************
> *** 0 ****
> --- 1,17 ----
> + ! { dg-do compile }
> +
> + subroutine to_product_of(self,a,b)
> +   real(kind=8), dimension(:,:) :: self
> +   real(kind=8), dimension(:,:), intent(in) :: a, b
> +   integer(kind=kind(1)) :: dim1, dim2
> +   dim1 = size(self,1)
> +   dim2 = size(self,2)
> +   do i = 1,dim1
> +     do j = 1,dim2
> +       self(i,j) = sum(a(i,:)*b(:,j))
> +     end do
> +   end do
> + end subroutine
> +
> + ! { dg-final { scan-tree-dump "vectorized 1 loop" "vect" } }
> + ! { dg-final { cleanup-tree-dump "vect" } }
> Index: trunk/gcc/Makefile.in
> ===================================================================
> *** trunk.orig/gcc/Makefile.in   2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/Makefile.in   2009-01-23 16:48:50.000000000 +0100
> *************** tree-nrv.o : tree-nrv.c $(CONFIG_H) $(SY
> *** 2135,2141 ****
>   tree-ssa-copy.o : tree-ssa-copy.c $(TREE_FLOW_H) $(CONFIG_H) $
(SYSTEM_H) \
>      $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h
> $(DIAGNOSTIC_H) \
>      $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
> !    $(BASIC_BLOCK_H) tree-pass.h langhooks.h tree-ssa-propagate.h $
(FLAGS_H)
>   tree-ssa-propagate.o : tree-ssa-propagate.c $(TREE_FLOW_H) $(CONFIG_H)
\
>      $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h
\
>      $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
> --- 2135,2142 ----
>   tree-ssa-copy.o : tree-ssa-copy.c $(TREE_FLOW_H) $(CONFIG_H) $
(SYSTEM_H) \
>      $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h
> $(DIAGNOSTIC_H) \
>      $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
> !    $(BASIC_BLOCK_H) tree-pass.h langhooks.h tree-ssa-propagate.h
> $(FLAGS_H) \
> !    $(CFGLOOP_H)
>   tree-ssa-propagate.o : tree-ssa-propagate.c $(TREE_FLOW_H) $(CONFIG_H)
\
>      $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h
\
>      $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
> Index: trunk/gcc/tree-ssa-copy.c
> ===================================================================
> *** trunk.orig/gcc/tree-ssa-copy.c   2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/tree-ssa-copy.c   2009-01-23 16:48:50.000000000 +0100
> *************** along with GCC; see the file COPYING3.
> *** 37,42 ****
> --- 37,43 ----
>   #include "tree-pass.h"
>   #include "tree-ssa-propagate.h"
>   #include "langhooks.h"
> + #include "cfgloop.h"
>
>   /* This file implements the copy propagation pass and provides a
>      handful of interfaces for performing const/copy propagation and
> *************** init_copy_prop (void)
> *** 991,997 ****
>             tree def;
>
>        def = gimple_phi_result (phi);
> !      if (!is_gimple_reg (def))
>               prop_set_simulate_again (phi, false);
>        else
>               prop_set_simulate_again (phi, true);
> --- 992,1004 ----
>             tree def;
>
>        def = gimple_phi_result (phi);
> !      if (!is_gimple_reg (def)
> !          /* In loop-closed SSA form do not copy-propagate through
> !             PHI nodes.  Technically this is only needed for loop
> !        exit PHIs, but this is difficult to query.  */
> !          || (current_loops
> !         && gimple_phi_num_args (phi) == 1
> !         && loops_state_satisfies_p (LOOP_CLOSED_SSA)))
>               prop_set_simulate_again (phi, false);
>        else
>               prop_set_simulate_again (phi, true);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for  vectorization
  2009-01-27 12:42 ` Dorit Nuzman
@ 2009-01-27 12:55   ` Richard Guenther
  2009-01-27 13:01     ` Dorit Nuzman
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Guenther @ 2009-01-27 12:55 UTC (permalink / raw)
  To: Dorit Nuzman; +Cc: gcc-patches

On Tue, 27 Jan 2009, Dorit Nuzman wrote:

> >
> > This patch adds the capability to the vectorizer to perform versioning
> > for the case of a constant (suitable) stride.  For example for
> >
> 
> awesome!
> 
> Another very useful application of the ability to work with unknown strides
> is in the context of outer-loop vectorization: if you have an unknown
> stride in the inner-loop and stride one in the outer-loop, you can
> vectorize the outer-loop (in which case you don't care about the inner-loop
> stride as long as it is invariant). So for this purpose there's no need for
> versioning - just to relax some assumptions here and there in the
> vectorizer about the inner-loop stride having to be a known constant, which
> is what part of your patch already starts to do (I know this is not the
> purpose of your patch, but it provides initial infrastructure towards
> that). I think this is quite a common case, and it appears in several PRs -
> e.g. PR33113 (with a bit more detailed discussion) and PR37021 (which has a
> (i,j) instead of a(j,i) in the example you show below); Outer-loop
> vectorization would be better cause the reduction would be done more
> accurately and more efficiently than with inner-loop vectorization, and it
> could be done for any inner-loop stride (not only 1).
> 
> [off topic comment:  At least a couple tweaks that are currently required
> to allow outer-loop vectorization for examples like in the above mentioned
> PRs:
> 1) somehow don't let gcc create guard code around the innermost loop to
> check that it executes more than zero iterations. This creates a
> complicated control flow structure within the outer-loop. For now you have
> to have  constant number of iterations for the inner-loop because of that,
> or insert a statement like "if (LB<=0) return;" before the loop...
> 2) use -fno-tree-sink cause otherwise it moves the loop iv increment to the
> latch block and the vectorizer likes to have the latch block empty...
> I really hope we could do something about these... maybe use a COND
> expression for the LB to solve the first problem, and clean up non-empty
> latch blocks as part of vectorization preprocessing for the second?
> ]
> 
> I'm sorry for digressing... I know this is a bit off topic, but besides
> this patch adding a very good feature to have as is, it's also a beginning
> of a direction that can really help outer-loop vectorization in really
> common cases - so this is very good news!
> 
> > subroutine to_product_of(self,a,b,a1,a2)
> >   complex(kind=8) :: self (:)
> >   complex(kind=8), intent(in) :: a(:,:)
> >   complex(kind=8), intent(in) :: b(:)
> >   integer a1,a2
> >   do i = 1,a1
> >     do j = 1,a2
> >       self(i) = self(i) + a(j,i)*b(j)
> >     end do
> >   end do
> > end subroutine
> >
> > we can only apply vectorization if the strides of the fastest dimension
> > of self, a and b are one (they are loaded from the passed array
> > descriptors and thus appear as (loop invariant) variables).
> >
> > During the implementation of this I noticed that peeling for
> > number of iterations (we have to unroll the above loop twice, and so
> > for an odd number of iterations have a epilogue loop for the remaining
> > iteration(s)) does not play well with versioning and we end up
> > vectorizing the wrong loop.  So I just disabled versioning if we
> > apply peeling with an epilogue loop
> 
> yes, this is an old limitation since the introduction of loop versioning to
> the vectorizer - it's either peeling or versioning... (not for any good
> reason):)
> 
> >  and instead attach the versioning
> > condition to the pre-condition of the main loop that skips directly
> > to the epilogue if the number of iterations is too small.  We obviously
> > can use the epilogue loop as the non-vectorized version.
> >
> 
> maybe we should consider doing that also when versioning for
> aliasing/alignment...

Yes, the patch does this.

> > This patch also inserts an extra copyprop and dce pass before the
> > vectorizer so it can recognize the reduction in the above testcase
> > (LIM has made that reduction non-obvious).
> 
> cool. This was also repeatedly reported problem... I wonder if it doesn't
> solve a few missed-optimization PRs on the way...
>
> > So I noticed that
> > copyprop does not preserve loop-closed SSA form and fixed that as well.
> >
> > Some earlier version bootstrapped and tested ok on
> > x86_64-unknown-linux-gnu, a final attempt is still running.
> >
> > I didn't yet performance test this extensively, but it might need
> > cost-model adjustments and/or need to wait until we have profile
> > feedback to properly seed vectorizer analysis here.  A micro-benchmark
> > based on the above loop shows around 15% improvement on AMD K10.
> >
> > Feedback (and ppc testing) is still welcome of course.
> >
> 
> I didn't review the code very closely, but one tiny thing that would be
> nice to add is a printout to the dump file reporting that versioining for
> stride is done.

Indeed.

> about:
> 
> > +   /* ???  Delay this change until after versioning or
> > +      preserve the original step somewhere.  */

Ah, this comment is a left-over.  The change here is definitely needed.

> When we take this forward to support outer-loop vectorization with unknown
> inner-loop strides, we'd indeed want to
> delay this change.

I guess outer-loop vectorization should simply deal with non-constant
DR_STEP then, even when the versioning for non-constant strides is not
done?  So, the data-ref change should already support this?

I have SPEC2006 tested the patch and it doesn't show any improvements
or regressions.

Richard.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for vectorization
  2009-01-27 12:55   ` Richard Guenther
@ 2009-01-27 13:01     ` Dorit Nuzman
  2009-01-27 13:46       ` Richard Guenther
  0 siblings, 1 reply; 11+ messages in thread
From: Dorit Nuzman @ 2009-01-27 13:01 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches

Richard Guenther <rguenther@suse.de> wrote on 27/01/2009 14:42:44:

> On Tue, 27 Jan 2009, Dorit Nuzman wrote:
>
...
> > about:
> >
> > > +   /* ???  Delay this change until after versioning or
> > > +      preserve the original step somewhere.  */
>
> Ah, this comment is a left-over.  The change here is definitely needed.
>
> > When we take this forward to support outer-loop vectorization with
unknown
> > inner-loop strides, we'd indeed want to
> > delay this change.
>
> I guess outer-loop vectorization should simply deal with non-constant
> DR_STEP then, even when the versioning for non-constant strides is not
> done?

yes, exactly

> So, the data-ref change should already support this?
>

Sebastian suggested a patch for data-ref in
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33113#c5. Note that dependence
analysis does not support unknown strides, but we don't always need
dependence analysis (e.g. when we have a reduction and no stores, or when
we obviously deal with different arrays).,

> I have SPEC2006 tested the patch and it doesn't show any improvements
> or regressions.
>

(wonder how many times we actually applied this versioning in SPEC?)

dorit

> Richard.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for  vectorization
  2009-01-27 13:01     ` Dorit Nuzman
@ 2009-01-27 13:46       ` Richard Guenther
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Guenther @ 2009-01-27 13:46 UTC (permalink / raw)
  To: Dorit Nuzman; +Cc: gcc-patches

On Tue, 27 Jan 2009, Dorit Nuzman wrote:

> Richard Guenther <rguenther@suse.de> wrote on 27/01/2009 14:42:44:
> 
> > On Tue, 27 Jan 2009, Dorit Nuzman wrote:
> >
> ...
> > > about:
> > >
> > > > +   /* ???  Delay this change until after versioning or
> > > > +      preserve the original step somewhere.  */
> >
> > Ah, this comment is a left-over.  The change here is definitely needed.
> >
> > > When we take this forward to support outer-loop vectorization with
> unknown
> > > inner-loop strides, we'd indeed want to
> > > delay this change.
> >
> > I guess outer-loop vectorization should simply deal with non-constant
> > DR_STEP then, even when the versioning for non-constant strides is not
> > done?
> 
> yes, exactly
> 
> > So, the data-ref change should already support this?
> >
> 
> Sebastian suggested a patch for data-ref in
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33113#c5. Note that dependence
> analysis does not support unknown strides, but we don't always need
> dependence analysis (e.g. when we have a reduction and no stores, or when
> we obviously deal with different arrays).,
> 
> > I have SPEC2006 tested the patch and it doesn't show any improvements
> > or regressions.
> >
> 
> (wonder how many times we actually applied this versioning in SPEC?)

It happens quiet often in tonto, but mostly on never called
subroutines.

Richard.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for  vectorization
  2009-01-23 18:03 [PATCH][RFC] Add versioning for constant strides for vectorization Richard Guenther
  2009-01-25 14:08 ` Ira Rosen
  2009-01-27 12:42 ` Dorit Nuzman
@ 2010-03-13 19:19 ` Jack Howarth
  2010-03-15 10:38   ` Richard Guenther
  2 siblings, 1 reply; 11+ messages in thread
From: Jack Howarth @ 2010-03-13 19:19 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches

On Fri, Jan 23, 2009 at 05:08:43PM +0100, Richard Guenther wrote:
> 
> This patch adds the capability to the vectorizer to perform versioning
> for the case of a constant (suitable) stride.  For example for
> 
> subroutine to_product_of(self,a,b,a1,a2)
>   complex(kind=8) :: self (:)
>   complex(kind=8), intent(in) :: a(:,:)
>   complex(kind=8), intent(in) :: b(:)
>   integer a1,a2
>   do i = 1,a1
>     do j = 1,a2
>       self(i) = self(i) + a(j,i)*b(j)
>     end do
>   end do
> end subroutine
> 
> we can only apply vectorization if the strides of the fastest dimension
> of self, a and b are one (they are loaded from the passed array
> descriptors and thus appear as (loop invariant) variables).
> 
> During the implementation of this I noticed that peeling for
> number of iterations (we have to unroll the above loop twice, and so
> for an odd number of iterations have a epilogue loop for the remaining
> iteration(s)) does not play well with versioning and we end up
> vectorizing the wrong loop.  So I just disabled versioning if we
> apply peeling with an epilogue loop and instead attach the versioning
> condition to the pre-condition of the main loop that skips directly
> to the epilogue if the number of iterations is too small.  We obviously
> can use the epilogue loop as the non-vectorized version.
> 
> This patch also inserts an extra copyprop and dce pass before the
> vectorizer so it can recognize the reduction in the above testcase
> (LIM has made that reduction non-obvious).  So I noticed that
> copyprop does not preserve loop-closed SSA form and fixed that as well.
> 
> Some earlier version bootstrapped and tested ok on 
> x86_64-unknown-linux-gnu, a final attempt is still running.
> 
> I didn't yet performance test this extensively, but it might need
> cost-model adjustments and/or need to wait until we have profile
> feedback to properly seed vectorizer analysis here.  A micro-benchmark
> based on the above loop shows around 15% improvement on AMD K10.
> 
> Feedback (and ppc testing) is still welcome of course.
> 
> Thanks,
> Richard.
> 
> 2009-01-23  Richard Guenther  <rguenther@suse.de>
> 
> 	* passes.c (init_optimization_passes): Add copy-prop and dce
> 	before vectorization.
> 	* Makefile.in (tree-ssa-copy.o): Add $(CFGLOOP_H) dependency.
> 	* tree-ssa-copy.c (init_copy_prop): Do not propagate through
> 	single-argument PHIs if we are in loop-closed SSA form.
> 	* tree-data-ref.c (dr_analyze_innermost): Allow affine offsets.
> 	* tree-vect-analyze.c (vect_check_interleaving): Check that
> 	DR_STEP is constant.
> 	(vect_enhance_data_refs_alignment): If versioning for strides
> 	is required do not peel.
> 	(vect_analyze_data_ref_access): Allow non-constant step of
> 	a specific form, remember them for versioning.
> 	* params.def (vect-max-version-for-stride-checks): New param.
> 	(vect-version-for-stride-value): Likewise.
> 	* tree-vectorizer.c (slpeel_add_loop_guard): Pass extra guards
> 	for the pre-condition.
> 	(slpeel_tree_peel_loop_to_edge): Likewise.
> 	(new_loop_vec_info): Allocate stride versioning data.
> 	(destroy_loop_vec_info): Free stride versioning data.
> 	* tree-vectorizer.h (struct _loop_vec_info): Add variable_strides
> 	field.
> 	(LOOP_VINFO_VARIABLE_STRIDES): Define.
> 	(slpeel_tree_peel_loop_to_edge): Adjust declaration.
> 	* tree-vect-transform.c (vect_build_loop_niters): Take an
> 	optional sequence to append stmts.
> 	(vect_generate_tmps_on_preheader): Likewise.
> 	(vect_do_peeling_for_loop_bound): Take extra guards for the
> 	pre-condition.
> 	(vect_do_peeling_for_alignment): Adjust.
> 	(vect_create_cond_for_stride_checks): New function.
> 	(vect_loop_versioning): Take stmt and stmt list to put pre-condition
> 	guards if we are going to peel.  Do not apply versioning in that
> 	case.
> 	(vect_transform_loop): If we are peeling for loop bound only
> 	record extra pre-conditions, do not apply loop versioning.
> 
> 	* gcc.dg/vect/fast-math-vect-complex-5.c: New testcase.
> 	* gfortran.dg/vect/fast-math-vect-complex-1.f90: Likewise.
> 	* gfortran.dg/vect/fast-math-vect-stride-1.f90: Likewise.
> 
> Index: trunk/gcc/passes.c
> ===================================================================
> *** trunk.orig/gcc/passes.c	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/passes.c	2009-01-23 16:48:50.000000000 +0100
> *************** init_optimization_passes (void)
> *** 659,664 ****
> --- 659,666 ----
>   	  NEXT_PASS (pass_graphite_transforms);
>   	  NEXT_PASS (pass_iv_canon);
>   	  NEXT_PASS (pass_if_conversion);
> + 	  NEXT_PASS (pass_copy_prop);
> + 	  NEXT_PASS (pass_dce_loop);
>   	  NEXT_PASS (pass_vectorize);
>   	    {
>   	      struct opt_pass **p = &pass_vectorize.pass.sub;
> Index: trunk/gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-5.c
> ===================================================================
> *** /dev/null	1970-01-01 00:00:00.000000000 +0000
> --- trunk/gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-5.c	2009-01-23 16:48:50.000000000 +0100
> ***************
> *** 0 ****
> --- 1,18 ----
> + /* { dg-do compile } */
> + /* { dg-require-effective-target vect_double } */
> + 
> + #define NUM 64
> + _Complex double ad[NUM], bd[NUM], cd[NUM];
> + 
> + void testd(void)
> + {
> +   int i;
> +   int j;
> + 
> +   for (i = 0; i < NUM; i++)
> +     for (j = 0; j < NUM; j++)
> +       cd[i] = cd[i] + ad[j] * bd[j];
> + }
> + 
> + /* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */
> + /* { dg-final { cleanup-tree-dump "vect" } } */
> Index: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-complex-1.f90
> ===================================================================
> *** /dev/null	1970-01-01 00:00:00.000000000 +0000
> --- trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-complex-1.f90	2009-01-23 16:48:50.000000000 +0100
> ***************
> *** 0 ****
> --- 1,16 ----
> + ! { dg-do compile }
> + 
> + subroutine to_product_of(self,a,b,a1,a2)
> +   complex(kind=8) :: self (:)
> +   complex(kind=8), intent(in) :: a(:,:)
> +   complex(kind=8), intent(in) :: b(:)
> +   integer a1,a2
> +   do i = 1,a1
> +     do j = 1,a2
> +       self(i) = self(i) + a(i,j)*b(j)
> +     end do
> +   end do
> + end subroutine
> + 
> + ! { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } }
> + ! { dg-final { cleanup-tree-dump "vect" } }
> Index: trunk/gcc/tree-data-ref.c
> ===================================================================
> *** trunk.orig/gcc/tree-data-ref.c	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/tree-data-ref.c	2009-01-23 16:48:50.000000000 +0100
> *************** dr_analyze_innermost (struct data_refere
> *** 708,714 ****
>         offset_iv.base = ssize_int (0);
>         offset_iv.step = ssize_int (0);
>       }
> !   else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
>       {
>         if (dump_file && (dump_flags & TDF_DETAILS))
>   	fprintf (dump_file, "failed: evolution of offset is not affine.\n");
> --- 708,714 ----
>         offset_iv.base = ssize_int (0);
>         offset_iv.step = ssize_int (0);
>       }
> !   else if (!simple_iv (loop, stmt, poffset, &offset_iv, true))
>       {
>         if (dump_file && (dump_flags & TDF_DETAILS))
>   	fprintf (dump_file, "failed: evolution of offset is not affine.\n");
> Index: trunk/gcc/tree-vect-analyze.c
> ===================================================================
> *** trunk.orig/gcc/tree-vect-analyze.c	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/tree-vect-analyze.c	2009-01-23 16:48:50.000000000 +0100
> *************** vect_check_interleaving (struct data_ref
> *** 1109,1114 ****
> --- 1109,1116 ----
>     type_size_b = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (drb))));
>   
>     if (type_size_a != type_size_b
> +       || TREE_CODE (DR_STEP (dra)) != INTEGER_CST
> +       || TREE_CODE (DR_STEP (drb)) != INTEGER_CST
>         || tree_int_cst_compare (DR_STEP (dra), DR_STEP (drb))
>         || !types_compatible_p (TREE_TYPE (DR_REF (dra)), 
>                                 TREE_TYPE (DR_REF (drb))))
> *************** vect_enhance_data_refs_alignment (loop_v
> *** 1825,1830 ****
> --- 1827,1833 ----
>     gimple stmt;
>     stmt_vec_info stmt_info;
>     int vect_versioning_for_alias_required;
> +   int vect_versioning_for_strides_required;
>   
>     if (vect_print_dump_info (REPORT_DETAILS))
>       fprintf (vect_dump, "=== vect_enhance_data_refs_alignment ===");
> *************** vect_enhance_data_refs_alignment (loop_v
> *** 1892,1904 ****
>   
>     vect_versioning_for_alias_required =
>       (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)) > 0);
>   
>     /* Temporarily, if versioning for alias is required, we disable peeling
>        until we support peeling and versioning.  Often peeling for alignment
>        will require peeling for loop-bound, which in turn requires that we
>        know how to adjust the loop ivs after the loop.  */
>     if (vect_versioning_for_alias_required
> !        || !vect_can_advance_ivs_p (loop_vinfo)
>         || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
>       do_peeling = false;
>   
> --- 1895,1910 ----
>   
>     vect_versioning_for_alias_required =
>       (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)) > 0);
> +   vect_versioning_for_strides_required =
> +     !bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo));
>   
>     /* Temporarily, if versioning for alias is required, we disable peeling
>        until we support peeling and versioning.  Often peeling for alignment
>        will require peeling for loop-bound, which in turn requires that we
>        know how to adjust the loop ivs after the loop.  */
>     if (vect_versioning_for_alias_required
> !       || vect_versioning_for_strides_required
> !       || !vect_can_advance_ivs_p (loop_vinfo)
>         || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
>       do_peeling = false;
>   
> *************** vect_analyze_data_ref_access (struct dat
> *** 2349,2357 ****
>     stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>     loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> !   HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
>   
> !   if (!step)
>       {
>         if (vect_print_dump_info (REPORT_DETAILS))
>   	fprintf (vect_dump, "bad data-ref access");
> --- 2355,2364 ----
>     stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>     loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> !   HOST_WIDE_INT dr_step;
>   
> !   if (!step
> !       || TREE_CODE (step) != INTEGER_CST)
>       {
>         if (vect_print_dump_info (REPORT_DETAILS))
>   	fprintf (vect_dump, "bad data-ref access");
> *************** vect_analyze_data_ref_access (struct dat
> *** 2359,2364 ****
> --- 2366,2372 ----
>       }
>   
>     /* Don't allow invariant accesses.  */
> +   dr_step = TREE_INT_CST_LOW (step);
>     if (dr_step == 0)
>       return false; 
>   
> *************** vect_analyze_data_refs (loop_vec_info lo
> *** 3563,3568 ****
> --- 3571,3620 ----
>             return false;
>           }
>   
> +       /* If the non-constant (but loop invariant) step is of the
> + 	 form NAME or NAME * CST where CST is the element size mark
> + 	 this ddr for versioning for strides and re-set DR_STEP
> + 	 to the value we will version for.  Otherwise reject
> + 	 non-constant steps.  */
> +       if (TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
> + 	{
> + 	  tree step = DR_STEP (dr);
> + 
> + 	  STRIP_NOPS (step);
> + 	  if (flag_tree_vect_loop_version
> + 	      && (TREE_CODE (step) == SSA_NAME
> + 		  || (TREE_CODE (step) == MULT_EXPR
> + 		      && TREE_CODE (TREE_OPERAND (step, 1)) == INTEGER_CST)))
> + 	    {
> + 	      tree stride;
> + 	      tree newstep;
> + 
> + 	      stride = step;
> + 	      if (TREE_CODE (step) == MULT_EXPR)
> + 		stride = TREE_OPERAND (step, 0);
> + 	      STRIP_NOPS (stride);
> + 	      if (TREE_CODE (stride) != SSA_NAME)
> + 		return false;
> + 
> + 	      bitmap_set_bit (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo),
> + 			      SSA_NAME_VERSION (stride));
> + 	      if (bitmap_count_bits (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo))
> + 		  > (unsigned)PARAM_VALUE (PARAM_VECT_MAX_VERSION_FOR_STRIDE_CHECKS))
> + 		return false;
> + 
> + 	      /* ???  Delay this change until after versioning or
> + 	         preserve the original step somewhere.  */
> + 	      newstep = build_int_cst (TREE_TYPE (step),
> + 		       PARAM_VALUE (PARAM_VECT_VERSION_FOR_STRIDE_VALUE));
> + 	      if (TREE_CODE (step) == MULT_EXPR)
> + 		newstep = int_const_binop (MULT_EXPR, newstep,
> + 					   TREE_OPERAND (step, 1), false);
> + 	      DR_STEP (dr) = newstep;
> + 	    }
> + 	  else
> + 	    return false;
> + 	}
> + 
>         if (!DR_SYMBOL_TAG (dr))
>           {
>             if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
> Index: trunk/gcc/params.def
> ===================================================================
> *** trunk.orig/gcc/params.def	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/params.def	2009-01-23 16:48:50.000000000 +0100
> *************** DEFPARAM(PARAM_VECT_MAX_VERSION_FOR_ALIA
> *** 506,511 ****
> --- 506,521 ----
>            "Bound on number of runtime checks inserted by the vectorizer's loop versioning for alias check",
>            10, 0, 0)
>   
> + DEFPARAM(PARAM_VECT_MAX_VERSION_FOR_STRIDE_CHECKS,
> +          "vect-max-version-for-stride-checks",
> +          "Bound on number of runtime checks inserted by the vectorizer's loop versioning for stride check",
> +          4, 0, 0)
> + 
> + DEFPARAM(PARAM_VECT_VERSION_FOR_STRIDE_VALUE,
> +          "vect-version-for-stride-value",
> +          "The constant stride in elements the vectorizer uses for loop versioning",
> +          1, 0, 0)
> + 
>   DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIONS,
>   	 "max-cselib-memory-locations",
>   	 "The maximum memory locations recorded by cselib",
> Index: trunk/gcc/tree-vectorizer.c
> ===================================================================
> *** trunk.orig/gcc/tree-vectorizer.c	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/tree-vectorizer.c	2009-01-23 16:48:50.000000000 +0100
> *************** slpeel_tree_duplicate_loop_to_edge_cfg (
> *** 927,934 ****
>      Returns the skip edge.  */
>   
>   static edge
> ! slpeel_add_loop_guard (basic_block guard_bb, tree cond, basic_block exit_bb,
> ! 		       basic_block dom_bb)
>   {
>     gimple_stmt_iterator gsi;
>     edge new_e, enter_e;
> --- 927,935 ----
>      Returns the skip edge.  */
>   
>   static edge
> ! slpeel_add_loop_guard (basic_block guard_bb, tree cond,
> ! 		       gimple_seq cond_expr_stmt_list,
> ! 		       basic_block exit_bb, basic_block dom_bb)
>   {
>     gimple_stmt_iterator gsi;
>     edge new_e, enter_e;
> *************** slpeel_add_loop_guard (basic_block guard
> *** 941,951 ****
>     gsi = gsi_last_bb (guard_bb);
>   
>     cond = force_gimple_operand (cond, &gimplify_stmt_list, true, NULL_TREE);
>     cond_stmt = gimple_build_cond (NE_EXPR,
>   				 cond, build_int_cst (TREE_TYPE (cond), 0),
>   				 NULL_TREE, NULL_TREE);
> !   if (gimplify_stmt_list)
> !     gsi_insert_seq_after (&gsi, gimplify_stmt_list, GSI_NEW_STMT);
>   
>     gsi = gsi_last_bb (guard_bb);
>     gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
> --- 942,954 ----
>     gsi = gsi_last_bb (guard_bb);
>   
>     cond = force_gimple_operand (cond, &gimplify_stmt_list, true, NULL_TREE);
> +   if (gimplify_stmt_list)
> +     gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
>     cond_stmt = gimple_build_cond (NE_EXPR,
>   				 cond, build_int_cst (TREE_TYPE (cond), 0),
>   				 NULL_TREE, NULL_TREE);
> !   if (cond_expr_stmt_list)
> !     gsi_insert_seq_after (&gsi, cond_expr_stmt_list, GSI_NEW_STMT);
>   
>     gsi = gsi_last_bb (guard_bb);
>     gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
> *************** struct loop*
> *** 1151,1157 ****
>   slpeel_tree_peel_loop_to_edge (struct loop *loop, 
>   			       edge e, tree first_niters, 
>   			       tree niters, bool update_first_loop_count,
> ! 			       unsigned int th, bool check_profitability)
>   {
>     struct loop *new_loop = NULL, *first_loop, *second_loop;
>     edge skip_e;
> --- 1154,1161 ----
>   slpeel_tree_peel_loop_to_edge (struct loop *loop, 
>   			       edge e, tree first_niters, 
>   			       tree niters, bool update_first_loop_count,
> ! 			       unsigned int th, bool check_profitability,
> ! 			       tree cond_expr, gimple_seq cond_expr_stmt_list)
>   {
>     struct loop *new_loop = NULL, *first_loop, *second_loop;
>     edge skip_e;
> *************** slpeel_tree_peel_loop_to_edge (struct lo
> *** 1325,1330 ****
> --- 1329,1342 ----
>   	  pre_condition = fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
>   				       cost_pre_condition, pre_condition);
>   	}
> +       if (cond_expr)
> + 	{
> + 	  pre_condition =
> + 	    fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
> + 			 pre_condition,
> + 			 fold_build1 (TRUTH_NOT_EXPR, boolean_type_node,
> + 				      cond_expr));
> + 	}
>       }
>   
>     /* Prologue peeling.  */  
> *************** slpeel_tree_peel_loop_to_edge (struct lo
> *** 1340,1345 ****
> --- 1352,1358 ----
>       }
>   
>     skip_e = slpeel_add_loop_guard (bb_before_first_loop, pre_condition,
> + 				  cond_expr_stmt_list,
>                                     bb_before_second_loop, bb_before_first_loop);
>     slpeel_update_phi_nodes_for_guard1 (skip_e, first_loop,
>   				      first_loop == new_loop,
> *************** slpeel_tree_peel_loop_to_edge (struct lo
> *** 1377,1383 ****
>   
>     pre_condition = 
>   	fold_build2 (EQ_EXPR, boolean_type_node, first_niters, niters);
> !   skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition,
>                                     bb_after_second_loop, bb_before_first_loop);
>     slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
>                                        second_loop == new_loop, &new_exit_bb);
> --- 1390,1396 ----
>   
>     pre_condition = 
>   	fold_build2 (EQ_EXPR, boolean_type_node, first_niters, niters);
> !   skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
>                                     bb_after_second_loop, bb_before_first_loop);
>     slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
>                                        second_loop == new_loop, &new_exit_bb);
> *************** new_loop_vec_info (struct loop *loop)
> *** 1714,1719 ****
> --- 1727,1733 ----
>     LOOP_VINFO_MAY_ALIAS_DDRS (res) =
>       VEC_alloc (ddr_p, heap,
>   	       PARAM_VALUE (PARAM_VECT_MAX_VERSION_FOR_ALIAS_CHECKS));
> +   LOOP_VINFO_VARIABLE_STRIDES (res) = BITMAP_ALLOC (NULL);
>     LOOP_VINFO_STRIDED_STORES (res) = VEC_alloc (gimple, heap, 10);
>     LOOP_VINFO_SLP_INSTANCES (res) = VEC_alloc (slp_instance, heap, 10);
>     LOOP_VINFO_SLP_UNROLLING_FACTOR (res) = 1;
> *************** destroy_loop_vec_info (loop_vec_info loo
> *** 1800,1805 ****
> --- 1814,1820 ----
>     free_dependence_relations (LOOP_VINFO_DDRS (loop_vinfo));
>     VEC_free (gimple, heap, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo));
>     VEC_free (ddr_p, heap, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo));
> +   BITMAP_FREE (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo));
>     slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
>     for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++)
>       vect_free_slp_instance (instance);
> Index: trunk/gcc/tree-vectorizer.h
> ===================================================================
> *** trunk.orig/gcc/tree-vectorizer.h	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/tree-vectorizer.h	2009-01-23 16:48:50.000000000 +0100
> *************** typedef struct _loop_vec_info {
> *** 210,215 ****
> --- 210,219 ----
>     /* All data dependences in the loop.  */
>     VEC (ddr_p, heap) *ddrs;
>   
> +   /* SSA_NAMEs representing variable strides in data references.
> +      Candidates for a run-time stride check.  */
> +   bitmap variable_strides;
> + 
>     /* Data Dependence Relations defining address ranges that are candidates
>        for a run-time aliasing check.  */
>     VEC (ddr_p, heap) *may_alias_ddrs;
> *************** typedef struct _loop_vec_info {
> *** 254,259 ****
> --- 258,264 ----
>   #define LOOP_VINFO_LOC(L)             (L)->loop_line_number
>   #define LOOP_VINFO_MAY_ALIAS_DDRS(L)  (L)->may_alias_ddrs
>   #define LOOP_VINFO_STRIDED_STORES(L)  (L)->strided_stores
> + #define LOOP_VINFO_VARIABLE_STRIDES(L) (L)->variable_strides
>   #define LOOP_VINFO_SLP_INSTANCES(L)   (L)->slp_instances
>   #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
>   
> *************** extern bitmap vect_memsyms_to_rename;
> *** 707,713 ****
>      divide by the vectorization factor, and to peel the first few iterations
>      to force the alignment of data references in the loop.  */
>   extern struct loop *slpeel_tree_peel_loop_to_edge 
> !   (struct loop *, edge, tree, tree, bool, unsigned int, bool);
>   extern void set_prologue_iterations (basic_block, tree,
>   				     struct loop *, unsigned int);
>   struct loop *tree_duplicate_loop_on_edge (struct loop *, edge);
> --- 712,718 ----
>      divide by the vectorization factor, and to peel the first few iterations
>      to force the alignment of data references in the loop.  */
>   extern struct loop *slpeel_tree_peel_loop_to_edge 
> !   (struct loop *, edge, tree, tree, bool, unsigned int, bool, tree, gimple_seq);
>   extern void set_prologue_iterations (basic_block, tree,
>   				     struct loop *, unsigned int);
>   struct loop *tree_duplicate_loop_on_edge (struct loop *, edge);
> Index: trunk/gcc/tree-vect-transform.c
> ===================================================================
> *** trunk.orig/gcc/tree-vect-transform.c	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/tree-vect-transform.c	2009-01-23 16:48:50.000000000 +0100
> *************** static tree get_initial_def_for_reductio
> *** 65,72 ****
>   
>   /* Utility function dealing with loop peeling (not peeling itself).  */
>   static void vect_generate_tmps_on_preheader 
> !   (loop_vec_info, tree *, tree *, tree *);
> ! static tree vect_build_loop_niters (loop_vec_info);
>   static void vect_update_ivs_after_vectorizer (loop_vec_info, tree, edge); 
>   static tree vect_gen_niters_for_prolog_loop (loop_vec_info, tree);
>   static void vect_update_init_of_dr (struct data_reference *, tree niters);
> --- 65,72 ----
>   
>   /* Utility function dealing with loop peeling (not peeling itself).  */
>   static void vect_generate_tmps_on_preheader 
> !   (loop_vec_info, tree *, tree *, tree *, gimple_seq);
> ! static tree vect_build_loop_niters (loop_vec_info, gimple_seq);
>   static void vect_update_ivs_after_vectorizer (loop_vec_info, tree, edge); 
>   static tree vect_gen_niters_for_prolog_loop (loop_vec_info, tree);
>   static void vect_update_init_of_dr (struct data_reference *, tree niters);
> *************** vect_transform_stmt (gimple stmt, gimple
> *** 7199,7205 ****
>      on the loop preheader.  */
>   
>   static tree
> ! vect_build_loop_niters (loop_vec_info loop_vinfo)
>   {
>     tree ni_name, var;
>     gimple_seq stmts = NULL;
> --- 7199,7205 ----
>      on the loop preheader.  */
>   
>   static tree
> ! vect_build_loop_niters (loop_vec_info loop_vinfo, gimple_seq seq)
>   {
>     tree ni_name, var;
>     gimple_seq stmts = NULL;
> *************** vect_build_loop_niters (loop_vec_info lo
> *** 7214,7221 ****
>     pe = loop_preheader_edge (loop);
>     if (stmts)
>       {
> !       basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> !       gcc_assert (!new_bb);
>       }
>         
>     return ni_name;
> --- 7214,7226 ----
>     pe = loop_preheader_edge (loop);
>     if (stmts)
>       {
> !       if (seq)
> ! 	gimple_seq_add_seq (&seq, stmts);
> !       else
> ! 	{
> ! 	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> ! 	  gcc_assert (!new_bb);
> ! 	}
>       }
>         
>     return ni_name;
> *************** static void
> *** 7234,7240 ****
>   vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, 
>   				 tree *ni_name_ptr,
>   				 tree *ratio_mult_vf_name_ptr, 
> ! 				 tree *ratio_name_ptr)
>   {
>   
>     edge pe;
> --- 7239,7246 ----
>   vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, 
>   				 tree *ni_name_ptr,
>   				 tree *ratio_mult_vf_name_ptr, 
> ! 				 tree *ratio_name_ptr,
> ! 				 gimple_seq cond_expr_stmt_list)
>   {
>   
>     edge pe;
> *************** vect_generate_tmps_on_preheader (loop_ve
> *** 7254,7260 ****
>     /* Generate temporary variable that contains 
>        number of iterations loop executes.  */
>   
> !   ni_name = vect_build_loop_niters (loop_vinfo);
>     log_vf = build_int_cst (TREE_TYPE (ni), exact_log2 (vf));
>   
>     /* Create: ratio = ni >> log2(vf) */
> --- 7260,7266 ----
>     /* Generate temporary variable that contains 
>        number of iterations loop executes.  */
>   
> !   ni_name = vect_build_loop_niters (loop_vinfo, cond_expr_stmt_list);
>     log_vf = build_int_cst (TREE_TYPE (ni), exact_log2 (vf));
>   
>     /* Create: ratio = ni >> log2(vf) */
> *************** vect_generate_tmps_on_preheader (loop_ve
> *** 7267,7275 ****
>   
>         stmts = NULL;
>         ratio_name = force_gimple_operand (ratio_name, &stmts, true, var);
> !       pe = loop_preheader_edge (loop);
> !       new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> !       gcc_assert (!new_bb);
>       }
>          
>     /* Create: ratio_mult_vf = ratio << log2 (vf).  */
> --- 7273,7286 ----
>   
>         stmts = NULL;
>         ratio_name = force_gimple_operand (ratio_name, &stmts, true, var);
> !       if (cond_expr_stmt_list)
> ! 	gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
> !       else
> ! 	{
> ! 	  pe = loop_preheader_edge (loop);
> ! 	  new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> ! 	  gcc_assert (!new_bb);
> ! 	}
>       }
>          
>     /* Create: ratio_mult_vf = ratio << log2 (vf).  */
> *************** vect_generate_tmps_on_preheader (loop_ve
> *** 7284,7292 ****
>         stmts = NULL;
>         ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name, &stmts,
>   						 true, var);
> !       pe = loop_preheader_edge (loop);
> !       new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> !       gcc_assert (!new_bb);
>       }
>   
>     *ni_name_ptr = ni_name;
> --- 7295,7308 ----
>         stmts = NULL;
>         ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name, &stmts,
>   						 true, var);
> !       if (cond_expr_stmt_list)
> ! 	gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
> !       else
> ! 	{
> ! 	  pe = loop_preheader_edge (loop);
> ! 	  new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> ! 	  gcc_assert (!new_bb);
> ! 	}
>       }
>   
>     *ni_name_ptr = ni_name;
> *************** conservative_cost_threshold (loop_vec_in
> *** 7470,7476 ****
>      NITERS / VECTORIZATION_FACTOR times (this value is placed into RATIO).  */
>   
>   static void 
> ! vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio)
>   {
>     tree ni_name, ratio_mult_vf_name;
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> --- 7486,7493 ----
>      NITERS / VECTORIZATION_FACTOR times (this value is placed into RATIO).  */
>   
>   static void 
> ! vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> ! 				tree cond_expr, gimple_seq cond_expr_stmt_list)
>   {
>     tree ni_name, ratio_mult_vf_name;
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> *************** vect_do_peeling_for_loop_bound (loop_vec
> *** 7493,7499 ****
>        ratio = ni_name / vf
>        ratio_mult_vf_name = ratio * vf  */
>     vect_generate_tmps_on_preheader (loop_vinfo, &ni_name,
> ! 				   &ratio_mult_vf_name, ratio);
>   
>     loop_num  = loop->num; 
>   
> --- 7510,7517 ----
>        ratio = ni_name / vf
>        ratio_mult_vf_name = ratio * vf  */
>     vect_generate_tmps_on_preheader (loop_vinfo, &ni_name,
> ! 				   &ratio_mult_vf_name, ratio,
> ! 				   cond_expr_stmt_list);
>   
>     loop_num  = loop->num; 
>   
> *************** vect_do_peeling_for_loop_bound (loop_vec
> *** 7501,7507 ****
>        peeling for alignment.  */
>     if (!VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
>         && !VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> !       && !LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
>       {
>         check_profitability = true;
>   
> --- 7519,7526 ----
>        peeling for alignment.  */
>     if (!VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
>         && !VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> !       && !LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo)
> !       && !cond_expr)
>       {
>         check_profitability = true;
>   
> *************** vect_do_peeling_for_loop_bound (loop_vec
> *** 7514,7520 ****
>   
>     new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
>                                               ratio_mult_vf_name, ni_name, false,
> !                                             th, check_profitability);
>     gcc_assert (new_loop);
>     gcc_assert (loop_num == loop->num);
>   #ifdef ENABLE_CHECKING
> --- 7533,7540 ----
>   
>     new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
>                                               ratio_mult_vf_name, ni_name, false,
> !                                             th, check_profitability,
> ! 					    cond_expr, cond_expr_stmt_list);
>     gcc_assert (new_loop);
>     gcc_assert (loop_num == loop->num);
>   #ifdef ENABLE_CHECKING
> *************** vect_do_peeling_for_alignment (loop_vec_
> *** 7738,7744 ****
>   
>     initialize_original_copy_tables ();
>   
> !   ni_name = vect_build_loop_niters (loop_vinfo);
>     niters_of_prolog_loop = vect_gen_niters_for_prolog_loop (loop_vinfo, ni_name);
>     
>   
> --- 7758,7764 ----
>   
>     initialize_original_copy_tables ();
>   
> !   ni_name = vect_build_loop_niters (loop_vinfo, NULL);
>     niters_of_prolog_loop = vect_gen_niters_for_prolog_loop (loop_vinfo, ni_name);
>     
>   
> *************** vect_do_peeling_for_alignment (loop_vec_
> *** 7759,7765 ****
>     new_loop =
>       slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
>   				   niters_of_prolog_loop, ni_name, true,
> ! 				   th, check_profitability);
>   
>     gcc_assert (new_loop);
>   #ifdef ENABLE_CHECKING
> --- 7779,7785 ----
>     new_loop =
>       slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
>   				   niters_of_prolog_loop, ni_name, true,
> ! 				   th, check_profitability, NULL_TREE, NULL);
>   
>     gcc_assert (new_loop);
>   #ifdef ENABLE_CHECKING
> *************** vect_create_cond_for_align_checks (loop_
> *** 7909,7914 ****
> --- 7929,7981 ----
>       *cond_expr = part_cond_expr;
>   }
>   
> + /* Function vect_create_cond_for_stride_checks.
> + 
> +    Create a conditional expression that represents the stride checks for
> +    all of the stride SSA_NAMEs used in data references (array element
> +    references) whose stride must be checked at runtime.
> + 
> +    Input:
> +    COND_EXPR  - input conditional expression.  New conditions will be chained
> +                 with logical AND operation.
> +    LOOP_VINFO - on field of the loop information is used.
> +                 LOOP_VINFO_VARIABLE_STRIDES is a bitmap of SSA_NAMEs to be
> + 		checked.
> + 
> +    Output:
> +    COND_EXPR_STMT_LIST - statements needed to construct the conditional
> +                          expression.
> +    The returned value is the conditional expression to be used in the if
> +    statement that controls which version of the loop gets executed at runtime.
> + 
> +    The stride we do versioning for is currently specified by a compile-time
> +    param.  In future the stride should be chosen by information from
> +    profile-feedback.  */
> + 
> + static void
> + vect_create_cond_for_stride_checks (loop_vec_info loop_vinfo,
> + 				    tree *cond_expr)
> + {
> +   bitmap_iterator bi;
> +   unsigned int i;
> +   HOST_WIDE_INT stride;
> + 
> +   stride = PARAM_VALUE (PARAM_VECT_VERSION_FOR_STRIDE_VALUE);
> + 
> +   EXECUTE_IF_SET_IN_BITMAP (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo), 0, i, bi)
> +     {
> +       tree name = ssa_name (i);
> +       tree cond = fold_build2 (EQ_EXPR, boolean_type_node,
> + 			       name,
> + 			       build_int_cst (TREE_TYPE (name), stride));
> +       if (*cond_expr)
> + 	*cond_expr = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
> + 				  *cond_expr, cond);
> +       else
> + 	*cond_expr = cond;
> +     }
> + }
> + 
>   /* Function vect_vfa_segment_size.
>   
>      Create an expression that computes the size of segment
> *************** vect_create_cond_for_alias_checks (loop_
> *** 8076,8087 ****
>      cost model initially.  */
>   
>   static void
> ! vect_loop_versioning (loop_vec_info loop_vinfo)
>   {
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>     struct loop *nloop;
> -   tree cond_expr = NULL_TREE;
> -   gimple_seq cond_expr_stmt_list = NULL;
>     basic_block condition_bb;
>     gimple_stmt_iterator gsi, cond_exp_gsi;
>     basic_block merge_bb;
> --- 8143,8153 ----
>      cost model initially.  */
>   
>   static void
> ! vect_loop_versioning (loop_vec_info loop_vinfo, bool do_versioning,
> ! 		      tree *cond_expr, gimple_seq *cond_expr_stmt_list)
>   {
>     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>     struct loop *nloop;
>     basic_block condition_bb;
>     gimple_stmt_iterator gsi, cond_exp_gsi;
>     basic_block merge_bb;
> *************** vect_loop_versioning (loop_vec_info loop
> *** 8101,8129 ****
>     th = conservative_cost_threshold (loop_vinfo,
>   				    min_profitable_iters);
>   
> !   cond_expr =
> !     build2 (GT_EXPR, boolean_type_node, scalar_loop_iters, 
> ! 	    build_int_cst (TREE_TYPE (scalar_loop_iters), th));
>   
> !   cond_expr = force_gimple_operand (cond_expr, &cond_expr_stmt_list,
> ! 				    false, NULL_TREE);
>   
>     if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)))
> !       vect_create_cond_for_align_checks (loop_vinfo, &cond_expr,
> ! 					 &cond_expr_stmt_list);
>   
>     if (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> !     vect_create_cond_for_alias_checks (loop_vinfo, &cond_expr, 
> ! 				       &cond_expr_stmt_list);
>   
> !   cond_expr =
> !     fold_build2 (NE_EXPR, boolean_type_node, cond_expr, integer_zero_node);
> !   cond_expr =
> !     force_gimple_operand (cond_expr, &gimplify_stmt_list, true, NULL_TREE);
> !   gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
>   
>     initialize_original_copy_tables ();
> !   nloop = loop_version (loop, cond_expr, &condition_bb,
>   			prob, prob, REG_BR_PROB_BASE - prob, true);
>     free_original_copy_tables();
>   
> --- 8167,8200 ----
>     th = conservative_cost_threshold (loop_vinfo,
>   				    min_profitable_iters);
>   
> !   *cond_expr =
> !     fold_build2 (GT_EXPR, boolean_type_node, scalar_loop_iters,
> ! 		 build_int_cst (TREE_TYPE (scalar_loop_iters), th));
>   
> !   *cond_expr = force_gimple_operand (*cond_expr, cond_expr_stmt_list,
> ! 				     false, NULL_TREE);
>   
>     if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)))
> !       vect_create_cond_for_align_checks (loop_vinfo, cond_expr,
> ! 					 cond_expr_stmt_list);
>   
>     if (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> !     vect_create_cond_for_alias_checks (loop_vinfo, cond_expr,
> ! 				       cond_expr_stmt_list);
>   
> !   if (!bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo)))
> !     vect_create_cond_for_stride_checks (loop_vinfo, cond_expr);
> ! 
> !   *cond_expr =
> !     fold_build2 (NE_EXPR, boolean_type_node, *cond_expr, integer_zero_node);
> !   *cond_expr =
> !     force_gimple_operand (*cond_expr, &gimplify_stmt_list, true, NULL_TREE);
> !   gimple_seq_add_seq (cond_expr_stmt_list, gimplify_stmt_list);
> !   if (!do_versioning)
> !     return;
>   
>     initialize_original_copy_tables ();
> !   nloop = loop_version (loop, *cond_expr, &condition_bb,
>   			prob, prob, REG_BR_PROB_BASE - prob, true);
>     free_original_copy_tables();
>   
> *************** vect_loop_versioning (loop_vec_info loop
> *** 8154,8164 ****
>     /* End loop-exit-fixes after versioning.  */
>   
>     update_ssa (TODO_update_ssa);
> !   if (cond_expr_stmt_list)
>       {
>         cond_exp_gsi = gsi_last_bb (condition_bb);
> !       gsi_insert_seq_before (&cond_exp_gsi, cond_expr_stmt_list, GSI_SAME_STMT);
>       }
>   }
>   
>   /* Remove a group of stores (for SLP or interleaving), free their 
> --- 8225,8238 ----
>     /* End loop-exit-fixes after versioning.  */
>   
>     update_ssa (TODO_update_ssa);
> !   if (*cond_expr_stmt_list)
>       {
>         cond_exp_gsi = gsi_last_bb (condition_bb);
> !       gsi_insert_seq_before (&cond_exp_gsi, *cond_expr_stmt_list,
> ! 			     GSI_SAME_STMT);
> !       *cond_expr_stmt_list = NULL;
>       }
> +   *cond_expr = NULL_TREE;
>   }
>   
>   /* Remove a group of stores (for SLP or interleaving), free their 
> *************** vect_transform_loop (loop_vec_info loop_
> *** 8320,8342 ****
>     bool strided_store;
>     bool slp_scheduled = false;
>     unsigned int nunits;
>   
>     if (vect_print_dump_info (REPORT_DETAILS))
>       fprintf (vect_dump, "=== vec_transform_loop ===");
>   
> -   if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> -       || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> -     vect_loop_versioning (loop_vinfo);
> - 
> -   /* CHECKME: we wouldn't need this if we called update_ssa once
> -      for all loops.  */
> -   bitmap_zero (vect_memsyms_to_rename);
> - 
>     /* Peel the loop if there are data refs with unknown alignment.
>        Only one data ref with unknown store is allowed.  */
>   
>     if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
>       vect_do_peeling_for_alignment (loop_vinfo);
>     
>     /* If the loop has a symbolic number of iterations 'n' (i.e. it's not a
>        compile time constant), or it is a constant that doesn't divide by the
> --- 8394,8427 ----
>     bool strided_store;
>     bool slp_scheduled = false;
>     unsigned int nunits;
> +   tree cond_expr = NULL_TREE;
> +   gimple_seq cond_expr_stmt_list = NULL;
> +   bool do_peeling_for_loop_bound;
>   
>     if (vect_print_dump_info (REPORT_DETAILS))
>       fprintf (vect_dump, "=== vec_transform_loop ===");
>   
>     /* Peel the loop if there are data refs with unknown alignment.
>        Only one data ref with unknown store is allowed.  */
>   
>     if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
>       vect_do_peeling_for_alignment (loop_vinfo);
> + 
> +   do_peeling_for_loop_bound
> +     = (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +        || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> + 	   && LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0));
> + 
> +   if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> +       || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> +       || !bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo)))
> +     vect_loop_versioning (loop_vinfo,
> + 			  !do_peeling_for_loop_bound,
> + 			  &cond_expr, &cond_expr_stmt_list);
> + 
> +   /* CHECKME: we wouldn't need this if we called update_ssa once
> +      for all loops.  */
> +   bitmap_zero (vect_memsyms_to_rename);
>     
>     /* If the loop has a symbolic number of iterations 'n' (i.e. it's not a
>        compile time constant), or it is a constant that doesn't divide by the
> *************** vect_transform_loop (loop_vec_info loop_
> *** 8346,8355 ****
>        will remain scalar and will compute the remaining (n%VF) iterations.
>        (VF is the vectorization factor).  */
>   
> !   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> !       || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> !           && LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0))
> !     vect_do_peeling_for_loop_bound (loop_vinfo, &ratio);
>     else
>       ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
>   		LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
> --- 8431,8439 ----
>        will remain scalar and will compute the remaining (n%VF) iterations.
>        (VF is the vectorization factor).  */
>   
> !   if (do_peeling_for_loop_bound)
> !     vect_do_peeling_for_loop_bound (loop_vinfo, &ratio,
> ! 				    cond_expr, cond_expr_stmt_list);
>     else
>       ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
>   		LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
> Index: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-stride-1.f90
> ===================================================================
> *** /dev/null	1970-01-01 00:00:00.000000000 +0000
> --- trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-stride-1.f90	2009-01-23 16:48:50.000000000 +0100
> ***************
> *** 0 ****
> --- 1,17 ----
> + ! { dg-do compile }
> + 
> + subroutine to_product_of(self,a,b)
> +   real(kind=8), dimension(:,:) :: self
> +   real(kind=8), dimension(:,:), intent(in) :: a, b
> +   integer(kind=kind(1)) :: dim1, dim2
> +   dim1 = size(self,1)
> +   dim2 = size(self,2)
> +   do i = 1,dim1
> +     do j = 1,dim2
> +       self(i,j) = sum(a(i,:)*b(:,j))
> +     end do
> +   end do
> + end subroutine
> + 
> + ! { dg-final { scan-tree-dump "vectorized 1 loop" "vect" } }
> + ! { dg-final { cleanup-tree-dump "vect" } }
> Index: trunk/gcc/Makefile.in
> ===================================================================
> *** trunk.orig/gcc/Makefile.in	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/Makefile.in	2009-01-23 16:48:50.000000000 +0100
> *************** tree-nrv.o : tree-nrv.c $(CONFIG_H) $(SY
> *** 2135,2141 ****
>   tree-ssa-copy.o : tree-ssa-copy.c $(TREE_FLOW_H) $(CONFIG_H) $(SYSTEM_H) \
>      $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h $(DIAGNOSTIC_H) \
>      $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
> !    $(BASIC_BLOCK_H) tree-pass.h langhooks.h tree-ssa-propagate.h $(FLAGS_H)
>   tree-ssa-propagate.o : tree-ssa-propagate.c $(TREE_FLOW_H) $(CONFIG_H) \
>      $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h \
>      $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
> --- 2135,2142 ----
>   tree-ssa-copy.o : tree-ssa-copy.c $(TREE_FLOW_H) $(CONFIG_H) $(SYSTEM_H) \
>      $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h $(DIAGNOSTIC_H) \
>      $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
> !    $(BASIC_BLOCK_H) tree-pass.h langhooks.h tree-ssa-propagate.h $(FLAGS_H) \
> !    $(CFGLOOP_H)
>   tree-ssa-propagate.o : tree-ssa-propagate.c $(TREE_FLOW_H) $(CONFIG_H) \
>      $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h \
>      $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
> Index: trunk/gcc/tree-ssa-copy.c
> ===================================================================
> *** trunk.orig/gcc/tree-ssa-copy.c	2009-01-23 16:47:53.000000000 +0100
> --- trunk/gcc/tree-ssa-copy.c	2009-01-23 16:48:50.000000000 +0100
> *************** along with GCC; see the file COPYING3.
> *** 37,42 ****
> --- 37,43 ----
>   #include "tree-pass.h"
>   #include "tree-ssa-propagate.h"
>   #include "langhooks.h"
> + #include "cfgloop.h"
>   
>   /* This file implements the copy propagation pass and provides a
>      handful of interfaces for performing const/copy propagation and
> *************** init_copy_prop (void)
> *** 991,997 ****
>             tree def;
>   
>   	  def = gimple_phi_result (phi);
> ! 	  if (!is_gimple_reg (def))
>               prop_set_simulate_again (phi, false);
>   	  else
>               prop_set_simulate_again (phi, true);
> --- 992,1004 ----
>             tree def;
>   
>   	  def = gimple_phi_result (phi);
> ! 	  if (!is_gimple_reg (def)
> ! 	      /* In loop-closed SSA form do not copy-propagate through
> ! 	         PHI nodes.  Technically this is only needed for loop
> ! 		 exit PHIs, but this is difficult to query.  */
> ! 	      || (current_loops
> ! 		  && gimple_phi_num_args (phi) == 1
> ! 		  && loops_state_satisfies_p (LOOP_CLOSED_SSA)))
>               prop_set_simulate_again (phi, false);
>   	  else
>               prop_set_simulate_again (phi, true);

Richard,
    Do you have an updated version of this patch which would apply against
current gcc trunk? Also, did this ever go into any of the branches?
              Jack

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for  vectorization
  2010-03-13 19:19 ` Jack Howarth
@ 2010-03-15 10:38   ` Richard Guenther
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Guenther @ 2010-03-15 10:38 UTC (permalink / raw)
  To: Jack Howarth; +Cc: gcc-patches

On Sat, 13 Mar 2010, Jack Howarth wrote:

> On Fri, Jan 23, 2009 at 05:08:43PM +0100, Richard Guenther wrote:
> > 
> > This patch adds the capability to the vectorizer to perform versioning
> > for the case of a constant (suitable) stride.  For example for
> > 
> > subroutine to_product_of(self,a,b,a1,a2)
> >   complex(kind=8) :: self (:)
> >   complex(kind=8), intent(in) :: a(:,:)
> >   complex(kind=8), intent(in) :: b(:)
> >   integer a1,a2
> >   do i = 1,a1
> >     do j = 1,a2
> >       self(i) = self(i) + a(j,i)*b(j)
> >     end do
> >   end do
> > end subroutine
> > 
> > we can only apply vectorization if the strides of the fastest dimension
> > of self, a and b are one (they are loaded from the passed array
> > descriptors and thus appear as (loop invariant) variables).
> > 
> > During the implementation of this I noticed that peeling for
> > number of iterations (we have to unroll the above loop twice, and so
> > for an odd number of iterations have a epilogue loop for the remaining
> > iteration(s)) does not play well with versioning and we end up
> > vectorizing the wrong loop.  So I just disabled versioning if we
> > apply peeling with an epilogue loop and instead attach the versioning
> > condition to the pre-condition of the main loop that skips directly
> > to the epilogue if the number of iterations is too small.  We obviously
> > can use the epilogue loop as the non-vectorized version.
> > 
> > This patch also inserts an extra copyprop and dce pass before the
> > vectorizer so it can recognize the reduction in the above testcase
> > (LIM has made that reduction non-obvious).  So I noticed that
> > copyprop does not preserve loop-closed SSA form and fixed that as well.
> > 
> > Some earlier version bootstrapped and tested ok on 
> > x86_64-unknown-linux-gnu, a final attempt is still running.
> > 
> > I didn't yet performance test this extensively, but it might need
> > cost-model adjustments and/or need to wait until we have profile
> > feedback to properly seed vectorizer analysis here.  A micro-benchmark
> > based on the above loop shows around 15% improvement on AMD K10.
> > 
> > Feedback (and ppc testing) is still welcome of course.
> > 
> > Thanks,
> > Richard.
> > 
> > 2009-01-23  Richard Guenther  <rguenther@suse.de>
> > 
> > 	* passes.c (init_optimization_passes): Add copy-prop and dce
> > 	before vectorization.
> > 	* Makefile.in (tree-ssa-copy.o): Add $(CFGLOOP_H) dependency.
> > 	* tree-ssa-copy.c (init_copy_prop): Do not propagate through
> > 	single-argument PHIs if we are in loop-closed SSA form.
> > 	* tree-data-ref.c (dr_analyze_innermost): Allow affine offsets.
> > 	* tree-vect-analyze.c (vect_check_interleaving): Check that
> > 	DR_STEP is constant.
> > 	(vect_enhance_data_refs_alignment): If versioning for strides
> > 	is required do not peel.
> > 	(vect_analyze_data_ref_access): Allow non-constant step of
> > 	a specific form, remember them for versioning.
> > 	* params.def (vect-max-version-for-stride-checks): New param.
> > 	(vect-version-for-stride-value): Likewise.
> > 	* tree-vectorizer.c (slpeel_add_loop_guard): Pass extra guards
> > 	for the pre-condition.
> > 	(slpeel_tree_peel_loop_to_edge): Likewise.
> > 	(new_loop_vec_info): Allocate stride versioning data.
> > 	(destroy_loop_vec_info): Free stride versioning data.
> > 	* tree-vectorizer.h (struct _loop_vec_info): Add variable_strides
> > 	field.
> > 	(LOOP_VINFO_VARIABLE_STRIDES): Define.
> > 	(slpeel_tree_peel_loop_to_edge): Adjust declaration.
> > 	* tree-vect-transform.c (vect_build_loop_niters): Take an
> > 	optional sequence to append stmts.
> > 	(vect_generate_tmps_on_preheader): Likewise.
> > 	(vect_do_peeling_for_loop_bound): Take extra guards for the
> > 	pre-condition.
> > 	(vect_do_peeling_for_alignment): Adjust.
> > 	(vect_create_cond_for_stride_checks): New function.
> > 	(vect_loop_versioning): Take stmt and stmt list to put pre-condition
> > 	guards if we are going to peel.  Do not apply versioning in that
> > 	case.
> > 	(vect_transform_loop): If we are peeling for loop bound only
> > 	record extra pre-conditions, do not apply loop versioning.
> > 
> > 	* gcc.dg/vect/fast-math-vect-complex-5.c: New testcase.
> > 	* gfortran.dg/vect/fast-math-vect-complex-1.f90: Likewise.
> > 	* gfortran.dg/vect/fast-math-vect-stride-1.f90: Likewise.
> > 
> > Index: trunk/gcc/passes.c
> > ===================================================================
> > *** trunk.orig/gcc/passes.c	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/passes.c	2009-01-23 16:48:50.000000000 +0100
> > *************** init_optimization_passes (void)
> > *** 659,664 ****
> > --- 659,666 ----
> >   	  NEXT_PASS (pass_graphite_transforms);
> >   	  NEXT_PASS (pass_iv_canon);
> >   	  NEXT_PASS (pass_if_conversion);
> > + 	  NEXT_PASS (pass_copy_prop);
> > + 	  NEXT_PASS (pass_dce_loop);
> >   	  NEXT_PASS (pass_vectorize);
> >   	    {
> >   	      struct opt_pass **p = &pass_vectorize.pass.sub;
> > Index: trunk/gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-5.c
> > ===================================================================
> > *** /dev/null	1970-01-01 00:00:00.000000000 +0000
> > --- trunk/gcc/testsuite/gcc.dg/vect/fast-math-vect-complex-5.c	2009-01-23 16:48:50.000000000 +0100
> > ***************
> > *** 0 ****
> > --- 1,18 ----
> > + /* { dg-do compile } */
> > + /* { dg-require-effective-target vect_double } */
> > + 
> > + #define NUM 64
> > + _Complex double ad[NUM], bd[NUM], cd[NUM];
> > + 
> > + void testd(void)
> > + {
> > +   int i;
> > +   int j;
> > + 
> > +   for (i = 0; i < NUM; i++)
> > +     for (j = 0; j < NUM; j++)
> > +       cd[i] = cd[i] + ad[j] * bd[j];
> > + }
> > + 
> > + /* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */
> > + /* { dg-final { cleanup-tree-dump "vect" } } */
> > Index: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-complex-1.f90
> > ===================================================================
> > *** /dev/null	1970-01-01 00:00:00.000000000 +0000
> > --- trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-complex-1.f90	2009-01-23 16:48:50.000000000 +0100
> > ***************
> > *** 0 ****
> > --- 1,16 ----
> > + ! { dg-do compile }
> > + 
> > + subroutine to_product_of(self,a,b,a1,a2)
> > +   complex(kind=8) :: self (:)
> > +   complex(kind=8), intent(in) :: a(:,:)
> > +   complex(kind=8), intent(in) :: b(:)
> > +   integer a1,a2
> > +   do i = 1,a1
> > +     do j = 1,a2
> > +       self(i) = self(i) + a(i,j)*b(j)
> > +     end do
> > +   end do
> > + end subroutine
> > + 
> > + ! { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } }
> > + ! { dg-final { cleanup-tree-dump "vect" } }
> > Index: trunk/gcc/tree-data-ref.c
> > ===================================================================
> > *** trunk.orig/gcc/tree-data-ref.c	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/tree-data-ref.c	2009-01-23 16:48:50.000000000 +0100
> > *************** dr_analyze_innermost (struct data_refere
> > *** 708,714 ****
> >         offset_iv.base = ssize_int (0);
> >         offset_iv.step = ssize_int (0);
> >       }
> > !   else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
> >       {
> >         if (dump_file && (dump_flags & TDF_DETAILS))
> >   	fprintf (dump_file, "failed: evolution of offset is not affine.\n");
> > --- 708,714 ----
> >         offset_iv.base = ssize_int (0);
> >         offset_iv.step = ssize_int (0);
> >       }
> > !   else if (!simple_iv (loop, stmt, poffset, &offset_iv, true))
> >       {
> >         if (dump_file && (dump_flags & TDF_DETAILS))
> >   	fprintf (dump_file, "failed: evolution of offset is not affine.\n");
> > Index: trunk/gcc/tree-vect-analyze.c
> > ===================================================================
> > *** trunk.orig/gcc/tree-vect-analyze.c	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/tree-vect-analyze.c	2009-01-23 16:48:50.000000000 +0100
> > *************** vect_check_interleaving (struct data_ref
> > *** 1109,1114 ****
> > --- 1109,1116 ----
> >     type_size_b = TREE_INT_CST_LOW (TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (drb))));
> >   
> >     if (type_size_a != type_size_b
> > +       || TREE_CODE (DR_STEP (dra)) != INTEGER_CST
> > +       || TREE_CODE (DR_STEP (drb)) != INTEGER_CST
> >         || tree_int_cst_compare (DR_STEP (dra), DR_STEP (drb))
> >         || !types_compatible_p (TREE_TYPE (DR_REF (dra)), 
> >                                 TREE_TYPE (DR_REF (drb))))
> > *************** vect_enhance_data_refs_alignment (loop_v
> > *** 1825,1830 ****
> > --- 1827,1833 ----
> >     gimple stmt;
> >     stmt_vec_info stmt_info;
> >     int vect_versioning_for_alias_required;
> > +   int vect_versioning_for_strides_required;
> >   
> >     if (vect_print_dump_info (REPORT_DETAILS))
> >       fprintf (vect_dump, "=== vect_enhance_data_refs_alignment ===");
> > *************** vect_enhance_data_refs_alignment (loop_v
> > *** 1892,1904 ****
> >   
> >     vect_versioning_for_alias_required =
> >       (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)) > 0);
> >   
> >     /* Temporarily, if versioning for alias is required, we disable peeling
> >        until we support peeling and versioning.  Often peeling for alignment
> >        will require peeling for loop-bound, which in turn requires that we
> >        know how to adjust the loop ivs after the loop.  */
> >     if (vect_versioning_for_alias_required
> > !        || !vect_can_advance_ivs_p (loop_vinfo)
> >         || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
> >       do_peeling = false;
> >   
> > --- 1895,1910 ----
> >   
> >     vect_versioning_for_alias_required =
> >       (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)) > 0);
> > +   vect_versioning_for_strides_required =
> > +     !bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo));
> >   
> >     /* Temporarily, if versioning for alias is required, we disable peeling
> >        until we support peeling and versioning.  Often peeling for alignment
> >        will require peeling for loop-bound, which in turn requires that we
> >        know how to adjust the loop ivs after the loop.  */
> >     if (vect_versioning_for_alias_required
> > !       || vect_versioning_for_strides_required
> > !       || !vect_can_advance_ivs_p (loop_vinfo)
> >         || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
> >       do_peeling = false;
> >   
> > *************** vect_analyze_data_ref_access (struct dat
> > *** 2349,2357 ****
> >     stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> >     loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> >     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > !   HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
> >   
> > !   if (!step)
> >       {
> >         if (vect_print_dump_info (REPORT_DETAILS))
> >   	fprintf (vect_dump, "bad data-ref access");
> > --- 2355,2364 ----
> >     stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> >     loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> >     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > !   HOST_WIDE_INT dr_step;
> >   
> > !   if (!step
> > !       || TREE_CODE (step) != INTEGER_CST)
> >       {
> >         if (vect_print_dump_info (REPORT_DETAILS))
> >   	fprintf (vect_dump, "bad data-ref access");
> > *************** vect_analyze_data_ref_access (struct dat
> > *** 2359,2364 ****
> > --- 2366,2372 ----
> >       }
> >   
> >     /* Don't allow invariant accesses.  */
> > +   dr_step = TREE_INT_CST_LOW (step);
> >     if (dr_step == 0)
> >       return false; 
> >   
> > *************** vect_analyze_data_refs (loop_vec_info lo
> > *** 3563,3568 ****
> > --- 3571,3620 ----
> >             return false;
> >           }
> >   
> > +       /* If the non-constant (but loop invariant) step is of the
> > + 	 form NAME or NAME * CST where CST is the element size mark
> > + 	 this ddr for versioning for strides and re-set DR_STEP
> > + 	 to the value we will version for.  Otherwise reject
> > + 	 non-constant steps.  */
> > +       if (TREE_CODE (DR_STEP (dr)) != INTEGER_CST)
> > + 	{
> > + 	  tree step = DR_STEP (dr);
> > + 
> > + 	  STRIP_NOPS (step);
> > + 	  if (flag_tree_vect_loop_version
> > + 	      && (TREE_CODE (step) == SSA_NAME
> > + 		  || (TREE_CODE (step) == MULT_EXPR
> > + 		      && TREE_CODE (TREE_OPERAND (step, 1)) == INTEGER_CST)))
> > + 	    {
> > + 	      tree stride;
> > + 	      tree newstep;
> > + 
> > + 	      stride = step;
> > + 	      if (TREE_CODE (step) == MULT_EXPR)
> > + 		stride = TREE_OPERAND (step, 0);
> > + 	      STRIP_NOPS (stride);
> > + 	      if (TREE_CODE (stride) != SSA_NAME)
> > + 		return false;
> > + 
> > + 	      bitmap_set_bit (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo),
> > + 			      SSA_NAME_VERSION (stride));
> > + 	      if (bitmap_count_bits (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo))
> > + 		  > (unsigned)PARAM_VALUE (PARAM_VECT_MAX_VERSION_FOR_STRIDE_CHECKS))
> > + 		return false;
> > + 
> > + 	      /* ???  Delay this change until after versioning or
> > + 	         preserve the original step somewhere.  */
> > + 	      newstep = build_int_cst (TREE_TYPE (step),
> > + 		       PARAM_VALUE (PARAM_VECT_VERSION_FOR_STRIDE_VALUE));
> > + 	      if (TREE_CODE (step) == MULT_EXPR)
> > + 		newstep = int_const_binop (MULT_EXPR, newstep,
> > + 					   TREE_OPERAND (step, 1), false);
> > + 	      DR_STEP (dr) = newstep;
> > + 	    }
> > + 	  else
> > + 	    return false;
> > + 	}
> > + 
> >         if (!DR_SYMBOL_TAG (dr))
> >           {
> >             if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
> > Index: trunk/gcc/params.def
> > ===================================================================
> > *** trunk.orig/gcc/params.def	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/params.def	2009-01-23 16:48:50.000000000 +0100
> > *************** DEFPARAM(PARAM_VECT_MAX_VERSION_FOR_ALIA
> > *** 506,511 ****
> > --- 506,521 ----
> >            "Bound on number of runtime checks inserted by the vectorizer's loop versioning for alias check",
> >            10, 0, 0)
> >   
> > + DEFPARAM(PARAM_VECT_MAX_VERSION_FOR_STRIDE_CHECKS,
> > +          "vect-max-version-for-stride-checks",
> > +          "Bound on number of runtime checks inserted by the vectorizer's loop versioning for stride check",
> > +          4, 0, 0)
> > + 
> > + DEFPARAM(PARAM_VECT_VERSION_FOR_STRIDE_VALUE,
> > +          "vect-version-for-stride-value",
> > +          "The constant stride in elements the vectorizer uses for loop versioning",
> > +          1, 0, 0)
> > + 
> >   DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIONS,
> >   	 "max-cselib-memory-locations",
> >   	 "The maximum memory locations recorded by cselib",
> > Index: trunk/gcc/tree-vectorizer.c
> > ===================================================================
> > *** trunk.orig/gcc/tree-vectorizer.c	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/tree-vectorizer.c	2009-01-23 16:48:50.000000000 +0100
> > *************** slpeel_tree_duplicate_loop_to_edge_cfg (
> > *** 927,934 ****
> >      Returns the skip edge.  */
> >   
> >   static edge
> > ! slpeel_add_loop_guard (basic_block guard_bb, tree cond, basic_block exit_bb,
> > ! 		       basic_block dom_bb)
> >   {
> >     gimple_stmt_iterator gsi;
> >     edge new_e, enter_e;
> > --- 927,935 ----
> >      Returns the skip edge.  */
> >   
> >   static edge
> > ! slpeel_add_loop_guard (basic_block guard_bb, tree cond,
> > ! 		       gimple_seq cond_expr_stmt_list,
> > ! 		       basic_block exit_bb, basic_block dom_bb)
> >   {
> >     gimple_stmt_iterator gsi;
> >     edge new_e, enter_e;
> > *************** slpeel_add_loop_guard (basic_block guard
> > *** 941,951 ****
> >     gsi = gsi_last_bb (guard_bb);
> >   
> >     cond = force_gimple_operand (cond, &gimplify_stmt_list, true, NULL_TREE);
> >     cond_stmt = gimple_build_cond (NE_EXPR,
> >   				 cond, build_int_cst (TREE_TYPE (cond), 0),
> >   				 NULL_TREE, NULL_TREE);
> > !   if (gimplify_stmt_list)
> > !     gsi_insert_seq_after (&gsi, gimplify_stmt_list, GSI_NEW_STMT);
> >   
> >     gsi = gsi_last_bb (guard_bb);
> >     gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
> > --- 942,954 ----
> >     gsi = gsi_last_bb (guard_bb);
> >   
> >     cond = force_gimple_operand (cond, &gimplify_stmt_list, true, NULL_TREE);
> > +   if (gimplify_stmt_list)
> > +     gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
> >     cond_stmt = gimple_build_cond (NE_EXPR,
> >   				 cond, build_int_cst (TREE_TYPE (cond), 0),
> >   				 NULL_TREE, NULL_TREE);
> > !   if (cond_expr_stmt_list)
> > !     gsi_insert_seq_after (&gsi, cond_expr_stmt_list, GSI_NEW_STMT);
> >   
> >     gsi = gsi_last_bb (guard_bb);
> >     gsi_insert_after (&gsi, cond_stmt, GSI_NEW_STMT);
> > *************** struct loop*
> > *** 1151,1157 ****
> >   slpeel_tree_peel_loop_to_edge (struct loop *loop, 
> >   			       edge e, tree first_niters, 
> >   			       tree niters, bool update_first_loop_count,
> > ! 			       unsigned int th, bool check_profitability)
> >   {
> >     struct loop *new_loop = NULL, *first_loop, *second_loop;
> >     edge skip_e;
> > --- 1154,1161 ----
> >   slpeel_tree_peel_loop_to_edge (struct loop *loop, 
> >   			       edge e, tree first_niters, 
> >   			       tree niters, bool update_first_loop_count,
> > ! 			       unsigned int th, bool check_profitability,
> > ! 			       tree cond_expr, gimple_seq cond_expr_stmt_list)
> >   {
> >     struct loop *new_loop = NULL, *first_loop, *second_loop;
> >     edge skip_e;
> > *************** slpeel_tree_peel_loop_to_edge (struct lo
> > *** 1325,1330 ****
> > --- 1329,1342 ----
> >   	  pre_condition = fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
> >   				       cost_pre_condition, pre_condition);
> >   	}
> > +       if (cond_expr)
> > + 	{
> > + 	  pre_condition =
> > + 	    fold_build2 (TRUTH_OR_EXPR, boolean_type_node,
> > + 			 pre_condition,
> > + 			 fold_build1 (TRUTH_NOT_EXPR, boolean_type_node,
> > + 				      cond_expr));
> > + 	}
> >       }
> >   
> >     /* Prologue peeling.  */  
> > *************** slpeel_tree_peel_loop_to_edge (struct lo
> > *** 1340,1345 ****
> > --- 1352,1358 ----
> >       }
> >   
> >     skip_e = slpeel_add_loop_guard (bb_before_first_loop, pre_condition,
> > + 				  cond_expr_stmt_list,
> >                                     bb_before_second_loop, bb_before_first_loop);
> >     slpeel_update_phi_nodes_for_guard1 (skip_e, first_loop,
> >   				      first_loop == new_loop,
> > *************** slpeel_tree_peel_loop_to_edge (struct lo
> > *** 1377,1383 ****
> >   
> >     pre_condition = 
> >   	fold_build2 (EQ_EXPR, boolean_type_node, first_niters, niters);
> > !   skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition,
> >                                     bb_after_second_loop, bb_before_first_loop);
> >     slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
> >                                        second_loop == new_loop, &new_exit_bb);
> > --- 1390,1396 ----
> >   
> >     pre_condition = 
> >   	fold_build2 (EQ_EXPR, boolean_type_node, first_niters, niters);
> > !   skip_e = slpeel_add_loop_guard (bb_between_loops, pre_condition, NULL,
> >                                     bb_after_second_loop, bb_before_first_loop);
> >     slpeel_update_phi_nodes_for_guard2 (skip_e, second_loop,
> >                                        second_loop == new_loop, &new_exit_bb);
> > *************** new_loop_vec_info (struct loop *loop)
> > *** 1714,1719 ****
> > --- 1727,1733 ----
> >     LOOP_VINFO_MAY_ALIAS_DDRS (res) =
> >       VEC_alloc (ddr_p, heap,
> >   	       PARAM_VALUE (PARAM_VECT_MAX_VERSION_FOR_ALIAS_CHECKS));
> > +   LOOP_VINFO_VARIABLE_STRIDES (res) = BITMAP_ALLOC (NULL);
> >     LOOP_VINFO_STRIDED_STORES (res) = VEC_alloc (gimple, heap, 10);
> >     LOOP_VINFO_SLP_INSTANCES (res) = VEC_alloc (slp_instance, heap, 10);
> >     LOOP_VINFO_SLP_UNROLLING_FACTOR (res) = 1;
> > *************** destroy_loop_vec_info (loop_vec_info loo
> > *** 1800,1805 ****
> > --- 1814,1820 ----
> >     free_dependence_relations (LOOP_VINFO_DDRS (loop_vinfo));
> >     VEC_free (gimple, heap, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo));
> >     VEC_free (ddr_p, heap, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo));
> > +   BITMAP_FREE (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo));
> >     slp_instances = LOOP_VINFO_SLP_INSTANCES (loop_vinfo);
> >     for (j = 0; VEC_iterate (slp_instance, slp_instances, j, instance); j++)
> >       vect_free_slp_instance (instance);
> > Index: trunk/gcc/tree-vectorizer.h
> > ===================================================================
> > *** trunk.orig/gcc/tree-vectorizer.h	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/tree-vectorizer.h	2009-01-23 16:48:50.000000000 +0100
> > *************** typedef struct _loop_vec_info {
> > *** 210,215 ****
> > --- 210,219 ----
> >     /* All data dependences in the loop.  */
> >     VEC (ddr_p, heap) *ddrs;
> >   
> > +   /* SSA_NAMEs representing variable strides in data references.
> > +      Candidates for a run-time stride check.  */
> > +   bitmap variable_strides;
> > + 
> >     /* Data Dependence Relations defining address ranges that are candidates
> >        for a run-time aliasing check.  */
> >     VEC (ddr_p, heap) *may_alias_ddrs;
> > *************** typedef struct _loop_vec_info {
> > *** 254,259 ****
> > --- 258,264 ----
> >   #define LOOP_VINFO_LOC(L)             (L)->loop_line_number
> >   #define LOOP_VINFO_MAY_ALIAS_DDRS(L)  (L)->may_alias_ddrs
> >   #define LOOP_VINFO_STRIDED_STORES(L)  (L)->strided_stores
> > + #define LOOP_VINFO_VARIABLE_STRIDES(L) (L)->variable_strides
> >   #define LOOP_VINFO_SLP_INSTANCES(L)   (L)->slp_instances
> >   #define LOOP_VINFO_SLP_UNROLLING_FACTOR(L) (L)->slp_unrolling_factor
> >   
> > *************** extern bitmap vect_memsyms_to_rename;
> > *** 707,713 ****
> >      divide by the vectorization factor, and to peel the first few iterations
> >      to force the alignment of data references in the loop.  */
> >   extern struct loop *slpeel_tree_peel_loop_to_edge 
> > !   (struct loop *, edge, tree, tree, bool, unsigned int, bool);
> >   extern void set_prologue_iterations (basic_block, tree,
> >   				     struct loop *, unsigned int);
> >   struct loop *tree_duplicate_loop_on_edge (struct loop *, edge);
> > --- 712,718 ----
> >      divide by the vectorization factor, and to peel the first few iterations
> >      to force the alignment of data references in the loop.  */
> >   extern struct loop *slpeel_tree_peel_loop_to_edge 
> > !   (struct loop *, edge, tree, tree, bool, unsigned int, bool, tree, gimple_seq);
> >   extern void set_prologue_iterations (basic_block, tree,
> >   				     struct loop *, unsigned int);
> >   struct loop *tree_duplicate_loop_on_edge (struct loop *, edge);
> > Index: trunk/gcc/tree-vect-transform.c
> > ===================================================================
> > *** trunk.orig/gcc/tree-vect-transform.c	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/tree-vect-transform.c	2009-01-23 16:48:50.000000000 +0100
> > *************** static tree get_initial_def_for_reductio
> > *** 65,72 ****
> >   
> >   /* Utility function dealing with loop peeling (not peeling itself).  */
> >   static void vect_generate_tmps_on_preheader 
> > !   (loop_vec_info, tree *, tree *, tree *);
> > ! static tree vect_build_loop_niters (loop_vec_info);
> >   static void vect_update_ivs_after_vectorizer (loop_vec_info, tree, edge); 
> >   static tree vect_gen_niters_for_prolog_loop (loop_vec_info, tree);
> >   static void vect_update_init_of_dr (struct data_reference *, tree niters);
> > --- 65,72 ----
> >   
> >   /* Utility function dealing with loop peeling (not peeling itself).  */
> >   static void vect_generate_tmps_on_preheader 
> > !   (loop_vec_info, tree *, tree *, tree *, gimple_seq);
> > ! static tree vect_build_loop_niters (loop_vec_info, gimple_seq);
> >   static void vect_update_ivs_after_vectorizer (loop_vec_info, tree, edge); 
> >   static tree vect_gen_niters_for_prolog_loop (loop_vec_info, tree);
> >   static void vect_update_init_of_dr (struct data_reference *, tree niters);
> > *************** vect_transform_stmt (gimple stmt, gimple
> > *** 7199,7205 ****
> >      on the loop preheader.  */
> >   
> >   static tree
> > ! vect_build_loop_niters (loop_vec_info loop_vinfo)
> >   {
> >     tree ni_name, var;
> >     gimple_seq stmts = NULL;
> > --- 7199,7205 ----
> >      on the loop preheader.  */
> >   
> >   static tree
> > ! vect_build_loop_niters (loop_vec_info loop_vinfo, gimple_seq seq)
> >   {
> >     tree ni_name, var;
> >     gimple_seq stmts = NULL;
> > *************** vect_build_loop_niters (loop_vec_info lo
> > *** 7214,7221 ****
> >     pe = loop_preheader_edge (loop);
> >     if (stmts)
> >       {
> > !       basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> > !       gcc_assert (!new_bb);
> >       }
> >         
> >     return ni_name;
> > --- 7214,7226 ----
> >     pe = loop_preheader_edge (loop);
> >     if (stmts)
> >       {
> > !       if (seq)
> > ! 	gimple_seq_add_seq (&seq, stmts);
> > !       else
> > ! 	{
> > ! 	  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> > ! 	  gcc_assert (!new_bb);
> > ! 	}
> >       }
> >         
> >     return ni_name;
> > *************** static void
> > *** 7234,7240 ****
> >   vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, 
> >   				 tree *ni_name_ptr,
> >   				 tree *ratio_mult_vf_name_ptr, 
> > ! 				 tree *ratio_name_ptr)
> >   {
> >   
> >     edge pe;
> > --- 7239,7246 ----
> >   vect_generate_tmps_on_preheader (loop_vec_info loop_vinfo, 
> >   				 tree *ni_name_ptr,
> >   				 tree *ratio_mult_vf_name_ptr, 
> > ! 				 tree *ratio_name_ptr,
> > ! 				 gimple_seq cond_expr_stmt_list)
> >   {
> >   
> >     edge pe;
> > *************** vect_generate_tmps_on_preheader (loop_ve
> > *** 7254,7260 ****
> >     /* Generate temporary variable that contains 
> >        number of iterations loop executes.  */
> >   
> > !   ni_name = vect_build_loop_niters (loop_vinfo);
> >     log_vf = build_int_cst (TREE_TYPE (ni), exact_log2 (vf));
> >   
> >     /* Create: ratio = ni >> log2(vf) */
> > --- 7260,7266 ----
> >     /* Generate temporary variable that contains 
> >        number of iterations loop executes.  */
> >   
> > !   ni_name = vect_build_loop_niters (loop_vinfo, cond_expr_stmt_list);
> >     log_vf = build_int_cst (TREE_TYPE (ni), exact_log2 (vf));
> >   
> >     /* Create: ratio = ni >> log2(vf) */
> > *************** vect_generate_tmps_on_preheader (loop_ve
> > *** 7267,7275 ****
> >   
> >         stmts = NULL;
> >         ratio_name = force_gimple_operand (ratio_name, &stmts, true, var);
> > !       pe = loop_preheader_edge (loop);
> > !       new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> > !       gcc_assert (!new_bb);
> >       }
> >          
> >     /* Create: ratio_mult_vf = ratio << log2 (vf).  */
> > --- 7273,7286 ----
> >   
> >         stmts = NULL;
> >         ratio_name = force_gimple_operand (ratio_name, &stmts, true, var);
> > !       if (cond_expr_stmt_list)
> > ! 	gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
> > !       else
> > ! 	{
> > ! 	  pe = loop_preheader_edge (loop);
> > ! 	  new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> > ! 	  gcc_assert (!new_bb);
> > ! 	}
> >       }
> >          
> >     /* Create: ratio_mult_vf = ratio << log2 (vf).  */
> > *************** vect_generate_tmps_on_preheader (loop_ve
> > *** 7284,7292 ****
> >         stmts = NULL;
> >         ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name, &stmts,
> >   						 true, var);
> > !       pe = loop_preheader_edge (loop);
> > !       new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> > !       gcc_assert (!new_bb);
> >       }
> >   
> >     *ni_name_ptr = ni_name;
> > --- 7295,7308 ----
> >         stmts = NULL;
> >         ratio_mult_vf_name = force_gimple_operand (ratio_mult_vf_name, &stmts,
> >   						 true, var);
> > !       if (cond_expr_stmt_list)
> > ! 	gimple_seq_add_seq (&cond_expr_stmt_list, stmts);
> > !       else
> > ! 	{
> > ! 	  pe = loop_preheader_edge (loop);
> > ! 	  new_bb = gsi_insert_seq_on_edge_immediate (pe, stmts);
> > ! 	  gcc_assert (!new_bb);
> > ! 	}
> >       }
> >   
> >     *ni_name_ptr = ni_name;
> > *************** conservative_cost_threshold (loop_vec_in
> > *** 7470,7476 ****
> >      NITERS / VECTORIZATION_FACTOR times (this value is placed into RATIO).  */
> >   
> >   static void 
> > ! vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio)
> >   {
> >     tree ni_name, ratio_mult_vf_name;
> >     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > --- 7486,7493 ----
> >      NITERS / VECTORIZATION_FACTOR times (this value is placed into RATIO).  */
> >   
> >   static void 
> > ! vect_do_peeling_for_loop_bound (loop_vec_info loop_vinfo, tree *ratio,
> > ! 				tree cond_expr, gimple_seq cond_expr_stmt_list)
> >   {
> >     tree ni_name, ratio_mult_vf_name;
> >     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > *************** vect_do_peeling_for_loop_bound (loop_vec
> > *** 7493,7499 ****
> >        ratio = ni_name / vf
> >        ratio_mult_vf_name = ratio * vf  */
> >     vect_generate_tmps_on_preheader (loop_vinfo, &ni_name,
> > ! 				   &ratio_mult_vf_name, ratio);
> >   
> >     loop_num  = loop->num; 
> >   
> > --- 7510,7517 ----
> >        ratio = ni_name / vf
> >        ratio_mult_vf_name = ratio * vf  */
> >     vect_generate_tmps_on_preheader (loop_vinfo, &ni_name,
> > ! 				   &ratio_mult_vf_name, ratio,
> > ! 				   cond_expr_stmt_list);
> >   
> >     loop_num  = loop->num; 
> >   
> > *************** vect_do_peeling_for_loop_bound (loop_vec
> > *** 7501,7507 ****
> >        peeling for alignment.  */
> >     if (!VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> >         && !VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> > !       && !LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
> >       {
> >         check_profitability = true;
> >   
> > --- 7519,7526 ----
> >        peeling for alignment.  */
> >     if (!VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> >         && !VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> > !       && !LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > !       && !cond_expr)
> >       {
> >         check_profitability = true;
> >   
> > *************** vect_do_peeling_for_loop_bound (loop_vec
> > *** 7514,7520 ****
> >   
> >     new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
> >                                               ratio_mult_vf_name, ni_name, false,
> > !                                             th, check_profitability);
> >     gcc_assert (new_loop);
> >     gcc_assert (loop_num == loop->num);
> >   #ifdef ENABLE_CHECKING
> > --- 7533,7540 ----
> >   
> >     new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
> >                                               ratio_mult_vf_name, ni_name, false,
> > !                                             th, check_profitability,
> > ! 					    cond_expr, cond_expr_stmt_list);
> >     gcc_assert (new_loop);
> >     gcc_assert (loop_num == loop->num);
> >   #ifdef ENABLE_CHECKING
> > *************** vect_do_peeling_for_alignment (loop_vec_
> > *** 7738,7744 ****
> >   
> >     initialize_original_copy_tables ();
> >   
> > !   ni_name = vect_build_loop_niters (loop_vinfo);
> >     niters_of_prolog_loop = vect_gen_niters_for_prolog_loop (loop_vinfo, ni_name);
> >     
> >   
> > --- 7758,7764 ----
> >   
> >     initialize_original_copy_tables ();
> >   
> > !   ni_name = vect_build_loop_niters (loop_vinfo, NULL);
> >     niters_of_prolog_loop = vect_gen_niters_for_prolog_loop (loop_vinfo, ni_name);
> >     
> >   
> > *************** vect_do_peeling_for_alignment (loop_vec_
> > *** 7759,7765 ****
> >     new_loop =
> >       slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
> >   				   niters_of_prolog_loop, ni_name, true,
> > ! 				   th, check_profitability);
> >   
> >     gcc_assert (new_loop);
> >   #ifdef ENABLE_CHECKING
> > --- 7779,7785 ----
> >     new_loop =
> >       slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
> >   				   niters_of_prolog_loop, ni_name, true,
> > ! 				   th, check_profitability, NULL_TREE, NULL);
> >   
> >     gcc_assert (new_loop);
> >   #ifdef ENABLE_CHECKING
> > *************** vect_create_cond_for_align_checks (loop_
> > *** 7909,7914 ****
> > --- 7929,7981 ----
> >       *cond_expr = part_cond_expr;
> >   }
> >   
> > + /* Function vect_create_cond_for_stride_checks.
> > + 
> > +    Create a conditional expression that represents the stride checks for
> > +    all of the stride SSA_NAMEs used in data references (array element
> > +    references) whose stride must be checked at runtime.
> > + 
> > +    Input:
> > +    COND_EXPR  - input conditional expression.  New conditions will be chained
> > +                 with logical AND operation.
> > +    LOOP_VINFO - on field of the loop information is used.
> > +                 LOOP_VINFO_VARIABLE_STRIDES is a bitmap of SSA_NAMEs to be
> > + 		checked.
> > + 
> > +    Output:
> > +    COND_EXPR_STMT_LIST - statements needed to construct the conditional
> > +                          expression.
> > +    The returned value is the conditional expression to be used in the if
> > +    statement that controls which version of the loop gets executed at runtime.
> > + 
> > +    The stride we do versioning for is currently specified by a compile-time
> > +    param.  In future the stride should be chosen by information from
> > +    profile-feedback.  */
> > + 
> > + static void
> > + vect_create_cond_for_stride_checks (loop_vec_info loop_vinfo,
> > + 				    tree *cond_expr)
> > + {
> > +   bitmap_iterator bi;
> > +   unsigned int i;
> > +   HOST_WIDE_INT stride;
> > + 
> > +   stride = PARAM_VALUE (PARAM_VECT_VERSION_FOR_STRIDE_VALUE);
> > + 
> > +   EXECUTE_IF_SET_IN_BITMAP (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo), 0, i, bi)
> > +     {
> > +       tree name = ssa_name (i);
> > +       tree cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > + 			       name,
> > + 			       build_int_cst (TREE_TYPE (name), stride));
> > +       if (*cond_expr)
> > + 	*cond_expr = fold_build2 (TRUTH_AND_EXPR, boolean_type_node,
> > + 				  *cond_expr, cond);
> > +       else
> > + 	*cond_expr = cond;
> > +     }
> > + }
> > + 
> >   /* Function vect_vfa_segment_size.
> >   
> >      Create an expression that computes the size of segment
> > *************** vect_create_cond_for_alias_checks (loop_
> > *** 8076,8087 ****
> >      cost model initially.  */
> >   
> >   static void
> > ! vect_loop_versioning (loop_vec_info loop_vinfo)
> >   {
> >     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> >     struct loop *nloop;
> > -   tree cond_expr = NULL_TREE;
> > -   gimple_seq cond_expr_stmt_list = NULL;
> >     basic_block condition_bb;
> >     gimple_stmt_iterator gsi, cond_exp_gsi;
> >     basic_block merge_bb;
> > --- 8143,8153 ----
> >      cost model initially.  */
> >   
> >   static void
> > ! vect_loop_versioning (loop_vec_info loop_vinfo, bool do_versioning,
> > ! 		      tree *cond_expr, gimple_seq *cond_expr_stmt_list)
> >   {
> >     struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> >     struct loop *nloop;
> >     basic_block condition_bb;
> >     gimple_stmt_iterator gsi, cond_exp_gsi;
> >     basic_block merge_bb;
> > *************** vect_loop_versioning (loop_vec_info loop
> > *** 8101,8129 ****
> >     th = conservative_cost_threshold (loop_vinfo,
> >   				    min_profitable_iters);
> >   
> > !   cond_expr =
> > !     build2 (GT_EXPR, boolean_type_node, scalar_loop_iters, 
> > ! 	    build_int_cst (TREE_TYPE (scalar_loop_iters), th));
> >   
> > !   cond_expr = force_gimple_operand (cond_expr, &cond_expr_stmt_list,
> > ! 				    false, NULL_TREE);
> >   
> >     if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)))
> > !       vect_create_cond_for_align_checks (loop_vinfo, &cond_expr,
> > ! 					 &cond_expr_stmt_list);
> >   
> >     if (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> > !     vect_create_cond_for_alias_checks (loop_vinfo, &cond_expr, 
> > ! 				       &cond_expr_stmt_list);
> >   
> > !   cond_expr =
> > !     fold_build2 (NE_EXPR, boolean_type_node, cond_expr, integer_zero_node);
> > !   cond_expr =
> > !     force_gimple_operand (cond_expr, &gimplify_stmt_list, true, NULL_TREE);
> > !   gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
> >   
> >     initialize_original_copy_tables ();
> > !   nloop = loop_version (loop, cond_expr, &condition_bb,
> >   			prob, prob, REG_BR_PROB_BASE - prob, true);
> >     free_original_copy_tables();
> >   
> > --- 8167,8200 ----
> >     th = conservative_cost_threshold (loop_vinfo,
> >   				    min_profitable_iters);
> >   
> > !   *cond_expr =
> > !     fold_build2 (GT_EXPR, boolean_type_node, scalar_loop_iters,
> > ! 		 build_int_cst (TREE_TYPE (scalar_loop_iters), th));
> >   
> > !   *cond_expr = force_gimple_operand (*cond_expr, cond_expr_stmt_list,
> > ! 				     false, NULL_TREE);
> >   
> >     if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo)))
> > !       vect_create_cond_for_align_checks (loop_vinfo, cond_expr,
> > ! 					 cond_expr_stmt_list);
> >   
> >     if (VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> > !     vect_create_cond_for_alias_checks (loop_vinfo, cond_expr,
> > ! 				       cond_expr_stmt_list);
> >   
> > !   if (!bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo)))
> > !     vect_create_cond_for_stride_checks (loop_vinfo, cond_expr);
> > ! 
> > !   *cond_expr =
> > !     fold_build2 (NE_EXPR, boolean_type_node, *cond_expr, integer_zero_node);
> > !   *cond_expr =
> > !     force_gimple_operand (*cond_expr, &gimplify_stmt_list, true, NULL_TREE);
> > !   gimple_seq_add_seq (cond_expr_stmt_list, gimplify_stmt_list);
> > !   if (!do_versioning)
> > !     return;
> >   
> >     initialize_original_copy_tables ();
> > !   nloop = loop_version (loop, *cond_expr, &condition_bb,
> >   			prob, prob, REG_BR_PROB_BASE - prob, true);
> >     free_original_copy_tables();
> >   
> > *************** vect_loop_versioning (loop_vec_info loop
> > *** 8154,8164 ****
> >     /* End loop-exit-fixes after versioning.  */
> >   
> >     update_ssa (TODO_update_ssa);
> > !   if (cond_expr_stmt_list)
> >       {
> >         cond_exp_gsi = gsi_last_bb (condition_bb);
> > !       gsi_insert_seq_before (&cond_exp_gsi, cond_expr_stmt_list, GSI_SAME_STMT);
> >       }
> >   }
> >   
> >   /* Remove a group of stores (for SLP or interleaving), free their 
> > --- 8225,8238 ----
> >     /* End loop-exit-fixes after versioning.  */
> >   
> >     update_ssa (TODO_update_ssa);
> > !   if (*cond_expr_stmt_list)
> >       {
> >         cond_exp_gsi = gsi_last_bb (condition_bb);
> > !       gsi_insert_seq_before (&cond_exp_gsi, *cond_expr_stmt_list,
> > ! 			     GSI_SAME_STMT);
> > !       *cond_expr_stmt_list = NULL;
> >       }
> > +   *cond_expr = NULL_TREE;
> >   }
> >   
> >   /* Remove a group of stores (for SLP or interleaving), free their 
> > *************** vect_transform_loop (loop_vec_info loop_
> > *** 8320,8342 ****
> >     bool strided_store;
> >     bool slp_scheduled = false;
> >     unsigned int nunits;
> >   
> >     if (vect_print_dump_info (REPORT_DETAILS))
> >       fprintf (vect_dump, "=== vec_transform_loop ===");
> >   
> > -   if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> > -       || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo)))
> > -     vect_loop_versioning (loop_vinfo);
> > - 
> > -   /* CHECKME: we wouldn't need this if we called update_ssa once
> > -      for all loops.  */
> > -   bitmap_zero (vect_memsyms_to_rename);
> > - 
> >     /* Peel the loop if there are data refs with unknown alignment.
> >        Only one data ref with unknown store is allowed.  */
> >   
> >     if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
> >       vect_do_peeling_for_alignment (loop_vinfo);
> >     
> >     /* If the loop has a symbolic number of iterations 'n' (i.e. it's not a
> >        compile time constant), or it is a constant that doesn't divide by the
> > --- 8394,8427 ----
> >     bool strided_store;
> >     bool slp_scheduled = false;
> >     unsigned int nunits;
> > +   tree cond_expr = NULL_TREE;
> > +   gimple_seq cond_expr_stmt_list = NULL;
> > +   bool do_peeling_for_loop_bound;
> >   
> >     if (vect_print_dump_info (REPORT_DETAILS))
> >       fprintf (vect_dump, "=== vec_transform_loop ===");
> >   
> >     /* Peel the loop if there are data refs with unknown alignment.
> >        Only one data ref with unknown store is allowed.  */
> >   
> >     if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
> >       vect_do_peeling_for_alignment (loop_vinfo);
> > + 
> > +   do_peeling_for_loop_bound
> > +     = (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > +        || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > + 	   && LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0));
> > + 
> > +   if (VEC_length (gimple, LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> > +       || VEC_length (ddr_p, LOOP_VINFO_MAY_ALIAS_DDRS (loop_vinfo))
> > +       || !bitmap_empty_p (LOOP_VINFO_VARIABLE_STRIDES (loop_vinfo)))
> > +     vect_loop_versioning (loop_vinfo,
> > + 			  !do_peeling_for_loop_bound,
> > + 			  &cond_expr, &cond_expr_stmt_list);
> > + 
> > +   /* CHECKME: we wouldn't need this if we called update_ssa once
> > +      for all loops.  */
> > +   bitmap_zero (vect_memsyms_to_rename);
> >     
> >     /* If the loop has a symbolic number of iterations 'n' (i.e. it's not a
> >        compile time constant), or it is a constant that doesn't divide by the
> > *************** vect_transform_loop (loop_vec_info loop_
> > *** 8346,8355 ****
> >        will remain scalar and will compute the remaining (n%VF) iterations.
> >        (VF is the vectorization factor).  */
> >   
> > !   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > !       || (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > !           && LOOP_VINFO_INT_NITERS (loop_vinfo) % vectorization_factor != 0))
> > !     vect_do_peeling_for_loop_bound (loop_vinfo, &ratio);
> >     else
> >       ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
> >   		LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
> > --- 8431,8439 ----
> >        will remain scalar and will compute the remaining (n%VF) iterations.
> >        (VF is the vectorization factor).  */
> >   
> > !   if (do_peeling_for_loop_bound)
> > !     vect_do_peeling_for_loop_bound (loop_vinfo, &ratio,
> > ! 				    cond_expr, cond_expr_stmt_list);
> >     else
> >       ratio = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
> >   		LOOP_VINFO_INT_NITERS (loop_vinfo) / vectorization_factor);
> > Index: trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-stride-1.f90
> > ===================================================================
> > *** /dev/null	1970-01-01 00:00:00.000000000 +0000
> > --- trunk/gcc/testsuite/gfortran.dg/vect/fast-math-vect-stride-1.f90	2009-01-23 16:48:50.000000000 +0100
> > ***************
> > *** 0 ****
> > --- 1,17 ----
> > + ! { dg-do compile }
> > + 
> > + subroutine to_product_of(self,a,b)
> > +   real(kind=8), dimension(:,:) :: self
> > +   real(kind=8), dimension(:,:), intent(in) :: a, b
> > +   integer(kind=kind(1)) :: dim1, dim2
> > +   dim1 = size(self,1)
> > +   dim2 = size(self,2)
> > +   do i = 1,dim1
> > +     do j = 1,dim2
> > +       self(i,j) = sum(a(i,:)*b(:,j))
> > +     end do
> > +   end do
> > + end subroutine
> > + 
> > + ! { dg-final { scan-tree-dump "vectorized 1 loop" "vect" } }
> > + ! { dg-final { cleanup-tree-dump "vect" } }
> > Index: trunk/gcc/Makefile.in
> > ===================================================================
> > *** trunk.orig/gcc/Makefile.in	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/Makefile.in	2009-01-23 16:48:50.000000000 +0100
> > *************** tree-nrv.o : tree-nrv.c $(CONFIG_H) $(SY
> > *** 2135,2141 ****
> >   tree-ssa-copy.o : tree-ssa-copy.c $(TREE_FLOW_H) $(CONFIG_H) $(SYSTEM_H) \
> >      $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h $(DIAGNOSTIC_H) \
> >      $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
> > !    $(BASIC_BLOCK_H) tree-pass.h langhooks.h tree-ssa-propagate.h $(FLAGS_H)
> >   tree-ssa-propagate.o : tree-ssa-propagate.c $(TREE_FLOW_H) $(CONFIG_H) \
> >      $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h \
> >      $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
> > --- 2135,2142 ----
> >   tree-ssa-copy.o : tree-ssa-copy.c $(TREE_FLOW_H) $(CONFIG_H) $(SYSTEM_H) \
> >      $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h $(DIAGNOSTIC_H) \
> >      $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h $(TREE_DUMP_H) \
> > !    $(BASIC_BLOCK_H) tree-pass.h langhooks.h tree-ssa-propagate.h $(FLAGS_H) \
> > !    $(CFGLOOP_H)
> >   tree-ssa-propagate.o : tree-ssa-propagate.c $(TREE_FLOW_H) $(CONFIG_H) \
> >      $(SYSTEM_H) $(RTL_H) $(TREE_H) $(TM_P_H) $(EXPR_H) $(GGC_H) output.h \
> >      $(DIAGNOSTIC_H) $(FUNCTION_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
> > Index: trunk/gcc/tree-ssa-copy.c
> > ===================================================================
> > *** trunk.orig/gcc/tree-ssa-copy.c	2009-01-23 16:47:53.000000000 +0100
> > --- trunk/gcc/tree-ssa-copy.c	2009-01-23 16:48:50.000000000 +0100
> > *************** along with GCC; see the file COPYING3.
> > *** 37,42 ****
> > --- 37,43 ----
> >   #include "tree-pass.h"
> >   #include "tree-ssa-propagate.h"
> >   #include "langhooks.h"
> > + #include "cfgloop.h"
> >   
> >   /* This file implements the copy propagation pass and provides a
> >      handful of interfaces for performing const/copy propagation and
> > *************** init_copy_prop (void)
> > *** 991,997 ****
> >             tree def;
> >   
> >   	  def = gimple_phi_result (phi);
> > ! 	  if (!is_gimple_reg (def))
> >               prop_set_simulate_again (phi, false);
> >   	  else
> >               prop_set_simulate_again (phi, true);
> > --- 992,1004 ----
> >             tree def;
> >   
> >   	  def = gimple_phi_result (phi);
> > ! 	  if (!is_gimple_reg (def)
> > ! 	      /* In loop-closed SSA form do not copy-propagate through
> > ! 	         PHI nodes.  Technically this is only needed for loop
> > ! 		 exit PHIs, but this is difficult to query.  */
> > ! 	      || (current_loops
> > ! 		  && gimple_phi_num_args (phi) == 1
> > ! 		  && loops_state_satisfies_p (LOOP_CLOSED_SSA)))
> >               prop_set_simulate_again (phi, false);
> >   	  else
> >               prop_set_simulate_again (phi, true);
> 
> Richard,
>     Do you have an updated version of this patch which would apply against
> current gcc trunk?

Probably not.

> Also, did this ever go into any of the branches?

No.

The basic issue with the patch is that it unconditionally applies the
versioning for stride == 1.  This pessimizes code-size unconditionally,
so it at least needs to be guarded by a flag, optimally it should be
driven by profile-feedback.

Richard.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for  vectorization
  2009-01-25 13:15 ` Richard Guenther
@ 2009-01-25 17:03   ` Dominique Dhumieres
  0 siblings, 0 replies; 11+ messages in thread
From: Dominique Dhumieres @ 2009-01-25 17:03 UTC (permalink / raw)
  To: rguenther, dominiq; +Cc: gcc-patches

> Can you file a bugreport please?

It is pr38968.

Dominique

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for  vectorization
  2009-01-25 12:12 Dominique Dhumieres
@ 2009-01-25 13:15 ` Richard Guenther
  2009-01-25 17:03   ` Dominique Dhumieres
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Guenther @ 2009-01-25 13:15 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: gcc-patches

On Sun, 25 Jan 2009, Dominique Dhumieres wrote:

> Richard,
> 
> > This patch adds the capability to the vectorizer to perform versioning
> > for the case of a constant (suitable) stride.
> 
> I have applied the patch on i686-apple-darwin9 (Core2 2.1Ghz, 4Mb cache, 
> 2Gb RAM). It regtested without regression. However the following test:
> 
> program mymatmul
>   implicit none
>   integer, parameter :: n = 2000
>   real, dimension(n,n) :: rr, ri
>   complex, dimension(n,n) :: a,b,c
>   real :: t1, t2
>   integer :: i, j, k
> 
>   call random_number (rr)
>   call random_number (ri)
>   a = cmplx (rr, ri)
>   call random_number (rr)
>   call random_number (ri)
>   b = cmplx (rr, ri)
> 
>   call cpu_time (t1)
> 
>   c = cmplx (0., 0.)
>   do j = 1, n
>      do k = 1, n
> 	do i = 1, n
> 	   c(i,j) = c(i,j) + a(i,k) * b(k,j)
> 	end do
>      end do
>   end do
> 
>   call cpu_time (t2)
>   write (*,'(F8.4)') t2-t1
> 
> end program mymatmul
> 
> did not vectorize:

We should be able to handle that I think.  Can you file a bugreport
please?  It should be the same as vectorizing simply

 j = random
 k = random
 tmp = b(k,j)
 do i = 1, n
   c(i,j) = c(i,j) + a(i,k) * tmp
 end do

it works with real data though.

> I can only report some timing with the polyhedron test suite:

Thanks.

> The timing shows a ~10% improvement for capacita.f90 compensated by a ~10% 
> degradation for fatigue.f90. All the other times are within the noise.
> 
> Thanks for the patch.
> 
> Dominique
> 
> PS Most of the time in capacita and tfft is spent in FFT subroutines that 
> are not vectorized. Anything that can be done to change that?

Not easily I guess.  From looking at SPEC 2006 tonto which I did recently
I noticed that GFortran inserts many temporaries for intrinsics, which
automatically makes vectorization harder (or at least no longer the
bottleneck).  Like for

  x = sum (a(:,i)*b(:,i))

where it puts a(:,i)*b(:,i) into a temporary array before doing the
reduction via libgfortran.  But well, this is something that should
be addressed by using middle-end arrays once I manage to spend some
time on that project again.

Richard.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH][RFC] Add versioning for constant strides for  vectorization
@ 2009-01-25 12:12 Dominique Dhumieres
  2009-01-25 13:15 ` Richard Guenther
  0 siblings, 1 reply; 11+ messages in thread
From: Dominique Dhumieres @ 2009-01-25 12:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: rguenther

Richard,

> This patch adds the capability to the vectorizer to perform versioning
> for the case of a constant (suitable) stride.

I have applied the patch on i686-apple-darwin9 (Core2 2.1Ghz, 4Mb cache, 
2Gb RAM). It regtested without regression. However the following test:

program mymatmul
  implicit none
  integer, parameter :: n = 2000
  real, dimension(n,n) :: rr, ri
  complex, dimension(n,n) :: a,b,c
  real :: t1, t2
  integer :: i, j, k

  call random_number (rr)
  call random_number (ri)
  a = cmplx (rr, ri)
  call random_number (rr)
  call random_number (ri)
  b = cmplx (rr, ri)

  call cpu_time (t1)

  c = cmplx (0., 0.)
  do j = 1, n
     do k = 1, n
	do i = 1, n
	   c(i,j) = c(i,j) + a(i,k) * b(k,j)
	end do
     end do
  end do

  call cpu_time (t2)
  write (*,'(F8.4)') t2-t1

end program mymatmul

did not vectorize:

[ibook-dhum] bug/timing% gfc -m64 -O3 -ffast-math -funroll-loops 
-fomit-frame-pointer -ftree-vectorizer-verbose=2 mymatmul_db.f90

mymatmul_db.f90:24: note: not vectorized: can't calculate alignment 
for data ref.
mymatmul_db.f90:14: note: not vectorized: complicated access pattern.
mymatmul_db.f90:14: note: not vectorized: can't calculate alignment 
for data ref.
mymatmul_db.f90:11: note: not vectorized: complicated access pattern.
mymatmul_db.f90:11: note: not vectorized: can't calculate alignment 
for data ref.
mymatmul_db.f90:1: note: vectorized 0 loops in function.

Is it expected?

> I didn't yet performance test this extensively, but it might need
> cost-model adjustments and/or need to wait until we have profile feedback
> to properly seed vectorizer analysis here.  A micro-benchmark based on
> the above loop shows around 15% improvement on AMD K10.

I can only report some timing with the polyhedron test suite:

================================================================================
Test Name       : pbharness
Compile Command : gfc %n.f90 -m64 -O3 -ffast-math -funroll-loops 
                             -ftree-loop-linear -fomit-frame-pointer 
                             -finline-limit=600 --param min-vect-loop-bound=2 
                             -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct 
                  linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

Date & Time     : 21 Jan 2009 14:06:52          24 Jan 2009  9:16:15 (patched)

  Bench.  Comp.  Exec.  Ave Run  #   Estim    Comp.    Exec. Ave Run   #  Estim
    Name (secs) (bytes)  (secs) Run  Err %   (secs)  (bytes)  (secs) Run  Err %
-------- ------ ------- ------- --- ------   ------ -------- ------- --- ------
      ac   2.33   42560   12.27   2 0.0081     2.51    42560   12.43   5 0.3163
  aermod  86.99 1270544   29.94   3 0.1371    92.59+ 1331976+  30.08   3 0.1636
     air   5.60   77336    8.40   2 0.0060     5.49    77336    8.35   2 0.0060
capacita   3.46   72760   55.41   2 0.0794     5.41+  105528+  51.79-  2 0.1690
 channel   2.11   38648    2.26   2 0.0442     2.13    38648    2.28   5 0.0683
   doduc  11.65  200024   43.07   2 0.0441    11.67   200024   42.97   2 0.0093
 fatigue   5.13   89024   10.78   5 0.3519     4.95    89024   11.87+  5 0.3516
 gas_dyn   6.45  708584   10.32   5 0.3332     6.51   708584   10.28   5 0.7988
  induct  10.03  181168   34.37   2 0.1222    10.37   181168   34.30   2 0.0087
   linpk   1.64   42536   27.63   2 0.0290     1.54    42536   27.67   2 0.0397
    mdbx   3.37   73000   14.74   2 0.0000     3.29    73000   14.80   2 0.0169
      nf  24.10  161416   31.91   2 0.0627    18.61-  140936-  32.06   2 0.0764
 protein  10.55  126424   47.05   2 0.0000    10.34   126424   46.24   3 0.1754
  rnflow  11.09  179616   36.14   2 0.0982    13.15+  191904+  36.61   2 0.1065
test_fpu  10.16  166512   12.39   2 0.0403    10.05   162416-  12.43   2 0.1006
    tfft   1.14   26432    2.82   2 0.0177     1.15    26432    2.84   3 0.1960

Geom. Mean Exec. Time =   17.01s                               17.07s

================================================================================
Polyhedron Benchmark Validator
Copyright (C) Polyhedron Software Ltd - 2004 - All rights reserved

The timing shows a ~10% improvement for capacita.f90 compensated by a ~10% 
degradation for fatigue.f90. All the other times are within the noise.

Thanks for the patch.

Dominique

PS Most of the time in capacita and tfft is spent in FFT subroutines that 
are not vectorized. Anything that can be done to change that?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-03-15  9:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-23 18:03 [PATCH][RFC] Add versioning for constant strides for vectorization Richard Guenther
2009-01-25 14:08 ` Ira Rosen
2009-01-27 12:42 ` Dorit Nuzman
2009-01-27 12:55   ` Richard Guenther
2009-01-27 13:01     ` Dorit Nuzman
2009-01-27 13:46       ` Richard Guenther
2010-03-13 19:19 ` Jack Howarth
2010-03-15 10:38   ` Richard Guenther
2009-01-25 12:12 Dominique Dhumieres
2009-01-25 13:15 ` Richard Guenther
2009-01-25 17:03   ` Dominique Dhumieres

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).