From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-192275-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 14142 invoked by alias); 8 Apr 2007 14:37:55 -0000
Received: (qmail 14129 invoked by uid 22791); 8 Apr 2007 14:37:53 -0000
X-Spam-Check-By: sourceware.org
Received: from mtagate7.de.ibm.com (HELO mtagate7.de.ibm.com) (195.212.29.156)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Sun, 08 Apr 2007 15:37:49 +0100
Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) 	by mtagate7.de.ibm.com (8.13.8/8.13.8) with ESMTP id l38Ebkew224788 	for <gcc-patches@gcc.gnu.org>; Sun, 8 Apr 2007 14:37:46 GMT
Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) 	by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l38EbkT84165734 	for <gcc-patches@gcc.gnu.org>; Sun, 8 Apr 2007 16:37:46 +0200
Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) 	by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l38Ebj78010475 	for <gcc-patches@gcc.gnu.org>; Sun, 8 Apr 2007 16:37:46 +0200
Received: from d12mc102.megacenter.de.ibm.com (d12mc102.megacenter.de.ibm.com [9.149.167.114]) 	by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l38EbjUF010472; 	Sun, 8 Apr 2007 16:37:45 +0200
In-Reply-To: <C98D5BFDB76BA04EA54EF07896B1BDD40103D576@SAUSEXMB2.amd.com>
Subject: Re: [patch] Vectorizer cost model implementation
To: "Linthicum, Tony" <tony.linthicum@amd.com>
Cc: "GCC Patches" <gcc-patches@gcc.gnu.org>,         "Jagasia, Harsha" <harsha.jagasia@amd.com>
X-Mailer: Lotus Notes Release 7.0 HF277 June 21, 2006
Message-ID: <OF9F10D891.56E7C8DD-ONC22572B7.004FACB6-C22572B7.00508084@il.ibm.com>
From: Dorit Nuzman <DORIT@il.ibm.com>
Date: Sun, 08 Apr 2007 14:37:00 -0000
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII
X-IsSubscribed: yes
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2007-04/txt/msg00362.txt.bz2

> Hello All,
>
> This patch contains the initial implementation for a cost model for the
> vectorizer.

Thanks Tony for this work, and Harsha for taking it on!
As I told Tony off-line, I think this is an excellent start.  With this,
people can start experimenting and tuning the cost model on different
targets, and gradually develop the required target-specific APIs, and also
see how to generalize this for the use of other loop transformations.

Here are some comments to the patch:

> Index: gcc/testsuite/gcc.dg/vect/vect.exp
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/vect.exp  (revision 123477)
> +++ gcc/testsuite/gcc.dg/vect/vect.exp  (working copy)
> @@ -23,7 +23,7 @@
>  set DEFAULT_VECTCFLAGS ""
>
>  # These flags are used for all targets.
> -lappend DEFAULT_VECTCFLAGS "-O2" "-ftree-vectorize"
> +lappend DEFAULT_VECTCFLAGS "-O2" "-ftree-vectorize"
"-fno-vect-cost-model"

This is discussed separately (consider not to disable the cost-model by
default, and instead xfail the tests that fail to vectorize with cost model
enabled (if there's not too many of them), duplicating them etc...).


> @@ -2216,6 +2219,7 @@
>       only over initial loops skipping newly generated ones.  */
>    FOR_EACH_LOOP (li, loop, 0)
>      {
> +      int required_iters;

What do you think about renaming this variable to something like
"min_profitable_iters", or "estimated_min_iters" or anyhow something that
would indicate that it's about estimated profitability and not something
that's really inherently required?


> @@ -2225,7 +2229,25 @@
>        if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
>     continue;
>
> +      required_iters = vect_model_required_iters (loop_vinfo);
> +      if (required_iters < 0)
> +   {
> +     if (vect_print_dump_info (REPORT_DETAILS))
> +       fprintf (vect_dump, "Vector body less profitable than
scalar.\n");
> +     continue;
> +   }
> +
> +      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) &&
> +     LOOP_VINFO_INT_NITERS (loop_vinfo) < (unsigned) required_iters)
> +       {
> +           if (vect_print_dump_info (REPORT_DETAILS))
> +           fprintf (vect_dump,
> +                  "Not vectorized: min iters: %d, actual: %lu\n",
> +                  required_iters, LOOP_VINFO_INT_NITERS (loop_vinfo));
> +         continue;
> +       }
>        vect_transform_loop (loop_vinfo);
> +
>        num_vectorized_loops++;
>      }
>    vect_loop_location = UNKNOWN_LOC;

One comment is that we use REPORT_UNVECTORIZED_LOOPS (rather then
REPORT_DETAILS) to guard the last printout that explains why a loop wasn't
vectorized. The idea is that when you use -fdump-tree-vect-stats (or
-ftree-vectorizer-verbose=2) you'll get one printout per loop explaining
why it wasn't vectorized. I think that with the current patch, if a loop
doesn't get vectorized because of the cost model, you'll see an explanation
only when using -fdump-tree-vect-details (or -ftree-vectorizer-verbose=7).
So you can either change REPORT_DETAILS to REPORT_UNVECTORIZED_LOOPS in the
code above, or leave it as is and add another, more user friedly, printout
under REPORT_UNVECTORIZED_LOOPS (e.g. "not vectorized: vectorization not
profitable").

The other comment is that this piece of code would be better placed
together with the last part of the function vect_analyze_operations, where
we already do several similar loop-bound related checks (if the loop-bound
is known, we compare it against the VF or whatever the user specified via
--param min-vect-loop-bound, and other checks). So, please move it there,
or, if you want, factor all of this out to a separate function, to be
called after vect_analyze_operations.

This also reminds me that we need to consider what to do if the user asked
to vectorize when the loop-bound is greater than X (via --param
min-vect-loop-bound), but the cost model determined that it would be
profitable when the loop-bound is greater than Y, and X != Y. The --param
min-vect-loop-bound was originally added only until such time that we have
a cost model, but maybe we want to keep it around, at least until the cost
model has been tuned/matured. In the meantime we can decide to do one of
the following:
(1) always go with the user
(2) always go with the cost model
(3) go with the user if the user is more conservative than the cost-model
(i.e. X > Y)
(4) go with the user if the user is less conservative than the cost-model
(i.e. X < Y)

The first option is equivalent to turning off the cost model with
-fno-vect-cost-model, so I'd go with one of 2,3,4.


> +
> +  /* Vectorization costs associated with statement */

According to the coding conventions comments should be treated like real
sentences, so there should be a period at the end. Also, coding conventions
ask for two spaces before end of the comment. (There are a few other
occurrences of that in the patch).


> +extern int  vect_model_required_iters (loop_vec_info);

same comment as I made above - how about
'vect_estimate_min_profitable_iters'? (or something shorter in that spirit)


> +fno-vect-cost-model
> +Common Report Var(flag_no_vect_cost_model) Optimization
> +Disable use of cost model in vectorization
> +

As others already commented - the flag should be introduced as
-fvect-cost-model, and documented in the texi files.


> +  /* Requires loop versioning tests.  */
> +  if (LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo))
> +    vec_outside_cost += TARG_COND_BRANCH_COST;

We can try to be a bit more "accurate" here and have the cost depend on the
number of stmts in the may_misalign list (the more datarefs in the list,
the heavier is the test we create).  Obviously, there are a lot of
improvements/fine-tuning like that that can be done, and these can be done
later.  For now, I would just ask that we record this in the code (add a
FIXME comment here).


> +  if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +    vec_outside_cost += TARG_COND_BRANCH_COST;
...
> +  /* Requires guard code for peeling.  */
> +  if (LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo))
> +    vec_outside_cost += TARG_COND_BRANCH_COST;

In the smae spirit of the above comment - we actually generate 2 guards for
each peel-loop that we create, so 2*TARG_COND_BRANCH_COST would be more
"accurate".


> +  /* Count statements in scalar loop.  Using this as scalar cost for a
single
> +     iteration for now.  */
> +  for (i = 0; i < nbbs; i++)
> +    {
> +      block_stmt_iterator si;
> +      basic_block bb = bbs[i];
> +
> +      for (si = bsi_start (bb); !bsi_end_p (si); bsi_next (&si))
> +   {
> +     tree stmt = bsi_stmt (si);
> +          stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +     if (!STMT_VINFO_RELEVANT_P (stmt_info)
> +         && !STMT_VINFO_LIVE_P (stmt_info))
> +       continue;
> +     scalar_single_iter_cost++;
> +     vec_inside_cost += STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info);
> +     vec_outside_cost += STMT_VINFO_OUTSIDE_OF_LOOP_COST (stmt_info);
> +   }
> +    }

This is more a note for myself - to remember to update this code when
looking at outer-loop vectorization (the stmts in the inner-loop execute
more times than the rest of the stmts). More adjustments would probably be
needed for outer-loop support.


> +  /* Add in costs of peeled instructions, if any.  */
> +  peel_iters = LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo);
> +  vec_outside_cost += peel_iters * scalar_single_iter_cost;
> +

You'll have a problem here if the number of iterations to be peeled is not
known at compile time... in this case LOOP_PEELING_FOR_ALIGNEMNT is
negative, and the actual number of iterations is computed at run time.  So
in that case you also need to build an expression that would be evaluated
at runtime. I understand that at this point this patch handles only the
case that the number of iterations in known, so until it is extended you
just want to check if LOOP_PEELING_FOR_ALIGNMENT > 0.


> +static void
> +vect_model_reduction_cost (stmt_vec_info stmt_info, enum tree_code
reduc_code,
> +                  int ncopies)
> +{

As discussed with Tony off-line - there's a close dependency between this
function and the reduction transformation functions, as is the case with
all the vect_model_*_cost functions - they all depend on how the respective
vectorizable_* functions are implemented. At some point we may want to
revisit this and see how to better design it.


> +  /* Add in cost for intial definition.  */

typo - initial


> +  /* Determine cost of epilog code.  */
> +  if (reduc_code < NUM_TREE_CODES)
> +    /* We have a reduction operator that will reduce the vector in one
> +       statement.  Also requires scalar extract.  */
> +    outer_cost += 2 * TARG_VEC_STMT_COST;

Did you want to use TARG_VEC_TO_SCALAR_COST for the scalar extract? (As
pointed out by others - it is not documented in tree-vectorizer.h).


> +static void
> +vect_model_simple_cost (stmt_vec_info stmt_info, int ncopies)
> +{
> +  STMT_VINFO_INSIDE_OF_LOOP_COST (stmt_info) = ncopies *
TARG_VEC_STMT_COST;
> +}
> +
> +/* Function vect_cost_strided_group_size
> +
> +   For a load or store, returns the group_size if it's the last store,
and 1
> +   if it is not.  */
> +static int
> +vect_cost_strided_group_size (stmt_vec_info stmt_info)
> +{

This is more of a convention within the tree-vect-* files: we keep 2 empty
lines between functions, and one empty line between the description of the
function and the function itself (you have kept this convention almost
everywhere else, but these are a couple such occurrences here and there).


> +/* Function vect_cost_strided_group_size
> +
> +   For a load or store, returns the group_size if it's the last store,
and 1
> +   if it is not.  */
> +static int
> +vect_cost_strided_group_size (stmt_vec_info stmt_info)
> +{
> +  tree first_stmt = DR_GROUP_FIRST_DR (stmt_info);
> +  stmt_vec_info first_info = vinfo_for_stmt (first_stmt);
> +  int group_size = DR_GROUP_SIZE (first_info);
> +  STMT_VINFO_MEM_COUNT_FOR_COST (first_info)++;
> +
> +  /* Is it the last in the group?  */
> +  if (STMT_VINFO_MEM_COUNT_FOR_COST (first_info) < group_size)
> +    group_size = 1;
> +
> +  return group_size;
> +}

I think there's no need to keep track of whether this is the last
load/store in a group or not. it doesn't matter for which of the
stores/loads we return the actual group_size, as long as we do it exactly
once per group. So I think you could rewrite it as follows (this way you
can also get rid of the field MEM_COUNT_FOR_COST):

   static int
   vect_cost_strided_group_size (stmt_vec_info stmt_info)
   {
     tree first_stmt = DR_GROUP_FIRST_DR (stmt_info);

     if (first_stmt == STMT_VINFO_STMT (stmt_info))
       return DR_GROUP_SIZE (stmt_info);

     return 1;
   }


> +  /* Is this the last access in a group of stores providing strided
acess?

typo - access. Appears a couple of times.


> +  if (group_size > 1)
> +    {
> +      /* Uses a high and low interleave operation for each needed
permute.  */
> +      cost = ncopies * exact_log2(group_size) * group_size *
> +   TARG_VEC_TO_SCALAR_COST;
> +    }

why are you using TARG_VEC_TO_SCALAR_COST? Again, it's not documented, so I
don't know what it's meant to represent, but permute/shuffle like
operations do not produce a scalar from a vector. Looks like
TARG_VEC_STMT_COST is more appropriate here?


> +      /* Uses a high and low interleave operation for each needed
permute.  */
> +      inner_cost = ncopies * exact_log2(group_size) * group_size *
> +   TARG_VEC_STMT_COST;

For the loads we actually use even and odd extract operations, not
interleave operations (and the cost for both should probably be the same)


> +    case dr_unaligned_software_pipeline:
> +      {
> +   int outer_cost;
> +
> +   /* Unaligned software pipeline has a load of an address, an intial

typo - initial


> +      load, and possibly a mask operation to "prime" the loop.  Inside
> +      the loop, there is a load op and a realignment op.  */
> +   outer_cost = 2 * TARG_VEC_STMT_COST;
> +   if (targetm.vectorize.builtin_mask_for_load)
> +     outer_cost += TARG_VEC_STMT_COST;
> +   STMT_VINFO_OUTSIDE_OF_LOOP_COST (stmt_info) = outer_cost;

if this is a group of loads (when vectorizing a strided access) - looks
like the outer-cost will be added by each load in the group? (it should be
added only once per group)


> +
> +   inner_cost += (2*ncopies) * TARG_VEC_LOAD_COST;

so you're using TARG_VEC_LOAD_COST also for the cost of the realignment op?
I think is should be a TARG_VEC_STMT_COST.


thanks!

dorit


> The patch is currently missing loop versioning for
> dynamically trip counted loops and some cost model specific tests.  My
> colleague, Harsha Jagasia, will be providing follow on patches for those
> things shortly.
>
> Thank you.
>
> Tony
>
>
> [attachment "patch" deleted by Dorit Nuzman/Haifa/IBM]