From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-206408-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 17379 invoked by alias); 24 Oct 2007 04:56:04 -0000
Received: (qmail 17370 invoked by uid 22791); 24 Oct 2007 04:56:02 -0000
X-Spam-Check-By: sourceware.org
Received: from mtagate4.de.ibm.com (HELO mtagate4.de.ibm.com) (195.212.29.153)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Wed, 24 Oct 2007 04:55:58 +0000
Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) 	by mtagate4.de.ibm.com (8.13.8/8.13.8) with ESMTP id l9O4ttJZ126918 	for <gcc-patches@gcc.gnu.org>; Wed, 24 Oct 2007 04:55:55 GMT
Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) 	by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v8.5) with ESMTP id l9O4tt2X2269188 	for <gcc-patches@gcc.gnu.org>; Wed, 24 Oct 2007 06:55:55 +0200
Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) 	by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l9O4tseK010302 	for <gcc-patches@gcc.gnu.org>; Wed, 24 Oct 2007 06:55:54 +0200
Received: from d12mc102.megacenter.de.ibm.com (d12mc102.megacenter.de.ibm.com [9.149.167.114]) 	by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l9O4tsVZ010298; 	Wed, 24 Oct 2007 06:55:54 +0200
In-Reply-To: <D5B24B5251882048AD03DDFA431BB790012F0472@SAUSEXMB3.amd.com>
Subject: RE: [PATCH,vect] ppc cost model options
To: "Jagasia, Harsha" <harsha.jagasia@amd.com>
Cc: "David Edelsohn" <dje@watson.ibm.com>, gcc-patches@gcc.gnu.org,         janis187@us.ibm.com
X-Mailer: Lotus Notes Release 7.0 HF277 June 21, 2006
Message-ID: <OFB6E28704.A9EE00F8-ON8825737D.00781955-8825737E.001B5525@il.ibm.com>
From: Dorit Nuzman <DORIT@il.ibm.com>
Date: Wed, 24 Oct 2007 08:43:00 -0000
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII
X-IsSubscribed: yes
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2007-10/txt/msg01393.txt.bz2

"Jagasia, Harsha" <harsha.jagasia@amd.com> wrote on 19/10/2007 16:38:45:

...
> >vectorized and such) because the vectorizer cost-model is pending some
> >fixes from Harsha
> (http://gcc.gnu.org/ml/gcc-patches/2007-09/msg00916.html
> >), to be followed by target-specific tuning of costs.)
>
> Here is the patch implementing the early cost model check.
>

great!

> The check is added as follows:
> - If we do versioning, the threshold comparison is and'ed to the guard
> that controls the versioning,
> - Otherwise, if we do peeling for alignment, we can determine the
> loop-count of the prolog loop according to the threshold test
> - Otherwise, the threshold comparison is or'ed with the guard that
> controls whether to bypass the vector code and go straight to the scalar
> epilogue.
>

The only technical issue I have is with the fact that I think we're not
taking into account that because of the presence of the runtime-test, there
are costs that should be added also to the scalar side of the equasion
(branch costs), and these costs are different in each of the above 3 cases.
(This is something we don't need to worry about when we make the decision
at compile time).  For example, in case 2, even if we choose to execute the
scalar loop we pay the cost of 2 branches that did not exist in the
original code; in case 3 we pay the cost of 1-3 branches (depending on
whether we also peel for alignment and if so, if we also enter the prolog
loop). (We also pay for these branches on the vector side of the equasion,
but this is already taken into account). So we actually have 4 slighlty
different equasions:

equasion for compile-time test:
      scalar_costs >= vector_costs

equasion for run-time test, case 1:
      scalar_costs
      + versioning_guard_cost >= vector_costs

equasion for run-time test, case 2:
      scalar_costs
      + cost_of_guard_for_prolog_peel >= vector_costs

equasion for run-time test, case 3:
      scalar_costs
      + cost_of_guard_for_prolog_peel
      + cost_of_guard_for_epilog_peel >= vector_costs

Do you think we can add these 3 different fixups somehow, depending on
where the run-time test is actually placed? (this can be important for
platforms like the SPU, where branches can be very costly, such that we may
choose to vectorize the less efficient vector version and not pay the
branch cost involved with branching to skip the vector code).

The rest are just various style issues:

>  * tree-vect-transform.c (vect_do_peeling_for_loop_bound): Computes
>  the treshold if the cost model check has not been done already.

treshold -> threshold

> Index: tree-vectorizer.c
> ===================================================================
> --- tree-vectorizer.c (revision 129356)
> +++ tree-vectorizer.c (working copy)
...
>  slpeel_tree_peel_loop_to_edge (struct loop *loop,
>            edge e, tree first_niters,
>            tree niters, bool update_first_loop_count,
> -          unsigned int th)
> +          unsigned int th, bool check)

Please document the new function argument "check" (I see that a
documentation for "th" is also missing; if you feel like it, please add
that too while you're at it...)

Actually, maybe we can come up with something more descriptive than
"check"?  e.g. "check_profitability", "add_cost_model_check"...

> -  /* 2. Add the guard that controls whether the first loop is executed.
> -        Resulting CFG would be:
> +  /* 2.a Add the guard that controls whether the first loop is executed.
> +         Resulting CFG would be:
>
> -        bb_before_first_loop:
> -        if (FIRST_NITERS == 0) GOTO bb_before_second_loop
> -                               GOTO first-loop
> +         bb_before_first_loop:
> +         if (FIRST_NITERS == 0) GOTO bb_before_second_loop
> +                                GOTO first-loop
>
> -        first_loop:
> -        do {
> -        } while ...
> +         first_loop:
> +         do {
> +         } while ...
>
> -        bb_before_second_loop:
> +         bb_before_second_loop:
>
> -        second_loop:
> -        do {
> -        } while ...
> +         second_loop:
> +         do {
> +         } while ...
>
> -        orig_exit_bb:
> -   */
> +         orig_exit_bb:
> +  */
> +
> +
> +  /* 2.b Add the cost model check that allows the prologue
> +         to iterate for the entire unchanged scalar
> +         iterations of the loop in the event that the cost
> +         model indicates that the scalar loop is more
> +         profitable than the vector one.
> +
> +         Resulting CFG after prologue peeling would be:
> +
> +         if (scalar_loop_iterations <= th)
> +           FIRST_NITERS = scalar_loop_iterations
> +
> +         bb_before_first_loop:
> +         if (FIRST_NITERS == 0) GOTO bb_before_second_loop
> +                                GOTO first-loop
> +
> +         first_loop:
> +         do {
> +         } while ...
> +
> +         bb_before_second_loop:
> +
> +         second_loop:
> +         do {
> +         } while ...
> +
> +         orig_exit_bb:
> +  */
> +
> +  /* 2.c Add the cost model check that allows the epilogue
> +         to iterate for the entire unchanged scalar
> +         iterations of the loop in the event that the cost
> +         model indicates that the scalar loop is more
> +         profitable than the vector one.
> +
> +         Resulting CFG after prologue peeling would be:
> +
> +         bb_before_first_loop:
> +         if ((scalar_loop_iterations <= th)
> +             ||
> +             FIRST_NITERS == 0) GOTO bb_before_second_loop
> +                                GOTO first-loop
> +
> +         first_loop:
> +         do {
> +         } while ...
> +
> +         bb_before_second_loop:
> +
> +         second_loop:
> +         do {
> +         } while ...
> +
> +         orig_exit_bb:
> +  */
>

Just so it's clear that this function does only one of the above per
invocation, maybe write: "2. Add the control/guard code in one of the
following ways: 2.a. ..."

Also, I was wondering if it would be clearer if the description was more
compact - i.e. instead of duplicating 3 times the pseudo-code that remains
the same in all 3 alternatives, do something along these lines:
"
2. Add the control/guard code. Resulting CFG would be:

         bb_before_first_loop:
         if (COND) GOTO bb_before_second_loop
                   GOTO first-loop

         first_loop:
         do {
         } while ...

         bb_before_second_loop:

         second_loop:
         do {
         } while ...

         orig_exit_bb:

The following 3 variants are supported:
2a. The guard controls whether the first loop is executed.
       COND: FIRST_NITERS == 0
2b. Add the cost model check that allows the prologue to iterate....
      if (scalar_loop_iterations <= th)
        FIRST_NITERS = scalar_loop_iterations
      COND: FIRST_NITERS == 0
2c. Add the cost model check that allows the epilogue to iterate...
      COND: (scalar_loop_iterations <= th) || FIRST_NITERS == 0
"

... but in the end I'm not sure which way is clearer, so I'll leave the
decision to you.

A general comment about the changes to slpeel_tree_peel_loop_to_edge: I
wonder if this function would be taken out of the vectorizer one day and
merged with the generic peeling utils. In any case, it would be desirable
if instead of manipulating the condition inside the function, the function
would get the condition as an input? (even if it ends up remaining a
vectorizer specific function - I think this will be a good thing to do). If
this looks like too much work, then please consider factoring out the new
code that builds the condition in 3 different ways, or at least just the
part that creates the if-stmt for the "Peologue peeling case:

> +  /* Epilogue peeling.  */
> +  if (!update_first_loop_count)
> +    {
...
> +    }
> +  /* Prologue peeling.  */
> +  else
> +    {
> +      if (check)
> +         first_niters = conditionally_reset_first_niters (...)
> +
...
> +    }


> Index: tree-vect-transform.c
> ===================================================================
> --- tree-vect-transform.c (revision 129356)
> +++ tree-vect-transform.c (working copy)
...
> @@ -6482,28 +6483,35 @@ vect_do_peeling_for_loop_bound (loop_vec
>
>    loop_num  = loop->num;
>
> -  /* Analyze cost to set threshhold for vectorized loop.  */

threshhold --> threshold

...
> +      /* Analyze cost to set threshhold for vectorized loop.  */
> +      min_profitable_iters = LOOP_VINFO_COST_MODEL_MIN_ITERS
(loop_vinfo);
>

about the comment - we don't "Analyze cost" here, just use the already
precalculcated threshold. (it also repeats in other places).

> -  th = (unsigned) min_scalar_loop_bound;
> -  if (min_profitable_iters
> -      && (!min_scalar_loop_bound
> -          || min_profitable_iters > min_scalar_loop_bound))
> -    th = (unsigned) min_profitable_iters;
> +      min_scalar_loop_bound = ((PARAM_VALUE (PARAM_MIN_VECT_LOOP_BOUND)
> +    * LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 1);
>
> -  if (((LOOP_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)
> -      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> -      && vect_print_dump_info (REPORT_DETAILS))
> -    fprintf (vect_dump, "vectorization may not be profitable.");
> +      /* Use the cost model only if it is more conservative than
> user specified
> +  threshold.  */
> +      th = (unsigned) min_scalar_loop_bound;
> +      if (min_profitable_iters
> +   && (!min_scalar_loop_bound
> +       || min_profitable_iters > min_scalar_loop_bound))
> + th = (unsigned) min_profitable_iters;
> +
> +      if (th && vect_print_dump_info (REPORT_DETAILS))
> + fprintf (vect_dump, "loop bound peeling: vectorization may not be
> profitable.");
> +    }

The above code repeats almost identically 3 times: here, in
vect_do_peeling_for_alignment and in vect_loop_versioning. Please it factor
out...

> @@ -7031,7 +7078,11 @@ vect_create_cond_for_alias_checks (loop_
>     data references that may or may not be aligned.  An additional
>     sequence of runtime tests is generated for each pairs of DDRs whose
>     independence was not proven.  The vectorized version of loop is
> -   executed only if both alias and alignment tests are passed.  */
> +   executed only if both alias and alignment tests are passed.
> +
> +   The test generated to check which version of loop is executed
> +   is modified to check for profitability as indicated by the

--> "is modified to *also* check for profitability", right?

thanks,
dorit


> The patch bootstrapped and passed test on amd64-linux.
>
> Ok for trunk?
>
> >
> >> Thanks, David
> >>
> >>      * gcc.dg/vect/costmodel/ppc/ppc-costmodel-vect.exp: Use
> >>    -mcpu=970 instead of 7400.
> >>
> >> Index: ppc-costmodel-vect.exp
> >> ===================================================================
> >> --- ppc-costmodel-vect.exp   (revision 129469)
> >> +++ ppc-costmodel-vect.exp   (working copy)
> >> @@ -49,7 +49,7 @@
> >>  } else {
> >>      if [is-effective-target ilp32] {
> >>          # Specify a cpu that supports VMX for compile-only tests.
> >> -        lappend DEFAULT_VECTCFLAGS "-mcpu=7400"
> >> +        lappend DEFAULT_VECTCFLAGS "-mcpu=970"
> >>      }
> >>      set dg-do-what-default compile
> >>  }
> >
> >
>
> [attachment "0008-costmodel-early-scalar.diff" deleted by Dorit
> Nuzman/Haifa/IBM]