From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-407428-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 22518 invoked by alias); 15 Sep 2015 12:09:56 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 21242 invoked by uid 89); 15 Sep 2015 12:09:55 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: mail-yk0-f169.google.com
Received: from mail-yk0-f169.google.com (HELO mail-yk0-f169.google.com) (209.85.160.169) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Tue, 15 Sep 2015 12:09:53 +0000
Received: by ykdg206 with SMTP id g206so182416843ykd.1        for <gcc-patches@gcc.gnu.org>; Tue, 15 Sep 2015 05:09:51 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.170.74.85 with SMTP id q82mr20428536ykq.94.1442318991570; Tue, 15 Sep 2015 05:09:51 -0700 (PDT)
Received: by 10.37.93.136 with HTTP; Tue, 15 Sep 2015 05:09:51 -0700 (PDT)
In-Reply-To: <D217578B.7FE4%alan.hayward@arm.com>
References: <D217578B.7FE4%alan.hayward@arm.com>
Date: Tue, 15 Sep 2015 12:10:00 -0000
Message-ID: <CAFiYyc0tYw9p9_z=5at7+2V8r9LzN0iRdA=w6TQ1==ZiqXb0bA@mail.gmail.com>
Subject: Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
From: Richard Biener <richard.guenther@gmail.com>
To: Alan Hayward <alan.hayward@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset=UTF-8
X-IsSubscribed: yes
X-SW-Source: 2015-09/txt/msg01053.txt.bz2

On Thu, Sep 10, 2015 at 4:51 PM, Alan Hayward <alan.hayward@arm.com> wrote:
> Hi,
> This patch (attached) adds support for vectorizing conditional expressions
> (PR 65947), for example:
>
> int condition_reduction (int *a, int min_v)
> {
>   int last = 0;
>   for (int i = 0; i < N; i++)
>     if (a[i] < min_v)
>       last = a[i];
>   return last;
> }
>
> To do this the loop is vectorised to create a vector of data results (ie
> of matching a[i] values). Using an induction variable, an additional
> vector is added containing the indexes where the matches occured. In the
> function epilogue this is reduced to a single max value and then used to
> index into the vector of data results.
> When no values are matched in the loop, the indexes vector will contain
> all zeroes, eventually matching the first entry in the data results vector.
>
> To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This is
> supported by aarch64 and arm. On X86 and powerpc, gcc will complain that
> REDUC_MAX_EXPR is not supported for the required modes, failing the
> vectorization. On mips it complains that the required vcond expression is
> not supported. It is suggested the relevant backend experts add the
> required backend support.
>
> Using a simple testcase based around a large number of N and run on an
> aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of
> it's original time.
>
> This patch caused binary differences in three spec2006 binaries on aarch64
> - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board
> showed no improvement or degregation in runtime.
>
>
> In the near future I hope to submit a further patch (as PR 66558) which
> optimises the case where the result is simply the index of the loop, for
> example:
> int condition_reduction (int *a, int min_v)
> {
>   int last = 0;
>   for (int i = 0; i < N; i++)
>     if (a[i] < min_v)
>       last = i;
>   return last;
> }
> In this case a lot of the new code can be optimized away.
>
> I have run check for aarch64, arm and x86 and have seen no regressions.

Now comments on the patch itself.

+      if (code == COND_EXPR)
+       *v_reduc_type = COND_REDUCTION;

so why not simply use COND_EXPR instead of the new v_reduc_type?

+  if (check_reduction && code != COND_EXPR &&
+      vect_is_slp_reduction (loop_info, phi, def_stmt))

&&s go to the next line

+             /* Reduction of the max index and a reduction of the found
+                values.  */
+             epilogue_cost += add_stmt_cost (target_cost_data, 1,
+                                             vec_to_scalar, stmt_info, 0,
+                                             vect_epilogue);

vec_to_scalar once isn't what the comment suggests.  Instead the
comment suggests twice what a regular reduction would do
but I guess we can "hide" the vec_to_scalar cost and "merge" it
with the broadcast.  Thus make the above two vector_stmt costs?

+             /* A broadcast of the max value.  */
+             epilogue_cost += add_stmt_cost (target_cost_data, 2,
+                                             scalar_to_vec, stmt_info, 0,
+                                             vect_epilogue);

comment suggests a single broadcast.

@@ -3705,7 +3764,7 @@ get_initial_def_for_induction (gimple iv_phi)
          the final vector of induction results:  */
       exit_phi = NULL;
       FOR_EACH_IMM_USE_FAST (use_p, imm_iter, loop_arg)
-        {
+       {
          gimple use_stmt = USE_STMT (use_p);
          if (is_gimple_debug (use_stmt))
            continue;

please avoid unrelated whitespace changes.

+      case COND_EXPR:
+       if (v_reduc_type == COND_REDUCTION)
+         {
...
+       /* Fall through.  */
+
       case MIN_EXPR:
       case MAX_EXPR:
-      case COND_EXPR:

aww, so we could already handle COND_EXPR reductions?  How do they
differ from what you add?  Did you see if that path is even exercised today?

+           /* Create a vector of {init_value, 0, 0, 0...}.  */
+           vec<constructor_elt, va_gc> *v;
+           vec_alloc (v, nunits);
+           CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, init_val);
+           if (SCALAR_FLOAT_TYPE_P (scalar_type))
+             for (i = 1; i < nunits; ++i)
+               CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
+                                       build_real (scalar_type, dconst0));
+           else
+             for (i = 1; i < nunits; ++i)
+               CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
+                                       build_int_cst (scalar_type, 0));
+           init_def = build_constructor (vectype, v);

you can unify the float/int case by using build_zero_cst (scalar_type).
Note that you should build a vector constant instead of a constructor
if init_val is a constant.  The convenient way is to build the vector
elements into a tree[] array and use build_vector_stat in that case.

+      /* Find maximum value from the vector of found indexes.  */
+      tree max_index = make_temp_ssa_name (index_scalar_type, NULL, "");

just use make_ssa_name (index_scalar_type);

+      /* Convert the vector of data to the same type as the EQ.  */
+      tree vec_data_cast;
+      if ( TYPE_UNSIGNED (index_vec_type))
+       {

How come it never happens the element
sizes do not match?  (int index type and double data type?)

+      /* Where the max index occured, use the value from the data vector.  */
+      tree vec_and = make_temp_ssa_name (index_vec_type_signed, NULL, "");
+      gimple vec_and_stmt = gimple_build_assign (vec_and, BIT_AND_EXPR,
+                                                vec_compare, vec_data_cast);

that is, don't you need to do some widening/shortening on the comparison result?
(also what happens in the case "or all of the values"?)

Definitely too much VIEW_CONVERT_MAGIC here for my taste ;)

+   This function also handles reduction of condition expressions, for example:
+     for (int i = 0; i < N; i++)
+       if (a[i] < value)
+        last = a[i];
+   This is handled by vectorising the loop and creating an additional vector
+   containing the loop indexes for which "a[i] < value" was true.  In the
+   function epilogue this is reduced to a single max value and then used to
+   index into the vector of results.

I miss a comment that shows the kind of code we transform this to.
"an additional vector containing the loop indexes" can't work - the vector
will not be large enough ;)  Naiively I would have made 'last' a vector,
performing the reduction element-wise and in the epilogue reduce
'last' itself.  And it looks like we are already doing that for

int foo (int *a)
{
  int val = 0;
  for (int i = 0; i < 1024; ++i)
    if (a[i] > val)
      val = a[i];
  return val;
}

I must be missing something.  Yes, I think we can't do index reduction yet,
but pr65947-10.c is alrady handled?

Stopping here.

Thanks,
Richard.

>
> Changelog:
>
>     2015-08-28  Alan Hayward <alan.hayward@arm.com>
>
>         PR tree-optimization/65947
>         * tree-vect-loop.c
>         (vect_is_simple_reduction_1): Find condition reductions.
>         (vect_model_reduction_cost): Add condition reduction costs.
>         (get_initial_def_for_reduction): Add condition reduction initial
> var.
>         (vect_create_epilog_for_reduction): Add condition reduction epilog.
>         (vectorizable_reduction): Condition reduction support.
>         * tree-vect-stmts.c
>         (vectorizable_condition): Add vect reduction arg
>         * doc/sourcebuild.texi (Vector-specific attributes): Document
>         vect_max_reduc
>
>     testsuite/Changelog:
>
>         PR tree-optimization/65947
>         * lib/target-supports.exp
>         (check_effective_target_vect_max_reduc): Add.
>         * gcc.dg/vect/pr65947-1.c: New test.
>         * gcc.dg/vect/pr65947-2.c: New test.
>         * gcc.dg/vect/pr65947-3.c: New test.
>         * gcc.dg/vect/pr65947-4.c: New test.
>         * gcc.dg/vect/pr65947-5.c: New test.
>         * gcc.dg/vect/pr65947-6.c: New test.
>         * gcc.dg/vect/pr65947-7.c: New test.
>         * gcc.dg/vect/pr65947-8.c: New test.
>         * gcc.dg/vect/pr65947-9.c: New test.
>         * gcc.dg/vect/pr65947-10.c: New test.
>         * gcc.dg/vect/pr65947-11.c: New test.
>
>
>
> Thanks,
> Alan
>
>