From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rguenther@suse.de>
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29])
 by sourceware.org (Postfix) with ESMTPS id 4FD5C385AC19
 for <gcc-patches@gcc.gnu.org>; Thu, 15 Jul 2021 12:34:32 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4FD5C385AC19
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
 by smtp-out2.suse.de (Postfix) with ESMTP id 85B871FE2F;
 Thu, 15 Jul 2021 12:34:31 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
 t=1626352471; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=YsLbcrmA3IxaocpQdCiKqZ4CdtZs5ZYtm0jQcG16VJE=;
 b=YeR2X34tCRO8Vzj5/Ic6H6maxDdQmUzdAp0n2wFRe9NWjMLL38a3gbIr09d3kUnCnj8RdH
 uY3QdI6fK7ioyoY2AW44f3I4HEcgylBld8YojolvXbgGxE2oW64TUs81+j6yQy62sugVTv
 bzCXexNsbZvnhqi/GdLxbtsnBcCfFr8=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1626352471;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
 mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=YsLbcrmA3IxaocpQdCiKqZ4CdtZs5ZYtm0jQcG16VJE=;
 b=nP6XKfAbxdZXATujpw51BGPYD9M2Z2B5lSjfuKG8EWhSQpIo2kkfcAyXyYwBZkg+QjGv2a
 QM4qohmactu/y2Bg==
Received: from murzim.suse.de (murzim.suse.de [10.160.4.192])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by relay2.suse.de (Postfix) with ESMTPS id 7E5E0A3BA1;
 Thu, 15 Jul 2021 12:34:31 +0000 (UTC)
Date: Thu, 15 Jul 2021 14:34:31 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: Christophe Lyon <christophe.lyon.oss@gmail.com>
cc: GCC Patches <gcc-patches@gcc.gnu.org>, 
 Richard Sandiford <richard.sandiford@arm.com>
Subject: Re: [PATCH] Support reduction def re-use for epilogue with different
 vector size
In-Reply-To: <CAKhMtSJc0+u4cMFZQjCNzFsbmeuKv1BzN2A0yiZdVnmT1zhmqQ@mail.gmail.com>
Message-ID: <nycvar.YFH.7.76.2107151434030.10711@zhemvz.fhfr.qr>
References: <4po155o0-9ps7-3r20-non0-n2snnrosorqn@fhfr.qr>
 <CAKhMtSJc0+u4cMFZQjCNzFsbmeuKv1BzN2A0yiZdVnmT1zhmqQ@mail.gmail.com>
User-Agent: Alpine 2.21 (LSU 202 2017-01-01)
MIME-Version: 1.0
X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Jul 2021 12:34:34 -0000

On Thu, 15 Jul 2021, Christophe Lyon wrote:

> Hi,
> 
> 
> 
> On Tue, Jul 13, 2021 at 2:09 PM Richard Biener <rguenther@suse.de> wrote:
> 
> > The following adds support for re-using the vector reduction def
> > from the main loop in vectorized epilogue loops on architectures
> > which use different vector sizes for the epilogue.  That's only
> > x86 as far as I am aware.
> >
> > vect.exp tested on x86_64-unknown-linux-gnu, full bootstrap &
> > regtest in progress.
> >
> > There's costing issues on x86 which usually prevent vectorizing
> > an epilogue with a reduction, at least for loops that only
> > have a reduction - it could be mitigated by not accounting for
> > the epilogue there if we can compute that we can re-use the
> > main loops cost.
> >
> > Richard - did I figure the correct place to adjust?  I guess
> > adjusting accumulator->reduc_input in vect_transform_cycle_phi
> > for re-use by the skip code in vect_create_epilog_for_reduction
> > is a bit awkward but at least we're conciously doing
> > vect_create_epilog_for_reduction last (via vectorizing live
> > operations).
> >
> > OK in the unlikely case all testing succeeds (I also want to
> > run it through SPEC with/without -fno-vect-cost-model which
> > will take some time)?
> >
> > Thanks,
> > Richard.
> >
> > 2021-07-13  Richard Biener  <rguenther@suse.de>
> >
> >         * tree-vect-loop.c (vect_find_reusable_accumulator): Handle
> >         vector types where the old vector type has a multiple of
> >         the new vector type elements.
> >         (vect_create_partial_epilog): New function, split out from...
> >         (vect_create_epilog_for_reduction): ... here.
> >         (vect_transform_cycle_phi): Reduce the re-used accumulator
> >         to the new vector type.
> >
> >         * gcc.target/i386/vect-reduc-1.c: New testcase.
> >
> 
> This patch is causing regressions on aarch64:
>  FAIL: gcc.dg/vect/pr92324-4.c (internal compiler error)
> FAIL: gcc.dg/vect/pr92324-4.c 2 blank line(s) in output
> FAIL: gcc.dg/vect/pr92324-4.c (test for excess errors)
> Excess errors:
> /gcc/testsuite/gcc.dg/vect/pr92324-4.c:7:1: error: incompatible types in
> 'PHI' argument 1
> vector(2) unsigned int
> vector(2) int
> _91 = PHI <_90(17), _83(11)>
> during GIMPLE pass: vect
> dump file: ./pr92324-4.c.167t.vect
> /gcc/testsuite/gcc.dg/vect/pr92324-4.c:7:1: internal compiler error:
> verify_gimple failed
> 0xe6438e verify_gimple_in_cfg(function*, bool)
>         /gcc/tree-cfg.c:5535
> 0xd13902 execute_function_todo
>         /gcc/passes.c:2042
> 0xd142a5 execute_todo
>         /gcc/passes.c:2096
> 
> FAIL: gcc.target/aarch64/vect-fmaxv-fminv-compile.c scan-assembler fminnmv
> FAIL: gcc.target/aarch64/vect-fmaxv-fminv-compile.c scan-assembler fmaxnmv

What exact options do you pass to cc1 to get this?  Can you track this
in a PR please?

Thanks,
Richard.

> Thanks,
> 
> Christophe
> 
> 
> 
> > ---
> >  gcc/testsuite/gcc.target/i386/vect-reduc-1.c |  17 ++
> >  gcc/tree-vect-loop.c                         | 223 ++++++++++++-------
> >  2 files changed, 155 insertions(+), 85 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > new file mode 100644
> > index 00000000000..9ee9ba4e736
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */
> > +
> > +#define N 32
> > +int foo (int *a, int n)
> > +{
> > +  int sum = 1;
> > +  for (int i = 0; i < 8*N + 4; ++i)
> > +    sum += a[i];
> > +  return sum;
> > +}
> > +
> > +/* The reduction epilog should be vectorized and the accumulator
> > +   re-used.  */
> > +/* { dg-final { scan-tree-dump "LOOP EPILOGUE VECTORIZED" "vect" } } */
> > +/* { dg-final { scan-assembler-times "psrl" 2 } } */
> > +/* { dg-final { scan-assembler-times "padd" 5 } } */
> > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> > index 8c27d75f889..98e2a845629 100644
> > --- a/gcc/tree-vect-loop.c
> > +++ b/gcc/tree-vect-loop.c
> > @@ -4901,7 +4901,8 @@ vect_find_reusable_accumulator (loop_vec_info
> > loop_vinfo,
> >       ones as well.  */
> >    tree vectype = STMT_VINFO_VECTYPE (reduc_info);
> >    tree old_vectype = TREE_TYPE (accumulator->reduc_input);
> > -  if (!useless_type_conversion_p (old_vectype, vectype))
> > +  if (!constant_multiple_p (TYPE_VECTOR_SUBPARTS (old_vectype),
> > +                           TYPE_VECTOR_SUBPARTS (vectype)))
> >      return false;
> >
> >    /* Non-SLP reductions might apply an adjustment after the reduction
> > @@ -4935,6 +4936,101 @@ vect_find_reusable_accumulator (loop_vec_info
> > loop_vinfo,
> >    return true;
> >  }
> >
> > +/* Reduce the vector VEC_DEF down to VECTYPE with reduction operation
> > +   CODE emitting stmts before GSI.  Returns a vector def of VECTYPE.  */
> > +
> > +static tree
> > +vect_create_partial_epilog (tree vec_def, tree vectype, enum tree_code
> > code,
> > +                           gimple_seq *seq)
> > +{
> > +  unsigned nunits = TYPE_VECTOR_SUBPARTS (TREE_TYPE
> > (vec_def)).to_constant ();
> > +  unsigned nunits1 = TYPE_VECTOR_SUBPARTS (vectype).to_constant ();
> > +  tree stype = TREE_TYPE (vectype);
> > +  tree new_temp = vec_def;
> > +  while (nunits > nunits1)
> > +    {
> > +      nunits /= 2;
> > +      tree vectype1 = get_related_vectype_for_scalar_type (TYPE_MODE
> > (vectype),
> > +                                                          stype, nunits);
> > +      unsigned int bitsize = tree_to_uhwi (TYPE_SIZE (vectype1));
> > +
> > +      /* The target has to make sure we support lowpart/highpart
> > +        extraction, either via direct vector extract or through
> > +        an integer mode punning.  */
> > +      tree dst1, dst2;
> > +      gimple *epilog_stmt;
> > +      if (convert_optab_handler (vec_extract_optab,
> > +                                TYPE_MODE (TREE_TYPE (new_temp)),
> > +                                TYPE_MODE (vectype1))
> > +         != CODE_FOR_nothing)
> > +       {
> > +         /* Extract sub-vectors directly once vec_extract becomes
> > +            a conversion optab.  */
> > +         dst1 = make_ssa_name (vectype1);
> > +         epilog_stmt
> > +             = gimple_build_assign (dst1, BIT_FIELD_REF,
> > +                                    build3 (BIT_FIELD_REF, vectype1,
> > +                                            new_temp, TYPE_SIZE
> > (vectype1),
> > +                                            bitsize_int (0)));
> > +         gimple_seq_add_stmt_without_update (seq, epilog_stmt);
> > +         dst2 =  make_ssa_name (vectype1);
> > +         epilog_stmt
> > +             = gimple_build_assign (dst2, BIT_FIELD_REF,
> > +                                    build3 (BIT_FIELD_REF, vectype1,
> > +                                            new_temp, TYPE_SIZE
> > (vectype1),
> > +                                            bitsize_int (bitsize)));
> > +         gimple_seq_add_stmt_without_update (seq, epilog_stmt);
> > +       }
> > +      else
> > +       {
> > +         /* Extract via punning to appropriately sized integer mode
> > +            vector.  */
> > +         tree eltype = build_nonstandard_integer_type (bitsize, 1);
> > +         tree etype = build_vector_type (eltype, 2);
> > +         gcc_assert (convert_optab_handler (vec_extract_optab,
> > +                                            TYPE_MODE (etype),
> > +                                            TYPE_MODE (eltype))
> > +                     != CODE_FOR_nothing);
> > +         tree tem = make_ssa_name (etype);
> > +         epilog_stmt = gimple_build_assign (tem, VIEW_CONVERT_EXPR,
> > +                                            build1 (VIEW_CONVERT_EXPR,
> > +                                                    etype, new_temp));
> > +         gimple_seq_add_stmt_without_update (seq, epilog_stmt);
> > +         new_temp = tem;
> > +         tem = make_ssa_name (eltype);
> > +         epilog_stmt
> > +             = gimple_build_assign (tem, BIT_FIELD_REF,
> > +                                    build3 (BIT_FIELD_REF, eltype,
> > +                                            new_temp, TYPE_SIZE (eltype),
> > +                                            bitsize_int (0)));
> > +         gimple_seq_add_stmt_without_update (seq, epilog_stmt);
> > +         dst1 = make_ssa_name (vectype1);
> > +         epilog_stmt = gimple_build_assign (dst1, VIEW_CONVERT_EXPR,
> > +                                            build1 (VIEW_CONVERT_EXPR,
> > +                                                    vectype1, tem));
> > +         gimple_seq_add_stmt_without_update (seq, epilog_stmt);
> > +         tem = make_ssa_name (eltype);
> > +         epilog_stmt
> > +             = gimple_build_assign (tem, BIT_FIELD_REF,
> > +                                    build3 (BIT_FIELD_REF, eltype,
> > +                                            new_temp, TYPE_SIZE (eltype),
> > +                                            bitsize_int (bitsize)));
> > +         gimple_seq_add_stmt_without_update (seq, epilog_stmt);
> > +         dst2 =  make_ssa_name (vectype1);
> > +         epilog_stmt = gimple_build_assign (dst2, VIEW_CONVERT_EXPR,
> > +                                            build1 (VIEW_CONVERT_EXPR,
> > +                                                    vectype1, tem));
> > +         gimple_seq_add_stmt_without_update (seq, epilog_stmt);
> > +       }
> > +
> > +      new_temp = make_ssa_name (vectype1);
> > +      epilog_stmt = gimple_build_assign (new_temp, code, dst1, dst2);
> > +      gimple_seq_add_stmt_without_update (seq, epilog_stmt);
> > +    }
> > +
> > +  return new_temp;
> > +}
> > +
> >  /* Function vect_create_epilog_for_reduction
> >
> >     Create code at the loop-epilog to finalize the result of a reduction
> > @@ -5684,87 +5780,11 @@ vect_create_epilog_for_reduction (loop_vec_info
> > loop_vinfo,
> >
> >        /* First reduce the vector to the desired vector size we should
> >          do shift reduction on by combining upper and lower halves.  */
> > -      new_temp = reduc_inputs[0];
> > -      while (nunits > nunits1)
> > -       {
> > -         nunits /= 2;
> > -         vectype1 = get_related_vectype_for_scalar_type (TYPE_MODE
> > (vectype),
> > -                                                         stype, nunits);
> > -         unsigned int bitsize = tree_to_uhwi (TYPE_SIZE (vectype1));
> > -
> > -         /* The target has to make sure we support lowpart/highpart
> > -            extraction, either via direct vector extract or through
> > -            an integer mode punning.  */
> > -         tree dst1, dst2;
> > -         if (convert_optab_handler (vec_extract_optab,
> > -                                    TYPE_MODE (TREE_TYPE (new_temp)),
> > -                                    TYPE_MODE (vectype1))
> > -             != CODE_FOR_nothing)
> > -           {
> > -             /* Extract sub-vectors directly once vec_extract becomes
> > -                a conversion optab.  */
> > -             dst1 = make_ssa_name (vectype1);
> > -             epilog_stmt
> > -                 = gimple_build_assign (dst1, BIT_FIELD_REF,
> > -                                        build3 (BIT_FIELD_REF, vectype1,
> > -                                                new_temp, TYPE_SIZE
> > (vectype1),
> > -                                                bitsize_int (0)));
> > -             gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
> > -             dst2 =  make_ssa_name (vectype1);
> > -             epilog_stmt
> > -                 = gimple_build_assign (dst2, BIT_FIELD_REF,
> > -                                        build3 (BIT_FIELD_REF, vectype1,
> > -                                                new_temp, TYPE_SIZE
> > (vectype1),
> > -                                                bitsize_int (bitsize)));
> > -             gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
> > -           }
> > -         else
> > -           {
> > -             /* Extract via punning to appropriately sized integer mode
> > -                vector.  */
> > -             tree eltype = build_nonstandard_integer_type (bitsize, 1);
> > -             tree etype = build_vector_type (eltype, 2);
> > -             gcc_assert (convert_optab_handler (vec_extract_optab,
> > -                                                TYPE_MODE (etype),
> > -                                                TYPE_MODE (eltype))
> > -                         != CODE_FOR_nothing);
> > -             tree tem = make_ssa_name (etype);
> > -             epilog_stmt = gimple_build_assign (tem, VIEW_CONVERT_EXPR,
> > -                                                build1 (VIEW_CONVERT_EXPR,
> > -                                                        etype, new_temp));
> > -             gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
> > -             new_temp = tem;
> > -             tem = make_ssa_name (eltype);
> > -             epilog_stmt
> > -                 = gimple_build_assign (tem, BIT_FIELD_REF,
> > -                                        build3 (BIT_FIELD_REF, eltype,
> > -                                                new_temp, TYPE_SIZE
> > (eltype),
> > -                                                bitsize_int (0)));
> > -             gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
> > -             dst1 = make_ssa_name (vectype1);
> > -             epilog_stmt = gimple_build_assign (dst1, VIEW_CONVERT_EXPR,
> > -                                                build1 (VIEW_CONVERT_EXPR,
> > -                                                        vectype1, tem));
> > -             gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
> > -             tem = make_ssa_name (eltype);
> > -             epilog_stmt
> > -                 = gimple_build_assign (tem, BIT_FIELD_REF,
> > -                                        build3 (BIT_FIELD_REF, eltype,
> > -                                                new_temp, TYPE_SIZE
> > (eltype),
> > -                                                bitsize_int (bitsize)));
> > -             gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
> > -             dst2 =  make_ssa_name (vectype1);
> > -             epilog_stmt = gimple_build_assign (dst2, VIEW_CONVERT_EXPR,
> > -                                                build1 (VIEW_CONVERT_EXPR,
> > -                                                        vectype1, tem));
> > -             gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
> > -           }
> > -
> > -         new_temp = make_ssa_name (vectype1);
> > -         epilog_stmt = gimple_build_assign (new_temp, code, dst1, dst2);
> > -         gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
> > -         reduc_inputs[0] = new_temp;
> > -       }
> > +      gimple_seq stmts = NULL;
> > +      new_temp = vect_create_partial_epilog (reduc_inputs[0], vectype1,
> > +                                            code, &stmts);
> > +      gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > +      reduc_inputs[0] = new_temp;
> >
> >        if (reduce_with_shift && !slp_reduc)
> >         {
> > @@ -7681,13 +7701,46 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo,
> >
> >    if (auto *accumulator = reduc_info->reused_accumulator)
> >      {
> > +      tree def = accumulator->reduc_input;
> > +      unsigned int nreduc;
> > +      bool res = constant_multiple_p (TYPE_VECTOR_SUBPARTS (TREE_TYPE
> > (def)),
> > +                                     TYPE_VECTOR_SUBPARTS (vectype_out),
> > +                                     &nreduc);
> > +      gcc_assert (res);
> > +      if (nreduc != 1)
> > +       {
> > +         /* Reduce the single vector to a smaller one.  */
> > +         gimple_seq stmts = NULL;
> > +         def = vect_create_partial_epilog (def, vectype_out,
> > +                                           STMT_VINFO_REDUC_CODE
> > (reduc_info),
> > +                                           &stmts);
> > +         /* Adjust the input so we pick up the partially reduced value
> > +            for the skip edge in vect_create_epilog_for_reduction.  */
> > +         accumulator->reduc_input = def;
> > +         if (loop_vinfo->main_loop_edge)
> > +           {
> > +             /* While we'd like to insert on the edge this will split
> > +                blocks and disturb bookkeeping, we also will eventually
> > +                need this on the skip edge.  Rely on sinking to
> > +                fixup optimal placement and insert in the pred.  */
> > +             gimple_stmt_iterator gsi
> > +               = gsi_last_bb (loop_vinfo->main_loop_edge->src);
> > +             /* Insert before a cond that eventually skips the
> > +                epilogue.  */
> > +             if (!gsi_end_p (gsi) && stmt_ends_bb_p (gsi_stmt (gsi)))
> > +               gsi_prev (&gsi);
> > +             gsi_insert_seq_after (&gsi, stmts, GSI_CONTINUE_LINKING);
> > +           }
> > +         else
> > +           gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop),
> > +                                             stmts);
> > +       }
> >        if (loop_vinfo->main_loop_edge)
> >         vec_initial_defs[0]
> > -         = vect_get_main_loop_result (loop_vinfo,
> > accumulator->reduc_input,
> > +         = vect_get_main_loop_result (loop_vinfo, def,
> >                                        vec_initial_defs[0]);
> >        else
> > -       vec_initial_defs.safe_push (accumulator->reduc_input);
> > -      gcc_assert (vec_initial_defs.length () == 1);
> > +       vec_initial_defs.safe_push (def);
> >      }
> >
> >    /* Generate the reduction PHIs upfront.  */
> > --
> > 2.26.2
> >
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)