From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com
 [IPv6:2a00:1450:4864:20::62c])
 by sourceware.org (Postfix) with ESMTPS id C07D1398505D
 for <gcc-patches@gcc.gnu.org>; Fri,  9 Jul 2021 11:58:42 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C07D1398505D
Received: by mail-ej1-x62c.google.com with SMTP id bu12so15799975ejb.0
 for <gcc-patches@gcc.gnu.org>; Fri, 09 Jul 2021 04:58:42 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:content-transfer-encoding;
 bh=EG3v3JaxHX2jeNLVg+nzAgM6dGd8uJiFsmkhzRZtOX4=;
 b=CTZKd9RvwzdxPmzJAlHCPQK6ruMRgqiKjh+Y+MXgyy/PuGjlp+dVKOfKVHmxAFhiHl
 ZG3n6yJ8a7Lt2JLG5XLWRhTbDTUuSymIMTC1YGyffF72bUwrJ2bHGRwUYAFYekPK9gc0
 N1uVZ22G32lz5ChwDyH4u7NJPyTRbtRhLDA6N9+4NLLqBr29z+zGXT7kJgoDvw/ETuem
 obAJXmZbGPVfV36Y1+AZE1/FfghUhOeeyshLZB+cFyJP2tHbQZ2KmS67dtGWXbPd7+iO
 6n/WB5pMQW90Sq7sKXriYx5wN5BITOLdoV0gByr2R7kXIrQ4dLcq2oy5KqSA8A5/U0u4
 /O3g==
X-Gm-Message-State: AOAM530dgBQDodIsIYVECMq4oznEqIRk6Z4/HgcRWjQZECY/Qp1IpzFc
 wc93+JvMjv4ofE4XMUwA2edHinC6dSzkM1FIUQc=
X-Google-Smtp-Source: ABdhPJxcs+5YtOebybBbaWNrhUaSPCKjWhCs4CWP0DtF4PL/mmiyy2swWo5//G7abxMnqA1HhQuEi21O5FxqGw+sx20=
X-Received: by 2002:a17:906:17d0:: with SMTP id
 u16mr1924728eje.129.1625831921179; 
 Fri, 09 Jul 2021 04:58:41 -0700 (PDT)
MIME-Version: 1.0
References: <mptpmvtrner.fsf@arm.com> <mpt35spq8mb.fsf@arm.com>
In-Reply-To: <mpt35spq8mb.fsf@arm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Fri, 9 Jul 2021 13:58:30 +0200
Message-ID: <CAFiYyc1Ynh9a9gxSjJSun5SGD9rW6kFzdMb_qEPLKQBCoDavxA@mail.gmail.com>
Subject: Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops
To: Richard Sandiford <richard.sandiford@arm.com>,
 GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-8.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_ASCII_DIVIDERS, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Jul 2021 11:58:47 -0000

On Thu, Jul 8, 2021 at 2:50 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> This patch adds support for reusing a main loop's reduction accumulator
> in an epilogue loop.  This in turn lets the loops share a single piece
> of vector->scalar reduction code.
>
> The patch has the following restrictions:
>
> (1) The epilogue reduction can only operate on a single vector
>     (e.g. ncopies must be 1 for non-SLP reductions, and the group size
>     must be <=3D the element count for SLP reductions).
>
> (2) Both loops must use the same vector mode for their accumulators.
>     This means that the patch is restricted to targets that support
>     --param vect-partial-vector-usage=3D1.
>
> (3) The reduction must be a standard =E2=80=9Ctree code=E2=80=9D reductio=
n.
>
> However, these restrictions could be lifted in future.  For example,
> if the main loop operates on 128-bit vectors and the epilogue loop
> operates on 64-bit vectors, we could in future reduce the 128-bit
> vector by one stage and use the 64-bit result as the starting point
> for the epilogue result.

Yeah, I hope that can be done quickly - it should make the
approach usable on x86_64.

> The patch tries to handle chained SLP reductions, unchained SLP
> reductions and non-SLP reductions.  It also handles cases in which
> the epilogue loop is entered directly (rather than via the main loop)
> and cases in which the epilogue loop can be skipped.
>
> vect_get_main_loop_result is a bit more general than the current
> patch needs.

I didn't see anything that would adjust the costing of the vectorization
(though I don't specifically remember how we cost vectorized epilogues
in general).

Few comments / questions inline below - I think the patch is OK
as-is though.

Thanks,
Richard.

> gcc/
>         * tree-vectorizer.h (vect_reusable_accumulator): New structure.
>         (_loop_vec_info::main_loop_edge): New field.
>         (_loop_vec_info::skip_main_loop_edge): Likewise.
>         (_loop_vec_info::skip_this_loop_edge): Likewise.
>         (_loop_vec_info::reusable_accumulators): Likewise.
>         (_stmt_vec_info::reduc_scalar_results): Likewise.
>         (_stmt_vec_info::reused_accumulator): Likewise.
>         (vect_get_main_loop_result): Declare.
>         * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize
>         reduc_scalar_inputs.
>         (vec_info::free_stmt_vec_info): Free reduc_scalar_inputs.
>         * tree-vect-loop-manip.c (vect_get_main_loop_result): New functio=
n.
>         (vect_do_peeling): Fill an epilogue loop's main_loop_edge,
>         skip_main_loop_edge and skip_this_loop_edge fields.
>         * tree-vect-loop.c (INCLUDE_ALGORITHM): Define.
>         (vect_emit_reduction_init_stmts): New function.
>         (get_initial_def_for_reduction): Use it.
>         (get_initial_defs_for_reduction): Likewise.  Change the vinfo
>         parameter to a loop_vec_info.
>         (vect_create_epilog_for_reduction): Store the scalar results
>         in the reduc_info.  If an epilogue loop is reusing an accumulator
>         from the main loop, and if the epilogue loop can also be skipped,
>         try to place the reduction code in the join block.  Record
>         accumulators that could potentially be reused by epilogue loops.
>         (vect_transform_cycle_phi): When vectorizing epilogue loops,
>         try to reuse accumulators from the main loop.  Record the initial
>         value in reduc_info for non-SLP reductions too.
>
> gcc/testsuite/
>         * gcc.target/aarch64/sve/reduc_9.c: New test.
>         * gcc.target/aarch64/sve/reduc_9_run.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_10.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_10_run.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_11.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_11_run.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_12.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_12_run.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_13.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_13_run.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_14.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_14_run.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_15.c: Likewise.
>         * gcc.target/aarch64/sve/reduc_15_run.c: Likewise.
> ---
>  .../gcc.target/aarch64/sve/reduc_10.c         |  77 +++++
>  .../gcc.target/aarch64/sve/reduc_10_run.c     |  49 +++
>  .../gcc.target/aarch64/sve/reduc_11.c         |  71 ++++
>  .../gcc.target/aarch64/sve/reduc_11_run.c     |  34 ++
>  .../gcc.target/aarch64/sve/reduc_12.c         |  71 ++++
>  .../gcc.target/aarch64/sve/reduc_12_run.c     |  66 ++++
>  .../gcc.target/aarch64/sve/reduc_13.c         | 101 ++++++
>  .../gcc.target/aarch64/sve/reduc_13_run.c     |  61 ++++
>  .../gcc.target/aarch64/sve/reduc_14.c         | 107 ++++++
>  .../gcc.target/aarch64/sve/reduc_14_run.c     | 187 +++++++++++
>  .../gcc.target/aarch64/sve/reduc_15.c         |  16 +
>  .../gcc.target/aarch64/sve/reduc_15_run.c     |  22 ++
>  .../gcc.target/aarch64/sve/reduc_9.c          |  77 +++++
>  .../gcc.target/aarch64/sve/reduc_9_run.c      |  29 ++
>  gcc/tree-vect-loop-manip.c                    |  29 ++
>  gcc/tree-vect-loop.c                          | 309 ++++++++++++++----
>  gcc/tree-vectorizer.c                         |   4 +
>  gcc/tree-vectorizer.h                         |  51 ++-
>  18 files changed, 1297 insertions(+), 64 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_10.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_10_run.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_11.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_11_run.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_12.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_12_run.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_13.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_13_run.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_14.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_14_run.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_15.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_15_run.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_9.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_9_run.c
>
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index e2fd3609fee..ed7a7738880 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -551,6 +551,18 @@ typedef auto_vec<rgroup_controls> vec_loop_lens;
>
>  typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec;
>
> +/* Information about a reduction accumulator from the main loop that cou=
ld
> +   conceivably be reused as the input to a reduction in an epilogue loop=
.  */
> +struct vect_reusable_accumulator {
> +  /* The final value of the accumulator, which forms the input to the
> +     reduction operation.  */
> +  tree reduc_input;
> +
> +  /* The stmt_vec_info that describes the reduction (i.e. the one for
> +     which is_reduc_info is true).  */
> +  stmt_vec_info reduc_info;
> +};
> +
>  /*-----------------------------------------------------------------*/
>  /* Info on vectorized loops.                                       */
>  /*-----------------------------------------------------------------*/
> @@ -588,6 +600,23 @@ public:
>    /* Unrolling factor  */
>    poly_uint64 vectorization_factor;
>
> +  /* If this loop is an epilogue loop whose main loop can be skipped,
> +     MAIN_LOOP_EDGE is the edge from the main loop to this loop's
> +     preheader.  SKIP_MAIN_LOOP_EDGE is then the edge that skips the
> +     main loop and goes straight to this loop's preheader.
> +
> +     Both fields are null otherwise.  */
> +  edge main_loop_edge;
> +  edge skip_main_loop_edge;
> +
> +  /* If this loop is an epilogue loop that might be skipped after execut=
ing
> +     the main loop, this edge is the one that skips the epilogue.  */
> +  edge skip_this_loop_edge;
> +
> +  /* After vectorization, maps live-out SSA names to information about
> +     the reductions that generated them.  */
> +  hash_map<tree, vect_reusable_accumulator> reusable_accumulators;

Is that the LC PHI node defs or the definition inside of the loop?
If the latter we could attach the info directly to its stmt-info?

> +
>    /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
>       if there is no particular limit.  */
>    unsigned HOST_WIDE_INT max_vectorization_factor;
> @@ -1186,6 +1215,21 @@ public:
>    /* The vector type for performing the actual reduction.  */
>    tree reduc_vectype;
>
> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
> +     elements in parallel, this vector gives the initial values of these
> +     N elements.  */

That's N scalar elements or N vector elements?  I suppose it's for
SLP reductions (rather than SLP reduction chains) and never non-SLP
reductions?

> +  vec<tree> reduc_initial_values;
> +
> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
> +     elements in parallel, this vector gives the scalar result of each
> +     reduction.  */
> +  vec<tree> reduc_scalar_results;
> +
> +  /* Only meaningful if IS_REDUC_INFO.  If non-null, the reduction is
> +     being performed by an epilogue loop and we have decided to reuse
> +     this accumulator from the main loop.  */
> +  vect_reusable_accumulator *reused_accumulator;
> +
>    /* Whether we force a single cycle PHI during reduction vectorization.=
  */
>    bool force_single_cycle;
>
> @@ -1382,12 +1426,6 @@ vect_phi_initial_value (gphi *phi)
>    return PHI_ARG_DEF_FROM_EDGE (phi, pe);
>  }
>
> -static inline tree
> -vect_phi_initial_value (stmt_vec_info stmt_info)
> -{
> -  return vect_phi_initial_value (as_a <gphi *> (stmt_info->stmt));
> -}
> -
>  /* Return true if STMT_INFO should produce a vector mask type rather tha=
n
>     a normal nonmask type.  */
>
> @@ -1818,6 +1856,7 @@ class loop *vect_loop_versioning (loop_vec_info, gi=
mple *);
>  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
>                                     tree *, tree *, tree *, int, bool, bo=
ol,
>                                     tree *);
> +extern tree vect_get_main_loop_result (loop_vec_info, tree, tree =3D NUL=
L_TREE);
>  extern void vect_prepare_for_masked_peels (loop_vec_info);
>  extern dump_user_location_t find_loop_location (class loop *);
>  extern bool vect_can_advance_ivs_p (loop_vec_info);
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index 9748043f3ee..f1035a83826 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -694,6 +694,8 @@ vec_info::new_stmt_vec_info (gimple *stmt)
>    STMT_VINFO_SLP_VECT_ONLY (res) =3D false;
>    STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) =3D false;
>    STMT_VINFO_VEC_STMTS (res) =3D vNULL;
> +  res->reduc_initial_values =3D vNULL;
> +  res->reduc_scalar_results =3D vNULL;
>
>    if (is_a <loop_vec_info> (this)
>        && gimple_code (stmt) =3D=3D GIMPLE_PHI
> @@ -755,6 +757,8 @@ vec_info::free_stmt_vec_info (stmt_vec_info stmt_info=
)
>         release_ssa_name (lhs);
>      }
>
> +  stmt_info->reduc_initial_values.release ();
> +  stmt_info->reduc_scalar_results.release ();
>    STMT_VINFO_SIMD_CLONE_INFO (stmt_info).release ();
>    STMT_VINFO_VEC_STMTS (stmt_info).release ();
>    free (stmt_info);
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 2909e8a0fc3..b7b0523e3c8 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -2457,6 +2457,31 @@ vect_update_epilogue_niters (loop_vec_info epilogu=
e_vinfo,
>    return vect_determine_partial_vectors_and_peeling (epilogue_vinfo, tru=
e);
>  }
>
> +/* LOOP_VINFO is an epilogue loop and MAIN_LOOP_VALUE is available on ex=
it
> +   from the corresponding main loop.  Return a value that is available i=
n
> +   LOOP_VINFO's preheader, using SKIP_VALUE if the main loop is skipped.
> +   Passing a null SKIP_VALUE is equivalent to passing zero.  */
> +
> +tree
> +vect_get_main_loop_result (loop_vec_info loop_vinfo, tree main_loop_valu=
e,
> +                          tree skip_value)
> +{
> +  if (!loop_vinfo->main_loop_edge)
> +    return main_loop_value;
> +
> +  if (!skip_value)
> +    skip_value =3D build_zero_cst (TREE_TYPE (main_loop_value));

shouldn't that be the initial value?

> +
> +  tree phi_result =3D make_ssa_name (TREE_TYPE (main_loop_value));
> +  basic_block bb =3D loop_vinfo->main_loop_edge->dest;
> +  gphi *new_phi =3D create_phi_node (phi_result, bb);
> +  add_phi_arg (new_phi, main_loop_value, loop_vinfo->main_loop_edge,
> +              UNKNOWN_LOCATION);
> +  add_phi_arg (new_phi, skip_value,
> +              loop_vinfo->skip_main_loop_edge, UNKNOWN_LOCATION);
> +  return phi_result;
> +}
> +
>  /* Function vect_do_peeling.
>
>     Input:
> @@ -2986,6 +3011,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree nit=
ers, tree nitersm1,
>                                            skip_vector ? anchor : guard_b=
b,
>                                            prob_epilog.invert (),
>                                            irred_flag);
> +         if (vect_epilogues)
> +           epilogue_vinfo->skip_this_loop_edge =3D guard_e;
>           slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
>                                               single_exit (epilog));
>           /* Only need to handle basic block before epilog loop if it's n=
ot
> @@ -3057,6 +3084,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree nit=
ers, tree nitersm1,
>           add_phi_arg (new_phi, build_zero_cst (TREE_TYPE (niters)), skip=
_e,
>                        UNKNOWN_LOCATION);
>           niters =3D PHI_RESULT (new_phi);
> +         epilogue_vinfo->main_loop_edge =3D update_e;
> +         epilogue_vinfo->skip_main_loop_edge =3D skip_e;
>         }
>
>        /* Set ADVANCE to the number of iterations performed by the previo=
us
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index fe7e73f655f..5e6c9b7c38a 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -19,6 +19,7 @@ You should have received a copy of the GNU General Publ=
ic License
>  along with GCC; see the file COPYING3.  If not see
>  <http://www.gnu.org/licenses/>.  */
>
> +#define INCLUDE_ALGORITHM
>  #include "config.h"
>  #include "system.h"
>  #include "coretypes.h"
> @@ -823,6 +824,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,=
 vec_info_shared *shared)
>      th (0),
>      versioning_threshold (0),
>      vectorization_factor (0),
> +    main_loop_edge (nullptr),
> +    skip_main_loop_edge (nullptr),
> +    skip_this_loop_edge (nullptr),
> +    reusable_accumulators (),
>      max_vectorization_factor (0),
>      mask_skip_niters (NULL_TREE),
>      rgroup_compare_type (NULL_TREE),
> @@ -4607,7 +4612,32 @@ vect_model_reduction_cost (loop_vec_info loop_vinf=
o,
>                   prologue_cost, epilogue_cost);
>  }
>
> +/* SEQ is a sequence of instructions that initialize the reduction
> +   described by REDUC_INFO.  Emit them in the appropriate place.  */
>
> +static void
> +vect_emit_reduction_init_stmts (loop_vec_info loop_vinfo,
> +                               stmt_vec_info reduc_info, gimple *seq)
> +{
> +  if (reduc_info->reused_accumulator)
> +    {
> +      /* When reusing an accumulator from the main loop, we only need
> +        initialization instructions if the main loop can be skipped.
> +        In that case, emit the initialization instructions at the end
> +        of the guard block that does the skip.  */
> +      edge skip_edge =3D loop_vinfo->skip_main_loop_edge;
> +      gcc_assert (skip_edge);
> +      gimple_stmt_iterator gsi =3D gsi_last_bb (skip_edge->src);
> +      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
> +    }
> +  else
> +    {
> +      /* The normal case: emit the initialization instructions on the
> +        preheader edge.  */
> +      class loop *loop =3D LOOP_VINFO_LOOP (loop_vinfo);
> +      gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), seq)=
;
> +    }
> +}
>
>  /* Function get_initial_def_for_reduction
>
> @@ -4675,36 +4705,30 @@ get_initial_def_for_reduction (loop_vec_info loop=
_vinfo,
>      }
>
>    if (stmts)
> -    gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts)=
;
> +    vect_emit_reduction_init_stmts (loop_vinfo, reduc_info, stmts);
>    return init_def;
>  }
>
> -/* Get at the initial defs for the reduction PHIs for REDUC_INFO, whose
> -   associated SLP node is SLP_NODE.  NUMBER_OF_VECTORS is the number of =
vector
> -   defs to create.  If NEUTRAL_OP is nonnull, introducing extra elements=
 of
> -   that value will not change the result.  */
> +/* Get at the initial defs for the reduction PHIs for REDUC_INFO,
> +   which performs a reduction involving GROUP_SIZE scalar statements.
> +   NUMBER_OF_VECTORS is the number of vector defs to create.  If NEUTRAL=
_OP
> +   is nonnull, introducing extra elements of that value will not change =
the
> +   result.  */
>
>  static void
> -get_initial_defs_for_reduction (vec_info *vinfo,
> +get_initial_defs_for_reduction (loop_vec_info loop_vinfo,
>                                 stmt_vec_info reduc_info,
> -                               slp_tree slp_node,
>                                 vec<tree> *vec_oprnds,
>                                 unsigned int number_of_vectors,
> -                               bool reduc_chain, tree neutral_op)
> +                               unsigned int group_size, tree neutral_op)
>  {
> -  vec<stmt_vec_info> stmts =3D SLP_TREE_SCALAR_STMTS (slp_node);
> +  vec<tree> &initial_values =3D reduc_info->reduc_initial_values;
>    unsigned HOST_WIDE_INT nunits;
>    unsigned j, number_of_places_left_in_vector;
>    tree vector_type =3D STMT_VINFO_VECTYPE (reduc_info);
> -  unsigned int group_size =3D stmts.length ();
>    unsigned int i;
> -  class loop *loop;
> -
> -  loop =3D (gimple_bb (reduc_info->stmt))->loop_father;
> -  gcc_assert (loop);
> -  edge pe =3D loop_preheader_edge (loop);
>
> -  gcc_assert (!reduc_chain || neutral_op);
> +  gcc_assert (group_size =3D=3D initial_values.length () || neutral_op);
>
>    /* NUMBER_OF_COPIES is the number of times we need to use the same val=
ues in
>       created vectors. It is greater than 1 if unrolling is performed.
> @@ -4734,18 +4758,13 @@ get_initial_defs_for_reduction (vec_info *vinfo,
>      {
>        tree op;
>        i =3D j % group_size;
> -      stmt_vec_info stmt_vinfo =3D stmts[i];
>
>        /* Get the def before the loop.  In reduction chain we have only
>          one initial value.  Else we have as many as PHIs in the group.  =
*/
> -      if (reduc_chain)
> -       op =3D j !=3D 0 ? neutral_op : vect_phi_initial_value (stmt_vinfo=
);
> -      else if (((vec_oprnds->length () + 1) * nunits
> -               - number_of_places_left_in_vector >=3D group_size)
> -              && neutral_op)
> +      if (i >=3D initial_values.length () || (j > i && neutral_op))
>         op =3D neutral_op;
>        else
> -       op =3D vect_phi_initial_value (stmt_vinfo);
> +       op =3D initial_values[i];
>
>        /* Create 'vect_ =3D {op0,op1,...,opn}'.  */
>        number_of_places_left_in_vector--;
> @@ -4781,8 +4800,8 @@ get_initial_defs_for_reduction (vec_info *vinfo,
>             {
>               /* First time round, duplicate ELTS to fill the
>                  required number of vectors.  */
> -             duplicate_and_interleave (vinfo, &ctor_seq, vector_type, el=
ts,
> -                                       number_of_vectors, *vec_oprnds);
> +             duplicate_and_interleave (loop_vinfo, &ctor_seq, vector_typ=
e,
> +                                       elts, number_of_vectors, *vec_opr=
nds);
>               break;
>             }
>           vec_oprnds->quick_push (init);
> @@ -4794,7 +4813,7 @@ get_initial_defs_for_reduction (vec_info *vinfo,
>         }
>      }
>    if (ctor_seq !=3D NULL)
> -    gsi_insert_seq_on_edge_immediate (pe, ctor_seq);
> +    vect_emit_reduction_init_stmts (loop_vinfo, reduc_info, ctor_seq);
>  }
>
>  /* For a statement STMT_INFO taking part in a reduction operation return
> @@ -4823,6 +4842,100 @@ info_for_reduction (vec_info *vinfo, stmt_vec_inf=
o stmt_info)
>    return stmt_info;
>  }
>
> +/* PHI is a reduction in LOOP_VINFO that we are going to vectorize using=
 vector
> +   type VECTYPE.  See if LOOP_VINFO is an epilogue loop whose main loop =
had a
> +   matching reduction that we can build on.  Adjust REDUC_INFO and retur=
n true
> +   if so, otherwise return false.  */
> +
> +static bool
> +vect_find_reusable_accumulator (loop_vec_info loop_vinfo,
> +                               stmt_vec_info reduc_info)
> +{
> +  loop_vec_info main_loop_vinfo =3D LOOP_VINFO_ORIG_LOOP_INFO (loop_vinf=
o);
> +  if (!main_loop_vinfo)
> +    return false;
> +
> +  if (STMT_VINFO_REDUC_TYPE (reduc_info) !=3D TREE_CODE_REDUCTION)
> +    return false;
> +
> +  unsigned int num_phis =3D reduc_info->reduc_initial_values.length ();
> +  auto_vec<tree, 16> main_loop_results (num_phis);
> +  auto_vec<tree, 16> initial_values (num_phis);
> +  if (edge main_loop_edge =3D loop_vinfo->main_loop_edge)
> +    {
> +      /* The epilogue loop can be entered either from the main loop or
> +        from an earlier guard block.  */
> +      edge skip_edge =3D loop_vinfo->skip_main_loop_edge;
> +      for (tree incoming_value : reduc_info->reduc_initial_values)
> +       {
> +         /* Look for:
> +
> +              INCOMING_VALUE =3D phi<MAIN_LOOP_RESULT(main loop),
> +                                   INITIAL_VALUE(guard block)>.  */
> +         gcc_assert (TREE_CODE (incoming_value) =3D=3D SSA_NAME);
> +
> +         gphi *phi =3D as_a <gphi *> (SSA_NAME_DEF_STMT (incoming_value)=
);
> +         gcc_assert (gimple_bb (phi) =3D=3D main_loop_edge->dest);
> +
> +         tree from_main_loop =3D PHI_ARG_DEF_FROM_EDGE (phi, main_loop_e=
dge);
> +         tree from_skip =3D PHI_ARG_DEF_FROM_EDGE (phi, skip_edge);
> +
> +         main_loop_results.quick_push (from_main_loop);
> +         initial_values.quick_push (from_skip);
> +       }
> +    }
> +  else
> +    /* The main loop dominates the epilogue loop.  */
> +    main_loop_results.splice (reduc_info->reduc_initial_values);
> +
> +  /* See if the main loop has the kind of accumulator we need.  */
> +  vect_reusable_accumulator *accumulator
> +    =3D main_loop_vinfo->reusable_accumulators.get (main_loop_results[0]=
);
> +  if (!accumulator
> +      || num_phis !=3D accumulator->reduc_info->reduc_scalar_results.len=
gth ()
> +      || !std::equal (main_loop_results.begin (), main_loop_results.end =
(),
> +                     accumulator->reduc_info->reduc_scalar_results.begin=
 ()))
> +    return false;
> +
> +  /* For now, only handle the case in which both loops are operating on =
the
> +     same vector types.  In future we could reduce wider vectors to narr=
ower
> +     ones as well.  */
> +  tree vectype =3D STMT_VINFO_VECTYPE (reduc_info);
> +  tree old_vectype =3D TREE_TYPE (accumulator->reduc_input);
> +  if (!useless_type_conversion_p (old_vectype, vectype))

It should be indeed quite trivial to handle, likewise the case where we
have multiple PHIs - just reduce to a single input vector and have the
possibly multiple input vectors in the epilogue filled with neutral
elements.  I'll see if I can cook up stuff for this next week.

> +    return false;
> +
> +  /* Non-SLP reductions might apply an adjustment after the reduction
> +     operation, in order to simplify the initialization of the accumulat=
or.
> +     If the epilogue loop carries on from where the main loop left off,
> +     it should apply the same adjustment to the final reduction result.
> +
> +     If the epilogue loop can also be entered directly (rather than via
> +     the main loop), we need to be able to handle that case in the same =
way,
> +     with the same adjustment.  (In principle we could add a PHI node
> +     to select the correct adjustment, but in practice that shouldn't be
> +     necessary.)  */
> +  tree main_adjustment
> +    =3D STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT (accumulator->reduc_info);
> +  if (loop_vinfo->main_loop_edge && main_adjustment)
> +    {
> +      gcc_assert (num_phis =3D=3D 1);
> +      tree initial_value =3D initial_values[0];
> +      /* Check that we can use INITIAL_VALUE as the adjustment and
> +        initialize the accumulator with a neutral value instead.  */
> +      if (!operand_equal_p (initial_value, main_adjustment))
> +       return false;
> +      tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info);
> +      initial_values[0] =3D neutral_op_for_reduction (TREE_TYPE (initial=
_value),
> +                                                   code, initial_value);
> +    }
> +  STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT (reduc_info) =3D main_adjustment;
> +  reduc_info->reduc_initial_values.truncate (0);
> +  reduc_info->reduc_initial_values.splice (initial_values);
> +  reduc_info->reused_accumulator =3D accumulator;
> +  return true;
> +}
> +
>  /* Function vect_create_epilog_for_reduction
>
>     Create code at the loop-epilog to finalize the result of a reduction
> @@ -4915,7 +5028,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo=
p_vinfo,
>    gimple *use_stmt;
>    auto_vec<tree> reduc_inputs;
>    int j, i;
> -  auto_vec<tree> scalar_results;
> +  vec<tree> &scalar_results =3D reduc_info->reduc_scalar_results;
>    unsigned int group_size =3D 1, k;
>    auto_vec<gimple *> phis;
>    /* SLP reduction without reduction chain, e.g.,
> @@ -4941,16 +5054,12 @@ vect_create_epilog_for_reduction (loop_vec_info l=
oop_vinfo,
>    gcc_assert (vectype);
>    mode =3D TYPE_MODE (vectype);
>
> -  tree initial_def =3D NULL;
>    tree induc_val =3D NULL_TREE;
>    tree adjustment_def =3D NULL;
>    if (slp_node)
>      ;
>    else
>      {
> -      /* Get at the scalar def before the loop, that defines the initial=
 value
> -        of the reduction variable.  */
> -      initial_def =3D vect_phi_initial_value (reduc_def_stmt);
>        /* Optimize: for induction condition reduction, if we can't use ze=
ro
>           for induc_val, use initial_def.  */
>        if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D INTEGER_INDUC_COND_R=
EDUCTION)
> @@ -5196,6 +5305,37 @@ vect_create_epilog_for_reduction (loop_vec_info lo=
op_vinfo,
>        reduc_inputs.safe_push (single_input);
>      }
>
> +  tree orig_reduc_input =3D reduc_inputs[0];
> +
> +  /* If this loop is an epilogue loop that can be skipped after the
> +     main loop, we can only share a reduction operation between the
> +     main loop and the epilogue if we put it at the target of the
> +     skip edge.

Do you have a testcase where we cannot do this?

> +     We can still reuse accumulators if this check fails.  Doing so has
> +     the minor(?) benefit of making the epilogue loop's scalar result
> +     independent of the main loop's scalar result.  */
> +  bool unify_with_main_loop_p =3D false;
> +  if (reduc_info->reused_accumulator
> +      && loop_vinfo->skip_this_loop_edge
> +      && single_succ_p (exit_bb)
> +      && single_succ (exit_bb) =3D=3D loop_vinfo->skip_this_loop_edge->d=
est)
> +    {
> +      unify_with_main_loop_p =3D true;
> +
> +      basic_block reduc_block =3D loop_vinfo->skip_this_loop_edge->dest;
> +      reduc_inputs[0] =3D make_ssa_name (vectype);
> +      gphi *new_phi =3D create_phi_node (reduc_inputs[0], reduc_block);
> +      add_phi_arg (new_phi, orig_reduc_input, single_succ_edge (exit_bb)=
,
> +                  UNKNOWN_LOCATION);
> +      add_phi_arg (new_phi, reduc_info->reused_accumulator->reduc_input,
> +                  loop_vinfo->skip_this_loop_edge, UNKNOWN_LOCATION);
> +      exit_gsi =3D gsi_after_labels (reduc_block);
> +    }
> +
> +  /* Shouldn't be used beyond this point.  */
> +  exit_bb =3D nullptr;
> +
>    if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D COND_REDUCTION
>        && reduc_fn !=3D IFN_LAST)
>      {
> @@ -5405,6 +5545,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo=
p_vinfo,
>              the same as initial_def already.  */
>           tree zcompare =3D build2 (EQ_EXPR, boolean_type_node, new_temp,
>                                   induc_val);
> +         tree initial_def =3D reduc_info->reduc_initial_values[0];
>
>           tmp =3D make_ssa_name (new_scalar_dest);
>           epilog_stmt =3D gimple_build_assign (tmp, COND_EXPR, zcompare,
> @@ -5425,9 +5566,6 @@ vect_create_epilog_for_reduction (loop_vec_info loo=
p_vinfo,
>        gcc_assert (reduc_inputs.length () =3D=3D 1);
>        gcc_assert (pow2p_hwi (group_size));
>
> -      slp_tree orig_phis_slp_node =3D slp_node_instance->reduc_phis;
> -      vec<stmt_vec_info> orig_phis
> -       =3D SLP_TREE_SCALAR_STMTS (orig_phis_slp_node);
>        gimple_seq seq =3D NULL;
>
>        /* Build a vector {0, 1, 2, ...}, with the same number of elements
> @@ -5452,7 +5590,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo=
p_vinfo,
>         {
>           tree initial_value =3D NULL_TREE;
>           if (REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> -           initial_value =3D vect_phi_initial_value (orig_phis[0]);
> +           initial_value =3D reduc_info->reduc_initial_values[0];
>           neutral_op =3D neutral_op_for_reduction (TREE_TYPE (vectype), c=
ode,
>                                                  initial_value);
>         }
> @@ -5466,7 +5604,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo=
p_vinfo,
>              for MIN and MAX reduction, for example.  */
>           if (!neutral_op)
>             {
> -             tree scalar_value =3D vect_phi_initial_value (orig_phis[i])=
;
> +             tree scalar_value =3D reduc_info->reduc_initial_values[i];
>               scalar_value =3D gimple_convert (&seq, TREE_TYPE (vectype),
>                                              scalar_value);
>               vector_identity =3D gimple_build_vector_from_val (&seq, vec=
type,
> @@ -5780,6 +5918,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo=
p_vinfo,
>              the same as initial_def already.  */
>           tree zcompare =3D build2 (EQ_EXPR, boolean_type_node, new_temp,
>                                   induc_val);
> +         tree initial_def =3D reduc_info->reduc_initial_values[0];
>
>           tree tmp =3D make_ssa_name (new_scalar_dest);
>           epilog_stmt =3D gimple_build_assign (tmp, COND_EXPR, zcompare,
> @@ -5819,6 +5958,12 @@ vect_create_epilog_for_reduction (loop_vec_info lo=
op_vinfo,
>        scalar_results[0] =3D new_temp;
>      }
>
> +  /* Record this operation if it could be reused by the epilogue loop.  =
*/
> +  if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDUCTION
> +      && !double_reduc)

what's the issue with double_reduc?

> +    loop_vinfo->reusable_accumulators.put (scalar_results[0],
> +                                          { orig_reduc_input, reduc_info=
 });
> +
>    if (double_reduc)
>      loop =3D outer_loop;
>
> @@ -5886,6 +6031,17 @@ vect_create_epilog_for_reduction (loop_vec_info lo=
op_vinfo,
>          {
>            /* Replace the uses:  */
>            orig_name =3D PHI_RESULT (exit_phi);
> +
> +         /* Look for a single use at the target of the skip edge.  */
> +         if (unify_with_main_loop_p)
> +           {
> +             use_operand_p use_p;
> +             gimple *user;
> +             if (!single_imm_use (orig_name, &use_p, &user))
> +               gcc_unreachable ();
> +             orig_name =3D gimple_get_lhs (user);
> +           }
> +
>            scalar_result =3D scalar_results[k];
>            FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, orig_name)
>             {
> @@ -7421,16 +7577,32 @@ vect_transform_cycle_phi (loop_vec_info loop_vinf=
o,
>        else
>         {
>           gcc_assert (slp_node =3D=3D slp_node_instance->reduc_phis);
> -         tree initial_value =3D NULL_TREE;
> +         vec<tree> &initial_values =3D reduc_info->reduc_initial_values;
> +         vec<stmt_vec_info> &stmts =3D SLP_TREE_SCALAR_STMTS (slp_node);
> +
> +         unsigned int num_phis =3D stmts.length ();
>           if (REDUC_GROUP_FIRST_ELEMENT (reduc_stmt_info))
> -           initial_value =3D vect_phi_initial_value (phi);
> -         tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info);
> -         tree neutral_op =3D neutral_op_for_reduction (TREE_TYPE (vectyp=
e_out),
> -                                                     code, initial_value=
);
> -         get_initial_defs_for_reduction (loop_vinfo, reduc_info,
> -                                         slp_node_instance->reduc_phis,
> -                                         &vec_initial_defs, vec_num,
> -                                         initial_value !=3D NULL, neutra=
l_op);
> +           num_phis =3D 1;
> +         initial_values.reserve (num_phis);
> +         for (unsigned int i =3D 0; i < num_phis; ++i)
> +           {
> +             gphi *this_phi =3D as_a<gphi *> (stmts[i]->stmt);
> +             initial_values.quick_push (vect_phi_initial_value (this_phi=
));
> +           }
> +         if (vec_num =3D=3D 1)
> +           vect_find_reusable_accumulator (loop_vinfo, reduc_info);
> +         if (!initial_values.is_empty ())
> +           {
> +             tree initial_value
> +               =3D (num_phis =3D=3D 1 ? initial_values[0] : NULL_TREE);
> +             tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info);
> +             tree neutral_op
> +               =3D neutral_op_for_reduction (TREE_TYPE (vectype_out),
> +                                           code, initial_value);
> +             get_initial_defs_for_reduction (loop_vinfo, reduc_info,
> +                                             &vec_initial_defs, vec_num,
> +                                             stmts.length (), neutral_op=
);
> +           }
>         }
>      }
>    else
> @@ -7438,6 +7610,7 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo,
>        /* Get at the scalar def before the loop, that defines the initial
>          value of the reduction variable.  */
>        tree initial_def =3D vect_phi_initial_value (phi);
> +      reduc_info->reduc_initial_values.safe_push (initial_def);
>        /* Optimize: if initial_def is for REDUC_MAX smaller than the base
>          and we can't use zero for induc_val, use initial_def.  Similarly
>          for REDUC_MIN and initial_def larger than the base.  */
> @@ -7474,21 +7647,30 @@ vect_transform_cycle_phi (loop_vec_info loop_vinf=
o,
>                                            initial_def, initial_def);
>        else
>         {
> -         enum tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info);
> -         tree neutral_op =3D neutral_op_for_reduction (TREE_TYPE (initia=
l_def),
> -                                                     code, initial_def);
> -         gcc_assert (neutral_op);
> -         /* Try to simplify the vector initialization by applying an
> -            adjustment after the reduction has been performed.  */
> -         if (STMT_VINFO_DEF_TYPE (stmt_info) =3D=3D vect_reduction_def
> -             && !operand_equal_p (neutral_op, initial_def))
> +         if (ncopies =3D=3D 1)
> +           vect_find_reusable_accumulator (loop_vinfo, reduc_info);
> +         if (!reduc_info->reduc_initial_values.is_empty ())
>             {
> -             STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT (reduc_info) =3D initi=
al_def;
> -             initial_def =3D neutral_op;
> +             initial_def =3D reduc_info->reduc_initial_values[0];
> +             enum tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info);
> +             tree neutral_op
> +               =3D neutral_op_for_reduction (TREE_TYPE (initial_def),
> +                                           code, initial_def);
> +             gcc_assert (neutral_op);
> +             /* Try to simplify the vector initialization by applying an
> +                adjustment after the reduction has been performed.  */
> +             if (!reduc_info->reused_accumulator
> +                 && STMT_VINFO_DEF_TYPE (stmt_info) =3D=3D vect_reductio=
n_def
> +                 && !operand_equal_p (neutral_op, initial_def))
> +               {
> +                 STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT (reduc_info)
> +                   =3D initial_def;
> +                 initial_def =3D neutral_op;
> +               }
> +             vec_initial_def
> +               =3D get_initial_def_for_reduction (loop_vinfo, reduc_info=
,
> +                                                initial_def, neutral_op)=
;
>             }
> -         vec_initial_def
> -           =3D get_initial_def_for_reduction (loop_vinfo, reduc_info,
> -                                            initial_def, neutral_op);
>         }
>      }
>
> @@ -7499,6 +7681,17 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo=
,
>         vec_initial_defs.quick_push (vec_initial_def);
>      }
>
> +  if (auto *accumulator =3D reduc_info->reused_accumulator)
> +    {
> +      if (loop_vinfo->main_loop_edge)
> +       vec_initial_defs[0]
> +         =3D vect_get_main_loop_result (loop_vinfo, accumulator->reduc_i=
nput,
> +                                      vec_initial_defs[0]);
> +      else
> +       vec_initial_defs.safe_push (accumulator->reduc_input);
> +      gcc_assert (vec_initial_defs.length () =3D=3D 1);
> +    }
> +
>    /* Generate the reduction PHIs upfront.  */
>    for (i =3D 0; i < vec_num; i++)
>      {
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_10.c b/gcc/testsu=
ite/gcc.target/aarch64/sve/reduc_10.c
> new file mode 100644
> index 00000000000..fb817b73d77
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_10.c
> @@ -0,0 +1,77 @@
> +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */
> +
> +unsigned short __attribute__((noipa))
> +add_loop (unsigned short *x, int n)
> +{
> +  unsigned short res =3D 0;
> +  for (int i =3D 0; i < n; ++i)
> +    res +=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +min_loop (unsigned short *x, int n)
> +{
> +  unsigned short res =3D ~0;
> +  for (int i =3D 0; i < n; ++i)
> +    res =3D res < x[i] ? res : x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +max_loop (unsigned short *x, int n)
> +{
> +  unsigned short res =3D 0;
> +  for (int i =3D 0; i < n; ++i)
> +    res =3D res > x[i] ? res : x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +and_loop (unsigned short *x, int n)
> +{
> +  unsigned short res =3D ~0;
> +  for (int i =3D 0; i < n; ++i)
> +    res &=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +or_loop (unsigned short *x, int n)
> +{
> +  unsigned short res =3D 0;
> +  for (int i =3D 0; i < n; ++i)
> +    res |=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +eor_loop (unsigned short *x, int n)
> +{
> +  unsigned short res =3D 0;
> +  for (int i =3D 0; i < n; ++i)
> +    res ^=3D x[i];
> +  return res;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0=
-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.h, p[0-7]/m, z[0-=
9]+\.h, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tuminv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.h, p[0-7]/m, z[0-=
9]+\.h, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tumaxv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tandv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teorv\t} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_10_run.c b/gcc/te=
stsuite/gcc.target/aarch64/sve/reduc_10_run.c
> new file mode 100644
> index 00000000000..1dd579be701
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_10_run.c
> @@ -0,0 +1,49 @@
> +/* { dg-do run { target aarch64_sve_hw } } */
> +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */
> +
> +#define N 0x1100
> +
> +#include "reduc_10.c"
> +
> +int
> +main (void)
> +{
> +  unsigned short x[N];
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D (i + 1) * (i + 2);
> +
> +  if (add_loop (x, 0) !=3D 0
> +      || add_loop (x, 11) !=3D 572
> +      || add_loop (x, 0x100) !=3D 22016
> +      || add_loop (x, 0xfff) !=3D 20480
> +      || max_loop (x, 0) !=3D 0
> +      || max_loop (x, 11) !=3D 132
> +      || max_loop (x, 0x100) !=3D 65280
> +      || max_loop (x, 0xfff) !=3D 65504
> +      || or_loop (x, 0) !=3D 0
> +      || or_loop (x, 11) !=3D 0xfe
> +      || or_loop (x, 0x80) !=3D 0x7ffe
> +      || or_loop (x, 0xb4) !=3D 0x7ffe
> +      || or_loop (x, 0xb5) !=3D 0xfffe
> +      || eor_loop (x, 0) !=3D 0
> +      || eor_loop (x, 11) !=3D 0xe8
> +      || eor_loop (x, 0x100) !=3D 0xcf00
> +      || eor_loop (x, 0xfff) !=3D 0xa000)
> +    __builtin_abort ();
> +
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ~x[i];
> +
> +  if (min_loop (x, 0) !=3D 65535
> +      || min_loop (x, 11) !=3D 65403
> +      || min_loop (x, 0x100) !=3D 255
> +      || min_loop (x, 0xfff) !=3D 31
> +      || and_loop (x, 0) !=3D 0xffff
> +      || and_loop (x, 11) !=3D 0xff01
> +      || and_loop (x, 0x80) !=3D 0x8001
> +      || and_loop (x, 0xb4) !=3D 0x8001
> +      || and_loop (x, 0xb5) !=3D 1)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_11.c b/gcc/testsu=
ite/gcc.target/aarch64/sve/reduc_11.c
> new file mode 100644
> index 00000000000..f99ef4aa865
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_11.c
> @@ -0,0 +1,71 @@
> +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector=
-usage=3D1" } */
> +
> +unsigned short __attribute__((noipa))
> +add_loop (unsigned short *x, unsigned short res)
> +{
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res +=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +min_loop (unsigned short *x, unsigned short res)
> +{
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res =3D res < x[i] ? res : x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +max_loop (unsigned short *x, unsigned short res)
> +{
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res =3D res > x[i] ? res : x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +and_loop (unsigned short *x, unsigned short res)
> +{
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res &=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +or_loop (unsigned short *x, unsigned short res)
> +{
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res |=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +eor_loop (unsigned short *x, unsigned short res)
> +{
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res ^=3D x[i];
> +  return res;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0=
-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.h, p[0-7]/m, z[0-=
9]+\.h, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tuminv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.h, p[0-7]/m, z[0-=
9]+\.h, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tumaxv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tandv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teorv\t} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_11_run.c b/gcc/te=
stsuite/gcc.target/aarch64/sve/reduc_11_run.c
> new file mode 100644
> index 00000000000..5b41560d2ef
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_11_run.c
> @@ -0,0 +1,34 @@
> +/* { dg-do run { target aarch64_sve256_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector=
-usage=3D1" } */
> +
> +#define N 0x1100
> +
> +#include "reduc_11.c"
> +
> +int
> +main (void)
> +{
> +  unsigned short x[N];
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D (i + 1) * (i + 2);
> +
> +  if (add_loop (x, 42) !=3D 20522
> +      || max_loop (x, 65503) !=3D 65504
> +      || max_loop (x, 65505) !=3D 65505
> +      || or_loop (x, 0) !=3D 0xfffe
> +      || or_loop (x, 1) !=3D 0xffff
> +      || eor_loop (x, 0) !=3D 0xa000
> +      || eor_loop (x, 0xbfff) !=3D 0x1fff)
> +    __builtin_abort ();
> +
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ~x[i];
> +
> +  if (min_loop (x, 32) !=3D 31
> +      || min_loop (x, 30) !=3D 30
> +      || and_loop (x, 0xff) !=3D 1
> +      || and_loop (x, 0) !=3D 0)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_12.c b/gcc/testsu=
ite/gcc.target/aarch64/sve/reduc_12.c
> new file mode 100644
> index 00000000000..d32b81a61bc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_12.c
> @@ -0,0 +1,71 @@
> +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */
> +
> +unsigned short __attribute__((noipa))
> +add_loop (unsigned short *x, int n, unsigned short res)
> +{
> +  for (int i =3D 0; i < n; ++i)
> +    res +=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +min_loop (unsigned short *x, int n, unsigned short res)
> +{
> +  for (int i =3D 0; i < n; ++i)
> +    res =3D res < x[i] ? res : x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +max_loop (unsigned short *x, int n, unsigned short res)
> +{
> +  for (int i =3D 0; i < n; ++i)
> +    res =3D res > x[i] ? res : x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +and_loop (unsigned short *x, int n, unsigned short res)
> +{
> +  for (int i =3D 0; i < n; ++i)
> +    res &=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +or_loop (unsigned short *x, int n, unsigned short res)
> +{
> +  for (int i =3D 0; i < n; ++i)
> +    res |=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +eor_loop (unsigned short *x, int n, unsigned short res)
> +{
> +  for (int i =3D 0; i < n; ++i)
> +    res ^=3D x[i];
> +  return res;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0=
-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.h, p[0-7]/m, z[0-=
9]+\.h, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tuminv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.h, p[0-7]/m, z[0-=
9]+\.h, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tumaxv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tandv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teorv\t} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_12_run.c b/gcc/te=
stsuite/gcc.target/aarch64/sve/reduc_12_run.c
> new file mode 100644
> index 00000000000..929b81a9705
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_12_run.c
> @@ -0,0 +1,66 @@
> +/* { dg-do run { target aarch64_sve_hw } } */
> +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */
> +
> +#define N 0x1100
> +
> +#include "reduc_12.c"
> +
> +int
> +main (void)
> +{
> +  unsigned short x[N];
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D (i + 1) * (i + 2);
> +
> +  if (add_loop (x, 0, 10) !=3D 10
> +      || add_loop (x, 11, 42) !=3D 614
> +      || add_loop (x, 0x100, 84) !=3D 22100
> +      || add_loop (x, 0xfff, 20) !=3D 20500
> +      || max_loop (x, 0, 10) !=3D 10
> +      || max_loop (x, 11, 131) !=3D 132
> +      || max_loop (x, 11, 133) !=3D 133
> +      || max_loop (x, 0x100, 65279) !=3D 65280
> +      || max_loop (x, 0x100, 65281) !=3D 65281
> +      || max_loop (x, 0xfff, 65503) !=3D 65504
> +      || max_loop (x, 0xfff, 65505) !=3D 65505
> +      || or_loop (x, 0, 0x71) !=3D 0x71
> +      || or_loop (x, 11, 0) !=3D 0xfe
> +      || or_loop (x, 11, 0xb3c) !=3D 0xbfe
> +      || or_loop (x, 0x80, 0) !=3D 0x7ffe
> +      || or_loop (x, 0x80, 1) !=3D 0x7fff
> +      || or_loop (x, 0xb4, 0) !=3D 0x7ffe
> +      || or_loop (x, 0xb4, 1) !=3D 0x7fff
> +      || or_loop (x, 0xb5, 0) !=3D 0xfffe
> +      || or_loop (x, 0xb5, 1) !=3D 0xffff
> +      || eor_loop (x, 0, 0x3e) !=3D 0x3e
> +      || eor_loop (x, 11, 0) !=3D 0xe8
> +      || eor_loop (x, 11, 0x1ff) !=3D 0x117
> +      || eor_loop (x, 0x100, 0) !=3D 0xcf00
> +      || eor_loop (x, 0x100, 0xeee) !=3D 0xc1ee
> +      || eor_loop (x, 0xfff, 0) !=3D 0xa000
> +      || eor_loop (x, 0xfff, 0x8888) !=3D 0x2888)
> +    __builtin_abort ();
> +
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ~x[i];
> +
> +  if (min_loop (x, 0, 10000) !=3D 10000
> +      || min_loop (x, 11, 65404) !=3D 65403
> +      || min_loop (x, 11, 65402) !=3D 65402
> +      || min_loop (x, 0x100, 256) !=3D 255
> +      || min_loop (x, 0x100, 254) !=3D 254
> +      || min_loop (x, 0xfff, 32) !=3D 31
> +      || min_loop (x, 0xfff, 30) !=3D 30
> +      || and_loop (x, 0, 0x1234) !=3D 0x1234
> +      || and_loop (x, 11, 0xffff) !=3D 0xff01
> +      || and_loop (x, 11, 0xcdef) !=3D 0xcd01
> +      || and_loop (x, 0x80, 0xffff) !=3D 0x8001
> +      || and_loop (x, 0x80, 0xfffe) !=3D 0x8000
> +      || and_loop (x, 0xb4, 0xffff) !=3D 0x8001
> +      || and_loop (x, 0xb4, 0xfffe) !=3D 0x8000
> +      || and_loop (x, 0xb5, 0xffff) !=3D 1
> +      || and_loop (x, 0xb5, 0xfffe) !=3D 0)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_13.c b/gcc/testsu=
ite/gcc.target/aarch64/sve/reduc_13.c
> new file mode 100644
> index 00000000000..ce2b8f2fcdc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_13.c
> @@ -0,0 +1,101 @@
> +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector=
-usage=3D1" } */
> +
> +void __attribute__((noipa))
> +add_loop (unsigned int *x, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < 0x7ff; ++i)
> +    {
> +      res0 +=3D x[i * 2];
> +      res1 +=3D x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +min_loop (unsigned int *x, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < 0x7ff; ++i)
> +    {
> +      res0 =3D res0 < x[i * 2] ? res0 : x[i * 2];
> +      res1 =3D res1 < x[i * 2 + 1] ? res1 : x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +max_loop (unsigned int *x, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < 0x7ff; ++i)
> +    {
> +      res0 =3D res0 > x[i * 2] ? res0 : x[i * 2];
> +      res1 =3D res1 > x[i * 2 + 1] ? res1 : x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +and_loop (unsigned int *x, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < 0x7ff; ++i)
> +    {
> +      res0 &=3D x[i * 2];
> +      res1 &=3D x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +or_loop (unsigned int *x, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < 0x7ff; ++i)
> +    {
> +      res0 |=3D x[i * 2];
> +      res1 |=3D x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +eor_loop (unsigned int *x, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < 0x7ff; ++i)
> +    {
> +      res0 ^=3D x[i * 2];
> +      res1 ^=3D x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0=
-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.s, p[0-7]/m, z[0-=
9]+\.s, z[0-9]+\.s\n} 2 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.s, p[0-7]/m, z[0-=
9]+\.s, z[0-9]+\.s\n} 2 } } */
> +
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_13_run.c b/gcc/te=
stsuite/gcc.target/aarch64/sve/reduc_13_run.c
> new file mode 100644
> index 00000000000..5514d8d6b3b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_13_run.c
> @@ -0,0 +1,61 @@
> +/* { dg-do run { target aarch64_sve256_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector=
-usage=3D1" } */
> +
> +#define N 0x1100
> +
> +#include "reduc_13.c"
> +
> +int
> +main (void)
> +{
> +  unsigned int x[N];
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ((i + 1) * (i + 2)) & 0xfffff;
> +
> +  unsigned int add_res[2] =3D { 42, 1111 };
> +  add_loop (x, add_res);
> +  if (add_res[0] !=3D 968538154
> +      || add_res[1] !=3D 964340823)
> +    __builtin_abort ();
> +
> +  unsigned int max_res1[2] =3D { 0, 0 };
> +  max_loop (x, max_res1);
> +  if (max_res1[0] !=3D 1048150
> +      || max_res1[1] !=3D 1045506)
> +    __builtin_abort ();
> +
> +  unsigned int max_res2[2] =3D { 1048151, 1045507 };
> +  max_loop (x, max_res2);
> +  if (max_res2[0] !=3D 1048151
> +      || max_res2[1] !=3D 1045507)
> +    __builtin_abort ();
> +
> +  unsigned int or_res[2] =3D { 0x1000000, 0x2000000 };
> +  or_loop (x, or_res);
> +  if (or_res[0] !=3D 0x10ffffe
> +      || or_res[1] !=3D 0x20ffffe)
> +    __builtin_abort ();
> +
> +  unsigned int eor_res[2] =3D { 0x1000000, 0x2000000 };
> +  eor_loop (x, eor_res);
> +  if (eor_res[0] !=3D 0x1010000
> +      || eor_res[1] !=3D 0x20b5000)
> +    __builtin_abort ();
> +
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ~x[i] & 0xfffff;
> +
> +  unsigned int min_res1[2] =3D { 500, 4000 };
> +  min_loop (x, min_res1);
> +  if (min_res1[0] !=3D 425
> +      || min_res1[1] !=3D 3069)
> +    __builtin_abort ();
> +
> +  unsigned int min_res2[2] =3D { 424, 3068 };
> +  min_loop (x, min_res2);
> +  if (min_res2[0] !=3D 424
> +      || min_res2[1] !=3D 3068)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_14.c b/gcc/testsu=
ite/gcc.target/aarch64/sve/reduc_14.c
> new file mode 100644
> index 00000000000..3be611e4b37
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_14.c
> @@ -0,0 +1,107 @@
> +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */
> +
> +void __attribute__((noipa))
> +add_loop (unsigned int *x, int n, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < n; ++i)
> +    {
> +      res0 +=3D x[i * 2];
> +      res1 +=3D x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +min_loop (unsigned int *x, int n, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < n; ++i)
> +    {
> +      res0 =3D res0 < x[i * 2] ? res0 : x[i * 2];
> +      res1 =3D res1 < x[i * 2 + 1] ? res1 : x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +max_loop (unsigned int *x, int n, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < n; ++i)
> +    {
> +      res0 =3D res0 > x[i * 2] ? res0 : x[i * 2];
> +      res1 =3D res1 > x[i * 2 + 1] ? res1 : x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +and_loop (unsigned int *x, int n, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < n; ++i)
> +    {
> +      res0 &=3D x[i * 2];
> +      res1 &=3D x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +or_loop (unsigned int *x, int n, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < n; ++i)
> +    {
> +      res0 |=3D x[i * 2];
> +      res1 |=3D x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +void __attribute__((noipa))
> +eor_loop (unsigned int *x, int n, unsigned int *res)
> +{
> +  unsigned int res0 =3D res[0];
> +  unsigned int res1 =3D res[1];
> +  for (int i =3D 0; i < n; ++i)
> +    {
> +      res0 ^=3D x[i * 2];
> +      res1 ^=3D x[i * 2 + 1];
> +    }
> +  res[0] =3D res0;
> +  res[1] =3D res1;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0=
-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tuaddv\t} 2 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.s, p[0-7]/m, z[0-=
9]+\.s, z[0-9]+\.s\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tuminv\t} 2 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.s, p[0-7]/m, z[0-=
9]+\.s, z[0-9]+\.s\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tumaxv\t} 2 } } */
> +
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tandv\t} 2 } } */
> +
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torv\t} 2 } } */
> +
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teorv\t} 2 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_14_run.c b/gcc/te=
stsuite/gcc.target/aarch64/sve/reduc_14_run.c
> new file mode 100644
> index 00000000000..ccaa770e9b2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_14_run.c
> @@ -0,0 +1,187 @@
> +/* { dg-do run { target aarch64_sve256_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector=
-usage=3D1" } */
> +
> +#define N 0x1100
> +
> +#include "reduc_14.c"
> +
> +int
> +main (void)
> +{
> +  unsigned int x[N];
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ((i + 1) * (i + 2)) & 0xfffff;
> +
> +  unsigned int add_res1[2] =3D { 11, 22 };
> +  add_loop (x, 0, add_res1);
> +  if (add_res1[0] !=3D 11
> +      || add_res1[1] !=3D 22)
> +    __builtin_abort ();
> +
> +  unsigned int add_res2[2] =3D { 10, 20 };
> +  add_loop (x, 11, add_res2);
> +  if (add_res2[0] !=3D 1902
> +      || add_res2[1] !=3D 2176)
> +    __builtin_abort ();
> +
> +  unsigned int add_res3[2] =3D { 15, 30 };
> +  add_loop (x, 0x100, add_res3);
> +  if (add_res3[0] !=3D 22435087
> +      || add_res3[1] !=3D 22566686)
> +    __builtin_abort ();
> +
> +  unsigned int add_res4[2] =3D { 100, 200 };
> +  add_loop (x, 0x11f, add_res4);
> +  if (add_res4[0] !=3D 31602244
> +      || add_res4[1] !=3D 31767656)
> +    __builtin_abort ();
> +
> +  unsigned int max_res1[2] =3D { 461, 500 };
> +  max_loop (x, 11, max_res1);
> +  if (max_res1[0] !=3D 462
> +      || max_res1[1] !=3D 506)
> +    __builtin_abort ();
> +
> +  unsigned int max_res2[2] =3D { 463, 507 };
> +  max_loop (x, 11, max_res2);
> +  if (max_res2[0] !=3D 463
> +      || max_res2[1] !=3D 507)
> +    __builtin_abort ();
> +
> +  unsigned int max_res3[2] =3D { 1000000, 1000000 };
> +  max_loop (x, 0x200, max_res3);
> +  if (max_res3[0] !=3D 1047552
> +      || max_res3[1] !=3D 1045506)
> +    __builtin_abort ();
> +
> +  unsigned int max_res4[2] =3D { 1047553, 1045507 };
> +  max_loop (x, 0x200, max_res4);
> +  if (max_res4[0] !=3D 1047553
> +      || max_res4[1] !=3D 1045507)
> +    __builtin_abort ();
> +
> +  unsigned int max_res5[2] =3D { 300000, 30000 };
> +  max_loop (x, 0x11f, max_res5);
> +  if (max_res5[0] !=3D 328902
> +      || max_res5[1] !=3D 330050)
> +    __builtin_abort ();
> +
> +  unsigned int max_res6[2] =3D { 328903, 330051 };
> +  max_loop (x, 0x11f, max_res6);
> +  if (max_res6[0] !=3D 328903
> +      || max_res6[1] !=3D 330051)
> +    __builtin_abort ();
> +
> +  unsigned int or_res1[2] =3D { 11, 22 };
> +  or_loop (x, 0, or_res1);
> +  if (or_res1[0] !=3D 11
> +      || or_res1[1] !=3D 22)
> +    __builtin_abort ();
> +
> +  unsigned int or_res2[2] =3D { 0x200000, 0xe00000 };
> +  or_loop (x, 11, or_res2);
> +  if (or_res2[0] !=3D 0x2001fe
> +      || or_res2[1] !=3D 0xe001fe)
> +    __builtin_abort ();
> +
> +  unsigned int or_res3[2] =3D { 0x800000, 0x700000 };
> +  or_loop (x, 0x40, or_res3);
> +  if (or_res3[0] !=3D 0x803ffe
> +      || or_res3[1] !=3D 0x707ffe)
> +    __builtin_abort ();
> +
> +  unsigned int or_res4[2] =3D { 0x100001, 0x300000 };
> +  or_loop (x, 0x4f, or_res4);
> +  if (or_res4[0] !=3D 0x107fff
> +      || or_res4[1] !=3D 0x307ffe)
> +    __builtin_abort ();
> +
> +  unsigned int eor_res1[2] =3D { 11, 22 };
> +  eor_loop (x, 0, eor_res1);
> +  if (eor_res1[0] !=3D 11
> +      || eor_res1[1] !=3D 22)
> +    __builtin_abort ();
> +
> +  unsigned int eor_res2[2] =3D { 0x2000ff, 0xe000ff };
> +  eor_loop (x, 11, eor_res2);
> +  if (eor_res2[0] !=3D 0x2001cf
> +      || eor_res2[1] !=3D 0xe000b7)
> +    __builtin_abort ();
> +
> +  unsigned int eor_res3[2] =3D { 0x805000, 0x70f000 };
> +  eor_loop (x, 0x100, eor_res3);
> +  if (eor_res3[0] !=3D 0x824200
> +      || eor_res3[1] !=3D 0x77dc00)
> +    __builtin_abort ();
> +
> +  unsigned int eor_res4[2] =3D { 0x101201, 0x300f00 };
> +  eor_loop (x, 0x11f, eor_res4);
> +  if (eor_res4[0] !=3D 0x178801
> +      || eor_res4[1] !=3D 0x337240)
> +    __builtin_abort ();
> +
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ~x[i] & 0xfffff;
> +
> +  unsigned int min_res1[2] =3D { 1048200, 1048100 };
> +  min_loop (x, 11, min_res1);
> +  if (min_res1[0] !=3D 1048113
> +      || min_res1[1] !=3D 1048069)
> +    __builtin_abort ();
> +
> +  unsigned int min_res2[2] =3D { 1048112, 1048068 };
> +  min_loop (x, 11, min_res2);
> +  if (min_res2[0] !=3D 1048112
> +      || min_res2[1] !=3D 1048068)
> +    __builtin_abort ();
> +
> +  unsigned int min_res3[2] =3D { 10000, 10000 };
> +  min_loop (x, 0x200, min_res3);
> +  if (min_res3[0] !=3D 1023
> +      || min_res3[1] !=3D 3069)
> +    __builtin_abort ();
> +
> +  unsigned int min_res4[2] =3D { 1022, 3068 };
> +  min_loop (x, 0x200, min_res4);
> +  if (min_res4[0] !=3D 1022
> +      || min_res4[1] !=3D 3068)
> +    __builtin_abort ();
> +
> +  unsigned int min_res5[2] =3D { 719680, 718530 };
> +  min_loop (x, 0x11f, min_res5);
> +  if (min_res5[0] !=3D 719673
> +      || min_res5[1] !=3D 718525)
> +    __builtin_abort ();
> +
> +  unsigned int min_res6[2] =3D { 719672, 718524 };
> +  min_loop (x, 0x11f, min_res6);
> +  if (min_res6[0] !=3D 719672
> +      || min_res6[1] !=3D 718524)
> +    __builtin_abort ();
> +
> +  unsigned int and_res1[2] =3D { 11, 22 };
> +  and_loop (x, 0, and_res1);
> +  if (and_res1[0] !=3D 11
> +      || and_res1[1] !=3D 22)
> +    __builtin_abort ();
> +
> +  unsigned int and_res2[2] =3D { 0xf5cff, 0xf78ff };
> +  and_loop (x, 11, and_res2);
> +  if (and_res2[0] !=3D 0xf5c01
> +      || and_res2[1] !=3D 0xf7801)
> +    __builtin_abort ();
> +
> +  unsigned int and_res3[2] =3D { 0x7efff, 0xecfff };
> +  and_loop (x, 0x40, and_res3);
> +  if (and_res3[0] !=3D 0x7c001
> +      || and_res3[1] !=3D 0xe8001)
> +    __builtin_abort ();
> +
> +  unsigned int and_res4[2] =3D { 0xffffff, 0xffffff };
> +  and_loop (x, 0x4f, and_res4);
> +  if (and_res4[0] !=3D 0xf8001
> +      || and_res4[1] !=3D 0xf8001)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_15.c b/gcc/testsu=
ite/gcc.target/aarch64/sve/reduc_15.c
> new file mode 100644
> index 00000000000..15b1ade30e2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_15.c
> @@ -0,0 +1,16 @@
> +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */
> +
> +int __attribute__((noipa))
> +add_loop (int *x, int n, int res)
> +{
> +  for (int i =3D 0; i < n; ++i)
> +    {
> +      res +=3D x[i * 2];
> +      res +=3D x[i * 2 + 1];
> +    }
> +  return res;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0=
-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, p[0-7]/m, z[0-9=
]+\.s, z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_15_run.c b/gcc/te=
stsuite/gcc.target/aarch64/sve/reduc_15_run.c
> new file mode 100644
> index 00000000000..3207fce5be3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_15_run.c
> @@ -0,0 +1,22 @@
> +/* { dg-do run { target aarch64_sve256_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector=
-usage=3D1" } */
> +
> +#define N 0x1100
> +
> +#include "reduc_15.c"
> +
> +int
> +main (void)
> +{
> +  int x[N];
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ((i + 1) * (i + 2)) & 0xfffff;
> +
> +  if (add_loop (x, 0, 33) !=3D 33
> +      || add_loop (x, 11, 30) !=3D 4078
> +      || add_loop (x, 0x100, 45) !=3D 45001773
> +      || add_loop (x, 0x11f, 300) !=3D 63369900)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_9.c b/gcc/testsui=
te/gcc.target/aarch64/sve/reduc_9.c
> new file mode 100644
> index 00000000000..b839821d6bb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_9.c
> @@ -0,0 +1,77 @@
> +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector=
-usage=3D1" } */
> +
> +unsigned short __attribute__((noipa))
> +add_loop (unsigned short *x)
> +{
> +  unsigned short res =3D 0;
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res +=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +min_loop (unsigned short *x)
> +{
> +  unsigned short res =3D ~0;
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res =3D res < x[i] ? res : x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +max_loop (unsigned short *x)
> +{
> +  unsigned short res =3D 0;
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res =3D res > x[i] ? res : x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +and_loop (unsigned short *x)
> +{
> +  unsigned short res =3D ~0;
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res &=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +or_loop (unsigned short *x)
> +{
> +  unsigned short res =3D 0;
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res |=3D x[i];
> +  return res;
> +}
> +
> +unsigned short __attribute__((noipa))
> +eor_loop (unsigned short *x)
> +{
> +  unsigned short res =3D 0;
> +  for (int i =3D 0; i < 0xfff; ++i)
> +    res ^=3D x[i];
> +  return res;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0=
-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.h, p[0-7]/m, z[0-=
9]+\.h, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tuminv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.h, p[0-7]/m, z[0-=
9]+\.h, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tumaxv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tandv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\torv\t} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0=
-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m, z[0-9=
]+\.h, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\teorv\t} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_9_run.c b/gcc/tes=
tsuite/gcc.target/aarch64/sve/reduc_9_run.c
> new file mode 100644
> index 00000000000..aa248f53eaa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_9_run.c
> @@ -0,0 +1,29 @@
> +/* { dg-do run { target aarch64_sve256_hw } } */
> +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector=
-usage=3D1" } */
> +
> +#define N 0x1100
> +
> +#include "reduc_9.c"
> +
> +int
> +main (void)
> +{
> +  unsigned short x[N];
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D (i + 1) * (i + 2);
> +
> +  if (add_loop (x) !=3D 20480
> +      || max_loop (x) !=3D 65504
> +      || or_loop (x) !=3D 0xfffe
> +      || eor_loop (x) !=3D 0xa000)
> +    __builtin_abort ();
> +
> +  for (int i =3D 0; i < N; ++i)
> +    x[i] =3D ~x[i];
> +
> +  if (min_loop (x) !=3D 31
> +      || and_loop (x) !=3D 1)
> +    __builtin_abort ();
> +
> +  return 0;
> +}