From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by sourceware.org (Postfix) with ESMTPS id C07D1398505D for ; Fri, 9 Jul 2021 11:58:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C07D1398505D Received: by mail-ej1-x62c.google.com with SMTP id bu12so15799975ejb.0 for ; Fri, 09 Jul 2021 04:58:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=EG3v3JaxHX2jeNLVg+nzAgM6dGd8uJiFsmkhzRZtOX4=; b=CTZKd9RvwzdxPmzJAlHCPQK6ruMRgqiKjh+Y+MXgyy/PuGjlp+dVKOfKVHmxAFhiHl ZG3n6yJ8a7Lt2JLG5XLWRhTbDTUuSymIMTC1YGyffF72bUwrJ2bHGRwUYAFYekPK9gc0 N1uVZ22G32lz5ChwDyH4u7NJPyTRbtRhLDA6N9+4NLLqBr29z+zGXT7kJgoDvw/ETuem obAJXmZbGPVfV36Y1+AZE1/FfghUhOeeyshLZB+cFyJP2tHbQZ2KmS67dtGWXbPd7+iO 6n/WB5pMQW90Sq7sKXriYx5wN5BITOLdoV0gByr2R7kXIrQ4dLcq2oy5KqSA8A5/U0u4 /O3g== X-Gm-Message-State: AOAM530dgBQDodIsIYVECMq4oznEqIRk6Z4/HgcRWjQZECY/Qp1IpzFc wc93+JvMjv4ofE4XMUwA2edHinC6dSzkM1FIUQc= X-Google-Smtp-Source: ABdhPJxcs+5YtOebybBbaWNrhUaSPCKjWhCs4CWP0DtF4PL/mmiyy2swWo5//G7abxMnqA1HhQuEi21O5FxqGw+sx20= X-Received: by 2002:a17:906:17d0:: with SMTP id u16mr1924728eje.129.1625831921179; Fri, 09 Jul 2021 04:58:41 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Fri, 9 Jul 2021 13:58:30 +0200 Message-ID: Subject: Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops To: Richard Sandiford , GCC Patches Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_ASCII_DIVIDERS, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Jul 2021 11:58:47 -0000 On Thu, Jul 8, 2021 at 2:50 PM Richard Sandiford via Gcc-patches wrote: > > This patch adds support for reusing a main loop's reduction accumulator > in an epilogue loop. This in turn lets the loops share a single piece > of vector->scalar reduction code. > > The patch has the following restrictions: > > (1) The epilogue reduction can only operate on a single vector > (e.g. ncopies must be 1 for non-SLP reductions, and the group size > must be <=3D the element count for SLP reductions). > > (2) Both loops must use the same vector mode for their accumulators. > This means that the patch is restricted to targets that support > --param vect-partial-vector-usage=3D1. > > (3) The reduction must be a standard =E2=80=9Ctree code=E2=80=9D reductio= n. > > However, these restrictions could be lifted in future. For example, > if the main loop operates on 128-bit vectors and the epilogue loop > operates on 64-bit vectors, we could in future reduce the 128-bit > vector by one stage and use the 64-bit result as the starting point > for the epilogue result. Yeah, I hope that can be done quickly - it should make the approach usable on x86_64. > The patch tries to handle chained SLP reductions, unchained SLP > reductions and non-SLP reductions. It also handles cases in which > the epilogue loop is entered directly (rather than via the main loop) > and cases in which the epilogue loop can be skipped. > > vect_get_main_loop_result is a bit more general than the current > patch needs. I didn't see anything that would adjust the costing of the vectorization (though I don't specifically remember how we cost vectorized epilogues in general). Few comments / questions inline below - I think the patch is OK as-is though. Thanks, Richard. > gcc/ > * tree-vectorizer.h (vect_reusable_accumulator): New structure. > (_loop_vec_info::main_loop_edge): New field. > (_loop_vec_info::skip_main_loop_edge): Likewise. > (_loop_vec_info::skip_this_loop_edge): Likewise. > (_loop_vec_info::reusable_accumulators): Likewise. > (_stmt_vec_info::reduc_scalar_results): Likewise. > (_stmt_vec_info::reused_accumulator): Likewise. > (vect_get_main_loop_result): Declare. > * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize > reduc_scalar_inputs. > (vec_info::free_stmt_vec_info): Free reduc_scalar_inputs. > * tree-vect-loop-manip.c (vect_get_main_loop_result): New functio= n. > (vect_do_peeling): Fill an epilogue loop's main_loop_edge, > skip_main_loop_edge and skip_this_loop_edge fields. > * tree-vect-loop.c (INCLUDE_ALGORITHM): Define. > (vect_emit_reduction_init_stmts): New function. > (get_initial_def_for_reduction): Use it. > (get_initial_defs_for_reduction): Likewise. Change the vinfo > parameter to a loop_vec_info. > (vect_create_epilog_for_reduction): Store the scalar results > in the reduc_info. If an epilogue loop is reusing an accumulator > from the main loop, and if the epilogue loop can also be skipped, > try to place the reduction code in the join block. Record > accumulators that could potentially be reused by epilogue loops. > (vect_transform_cycle_phi): When vectorizing epilogue loops, > try to reuse accumulators from the main loop. Record the initial > value in reduc_info for non-SLP reductions too. > > gcc/testsuite/ > * gcc.target/aarch64/sve/reduc_9.c: New test. > * gcc.target/aarch64/sve/reduc_9_run.c: Likewise. > * gcc.target/aarch64/sve/reduc_10.c: Likewise. > * gcc.target/aarch64/sve/reduc_10_run.c: Likewise. > * gcc.target/aarch64/sve/reduc_11.c: Likewise. > * gcc.target/aarch64/sve/reduc_11_run.c: Likewise. > * gcc.target/aarch64/sve/reduc_12.c: Likewise. > * gcc.target/aarch64/sve/reduc_12_run.c: Likewise. > * gcc.target/aarch64/sve/reduc_13.c: Likewise. > * gcc.target/aarch64/sve/reduc_13_run.c: Likewise. > * gcc.target/aarch64/sve/reduc_14.c: Likewise. > * gcc.target/aarch64/sve/reduc_14_run.c: Likewise. > * gcc.target/aarch64/sve/reduc_15.c: Likewise. > * gcc.target/aarch64/sve/reduc_15_run.c: Likewise. > --- > .../gcc.target/aarch64/sve/reduc_10.c | 77 +++++ > .../gcc.target/aarch64/sve/reduc_10_run.c | 49 +++ > .../gcc.target/aarch64/sve/reduc_11.c | 71 ++++ > .../gcc.target/aarch64/sve/reduc_11_run.c | 34 ++ > .../gcc.target/aarch64/sve/reduc_12.c | 71 ++++ > .../gcc.target/aarch64/sve/reduc_12_run.c | 66 ++++ > .../gcc.target/aarch64/sve/reduc_13.c | 101 ++++++ > .../gcc.target/aarch64/sve/reduc_13_run.c | 61 ++++ > .../gcc.target/aarch64/sve/reduc_14.c | 107 ++++++ > .../gcc.target/aarch64/sve/reduc_14_run.c | 187 +++++++++++ > .../gcc.target/aarch64/sve/reduc_15.c | 16 + > .../gcc.target/aarch64/sve/reduc_15_run.c | 22 ++ > .../gcc.target/aarch64/sve/reduc_9.c | 77 +++++ > .../gcc.target/aarch64/sve/reduc_9_run.c | 29 ++ > gcc/tree-vect-loop-manip.c | 29 ++ > gcc/tree-vect-loop.c | 309 ++++++++++++++---- > gcc/tree-vectorizer.c | 4 + > gcc/tree-vectorizer.h | 51 ++- > 18 files changed, 1297 insertions(+), 64 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_10.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_10_run.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_11.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_11_run.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_12.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_12_run.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_13.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_13_run.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_14.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_14_run.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_15.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_15_run.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_9.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_9_run.c > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index e2fd3609fee..ed7a7738880 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -551,6 +551,18 @@ typedef auto_vec vec_loop_lens; > > typedef auto_vec > drs_init_vec; > > +/* Information about a reduction accumulator from the main loop that cou= ld > + conceivably be reused as the input to a reduction in an epilogue loop= . */ > +struct vect_reusable_accumulator { > + /* The final value of the accumulator, which forms the input to the > + reduction operation. */ > + tree reduc_input; > + > + /* The stmt_vec_info that describes the reduction (i.e. the one for > + which is_reduc_info is true). */ > + stmt_vec_info reduc_info; > +}; > + > /*-----------------------------------------------------------------*/ > /* Info on vectorized loops. */ > /*-----------------------------------------------------------------*/ > @@ -588,6 +600,23 @@ public: > /* Unrolling factor */ > poly_uint64 vectorization_factor; > > + /* If this loop is an epilogue loop whose main loop can be skipped, > + MAIN_LOOP_EDGE is the edge from the main loop to this loop's > + preheader. SKIP_MAIN_LOOP_EDGE is then the edge that skips the > + main loop and goes straight to this loop's preheader. > + > + Both fields are null otherwise. */ > + edge main_loop_edge; > + edge skip_main_loop_edge; > + > + /* If this loop is an epilogue loop that might be skipped after execut= ing > + the main loop, this edge is the one that skips the epilogue. */ > + edge skip_this_loop_edge; > + > + /* After vectorization, maps live-out SSA names to information about > + the reductions that generated them. */ > + hash_map reusable_accumulators; Is that the LC PHI node defs or the definition inside of the loop? If the latter we could attach the info directly to its stmt-info? > + > /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR > if there is no particular limit. */ > unsigned HOST_WIDE_INT max_vectorization_factor; > @@ -1186,6 +1215,21 @@ public: > /* The vector type for performing the actual reduction. */ > tree reduc_vectype; > > + /* If IS_REDUC_INFO is true and if the reduction is operating on N > + elements in parallel, this vector gives the initial values of these > + N elements. */ That's N scalar elements or N vector elements? I suppose it's for SLP reductions (rather than SLP reduction chains) and never non-SLP reductions? > + vec reduc_initial_values; > + > + /* If IS_REDUC_INFO is true and if the reduction is operating on N > + elements in parallel, this vector gives the scalar result of each > + reduction. */ > + vec reduc_scalar_results; > + > + /* Only meaningful if IS_REDUC_INFO. If non-null, the reduction is > + being performed by an epilogue loop and we have decided to reuse > + this accumulator from the main loop. */ > + vect_reusable_accumulator *reused_accumulator; > + > /* Whether we force a single cycle PHI during reduction vectorization.= */ > bool force_single_cycle; > > @@ -1382,12 +1426,6 @@ vect_phi_initial_value (gphi *phi) > return PHI_ARG_DEF_FROM_EDGE (phi, pe); > } > > -static inline tree > -vect_phi_initial_value (stmt_vec_info stmt_info) > -{ > - return vect_phi_initial_value (as_a (stmt_info->stmt)); > -} > - > /* Return true if STMT_INFO should produce a vector mask type rather tha= n > a normal nonmask type. */ > > @@ -1818,6 +1856,7 @@ class loop *vect_loop_versioning (loop_vec_info, gi= mple *); > extern class loop *vect_do_peeling (loop_vec_info, tree, tree, > tree *, tree *, tree *, int, bool, bo= ol, > tree *); > +extern tree vect_get_main_loop_result (loop_vec_info, tree, tree =3D NUL= L_TREE); > extern void vect_prepare_for_masked_peels (loop_vec_info); > extern dump_user_location_t find_loop_location (class loop *); > extern bool vect_can_advance_ivs_p (loop_vec_info); > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c > index 9748043f3ee..f1035a83826 100644 > --- a/gcc/tree-vectorizer.c > +++ b/gcc/tree-vectorizer.c > @@ -694,6 +694,8 @@ vec_info::new_stmt_vec_info (gimple *stmt) > STMT_VINFO_SLP_VECT_ONLY (res) =3D false; > STMT_VINFO_SLP_VECT_ONLY_PATTERN (res) =3D false; > STMT_VINFO_VEC_STMTS (res) =3D vNULL; > + res->reduc_initial_values =3D vNULL; > + res->reduc_scalar_results =3D vNULL; > > if (is_a (this) > && gimple_code (stmt) =3D=3D GIMPLE_PHI > @@ -755,6 +757,8 @@ vec_info::free_stmt_vec_info (stmt_vec_info stmt_info= ) > release_ssa_name (lhs); > } > > + stmt_info->reduc_initial_values.release (); > + stmt_info->reduc_scalar_results.release (); > STMT_VINFO_SIMD_CLONE_INFO (stmt_info).release (); > STMT_VINFO_VEC_STMTS (stmt_info).release (); > free (stmt_info); > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c > index 2909e8a0fc3..b7b0523e3c8 100644 > --- a/gcc/tree-vect-loop-manip.c > +++ b/gcc/tree-vect-loop-manip.c > @@ -2457,6 +2457,31 @@ vect_update_epilogue_niters (loop_vec_info epilogu= e_vinfo, > return vect_determine_partial_vectors_and_peeling (epilogue_vinfo, tru= e); > } > > +/* LOOP_VINFO is an epilogue loop and MAIN_LOOP_VALUE is available on ex= it > + from the corresponding main loop. Return a value that is available i= n > + LOOP_VINFO's preheader, using SKIP_VALUE if the main loop is skipped. > + Passing a null SKIP_VALUE is equivalent to passing zero. */ > + > +tree > +vect_get_main_loop_result (loop_vec_info loop_vinfo, tree main_loop_valu= e, > + tree skip_value) > +{ > + if (!loop_vinfo->main_loop_edge) > + return main_loop_value; > + > + if (!skip_value) > + skip_value =3D build_zero_cst (TREE_TYPE (main_loop_value)); shouldn't that be the initial value? > + > + tree phi_result =3D make_ssa_name (TREE_TYPE (main_loop_value)); > + basic_block bb =3D loop_vinfo->main_loop_edge->dest; > + gphi *new_phi =3D create_phi_node (phi_result, bb); > + add_phi_arg (new_phi, main_loop_value, loop_vinfo->main_loop_edge, > + UNKNOWN_LOCATION); > + add_phi_arg (new_phi, skip_value, > + loop_vinfo->skip_main_loop_edge, UNKNOWN_LOCATION); > + return phi_result; > +} > + > /* Function vect_do_peeling. > > Input: > @@ -2986,6 +3011,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree nit= ers, tree nitersm1, > skip_vector ? anchor : guard_b= b, > prob_epilog.invert (), > irred_flag); > + if (vect_epilogues) > + epilogue_vinfo->skip_this_loop_edge =3D guard_e; > slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e, > single_exit (epilog)); > /* Only need to handle basic block before epilog loop if it's n= ot > @@ -3057,6 +3084,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree nit= ers, tree nitersm1, > add_phi_arg (new_phi, build_zero_cst (TREE_TYPE (niters)), skip= _e, > UNKNOWN_LOCATION); > niters =3D PHI_RESULT (new_phi); > + epilogue_vinfo->main_loop_edge =3D update_e; > + epilogue_vinfo->skip_main_loop_edge =3D skip_e; > } > > /* Set ADVANCE to the number of iterations performed by the previo= us > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index fe7e73f655f..5e6c9b7c38a 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -19,6 +19,7 @@ You should have received a copy of the GNU General Publ= ic License > along with GCC; see the file COPYING3. If not see > . */ > > +#define INCLUDE_ALGORITHM > #include "config.h" > #include "system.h" > #include "coretypes.h" > @@ -823,6 +824,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,= vec_info_shared *shared) > th (0), > versioning_threshold (0), > vectorization_factor (0), > + main_loop_edge (nullptr), > + skip_main_loop_edge (nullptr), > + skip_this_loop_edge (nullptr), > + reusable_accumulators (), > max_vectorization_factor (0), > mask_skip_niters (NULL_TREE), > rgroup_compare_type (NULL_TREE), > @@ -4607,7 +4612,32 @@ vect_model_reduction_cost (loop_vec_info loop_vinf= o, > prologue_cost, epilogue_cost); > } > > +/* SEQ is a sequence of instructions that initialize the reduction > + described by REDUC_INFO. Emit them in the appropriate place. */ > > +static void > +vect_emit_reduction_init_stmts (loop_vec_info loop_vinfo, > + stmt_vec_info reduc_info, gimple *seq) > +{ > + if (reduc_info->reused_accumulator) > + { > + /* When reusing an accumulator from the main loop, we only need > + initialization instructions if the main loop can be skipped. > + In that case, emit the initialization instructions at the end > + of the guard block that does the skip. */ > + edge skip_edge =3D loop_vinfo->skip_main_loop_edge; > + gcc_assert (skip_edge); > + gimple_stmt_iterator gsi =3D gsi_last_bb (skip_edge->src); > + gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT); > + } > + else > + { > + /* The normal case: emit the initialization instructions on the > + preheader edge. */ > + class loop *loop =3D LOOP_VINFO_LOOP (loop_vinfo); > + gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), seq)= ; > + } > +} > > /* Function get_initial_def_for_reduction > > @@ -4675,36 +4705,30 @@ get_initial_def_for_reduction (loop_vec_info loop= _vinfo, > } > > if (stmts) > - gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts)= ; > + vect_emit_reduction_init_stmts (loop_vinfo, reduc_info, stmts); > return init_def; > } > > -/* Get at the initial defs for the reduction PHIs for REDUC_INFO, whose > - associated SLP node is SLP_NODE. NUMBER_OF_VECTORS is the number of = vector > - defs to create. If NEUTRAL_OP is nonnull, introducing extra elements= of > - that value will not change the result. */ > +/* Get at the initial defs for the reduction PHIs for REDUC_INFO, > + which performs a reduction involving GROUP_SIZE scalar statements. > + NUMBER_OF_VECTORS is the number of vector defs to create. If NEUTRAL= _OP > + is nonnull, introducing extra elements of that value will not change = the > + result. */ > > static void > -get_initial_defs_for_reduction (vec_info *vinfo, > +get_initial_defs_for_reduction (loop_vec_info loop_vinfo, > stmt_vec_info reduc_info, > - slp_tree slp_node, > vec *vec_oprnds, > unsigned int number_of_vectors, > - bool reduc_chain, tree neutral_op) > + unsigned int group_size, tree neutral_op) > { > - vec stmts =3D SLP_TREE_SCALAR_STMTS (slp_node); > + vec &initial_values =3D reduc_info->reduc_initial_values; > unsigned HOST_WIDE_INT nunits; > unsigned j, number_of_places_left_in_vector; > tree vector_type =3D STMT_VINFO_VECTYPE (reduc_info); > - unsigned int group_size =3D stmts.length (); > unsigned int i; > - class loop *loop; > - > - loop =3D (gimple_bb (reduc_info->stmt))->loop_father; > - gcc_assert (loop); > - edge pe =3D loop_preheader_edge (loop); > > - gcc_assert (!reduc_chain || neutral_op); > + gcc_assert (group_size =3D=3D initial_values.length () || neutral_op); > > /* NUMBER_OF_COPIES is the number of times we need to use the same val= ues in > created vectors. It is greater than 1 if unrolling is performed. > @@ -4734,18 +4758,13 @@ get_initial_defs_for_reduction (vec_info *vinfo, > { > tree op; > i =3D j % group_size; > - stmt_vec_info stmt_vinfo =3D stmts[i]; > > /* Get the def before the loop. In reduction chain we have only > one initial value. Else we have as many as PHIs in the group. = */ > - if (reduc_chain) > - op =3D j !=3D 0 ? neutral_op : vect_phi_initial_value (stmt_vinfo= ); > - else if (((vec_oprnds->length () + 1) * nunits > - - number_of_places_left_in_vector >=3D group_size) > - && neutral_op) > + if (i >=3D initial_values.length () || (j > i && neutral_op)) > op =3D neutral_op; > else > - op =3D vect_phi_initial_value (stmt_vinfo); > + op =3D initial_values[i]; > > /* Create 'vect_ =3D {op0,op1,...,opn}'. */ > number_of_places_left_in_vector--; > @@ -4781,8 +4800,8 @@ get_initial_defs_for_reduction (vec_info *vinfo, > { > /* First time round, duplicate ELTS to fill the > required number of vectors. */ > - duplicate_and_interleave (vinfo, &ctor_seq, vector_type, el= ts, > - number_of_vectors, *vec_oprnds); > + duplicate_and_interleave (loop_vinfo, &ctor_seq, vector_typ= e, > + elts, number_of_vectors, *vec_opr= nds); > break; > } > vec_oprnds->quick_push (init); > @@ -4794,7 +4813,7 @@ get_initial_defs_for_reduction (vec_info *vinfo, > } > } > if (ctor_seq !=3D NULL) > - gsi_insert_seq_on_edge_immediate (pe, ctor_seq); > + vect_emit_reduction_init_stmts (loop_vinfo, reduc_info, ctor_seq); > } > > /* For a statement STMT_INFO taking part in a reduction operation return > @@ -4823,6 +4842,100 @@ info_for_reduction (vec_info *vinfo, stmt_vec_inf= o stmt_info) > return stmt_info; > } > > +/* PHI is a reduction in LOOP_VINFO that we are going to vectorize using= vector > + type VECTYPE. See if LOOP_VINFO is an epilogue loop whose main loop = had a > + matching reduction that we can build on. Adjust REDUC_INFO and retur= n true > + if so, otherwise return false. */ > + > +static bool > +vect_find_reusable_accumulator (loop_vec_info loop_vinfo, > + stmt_vec_info reduc_info) > +{ > + loop_vec_info main_loop_vinfo =3D LOOP_VINFO_ORIG_LOOP_INFO (loop_vinf= o); > + if (!main_loop_vinfo) > + return false; > + > + if (STMT_VINFO_REDUC_TYPE (reduc_info) !=3D TREE_CODE_REDUCTION) > + return false; > + > + unsigned int num_phis =3D reduc_info->reduc_initial_values.length (); > + auto_vec main_loop_results (num_phis); > + auto_vec initial_values (num_phis); > + if (edge main_loop_edge =3D loop_vinfo->main_loop_edge) > + { > + /* The epilogue loop can be entered either from the main loop or > + from an earlier guard block. */ > + edge skip_edge =3D loop_vinfo->skip_main_loop_edge; > + for (tree incoming_value : reduc_info->reduc_initial_values) > + { > + /* Look for: > + > + INCOMING_VALUE =3D phi + INITIAL_VALUE(guard block)>. */ > + gcc_assert (TREE_CODE (incoming_value) =3D=3D SSA_NAME); > + > + gphi *phi =3D as_a (SSA_NAME_DEF_STMT (incoming_value)= ); > + gcc_assert (gimple_bb (phi) =3D=3D main_loop_edge->dest); > + > + tree from_main_loop =3D PHI_ARG_DEF_FROM_EDGE (phi, main_loop_e= dge); > + tree from_skip =3D PHI_ARG_DEF_FROM_EDGE (phi, skip_edge); > + > + main_loop_results.quick_push (from_main_loop); > + initial_values.quick_push (from_skip); > + } > + } > + else > + /* The main loop dominates the epilogue loop. */ > + main_loop_results.splice (reduc_info->reduc_initial_values); > + > + /* See if the main loop has the kind of accumulator we need. */ > + vect_reusable_accumulator *accumulator > + =3D main_loop_vinfo->reusable_accumulators.get (main_loop_results[0]= ); > + if (!accumulator > + || num_phis !=3D accumulator->reduc_info->reduc_scalar_results.len= gth () > + || !std::equal (main_loop_results.begin (), main_loop_results.end = (), > + accumulator->reduc_info->reduc_scalar_results.begin= ())) > + return false; > + > + /* For now, only handle the case in which both loops are operating on = the > + same vector types. In future we could reduce wider vectors to narr= ower > + ones as well. */ > + tree vectype =3D STMT_VINFO_VECTYPE (reduc_info); > + tree old_vectype =3D TREE_TYPE (accumulator->reduc_input); > + if (!useless_type_conversion_p (old_vectype, vectype)) It should be indeed quite trivial to handle, likewise the case where we have multiple PHIs - just reduce to a single input vector and have the possibly multiple input vectors in the epilogue filled with neutral elements. I'll see if I can cook up stuff for this next week. > + return false; > + > + /* Non-SLP reductions might apply an adjustment after the reduction > + operation, in order to simplify the initialization of the accumulat= or. > + If the epilogue loop carries on from where the main loop left off, > + it should apply the same adjustment to the final reduction result. > + > + If the epilogue loop can also be entered directly (rather than via > + the main loop), we need to be able to handle that case in the same = way, > + with the same adjustment. (In principle we could add a PHI node > + to select the correct adjustment, but in practice that shouldn't be > + necessary.) */ > + tree main_adjustment > + =3D STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT (accumulator->reduc_info); > + if (loop_vinfo->main_loop_edge && main_adjustment) > + { > + gcc_assert (num_phis =3D=3D 1); > + tree initial_value =3D initial_values[0]; > + /* Check that we can use INITIAL_VALUE as the adjustment and > + initialize the accumulator with a neutral value instead. */ > + if (!operand_equal_p (initial_value, main_adjustment)) > + return false; > + tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info); > + initial_values[0] =3D neutral_op_for_reduction (TREE_TYPE (initial= _value), > + code, initial_value); > + } > + STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT (reduc_info) =3D main_adjustment; > + reduc_info->reduc_initial_values.truncate (0); > + reduc_info->reduc_initial_values.splice (initial_values); > + reduc_info->reused_accumulator =3D accumulator; > + return true; > +} > + > /* Function vect_create_epilog_for_reduction > > Create code at the loop-epilog to finalize the result of a reduction > @@ -4915,7 +5028,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo= p_vinfo, > gimple *use_stmt; > auto_vec reduc_inputs; > int j, i; > - auto_vec scalar_results; > + vec &scalar_results =3D reduc_info->reduc_scalar_results; > unsigned int group_size =3D 1, k; > auto_vec phis; > /* SLP reduction without reduction chain, e.g., > @@ -4941,16 +5054,12 @@ vect_create_epilog_for_reduction (loop_vec_info l= oop_vinfo, > gcc_assert (vectype); > mode =3D TYPE_MODE (vectype); > > - tree initial_def =3D NULL; > tree induc_val =3D NULL_TREE; > tree adjustment_def =3D NULL; > if (slp_node) > ; > else > { > - /* Get at the scalar def before the loop, that defines the initial= value > - of the reduction variable. */ > - initial_def =3D vect_phi_initial_value (reduc_def_stmt); > /* Optimize: for induction condition reduction, if we can't use ze= ro > for induc_val, use initial_def. */ > if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D INTEGER_INDUC_COND_R= EDUCTION) > @@ -5196,6 +5305,37 @@ vect_create_epilog_for_reduction (loop_vec_info lo= op_vinfo, > reduc_inputs.safe_push (single_input); > } > > + tree orig_reduc_input =3D reduc_inputs[0]; > + > + /* If this loop is an epilogue loop that can be skipped after the > + main loop, we can only share a reduction operation between the > + main loop and the epilogue if we put it at the target of the > + skip edge. Do you have a testcase where we cannot do this? > + We can still reuse accumulators if this check fails. Doing so has > + the minor(?) benefit of making the epilogue loop's scalar result > + independent of the main loop's scalar result. */ > + bool unify_with_main_loop_p =3D false; > + if (reduc_info->reused_accumulator > + && loop_vinfo->skip_this_loop_edge > + && single_succ_p (exit_bb) > + && single_succ (exit_bb) =3D=3D loop_vinfo->skip_this_loop_edge->d= est) > + { > + unify_with_main_loop_p =3D true; > + > + basic_block reduc_block =3D loop_vinfo->skip_this_loop_edge->dest; > + reduc_inputs[0] =3D make_ssa_name (vectype); > + gphi *new_phi =3D create_phi_node (reduc_inputs[0], reduc_block); > + add_phi_arg (new_phi, orig_reduc_input, single_succ_edge (exit_bb)= , > + UNKNOWN_LOCATION); > + add_phi_arg (new_phi, reduc_info->reused_accumulator->reduc_input, > + loop_vinfo->skip_this_loop_edge, UNKNOWN_LOCATION); > + exit_gsi =3D gsi_after_labels (reduc_block); > + } > + > + /* Shouldn't be used beyond this point. */ > + exit_bb =3D nullptr; > + > if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D COND_REDUCTION > && reduc_fn !=3D IFN_LAST) > { > @@ -5405,6 +5545,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo= p_vinfo, > the same as initial_def already. */ > tree zcompare =3D build2 (EQ_EXPR, boolean_type_node, new_temp, > induc_val); > + tree initial_def =3D reduc_info->reduc_initial_values[0]; > > tmp =3D make_ssa_name (new_scalar_dest); > epilog_stmt =3D gimple_build_assign (tmp, COND_EXPR, zcompare, > @@ -5425,9 +5566,6 @@ vect_create_epilog_for_reduction (loop_vec_info loo= p_vinfo, > gcc_assert (reduc_inputs.length () =3D=3D 1); > gcc_assert (pow2p_hwi (group_size)); > > - slp_tree orig_phis_slp_node =3D slp_node_instance->reduc_phis; > - vec orig_phis > - =3D SLP_TREE_SCALAR_STMTS (orig_phis_slp_node); > gimple_seq seq =3D NULL; > > /* Build a vector {0, 1, 2, ...}, with the same number of elements > @@ -5452,7 +5590,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo= p_vinfo, > { > tree initial_value =3D NULL_TREE; > if (REDUC_GROUP_FIRST_ELEMENT (stmt_info)) > - initial_value =3D vect_phi_initial_value (orig_phis[0]); > + initial_value =3D reduc_info->reduc_initial_values[0]; > neutral_op =3D neutral_op_for_reduction (TREE_TYPE (vectype), c= ode, > initial_value); > } > @@ -5466,7 +5604,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo= p_vinfo, > for MIN and MAX reduction, for example. */ > if (!neutral_op) > { > - tree scalar_value =3D vect_phi_initial_value (orig_phis[i])= ; > + tree scalar_value =3D reduc_info->reduc_initial_values[i]; > scalar_value =3D gimple_convert (&seq, TREE_TYPE (vectype), > scalar_value); > vector_identity =3D gimple_build_vector_from_val (&seq, vec= type, > @@ -5780,6 +5918,7 @@ vect_create_epilog_for_reduction (loop_vec_info loo= p_vinfo, > the same as initial_def already. */ > tree zcompare =3D build2 (EQ_EXPR, boolean_type_node, new_temp, > induc_val); > + tree initial_def =3D reduc_info->reduc_initial_values[0]; > > tree tmp =3D make_ssa_name (new_scalar_dest); > epilog_stmt =3D gimple_build_assign (tmp, COND_EXPR, zcompare, > @@ -5819,6 +5958,12 @@ vect_create_epilog_for_reduction (loop_vec_info lo= op_vinfo, > scalar_results[0] =3D new_temp; > } > > + /* Record this operation if it could be reused by the epilogue loop. = */ > + if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDUCTION > + && !double_reduc) what's the issue with double_reduc? > + loop_vinfo->reusable_accumulators.put (scalar_results[0], > + { orig_reduc_input, reduc_info= }); > + > if (double_reduc) > loop =3D outer_loop; > > @@ -5886,6 +6031,17 @@ vect_create_epilog_for_reduction (loop_vec_info lo= op_vinfo, > { > /* Replace the uses: */ > orig_name =3D PHI_RESULT (exit_phi); > + > + /* Look for a single use at the target of the skip edge. */ > + if (unify_with_main_loop_p) > + { > + use_operand_p use_p; > + gimple *user; > + if (!single_imm_use (orig_name, &use_p, &user)) > + gcc_unreachable (); > + orig_name =3D gimple_get_lhs (user); > + } > + > scalar_result =3D scalar_results[k]; > FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, orig_name) > { > @@ -7421,16 +7577,32 @@ vect_transform_cycle_phi (loop_vec_info loop_vinf= o, > else > { > gcc_assert (slp_node =3D=3D slp_node_instance->reduc_phis); > - tree initial_value =3D NULL_TREE; > + vec &initial_values =3D reduc_info->reduc_initial_values; > + vec &stmts =3D SLP_TREE_SCALAR_STMTS (slp_node); > + > + unsigned int num_phis =3D stmts.length (); > if (REDUC_GROUP_FIRST_ELEMENT (reduc_stmt_info)) > - initial_value =3D vect_phi_initial_value (phi); > - tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info); > - tree neutral_op =3D neutral_op_for_reduction (TREE_TYPE (vectyp= e_out), > - code, initial_value= ); > - get_initial_defs_for_reduction (loop_vinfo, reduc_info, > - slp_node_instance->reduc_phis, > - &vec_initial_defs, vec_num, > - initial_value !=3D NULL, neutra= l_op); > + num_phis =3D 1; > + initial_values.reserve (num_phis); > + for (unsigned int i =3D 0; i < num_phis; ++i) > + { > + gphi *this_phi =3D as_a (stmts[i]->stmt); > + initial_values.quick_push (vect_phi_initial_value (this_phi= )); > + } > + if (vec_num =3D=3D 1) > + vect_find_reusable_accumulator (loop_vinfo, reduc_info); > + if (!initial_values.is_empty ()) > + { > + tree initial_value > + =3D (num_phis =3D=3D 1 ? initial_values[0] : NULL_TREE); > + tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info); > + tree neutral_op > + =3D neutral_op_for_reduction (TREE_TYPE (vectype_out), > + code, initial_value); > + get_initial_defs_for_reduction (loop_vinfo, reduc_info, > + &vec_initial_defs, vec_num, > + stmts.length (), neutral_op= ); > + } > } > } > else > @@ -7438,6 +7610,7 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo, > /* Get at the scalar def before the loop, that defines the initial > value of the reduction variable. */ > tree initial_def =3D vect_phi_initial_value (phi); > + reduc_info->reduc_initial_values.safe_push (initial_def); > /* Optimize: if initial_def is for REDUC_MAX smaller than the base > and we can't use zero for induc_val, use initial_def. Similarly > for REDUC_MIN and initial_def larger than the base. */ > @@ -7474,21 +7647,30 @@ vect_transform_cycle_phi (loop_vec_info loop_vinf= o, > initial_def, initial_def); > else > { > - enum tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info); > - tree neutral_op =3D neutral_op_for_reduction (TREE_TYPE (initia= l_def), > - code, initial_def); > - gcc_assert (neutral_op); > - /* Try to simplify the vector initialization by applying an > - adjustment after the reduction has been performed. */ > - if (STMT_VINFO_DEF_TYPE (stmt_info) =3D=3D vect_reduction_def > - && !operand_equal_p (neutral_op, initial_def)) > + if (ncopies =3D=3D 1) > + vect_find_reusable_accumulator (loop_vinfo, reduc_info); > + if (!reduc_info->reduc_initial_values.is_empty ()) > { > - STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT (reduc_info) =3D initi= al_def; > - initial_def =3D neutral_op; > + initial_def =3D reduc_info->reduc_initial_values[0]; > + enum tree_code code =3D STMT_VINFO_REDUC_CODE (reduc_info); > + tree neutral_op > + =3D neutral_op_for_reduction (TREE_TYPE (initial_def), > + code, initial_def); > + gcc_assert (neutral_op); > + /* Try to simplify the vector initialization by applying an > + adjustment after the reduction has been performed. */ > + if (!reduc_info->reused_accumulator > + && STMT_VINFO_DEF_TYPE (stmt_info) =3D=3D vect_reductio= n_def > + && !operand_equal_p (neutral_op, initial_def)) > + { > + STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT (reduc_info) > + =3D initial_def; > + initial_def =3D neutral_op; > + } > + vec_initial_def > + =3D get_initial_def_for_reduction (loop_vinfo, reduc_info= , > + initial_def, neutral_op)= ; > } > - vec_initial_def > - =3D get_initial_def_for_reduction (loop_vinfo, reduc_info, > - initial_def, neutral_op); > } > } > > @@ -7499,6 +7681,17 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo= , > vec_initial_defs.quick_push (vec_initial_def); > } > > + if (auto *accumulator =3D reduc_info->reused_accumulator) > + { > + if (loop_vinfo->main_loop_edge) > + vec_initial_defs[0] > + =3D vect_get_main_loop_result (loop_vinfo, accumulator->reduc_i= nput, > + vec_initial_defs[0]); > + else > + vec_initial_defs.safe_push (accumulator->reduc_input); > + gcc_assert (vec_initial_defs.length () =3D=3D 1); > + } > + > /* Generate the reduction PHIs upfront. */ > for (i =3D 0; i < vec_num; i++) > { > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_10.c b/gcc/testsu= ite/gcc.target/aarch64/sve/reduc_10.c > new file mode 100644 > index 00000000000..fb817b73d77 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_10.c > @@ -0,0 +1,77 @@ > +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */ > + > +unsigned short __attribute__((noipa)) > +add_loop (unsigned short *x, int n) > +{ > + unsigned short res =3D 0; > + for (int i =3D 0; i < n; ++i) > + res +=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +min_loop (unsigned short *x, int n) > +{ > + unsigned short res =3D ~0; > + for (int i =3D 0; i < n; ++i) > + res =3D res < x[i] ? res : x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +max_loop (unsigned short *x, int n) > +{ > + unsigned short res =3D 0; > + for (int i =3D 0; i < n; ++i) > + res =3D res > x[i] ? res : x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +and_loop (unsigned short *x, int n) > +{ > + unsigned short res =3D ~0; > + for (int i =3D 0; i < n; ++i) > + res &=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +or_loop (unsigned short *x, int n) > +{ > + unsigned short res =3D 0; > + for (int i =3D 0; i < n; ++i) > + res |=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +eor_loop (unsigned short *x, int n) > +{ > + unsigned short res =3D 0; > + for (int i =3D 0; i < n; ++i) > + res ^=3D x[i]; > + return res; > +} > + > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0= -9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.h, p[0-7]/m, z[0-= 9]+\.h, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tuminv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.h, p[0-7]/m, z[0-= 9]+\.h, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tumaxv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tandv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teorv\t} 1 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_10_run.c b/gcc/te= stsuite/gcc.target/aarch64/sve/reduc_10_run.c > new file mode 100644 > index 00000000000..1dd579be701 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_10_run.c > @@ -0,0 +1,49 @@ > +/* { dg-do run { target aarch64_sve_hw } } */ > +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */ > + > +#define N 0x1100 > + > +#include "reduc_10.c" > + > +int > +main (void) > +{ > + unsigned short x[N]; > + for (int i =3D 0; i < N; ++i) > + x[i] =3D (i + 1) * (i + 2); > + > + if (add_loop (x, 0) !=3D 0 > + || add_loop (x, 11) !=3D 572 > + || add_loop (x, 0x100) !=3D 22016 > + || add_loop (x, 0xfff) !=3D 20480 > + || max_loop (x, 0) !=3D 0 > + || max_loop (x, 11) !=3D 132 > + || max_loop (x, 0x100) !=3D 65280 > + || max_loop (x, 0xfff) !=3D 65504 > + || or_loop (x, 0) !=3D 0 > + || or_loop (x, 11) !=3D 0xfe > + || or_loop (x, 0x80) !=3D 0x7ffe > + || or_loop (x, 0xb4) !=3D 0x7ffe > + || or_loop (x, 0xb5) !=3D 0xfffe > + || eor_loop (x, 0) !=3D 0 > + || eor_loop (x, 11) !=3D 0xe8 > + || eor_loop (x, 0x100) !=3D 0xcf00 > + || eor_loop (x, 0xfff) !=3D 0xa000) > + __builtin_abort (); > + > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ~x[i]; > + > + if (min_loop (x, 0) !=3D 65535 > + || min_loop (x, 11) !=3D 65403 > + || min_loop (x, 0x100) !=3D 255 > + || min_loop (x, 0xfff) !=3D 31 > + || and_loop (x, 0) !=3D 0xffff > + || and_loop (x, 11) !=3D 0xff01 > + || and_loop (x, 0x80) !=3D 0x8001 > + || and_loop (x, 0xb4) !=3D 0x8001 > + || and_loop (x, 0xb5) !=3D 1) > + __builtin_abort (); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_11.c b/gcc/testsu= ite/gcc.target/aarch64/sve/reduc_11.c > new file mode 100644 > index 00000000000..f99ef4aa865 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_11.c > @@ -0,0 +1,71 @@ > +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector= -usage=3D1" } */ > + > +unsigned short __attribute__((noipa)) > +add_loop (unsigned short *x, unsigned short res) > +{ > + for (int i =3D 0; i < 0xfff; ++i) > + res +=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +min_loop (unsigned short *x, unsigned short res) > +{ > + for (int i =3D 0; i < 0xfff; ++i) > + res =3D res < x[i] ? res : x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +max_loop (unsigned short *x, unsigned short res) > +{ > + for (int i =3D 0; i < 0xfff; ++i) > + res =3D res > x[i] ? res : x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +and_loop (unsigned short *x, unsigned short res) > +{ > + for (int i =3D 0; i < 0xfff; ++i) > + res &=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +or_loop (unsigned short *x, unsigned short res) > +{ > + for (int i =3D 0; i < 0xfff; ++i) > + res |=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +eor_loop (unsigned short *x, unsigned short res) > +{ > + for (int i =3D 0; i < 0xfff; ++i) > + res ^=3D x[i]; > + return res; > +} > + > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0= -9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.h, p[0-7]/m, z[0-= 9]+\.h, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tuminv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.h, p[0-7]/m, z[0-= 9]+\.h, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tumaxv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tandv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teorv\t} 1 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_11_run.c b/gcc/te= stsuite/gcc.target/aarch64/sve/reduc_11_run.c > new file mode 100644 > index 00000000000..5b41560d2ef > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_11_run.c > @@ -0,0 +1,34 @@ > +/* { dg-do run { target aarch64_sve256_hw } } */ > +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector= -usage=3D1" } */ > + > +#define N 0x1100 > + > +#include "reduc_11.c" > + > +int > +main (void) > +{ > + unsigned short x[N]; > + for (int i =3D 0; i < N; ++i) > + x[i] =3D (i + 1) * (i + 2); > + > + if (add_loop (x, 42) !=3D 20522 > + || max_loop (x, 65503) !=3D 65504 > + || max_loop (x, 65505) !=3D 65505 > + || or_loop (x, 0) !=3D 0xfffe > + || or_loop (x, 1) !=3D 0xffff > + || eor_loop (x, 0) !=3D 0xa000 > + || eor_loop (x, 0xbfff) !=3D 0x1fff) > + __builtin_abort (); > + > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ~x[i]; > + > + if (min_loop (x, 32) !=3D 31 > + || min_loop (x, 30) !=3D 30 > + || and_loop (x, 0xff) !=3D 1 > + || and_loop (x, 0) !=3D 0) > + __builtin_abort (); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_12.c b/gcc/testsu= ite/gcc.target/aarch64/sve/reduc_12.c > new file mode 100644 > index 00000000000..d32b81a61bc > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_12.c > @@ -0,0 +1,71 @@ > +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */ > + > +unsigned short __attribute__((noipa)) > +add_loop (unsigned short *x, int n, unsigned short res) > +{ > + for (int i =3D 0; i < n; ++i) > + res +=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +min_loop (unsigned short *x, int n, unsigned short res) > +{ > + for (int i =3D 0; i < n; ++i) > + res =3D res < x[i] ? res : x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +max_loop (unsigned short *x, int n, unsigned short res) > +{ > + for (int i =3D 0; i < n; ++i) > + res =3D res > x[i] ? res : x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +and_loop (unsigned short *x, int n, unsigned short res) > +{ > + for (int i =3D 0; i < n; ++i) > + res &=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +or_loop (unsigned short *x, int n, unsigned short res) > +{ > + for (int i =3D 0; i < n; ++i) > + res |=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +eor_loop (unsigned short *x, int n, unsigned short res) > +{ > + for (int i =3D 0; i < n; ++i) > + res ^=3D x[i]; > + return res; > +} > + > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0= -9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.h, p[0-7]/m, z[0-= 9]+\.h, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tuminv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.h, p[0-7]/m, z[0-= 9]+\.h, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tumaxv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tandv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teorv\t} 1 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_12_run.c b/gcc/te= stsuite/gcc.target/aarch64/sve/reduc_12_run.c > new file mode 100644 > index 00000000000..929b81a9705 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_12_run.c > @@ -0,0 +1,66 @@ > +/* { dg-do run { target aarch64_sve_hw } } */ > +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */ > + > +#define N 0x1100 > + > +#include "reduc_12.c" > + > +int > +main (void) > +{ > + unsigned short x[N]; > + for (int i =3D 0; i < N; ++i) > + x[i] =3D (i + 1) * (i + 2); > + > + if (add_loop (x, 0, 10) !=3D 10 > + || add_loop (x, 11, 42) !=3D 614 > + || add_loop (x, 0x100, 84) !=3D 22100 > + || add_loop (x, 0xfff, 20) !=3D 20500 > + || max_loop (x, 0, 10) !=3D 10 > + || max_loop (x, 11, 131) !=3D 132 > + || max_loop (x, 11, 133) !=3D 133 > + || max_loop (x, 0x100, 65279) !=3D 65280 > + || max_loop (x, 0x100, 65281) !=3D 65281 > + || max_loop (x, 0xfff, 65503) !=3D 65504 > + || max_loop (x, 0xfff, 65505) !=3D 65505 > + || or_loop (x, 0, 0x71) !=3D 0x71 > + || or_loop (x, 11, 0) !=3D 0xfe > + || or_loop (x, 11, 0xb3c) !=3D 0xbfe > + || or_loop (x, 0x80, 0) !=3D 0x7ffe > + || or_loop (x, 0x80, 1) !=3D 0x7fff > + || or_loop (x, 0xb4, 0) !=3D 0x7ffe > + || or_loop (x, 0xb4, 1) !=3D 0x7fff > + || or_loop (x, 0xb5, 0) !=3D 0xfffe > + || or_loop (x, 0xb5, 1) !=3D 0xffff > + || eor_loop (x, 0, 0x3e) !=3D 0x3e > + || eor_loop (x, 11, 0) !=3D 0xe8 > + || eor_loop (x, 11, 0x1ff) !=3D 0x117 > + || eor_loop (x, 0x100, 0) !=3D 0xcf00 > + || eor_loop (x, 0x100, 0xeee) !=3D 0xc1ee > + || eor_loop (x, 0xfff, 0) !=3D 0xa000 > + || eor_loop (x, 0xfff, 0x8888) !=3D 0x2888) > + __builtin_abort (); > + > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ~x[i]; > + > + if (min_loop (x, 0, 10000) !=3D 10000 > + || min_loop (x, 11, 65404) !=3D 65403 > + || min_loop (x, 11, 65402) !=3D 65402 > + || min_loop (x, 0x100, 256) !=3D 255 > + || min_loop (x, 0x100, 254) !=3D 254 > + || min_loop (x, 0xfff, 32) !=3D 31 > + || min_loop (x, 0xfff, 30) !=3D 30 > + || and_loop (x, 0, 0x1234) !=3D 0x1234 > + || and_loop (x, 11, 0xffff) !=3D 0xff01 > + || and_loop (x, 11, 0xcdef) !=3D 0xcd01 > + || and_loop (x, 0x80, 0xffff) !=3D 0x8001 > + || and_loop (x, 0x80, 0xfffe) !=3D 0x8000 > + || and_loop (x, 0xb4, 0xffff) !=3D 0x8001 > + || and_loop (x, 0xb4, 0xfffe) !=3D 0x8000 > + || and_loop (x, 0xb5, 0xffff) !=3D 1 > + || and_loop (x, 0xb5, 0xfffe) !=3D 0) > + __builtin_abort (); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_13.c b/gcc/testsu= ite/gcc.target/aarch64/sve/reduc_13.c > new file mode 100644 > index 00000000000..ce2b8f2fcdc > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_13.c > @@ -0,0 +1,101 @@ > +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector= -usage=3D1" } */ > + > +void __attribute__((noipa)) > +add_loop (unsigned int *x, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < 0x7ff; ++i) > + { > + res0 +=3D x[i * 2]; > + res1 +=3D x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +min_loop (unsigned int *x, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < 0x7ff; ++i) > + { > + res0 =3D res0 < x[i * 2] ? res0 : x[i * 2]; > + res1 =3D res1 < x[i * 2 + 1] ? res1 : x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +max_loop (unsigned int *x, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < 0x7ff; ++i) > + { > + res0 =3D res0 > x[i * 2] ? res0 : x[i * 2]; > + res1 =3D res1 > x[i * 2 + 1] ? res1 : x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +and_loop (unsigned int *x, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < 0x7ff; ++i) > + { > + res0 &=3D x[i * 2]; > + res1 &=3D x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +or_loop (unsigned int *x, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < 0x7ff; ++i) > + { > + res0 |=3D x[i * 2]; > + res1 |=3D x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +eor_loop (unsigned int *x, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < 0x7ff; ++i) > + { > + res0 ^=3D x[i * 2]; > + res1 ^=3D x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0= -9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.s, p[0-7]/m, z[0-= 9]+\.s, z[0-9]+\.s\n} 2 } } */ > + > +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.s, p[0-7]/m, z[0-= 9]+\.s, z[0-9]+\.s\n} 2 } } */ > + > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_13_run.c b/gcc/te= stsuite/gcc.target/aarch64/sve/reduc_13_run.c > new file mode 100644 > index 00000000000..5514d8d6b3b > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_13_run.c > @@ -0,0 +1,61 @@ > +/* { dg-do run { target aarch64_sve256_hw } } */ > +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector= -usage=3D1" } */ > + > +#define N 0x1100 > + > +#include "reduc_13.c" > + > +int > +main (void) > +{ > + unsigned int x[N]; > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ((i + 1) * (i + 2)) & 0xfffff; > + > + unsigned int add_res[2] =3D { 42, 1111 }; > + add_loop (x, add_res); > + if (add_res[0] !=3D 968538154 > + || add_res[1] !=3D 964340823) > + __builtin_abort (); > + > + unsigned int max_res1[2] =3D { 0, 0 }; > + max_loop (x, max_res1); > + if (max_res1[0] !=3D 1048150 > + || max_res1[1] !=3D 1045506) > + __builtin_abort (); > + > + unsigned int max_res2[2] =3D { 1048151, 1045507 }; > + max_loop (x, max_res2); > + if (max_res2[0] !=3D 1048151 > + || max_res2[1] !=3D 1045507) > + __builtin_abort (); > + > + unsigned int or_res[2] =3D { 0x1000000, 0x2000000 }; > + or_loop (x, or_res); > + if (or_res[0] !=3D 0x10ffffe > + || or_res[1] !=3D 0x20ffffe) > + __builtin_abort (); > + > + unsigned int eor_res[2] =3D { 0x1000000, 0x2000000 }; > + eor_loop (x, eor_res); > + if (eor_res[0] !=3D 0x1010000 > + || eor_res[1] !=3D 0x20b5000) > + __builtin_abort (); > + > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ~x[i] & 0xfffff; > + > + unsigned int min_res1[2] =3D { 500, 4000 }; > + min_loop (x, min_res1); > + if (min_res1[0] !=3D 425 > + || min_res1[1] !=3D 3069) > + __builtin_abort (); > + > + unsigned int min_res2[2] =3D { 424, 3068 }; > + min_loop (x, min_res2); > + if (min_res2[0] !=3D 424 > + || min_res2[1] !=3D 3068) > + __builtin_abort (); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_14.c b/gcc/testsu= ite/gcc.target/aarch64/sve/reduc_14.c > new file mode 100644 > index 00000000000..3be611e4b37 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_14.c > @@ -0,0 +1,107 @@ > +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */ > + > +void __attribute__((noipa)) > +add_loop (unsigned int *x, int n, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < n; ++i) > + { > + res0 +=3D x[i * 2]; > + res1 +=3D x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +min_loop (unsigned int *x, int n, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < n; ++i) > + { > + res0 =3D res0 < x[i * 2] ? res0 : x[i * 2]; > + res1 =3D res1 < x[i * 2 + 1] ? res1 : x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +max_loop (unsigned int *x, int n, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < n; ++i) > + { > + res0 =3D res0 > x[i * 2] ? res0 : x[i * 2]; > + res1 =3D res1 > x[i * 2 + 1] ? res1 : x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +and_loop (unsigned int *x, int n, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < n; ++i) > + { > + res0 &=3D x[i * 2]; > + res1 &=3D x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +or_loop (unsigned int *x, int n, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < n; ++i) > + { > + res0 |=3D x[i * 2]; > + res1 |=3D x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +void __attribute__((noipa)) > +eor_loop (unsigned int *x, int n, unsigned int *res) > +{ > + unsigned int res0 =3D res[0]; > + unsigned int res1 =3D res[1]; > + for (int i =3D 0; i < n; ++i) > + { > + res0 ^=3D x[i * 2]; > + res1 ^=3D x[i * 2 + 1]; > + } > + res[0] =3D res0; > + res[1] =3D res1; > +} > + > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0= -9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tuaddv\t} 2 } } */ > + > +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.s, p[0-7]/m, z[0-= 9]+\.s, z[0-9]+\.s\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tuminv\t} 2 } } */ > + > +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.s, p[0-7]/m, z[0-= 9]+\.s, z[0-9]+\.s\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tumaxv\t} 2 } } */ > + > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tandv\t} 2 } } */ > + > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torv\t} 2 } } */ > + > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teorv\t} 2 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_14_run.c b/gcc/te= stsuite/gcc.target/aarch64/sve/reduc_14_run.c > new file mode 100644 > index 00000000000..ccaa770e9b2 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_14_run.c > @@ -0,0 +1,187 @@ > +/* { dg-do run { target aarch64_sve256_hw } } */ > +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector= -usage=3D1" } */ > + > +#define N 0x1100 > + > +#include "reduc_14.c" > + > +int > +main (void) > +{ > + unsigned int x[N]; > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ((i + 1) * (i + 2)) & 0xfffff; > + > + unsigned int add_res1[2] =3D { 11, 22 }; > + add_loop (x, 0, add_res1); > + if (add_res1[0] !=3D 11 > + || add_res1[1] !=3D 22) > + __builtin_abort (); > + > + unsigned int add_res2[2] =3D { 10, 20 }; > + add_loop (x, 11, add_res2); > + if (add_res2[0] !=3D 1902 > + || add_res2[1] !=3D 2176) > + __builtin_abort (); > + > + unsigned int add_res3[2] =3D { 15, 30 }; > + add_loop (x, 0x100, add_res3); > + if (add_res3[0] !=3D 22435087 > + || add_res3[1] !=3D 22566686) > + __builtin_abort (); > + > + unsigned int add_res4[2] =3D { 100, 200 }; > + add_loop (x, 0x11f, add_res4); > + if (add_res4[0] !=3D 31602244 > + || add_res4[1] !=3D 31767656) > + __builtin_abort (); > + > + unsigned int max_res1[2] =3D { 461, 500 }; > + max_loop (x, 11, max_res1); > + if (max_res1[0] !=3D 462 > + || max_res1[1] !=3D 506) > + __builtin_abort (); > + > + unsigned int max_res2[2] =3D { 463, 507 }; > + max_loop (x, 11, max_res2); > + if (max_res2[0] !=3D 463 > + || max_res2[1] !=3D 507) > + __builtin_abort (); > + > + unsigned int max_res3[2] =3D { 1000000, 1000000 }; > + max_loop (x, 0x200, max_res3); > + if (max_res3[0] !=3D 1047552 > + || max_res3[1] !=3D 1045506) > + __builtin_abort (); > + > + unsigned int max_res4[2] =3D { 1047553, 1045507 }; > + max_loop (x, 0x200, max_res4); > + if (max_res4[0] !=3D 1047553 > + || max_res4[1] !=3D 1045507) > + __builtin_abort (); > + > + unsigned int max_res5[2] =3D { 300000, 30000 }; > + max_loop (x, 0x11f, max_res5); > + if (max_res5[0] !=3D 328902 > + || max_res5[1] !=3D 330050) > + __builtin_abort (); > + > + unsigned int max_res6[2] =3D { 328903, 330051 }; > + max_loop (x, 0x11f, max_res6); > + if (max_res6[0] !=3D 328903 > + || max_res6[1] !=3D 330051) > + __builtin_abort (); > + > + unsigned int or_res1[2] =3D { 11, 22 }; > + or_loop (x, 0, or_res1); > + if (or_res1[0] !=3D 11 > + || or_res1[1] !=3D 22) > + __builtin_abort (); > + > + unsigned int or_res2[2] =3D { 0x200000, 0xe00000 }; > + or_loop (x, 11, or_res2); > + if (or_res2[0] !=3D 0x2001fe > + || or_res2[1] !=3D 0xe001fe) > + __builtin_abort (); > + > + unsigned int or_res3[2] =3D { 0x800000, 0x700000 }; > + or_loop (x, 0x40, or_res3); > + if (or_res3[0] !=3D 0x803ffe > + || or_res3[1] !=3D 0x707ffe) > + __builtin_abort (); > + > + unsigned int or_res4[2] =3D { 0x100001, 0x300000 }; > + or_loop (x, 0x4f, or_res4); > + if (or_res4[0] !=3D 0x107fff > + || or_res4[1] !=3D 0x307ffe) > + __builtin_abort (); > + > + unsigned int eor_res1[2] =3D { 11, 22 }; > + eor_loop (x, 0, eor_res1); > + if (eor_res1[0] !=3D 11 > + || eor_res1[1] !=3D 22) > + __builtin_abort (); > + > + unsigned int eor_res2[2] =3D { 0x2000ff, 0xe000ff }; > + eor_loop (x, 11, eor_res2); > + if (eor_res2[0] !=3D 0x2001cf > + || eor_res2[1] !=3D 0xe000b7) > + __builtin_abort (); > + > + unsigned int eor_res3[2] =3D { 0x805000, 0x70f000 }; > + eor_loop (x, 0x100, eor_res3); > + if (eor_res3[0] !=3D 0x824200 > + || eor_res3[1] !=3D 0x77dc00) > + __builtin_abort (); > + > + unsigned int eor_res4[2] =3D { 0x101201, 0x300f00 }; > + eor_loop (x, 0x11f, eor_res4); > + if (eor_res4[0] !=3D 0x178801 > + || eor_res4[1] !=3D 0x337240) > + __builtin_abort (); > + > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ~x[i] & 0xfffff; > + > + unsigned int min_res1[2] =3D { 1048200, 1048100 }; > + min_loop (x, 11, min_res1); > + if (min_res1[0] !=3D 1048113 > + || min_res1[1] !=3D 1048069) > + __builtin_abort (); > + > + unsigned int min_res2[2] =3D { 1048112, 1048068 }; > + min_loop (x, 11, min_res2); > + if (min_res2[0] !=3D 1048112 > + || min_res2[1] !=3D 1048068) > + __builtin_abort (); > + > + unsigned int min_res3[2] =3D { 10000, 10000 }; > + min_loop (x, 0x200, min_res3); > + if (min_res3[0] !=3D 1023 > + || min_res3[1] !=3D 3069) > + __builtin_abort (); > + > + unsigned int min_res4[2] =3D { 1022, 3068 }; > + min_loop (x, 0x200, min_res4); > + if (min_res4[0] !=3D 1022 > + || min_res4[1] !=3D 3068) > + __builtin_abort (); > + > + unsigned int min_res5[2] =3D { 719680, 718530 }; > + min_loop (x, 0x11f, min_res5); > + if (min_res5[0] !=3D 719673 > + || min_res5[1] !=3D 718525) > + __builtin_abort (); > + > + unsigned int min_res6[2] =3D { 719672, 718524 }; > + min_loop (x, 0x11f, min_res6); > + if (min_res6[0] !=3D 719672 > + || min_res6[1] !=3D 718524) > + __builtin_abort (); > + > + unsigned int and_res1[2] =3D { 11, 22 }; > + and_loop (x, 0, and_res1); > + if (and_res1[0] !=3D 11 > + || and_res1[1] !=3D 22) > + __builtin_abort (); > + > + unsigned int and_res2[2] =3D { 0xf5cff, 0xf78ff }; > + and_loop (x, 11, and_res2); > + if (and_res2[0] !=3D 0xf5c01 > + || and_res2[1] !=3D 0xf7801) > + __builtin_abort (); > + > + unsigned int and_res3[2] =3D { 0x7efff, 0xecfff }; > + and_loop (x, 0x40, and_res3); > + if (and_res3[0] !=3D 0x7c001 > + || and_res3[1] !=3D 0xe8001) > + __builtin_abort (); > + > + unsigned int and_res4[2] =3D { 0xffffff, 0xffffff }; > + and_loop (x, 0x4f, and_res4); > + if (and_res4[0] !=3D 0xf8001 > + || and_res4[1] !=3D 0xf8001) > + __builtin_abort (); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_15.c b/gcc/testsu= ite/gcc.target/aarch64/sve/reduc_15.c > new file mode 100644 > index 00000000000..15b1ade30e2 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_15.c > @@ -0,0 +1,16 @@ > +/* { dg-options "-O3 --param vect-partial-vector-usage=3D1" } */ > + > +int __attribute__((noipa)) > +add_loop (int *x, int n, int res) > +{ > + for (int i =3D 0; i < n; ++i) > + { > + res +=3D x[i * 2]; > + res +=3D x[i * 2 + 1]; > + } > + return res; > +} > + > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, z[0-9]+\.s, z[0= -9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.s, p[0-7]/m, z[0-9= ]+\.s, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_15_run.c b/gcc/te= stsuite/gcc.target/aarch64/sve/reduc_15_run.c > new file mode 100644 > index 00000000000..3207fce5be3 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_15_run.c > @@ -0,0 +1,22 @@ > +/* { dg-do run { target aarch64_sve256_hw } } */ > +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector= -usage=3D1" } */ > + > +#define N 0x1100 > + > +#include "reduc_15.c" > + > +int > +main (void) > +{ > + int x[N]; > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ((i + 1) * (i + 2)) & 0xfffff; > + > + if (add_loop (x, 0, 33) !=3D 33 > + || add_loop (x, 11, 30) !=3D 4078 > + || add_loop (x, 0x100, 45) !=3D 45001773 > + || add_loop (x, 0x11f, 300) !=3D 63369900) > + __builtin_abort (); > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_9.c b/gcc/testsui= te/gcc.target/aarch64/sve/reduc_9.c > new file mode 100644 > index 00000000000..b839821d6bb > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_9.c > @@ -0,0 +1,77 @@ > +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector= -usage=3D1" } */ > + > +unsigned short __attribute__((noipa)) > +add_loop (unsigned short *x) > +{ > + unsigned short res =3D 0; > + for (int i =3D 0; i < 0xfff; ++i) > + res +=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +min_loop (unsigned short *x) > +{ > + unsigned short res =3D ~0; > + for (int i =3D 0; i < 0xfff; ++i) > + res =3D res < x[i] ? res : x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +max_loop (unsigned short *x) > +{ > + unsigned short res =3D 0; > + for (int i =3D 0; i < 0xfff; ++i) > + res =3D res > x[i] ? res : x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +and_loop (unsigned short *x) > +{ > + unsigned short res =3D ~0; > + for (int i =3D 0; i < 0xfff; ++i) > + res &=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +or_loop (unsigned short *x) > +{ > + unsigned short res =3D 0; > + for (int i =3D 0; i < 0xfff; ++i) > + res |=3D x[i]; > + return res; > +} > + > +unsigned short __attribute__((noipa)) > +eor_loop (unsigned short *x) > +{ > + unsigned short res =3D 0; > + for (int i =3D 0; i < 0xfff; ++i) > + res ^=3D x[i]; > + return res; > +} > + > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, z[0-9]+\.h, z[0= -9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tadd\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tuaddv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumin\tz[0-9]+\.h, p[0-7]/m, z[0-= 9]+\.h, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tuminv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tumax\tz[0-9]+\.h, p[0-7]/m, z[0-= 9]+\.h, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tumaxv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tandv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torr\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\torv\t} 1 } } */ > + > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, z[0-9]+\.d, z[0= -9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m, z[0-9= ]+\.h, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\teorv\t} 1 } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/reduc_9_run.c b/gcc/tes= tsuite/gcc.target/aarch64/sve/reduc_9_run.c > new file mode 100644 > index 00000000000..aa248f53eaa > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/reduc_9_run.c > @@ -0,0 +1,29 @@ > +/* { dg-do run { target aarch64_sve256_hw } } */ > +/* { dg-options "-O3 -msve-vector-bits=3D256 --param vect-partial-vector= -usage=3D1" } */ > + > +#define N 0x1100 > + > +#include "reduc_9.c" > + > +int > +main (void) > +{ > + unsigned short x[N]; > + for (int i =3D 0; i < N; ++i) > + x[i] =3D (i + 1) * (i + 2); > + > + if (add_loop (x) !=3D 20480 > + || max_loop (x) !=3D 65504 > + || or_loop (x) !=3D 0xfffe > + || eor_loop (x) !=3D 0xa000) > + __builtin_abort (); > + > + for (int i =3D 0; i < N; ++i) > + x[i] =3D ~x[i]; > + > + if (min_loop (x) !=3D 31 > + || and_loop (x) !=3D 1) > + __builtin_abort (); > + > + return 0; > +}