From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x229.google.com (mail-lj1-x229.google.com [IPv6:2a00:1450:4864:20::229]) by sourceware.org (Postfix) with ESMTPS id 69F903858D35 for ; Mon, 24 Jun 2024 12:58:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 69F903858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 69F903858D35 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::229 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1719233908; cv=none; b=ApskybxMJAs11Q5iYf3JoxsL4+53aXaP9NoB7OJhi2pB+K15yosTktYBglPw0drv81febasZdL2E2ZcbEfRi5Lxcq+nKZVp+vI1cG534+jBy2woBYozECy9M51Vn6FpdFV3x0HDdWWShxNmMqYOAmkvM5C5lw7/qr9yIFKWvxZQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1719233908; c=relaxed/simple; bh=OnszmzFMxaLXFBau43gfyROZLi5qmtG/LiYls8fLSkw=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=xFhCGDTC6N2U+wWGBMQ2NsdtN5idwFb8MStDpWjMxd49pSU/nVjdwm8CmQcSuf1sSgYrGbYztfXfGTjUc9WRiO25fklBmfZ6SN7jGDJaPMTMMnMxyFsNWjWva8RpHqqcJUmgVbm4aUwA7Yf/dfi3Gg8zAyVopyGr9V0NPjH0BU0= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-lj1-x229.google.com with SMTP id 38308e7fff4ca-2ec58040f39so18229151fa.2 for ; Mon, 24 Jun 2024 05:58:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719233898; x=1719838698; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Fa08U3c96wUCCABVqlNsruvFJx+laRXtiHqAaiYmA3w=; b=f7a7H1vNK/6Z0UhOJm0uBle8ZUgwUdNAvDKWIJQAWgpWWea8UDnjn4mYwhgzkwdTxn DzzA8nY3yiLN8cXa1EotIS1jxfPy8OQedOq9cRXXzLCFRd10YUrmwq2hVGUzH+nteEIi DdbEbHKwO5BC9o3/CA4zjtHW0vqAvO4TMeeXDNGqZpyZASr4eAlksZHjQMXHbHVhGqyl l/LjDjjol/deSDPUDGFnZ6ieAZg0Qy1pWj6/EpKJKZLsSPTrh3lWZC+GFpWpj7ZQ5XCZ rTXHIqST164sYmWFCliXVV0aYa3GMyqKvUUXJm2eV/mBsjBzuA+2tPok50tyDhGnXw1+ jGuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719233898; x=1719838698; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Fa08U3c96wUCCABVqlNsruvFJx+laRXtiHqAaiYmA3w=; b=EP6Me7VeWI+gSPIL3CSNspwDn5dV2M311/AeoxL3f5gjV7ZCaevsqOSqMSBxrSSApa O+mY62np1GRT5PD5feKWG4Qn8YrMyBZmLwYeLjmVzwnG0jQh0muF6J0hApraFsJMW2ki tTzqjWdMjjyTt6HAWopFZrzej871Q0AieQ7+ZPu9umFZjHh34IavJiwv4d2yGtMaEmZU yC3I+P2CjUC4sErRZUn1QOnUhQL+IK551Z1vzOBqhV8WkcdmCh2uOrfh7mhlPYo6EWfj UiOClF+H95no9OrwnStHGRRb3EX49kdms1GiMkFclNB+xvZoWDUdcsSccAtP+HLtFziy WoOw== X-Gm-Message-State: AOJu0YxOQ5tYUae9mn2cX+MIkPS5zyeVz6XrqrwHNhCKQgsnSsSqbUWI qRs/xRqSj4ouPf8+z1YFY6yBEySJz/ghEzDskzP1+8duyqVYMbpcYNSwMxE75NjYQ1wQGIiM8LA 5k+w64cfsf9h6/QIXzroxbRm2CNRYoc92 X-Google-Smtp-Source: AGHT+IGIh9mmzephArTV9mCAVGLo+GjTtspGWfrs4uztpRn0RaCBKnUwMM8Wgrha04fCN/f9Ke/iiT2ThD3xPkfBhJs= X-Received: by 2002:ac2:5ec1:0:b0:52c:d9f8:b033 with SMTP id 2adb3069b0e04-52ce063e78amr3177904e87.3.1719233897091; Mon, 24 Jun 2024 05:58:17 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Mon, 24 Jun 2024 14:58:05 +0200 Message-ID: Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] To: Feng Xue OS Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Sun, Jun 23, 2024 at 5:10=E2=80=AFPM Feng Xue OS wrote: > > >> - if (slp_node) > >> + if (slp_node && SLP_TREE_LANES (slp_node) > 1) > > > > Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is o= ff > > instead, which is bad. > > > >> nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > >> else > >> nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in); > >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (lo= op_vec_info loop_vinfo, > >> } > >> } > >> > >> +/* Check if STMT_INFO is a lane-reducing operation that can be vector= ized in > >> + the context of LOOP_VINFO, and vector cost will be recorded in COS= T_VEC. > >> + Now there are three such kinds of operations: dot-prod/widen-sum/s= ad > >> + (sum-of-absolute-differences). > >> + > >> + For a lane-reducing operation, the loop reduction path that it lie= s in, > >> + may contain normal operation, or other lane-reducing operation of = different > >> + input type size, an example as: > >> + > >> + int sum =3D 0; > >> + for (i) > >> + { > >> + ... > >> + sum +=3D d0[i] * d1[i]; // dot-prod > >> + sum +=3D w[i]; // widen-sum > >> + sum +=3D abs(s0[i] - s1[i]); // sad > >> + sum +=3D n[i]; // normal > >> + ... > >> + } > >> + > >> + Vectorization factor is essentially determined by operation whose = input > >> + vectype has the most lanes ("vector(16) char" in the example), whi= le we > >> + need to choose input vectype with the least lanes ("vector(4) int"= in the > >> + example) for the reduction PHI statement. */ > >> + > >> +bool > >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info s= tmt_info, > >> + slp_tree slp_node, stmt_vector_for_cost *c= ost_vec) > >> +{ > >> + gimple *stmt =3D stmt_info->stmt; > >> + > >> + if (!lane_reducing_stmt_p (stmt)) > >> + return false; > >> + > >> + tree type =3D TREE_TYPE (gimple_assign_lhs (stmt)); > >> + > >> + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type)) > >> + return false; > >> + > >> + /* Do not try to vectorize bit-precision reductions. */ > >> + if (!type_has_mode_precision_p (type)) > >> + return false; > >> + > >> + if (!slp_node) > >> + return false; > >> + > >> + for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++) > >> + { > >> + stmt_vec_info def_stmt_info; > >> + slp_tree slp_op; > >> + tree op; > >> + tree vectype; > >> + enum vect_def_type dt; > >> + > >> + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &o= p, > >> + &slp_op, &dt, &vectype, &def_stmt_info)= ) > >> + { > >> + if (dump_enabled_p ()) > >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> + "use not simple.\n"); > >> + return false; > >> + } > >> + > >> + if (!vectype) > >> + { > >> + vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE_TY= PE (op), > >> + slp_op); > >> + if (!vectype) > >> + return false; > >> + } > >> + > >> + if (!vect_maybe_update_slp_op_vectype (slp_op, vectype)) > >> + { > >> + if (dump_enabled_p ()) > >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> + "incompatible vector types for invariants= \n"); > >> + return false; > >> + } > >> + > >> + if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info)) > >> + continue; > >> + > >> + /* There should be at most one cycle def in the stmt. */ > >> + if (VECTORIZABLE_CYCLE_DEF (dt)) > >> + return false; > >> + } > >> + > >> + stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stmt (= stmt_info)); > >> + > >> + /* TODO: Support lane-reducing operation that does not directly par= ticipate > >> + in loop reduction. */ > >> + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) > >> + return false; > >> + > >> + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not > >> + recoginized. */ > >> + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reduction_= def); > >> + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_RED= UCTION); > >> + > >> + tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); > >> + int ncopies_for_cost; > >> + > >> + if (SLP_TREE_LANES (slp_node) > 1) > >> + { > >> + /* Now lane-reducing operations in a non-single-lane slp node s= hould only > >> + come from the same loop reduction path. */ > >> + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info)); > >> + ncopies_for_cost =3D 1; > >> + } > >> + else > >> + { > >> + ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectype_i= n); > > > > OK, so the fact that the ops are lane-reducing means they effectively > > change the VF for the result. That's only possible as we tightly contr= ol > > code generation and "adjust" to the expected VF (by inserting the copie= s > > you mentioned above), but only up to the highest number of outputs > > created in the reduction chain. In that sense instead of talking and r= ecording > > "input vector types" wouldn't it make more sense to record the effectiv= e > > vectorization factor for the reduction instance? That VF would be at m= ost > > the loops VF but could be as low as 1. Once we have a non-lane-reducin= g > > operation in the reduction chain it would be always equal to the loops = VF. > > > > ncopies would then be always determined by that reduction instance VF a= nd > > the accumulator vector type (STMT_VINFO_VECTYPE). This reduction > > instance VF would also trivially indicate the force-single-def-use-cycl= e > > case, possibly simplifying code? > > I tried to add such an effective VF, while the vectype_in is still needed= in some > scenarios, such as when checking whether a dot-prod stmt is emulated or n= ot. > The former could be deduced from the later, so recording both things seem= s > to be redundant. Another consideration is that for normal op, ncopies > is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op, > it is from VF. So, a better means to make them unified? AFAICS reductions are special in that they, for the accumulation SSA cycle, do not adhere to the loops VF but as optimization can chose a smaller one. OTOH STMT_VINFO_VECTYPE is for the vector type used for individual operations which even for lane-reducing ops is adhered to - those just may use a smaller VF, that of the reduction SSA cycle. So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather it's not fully redundant but needlessly replicated over all stmts participa= ting in the reduction instead of recording the reduction VF in the reduc_info an= d using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies for stmts in the reduction cycle. At least that was my idea ... > >> + gcc_assert (ncopies_for_cost >=3D 1); > >> + } > >> + > >> + if (vect_is_emulated_mixed_dot_prod (stmt_info)) > >> + { > >> + /* We need extra two invariants: one that contains the minimum = signed > >> + value and one that contains half of its negative. */ > >> + int prologue_stmts =3D 2; > >> + unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts, > >> + scalar_to_vec, stmt_info, 0, > >> + vect_prologue); > >> + if (dump_enabled_p ()) > >> + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: " > >> + "extra prologue_cost =3D %d .\n", cost); > >> + > >> + /* Three dot-products and a subtraction. */ > >> + ncopies_for_cost *=3D 4; > >> + } > >> + > >> + record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_inf= o, 0, > >> + vect_body); > >> + > >> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > >> + { > >> + enum tree_code code =3D gimple_assign_rhs_code (stmt); > >> + vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_i= nfo, > >> + slp_node, code, type= , > >> + vectype_in); > >> + } > >> + > > > > Add a comment: > > > > /* Transform via vect_transform_reduction. */ > > > >> + STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type; > >> + return true; > >> +} > >> + > >> /* Function vectorizable_reduction. > >> > >> Check if STMT_INFO performs a reduction operation that can be vect= orized. > >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinf= o, > >> if (!type_has_mode_precision_p (op.type)) > >> return false; > >> > >> - /* For lane-reducing ops we're reducing the number of reduction PHI= s > >> - which means the only use of that may be in the lane-reducing ope= ration. */ > >> - if (lane_reducing > >> - && reduc_chain_length !=3D 1 > >> - && !only_slp_reduc_chain) > >> - { > >> - if (dump_enabled_p ()) > >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> - "lane-reducing reduction with extra stmts.\n"= ); > >> - return false; > >> - } > >> - > >> /* Lane-reducing ops also never can be used in a SLP reduction grou= p > >> since we'll mix lanes belonging to different reductions. But it= 's > >> OK to use them in a reduction chain or when the reduction group > >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vin= fo, > >> && loop_vinfo->suggested_unroll_factor =3D=3D 1) > >> single_defuse_cycle =3D true; > >> > >> - if (single_defuse_cycle || lane_reducing) > >> + if (single_defuse_cycle && !lane_reducing) > > > > If there's also a non-lane-reducing plus in the chain don't we have to > > check for that reduction op? So shouldn't it be > > single_defuse_cycle && ... fact that we don't record > > (non-lane-reducing op there) ... > > Quite not understand this point. For a non-lane-reducing op in the chain= , > it should be handled in its own vectorizable_xxx function? The below chec= k > is only for the first statement (vect_reduction_def) in the reduction. Hmm. So we have vectorizable_lane_reducing_* for the check on the lane-reducing stmts, vectorizable_* for !single-def-use stmts. And the following is then just for the case there's a single def that's not lane-reducing and we're forcing a single-def-use and thus go via vect_transform_reduction= ? > > > >> { > >> gcc_assert (op.code !=3D COND_EXPR); > >> > >> - /* 4. Supportable by target? */ > >> - bool ok =3D true; > >> - > >> - /* 4.1. check support for the operation in the loop > >> + /* 4. check support for the operation in the loop > >> > >> This isn't necessary for the lane reduction codes, since they > >> can only be produced by pattern matching, and it's up to the > >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vin= fo, > >> mixed-sign dot-products can be implemented using signed > >> dot-products. */ > >> machine_mode vec_mode =3D TYPE_MODE (vectype_in); > >> - if (!lane_reducing > >> - && !directly_supported_p (op.code, vectype_in, optab_vector)= ) > >> + if (!directly_supported_p (op.code, vectype_in, optab_vector)) > >> { > >> if (dump_enabled_p ()) > >> dump_printf (MSG_NOTE, "op not supported by target.\n"); > >> if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD) > >> || !vect_can_vectorize_without_simd_p (op.code)) > >> - ok =3D false; > >> + single_defuse_cycle =3D false; > >> else > >> if (dump_enabled_p ()) > >> dump_printf (MSG_NOTE, "proceeding using word mode.\n"); > >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinf= o, > >> dump_printf (MSG_NOTE, "using word mode not possible.\n"); > >> return false; > >> } > >> - > >> - /* lane-reducing operations have to go through vect_transform_r= eduction. > >> - For the other cases try without the single cycle optimizatio= n. */ > >> - if (!ok) > >> - { > >> - if (lane_reducing) > >> - return false; > >> - else > >> - single_defuse_cycle =3D false; > >> - } > >> } > >> if (dump_enabled_p () && single_defuse_cycle) > >> dump_printf_loc (MSG_NOTE, vect_location, > >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vin= fo, > >> "multiple vectors to one in the loop body\n"); > >> STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cycle; > >> > >> - /* If the reduction stmt is one of the patterns that have lane > >> - reduction embedded we cannot handle the case of ! single_defuse_= cycle. */ > >> - if ((ncopies > 1 && ! single_defuse_cycle) > >> - && lane_reducing) > >> - { > >> - if (dump_enabled_p ()) > >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> - "multi def-use cycle not possible for lane-re= ducing " > >> - "reduction operation\n"); > >> - return false; > >> - } > >> + /* For lane-reducing operation, the below processing related to sin= gle > >> + defuse-cycle will be done in its own vectorizable function. One= more > >> + thing to note is that the operation must not be involved in fold= -left > >> + reduction. */ > >> + single_defuse_cycle &=3D !lane_reducing; > >> > >> if (slp_node > >> - && !(!single_defuse_cycle > >> - && !lane_reducing > >> - && reduction_type !=3D FOLD_LEFT_REDUCTION)) > >> + && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_REDU= CTION)) > >> for (i =3D 0; i < (int) op.num_ops; i++) > >> if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]= )) > >> { > >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vin= fo, > >> vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn, > >> reduction_type, ncopies, cost_vec); > >> /* Cost the reduction op inside the loop if transformed via > >> - vect_transform_reduction. Otherwise this is costed by the > >> - separate vectorizable_* routines. */ > >> - if (single_defuse_cycle || lane_reducing) > >> - { > >> - int factor =3D 1; > >> - if (vect_is_emulated_mixed_dot_prod (stmt_info)) > >> - /* Three dot-products and a subtraction. */ > >> - factor =3D 4; > >> - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, > >> - stmt_info, 0, vect_body); > >> - } > >> + vect_transform_reduction for non-lane-reducing operation. Other= wise > >> + this is costed by the separate vectorizable_* routines. */ > >> + if (single_defuse_cycle) > >> + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, v= ect_body); > >> > >> if (dump_enabled_p () > >> && reduction_type =3D=3D FOLD_LEFT_REDUCTION) > >> dump_printf_loc (MSG_NOTE, vect_location, > >> "using an in-order (fold-left) reduction.\n"); > >> STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type; > >> - /* All but single defuse-cycle optimized, lane-reducing and fold-le= ft > >> - reductions go through their own vectorizable_* routines. */ > >> - if (!single_defuse_cycle > >> - && !lane_reducing > >> - && reduction_type !=3D FOLD_LEFT_REDUCTION) > >> + > >> + /* All but single defuse-cycle optimized and fold-left reductions g= o > >> + through their own vectorizable_* routines. */ > >> + if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCTION= ) > >> { > >> stmt_vec_info tem > >> =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info)); > >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vi= nfo, > >> bool lane_reducing =3D lane_reducing_op_p (code); > >> gcc_assert (single_defuse_cycle || lane_reducing); > >> > >> + if (lane_reducing) > >> + { > >> + /* The last operand of lane-reducing op is for reduction. */ > >> + gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1); > >> + > >> + /* Now all lane-reducing ops are covered by some slp node. */ > >> + gcc_assert (slp_node); > >> + } > >> + > >> /* Create the destination vector */ > >> tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt); > >> tree vec_dest =3D vect_create_destination_var (scalar_dest, vectype= _out); > >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vi= nfo, > >> reduc_index =3D=3D 2 ? op.ops[2] : NULL_TREE, > >> &vec_oprnds[2]); > >> } > >> + else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1 > >> + && vec_oprnds[0].length () < vec_oprnds[reduc_index].length= ()) > >> + { > >> + /* For lane-reducing op covered by single-lane slp node, the in= put > >> + vectype of the reduction PHI determines copies of vectorized = def-use > >> + cycles, which might be more than effective copies of vectoriz= ed lane- > >> + reducing reduction statements. This could be complemented by > >> + generating extra trivial pass-through copies. For example: > >> + > >> + int sum =3D 0; > >> + for (i) > >> + { > >> + sum +=3D d0[i] * d1[i]; // dot-prod > >> + sum +=3D abs(s0[i] - s1[i]); // sad > >> + sum +=3D n[i]; // normal > >> + } > >> + > >> + The vector size is 128-bit,vectorization factor is 16. Reduc= tion > >> + statements would be transformed as: > >> + > >> + vector<4> int sum_v0 =3D { 0, 0, 0, 0 }; > >> + vector<4> int sum_v1 =3D { 0, 0, 0, 0 }; > >> + vector<4> int sum_v2 =3D { 0, 0, 0, 0 }; > >> + vector<4> int sum_v3 =3D { 0, 0, 0, 0 }; > >> + > >> + for (i / 16) > >> + { > >> + sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15]= , sum_v0); > >> + sum_v1 =3D sum_v1; // copy > >> + sum_v2 =3D sum_v2; // copy > >> + sum_v3 =3D sum_v3; // copy > >> + > >> + sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum= _v0); > >> + sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum= _v1); > >> + sum_v2 =3D sum_v2; // copy > >> + sum_v3 =3D sum_v3; // copy > >> + > >> + sum_v0 +=3D n_v0[i: 0 ~ 3 ]; > >> + sum_v1 +=3D n_v1[i: 4 ~ 7 ]; > >> + sum_v2 +=3D n_v2[i: 8 ~ 11]; > >> + sum_v3 +=3D n_v3[i: 12 ~ 15]; > >> + } > >> + */ > >> + unsigned using_ncopies =3D vec_oprnds[0].length (); > >> + unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length (); > >> + > > > > assert reduc_ncopies >=3D using_ncopies? Maybe assert > > reduc_index =3D=3D op.num_ops - 1 given you use one above > > and the other below? Or simply iterate till op.num_ops > > and sip i =3D=3D reduc_index. > > > >> + for (unsigned i =3D 0; i < op.num_ops - 1; i++) > >> + { > >> + gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies); > >> + vec_oprnds[i].safe_grow_cleared (reduc_ncopies); > >> + } > >> + } > >> > >> bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (s= tmt_info); > >> unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length ()= ; > >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vi= nfo, > >> { > >> gimple *new_stmt; > >> tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE= }; > >> - if (masked_loop_p && !mask_by_cond_expr) > >> + > >> + if (!vop[0] || !vop[1]) > >> + { > >> + tree reduc_vop =3D vec_oprnds[reduc_index][i]; > >> + > >> + /* Insert trivial copy if no need to generate vectorized > >> + statement. */ > >> + gcc_assert (reduc_vop); > >> + > >> + new_stmt =3D gimple_build_assign (vec_dest, reduc_vop); > >> + new_temp =3D make_ssa_name (vec_dest, new_stmt); > >> + gimple_set_lhs (new_stmt, new_temp); > >> + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt= , gsi); > > > > I think you could simply do > > > > slp_node->push_vec_def (reduc_vop); > > continue; > > > > without any code generation. > > > > OK, that would be easy. Here comes another question, this patch assumes > lane-reducing op would always be contained in a slp node, since single-la= ne > slp node feature has been enabled. But I got some regression if I enforce= d > such constraint on lane-reducing op check. Those cases are founded to > be unvectorizable with single-lane slp, so this should not be what we wan= t? > and need to be fixed? Yes, in the end we need to chase down all unsupported cases and fix them (there's known issues with load permutes, I'm working on that - hopefully when finding a continuous stretch of time...). > > >> + } > >> + else if (masked_loop_p && !mask_by_cond_expr) > >> { > >> /* No conditional ifns have been defined for lane-reducing o= p > >> yet. */ > >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vi= nfo, > >> > >> if (masked_loop_p && mask_by_cond_expr) > >> { > >> + tree stmt_vectype_in =3D vectype_in; > >> + unsigned nvectors =3D vec_num * ncopies; > >> + > >> + if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1) > >> + { > >> + /* Input vectype of the reduction PHI may be deffere= nt from > > > > different > > > >> + that of lane-reducing operation. */ > >> + stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stm= t_info); > >> + nvectors =3D vect_get_num_copies (loop_vinfo, stmt_v= ectype_in); > > > > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS. > > To partially vectorizing a dot_prod<16 * char> with 128-bit vector width, > we should pass (nvector=3D4, vectype=3D<4 *int>) instead of (nvector=3D1,= vectype=3D<16 *char>) > to vect_get_loop_mask? Probably - it depends on the vectorization factor. What I wanted to point out is that vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong. The place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted, or we should forgo with it (but that's possibly a post-only-SLP cleanup to be done). See vect_slp_analyze_node_operations_1 where that's computed. For reductio= ns it's probably not quite right (and we might have latent issues like those you are "fixing" with code like above). The order we analyze stmts might also be n= ot optimal for reductions with SLP - in fact given that stmt analysis relies on a fixed VF it would probably make sense to determine the reduction VF in advance as we= ll. But again this sounds like post-only-SLP cleanup opportunities. In the end I might suggest to always use reduct-VF and vectype to determine the number of vector stmts rather than computing ncopies/vec_num separately= . Richard. > Thanks, > Feng > > > ________________________________________ > From: Richard Biener > Sent: Thursday, June 20, 2024 8:26 PM > To: Feng Xue OS > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations = for loop reduction [PR114440] > > On Sun, Jun 16, 2024 at 9:31?AM Feng Xue OS = wrote: > > > > For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, = current > > vectorizer could only handle the pattern if the reduction chain does no= t > > contain other operation, no matter the other is normal or lane-reducing= . > > > > Actually, to allow multiple arbitrary lane-reducing operations, we need= to > > support vectorization of loop reduction chain with mixed input vectypes= . Since > > lanes of vectype may vary with operation, the effective ncopies of vect= orized > > statements for operation also may not be same to each other, this cause= s > > mismatch on vectorized def-use cycles. A simple way is to align all ope= rations > > with the one that has the most ncopies, the gap could be complemented b= y > > generating extra trivial pass-through copies. For example: > > > > int sum =3D 0; > > for (i) > > { > > sum +=3D d0[i] * d1[i]; // dot-prod > > sum +=3D w[i]; // widen-sum > > sum +=3D abs(s0[i] - s1[i]); // sad > > sum +=3D n[i]; // normal > > } > > > > The vector size is 128-bit vectorization factor is 16. Reduction statem= ents > > would be transformed as: > > > > vector<4> int sum_v0 =3D { 0, 0, 0, 0 }; > > vector<4> int sum_v1 =3D { 0, 0, 0, 0 }; > > vector<4> int sum_v2 =3D { 0, 0, 0, 0 }; > > vector<4> int sum_v3 =3D { 0, 0, 0, 0 }; > > > > for (i / 16) > > { > > sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0)= ; > > sum_v1 =3D sum_v1; // copy > > sum_v2 =3D sum_v2; // copy > > sum_v3 =3D sum_v3; // copy > > > > sum_v0 =3D WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); > > sum_v1 =3D sum_v1; // copy > > sum_v2 =3D sum_v2; // copy > > sum_v3 =3D sum_v3; // copy > > > > sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); > > sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); > > sum_v2 =3D sum_v2; // copy > > sum_v3 =3D sum_v3; // copy > > > > sum_v0 +=3D n_v0[i: 0 ~ 3 ]; > > sum_v1 +=3D n_v1[i: 4 ~ 7 ]; > > sum_v2 +=3D n_v2[i: 8 ~ 11]; > > sum_v3 +=3D n_v3[i: 12 ~ 15]; > > } > > > > Thanks, > > Feng > > > > --- > > gcc/ > > PR tree-optimization/114440 > > * tree-vectorizer.h (vectorizable_lane_reducing): New function > > declaration. > > * tree-vect-stmts.cc (vect_analyze_stmt): Call new function > > vectorizable_lane_reducing to analyze lane-reducing operation. > > * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost co= mputation > > code related to emulated_mixed_dot_prod. > > (vect_reduction_update_partial_vector_usage): Compute ncopies a= s the > > original means for single-lane slp node. > > (vectorizable_lane_reducing): New function. > > (vectorizable_reduction): Allow multiple lane-reducing operatio= ns in > > loop reduction. Move some original lane-reducing related code t= o > > vectorizable_lane_reducing. > > (vect_transform_reduction): Extend transformation to support re= duction > > statements with mixed input vectypes. > > > > gcc/testsuite/ > > PR tree-optimization/114440 > > * gcc.dg/vect/vect-reduc-chain-1.c > > * gcc.dg/vect/vect-reduc-chain-2.c > > * gcc.dg/vect/vect-reduc-chain-3.c > > * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c > > * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c > > * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c > > * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c > > * gcc.dg/vect/vect-reduc-dot-slp-1.c > > --- > > .../gcc.dg/vect/vect-reduc-chain-1.c | 62 ++++ > > .../gcc.dg/vect/vect-reduc-chain-2.c | 77 +++++ > > .../gcc.dg/vect/vect-reduc-chain-3.c | 66 ++++ > > .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 +++++ > > .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 ++++ > > .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 +++++ > > .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 ++++ > > .../gcc.dg/vect/vect-reduc-dot-slp-1.c | 35 ++ > > gcc/tree-vect-loop.cc | 324 ++++++++++++++---- > > gcc/tree-vect-stmts.cc | 2 + > > gcc/tree-vectorizer.h | 2 + > > 11 files changed, 802 insertions(+), 70 deletions(-) > > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-= 1.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-= 2.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-= 3.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-= 4.c > > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/tests= uite/gcc.dg/vect/vect-reduc-chain-1.c > > new file mode 100644 > > index 00000000000..04bfc419dbd > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c > > @@ -0,0 +1,62 @@ > > +/* Disabling epilogues until we find a better way to deal with scans. = */ > > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */ > > +/* { dg-require-effective-target vect_int } */ > > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { = aarch64*-*-* || arm*-*-* } } } */ > > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > + > > +#include "tree-vect.h" > > + > > +#define N 50 > > + > > +#ifndef SIGNEDNESS_1 > > +#define SIGNEDNESS_1 signed > > +#define SIGNEDNESS_2 signed > > +#endif > > + > > +SIGNEDNESS_1 int __attribute__ ((noipa)) > > +f (SIGNEDNESS_1 int res, > > + SIGNEDNESS_2 char *restrict a, > > + SIGNEDNESS_2 char *restrict b, > > + SIGNEDNESS_2 char *restrict c, > > + SIGNEDNESS_2 char *restrict d, > > + SIGNEDNESS_1 int *restrict e) > > +{ > > + for (int i =3D 0; i < N; ++i) > > + { > > + res +=3D a[i] * b[i]; > > + res +=3D c[i] * d[i]; > > + res +=3D e[i]; > > + } > > + return res; > > +} > > + > > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > > +#define OFFSET 20 > > + > > +int > > +main (void) > > +{ > > + check_vect (); > > + > > + SIGNEDNESS_2 char a[N], b[N]; > > + SIGNEDNESS_2 char c[N], d[N]; > > + SIGNEDNESS_1 int e[N]; > > + int expected =3D 0x12345; > > + for (int i =3D 0; i < N; ++i) > > + { > > + a[i] =3D BASE + i * 5; > > + b[i] =3D BASE + OFFSET + i * 4; > > + c[i] =3D BASE + i * 2; > > + d[i] =3D BASE + OFFSET + i * 3; > > + e[i] =3D i; > > + asm volatile ("" ::: "memory"); > > + expected +=3D a[i] * b[i]; > > + expected +=3D c[i] * d[i]; > > + expected +=3D e[i]; > > + } > > + if (f (0x12345, a, b, c, d, e) !=3D expected) > > + __builtin_abort (); > > +} > > + > > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"= "vect" } } */ > > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D = DOT_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/tests= uite/gcc.dg/vect/vect-reduc-chain-2.c > > new file mode 100644 > > index 00000000000..6c803b80120 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c > > @@ -0,0 +1,77 @@ > > +/* Disabling epilogues until we find a better way to deal with scans. = */ > > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */ > > +/* { dg-require-effective-target vect_int } */ > > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { = aarch64*-*-* || arm*-*-* } } } */ > > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > + > > +#include "tree-vect.h" > > + > > +#define N 50 > > + > > +#ifndef SIGNEDNESS_1 > > +#define SIGNEDNESS_1 signed > > +#define SIGNEDNESS_2 unsigned > > +#define SIGNEDNESS_3 signed > > +#define SIGNEDNESS_4 signed > > +#endif > > + > > +SIGNEDNESS_1 int __attribute__ ((noipa)) > > +fn (SIGNEDNESS_1 int res, > > + SIGNEDNESS_2 char *restrict a, > > + SIGNEDNESS_2 char *restrict b, > > + SIGNEDNESS_3 char *restrict c, > > + SIGNEDNESS_3 char *restrict d, > > + SIGNEDNESS_4 short *restrict e, > > + SIGNEDNESS_4 short *restrict f, > > + SIGNEDNESS_1 int *restrict g) > > +{ > > + for (int i =3D 0; i < N; ++i) > > + { > > + res +=3D a[i] * b[i]; > > + res +=3D i + 1; > > + res +=3D c[i] * d[i]; > > + res +=3D e[i] * f[i]; > > + res +=3D g[i]; > > + } > > + return res; > > +} > > + > > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4) > > +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373) > > +#define OFFSET 20 > > + > > +int > > +main (void) > > +{ > > + check_vect (); > > + > > + SIGNEDNESS_2 char a[N], b[N]; > > + SIGNEDNESS_3 char c[N], d[N]; > > + SIGNEDNESS_4 short e[N], f[N]; > > + SIGNEDNESS_1 int g[N]; > > + int expected =3D 0x12345; > > + for (int i =3D 0; i < N; ++i) > > + { > > + a[i] =3D BASE2 + i * 5; > > + b[i] =3D BASE2 + OFFSET + i * 4; > > + c[i] =3D BASE3 + i * 2; > > + d[i] =3D BASE3 + OFFSET + i * 3; > > + e[i] =3D BASE4 + i * 6; > > + f[i] =3D BASE4 + OFFSET + i * 5; > > + g[i] =3D i; > > + asm volatile ("" ::: "memory"); > > + expected +=3D a[i] * b[i]; > > + expected +=3D i + 1; > > + expected +=3D c[i] * d[i]; > > + expected +=3D e[i] * f[i]; > > + expected +=3D g[i]; > > + } > > + if (fn (0x12345, a, b, c, d, e, f, g) !=3D expected) > > + __builtin_abort (); > > +} > > + > > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"= "vect" } } */ > > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PR= OD_EXPR" "vect" { target { vect_sdot_qi } } } } */ > > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PR= OD_EXPR" "vect" { target { vect_udot_qi } } } } */ > > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PR= OD_EXPR" "vect" { target { vect_sdot_hi } } } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/tests= uite/gcc.dg/vect/vect-reduc-chain-3.c > > new file mode 100644 > > index 00000000000..a41e4b176c4 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c > > @@ -0,0 +1,66 @@ > > +/* Disabling epilogues until we find a better way to deal with scans. = */ > > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +#include "tree-vect.h" > > + > > +#define N 50 > > + > > +#ifndef SIGNEDNESS_1 > > +#define SIGNEDNESS_1 signed > > +#define SIGNEDNESS_2 unsigned > > +#define SIGNEDNESS_3 signed > > +#endif > > + > > +SIGNEDNESS_1 int __attribute__ ((noipa)) > > +f (SIGNEDNESS_1 int res, > > + SIGNEDNESS_2 char *restrict a, > > + SIGNEDNESS_2 char *restrict b, > > + SIGNEDNESS_3 short *restrict c, > > + SIGNEDNESS_3 short *restrict d, > > + SIGNEDNESS_1 int *restrict e) > > +{ > > + for (int i =3D 0; i < N; ++i) > > + { > > + short diff =3D a[i] - b[i]; > > + SIGNEDNESS_2 short abs =3D diff < 0 ? -diff : diff; > > + res +=3D abs; > > + res +=3D c[i] * d[i]; > > + res +=3D e[i]; > > + } > > + return res; > > +} > > + > > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373) > > +#define OFFSET 20 > > + > > +int > > +main (void) > > +{ > > + check_vect (); > > + > > + SIGNEDNESS_2 char a[N], b[N]; > > + SIGNEDNESS_3 short c[N], d[N]; > > + SIGNEDNESS_1 int e[N]; > > + int expected =3D 0x12345; > > + for (int i =3D 0; i < N; ++i) > > + { > > + a[i] =3D BASE2 + i * 5; > > + b[i] =3D BASE2 - i * 4; > > + c[i] =3D BASE3 + i * 2; > > + d[i] =3D BASE3 + OFFSET + i * 3; > > + e[i] =3D i; > > + asm volatile ("" ::: "memory"); > > + short diff =3D a[i] - b[i]; > > + SIGNEDNESS_2 short abs =3D diff < 0 ? -diff : diff; > > + expected +=3D abs; > > + expected +=3D c[i] * d[i]; > > + expected +=3D e[i]; > > + } > > + if (f (0x12345, a, b, c, d, e) !=3D expected) > > + __builtin_abort (); > > +} > > + > > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D SAD_EX= PR" "vect" { target vect_udot_qi } } } */ > > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PR= OD_EXPR" "vect" { target vect_sdot_hi } } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/g= cc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c > > new file mode 100644 > > index 00000000000..c2831fbcc8e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c > > @@ -0,0 +1,95 @@ > > +/* Disabling epilogues until we find a better way to deal with scans. = */ > > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */ > > +/* { dg-require-effective-target vect_int } */ > > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { = aarch64*-*-* || arm*-*-* } } } */ > > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > + > > +#include "tree-vect.h" > > + > > +#ifndef SIGNEDNESS_1 > > +#define SIGNEDNESS_1 signed > > +#define SIGNEDNESS_2 signed > > +#endif > > + > > +SIGNEDNESS_1 int __attribute__ ((noipa)) > > +f (SIGNEDNESS_1 int res, > > + SIGNEDNESS_2 char *a, > > + SIGNEDNESS_2 char *b, > > + int step, int n) > > +{ > > + for (int i =3D 0; i < n; i++) > > + { > > + res +=3D a[0] * b[0]; > > + res +=3D a[1] * b[1]; > > + res +=3D a[2] * b[2]; > > + res +=3D a[3] * b[3]; > > + res +=3D a[4] * b[4]; > > + res +=3D a[5] * b[5]; > > + res +=3D a[6] * b[6]; > > + res +=3D a[7] * b[7]; > > + res +=3D a[8] * b[8]; > > + res +=3D a[9] * b[9]; > > + res +=3D a[10] * b[10]; > > + res +=3D a[11] * b[11]; > > + res +=3D a[12] * b[12]; > > + res +=3D a[13] * b[13]; > > + res +=3D a[14] * b[14]; > > + res +=3D a[15] * b[15]; > > + > > + a +=3D step; > > + b +=3D step; > > + } > > + > > + return res; > > +} > > + > > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > > +#define OFFSET 20 > > + > > +int > > +main (void) > > +{ > > + check_vect (); > > + > > + SIGNEDNESS_2 char a[100], b[100]; > > + int expected =3D 0x12345; > > + int step =3D 16; > > + int n =3D 2; > > + int t =3D 0; > > + > > + for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i) > > + { > > + a[i] =3D BASE + i * 5; > > + b[i] =3D BASE + OFFSET + i * 4; > > + asm volatile ("" ::: "memory"); > > + } > > + > > + for (int i =3D 0; i < n; i++) > > + { > > + asm volatile ("" ::: "memory"); > > + expected +=3D a[t + 0] * b[t + 0]; > > + expected +=3D a[t + 1] * b[t + 1]; > > + expected +=3D a[t + 2] * b[t + 2]; > > + expected +=3D a[t + 3] * b[t + 3]; > > + expected +=3D a[t + 4] * b[t + 4]; > > + expected +=3D a[t + 5] * b[t + 5]; > > + expected +=3D a[t + 6] * b[t + 6]; > > + expected +=3D a[t + 7] * b[t + 7]; > > + expected +=3D a[t + 8] * b[t + 8]; > > + expected +=3D a[t + 9] * b[t + 9]; > > + expected +=3D a[t + 10] * b[t + 10]; > > + expected +=3D a[t + 11] * b[t + 11]; > > + expected +=3D a[t + 12] * b[t + 12]; > > + expected +=3D a[t + 13] * b[t + 13]; > > + expected +=3D a[t + 14] * b[t + 14]; > > + expected +=3D a[t + 15] * b[t + 15]; > > + t +=3D step; > > + } > > + > > + if (f (0x12345, a, b, step, n) !=3D expected) > > + __builtin_abort (); > > +} > > + > > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"= "vect" } } */ > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } = } */ > > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D = DOT_PROD_EXPR" 16 "vect" } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/g= cc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c > > new file mode 100644 > > index 00000000000..4114264a364 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c > > @@ -0,0 +1,67 @@ > > +/* Disabling epilogues until we find a better way to deal with scans. = */ > > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */ > > +/* { dg-require-effective-target vect_int } */ > > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { = aarch64*-*-* || arm*-*-* } } } */ > > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > + > > +#include "tree-vect.h" > > + > > +#ifndef SIGNEDNESS_1 > > +#define SIGNEDNESS_1 signed > > +#define SIGNEDNESS_2 signed > > +#endif > > + > > +SIGNEDNESS_1 int __attribute__ ((noipa)) > > +f (SIGNEDNESS_1 int res, > > + SIGNEDNESS_2 char *a, > > + SIGNEDNESS_2 char *b, > > + int n) > > +{ > > + for (int i =3D 0; i < n; i++) > > + { > > + res +=3D a[5 * i + 0] * b[5 * i + 0]; > > + res +=3D a[5 * i + 1] * b[5 * i + 1]; > > + res +=3D a[5 * i + 2] * b[5 * i + 2]; > > + res +=3D a[5 * i + 3] * b[5 * i + 3]; > > + res +=3D a[5 * i + 4] * b[5 * i + 4]; > > + } > > + > > + return res; > > +} > > + > > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) > > +#define OFFSET 20 > > + > > +int > > +main (void) > > +{ > > + check_vect (); > > + > > + SIGNEDNESS_2 char a[100], b[100]; > > + int expected =3D 0x12345; > > + int n =3D 18; > > + > > + for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i) > > + { > > + a[i] =3D BASE + i * 5; > > + b[i] =3D BASE + OFFSET + i * 4; > > + asm volatile ("" ::: "memory"); > > + } > > + > > + for (int i =3D 0; i < n; i++) > > + { > > + asm volatile ("" ::: "memory"); > > + expected +=3D a[5 * i + 0] * b[5 * i + 0]; > > + expected +=3D a[5 * i + 1] * b[5 * i + 1]; > > + expected +=3D a[5 * i + 2] * b[5 * i + 2]; > > + expected +=3D a[5 * i + 3] * b[5 * i + 3]; > > + expected +=3D a[5 * i + 4] * b[5 * i + 4]; > > + } > > + > > + if (f (0x12345, a, b, n) !=3D expected) > > + __builtin_abort (); > > +} > > + > > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"= "vect" } } */ > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } = } */ > > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D = DOT_PROD_EXPR" 5 "vect" } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/g= cc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c > > new file mode 100644 > > index 00000000000..2cdecc36d16 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c > > @@ -0,0 +1,79 @@ > > +/* Disabling epilogues until we find a better way to deal with scans. = */ > > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */ > > +/* { dg-require-effective-target vect_int } */ > > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { = aarch64*-*-* || arm*-*-* } } } */ > > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > + > > +#include "tree-vect.h" > > + > > +#ifndef SIGNEDNESS_1 > > +#define SIGNEDNESS_1 signed > > +#define SIGNEDNESS_2 signed > > +#endif > > + > > +SIGNEDNESS_1 int __attribute__ ((noipa)) > > +f (SIGNEDNESS_1 int res, > > + SIGNEDNESS_2 short *a, > > + SIGNEDNESS_2 short *b, > > + int step, int n) > > +{ > > + for (int i =3D 0; i < n; i++) > > + { > > + res +=3D a[0] * b[0]; > > + res +=3D a[1] * b[1]; > > + res +=3D a[2] * b[2]; > > + res +=3D a[3] * b[3]; > > + res +=3D a[4] * b[4]; > > + res +=3D a[5] * b[5]; > > + res +=3D a[6] * b[6]; > > + res +=3D a[7] * b[7]; > > + > > + a +=3D step; > > + b +=3D step; > > + } > > + > > + return res; > > +} > > + > > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373) > > +#define OFFSET 20 > > + > > +int > > +main (void) > > +{ > > + check_vect (); > > + > > + SIGNEDNESS_2 short a[100], b[100]; > > + int expected =3D 0x12345; > > + int step =3D 8; > > + int n =3D 2; > > + int t =3D 0; > > + > > + for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i) > > + { > > + a[i] =3D BASE + i * 5; > > + b[i] =3D BASE + OFFSET + i * 4; > > + asm volatile ("" ::: "memory"); > > + } > > + > > + for (int i =3D 0; i < n; i++) > > + { > > + asm volatile ("" ::: "memory"); > > + expected +=3D a[t + 0] * b[t + 0]; > > + expected +=3D a[t + 1] * b[t + 1]; > > + expected +=3D a[t + 2] * b[t + 2]; > > + expected +=3D a[t + 3] * b[t + 3]; > > + expected +=3D a[t + 4] * b[t + 4]; > > + expected +=3D a[t + 5] * b[t + 5]; > > + expected +=3D a[t + 6] * b[t + 6]; > > + expected +=3D a[t + 7] * b[t + 7]; > > + t +=3D step; > > + } > > + > > + if (f (0x12345, a, b, step, n) !=3D expected) > > + __builtin_abort (); > > +} > > + > > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"= "vect" } } */ > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } = } */ > > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D = DOT_PROD_EXPR" 8 "vect" { target vect_sdot_hi } } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/g= cc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c > > new file mode 100644 > > index 00000000000..32c0f30c77b > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c > > @@ -0,0 +1,63 @@ > > +/* Disabling epilogues until we find a better way to deal with scans. = */ > > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */ > > +/* { dg-require-effective-target vect_int } */ > > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { = aarch64*-*-* || arm*-*-* } } } */ > > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > + > > +#include "tree-vect.h" > > + > > +#ifndef SIGNEDNESS_1 > > +#define SIGNEDNESS_1 signed > > +#define SIGNEDNESS_2 signed > > +#endif > > + > > +SIGNEDNESS_1 int __attribute__ ((noipa)) > > +f (SIGNEDNESS_1 int res, > > + SIGNEDNESS_2 short *a, > > + SIGNEDNESS_2 short *b, > > + int n) > > +{ > > + for (int i =3D 0; i < n; i++) > > + { > > + res +=3D a[3 * i + 0] * b[3 * i + 0]; > > + res +=3D a[3 * i + 1] * b[3 * i + 1]; > > + res +=3D a[3 * i + 2] * b[3 * i + 2]; > > + } > > + > > + return res; > > +} > > + > > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373) > > +#define OFFSET 20 > > + > > +int > > +main (void) > > +{ > > + check_vect (); > > + > > + SIGNEDNESS_2 short a[100], b[100]; > > + int expected =3D 0x12345; > > + int n =3D 18; > > + > > + for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i) > > + { > > + a[i] =3D BASE + i * 5; > > + b[i] =3D BASE + OFFSET + i * 4; > > + asm volatile ("" ::: "memory"); > > + } > > + > > + for (int i =3D 0; i < n; i++) > > + { > > + asm volatile ("" ::: "memory"); > > + expected +=3D a[3 * i + 0] * b[3 * i + 0]; > > + expected +=3D a[3 * i + 1] * b[3 * i + 1]; > > + expected +=3D a[3 * i + 2] * b[3 * i + 2]; > > + } > > + > > + if (f (0x12345, a, b, n) !=3D expected) > > + __builtin_abort (); > > +} > > + > > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"= "vect" } } */ > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } = } */ > > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D = DOT_PROD_EXPR" 3 "vect" { target vect_sdot_hi } } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/tes= tsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c > > new file mode 100644 > > index 00000000000..e17d6291f75 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c > > @@ -0,0 +1,35 @@ > > +/* Disabling epilogues until we find a better way to deal with scans. = */ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */ > > +/* { dg-require-effective-target vect_int } */ > > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { = aarch64*-*-* || arm*-*-* } } } */ > > +/* { dg-add-options arm_v8_2a_dotprod_neon } */ > > + > > +#include "tree-vect.h" > > + > > +#ifndef SIGNEDNESS_1 > > +#define SIGNEDNESS_1 signed > > +#define SIGNEDNESS_2 signed > > +#endif > > + > > +SIGNEDNESS_1 int __attribute__ ((noipa)) > > +f (SIGNEDNESS_1 int res0, > > + SIGNEDNESS_1 int res1, > > + SIGNEDNESS_1 int res2, > > + SIGNEDNESS_1 int res3, > > + SIGNEDNESS_2 short *a, > > + SIGNEDNESS_2 short *b) > > +{ > > + for (int i =3D 0; i < 64; i +=3D 4) > > + { > > + res0 +=3D a[i + 0] * b[i + 0]; > > + res1 +=3D a[i + 1] * b[i + 1]; > > + res2 +=3D a[i + 2] * b[i + 2]; > > + res3 +=3D a[i + 3] * b[i + 3]; > > + } > > + > > + return res0 ^ res1 ^ res2 ^ res3; > > +} > > + > > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected"= "vect" } } */ > > +/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect= " } } */ > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index e0561feddce..6d91665a341 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vin= fo, > > if (!gimple_extract_op (orig_stmt_info->stmt, &op)) > > gcc_unreachable (); > > > > - bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (st= mt_info); > > - > > if (reduction_type =3D=3D EXTRACT_LAST_REDUCTION) > > /* No extra instructions are needed in the prologue. The loop bod= y > > operations are costed in vectorizable_condition. */ > > @@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vi= nfo, > > initial result of the data reduction, initial value of the i= ndex > > reduction. */ > > prologue_stmts =3D 4; > > - else if (emulated_mixed_dot_prod) > > - /* We need the initial reduction value and two invariants: > > - one that contains the minimum signed value and one that > > - contains half of its negative. */ > > - prologue_stmts =3D 3; > > else > > + /* We need the initial reduction value. */ > > prologue_stmts =3D 1; > > prologue_cost +=3D record_stmt_cost (cost_vec, prologue_stmts, > > scalar_to_vec, stmt_info, 0, > > @@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_= vec_info loop_vinfo, > > vec_loop_lens *lens =3D &LOOP_VINFO_LENS (loop_vinfo); > > unsigned nvectors; > > > > - if (slp_node) > > + if (slp_node && SLP_TREE_LANES (slp_node) > 1) > > Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off > instead, which is bad. > > > nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > > else > > nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in); > > @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loo= p_vec_info loop_vinfo, > > } > > } > > > > +/* Check if STMT_INFO is a lane-reducing operation that can be vectori= zed in > > + the context of LOOP_VINFO, and vector cost will be recorded in COST= _VEC. > > + Now there are three such kinds of operations: dot-prod/widen-sum/sa= d > > + (sum-of-absolute-differences). > > + > > + For a lane-reducing operation, the loop reduction path that it lies= in, > > + may contain normal operation, or other lane-reducing operation of d= ifferent > > + input type size, an example as: > > + > > + int sum =3D 0; > > + for (i) > > + { > > + ... > > + sum +=3D d0[i] * d1[i]; // dot-prod > > + sum +=3D w[i]; // widen-sum > > + sum +=3D abs(s0[i] - s1[i]); // sad > > + sum +=3D n[i]; // normal > > + ... > > + } > > + > > + Vectorization factor is essentially determined by operation whose i= nput > > + vectype has the most lanes ("vector(16) char" in the example), whil= e we > > + need to choose input vectype with the least lanes ("vector(4) int" = in the > > + example) for the reduction PHI statement. */ > > + > > +bool > > +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info st= mt_info, > > + slp_tree slp_node, stmt_vector_for_cost *co= st_vec) > > +{ > > + gimple *stmt =3D stmt_info->stmt; > > + > > + if (!lane_reducing_stmt_p (stmt)) > > + return false; > > + > > + tree type =3D TREE_TYPE (gimple_assign_lhs (stmt)); > > + > > + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type)) > > + return false; > > + > > + /* Do not try to vectorize bit-precision reductions. */ > > + if (!type_has_mode_precision_p (type)) > > + return false; > > + > > + if (!slp_node) > > + return false; > > + > > + for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++) > > + { > > + stmt_vec_info def_stmt_info; > > + slp_tree slp_op; > > + tree op; > > + tree vectype; > > + enum vect_def_type dt; > > + > > + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op= , > > + &slp_op, &dt, &vectype, &def_stmt_info)) > > + { > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > + "use not simple.\n"); > > + return false; > > + } > > + > > + if (!vectype) > > + { > > + vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE_TYP= E (op), > > + slp_op); > > + if (!vectype) > > + return false; > > + } > > + > > + if (!vect_maybe_update_slp_op_vectype (slp_op, vectype)) > > + { > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > + "incompatible vector types for invariants\= n"); > > + return false; > > + } > > + > > + if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info)) > > + continue; > > + > > + /* There should be at most one cycle def in the stmt. */ > > + if (VECTORIZABLE_CYCLE_DEF (dt)) > > + return false; > > + } > > + > > + stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stmt (s= tmt_info)); > > + > > + /* TODO: Support lane-reducing operation that does not directly part= icipate > > + in loop reduction. */ > > + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) > > + return false; > > + > > + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not > > + recoginized. */ > > + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reduction_d= ef); > > + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDU= CTION); > > + > > + tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); > > + int ncopies_for_cost; > > + > > + if (SLP_TREE_LANES (slp_node) > 1) > > + { > > + /* Now lane-reducing operations in a non-single-lane slp node sh= ould only > > + come from the same loop reduction path. */ > > + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info)); > > + ncopies_for_cost =3D 1; > > + } > > + else > > + { > > + ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectype_in= ); > > OK, so the fact that the ops are lane-reducing means they effectively > change the VF for the result. That's only possible as we tightly control > code generation and "adjust" to the expected VF (by inserting the copies > you mentioned above), but only up to the highest number of outputs > created in the reduction chain. In that sense instead of talking and rec= ording > "input vector types" wouldn't it make more sense to record the effective > vectorization factor for the reduction instance? That VF would be at mos= t > the loops VF but could be as low as 1. Once we have a non-lane-reducing > operation in the reduction chain it would be always equal to the loops VF= . > > ncopies would then be always determined by that reduction instance VF and > the accumulator vector type (STMT_VINFO_VECTYPE). This reduction > instance VF would also trivially indicate the force-single-def-use-cycle > case, possibly simplifying code? > > > + gcc_assert (ncopies_for_cost >=3D 1); > > + } > > + > > + if (vect_is_emulated_mixed_dot_prod (stmt_info)) > > + { > > + /* We need extra two invariants: one that contains the minimum s= igned > > + value and one that contains half of its negative. */ > > + int prologue_stmts =3D 2; > > + unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts, > > + scalar_to_vec, stmt_info, 0, > > + vect_prologue); > > + if (dump_enabled_p ()) > > + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: " > > + "extra prologue_cost =3D %d .\n", cost); > > + > > + /* Three dot-products and a subtraction. */ > > + ncopies_for_cost *=3D 4; > > + } > > + > > + record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info= , 0, > > + vect_body); > > + > > + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > > + { > > + enum tree_code code =3D gimple_assign_rhs_code (stmt); > > + vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_in= fo, > > + slp_node, code, type, > > + vectype_in); > > + } > > + > > Add a comment: > > /* Transform via vect_transform_reduction. */ > > > + STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type; > > + return true; > > +} > > + > > /* Function vectorizable_reduction. > > > > Check if STMT_INFO performs a reduction operation that can be vecto= rized. > > @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo= , > > if (!type_has_mode_precision_p (op.type)) > > return false; > > > > - /* For lane-reducing ops we're reducing the number of reduction PHIs > > - which means the only use of that may be in the lane-reducing oper= ation. */ > > - if (lane_reducing > > - && reduc_chain_length !=3D 1 > > - && !only_slp_reduc_chain) > > - { > > - if (dump_enabled_p ()) > > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > - "lane-reducing reduction with extra stmts.\n")= ; > > - return false; > > - } > > - > > /* Lane-reducing ops also never can be used in a SLP reduction group > > since we'll mix lanes belonging to different reductions. But it'= s > > OK to use them in a reduction chain or when the reduction group > > @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinf= o, > > && loop_vinfo->suggested_unroll_factor =3D=3D 1) > > single_defuse_cycle =3D true; > > > > - if (single_defuse_cycle || lane_reducing) > > + if (single_defuse_cycle && !lane_reducing) > > If there's also a non-lane-reducing plus in the chain don't we have to > check for that reduction op? So shouldn't it be > single_defuse_cycle && ... fact that we don't record > (non-lane-reducing op there) ... > > > { > > gcc_assert (op.code !=3D COND_EXPR); > > > > - /* 4. Supportable by target? */ > > - bool ok =3D true; > > - > > - /* 4.1. check support for the operation in the loop > > + /* 4. check support for the operation in the loop > > > > This isn't necessary for the lane reduction codes, since they > > can only be produced by pattern matching, and it's up to the > > @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinf= o, > > mixed-sign dot-products can be implemented using signed > > dot-products. */ > > machine_mode vec_mode =3D TYPE_MODE (vectype_in); > > - if (!lane_reducing > > - && !directly_supported_p (op.code, vectype_in, optab_vector)) > > + if (!directly_supported_p (op.code, vectype_in, optab_vector)) > > { > > if (dump_enabled_p ()) > > dump_printf (MSG_NOTE, "op not supported by target.\n"); > > if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD) > > || !vect_can_vectorize_without_simd_p (op.code)) > > - ok =3D false; > > + single_defuse_cycle =3D false; > > else > > if (dump_enabled_p ()) > > dump_printf (MSG_NOTE, "proceeding using word mode.\n"); > > @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo= , > > dump_printf (MSG_NOTE, "using word mode not possible.\n"); > > return false; > > } > > - > > - /* lane-reducing operations have to go through vect_transform_re= duction. > > - For the other cases try without the single cycle optimization= . */ > > - if (!ok) > > - { > > - if (lane_reducing) > > - return false; > > - else > > - single_defuse_cycle =3D false; > > - } > > } > > if (dump_enabled_p () && single_defuse_cycle) > > dump_printf_loc (MSG_NOTE, vect_location, > > @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinf= o, > > "multiple vectors to one in the loop body\n"); > > STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cycle; > > > > - /* If the reduction stmt is one of the patterns that have lane > > - reduction embedded we cannot handle the case of ! single_defuse_c= ycle. */ > > - if ((ncopies > 1 && ! single_defuse_cycle) > > - && lane_reducing) > > - { > > - if (dump_enabled_p ()) > > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > - "multi def-use cycle not possible for lane-red= ucing " > > - "reduction operation\n"); > > - return false; > > - } > > + /* For lane-reducing operation, the below processing related to sing= le > > + defuse-cycle will be done in its own vectorizable function. One = more > > + thing to note is that the operation must not be involved in fold-= left > > + reduction. */ > > + single_defuse_cycle &=3D !lane_reducing; > > > > if (slp_node > > - && !(!single_defuse_cycle > > - && !lane_reducing > > - && reduction_type !=3D FOLD_LEFT_REDUCTION)) > > + && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_REDUC= TION)) > > for (i =3D 0; i < (int) op.num_ops; i++) > > if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i])= ) > > { > > @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinf= o, > > vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn, > > reduction_type, ncopies, cost_vec); > > /* Cost the reduction op inside the loop if transformed via > > - vect_transform_reduction. Otherwise this is costed by the > > - separate vectorizable_* routines. */ > > - if (single_defuse_cycle || lane_reducing) > > - { > > - int factor =3D 1; > > - if (vect_is_emulated_mixed_dot_prod (stmt_info)) > > - /* Three dot-products and a subtraction. */ > > - factor =3D 4; > > - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, > > - stmt_info, 0, vect_body); > > - } > > + vect_transform_reduction for non-lane-reducing operation. Otherw= ise > > + this is costed by the separate vectorizable_* routines. */ > > + if (single_defuse_cycle) > > + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, ve= ct_body); > > > > if (dump_enabled_p () > > && reduction_type =3D=3D FOLD_LEFT_REDUCTION) > > dump_printf_loc (MSG_NOTE, vect_location, > > "using an in-order (fold-left) reduction.\n"); > > STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type; > > - /* All but single defuse-cycle optimized, lane-reducing and fold-lef= t > > - reductions go through their own vectorizable_* routines. */ > > - if (!single_defuse_cycle > > - && !lane_reducing > > - && reduction_type !=3D FOLD_LEFT_REDUCTION) > > + > > + /* All but single defuse-cycle optimized and fold-left reductions go > > + through their own vectorizable_* routines. */ > > + if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCTION) > > { > > stmt_vec_info tem > > =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info)); > > @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vin= fo, > > bool lane_reducing =3D lane_reducing_op_p (code); > > gcc_assert (single_defuse_cycle || lane_reducing); > > > > + if (lane_reducing) > > + { > > + /* The last operand of lane-reducing op is for reduction. */ > > + gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1); > > + > > + /* Now all lane-reducing ops are covered by some slp node. */ > > + gcc_assert (slp_node); > > + } > > + > > /* Create the destination vector */ > > tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt); > > tree vec_dest =3D vect_create_destination_var (scalar_dest, vectype_= out); > > @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vin= fo, > > reduc_index =3D=3D 2 ? op.ops[2] : NULL_TREE, > > &vec_oprnds[2]); > > } > > + else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1 > > + && vec_oprnds[0].length () < vec_oprnds[reduc_index].length = ()) > > + { > > + /* For lane-reducing op covered by single-lane slp node, the inp= ut > > + vectype of the reduction PHI determines copies of vectorized d= ef-use > > + cycles, which might be more than effective copies of vectorize= d lane- > > + reducing reduction statements. This could be complemented by > > + generating extra trivial pass-through copies. For example: > > + > > + int sum =3D 0; > > + for (i) > > + { > > + sum +=3D d0[i] * d1[i]; // dot-prod > > + sum +=3D abs(s0[i] - s1[i]); // sad > > + sum +=3D n[i]; // normal > > + } > > + > > + The vector size is 128-bit,vectorization factor is 16. Reduct= ion > > + statements would be transformed as: > > + > > + vector<4> int sum_v0 =3D { 0, 0, 0, 0 }; > > + vector<4> int sum_v1 =3D { 0, 0, 0, 0 }; > > + vector<4> int sum_v2 =3D { 0, 0, 0, 0 }; > > + vector<4> int sum_v3 =3D { 0, 0, 0, 0 }; > > + > > + for (i / 16) > > + { > > + sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15],= sum_v0); > > + sum_v1 =3D sum_v1; // copy > > + sum_v2 =3D sum_v2; // copy > > + sum_v3 =3D sum_v3; // copy > > + > > + sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_= v0); > > + sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_= v1); > > + sum_v2 =3D sum_v2; // copy > > + sum_v3 =3D sum_v3; // copy > > + > > + sum_v0 +=3D n_v0[i: 0 ~ 3 ]; > > + sum_v1 +=3D n_v1[i: 4 ~ 7 ]; > > + sum_v2 +=3D n_v2[i: 8 ~ 11]; > > + sum_v3 +=3D n_v3[i: 12 ~ 15]; > > + } > > + */ > > + unsigned using_ncopies =3D vec_oprnds[0].length (); > > + unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length (); > > + > > assert reduc_ncopies >=3D using_ncopies? Maybe assert > reduc_index =3D=3D op.num_ops - 1 given you use one above > and the other below? Or simply iterate till op.num_ops > and sip i =3D=3D reduc_index. > > > + for (unsigned i =3D 0; i < op.num_ops - 1; i++) > > + { > > + gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies); > > + vec_oprnds[i].safe_grow_cleared (reduc_ncopies); > > + } > > + } > > > > bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (st= mt_info); > > unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length (); > > @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vin= fo, > > { > > gimple *new_stmt; > > tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE = }; > > - if (masked_loop_p && !mask_by_cond_expr) > > + > > + if (!vop[0] || !vop[1]) > > + { > > + tree reduc_vop =3D vec_oprnds[reduc_index][i]; > > + > > + /* Insert trivial copy if no need to generate vectorized > > + statement. */ > > + gcc_assert (reduc_vop); > > + > > + new_stmt =3D gimple_build_assign (vec_dest, reduc_vop); > > + new_temp =3D make_ssa_name (vec_dest, new_stmt); > > + gimple_set_lhs (new_stmt, new_temp); > > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,= gsi); > > I think you could simply do > > slp_node->push_vec_def (reduc_vop); > continue; > > without any code generation. > > > + } > > + else if (masked_loop_p && !mask_by_cond_expr) > > { > > /* No conditional ifns have been defined for lane-reducing op > > yet. */ > > @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vin= fo, > > > > if (masked_loop_p && mask_by_cond_expr) > > { > > + tree stmt_vectype_in =3D vectype_in; > > + unsigned nvectors =3D vec_num * ncopies; > > + > > + if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1) > > + { > > + /* Input vectype of the reduction PHI may be defferen= t from > > different > > > + that of lane-reducing operation. */ > > + stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt= _info); > > + nvectors =3D vect_get_num_copies (loop_vinfo, stmt_ve= ctype_in); > > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS. > > Otherwise the patch looks good to me. > > Richard. > > > + } > > + > > tree mask =3D vect_get_loop_mask (loop_vinfo, gsi, masks, > > - vec_num * ncopies, vectyp= e_in, i); > > + nvectors, stmt_vectype_in= , i); > > build_vect_cond_expr (code, vop, mask, gsi); > > } > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index ca6052662a3..1b73ef01ade 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo, > > NULL, NULL, node, cost_vec) > > || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cos= t_vec) > > || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, co= st_vec) > > + || vectorizable_lane_reducing (as_a (vinfo), > > + stmt_info, node, cost_vec) > > || vectorizable_reduction (as_a (vinfo), stmt= _info, > > node, node_instance, cost_vec) > > || vectorizable_induction (as_a (vinfo), stmt= _info, > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > > index 60224f4e284..94736736dcc 100644 > > --- a/gcc/tree-vectorizer.h > > +++ b/gcc/tree-vectorizer.h > > @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (clas= s loop *, vec_info_shared *, > > extern bool vectorizable_live_operation (vec_info *, stmt_vec_info, > > slp_tree, slp_instance, int, > > bool, stmt_vector_for_cost *); > > +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info, > > + slp_tree, stmt_vector_for_cost = *); > > extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info, > > slp_tree, slp_instance, > > stmt_vector_for_cost *); > > -- > > 2.17.1