From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x22d.google.com (mail-lj1-x22d.google.com [IPv6:2a00:1450:4864:20::22d]) by sourceware.org (Postfix) with ESMTPS id 15978385B50C for ; Tue, 25 Jun 2024 10:26:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 15978385B50C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 15978385B50C Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::22d ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1719311181; cv=none; b=JvjEbUyXIn8c7fQHkCFknsxYeQzf77GZDJjle3K2dmv/vrgingu0Jqh4x4zm/S2jJeUNr6StpHEw4isxErbKKSBnIS6HNdkVoBK/LSlSufSQcQ15k7xotOK/+i/zFtiOhbzFY9v0YRXnN48kI8XtoYS07DXoHU3F+FFLR71QoEM= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1719311181; c=relaxed/simple; bh=b4vxZ2Chg208HfTv3b8Mgl6V4YE4jOb4FHowAPWRPeE=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=Al/XSSudomrO2A+O+75EXpwpA8SaoJ8CSpGeVq2LwMhwjQ2lM5Ru7B7Ns/kV8PAbbcnQwUtaIoB/+503ZYxYhMLG1Vm1lOLtcduFmzKSMSf0fER6cmbT6mnzCE+ei5y4N14/nuxYcgjYKtaY5kSlFCI47gcviVFOAh8lJW+2iP8= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-lj1-x22d.google.com with SMTP id 38308e7fff4ca-2ebe0a81dc8so69109541fa.2 for ; Tue, 25 Jun 2024 03:26:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719311175; x=1719915975; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=w8rT9gkLJioXxVXLprvkHW9vdamEuPA9iH49F/C+g2Q=; b=f4cJ5KAsvFe8xFBhUc/SAeSZYBJkloA4+rU4qnHVbdXwJoWBRjjKbAzg9bvJwLfVix gcqK/M2lA/90z6OmXyVVNcFA1DwCRuCPTlrTnJg3pxg2FBioJ99JUXFmruBqIjK0o7Ds myy8SRMsGr697piiJ3geRB5i34qMiVVmrztc8115IC8RUCUtxNFqOFkhzOdDPtVQkHQy /CDiSy6uggz26CT4ojOdRkFPmujfXFrY72PSILKNKz0iXmpH5q1aw7gjIzE7pXn/qiDx jo/HBBypMPdRdtkLraC002nABGaCFujMw6rr5AwvVG9pSYdsEgwZogvTJFLtVRTiQy4o +0lQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719311175; x=1719915975; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w8rT9gkLJioXxVXLprvkHW9vdamEuPA9iH49F/C+g2Q=; b=NSngI6pCxR7qN46HxDrrXcGS1mzSU+jzpqOQYJpts5PKNe/CxQ4eF+TG6/O56TOZKv 36Bqc9sTh1j5XUNl/Y9phLcpzbMBTc6Imn+D0y8/TJU3t1yPjRIDcNGwWJe6KUQbtmG0 k1sQPsUGAbW+m84o1V3sTtrx4rqMT1FQIYnu8q+iFQe/yk8G6aps0pFRpBxvBInDs97h 3N2PNbgF6TXU2OjsXcmxnjzSPVR4rlqWnD16I7wMJnx+mUlL0HlCUdkQWUwbFoIKhET3 Ha/gP72W/ZodBIk6zKQAtbGGukzZaghTMMrhtS4a/VndZqZLc3lWmcCc4shBwl5FrKsC YZUA== X-Gm-Message-State: AOJu0YzRB//0ePRK06XTWvtQRoKaheQZQgjh6yuGidq1OAEcG1P4FJV1 2phzglBG6ZetFBCl7qIuvkTfxNBnVfB5vcf+W+MghiA/3bfahG5wWzr+t7RGsrr41AdglOeT8Vv h6IRivwb12TKVLhXkm/kpRExC0aEdgg== X-Google-Smtp-Source: AGHT+IEKp76SG3Vb1iqVaSJObgPt1RuyWWUf8q0VcIvdkddAp9MWvq1u+16+ZlTWynw1aV/UyXt0nWZy17NGHpAmICc= X-Received: by 2002:a05:6512:1154:b0:52c:9942:b008 with SMTP id 2adb3069b0e04-52ce18325femr5945507e87.2.1719311174779; Tue, 25 Jun 2024 03:26:14 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Tue, 25 Jun 2024 12:26:03 +0200 Message-ID: Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] To: Feng Xue OS Cc: "gcc-patches@gcc.gnu.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, Jun 25, 2024 at 11:32=E2=80=AFAM Feng Xue OS wrote: > > >> > >> >> - if (slp_node) > >> >> + if (slp_node && SLP_TREE_LANES (slp_node) > 1) > >> > > >> > Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS i= s off > >> > instead, which is bad. > >> > > >> >> nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > >> >> else > >> >> nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in); > >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage = (loop_vec_info loop_vinfo, > >> >> } > >> >> } > >> >> > >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vec= torized in > >> >> + the context of LOOP_VINFO, and vector cost will be recorded in = COST_VEC. > >> >> + Now there are three such kinds of operations: dot-prod/widen-su= m/sad > >> >> + (sum-of-absolute-differences). > >> >> + > >> >> + For a lane-reducing operation, the loop reduction path that it = lies in, > >> >> + may contain normal operation, or other lane-reducing operation = of different > >> >> + input type size, an example as: > >> >> + > >> >> + int sum =3D 0; > >> >> + for (i) > >> >> + { > >> >> + ... > >> >> + sum +=3D d0[i] * d1[i]; // dot-prod > >> >> + sum +=3D w[i]; // widen-sum > >> >> + sum +=3D abs(s0[i] - s1[i]); // sad > >> >> + sum +=3D n[i]; // normal > >> >> + ... > >> >> + } > >> >> + > >> >> + Vectorization factor is essentially determined by operation who= se input > >> >> + vectype has the most lanes ("vector(16) char" in the example), = while we > >> >> + need to choose input vectype with the least lanes ("vector(4) i= nt" in the > >> >> + example) for the reduction PHI statement. */ > >> >> + > >> >> +bool > >> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_inf= o stmt_info, > >> >> + slp_tree slp_node, stmt_vector_for_cost= *cost_vec) > >> >> +{ > >> >> + gimple *stmt =3D stmt_info->stmt; > >> >> + > >> >> + if (!lane_reducing_stmt_p (stmt)) > >> >> + return false; > >> >> + > >> >> + tree type =3D TREE_TYPE (gimple_assign_lhs (stmt)); > >> >> + > >> >> + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type)) > >> >> + return false; > >> >> + > >> >> + /* Do not try to vectorize bit-precision reductions. */ > >> >> + if (!type_has_mode_precision_p (type)) > >> >> + return false; > >> >> + > >> >> + if (!slp_node) > >> >> + return false; > >> >> + > >> >> + for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++) > >> >> + { > >> >> + stmt_vec_info def_stmt_info; > >> >> + slp_tree slp_op; > >> >> + tree op; > >> >> + tree vectype; > >> >> + enum vect_def_type dt; > >> >> + > >> >> + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i,= &op, > >> >> + &slp_op, &dt, &vectype, &def_stmt_in= fo)) > >> >> + { > >> >> + if (dump_enabled_p ()) > >> >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location= , > >> >> + "use not simple.\n"); > >> >> + return false; > >> >> + } > >> >> + > >> >> + if (!vectype) > >> >> + { > >> >> + vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE= _TYPE (op), > >> >> + slp_op); > >> >> + if (!vectype) > >> >> + return false; > >> >> + } > >> >> + > >> >> + if (!vect_maybe_update_slp_op_vectype (slp_op, vectype)) > >> >> + { > >> >> + if (dump_enabled_p ()) > >> >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location= , > >> >> + "incompatible vector types for invaria= nts\n"); > >> >> + return false; > >> >> + } > >> >> + > >> >> + if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info)) > >> >> + continue; > >> >> + > >> >> + /* There should be at most one cycle def in the stmt. */ > >> >> + if (VECTORIZABLE_CYCLE_DEF (dt)) > >> >> + return false; > >> >> + } > >> >> + > >> >> + stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stm= t (stmt_info)); > >> >> + > >> >> + /* TODO: Support lane-reducing operation that does not directly = participate > >> >> + in loop reduction. */ > >> >> + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) > >> >> + return false; > >> >> + > >> >> + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is = not > >> >> + recoginized. */ > >> >> + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reducti= on_def); > >> >> + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_= REDUCTION); > >> >> + > >> >> + tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); > >> >> + int ncopies_for_cost; > >> >> + > >> >> + if (SLP_TREE_LANES (slp_node) > 1) > >> >> + { > >> >> + /* Now lane-reducing operations in a non-single-lane slp nod= e should only > >> >> + come from the same loop reduction path. */ > >> >> + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info)); > >> >> + ncopies_for_cost =3D 1; > >> >> + } > >> >> + else > >> >> + { > >> >> + ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectyp= e_in); > >> > > >> > OK, so the fact that the ops are lane-reducing means they effectivel= y > >> > change the VF for the result. That's only possible as we tightly co= ntrol > >> > code generation and "adjust" to the expected VF (by inserting the co= pies > >> > you mentioned above), but only up to the highest number of outputs > >> > created in the reduction chain. In that sense instead of talking an= d recording > >> > "input vector types" wouldn't it make more sense to record the effec= tive > >> > vectorization factor for the reduction instance? That VF would be a= t most > >> > the loops VF but could be as low as 1. Once we have a non-lane-redu= cing > >> > operation in the reduction chain it would be always equal to the loo= ps VF. > >> > > >> > ncopies would then be always determined by that reduction instance V= F and > >> > the accumulator vector type (STMT_VINFO_VECTYPE). This reduction > >> > instance VF would also trivially indicate the force-single-def-use-c= ycle > >> > case, possibly simplifying code? > >> > >> I tried to add such an effective VF, while the vectype_in is still nee= ded in some > >> scenarios, such as when checking whether a dot-prod stmt is emulated o= r not. > >> The former could be deduced from the later, so recording both things s= eems > >> to be redundant. Another consideration is that for normal op, ncopies > >> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op= , > >> it is from VF. So, a better means to make them unified? > > > > AFAICS reductions are special in that they, for the accumulation SSA cy= cle, > > do not adhere to the loops VF but as optimization can chose a smaller o= ne. > > OTOH STMT_VINFO_VECTYPE is for the vector type used for individual > > operations which even for lane-reducing ops is adhered to - those just > > may use a smaller VF, that of the reduction SSA cycle. > > > > So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather > > it's not fully redundant but needlessly replicated over all stmts parti= cipating > > in the reduction instead of recording the reduction VF in the reduc_inf= o and > > using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies > > for stmts in the reduction cycle. > > > > At least that was my idea ... > > > > For lane-reducing ops and single-defuse-cycle optimization, we could assu= me > no lane would be reduced, and always generate vectorization statements > according to the normal VF, if placeholder is needed, just insert some tr= ivial > statement like zero-initialization, or pass-through copy. And define a"ef= fective VF or > ncopies" to control lane-reducing related aspects in analysis and codegen= (such > as the below vect_get_loop_mask). Since all things will become SLP-based= finally, > I think a suitable place to add such a field might be in slp_node, as a s= upplement to > "vect_stmts_size", and it is expected to be adjusted in vectorizable_redu= ction. So > could we do the refinement as separate patches when non-slp code path is = to be > removed? I suppose so. Thanks, Richard. > >> >> + gcc_assert (ncopies_for_cost >=3D 1); > >> >> + } > >> >> + > >> >> + if (vect_is_emulated_mixed_dot_prod (stmt_info)) > >> >> + { > >> >> + /* We need extra two invariants: one that contains the minim= um signed > >> >> + value and one that contains half of its negative. */ > >> >> + int prologue_stmts =3D 2; > >> >> + unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts= , > >> >> + scalar_to_vec, stmt_info, 0= , > >> >> + vect_prologue); > >> >> + if (dump_enabled_p ()) > >> >> + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: " > >> >> + "extra prologue_cost =3D %d .\n", cost); > >> >> + > >> >> + /* Three dot-products and a subtraction. */ > >> >> + ncopies_for_cost *=3D 4; > >> >> + } > >> >> + > >> >> + record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_= info, 0, > >> >> + vect_body); > >> >> + > >> >> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > >> >> + { > >> >> + enum tree_code code =3D gimple_assign_rhs_code (stmt); > >> >> + vect_reduction_update_partial_vector_usage (loop_vinfo, redu= c_info, > >> >> + slp_node, code, t= ype, > >> >> + vectype_in); > >> >> + } > >> >> + > >> > > >> > Add a comment: > >> > > >> > /* Transform via vect_transform_reduction. */ > >> > > >> >> + STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type; > >> >> + return true; > >> >> +} > >> >> + > >> >> /* Function vectorizable_reduction. > >> >> > >> >> Check if STMT_INFO performs a reduction operation that can be v= ectorized. > >> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_v= info, > >> >> if (!type_has_mode_precision_p (op.type)) > >> >> return false; > >> >> > >> >> - /* For lane-reducing ops we're reducing the number of reduction = PHIs > >> >> - which means the only use of that may be in the lane-reducing = operation. */ > >> >> - if (lane_reducing > >> >> - && reduc_chain_length !=3D 1 > >> >> - && !only_slp_reduc_chain) > >> >> - { > >> >> - if (dump_enabled_p ()) > >> >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> >> - "lane-reducing reduction with extra stmts.= \n"); > >> >> - return false; > >> >> - } > >> >> - > >> >> /* Lane-reducing ops also never can be used in a SLP reduction g= roup > >> >> since we'll mix lanes belonging to different reductions. But= it's > >> >> OK to use them in a reduction chain or when the reduction gro= up > >> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_= vinfo, > >> >> && loop_vinfo->suggested_unroll_factor =3D=3D 1) > >> >> single_defuse_cycle =3D true; > >> >> > >> >> - if (single_defuse_cycle || lane_reducing) > >> >> + if (single_defuse_cycle && !lane_reducing) > >> > > >> > If there's also a non-lane-reducing plus in the chain don't we have = to > >> > check for that reduction op? So shouldn't it be > >> > single_defuse_cycle && ... fact that we don't record > >> > (non-lane-reducing op there) ... > >> > >> Quite not understand this point. For a non-lane-reducing op in the ch= ain, > >> it should be handled in its own vectorizable_xxx function? The below c= heck > >> is only for the first statement (vect_reduction_def) in the reduction. > > > > Hmm. So we have vectorizable_lane_reducing_* for the check on the > > lane-reducing stmts, vectorizable_* for !single-def-use stmts. And the > > following is then just for the case there's a single def that's not > > lane-reducing > > and we're forcing a single-def-use and thus go via vect_transform_reduc= tion? > > Yes. Non-lane-reducing with single-defuse-cycle is handled in the functio= n. > This logic is same as the original. > > >> > > >> >> { > >> >> gcc_assert (op.code !=3D COND_EXPR); > >> >> > >> >> - /* 4. Supportable by target? */ > >> >> - bool ok =3D true; > >> >> - > >> >> - /* 4.1. check support for the operation in the loop > >> >> + /* 4. check support for the operation in the loop > >> >> > >> >> This isn't necessary for the lane reduction codes, since t= hey > >> >> can only be produced by pattern matching, and it's up to t= he > >> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_= vinfo, > >> >> mixed-sign dot-products can be implemented using signed > >> >> dot-products. */ > >> >> machine_mode vec_mode =3D TYPE_MODE (vectype_in); > >> >> - if (!lane_reducing > >> >> - && !directly_supported_p (op.code, vectype_in, optab_vect= or)) > >> >> + if (!directly_supported_p (op.code, vectype_in, optab_vector= )) > >> >> { > >> >> if (dump_enabled_p ()) > >> >> dump_printf (MSG_NOTE, "op not supported by target.\n"= ); > >> >> if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD) > >> >> || !vect_can_vectorize_without_simd_p (op.code)) > >> >> - ok =3D false; > >> >> + single_defuse_cycle =3D false; > >> >> else > >> >> if (dump_enabled_p ()) > >> >> dump_printf (MSG_NOTE, "proceeding using word mode.\n= "); > >> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_v= info, > >> >> dump_printf (MSG_NOTE, "using word mode not possible.\n= "); > >> >> return false; > >> >> } > >> >> - > >> >> - /* lane-reducing operations have to go through vect_transfor= m_reduction. > >> >> - For the other cases try without the single cycle optimiza= tion. */ > >> >> - if (!ok) > >> >> - { > >> >> - if (lane_reducing) > >> >> - return false; > >> >> - else > >> >> - single_defuse_cycle =3D false; > >> >> - } > >> >> } > >> >> if (dump_enabled_p () && single_defuse_cycle) > >> >> dump_printf_loc (MSG_NOTE, vect_location, > >> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_= vinfo, > >> >> "multiple vectors to one in the loop body\n"); > >> >> STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cyc= le; > >> >> > >> >> - /* If the reduction stmt is one of the patterns that have lane > >> >> - reduction embedded we cannot handle the case of ! single_defu= se_cycle. */ > >> >> - if ((ncopies > 1 && ! single_defuse_cycle) > >> >> - && lane_reducing) > >> >> - { > >> >> - if (dump_enabled_p ()) > >> >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> >> - "multi def-use cycle not possible for lane= -reducing " > >> >> - "reduction operation\n"); > >> >> - return false; > >> >> - } > >> >> + /* For lane-reducing operation, the below processing related to = single > >> >> + defuse-cycle will be done in its own vectorizable function. = One more > >> >> + thing to note is that the operation must not be involved in f= old-left > >> >> + reduction. */ > >> >> + single_defuse_cycle &=3D !lane_reducing; > >> >> > >> >> if (slp_node > >> >> - && !(!single_defuse_cycle > >> >> - && !lane_reducing > >> >> - && reduction_type !=3D FOLD_LEFT_REDUCTION)) > >> >> + && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_R= EDUCTION)) > >> >> for (i =3D 0; i < (int) op.num_ops; i++) > >> >> if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op= [i])) > >> >> { > >> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_= vinfo, > >> >> vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn, > >> >> reduction_type, ncopies, cost_vec); > >> >> /* Cost the reduction op inside the loop if transformed via > >> >> - vect_transform_reduction. Otherwise this is costed by the > >> >> - separate vectorizable_* routines. */ > >> >> - if (single_defuse_cycle || lane_reducing) > >> >> - { > >> >> - int factor =3D 1; > >> >> - if (vect_is_emulated_mixed_dot_prod (stmt_info)) > >> >> - /* Three dot-products and a subtraction. */ > >> >> - factor =3D 4; > >> >> - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, > >> >> - stmt_info, 0, vect_body); > >> >> - } > >> >> + vect_transform_reduction for non-lane-reducing operation. Ot= herwise > >> >> + this is costed by the separate vectorizable_* routines. */ > >> >> + if (single_defuse_cycle) > >> >> + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0= , vect_body); > >> >> > >> >> if (dump_enabled_p () > >> >> && reduction_type =3D=3D FOLD_LEFT_REDUCTION) > >> >> dump_printf_loc (MSG_NOTE, vect_location, > >> >> "using an in-order (fold-left) reduction.\n"); > >> >> STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type; > >> >> - /* All but single defuse-cycle optimized, lane-reducing and fold= -left > >> >> - reductions go through their own vectorizable_* routines. */ > >> >> - if (!single_defuse_cycle > >> >> - && !lane_reducing > >> >> - && reduction_type !=3D FOLD_LEFT_REDUCTION) > >> >> + > >> >> + /* All but single defuse-cycle optimized and fold-left reduction= s go > >> >> + through their own vectorizable_* routines. */ > >> >> + if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCT= ION) > >> >> { > >> >> stmt_vec_info tem > >> >> =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info)= ); > >> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop= _vinfo, > >> >> bool lane_reducing =3D lane_reducing_op_p (code); > >> >> gcc_assert (single_defuse_cycle || lane_reducing); > >> >> > >> >> + if (lane_reducing) > >> >> + { > >> >> + /* The last operand of lane-reducing op is for reduction. *= / > >> >> + gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1); > >> >> + > >> >> + /* Now all lane-reducing ops are covered by some slp node. = */ > >> >> + gcc_assert (slp_node); > >> >> + } > >> >> + > >> >> /* Create the destination vector */ > >> >> tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt); > >> >> tree vec_dest =3D vect_create_destination_var (scalar_dest, vect= ype_out); > >> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop= _vinfo, > >> >> reduc_index =3D=3D 2 ? op.ops[2] : NULL_TR= EE, > >> >> &vec_oprnds[2]); > >> >> } > >> >> + else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1 > >> >> + && vec_oprnds[0].length () < vec_oprnds[reduc_index].len= gth ()) > >> >> + { > >> >> + /* For lane-reducing op covered by single-lane slp node, the= input > >> >> + vectype of the reduction PHI determines copies of vectoriz= ed def-use > >> >> + cycles, which might be more than effective copies of vecto= rized lane- > >> >> + reducing reduction statements. This could be complemented= by > >> >> + generating extra trivial pass-through copies. For example= : > >> >> + > >> >> + int sum =3D 0; > >> >> + for (i) > >> >> + { > >> >> + sum +=3D d0[i] * d1[i]; // dot-prod > >> >> + sum +=3D abs(s0[i] - s1[i]); // sad > >> >> + sum +=3D n[i]; // normal > >> >> + } > >> >> + > >> >> + The vector size is 128-bit,vectorization factor is 16. Re= duction > >> >> + statements would be transformed as: > >> >> + > >> >> + vector<4> int sum_v0 =3D { 0, 0, 0, 0 }; > >> >> + vector<4> int sum_v1 =3D { 0, 0, 0, 0 }; > >> >> + vector<4> int sum_v2 =3D { 0, 0, 0, 0 }; > >> >> + vector<4> int sum_v3 =3D { 0, 0, 0, 0 }; > >> >> + > >> >> + for (i / 16) > >> >> + { > >> >> + sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ = 15], sum_v0); > >> >> + sum_v1 =3D sum_v1; // copy > >> >> + sum_v2 =3D sum_v2; // copy > >> >> + sum_v3 =3D sum_v3; // copy > >> >> + > >> >> + sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], = sum_v0); > >> >> + sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], = sum_v1); > >> >> + sum_v2 =3D sum_v2; // copy > >> >> + sum_v3 =3D sum_v3; // copy > >> >> + > >> >> + sum_v0 +=3D n_v0[i: 0 ~ 3 ]; > >> >> + sum_v1 +=3D n_v1[i: 4 ~ 7 ]; > >> >> + sum_v2 +=3D n_v2[i: 8 ~ 11]; > >> >> + sum_v3 +=3D n_v3[i: 12 ~ 15]; > >> >> + } > >> >> + */ > >> >> + unsigned using_ncopies =3D vec_oprnds[0].length (); > >> >> + unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length ()= ; > >> >> + > >> > > >> > assert reduc_ncopies >=3D using_ncopies? Maybe assert > >> > reduc_index =3D=3D op.num_ops - 1 given you use one above > >> > and the other below? Or simply iterate till op.num_ops > >> > and sip i =3D=3D reduc_index. > >> > > >> >> + for (unsigned i =3D 0; i < op.num_ops - 1; i++) > >> >> + { > >> >> + gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies)= ; > >> >> + vec_oprnds[i].safe_grow_cleared (reduc_ncopies); > >> >> + } > >> >> + } > >> >> > >> >> bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod= (stmt_info); > >> >> unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length= (); > >> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop= _vinfo, > >> >> { > >> >> gimple *new_stmt; > >> >> tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_T= REE }; > >> >> - if (masked_loop_p && !mask_by_cond_expr) > >> >> + > >> >> + if (!vop[0] || !vop[1]) > >> >> + { > >> >> + tree reduc_vop =3D vec_oprnds[reduc_index][i]; > >> >> + > >> >> + /* Insert trivial copy if no need to generate vectorized > >> >> + statement. */ > >> >> + gcc_assert (reduc_vop); > >> >> + > >> >> + new_stmt =3D gimple_build_assign (vec_dest, reduc_vop); > >> >> + new_temp =3D make_ssa_name (vec_dest, new_stmt); > >> >> + gimple_set_lhs (new_stmt, new_temp); > >> >> + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_s= tmt, gsi); > >> > > >> > I think you could simply do > >> > > >> > slp_node->push_vec_def (reduc_vop); > >> > continue; > >> > > >> > without any code generation. > >> > > >> > >> OK, that would be easy. Here comes another question, this patch assume= s > >> lane-reducing op would always be contained in a slp node, since single= -lane > >> slp node feature has been enabled. But I got some regression if I enfo= rced > >> such constraint on lane-reducing op check. Those cases are founded to > >> be unvectorizable with single-lane slp, so this should not be what we = want? > >> and need to be fixed? > > > > Yes, in the end we need to chase down all unsupported cases and fix the= m > > (there's known issues with load permutes, I'm working on that - hopeful= ly > > when finding a continuous stretch of time...). > > > >> > >> >> + } > >> >> + else if (masked_loop_p && !mask_by_cond_expr) > >> >> { > >> >> /* No conditional ifns have been defined for lane-reducin= g op > >> >> yet. */ > >> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop= _vinfo, > >> >> > >> >> if (masked_loop_p && mask_by_cond_expr) > >> >> { > >> >> + tree stmt_vectype_in =3D vectype_in; > >> >> + unsigned nvectors =3D vec_num * ncopies; > >> >> + > >> >> + if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D= 1) > >> >> + { > >> >> + /* Input vectype of the reduction PHI may be deff= erent from > >> > > >> > different > >> > > >> >> + that of lane-reducing operation. */ > >> >> + stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (= stmt_info); > >> >> + nvectors =3D vect_get_num_copies (loop_vinfo, stm= t_vectype_in); > >> > > >> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS. > >> > >> To partially vectorizing a dot_prod<16 * char> with 128-bit vector wid= th, > >> we should pass (nvector=3D4, vectype=3D<4 *int>) instead of (nvector= =3D1, vectype=3D<16 *char>) > >> to vect_get_loop_mask? > > > > Probably - it depends on the vectorization factor. What I wanted to > > point out is that > > vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong. The > > place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted, > > or we should forgo with it (but that's possibly a post-only-SLP > > cleanup to be done). > > > > See vect_slp_analyze_node_operations_1 where that's computed. For redu= ctions > > it's probably not quite right (and we might have latent issues like > > those you are > > "fixing" with code like above). The order we analyze stmts might also = be not > > optimal for reductions with SLP - in fact given that stmt analysis > > relies on a fixed VF > > it would probably make sense to determine the reduction VF in advance a= s well. > > But again this sounds like post-only-SLP cleanup opportunities. > > > > In the end I might suggest to always use reduct-VF and vectype to deter= mine > > the number of vector stmts rather than computing ncopies/vec_num separa= tely. > > > > Thanks, > Feng