From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 7B8FF385B524 for ; Tue, 15 Aug 2023 15:17:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7B8FF385B524 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3D4CC1063; Tue, 15 Aug 2023 08:18:10 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7622E3F6C4; Tue, 15 Aug 2023 08:17:27 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener ,gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Handle TYPE_OVERFLOW_UNDEFINED vectorized BB reductions References: <28e11ecf-ede0-44d0-b35f-fbc635adbd78@DBAEUR03FT020.eop-EUR03.prod.protection.outlook.com> Date: Tue, 15 Aug 2023 16:17:26 +0100 In-Reply-To: <28e11ecf-ede0-44d0-b35f-fbc635adbd78@DBAEUR03FT020.eop-EUR03.prod.protection.outlook.com> (Richard Biener's message of "Tue, 15 Aug 2023 13:36:40 +0000 (UTC)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-25.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Biener writes: > The following changes the gate to perform vectorization of BB reductions > to use needs_fold_left_reduction_p which in turn requires handling > TYPE_OVERFLOW_UNDEFINED types in the epilogue code generation by > promoting any operations generated there to use unsigned arithmetic. > > The following does this, there's currently only v16qi where x86 > supports a .REDUC_PLUS reduction for integral modes so I had to > add a x86 specific testcase using GIMPLE IL. > > Bootstrap and regtest ongoing on x86_64-unknown-linux-gnu. LGTM FWIW. > The next plan is to remove the restriction to .REDUC_PLUS, factoring > out some of the general non-ifn way of doing a reduction epilog > from loop reduction handling. I had a stab at doing in-order > reductions already but then those are really too similar to > having general SLP discovery from N scalar defs (and then replacing > those with extracts), at least since there's no > fold_left_plus that doesn't add to an existing scalar I can't > seem to easily just handle that case, possibly discovering > { x_0, x_1, ..., x_n-1 }, extracting x_0, shifting the vector > to { x_1, ..., x_n-1, } and using mask_fold_left_plus > with accumulating to x_0 and the element masked would do. > But I'm not sure that's worth the trouble? Yeah, I doubt it. I don't think SVE's FADDA is expected to be an optimisation in its own right. It's more of an enabler. Another reason to use it in loops is that it's VLA-friendly. But that wouldn't be an issue here. Thanks, Richard > In principle with generic N scalar defs we could do a forward > discovery from grouped loads and see where that goes (and of > course handle in-order reductions that way). > > * tree-vect-slp.cc (vect_slp_check_for_roots): Use > !needs_fold_left_reduction_p to decide whether we can > handle the reduction with association. > (vectorize_slp_instance_root_stmt): For TYPE_OVERFLOW_UNDEFINED > reductions perform all arithmetic in an unsigned type. > > * gcc.target/i386/vect-reduc-2.c: New testcase. > --- > gcc/testsuite/gcc.target/i386/vect-reduc-2.c | 77 ++++++++++++++++++++ > gcc/tree-vect-slp.cc | 44 +++++++---- > 2 files changed, 107 insertions(+), 14 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/vect-reduc-2.c > > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-2.c b/gcc/testsuite/gcc.target/i386/vect-reduc-2.c > new file mode 100644 > index 00000000000..62559ef8e7b > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-2.c > @@ -0,0 +1,77 @@ > +/* { dg-do compile } */ > +/* { dg-options "-fgimple -O2 -msse2 -fdump-tree-slp2-optimized" } */ > + > +signed char x[16]; > + > +signed char __GIMPLE (ssa,guessed_local(1073741824)) > +foo () > +{ > + signed char _1; > + signed char _3; > + signed char _5; > + signed char _6; > + signed char _8; > + signed char _9; > + signed char _11; > + signed char _12; > + signed char _14; > + signed char _15; > + signed char _17; > + signed char _18; > + signed char _20; > + signed char _21; > + signed char _23; > + signed char _24; > + signed char _26; > + signed char _27; > + signed char _29; > + signed char _30; > + signed char _32; > + signed char _33; > + signed char _35; > + signed char _36; > + signed char _38; > + signed char _39; > + signed char _41; > + signed char _42; > + signed char _44; > + signed char _45; > + signed char _47; > + > + __BB(2,guessed_local(1073741824)): > + _1 = x[15]; > + _3 = x[1]; > + _5 = _1 + _3; > + _6 = x[2]; > + _8 = _5 + _6; > + _9 = x[3]; > + _11 = _8 + _9; > + _12 = x[4]; > + _14 = _11 + _12; > + _15 = x[5]; > + _17 = _14 + _15; > + _18 = x[6]; > + _20 = _17 + _18; > + _21 = x[7]; > + _23 = _20 + _21; > + _24 = x[8]; > + _26 = _23 + _24; > + _27 = x[9]; > + _29 = _26 + _27; > + _30 = x[10]; > + _32 = _29 + _30; > + _33 = x[11]; > + _35 = _32 + _33; > + _36 = x[12]; > + _38 = _35 + _36; > + _39 = x[13]; > + _41 = _38 + _39; > + _42 = x[14]; > + _44 = _41 + _42; > + _45 = x[0]; > + _47 = _44 + _45; > + return _47; > + > +} > + > +/* { dg-final { scan-tree-dump "optimized: basic block part vectorized" "slp2" } } */ > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 7020bd9fa0e..07d68f2052b 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -7217,13 +7217,10 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo) > } > else if (!VECTOR_TYPE_P (TREE_TYPE (rhs)) > && (associative_tree_code (code) || code == MINUS_EXPR) > - /* ??? The flag_associative_math and TYPE_OVERFLOW_WRAPS > - checks pessimize a two-element reduction. PR54400. > + /* ??? This pessimizes a two-element reduction. PR54400. > ??? In-order reduction could be handled if we only > traverse one operand chain in vect_slp_linearize_chain. */ > - && ((FLOAT_TYPE_P (TREE_TYPE (rhs)) && flag_associative_math) > - || (INTEGRAL_TYPE_P (TREE_TYPE (rhs)) > - && TYPE_OVERFLOW_WRAPS (TREE_TYPE (rhs)))) > + && !needs_fold_left_reduction_p (TREE_TYPE (rhs), code) > /* Ops with constants at the tail can be stripped here. */ > && TREE_CODE (rhs) == SSA_NAME > && TREE_CODE (gimple_assign_rhs2 (assign)) == SSA_NAME > @@ -9161,9 +9158,23 @@ vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > /* We may end up with more than one vector result, reduce them > to one vector. */ > tree vec_def = vec_defs[0]; > + tree vectype = TREE_TYPE (vec_def); > + tree compute_vectype = vectype; > + if (TYPE_OVERFLOW_UNDEFINED (vectype)) > + { > + compute_vectype = unsigned_type_for (vectype); > + vec_def = gimple_build (&epilogue, VIEW_CONVERT_EXPR, > + compute_vectype, vec_def); > + } > for (unsigned i = 1; i < vec_defs.length (); ++i) > - vec_def = gimple_build (&epilogue, reduc_code, TREE_TYPE (vec_def), > - vec_def, vec_defs[i]); > + { > + tree def = vec_defs[i]; > + if (TYPE_OVERFLOW_UNDEFINED (vectype)) > + def = gimple_build (&epilogue, VIEW_CONVERT_EXPR, > + compute_vectype, def); > + vec_def = gimple_build (&epilogue, reduc_code, compute_vectype, > + vec_def, def); > + } > vec_defs.release (); > /* ??? Support other schemes than direct internal fn. */ > internal_fn reduc_fn; > @@ -9171,21 +9182,26 @@ vectorize_slp_instance_root_stmt (slp_tree node, slp_instance instance) > || reduc_fn == IFN_LAST) > gcc_unreachable (); > tree scalar_def = gimple_build (&epilogue, as_combined_fn (reduc_fn), > - TREE_TYPE (TREE_TYPE (vec_def)), vec_def); > + TREE_TYPE (compute_vectype), vec_def); > if (!SLP_INSTANCE_REMAIN_DEFS (instance).is_empty ()) > { > tree rem_def = NULL_TREE; > for (auto def : SLP_INSTANCE_REMAIN_DEFS (instance)) > - if (!rem_def) > - rem_def = def; > - else > - rem_def = gimple_build (&epilogue, reduc_code, > - TREE_TYPE (scalar_def), > - rem_def, def); > + { > + def = gimple_convert (&epilogue, TREE_TYPE (scalar_def), def); > + if (!rem_def) > + rem_def = def; > + else > + rem_def = gimple_build (&epilogue, reduc_code, > + TREE_TYPE (scalar_def), > + rem_def, def); > + } > scalar_def = gimple_build (&epilogue, reduc_code, > TREE_TYPE (scalar_def), > scalar_def, rem_def); > } > + scalar_def = gimple_convert (&epilogue, > + TREE_TYPE (vectype), scalar_def); > gimple_stmt_iterator rgsi = gsi_for_stmt (instance->root_stmts[0]->stmt); > gsi_insert_seq_before (&rgsi, epilogue, GSI_SAME_STMT); > gimple_assign_set_rhs_from_tree (&rgsi, scalar_def);