From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-il1-x12b.google.com (mail-il1-x12b.google.com [IPv6:2607:f8b0:4864:20::12b]) by sourceware.org (Postfix) with ESMTPS id EF783398B88B for ; Thu, 5 Aug 2021 09:42:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org EF783398B88B Received: by mail-il1-x12b.google.com with SMTP id y4so4494792ilp.0 for ; Thu, 05 Aug 2021 02:42:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XWjLA9t+9a9/gtDvT1aARejg9/8xsCi4OIL9DNn/5x4=; b=kND1Q3L8uhd2LKGHBX58pJeh6kCkx9Iy0ylOBsvTS7ug08aFBe5W24J6SiUnQCbi3W km6nH2V0JRU/9pLhEcJ7TEe/ycTFLqLVP4db3SepIep7mt7j0TAt8JUCKWyc4zvajM4M LZSE1H3jo+pqM31tjaOfghUdhwr82Yvk+3OgzpluOvDa8vzF2mQUDctfbpKqGLR2C35M M9U9h6AQUOhhTI7fhCAfQWyT7DVKokBB0ZOdm7nmKNPUdEcwomE+EDl4gn1d5Z03G6qd sHjO9FX0u+sgARJpcYD0S9LX6RbW63Jt0n9tNGxf2EXwETGjR0U0Xu/NCz2L6XmlRJuv 5HuQ== X-Gm-Message-State: AOAM530TiNmusgOd8ZE6rLQfTIKDqtgMnQit9M7AN+zFVzg5G/z0sM4r 7yM8oQgcj38NybIYi/65pCFQC2mPjsdj0UkQ3MA= X-Google-Smtp-Source: ABdhPJxhsA2iv1LMtY5PXZoHOexlVM6ZUXQxxryGu1ePrVYjcRSjt/CegcE95rjKaPSaBc76+Dwqpek8EIgG0llUpgQ= X-Received: by 2002:a92:c949:: with SMTP id i9mr664243ilq.19.1628156545437; Thu, 05 Aug 2021 02:42:25 -0700 (PDT) MIME-Version: 1.0 References: <98s2q47-607-130-21p-or6454oprsso@fhfr.qr> In-Reply-To: From: Christophe Lyon Date: Thu, 5 Aug 2021 11:42:14 +0200 Message-ID: Subject: Re: [PATCH 1/2] Add emulated gather capability to the vectorizer To: Richard Biener Cc: Richard Sandiford , hongtao.liu@intel.com, GCC Patches X-Spam-Status: No, score=-9.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, HTML_MESSAGE, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Aug 2021 09:42:29 -0000 On Wed, Aug 4, 2021 at 2:08 PM Richard Biener wrote: > On Wed, 4 Aug 2021, Richard Sandiford wrote: > > > Richard Biener writes: > > > This adds a gather vectorization capability to the vectorizer > > > without target support by decomposing the offset vector, doing > > > sclar loads and then building a vector from the result. This > > > is aimed mainly at cases where vectorizing the rest of the loop > > > offsets the cost of vectorizing the gather. > > > > > > Note it's difficult to avoid vectorizing the offset load, but in > > > some cases later passes can turn the vector load + extract into > > > scalar loads, see the followup patch. > > > > > > On SPEC CPU 2017 510.parest_r this improves runtime from 250s > > > to 219s on a Zen2 CPU which has its native gather instructions > > > disabled (using those the runtime instead increases to 254s) > > > using -Ofast -march=3Dznver2 [-flto]. It turns out the critical > > > loops in this benchmark all perform gather operations. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > 2021-07-30 Richard Biener > > > > > > * tree-vect-data-refs.c (vect_check_gather_scatter): > > > Include widening conversions only when the result is > > > still handed by native gather or the current offset > > > size not already matches the data size. > > > Also succeed analysis in case there's no native support, > > > noted by a IFN_LAST ifn and a NULL decl. > > > (vect_analyze_data_refs): Always consider gathers. > > > * tree-vect-patterns.c (vect_recog_gather_scatter_pattern): > > > Test for no IFN gather rather than decl gather. > > > * tree-vect-stmts.c (vect_model_load_cost): Pass in the > > > gather-scatter info and cost emulated gathers accordingly. > > > (vect_truncate_gather_scatter_offset): Properly test for > > > no IFN gather. > > > (vect_use_strided_gather_scatters_p): Likewise. > > > (get_load_store_type): Handle emulated gathers and its > > > restrictions. > > > (vectorizable_load): Likewise. Emulate them by extracting > > > scalar offsets, doing scalar loads and a vector construct. > > > > > > * gcc.target/i386/vect-gather-1.c: New testcase. > > > * gfortran.dg/vect/vect-8.f90: Adjust. > Hi, The adjusted testcase now fails on aarch64: FAIL: gfortran.dg/vect/vect-8.f90 -O scan-tree-dump-times vect "vectorized 23 loops" 1 Christophe > > --- > > > gcc/testsuite/gcc.target/i386/vect-gather-1.c | 18 ++++ > > > gcc/testsuite/gfortran.dg/vect/vect-8.f90 | 2 +- > > > gcc/tree-vect-data-refs.c | 34 ++++-- > > > gcc/tree-vect-patterns.c | 2 +- > > > gcc/tree-vect-stmts.c | 100 ++++++++++++++++= -- > > > 5 files changed, 138 insertions(+), 18 deletions(-) > > > create mode 100644 gcc/testsuite/gcc.target/i386/vect-gather-1.c > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c > b/gcc/testsuite/gcc.target/i386/vect-gather-1.c > > > new file mode 100644 > > > index 00000000000..134aef39666 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c > > > @@ -0,0 +1,18 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */ > > > + > > > +#ifndef INDEXTYPE > > > +#define INDEXTYPE int > > > +#endif > > > +double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend, > > > + double *luval, double *dst) > > > +{ > > > + double res =3D 0; > > > + for (const INDEXTYPE * col =3D rowstart; col !=3D rowend; ++col, > ++luval) > > > + res +=3D *luval * dst[*col]; > > > + return res; > > > +} > > > + > > > +/* With gather emulation this should be profitable to vectorize > > > + even with plain SSE2. */ > > > +/* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */ > > > diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 > b/gcc/testsuite/gfortran.dg/vect/vect-8.f90 > > > index 9994805d77f..cc1aebfbd84 100644 > > > --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 > > > +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90 > > > @@ -706,5 +706,5 @@ END SUBROUTINE kernel > > > > > > ! { dg-final { scan-tree-dump-times "vectorized 24 loops" 1 "vect" { > target aarch64_sve } } } > > > ! { dg-final { scan-tree-dump-times "vectorized 23 loops" 1 "vect" { > target { aarch64*-*-* && { ! aarch64_sve } } } } } > > > -! { dg-final { scan-tree-dump-times "vectorized 2\[23\] loops" 1 > "vect" { target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } } > > > +! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 > "vect" { target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } } > > > ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { > target { { ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } } > > > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c > > > index 6995efba899..3c29ff04fd8 100644 > > > --- a/gcc/tree-vect-data-refs.c > > > +++ b/gcc/tree-vect-data-refs.c > > > @@ -4007,8 +4007,27 @@ vect_check_gather_scatter (stmt_vec_info > stmt_info, loop_vec_info loop_vinfo, > > > continue; > > > } > > > > > > - if (TYPE_PRECISION (TREE_TYPE (op0)) > > > - < TYPE_PRECISION (TREE_TYPE (off))) > > > + /* Include the conversion if it is widening and we're using > > > + the IFN path or the target can handle the converted from > > > + offset or the current size is not already the same as the > > > + data vector element size. */ > > > + if ((TYPE_PRECISION (TREE_TYPE (op0)) > > > + < TYPE_PRECISION (TREE_TYPE (off))) > > > + && ((!use_ifn_p > > > + && (DR_IS_READ (dr) > > > + ? (targetm.vectorize.builtin_gather > > > + && targetm.vectorize.builtin_gather (vectype, > > > + TREE_TYPE > (op0), > > > + scale)) > > > + : (targetm.vectorize.builtin_scatter > > > + && targetm.vectorize.builtin_scatter (vectype, > > > + TREE_TYPE > (op0), > > > + scale)))) > > > + || (!use_ifn_p > > > + && !operand_equal_p (TYPE_SIZE (TREE_TYPE (off)), > > > + TYPE_SIZE (TREE_TYPE (vectype)= ), > > > + 0)) > > > + || use_ifn_p)) > > > > Looks like this would be slightly simpler to read/understand as: > > > > && (use_ifn_p > > || (DR_IS_READ =E2=80=A6) > > || !operand_equal_p =E2=80=A6) > > > > (Sorry for the unsubstantive comment.) > > Ah, yes. Will adjust before committing. > > Thanks, > Richard. > > > Richard > > > > > { > > > off =3D op0; > > > offtype =3D TREE_TYPE (off); > > > @@ -4036,7 +4055,8 @@ vect_check_gather_scatter (stmt_vec_info > stmt_info, loop_vec_info loop_vinfo, > > > if (!vect_gather_scatter_fn_p (loop_vinfo, DR_IS_READ (dr), > masked_p, > > > vectype, memory_type, offtype, scale= , > > > &ifn, &offset_vectype)) > > > - return false; > > > + ifn =3D IFN_LAST; > > > + decl =3D NULL_TREE; > > > } > > > else > > > { > > > @@ -4050,10 +4070,6 @@ vect_check_gather_scatter (stmt_vec_info > stmt_info, loop_vec_info loop_vinfo, > > > if (targetm.vectorize.builtin_scatter) > > > decl =3D targetm.vectorize.builtin_scatter (vectype, offtype, > scale); > > > } > > > - > > > - if (!decl) > > > - return false; > > > - > > > ifn =3D IFN_LAST; > > > /* The offset vector type will be read from DECL when needed. > */ > > > offset_vectype =3D NULL_TREE; > > > @@ -4283,9 +4299,7 @@ vect_analyze_data_refs (vec_info *vinfo, > poly_uint64 *min_vf, bool *fatal) > > > { > > > bool maybe_gather > > > =3D DR_IS_READ (dr) > > > - && !TREE_THIS_VOLATILE (DR_REF (dr)) > > > - && (targetm.vectorize.builtin_gather !=3D NULL > > > - || supports_vec_gather_load_p ()); > > > + && !TREE_THIS_VOLATILE (DR_REF (dr)); > > > bool maybe_scatter > > > =3D DR_IS_WRITE (dr) > > > && !TREE_THIS_VOLATILE (DR_REF (dr)) > > > diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c > > > index 743fd3f5414..25de97bd9b0 100644 > > > --- a/gcc/tree-vect-patterns.c > > > +++ b/gcc/tree-vect-patterns.c > > > @@ -4811,7 +4811,7 @@ vect_recog_gather_scatter_pattern (vec_info > *vinfo, > > > function for the gather/scatter operation. */ > > > gather_scatter_info gs_info; > > > if (!vect_check_gather_scatter (stmt_info, loop_vinfo, &gs_info) > > > - || gs_info.decl) > > > + || gs_info.ifn =3D=3D IFN_LAST) > > > return NULL; > > > > > > /* Convert the mask to the right form. */ > > > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c > > > index 05085e1b110..9d51b476db6 100644 > > > --- a/gcc/tree-vect-stmts.c > > > +++ b/gcc/tree-vect-stmts.c > > > @@ -1084,6 +1084,7 @@ static void > > > vect_model_load_cost (vec_info *vinfo, > > > stmt_vec_info stmt_info, unsigned ncopies, > poly_uint64 vf, > > > vect_memory_access_type memory_access_type, > > > + gather_scatter_info *gs_info, > > > slp_tree slp_node, > > > stmt_vector_for_cost *cost_vec) > > > { > > > @@ -1172,9 +1173,17 @@ vect_model_load_cost (vec_info *vinfo, > > > if (memory_access_type =3D=3D VMAT_ELEMENTWISE > > > || memory_access_type =3D=3D VMAT_GATHER_SCATTER) > > > { > > > - /* N scalar loads plus gathering them into a vector. */ > > > tree vectype =3D STMT_VINFO_VECTYPE (stmt_info); > > > unsigned int assumed_nunits =3D vect_nunits_for_cost (vectype)= ; > > > + if (memory_access_type =3D=3D VMAT_GATHER_SCATTER > > > + && gs_info->ifn =3D=3D IFN_LAST && !gs_info->decl) > > > + /* For emulated gathers N offset vector element extracts > > > + (we assume the scalar scaling and ptr + offset add is consumed > by > > > + the load). */ > > > + inside_cost +=3D record_stmt_cost (cost_vec, ncopies * > assumed_nunits, > > > + vec_to_scalar, stmt_info, 0, > > > + vect_body); > > > + /* N scalar loads plus gathering them into a vector. */ > > > inside_cost +=3D record_stmt_cost (cost_vec, > > > ncopies * assumed_nunits, > > > scalar_load, stmt_info, 0, > vect_body); > > > @@ -1184,7 +1193,9 @@ vect_model_load_cost (vec_info *vinfo, > > > &inside_cost, &prologue_cost, > > > cost_vec, cost_vec, true); > > > if (memory_access_type =3D=3D VMAT_ELEMENTWISE > > > - || memory_access_type =3D=3D VMAT_STRIDED_SLP) > > > + || memory_access_type =3D=3D VMAT_STRIDED_SLP > > > + || (memory_access_type =3D=3D VMAT_GATHER_SCATTER > > > + && gs_info->ifn =3D=3D IFN_LAST && !gs_info->decl)) > > > inside_cost +=3D record_stmt_cost (cost_vec, ncopies, vec_constr= uct, > > > stmt_info, 0, vect_body); > > > > > > @@ -1866,7 +1877,8 @@ vect_truncate_gather_scatter_offset > (stmt_vec_info stmt_info, > > > tree memory_type =3D TREE_TYPE (DR_REF (dr)); > > > if (!vect_gather_scatter_fn_p (loop_vinfo, DR_IS_READ (dr), > masked_p, > > > vectype, memory_type, offset_type, > scale, > > > - &gs_info->ifn, > &gs_info->offset_vectype)) > > > + &gs_info->ifn, > &gs_info->offset_vectype) > > > + || gs_info->ifn =3D=3D IFN_LAST) > > > continue; > > > > > > gs_info->decl =3D NULL_TREE; > > > @@ -1901,7 +1913,7 @@ vect_use_strided_gather_scatters_p > (stmt_vec_info stmt_info, > > > gather_scatter_info *gs_info) > > > { > > > if (!vect_check_gather_scatter (stmt_info, loop_vinfo, gs_info) > > > - || gs_info->decl) > > > + || gs_info->ifn =3D=3D IFN_LAST) > > > return vect_truncate_gather_scatter_offset (stmt_info, loop_vinf= o, > > > masked_p, gs_info); > > > > > > @@ -2355,6 +2367,27 @@ get_load_store_type (vec_info *vinfo, > stmt_vec_info stmt_info, > > > vls_type =3D=3D VLS_LOAD ? "gather" : "scatt= er"); > > > return false; > > > } > > > + else if (gs_info->ifn =3D=3D IFN_LAST && !gs_info->decl) > > > + { > > > + if (vls_type !=3D VLS_LOAD) > > > + { > > > + if (dump_enabled_p ()) > > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > + "unsupported emulated scatter.\n"); > > > + return false; > > > + } > > > + else if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant () > > > + || !known_eq (TYPE_VECTOR_SUBPARTS (vectype), > > > + TYPE_VECTOR_SUBPARTS > > > + (gs_info->offset_vectype))) > > > + { > > > + if (dump_enabled_p ()) > > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > + "unsupported vector types for emulated " > > > + "gather.\n"); > > > + return false; > > > + } > > > + } > > > /* Gather-scatter accesses perform only component accesses, > alignment > > > is irrelevant for them. */ > > > *alignment_support_scheme =3D dr_unaligned_supported; > > > @@ -8692,6 +8725,15 @@ vectorizable_load (vec_info *vinfo, > > > "unsupported access type for masked load.\n"= ); > > > return false; > > > } > > > + else if (memory_access_type =3D=3D VMAT_GATHER_SCATTER > > > + && gs_info.ifn =3D=3D IFN_LAST > > > + && !gs_info.decl) > > > + { > > > + if (dump_enabled_p ()) > > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > + "unsupported masked emulated gather.\n"); > > > + return false; > > > + } > > > } > > > > > > if (!vec_stmt) /* transformation not required. */ > > > @@ -8725,7 +8767,7 @@ vectorizable_load (vec_info *vinfo, > > > > > > STMT_VINFO_TYPE (orig_stmt_info) =3D load_vec_info_type; > > > vect_model_load_cost (vinfo, stmt_info, ncopies, vf, > memory_access_type, > > > - slp_node, cost_vec); > > > + &gs_info, slp_node, cost_vec); > > > return true; > > > } > > > > > > @@ -9438,7 +9480,8 @@ vectorizable_load (vec_info *vinfo, > > > unsigned int misalign; > > > unsigned HOST_WIDE_INT align; > > > > > > - if (memory_access_type =3D=3D VMAT_GATHER_SCATTER) > > > + if (memory_access_type =3D=3D VMAT_GATHER_SCATTER > > > + && gs_info.ifn !=3D IFN_LAST) > > > { > > > tree zero =3D build_zero_cst (vectype); > > > tree scale =3D size_int (gs_info.scale); > > > @@ -9456,6 +9499,51 @@ vectorizable_load (vec_info *vinfo, > > > data_ref =3D NULL_TREE; > > > break; > > > } > > > + else if (memory_access_type =3D=3D VMAT_GATHER_SCATTE= R) > > > + { > > > + /* Emulated gather-scatter. */ > > > + gcc_assert (!final_mask); > > > + unsigned HOST_WIDE_INT const_nunits > > > + =3D nunits.to_constant (); > > > + vec *ctor_elts; > > > + vec_alloc (ctor_elts, const_nunits); > > > + gimple_seq stmts =3D NULL; > > > + tree idx_type =3D TREE_TYPE (TREE_TYPE (vec_offse= t)); > > > + tree scale =3D size_int (gs_info.scale); > > > + align > > > + =3D get_object_alignment (DR_REF > (first_dr_info->dr)); > > > + tree ltype =3D build_aligned_type (TREE_TYPE > (vectype), > > > + align); > > > + for (unsigned k =3D 0; k < const_nunits; ++k) > > > + { > > > + tree boff =3D size_binop (MULT_EXPR, > > > + TYPE_SIZE (idx_type), > > > + bitsize_int (k)); > > > + tree idx =3D gimple_build (&stmts, BIT_FIELD_= REF, > > > + idx_type, vec_offset= , > > > + TYPE_SIZE (idx_type)= , > > > + boff); > > > + idx =3D gimple_convert (&stmts, sizetype, idx= ); > > > + idx =3D gimple_build (&stmts, MULT_EXPR, > > > + sizetype, idx, scale); > > > + tree ptr =3D gimple_build (&stmts, PLUS_EXPR, > > > + TREE_TYPE > (dataref_ptr), > > > + dataref_ptr, idx); > > > + ptr =3D gimple_convert (&stmts, ptr_type_node= , > ptr); > > > + tree elt =3D make_ssa_name (TREE_TYPE (vectyp= e)); > > > + tree ref =3D build2 (MEM_REF, ltype, ptr, > > > + build_int_cst (ref_type, > 0)); > > > + new_stmt =3D gimple_build_assign (elt, ref); > > > + gimple_seq_add_stmt (&stmts, new_stmt); > > > + CONSTRUCTOR_APPEND_ELT (ctor_elts, NULL_TREE, > elt); > > > + } > > > + gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT)= ; > > > + new_stmt =3D gimple_build_assign (NULL_TREE, > > > + build_constructor > > > + (vectype, > ctor_elts)); > > > + data_ref =3D NULL_TREE; > > > + break; > > > + } > > > > > > align =3D > > > known_alignment (DR_TARGET_ALIGNMENT > (first_dr_info)); > > > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, > Germany; GF: Felix Imend=C3=B6rffer; HRB 36809 (AG Nuernberg) >