From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 8581B3861823 for ; Mon, 2 Aug 2021 06:08:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8581B3861823 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 17264lFx103981; Mon, 2 Aug 2021 02:07:57 -0400 Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0a-001b2d01.pphosted.com with ESMTP id 3a5u8b829h-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 02 Aug 2021 02:07:57 -0400 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 17267GM2028520; Mon, 2 Aug 2021 06:07:55 GMT Received: from b06avi18878370.portsmouth.uk.ibm.com (b06avi18878370.portsmouth.uk.ibm.com [9.149.26.194]) by ppma04ams.nl.ibm.com with ESMTP id 3a4x58unfs-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 02 Aug 2021 06:07:55 +0000 Received: from d06av23.portsmouth.uk.ibm.com (d06av23.portsmouth.uk.ibm.com [9.149.105.59]) by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 172650xO22348120 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 2 Aug 2021 06:05:00 GMT Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E2617A4055; Mon, 2 Aug 2021 06:07:52 +0000 (GMT) Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A01E1A4057; Mon, 2 Aug 2021 06:07:51 +0000 (GMT) Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.147.34]) by d06av23.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 2 Aug 2021 06:07:51 +0000 (GMT) Subject: Re: [PATCH] Add emulated gather capability to the vectorizer To: Richard Biener Cc: richard.sandiford@arm.com, gcc-patches@gcc.gnu.org References: <3n590sn-6584-r9q-oqp7-953p556qp515@fhfr.qr> <2f12c7e9-7098-7380-c943-3b554e992d58@linux.ibm.com> From: "Kewen.Lin" Message-ID: Date: Mon, 2 Aug 2021 14:07:49 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 In-Reply-To: <2f12c7e9-7098-7380-c943-3b554e992d58@linux.ibm.com> Content-Type: text/plain; charset=gbk Content-Language: en-US Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 7SZ8U0TkaQFFSFasPxlQjqK248clUWY4 X-Proofpoint-GUID: 7SZ8U0TkaQFFSFasPxlQjqK248clUWY4 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-08-02_01:2021-08-02, 2021-08-02 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 impostorscore=0 spamscore=0 mlxscore=0 phishscore=0 suspectscore=0 lowpriorityscore=0 priorityscore=1501 adultscore=0 clxscore=1015 malwarescore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2107140000 definitions=main-2108020042 X-Spam-Status: No, score=-8.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, MIME_CHARSET_FARAWAY, NICE_REPLY_A, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Aug 2021 06:08:02 -0000 on 2021/7/30 ÏÂÎç10:04, Kewen.Lin via Gcc-patches wrote: > Hi Richi, > > on 2021/7/30 ÏÂÎç7:34, Richard Biener wrote: >> This adds a gather vectorization capability to the vectorizer >> without target support by decomposing the offset vector, doing >> sclar loads and then building a vector from the result. This >> is aimed mainly at cases where vectorizing the rest of the loop >> offsets the cost of vectorizing the gather. >> >> Note it's difficult to avoid vectorizing the offset load, but in >> some cases later passes can turn the vector load + extract into >> scalar loads, see the followup patch. >> >> On SPEC CPU 2017 510.parest_r this improves runtime from 250s >> to 219s on a Zen2 CPU which has its native gather instructions >> disabled (using those the runtime instead increases to 254s) >> using -Ofast -march=znver2 [-flto]. It turns out the critical >> loops in this benchmark all perform gather operations. >> > > Wow, it sounds promising! > >> Bootstrapped and tested on x86_64-unknown-linux-gnu. >> >> Any comments? I still plan to run this over full SPEC and >> I have to apply TLC to the followup patch before I can post it. >> >> I think neither power nor z has gather so I'm curious if the >> patch helps 510.parest_r there, I'm unsure about neon/advsimd. > > Yes, Power (latest Power10) doesn't support gather load. > I just measured 510.parest_r with this patch on Power9 at option > -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral. > > It fails to vectorize the loop in vect-gather-1.c: > > vect-gather.c:12:28: missed: failed: evolution of base is not affine. > vect-gather.c:11:46: missed: not vectorized: data ref analysis failed _6 = *_5; > vect-gather.c:12:28: missed: not vectorized: data ref analysis failed: _6 = *_5; > vect-gather.c:11:46: missed: bad data references. > vect-gather.c:11:46: missed: couldn't vectorize loop > By further investigation, it's due to that rs6000 fails to make maybe_gather true in: bool maybe_gather = DR_IS_READ (dr) && !TREE_THIS_VOLATILE (DR_REF (dr)) && (targetm.vectorize.builtin_gather != NULL || supports_vec_gather_load_p ()); With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the case gets vectorized as expected. But re-evaluated 510.parest_r with this extra hacking, the runtime performance doesn't have any changes. BR, Kewen >> Both might need the followup patch - I was surprised about >> the speedup without it on Zen (the followup improves runtime >> to 198s there). >> >> Thanks, >> Richard. >> >> 2021-07-30 Richard Biener >> >> * tree-vect-data-refs.c (vect_check_gather_scatter): >> Include widening conversions only when the result is >> still handed by native gather or the current offset >> size not already matches the data size. >> Also succeed analysis in case there's no native support, >> noted by a IFN_LAST ifn and a NULL decl. >> * tree-vect-patterns.c (vect_recog_gather_scatter_pattern): >> Test for no IFN gather rather than decl gather. >> * tree-vect-stmts.c (vect_model_load_cost): Pass in the >> gather-scatter info and cost emulated gathers accordingly. >> (vect_truncate_gather_scatter_offset): Properly test for >> no IFN gather. >> (vect_use_strided_gather_scatters_p): Likewise. >> (get_load_store_type): Handle emulated gathers and its >> restrictions. >> (vectorizable_load): Likewise. Emulate them by extracting >> scalar offsets, doing scalar loads and a vector construct. >> >> * gcc.target/i386/vect-gather-1.c: New testcase. >> * gfortran.dg/vect/vect-8.f90: Adjust. >> --- >> gcc/testsuite/gcc.target/i386/vect-gather-1.c | 18 ++++ >> gcc/testsuite/gfortran.dg/vect/vect-8.f90 | 2 +- >> gcc/tree-vect-data-refs.c | 29 +++-- >> gcc/tree-vect-patterns.c | 2 +- >> gcc/tree-vect-stmts.c | 100 ++++++++++++++++-- >> 5 files changed, 136 insertions(+), 15 deletions(-) >> create mode 100644 gcc/testsuite/gcc.target/i386/vect-gather-1.c >> >> diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c b/gcc/testsuite/gcc.target/i386/vect-gather-1.c >> new file mode 100644 >> index 00000000000..134aef39666 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c >> @@ -0,0 +1,18 @@ >> +/* { dg-do compile } */ >> +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */ >> + >> +#ifndef INDEXTYPE >> +#define INDEXTYPE int >> +#endif >> +double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend, >> + double *luval, double *dst) >> +{ >> + double res = 0; >> + for (const INDEXTYPE * col = rowstart; col != rowend; ++col, ++luval) >> + res += *luval * dst[*col]; >> + return res; >> +} >> + >> +/* With gather emulation this should be profitable to vectorize >> + even with plain SSE2. */ >> +/* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */ >> diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 b/gcc/testsuite/gfortran.dg/vect/vect-8.f90 >> index 9994805d77f..cc1aebfbd84 100644 >> --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 >> +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90 >> @@ -706,5 +706,5 @@ END SUBROUTINE kernel >> >> ! { dg-final { scan-tree-dump-times "vectorized 24 loops" 1 "vect" { target aarch64_sve } } } >> ! { dg-final { scan-tree-dump-times "vectorized 23 loops" 1 "vect" { target { aarch64*-*-* && { ! aarch64_sve } } } } } >> -! { dg-final { scan-tree-dump-times "vectorized 2\[23\] loops" 1 "vect" { target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } } >> +! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } } >> ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { { ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } } >> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c >> index 6995efba899..0279e75fa8e 100644 >> --- a/gcc/tree-vect-data-refs.c >> +++ b/gcc/tree-vect-data-refs.c >> @@ -4007,8 +4007,26 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, >> continue; >> } >> >> - if (TYPE_PRECISION (TREE_TYPE (op0)) >> - < TYPE_PRECISION (TREE_TYPE (off))) >> + /* Include the conversion if it is widening and we're using >> + the IFN path or the target can handle the converted from >> + offset or the current size is not already the same as the >> + data vector element size. */ >> + if ((TYPE_PRECISION (TREE_TYPE (op0)) >> + < TYPE_PRECISION (TREE_TYPE (off))) >> + && ((!use_ifn_p >> + && (DR_IS_READ (dr) >> + ? (targetm.vectorize.builtin_gather >> + && targetm.vectorize.builtin_gather (vectype, >> + TREE_TYPE (op0), >> + scale)) >> + : (targetm.vectorize.builtin_scatter >> + && targetm.vectorize.builtin_scatter (vectype, >> + TREE_TYPE (op0), >> + scale)))) >> + || (!use_ifn_p >> + && !operand_equal_p (TYPE_SIZE (TREE_TYPE (off)), >> + TYPE_SIZE (TREE_TYPE (vectype)), >> + 0)))) >> { >> off = op0; >> offtype = TREE_TYPE (off); >> @@ -4036,7 +4054,8 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, >> if (!vect_gather_scatter_fn_p (loop_vinfo, DR_IS_READ (dr), masked_p, >> vectype, memory_type, offtype, scale, >> &ifn, &offset_vectype)) >> - return false; >> + ifn = IFN_LAST; >> + decl = NULL_TREE; >> } >> else >> { >> @@ -4050,10 +4069,6 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, >> if (targetm.vectorize.builtin_scatter) >> decl = targetm.vectorize.builtin_scatter (vectype, offtype, scale); >> } >> - >> - if (!decl) >> - return false; >> - >> ifn = IFN_LAST; >> /* The offset vector type will be read from DECL when needed. */ >> offset_vectype = NULL_TREE; >> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c >> index 743fd3f5414..25de97bd9b0 100644 >> --- a/gcc/tree-vect-patterns.c >> +++ b/gcc/tree-vect-patterns.c >> @@ -4811,7 +4811,7 @@ vect_recog_gather_scatter_pattern (vec_info *vinfo, >> function for the gather/scatter operation. */ >> gather_scatter_info gs_info; >> if (!vect_check_gather_scatter (stmt_info, loop_vinfo, &gs_info) >> - || gs_info.decl) >> + || gs_info.ifn == IFN_LAST) >> return NULL; >> >> /* Convert the mask to the right form. */ >> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c >> index 05085e1b110..9d51b476db6 100644 >> --- a/gcc/tree-vect-stmts.c >> +++ b/gcc/tree-vect-stmts.c >> @@ -1084,6 +1084,7 @@ static void >> vect_model_load_cost (vec_info *vinfo, >> stmt_vec_info stmt_info, unsigned ncopies, poly_uint64 vf, >> vect_memory_access_type memory_access_type, >> + gather_scatter_info *gs_info, >> slp_tree slp_node, >> stmt_vector_for_cost *cost_vec) >> { >> @@ -1172,9 +1173,17 @@ vect_model_load_cost (vec_info *vinfo, >> if (memory_access_type == VMAT_ELEMENTWISE >> || memory_access_type == VMAT_GATHER_SCATTER) >> { >> - /* N scalar loads plus gathering them into a vector. */ >> tree vectype = STMT_VINFO_VECTYPE (stmt_info); >> unsigned int assumed_nunits = vect_nunits_for_cost (vectype); >> + if (memory_access_type == VMAT_GATHER_SCATTER >> + && gs_info->ifn == IFN_LAST && !gs_info->decl) >> + /* For emulated gathers N offset vector element extracts >> + (we assume the scalar scaling and ptr + offset add is consumed by >> + the load). */ >> + inside_cost += record_stmt_cost (cost_vec, ncopies * assumed_nunits, >> + vec_to_scalar, stmt_info, 0, >> + vect_body); >> + /* N scalar loads plus gathering them into a vector. */ >> inside_cost += record_stmt_cost (cost_vec, >> ncopies * assumed_nunits, >> scalar_load, stmt_info, 0, vect_body); >> @@ -1184,7 +1193,9 @@ vect_model_load_cost (vec_info *vinfo, >> &inside_cost, &prologue_cost, >> cost_vec, cost_vec, true); >> if (memory_access_type == VMAT_ELEMENTWISE >> - || memory_access_type == VMAT_STRIDED_SLP) >> + || memory_access_type == VMAT_STRIDED_SLP >> + || (memory_access_type == VMAT_GATHER_SCATTER >> + && gs_info->ifn == IFN_LAST && !gs_info->decl)) >> inside_cost += record_stmt_cost (cost_vec, ncopies, vec_construct, >> stmt_info, 0, vect_body); >> >> @@ -1866,7 +1877,8 @@ vect_truncate_gather_scatter_offset (stmt_vec_info stmt_info, >> tree memory_type = TREE_TYPE (DR_REF (dr)); >> if (!vect_gather_scatter_fn_p (loop_vinfo, DR_IS_READ (dr), masked_p, >> vectype, memory_type, offset_type, scale, >> - &gs_info->ifn, &gs_info->offset_vectype)) >> + &gs_info->ifn, &gs_info->offset_vectype) >> + || gs_info->ifn == IFN_LAST) >> continue; >> >> gs_info->decl = NULL_TREE; >> @@ -1901,7 +1913,7 @@ vect_use_strided_gather_scatters_p (stmt_vec_info stmt_info, >> gather_scatter_info *gs_info) >> { >> if (!vect_check_gather_scatter (stmt_info, loop_vinfo, gs_info) >> - || gs_info->decl) >> + || gs_info->ifn == IFN_LAST) >> return vect_truncate_gather_scatter_offset (stmt_info, loop_vinfo, >> masked_p, gs_info); >> >> @@ -2355,6 +2367,27 @@ get_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, >> vls_type == VLS_LOAD ? "gather" : "scatter"); >> return false; >> } >> + else if (gs_info->ifn == IFN_LAST && !gs_info->decl) >> + { >> + if (vls_type != VLS_LOAD) >> + { >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> + "unsupported emulated scatter.\n"); >> + return false; >> + } >> + else if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant () >> + || !known_eq (TYPE_VECTOR_SUBPARTS (vectype), >> + TYPE_VECTOR_SUBPARTS >> + (gs_info->offset_vectype))) >> + { >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> + "unsupported vector types for emulated " >> + "gather.\n"); >> + return false; >> + } >> + } >> /* Gather-scatter accesses perform only component accesses, alignment >> is irrelevant for them. */ >> *alignment_support_scheme = dr_unaligned_supported; >> @@ -8692,6 +8725,15 @@ vectorizable_load (vec_info *vinfo, >> "unsupported access type for masked load.\n"); >> return false; >> } >> + else if (memory_access_type == VMAT_GATHER_SCATTER >> + && gs_info.ifn == IFN_LAST >> + && !gs_info.decl) >> + { >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> + "unsupported masked emulated gather.\n"); >> + return false; >> + } >> } >> >> if (!vec_stmt) /* transformation not required. */ >> @@ -8725,7 +8767,7 @@ vectorizable_load (vec_info *vinfo, >> >> STMT_VINFO_TYPE (orig_stmt_info) = load_vec_info_type; >> vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type, >> - slp_node, cost_vec); >> + &gs_info, slp_node, cost_vec); >> return true; >> } >> >> @@ -9438,7 +9480,8 @@ vectorizable_load (vec_info *vinfo, >> unsigned int misalign; >> unsigned HOST_WIDE_INT align; >> >> - if (memory_access_type == VMAT_GATHER_SCATTER) >> + if (memory_access_type == VMAT_GATHER_SCATTER >> + && gs_info.ifn != IFN_LAST) >> { >> tree zero = build_zero_cst (vectype); >> tree scale = size_int (gs_info.scale); >> @@ -9456,6 +9499,51 @@ vectorizable_load (vec_info *vinfo, >> data_ref = NULL_TREE; >> break; >> } >> + else if (memory_access_type == VMAT_GATHER_SCATTER) >> + { >> + /* Emulated gather-scatter. */ >> + gcc_assert (!final_mask); >> + unsigned HOST_WIDE_INT const_nunits >> + = nunits.to_constant (); >> + vec *ctor_elts; >> + vec_alloc (ctor_elts, const_nunits); >> + gimple_seq stmts = NULL; >> + tree idx_type = TREE_TYPE (TREE_TYPE (vec_offset)); >> + tree scale = size_int (gs_info.scale); >> + align >> + = get_object_alignment (DR_REF (first_dr_info->dr)); >> + tree ltype = build_aligned_type (TREE_TYPE (vectype), >> + align); >> + for (unsigned k = 0; k < const_nunits; ++k) >> + { >> + tree boff = size_binop (MULT_EXPR, >> + TYPE_SIZE (idx_type), >> + bitsize_int (k)); >> + tree idx = gimple_build (&stmts, BIT_FIELD_REF, >> + idx_type, vec_offset, >> + TYPE_SIZE (idx_type), >> + boff); >> + idx = gimple_convert (&stmts, sizetype, idx); >> + idx = gimple_build (&stmts, MULT_EXPR, >> + sizetype, idx, scale); >> + tree ptr = gimple_build (&stmts, PLUS_EXPR, >> + TREE_TYPE (dataref_ptr), >> + dataref_ptr, idx); >> + ptr = gimple_convert (&stmts, ptr_type_node, ptr); >> + tree elt = make_ssa_name (TREE_TYPE (vectype)); >> + tree ref = build2 (MEM_REF, ltype, ptr, >> + build_int_cst (ref_type, 0)); >> + new_stmt = gimple_build_assign (elt, ref); >> + gimple_seq_add_stmt (&stmts, new_stmt); >> + CONSTRUCTOR_APPEND_ELT (ctor_elts, NULL_TREE, elt); >> + } >> + gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT); >> + new_stmt = gimple_build_assign (NULL_TREE, >> + build_constructor >> + (vectype, ctor_elts)); >> + data_ref = NULL_TREE; >> + break; >> + } >> >> align = >> known_alignment (DR_TARGET_ALIGNMENT (first_dr_info)); >> >