From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 35BDF383B42A for ; Mon, 2 Aug 2021 08:35:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 35BDF383B42A Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 1728Z2Ym156362; Mon, 2 Aug 2021 04:35:43 -0400 Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0a-001b2d01.pphosted.com with ESMTP id 3a5kjuuqdw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 02 Aug 2021 04:35:43 -0400 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1728RTd1007959; Mon, 2 Aug 2021 08:35:41 GMT Received: from b06cxnps4076.portsmouth.uk.ibm.com (d06relay13.portsmouth.uk.ibm.com [9.149.109.198]) by ppma04ams.nl.ibm.com with ESMTP id 3a4x58v5h8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 02 Aug 2021 08:35:40 +0000 Received: from d06av24.portsmouth.uk.ibm.com (mk.ibm.com [9.149.105.60]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 1728Zckc30867712 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 2 Aug 2021 08:35:38 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A5E644204C; Mon, 2 Aug 2021 08:35:38 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 474C04203F; Mon, 2 Aug 2021 08:35:37 +0000 (GMT) Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.147.34]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 2 Aug 2021 08:35:37 +0000 (GMT) Subject: Re: [PATCH] Add emulated gather capability to the vectorizer To: Richard Biener Cc: richard.sandiford@arm.com, gcc-patches@gcc.gnu.org References: <3n590sn-6584-r9q-oqp7-953p556qp515@fhfr.qr> <2f12c7e9-7098-7380-c943-3b554e992d58@linux.ibm.com> From: "Kewen.Lin" Message-ID: <169f58db-4e9d-df26-309f-8e8c0c4b691f@linux.ibm.com> Date: Mon, 2 Aug 2021 16:35:35 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: Sl3R7EW1IpEC0csAiVpvkxYACPPWVDUs X-Proofpoint-GUID: Sl3R7EW1IpEC0csAiVpvkxYACPPWVDUs X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790 definitions=2021-08-02_01:2021-08-02, 2021-08-02 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 phishscore=0 impostorscore=0 suspectscore=0 adultscore=0 mlxlogscore=999 bulkscore=0 mlxscore=0 clxscore=1015 lowpriorityscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2107140000 definitions=main-2108020060 X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, NICE_REPLY_A, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Aug 2021 08:35:50 -0000 on 2021/8/2 下午3:09, Richard Biener wrote: > On Mon, 2 Aug 2021, Kewen.Lin wrote: > >> on 2021/7/30 下午10:04, Kewen.Lin via Gcc-patches wrote: >>> Hi Richi, >>> >>> on 2021/7/30 下午7:34, Richard Biener wrote: >>>> This adds a gather vectorization capability to the vectorizer >>>> without target support by decomposing the offset vector, doing >>>> sclar loads and then building a vector from the result. This >>>> is aimed mainly at cases where vectorizing the rest of the loop >>>> offsets the cost of vectorizing the gather. >>>> >>>> Note it's difficult to avoid vectorizing the offset load, but in >>>> some cases later passes can turn the vector load + extract into >>>> scalar loads, see the followup patch. >>>> >>>> On SPEC CPU 2017 510.parest_r this improves runtime from 250s >>>> to 219s on a Zen2 CPU which has its native gather instructions >>>> disabled (using those the runtime instead increases to 254s) >>>> using -Ofast -march=znver2 [-flto]. It turns out the critical >>>> loops in this benchmark all perform gather operations. >>>> >>> >>> Wow, it sounds promising! >>> >>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. >>>> >>>> Any comments? I still plan to run this over full SPEC and >>>> I have to apply TLC to the followup patch before I can post it. >>>> >>>> I think neither power nor z has gather so I'm curious if the >>>> patch helps 510.parest_r there, I'm unsure about neon/advsimd. >>> >>> Yes, Power (latest Power10) doesn't support gather load. >>> I just measured 510.parest_r with this patch on Power9 at option >>> -Ofast -mcpu=power9 {,-funroll-loops}, both are neutral. >>> >>> It fails to vectorize the loop in vect-gather-1.c: >>> >>> vect-gather.c:12:28: missed: failed: evolution of base is not affine. >>> vect-gather.c:11:46: missed: not vectorized: data ref analysis failed _6 = *_5; >>> vect-gather.c:12:28: missed: not vectorized: data ref analysis failed: _6 = *_5; >>> vect-gather.c:11:46: missed: bad data references. >>> vect-gather.c:11:46: missed: couldn't vectorize loop >>> >> >> By further investigation, it's due to that rs6000 fails to make >> maybe_gather true in: >> >> bool maybe_gather >> = DR_IS_READ (dr) >> && !TREE_THIS_VOLATILE (DR_REF (dr)) >> && (targetm.vectorize.builtin_gather != NULL >> || supports_vec_gather_load_p ()); >> >> With the hacking defining TARGET_VECTORIZE_BUILTIN_GATHER (as >> well as TARGET_VECTORIZE_BUILTIN_SCATTER) for rs6000, the >> case gets vectorized as expected. > > Ah, yeah - this check needs to be elided. Checking with a cross > that's indeed what is missing. > >> But re-evaluated 510.parest_r with this extra hacking, the >> runtime performance doesn't have any changes. > > OK, it likely needs the followup as well then. For the added > testcase I end up with > > .L4: > lwz %r10,8(%r9) > lwz %r31,12(%r9) > addi %r8,%r8,32 > addi %r9,%r9,16 > lwz %r7,-16(%r9) > lwz %r0,-12(%r9) > lxv %vs10,-16(%r8) > lxv %vs12,-32(%r8) > extswsli %r10,%r10,3 > extswsli %r31,%r31,3 > extswsli %r7,%r7,3 > extswsli %r0,%r0,3 > ldx %r10,%r6,%r10 > ldx %r31,%r6,%r31 > ldx %r7,%r6,%r7 > ldx %r12,%r6,%r0 > mtvsrdd %vs0,%r31,%r10 > mtvsrdd %vs11,%r12,%r7 > xvmuldp %vs0,%vs0,%vs10 > xvmaddadp %vs0,%vs12,%vs11 > xvadddp %vs32,%vs32,%vs0 > bdnz .L4 > > as the inner vectorized loop. Looks like power doesn't have > extending SI->DImode loads? Otherwise the code looks You meant one instruction for both left shifting and loads? It doesn't if so. btw, the above code sequence looks nice! > straight-forward (I've used -mcpu=power10), but eventually > the whole gather, which I think boils down to > > lwz %r10,8(%r9) > lwz %r31,12(%r9) > addi %r9,%r9,16 > lwz %r7,-16(%r9) > lwz %r0,-12(%r9) > extswsli %r10,%r10,3 > extswsli %r31,%r31,3 > extswsli %r7,%r7,3 > extswsli %r0,%r0,3 > ldx %r10,%r6,%r10 > ldx %r31,%r6,%r31 > ldx %r7,%r6,%r7 > ldx %r12,%r6,%r0 > mtvsrdd %vs0,%r31,%r10 > mtvsrdd %vs11,%r12,%r7 > > is too difficult for the integer/load side of the pipeline. It's > of course the same in the scalar loop but there the FP ops only > need to wait for one lane to complete. Note the above is with > the followup patch. > Good point! Will check the actual effect after this and its follow-up are landed, may need some cost tweaking on it. BR, Kewen > Richard. > >> BR, >> Kewen >> >>>> Both might need the followup patch - I was surprised about >>>> the speedup without it on Zen (the followup improves runtime >>>> to 198s there). >>>> >>>> Thanks, >>>> Richard. >>>> >>>> 2021-07-30 Richard Biener >>>> >>>> * tree-vect-data-refs.c (vect_check_gather_scatter): >>>> Include widening conversions only when the result is >>>> still handed by native gather or the current offset >>>> size not already matches the data size. >>>> Also succeed analysis in case there's no native support, >>>> noted by a IFN_LAST ifn and a NULL decl. >>>> * tree-vect-patterns.c (vect_recog_gather_scatter_pattern): >>>> Test for no IFN gather rather than decl gather. >>>> * tree-vect-stmts.c (vect_model_load_cost): Pass in the >>>> gather-scatter info and cost emulated gathers accordingly. >>>> (vect_truncate_gather_scatter_offset): Properly test for >>>> no IFN gather. >>>> (vect_use_strided_gather_scatters_p): Likewise. >>>> (get_load_store_type): Handle emulated gathers and its >>>> restrictions. >>>> (vectorizable_load): Likewise. Emulate them by extracting >>>> scalar offsets, doing scalar loads and a vector construct. >>>> >>>> * gcc.target/i386/vect-gather-1.c: New testcase. >>>> * gfortran.dg/vect/vect-8.f90: Adjust. >>>> --- >>>> gcc/testsuite/gcc.target/i386/vect-gather-1.c | 18 ++++ >>>> gcc/testsuite/gfortran.dg/vect/vect-8.f90 | 2 +- >>>> gcc/tree-vect-data-refs.c | 29 +++-- >>>> gcc/tree-vect-patterns.c | 2 +- >>>> gcc/tree-vect-stmts.c | 100 ++++++++++++++++-- >>>> 5 files changed, 136 insertions(+), 15 deletions(-) >>>> create mode 100644 gcc/testsuite/gcc.target/i386/vect-gather-1.c >>>> >>>> diff --git a/gcc/testsuite/gcc.target/i386/vect-gather-1.c b/gcc/testsuite/gcc.target/i386/vect-gather-1.c >>>> new file mode 100644 >>>> index 00000000000..134aef39666 >>>> --- /dev/null >>>> +++ b/gcc/testsuite/gcc.target/i386/vect-gather-1.c >>>> @@ -0,0 +1,18 @@ >>>> +/* { dg-do compile } */ >>>> +/* { dg-options "-Ofast -msse2 -fdump-tree-vect-details" } */ >>>> + >>>> +#ifndef INDEXTYPE >>>> +#define INDEXTYPE int >>>> +#endif >>>> +double vmul(INDEXTYPE *rowstart, INDEXTYPE *rowend, >>>> + double *luval, double *dst) >>>> +{ >>>> + double res = 0; >>>> + for (const INDEXTYPE * col = rowstart; col != rowend; ++col, ++luval) >>>> + res += *luval * dst[*col]; >>>> + return res; >>>> +} >>>> + >>>> +/* With gather emulation this should be profitable to vectorize >>>> + even with plain SSE2. */ >>>> +/* { dg-final { scan-tree-dump "loop vectorized" "vect" } } */ >>>> diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 b/gcc/testsuite/gfortran.dg/vect/vect-8.f90 >>>> index 9994805d77f..cc1aebfbd84 100644 >>>> --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 >>>> +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90 >>>> @@ -706,5 +706,5 @@ END SUBROUTINE kernel >>>> >>>> ! { dg-final { scan-tree-dump-times "vectorized 24 loops" 1 "vect" { target aarch64_sve } } } >>>> ! { dg-final { scan-tree-dump-times "vectorized 23 loops" 1 "vect" { target { aarch64*-*-* && { ! aarch64_sve } } } } } >>>> -! { dg-final { scan-tree-dump-times "vectorized 2\[23\] loops" 1 "vect" { target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } } >>>> +! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } } >>>> ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target { { ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } } >>>> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c >>>> index 6995efba899..0279e75fa8e 100644 >>>> --- a/gcc/tree-vect-data-refs.c >>>> +++ b/gcc/tree-vect-data-refs.c >>>> @@ -4007,8 +4007,26 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, >>>> continue; >>>> } >>>> >>>> - if (TYPE_PRECISION (TREE_TYPE (op0)) >>>> - < TYPE_PRECISION (TREE_TYPE (off))) >>>> + /* Include the conversion if it is widening and we're using >>>> + the IFN path or the target can handle the converted from >>>> + offset or the current size is not already the same as the >>>> + data vector element size. */ >>>> + if ((TYPE_PRECISION (TREE_TYPE (op0)) >>>> + < TYPE_PRECISION (TREE_TYPE (off))) >>>> + && ((!use_ifn_p >>>> + && (DR_IS_READ (dr) >>>> + ? (targetm.vectorize.builtin_gather >>>> + && targetm.vectorize.builtin_gather (vectype, >>>> + TREE_TYPE (op0), >>>> + scale)) >>>> + : (targetm.vectorize.builtin_scatter >>>> + && targetm.vectorize.builtin_scatter (vectype, >>>> + TREE_TYPE (op0), >>>> + scale)))) >>>> + || (!use_ifn_p >>>> + && !operand_equal_p (TYPE_SIZE (TREE_TYPE (off)), >>>> + TYPE_SIZE (TREE_TYPE (vectype)), >>>> + 0)))) >>>> { >>>> off = op0; >>>> offtype = TREE_TYPE (off); >>>> @@ -4036,7 +4054,8 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, >>>> if (!vect_gather_scatter_fn_p (loop_vinfo, DR_IS_READ (dr), masked_p, >>>> vectype, memory_type, offtype, scale, >>>> &ifn, &offset_vectype)) >>>> - return false; >>>> + ifn = IFN_LAST; >>>> + decl = NULL_TREE; >>>> } >>>> else >>>> { >>>> @@ -4050,10 +4069,6 @@ vect_check_gather_scatter (stmt_vec_info stmt_info, loop_vec_info loop_vinfo, >>>> if (targetm.vectorize.builtin_scatter) >>>> decl = targetm.vectorize.builtin_scatter (vectype, offtype, scale); >>>> } >>>> - >>>> - if (!decl) >>>> - return false; >>>> - >>>> ifn = IFN_LAST; >>>> /* The offset vector type will be read from DECL when needed. */ >>>> offset_vectype = NULL_TREE; >>>> diff --git a/gcc/tree-vect-patterns.c b/gcc/tree-vect-patterns.c >>>> index 743fd3f5414..25de97bd9b0 100644 >>>> --- a/gcc/tree-vect-patterns.c >>>> +++ b/gcc/tree-vect-patterns.c >>>> @@ -4811,7 +4811,7 @@ vect_recog_gather_scatter_pattern (vec_info *vinfo, >>>> function for the gather/scatter operation. */ >>>> gather_scatter_info gs_info; >>>> if (!vect_check_gather_scatter (stmt_info, loop_vinfo, &gs_info) >>>> - || gs_info.decl) >>>> + || gs_info.ifn == IFN_LAST) >>>> return NULL; >>>> >>>> /* Convert the mask to the right form. */ >>>> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c >>>> index 05085e1b110..9d51b476db6 100644 >>>> --- a/gcc/tree-vect-stmts.c >>>> +++ b/gcc/tree-vect-stmts.c >>>> @@ -1084,6 +1084,7 @@ static void >>>> vect_model_load_cost (vec_info *vinfo, >>>> stmt_vec_info stmt_info, unsigned ncopies, poly_uint64 vf, >>>> vect_memory_access_type memory_access_type, >>>> + gather_scatter_info *gs_info, >>>> slp_tree slp_node, >>>> stmt_vector_for_cost *cost_vec) >>>> { >>>> @@ -1172,9 +1173,17 @@ vect_model_load_cost (vec_info *vinfo, >>>> if (memory_access_type == VMAT_ELEMENTWISE >>>> || memory_access_type == VMAT_GATHER_SCATTER) >>>> { >>>> - /* N scalar loads plus gathering them into a vector. */ >>>> tree vectype = STMT_VINFO_VECTYPE (stmt_info); >>>> unsigned int assumed_nunits = vect_nunits_for_cost (vectype); >>>> + if (memory_access_type == VMAT_GATHER_SCATTER >>>> + && gs_info->ifn == IFN_LAST && !gs_info->decl) >>>> + /* For emulated gathers N offset vector element extracts >>>> + (we assume the scalar scaling and ptr + offset add is consumed by >>>> + the load). */ >>>> + inside_cost += record_stmt_cost (cost_vec, ncopies * assumed_nunits, >>>> + vec_to_scalar, stmt_info, 0, >>>> + vect_body); >>>> + /* N scalar loads plus gathering them into a vector. */ >>>> inside_cost += record_stmt_cost (cost_vec, >>>> ncopies * assumed_nunits, >>>> scalar_load, stmt_info, 0, vect_body); >>>> @@ -1184,7 +1193,9 @@ vect_model_load_cost (vec_info *vinfo, >>>> &inside_cost, &prologue_cost, >>>> cost_vec, cost_vec, true); >>>> if (memory_access_type == VMAT_ELEMENTWISE >>>> - || memory_access_type == VMAT_STRIDED_SLP) >>>> + || memory_access_type == VMAT_STRIDED_SLP >>>> + || (memory_access_type == VMAT_GATHER_SCATTER >>>> + && gs_info->ifn == IFN_LAST && !gs_info->decl)) >>>> inside_cost += record_stmt_cost (cost_vec, ncopies, vec_construct, >>>> stmt_info, 0, vect_body); >>>> >>>> @@ -1866,7 +1877,8 @@ vect_truncate_gather_scatter_offset (stmt_vec_info stmt_info, >>>> tree memory_type = TREE_TYPE (DR_REF (dr)); >>>> if (!vect_gather_scatter_fn_p (loop_vinfo, DR_IS_READ (dr), masked_p, >>>> vectype, memory_type, offset_type, scale, >>>> - &gs_info->ifn, &gs_info->offset_vectype)) >>>> + &gs_info->ifn, &gs_info->offset_vectype) >>>> + || gs_info->ifn == IFN_LAST) >>>> continue; >>>> >>>> gs_info->decl = NULL_TREE; >>>> @@ -1901,7 +1913,7 @@ vect_use_strided_gather_scatters_p (stmt_vec_info stmt_info, >>>> gather_scatter_info *gs_info) >>>> { >>>> if (!vect_check_gather_scatter (stmt_info, loop_vinfo, gs_info) >>>> - || gs_info->decl) >>>> + || gs_info->ifn == IFN_LAST) >>>> return vect_truncate_gather_scatter_offset (stmt_info, loop_vinfo, >>>> masked_p, gs_info); >>>> >>>> @@ -2355,6 +2367,27 @@ get_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, >>>> vls_type == VLS_LOAD ? "gather" : "scatter"); >>>> return false; >>>> } >>>> + else if (gs_info->ifn == IFN_LAST && !gs_info->decl) >>>> + { >>>> + if (vls_type != VLS_LOAD) >>>> + { >>>> + if (dump_enabled_p ()) >>>> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >>>> + "unsupported emulated scatter.\n"); >>>> + return false; >>>> + } >>>> + else if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant () >>>> + || !known_eq (TYPE_VECTOR_SUBPARTS (vectype), >>>> + TYPE_VECTOR_SUBPARTS >>>> + (gs_info->offset_vectype))) >>>> + { >>>> + if (dump_enabled_p ()) >>>> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >>>> + "unsupported vector types for emulated " >>>> + "gather.\n"); >>>> + return false; >>>> + } >>>> + } >>>> /* Gather-scatter accesses perform only component accesses, alignment >>>> is irrelevant for them. */ >>>> *alignment_support_scheme = dr_unaligned_supported; >>>> @@ -8692,6 +8725,15 @@ vectorizable_load (vec_info *vinfo, >>>> "unsupported access type for masked load.\n"); >>>> return false; >>>> } >>>> + else if (memory_access_type == VMAT_GATHER_SCATTER >>>> + && gs_info.ifn == IFN_LAST >>>> + && !gs_info.decl) >>>> + { >>>> + if (dump_enabled_p ()) >>>> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >>>> + "unsupported masked emulated gather.\n"); >>>> + return false; >>>> + } >>>> } >>>> >>>> if (!vec_stmt) /* transformation not required. */ >>>> @@ -8725,7 +8767,7 @@ vectorizable_load (vec_info *vinfo, >>>> >>>> STMT_VINFO_TYPE (orig_stmt_info) = load_vec_info_type; >>>> vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type, >>>> - slp_node, cost_vec); >>>> + &gs_info, slp_node, cost_vec); >>>> return true; >>>> } >>>> >>>> @@ -9438,7 +9480,8 @@ vectorizable_load (vec_info *vinfo, >>>> unsigned int misalign; >>>> unsigned HOST_WIDE_INT align; >>>> >>>> - if (memory_access_type == VMAT_GATHER_SCATTER) >>>> + if (memory_access_type == VMAT_GATHER_SCATTER >>>> + && gs_info.ifn != IFN_LAST) >>>> { >>>> tree zero = build_zero_cst (vectype); >>>> tree scale = size_int (gs_info.scale); >>>> @@ -9456,6 +9499,51 @@ vectorizable_load (vec_info *vinfo, >>>> data_ref = NULL_TREE; >>>> break; >>>> } >>>> + else if (memory_access_type == VMAT_GATHER_SCATTER) >>>> + { >>>> + /* Emulated gather-scatter. */ >>>> + gcc_assert (!final_mask); >>>> + unsigned HOST_WIDE_INT const_nunits >>>> + = nunits.to_constant (); >>>> + vec *ctor_elts; >>>> + vec_alloc (ctor_elts, const_nunits); >>>> + gimple_seq stmts = NULL; >>>> + tree idx_type = TREE_TYPE (TREE_TYPE (vec_offset)); >>>> + tree scale = size_int (gs_info.scale); >>>> + align >>>> + = get_object_alignment (DR_REF (first_dr_info->dr)); >>>> + tree ltype = build_aligned_type (TREE_TYPE (vectype), >>>> + align); >>>> + for (unsigned k = 0; k < const_nunits; ++k) >>>> + { >>>> + tree boff = size_binop (MULT_EXPR, >>>> + TYPE_SIZE (idx_type), >>>> + bitsize_int (k)); >>>> + tree idx = gimple_build (&stmts, BIT_FIELD_REF, >>>> + idx_type, vec_offset, >>>> + TYPE_SIZE (idx_type), >>>> + boff); >>>> + idx = gimple_convert (&stmts, sizetype, idx); >>>> + idx = gimple_build (&stmts, MULT_EXPR, >>>> + sizetype, idx, scale); >>>> + tree ptr = gimple_build (&stmts, PLUS_EXPR, >>>> + TREE_TYPE (dataref_ptr), >>>> + dataref_ptr, idx); >>>> + ptr = gimple_convert (&stmts, ptr_type_node, ptr); >>>> + tree elt = make_ssa_name (TREE_TYPE (vectype)); >>>> + tree ref = build2 (MEM_REF, ltype, ptr, >>>> + build_int_cst (ref_type, 0)); >>>> + new_stmt = gimple_build_assign (elt, ref); >>>> + gimple_seq_add_stmt (&stmts, new_stmt); >>>> + CONSTRUCTOR_APPEND_ELT (ctor_elts, NULL_TREE, elt); >>>> + } >>>> + gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT); >>>> + new_stmt = gimple_build_assign (NULL_TREE, >>>> + build_constructor >>>> + (vectype, ctor_elts)); >>>> + data_ref = NULL_TREE; >>>> + break; >>>> + } >>>> >>>> align = >>>> known_alignment (DR_TARGET_ALIGNMENT (first_dr_info)); >>>> >>> >> >