From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x62f.google.com (mail-ej1-x62f.google.com [IPv6:2a00:1450:4864:20::62f]) by sourceware.org (Postfix) with ESMTPS id 1F6483857806 for ; Wed, 16 Mar 2022 09:54:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 1F6483857806 Received: by mail-ej1-x62f.google.com with SMTP id qa43so2914911ejc.12 for ; Wed, 16 Mar 2022 02:54:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=DoM6wP1mzBSN82UZj48k9KcfGP2l2s2AxIWatniTS6M=; b=PPWyvJS+6L9Lp30EGPdMcrv1ncCLdyYhm9RoKpAtj5PP2iKLi8HJJFHWXRxEtgD8kv /18t044gqMpNVE0IaiPMc3iM3td+vOHMByDJvZF4WtThcNYA8+3XPv0AxZ6NZjvSfW25 rlsijQkVno4Yh4g0UMABLco/OzG3CmCFsPoScmanRz57xzoMB4g//F82X28xrSNFOgdn u8C14DmvfsB/8o9QYwrYSdpbgHjoaBUvlqDXY5xKAGOULRohc7R2GyWIegP+BJ933guE h9R3/85JOy+TdaY860wu34oqLfGbUVVTD256kr8f4q84D7b8lZqeKqnup6CXXmTsxZqL L5hg== X-Gm-Message-State: AOAM5308mFwmkbo1rH8Dhx5jj7aBDr2z+ilW0Hm/dIRJev6HMiypuw/e GqQXvUq+GuOY+Heg4Rr2qhEteNwaO+JCrWa9/Yg= X-Google-Smtp-Source: ABdhPJxhsQ/RbjqCZ3cYkA7vFA6n9I/r/vzodRAI78+b3H63wmpZbdcAXCfjbSjmr9OzIamNCR0E096FguFLIsuTWsg= X-Received: by 2002:a17:906:4108:b0:6db:6b07:34c3 with SMTP id j8-20020a170906410800b006db6b0734c3mr27290199ejk.407.1647424452003; Wed, 16 Mar 2022 02:54:12 -0700 (PDT) MIME-Version: 1.0 References: <20220316021934.106345-1-hongtao.liu@intel.com> In-Reply-To: <20220316021934.106345-1-hongtao.liu@intel.com> From: Richard Biener Date: Wed, 16 Mar 2022 10:54:01 +0100 Message-ID: Subject: Re: [PATCH] [i386] Add extra cost for unsigned_load which may have stall forward issue. To: liuhongt Cc: GCC Patches Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-8.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Mar 2022 09:54:19 -0000 On Wed, Mar 16, 2022 at 3:19 AM liuhongt wrote: > > This patch only handle pure-slp for by-value passed parameter which > has nothing to do with IPA but psABI. For by-reference passed > parameter IPA is required. > > The patch is aggressive in determining STLF failure, any > unaligned_load for parm_decl passed by stack is thought to have STLF > stall issue. It could lose some perf where there's no such issue(1 > vector_load vs n scalar_load + CTOR). > > According to microbenchmark in PR, cost of STLF failure is generally > between 8 scalar_loads and 16 scalar loads on most latest Intel/AMD > processors. > > gcc/ChangeLog: > > PR target/101908 > * config/i386/i386.cc (ix86_load_maybe_stfs_p): New. > (ix86_vector_costs::add_stmt_cost): Add extra cost for > unsigned_load which may have store forwarding stall issue. > * config/i386/i386.h (processor_costs): Add new member > stfs. > * config/i386/x86-tune-costs.h (i386_size_cost): Initialize > stfs. > (i386_cost, i486_cost, pentium_cost, lakemont_cost, > pentiumpro_cost, geode_cost, k6_cost, athlon_cost, k8_cost, > amdfam10_cost, bdver_cost, znver1_cost, znver2_cost, > znver3_cost, skylake_cost, icelake_cost, alderlake_cost, > btver1_cost, btver2_cost, pentium4_cost, nocano_cost, > atom_cost, slm_cost, tremont_cost, intel_cost, generic_cost, > core_cost): Ditto. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr101908-1.c: New test. > * gcc.target/i386/pr101908-2.c: New test. > * gcc.target/i386/pr101908-3.c: New test. > * gcc.target/i386/pr101908-v16hi.c: New test. > * gcc.target/i386/pr101908-v16qi.c: New test. > * gcc.target/i386/pr101908-v16sf.c: New test. > * gcc.target/i386/pr101908-v16si.c: New test. > * gcc.target/i386/pr101908-v2df.c: New test. > * gcc.target/i386/pr101908-v2di.c: New test. > * gcc.target/i386/pr101908-v2hi.c: New test. > * gcc.target/i386/pr101908-v2qi.c: New test. > * gcc.target/i386/pr101908-v2sf.c: New test. > * gcc.target/i386/pr101908-v2si.c: New test. > * gcc.target/i386/pr101908-v4df.c: New test. > * gcc.target/i386/pr101908-v4di.c: New test. > * gcc.target/i386/pr101908-v4hi.c: New test. > * gcc.target/i386/pr101908-v4qi.c: New test. > * gcc.target/i386/pr101908-v4sf.c: New test. > * gcc.target/i386/pr101908-v4si.c: New test. > * gcc.target/i386/pr101908-v8df-adl.c: New test. > * gcc.target/i386/pr101908-v8df.c: New test. > * gcc.target/i386/pr101908-v8di-adl.c: New test. > * gcc.target/i386/pr101908-v8di.c: New test. > * gcc.target/i386/pr101908-v8hi-adl.c: New test. > * gcc.target/i386/pr101908-v8hi.c: New test. > * gcc.target/i386/pr101908-v8qi-adl.c: New test. > * gcc.target/i386/pr101908-v8qi.c: New test. > * gcc.target/i386/pr101908-v8sf-adl.c: New test. > * gcc.target/i386/pr101908-v8sf.c: New test. > * gcc.target/i386/pr101908-v8si-adl.c: New test. > * gcc.target/i386/pr101908-v8si.c: New test. > --- > gcc/config/i386/i386.cc | 51 +++++++++++ > gcc/config/i386/i386.h | 1 + > gcc/config/i386/x86-tune-costs.h | 28 ++++++ > gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++ > gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++ > gcc/testsuite/gcc.target/i386/pr101908-3.c | 90 +++++++++++++++++++ > .../gcc.target/i386/pr101908-v16hi.c | 6 ++ > .../gcc.target/i386/pr101908-v16qi.c | 30 +++++++ > .../gcc.target/i386/pr101908-v16sf.c | 6 ++ > .../gcc.target/i386/pr101908-v16si.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v2df.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v2di.c | 7 ++ > gcc/testsuite/gcc.target/i386/pr101908-v2hi.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v2qi.c | 16 ++++ > gcc/testsuite/gcc.target/i386/pr101908-v2sf.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v2si.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v4df.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v4di.c | 7 ++ > gcc/testsuite/gcc.target/i386/pr101908-v4hi.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v4qi.c | 18 ++++ > gcc/testsuite/gcc.target/i386/pr101908-v4sf.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v4si.c | 6 ++ > .../gcc.target/i386/pr101908-v8df-adl.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v8df.c | 6 ++ > .../gcc.target/i386/pr101908-v8di-adl.c | 7 ++ > gcc/testsuite/gcc.target/i386/pr101908-v8di.c | 7 ++ > .../gcc.target/i386/pr101908-v8hi-adl.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v8hi.c | 6 ++ > .../gcc.target/i386/pr101908-v8qi-adl.c | 22 +++++ > gcc/testsuite/gcc.target/i386/pr101908-v8qi.c | 22 +++++ > .../gcc.target/i386/pr101908-v8sf-adl.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v8sf.c | 6 ++ > .../gcc.target/i386/pr101908-v8si-adl.c | 6 ++ > gcc/testsuite/gcc.target/i386/pr101908-v8si.c | 6 ++ > 34 files changed, 444 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-3.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16hi.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16qi.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16sf.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16si.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2df.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2di.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2hi.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2qi.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2sf.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2si.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4df.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4di.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4hi.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4qi.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4sf.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4si.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si.c > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index d77ad83e437..c01809cc3da 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -22988,6 +22988,46 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info) > return default_noce_conversion_profitable_p (seq, if_info); > } > > +/* Return true if REF may have STF issue, otherwise false. > + Any unaligned_load from parm_decl which is passed by stack > + is considered to have STLF stall issue. */ > +static bool > +ix86_load_maybe_stfs_p (data_reference* dr) > +{ > + tree addr = DR_BASE_ADDRESS (dr); > + if (TREE_CODE (addr) != ADDR_EXPR) > + return false; > + addr = get_base_address (TREE_OPERAND (addr, 0)); > + > + if (TREE_CODE (addr) != PARM_DECL) > + return false; > + tree type = TREE_TYPE (addr); > + if (!type) type should never be NULL > + return false; > + > + machine_mode mode = TYPE_MODE (type); > + > + /* There could be false positive in determine parameter passed by stack. > + .i.e. parameter can be put in registers but finally passed by stack > + because registers are ran out. */ > + if (TARGET_64BIT) > + { > + /* From function_arg_64. */ > + enum x86_64_reg_class regclass[MAX_CLASSES]; > + int zero_width_bitfields = 0; > + return !classify_argument (mode, type, regclass, 0, zero_width_bitfields); > + } > + else > + { > + /* From function_arg_32. */ > + return (mode == E_BLKmode > + || (AGGREGATE_TYPE_P (type) > + && (VECTOR_MODE_P (mode) || mode == TImode))); > + } > + > + return false; that stmt is unreachable. > +} > + > /* x86-specific vector costs. */ > class ix86_vector_costs : public vector_costs > { > @@ -23218,6 +23258,17 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > if (stmt_cost == -1) > stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); > > + /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue. > + Performance may lose when there's no STF issue(1 vector_load vs n > + scalar_load + CTOR). > + TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine cost(2000) is no longer there > + tuned. */ > + if (kind == unaligned_load && stmt_info > + && stmt_info->slp_type == pure_slp You want to restrict this to BB vectorization? pure_slp isn't exactly that, instead you can do && is_a (m_vinfo) > + && STMT_VINFO_DATA_REF (stmt_info) > + && ix86_load_maybe_stfs_p (STMT_VINFO_DATA_REF (stmt_info))) > + stmt_cost += COSTS_N_INSNS (ix86_cost->stfs / 2); I wonder why we divide stfs by two? I'd suggest an additional check, that the DR is close to function start. One possible check that occurs to me is to check STMT_VINFO_DR_INFO (stmt_info)->group == 0 that will for example avoid the penalty for struct Y y; void foo (struct X x) { bar(); y.a = x.a; y.b = x.b; } but also (maybe not wanted) when the access happens after control flow transfer like with struct Y y; void foo (struct X x, int flag) { if (flag) { y.a = x.a; y.b = x.b; } } I think we should be conservative with what we pessimize until we have evidence that we need to include more cases, also since this after-the-fact handling of the issue in costing is sub-optimal. Ideally the vectorizer itself would decide the vectorize the load in a way to avoid STLF fails, but that's nothing we can easily arrange for at this stage. Another option could be to split such loads during md-reorg where we could somehow "count" the latency from function entry, only scanning paths from there up to a point where the store buffer is likely not drained (with a different target cost parameter?) and only scanning not optimize_for_size BBs. That might be a better place to do after-the-fact adjustments (the cost adjustment won't avoid the STLF fail if the rest of the vectorization compensates the penalty). Richard. > + > /* Penalize DFmode vector operations for Bonnell. */ > if (TARGET_CPU_P (BONNELL) && kind == vector_stmt > && vectype && GET_MODE_INNER (TYPE_MODE (vectype)) == DFmode) > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > index 0d28e57f8f2..341f1c47981 100644 > --- a/gcc/config/i386/i386.h > +++ b/gcc/config/i386/i386.h > @@ -168,6 +168,7 @@ struct processor_costs { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > const int sse_unaligned_load[5];/* cost of unaligned load. */ > const int sse_unaligned_store[5];/* cost of unaligned store. */ > + const int stfs; /* cost of store forward stalls. */ > const int xmm_move, ymm_move, /* cost of moving XMM and YMM register. */ > zmm_move; > const int sse_to_integer; /* cost of moving SSE register to integer. */ > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h > index 017ffa69958..3a5fcdeefdd 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -100,6 +100,7 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */ > in 128bit, 256bit and 512bit */ > {3, 3, 3, 3, 3}, /* cost of unaligned SSE store > in 128bit, 256bit and 512bit */ > + 6, /* cost of store forward stall. */ > 3, 3, 3, /* cost of moving XMM,YMM,ZMM register */ > 3, /* cost of moving SSE register to integer. */ > 5, 0, /* Gather load static, per_elt. */ > @@ -209,6 +210,7 @@ struct processor_costs i386_cost = { /* 386 specific costs */ > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {4, 8, 16, 32, 64}, /* cost of unaligned loads. */ > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > + 8, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 3, /* cost of moving SSE register to integer. */ > 4, 4, /* Gather load static, per_elt. */ > @@ -317,6 +319,7 @@ struct processor_costs i486_cost = { /* 486 specific costs */ > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {4, 8, 16, 32, 64}, /* cost of unaligned loads. */ > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > + 8, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 3, /* cost of moving SSE register to integer. */ > 4, 4, /* Gather load static, per_elt. */ > @@ -427,6 +430,7 @@ struct processor_costs pentium_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {4, 8, 16, 32, 64}, /* cost of unaligned loads. */ > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > + 8, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 3, /* cost of moving SSE register to integer. */ > 4, 4, /* Gather load static, per_elt. */ > @@ -528,6 +532,7 @@ struct processor_costs lakemont_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {4, 8, 16, 32, 64}, /* cost of unaligned loads. */ > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > + 8, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 3, /* cost of moving SSE register to integer. */ > 4, 4, /* Gather load static, per_elt. */ > @@ -644,6 +649,7 @@ struct processor_costs pentiumpro_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {4, 8, 16, 32, 64}, /* cost of unaligned loads. */ > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > + 24, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 3, /* cost of moving SSE register to integer. */ > 4, 4, /* Gather load static, per_elt. */ > @@ -751,6 +757,7 @@ struct processor_costs geode_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {2, 2, 8, 16, 32}, /* cost of unaligned loads. */ > {2, 2, 8, 16, 32}, /* cost of unaligned stores. */ > + 14, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 6, /* cost of moving SSE register to integer. */ > 2, 2, /* Gather load static, per_elt. */ > @@ -858,6 +865,7 @@ struct processor_costs k6_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {2, 2, 8, 16, 32}, /* cost of unaligned loads. */ > {2, 2, 8, 16, 32}, /* cost of unaligned stores. */ > + 24, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 6, /* cost of moving SSE register to integer. */ > 2, 2, /* Gather load static, per_elt. */ > @@ -971,6 +979,7 @@ struct processor_costs athlon_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {4, 4, 12, 12, 24}, /* cost of unaligned loads. */ > {4, 4, 10, 10, 20}, /* cost of unaligned stores. */ > + 14, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 5, /* cost of moving SSE register to integer. */ > 4, 4, /* Gather load static, per_elt. */ > @@ -1086,6 +1095,7 @@ struct processor_costs k8_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {4, 3, 12, 12, 24}, /* cost of unaligned loads. */ > {4, 4, 10, 10, 20}, /* cost of unaligned stores. */ > + 14, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 5, /* cost of moving SSE register to integer. */ > 4, 4, /* Gather load static, per_elt. */ > @@ -1214,6 +1224,7 @@ struct processor_costs amdfam10_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {4, 4, 3, 7, 12}, /* cost of unaligned loads. */ > {4, 4, 5, 10, 20}, /* cost of unaligned stores. */ > + 21, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 3, /* cost of moving SSE register to integer. */ > 4, 4, /* Gather load static, per_elt. */ > @@ -1334,6 +1345,7 @@ const struct processor_costs bdver_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {12, 12, 10, 40, 60}, /* cost of unaligned loads. */ > {10, 10, 10, 40, 60}, /* cost of unaligned stores. */ > + 54, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 16, /* cost of moving SSE register to integer. */ > 12, 12, /* Gather load static, per_elt. */ > @@ -1475,6 +1487,7 @@ struct processor_costs znver1_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 12, 24}, /* cost of unaligned loads. */ > {8, 8, 8, 16, 32}, /* cost of unaligned stores. */ > + 42, /* cost of store forward stall. */ > 2, 3, 6, /* cost of moving XMM,YMM,ZMM register. */ > 6, /* cost of moving SSE register to integer. */ > /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops, > @@ -1630,6 +1643,7 @@ struct processor_costs znver2_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 6, 12}, /* cost of unaligned loads. */ > {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ > + 42, /* cost of store forward stall. */ > 2, 2, 3, /* cost of moving XMM,YMM,ZMM > register. */ > 6, /* cost of moving SSE register to integer. */ > @@ -1762,6 +1776,7 @@ struct processor_costs znver3_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 6, 12}, /* cost of unaligned loads. */ > {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ > + 42, /* cost of store forward stall. */ > 2, 2, 3, /* cost of moving XMM,YMM,ZMM > register. */ > 6, /* cost of moving SSE register to integer. */ > @@ -1907,6 +1922,7 @@ struct processor_costs skylake_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ > {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ > + 26, /* cost of store forward stall. */ > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > 6, /* cost of moving SSE register to integer. */ > 20, 8, /* Gather load static, per_elt. */ > @@ -2033,6 +2049,7 @@ struct processor_costs icelake_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ > {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ > + 26, /* cost of store forward stall. */ > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > 6, /* cost of moving SSE register to integer. */ > 20, 8, /* Gather load static, per_elt. */ > @@ -2153,6 +2170,7 @@ struct processor_costs alderlake_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 10, 15}, /* cost of unaligned loads. */ > {6, 6, 6, 10, 15}, /* cost of unaligned storess. */ > + 90, /* cost of store forward stall. */ > 2, 3, 4, /* cost of moving XMM,YMM,ZMM register */ > 6, /* cost of moving SSE register to integer. */ > 18, 6, /* Gather load static, per_elt. */ > @@ -2266,6 +2284,7 @@ const struct processor_costs btver1_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {10, 10, 12, 48, 96}, /* cost of unaligned loads. */ > {10, 10, 12, 48, 96}, /* cost of unaligned stores. */ > + 36, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 14, /* cost of moving SSE register to integer. */ > 10, 10, /* Gather load static, per_elt. */ > @@ -2376,6 +2395,7 @@ const struct processor_costs btver2_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {10, 10, 12, 48, 96}, /* cost of unaligned loads. */ > {10, 10, 12, 48, 96}, /* cost of unaligned stores. */ > + 36, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 14, /* cost of moving SSE register to integer. */ > 10, 10, /* Gather load static, per_elt. */ > @@ -2485,6 +2505,7 @@ struct processor_costs pentium4_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {32, 32, 32, 64, 128}, /* cost of unaligned loads. */ > {32, 32, 32, 64, 128}, /* cost of unaligned stores. */ > + 10, /* cost of store forward stall. */ > 12, 24, 48, /* cost of moving XMM,YMM,ZMM register */ > 20, /* cost of moving SSE register to integer. */ > 16, 16, /* Gather load static, per_elt. */ > @@ -2597,6 +2618,7 @@ struct processor_costs nocona_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {24, 24, 24, 48, 96}, /* cost of unaligned loads. */ > {24, 24, 24, 48, 96}, /* cost of unaligned stores. */ > + 8, /* cost of store forward stall. */ > 6, 12, 24, /* cost of moving XMM,YMM,ZMM register */ > 20, /* cost of moving SSE register to integer. */ > 12, 12, /* Gather load static, per_elt. */ > @@ -2707,6 +2729,7 @@ struct processor_costs atom_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {16, 16, 16, 32, 64}, /* cost of unaligned loads. */ > {16, 16, 16, 32, 64}, /* cost of unaligned stores. */ > + 32, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 8, /* cost of moving SSE register to integer. */ > 8, 8, /* Gather load static, per_elt. */ > @@ -2817,6 +2840,7 @@ struct processor_costs slm_cost = { > in SImode, DImode and TImode. */ > {16, 16, 16, 32, 64}, /* cost of unaligned loads. */ > {16, 16, 16, 32, 64}, /* cost of unaligned stores. */ > + 48, /* cost of store forward stall. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */ > 8, /* cost of moving SSE register to integer. */ > 8, 8, /* Gather load static, per_elt. */ > @@ -2939,6 +2963,7 @@ struct processor_costs tremont_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 10, 15}, /* cost of unaligned loads. */ > {6, 6, 6, 10, 15}, /* cost of unaligned storess. */ > + 42, /* cost of store forward stall. */ > 2, 3, 4, /* cost of moving XMM,YMM,ZMM register */ > 6, /* cost of moving SSE register to integer. */ > 18, 6, /* Gather load static, per_elt. */ > @@ -3051,6 +3076,7 @@ struct processor_costs intel_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {10, 10, 10, 10, 10}, /* cost of unaligned loads. */ > {10, 10, 10, 10, 10}, /* cost of unaligned loads. */ > + 22, /* cost of store forward stall. */ > 2, 2, 2, /* cost of moving XMM,YMM,ZMM register */ > 4, /* cost of moving SSE register to integer. */ > 6, 6, /* Gather load static, per_elt. */ > @@ -3168,6 +3194,7 @@ struct processor_costs generic_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 10, 15}, /* cost of unaligned loads. */ > {6, 6, 6, 10, 15}, /* cost of unaligned storess. */ > + 54, /* cost of store forward stall. */ > 2, 3, 4, /* cost of moving XMM,YMM,ZMM register */ > 6, /* cost of moving SSE register to integer. */ > 18, 6, /* Gather load static, per_elt. */ > @@ -3291,6 +3318,7 @@ struct processor_costs core_cost = { > in 32bit, 64bit, 128bit, 256bit and 512bit */ > {6, 6, 6, 6, 12}, /* cost of unaligned loads. */ > {6, 6, 6, 6, 12}, /* cost of unaligned stores. */ > + 26, /* cost of store forward stall. */ > 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */ > 2, /* cost of moving SSE register to integer. */ > /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops, > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c > new file mode 100644 > index 00000000000..f8e0f2e26bb > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \} "slp2" } } */ > + > +struct X { double x[2]; }; > +typedef double v2df __attribute__((vector_size(16))); > + > +v2df __attribute__((noipa)) > +foo (struct X* x, struct X* y) > +{ > + return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] }; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c > new file mode 100644 > index 00000000000..f4ff7a83c82 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \} "slp2" } } */ > + > +struct X { double x[4]; }; > +typedef double v2df __attribute__((vector_size(16))); > + > +v2df __attribute__((noipa)) > +foo (struct X x, struct X y) > +{ > + return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] }; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-3.c b/gcc/testsuite/gcc.target/i386/pr101908-3.c > new file mode 100644 > index 00000000000..6f853aa7750 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-3.c > @@ -0,0 +1,90 @@ > +/* PR target/101908. */ > +/* { dg-do compile } */ > +/* { dg-options "-march=x86-64 -O2 -mtune=generic -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not "add new stmt:.*MEM \.*ray + 24B" "slp2" } } */ > +/* This testcase is used to avoid STLF stall. */ > + > +#define sqrt __builtin_sqrt > +#define SQ(x) ((x) * (x)) > +struct vec3 { > + double x, y, z; > +}; > + > +struct ray { > + struct vec3 orig, dir; > +}; > + > +struct material { > + struct vec3 col; /* color */ > + double spow; /* specular power */ > + double refl; /* reflection intensity */ > +}; > + > +struct sphere { > + struct vec3 pos; > + double rad; > + struct material mat; > + struct sphere *next; > +}; > + > +struct spoint { > + struct vec3 pos, normal, vref; /* position, normal and view reflection */ > + double dist; /* parametric distance of intersection along the ray */ > +}; > + > +#define ERR_MARGIN 1e-6 > + > +#define DOT(a, b) ((a).x * (b).x + (a).y * (b).y + (a).z * (b).z) > +#define NORMALIZE(a) do { \ > + double len = sqrt(DOT(a, a)); \ > + (a).x /= len; (a).y /= len; (a).z /= len; \ > + } while(0); > + > +static struct vec3 > +reflect(struct vec3 v, struct vec3 n) { > + struct vec3 res; > + double dot = v.x * n.x + v.y * n.y + v.z * n.z; > + res.x = -(2.0 * dot * n.x - v.x); > + res.y = -(2.0 * dot * n.y - v.y); > + res.z = -(2.0 * dot * n.z - v.z); > + return res; > +} > + > +int ray_sphere(const struct sphere *sph, > + struct ray ray, struct spoint *sp) { > + double a, b, c, d, sqrt_d, t1, t2; > + > + a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z); > + b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) + > + 2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) + > + 2.0 * ray.dir.z * (ray.orig.z - sph->pos.z); > + c = SQ(sph->pos.x) + SQ(sph->pos.y) + SQ(sph->pos.z) + > + SQ(ray.orig.x) + SQ(ray.orig.y) + SQ(ray.orig.z) + > + 2.0 * (-sph->pos.x * ray.orig.x - sph->pos.y * ray.orig.y - sph->pos.z * ray.orig.z) - SQ(sph->rad); > + > + if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0; > + > + sqrt_d = sqrt(d); > + t1 = (-b + sqrt_d) / (2.0 * a); > + t2 = (-b - sqrt_d) / (2.0 * a); > + > + if((t1 < ERR_MARGIN && t2 < ERR_MARGIN) || (t1 > 1.0 && t2 > 1.0)) return 0; > + > + if(sp) { > + if(t1 < ERR_MARGIN) t1 = t2; > + if(t2 < ERR_MARGIN) t2 = t1; > + sp->dist = t1 < t2 ? t1 : t2; > + > + sp->pos.x = ray.orig.x + ray.dir.x * sp->dist; > + sp->pos.y = ray.orig.y + ray.dir.y * sp->dist; > + sp->pos.z = ray.orig.z + ray.dir.z * sp->dist; > + > + sp->normal.x = (sp->pos.x - sph->pos.x) / sph->rad; > + sp->normal.y = (sp->pos.y - sph->pos.y) / sph->rad; > + sp->normal.z = (sp->pos.z - sph->pos.z) / sph->rad; > + > + sp->vref = reflect(ray.dir, sp->normal); > + NORMALIZE(sp->vref); > + } > + return 1; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c > new file mode 100644 > index 00000000000..fcd3ee8122f > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE short > +#include "pr101908-v16qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c > new file mode 100644 > index 00000000000..6d43788600e > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c > @@ -0,0 +1,30 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#ifndef TYPE > +#define TYPE char > +#endif > + > +struct X { TYPE a[128]; }; > + > +void __attribute__((noipa)) > +foo16 (struct X x, struct X y, TYPE* __restrict p) > +{ > + p[0] = x.a[1] + y.a[1]; > + p[1] = x.a[2] + y.a[2]; > + p[2] = x.a[3] + y.a[3]; > + p[3] = x.a[4] + y.a[4]; > + p[4] = x.a[5] + y.a[5]; > + p[5] = x.a[6] + y.a[6]; > + p[6] = x.a[7] + y.a[7]; > + p[7] = x.a[8] + y.a[8]; > + p[8] = x.a[9] + y.a[9]; > + p[9] = x.a[10] + y.a[10]; > + p[10] = x.a[11] + y.a[11]; > + p[11] = x.a[12] + y.a[12]; > + p[12] = x.a[13] + y.a[13]; > + p[13] = x.a[14] + y.a[14]; > + p[14] = x.a[15] + y.a[15]; > + p[15] = x.a[16] + y.a[16]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c > new file mode 100644 > index 00000000000..f95b85abbc6 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE float > +#include "pr101908-v16qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16si.c b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c > new file mode 100644 > index 00000000000..5c48aa5da69 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE int > +#include "pr101908-v16qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2df.c b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c > new file mode 100644 > index 00000000000..9d3f157718c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \} "slp2" } } */ > + > +#define TYPE double > +#include "pr101908-v2qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2di.c b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c > new file mode 100644 > index 00000000000..c7cf9a71f21 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c > @@ -0,0 +1,7 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \} "slp2" } } */ > + > +typedef long long int64_t; > +#define TYPE int64_t > +#include "pr101908-v2qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c > new file mode 100644 > index 00000000000..e6024d70780 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \} "slp2" } } */ > + > +#define TYPE short > +#include "pr101908-v2qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c > new file mode 100644 > index 00000000000..cf876cc70d4 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c > @@ -0,0 +1,16 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \} "slp2" } } */ > + > +#ifndef TYPE > +#define TYPE char > +#endif > + > +struct X { TYPE a[128]; }; > + > +void __attribute__((noipa)) > +foo16 (struct X x, struct X y, TYPE* __restrict p) > +{ > + p[14] = x.a[15] + y.a[15]; > + p[15] = x.a[16] + y.a[16]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c > new file mode 100644 > index 00000000000..eb6349b957e > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \} "slp2" } } */ > + > +#define TYPE float > +#include "pr101908-v2qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2si.c b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c > new file mode 100644 > index 00000000000..ae5fa0749c6 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \} "slp2" } } */ > + > +#define TYPE int > +#include "pr101908-v2qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4df.c b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c > new file mode 100644 > index 00000000000..94497422704 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE double > +#include "pr101908-v4qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4di.c b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c > new file mode 100644 > index 00000000000..71407aa9fc7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c > @@ -0,0 +1,7 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +typedef long long int64_t; > +#define TYPE int64_t > +#include "pr101908-v4qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c > new file mode 100644 > index 00000000000..4b207b91225 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE short > +#include "pr101908-v4qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c > new file mode 100644 > index 00000000000..5292d3442ec > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#ifndef TYPE > +#define TYPE char > +#endif > + > +struct X { TYPE a[128]; }; > + > +void __attribute__((noipa)) > +foo16 (struct X x, struct X y, TYPE* __restrict p) > +{ > + p[12] = x.a[13] + y.a[13]; > + p[13] = x.a[14] + y.a[14]; > + p[14] = x.a[15] + y.a[15]; > + p[15] = x.a[16] + y.a[16]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c > new file mode 100644 > index 00000000000..a2c6273120d > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE float > +#include "pr101908-v4qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4si.c b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c > new file mode 100644 > index 00000000000..c6824285c74 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE int > +#include "pr101908-v4qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c > new file mode 100644 > index 00000000000..248c6d0fb91 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE double > +#include "pr101908-v8qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c > new file mode 100644 > index 00000000000..05eb2dd51d0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE double > +#include "pr101908-v8qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c > new file mode 100644 > index 00000000000..b0055d7d2c0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c > @@ -0,0 +1,7 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +typedef long long int64_t; > +#define TYPE int64_t > +#include "pr101908-v8qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c > new file mode 100644 > index 00000000000..76a393bcc6c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c > @@ -0,0 +1,7 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +typedef long long int64_t; > +#define TYPE int64_t > +#include "pr101908-v8qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c > new file mode 100644 > index 00000000000..28977adae28 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE short > +#include "pr101908-v8qi-adl.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c > new file mode 100644 > index 00000000000..89b50885366 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE short > +#include "pr101908-v8qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c > new file mode 100644 > index 00000000000..be668e5d006 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c > @@ -0,0 +1,22 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#ifndef TYPE > +#define TYPE char > +#endif > + > +struct X { TYPE a[128]; }; > + > +void __attribute__((noipa)) > +foo16 (struct X x, struct X y, TYPE* __restrict p) > +{ > + p[8] = x.a[9] + y.a[9]; > + p[9] = x.a[10] + y.a[10]; > + p[10] = x.a[11] + y.a[11]; > + p[11] = x.a[12] + y.a[12]; > + p[12] = x.a[13] + y.a[13]; > + p[13] = x.a[14] + y.a[14]; > + p[14] = x.a[15] + y.a[15]; > + p[15] = x.a[16] + y.a[16]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c > new file mode 100644 > index 00000000000..842c88c8952 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c > @@ -0,0 +1,22 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#ifndef TYPE > +#define TYPE char > +#endif > + > +struct X { TYPE a[128]; }; > + > +void __attribute__((noipa)) > +foo16 (struct X x, struct X y, TYPE* __restrict p) > +{ > + p[8] = x.a[9] + y.a[9]; > + p[9] = x.a[10] + y.a[10]; > + p[10] = x.a[11] + y.a[11]; > + p[11] = x.a[12] + y.a[12]; > + p[12] = x.a[13] + y.a[13]; > + p[13] = x.a[14] + y.a[14]; > + p[14] = x.a[15] + y.a[15]; > + p[15] = x.a[16] + y.a[16]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c > new file mode 100644 > index 00000000000..89d33566a40 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE float > +#include "pr101908-v8qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c > new file mode 100644 > index 00000000000..81557c7b9b7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE float > +#include "pr101908-v8qi.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c > new file mode 100644 > index 00000000000..883956a0d49 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE int > +#include "pr101908-v8qi-adl.c" > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c > new file mode 100644 > index 00000000000..142f46012d7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \} "slp2" } } */ > + > +#define TYPE int > +#include "pr101908-v8qi.c" > -- > 2.18.1 >