On Mon, Jun 24, 2019 at 9:16 AM H.J. Lu wrote: > > On Mon, Jun 24, 2019 at 6:37 AM Richard Biener wrote: > > > > On Thu, 20 Jun 2019, Jan Hubicka wrote: > > > > > > > Currently, costs of moves are also used for costs of RTL expressions. This > > > > > patch: > > > > > > > > > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html > > > > > > > > > > includes: > > > > > > > > > > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h > > > > > index e943d13..8409a5f 100644 > > > > > --- a/gcc/config/i386/x86-tune-costs.h > > > > > +++ b/gcc/config/i386/x86-tune-costs.h > > > > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = { > > > > > {4, 4, 4}, /* cost of loading integer registers > > > > > in QImode, HImode and SImode. > > > > > Relative to reg-reg move (2). */ > > > > > - {6, 6, 6}, /* cost of storing integer registers */ > > > > > + {6, 6, 3}, /* cost of storing integer registers */ > > > > > 2, /* cost of reg,reg fld/fst */ > > > > > {6, 6, 8}, /* cost of loading fp registers > > > > > in SFmode, DFmode and XFmode */ > > > > > > Well, it seems that the patch was fixing things on wrong spot - the > > > tables are intended to be mostly latency based. I think we ought to > > > document divergences from these including benchmarks where the change > > > helped. Otherwise it is very hard to figure out why the entry does not > > > match the reality. > > > > > > > > > > It lowered the cost for SImode store and made it cheaper than SSE<->integer > > > > > register move. It caused a regression: > > > > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878 > > > > > > > > > > Since the cost for SImode store is also used to compute scalar_store > > > > > in ix86_builtin_vectorization_cost, it changed loop costs in > > > > > > > > > > void > > > > > foo (long p2, long *diag, long d, long i) > > > > > { > > > > > long k; > > > > > k = p2 < 3 ? p2 + p2 : p2 + 3; > > > > > while (i < k) > > > > > diag[i++] = d; > > > > > } > > > > > > > > > > As the result, the loop is unrolled 4 times with -O3 -march=skylake, > > > > > instead of 3. > > > > > > > > > > My patch separates costs of moves from costs of RTL expressions. We have > > > > > a follow up patch which restores the cost for SImode store back to 6 and leave > > > > > the cost of scalar_store unchanged. It keeps loop unrolling unchanged and > > > > > improves powf performance in glibc by 30%. We are collecting SPEC CPU 2017 > > > > > data now. > > > > > > I have seen the problem with scalar_store with AMD tuning as well. > > > It seems to make SLP vectorizer to be happy about idea of turning > > > sequence of say integer tores into code which moves all the values into > > > AVX register and then does one vector store. > > > > > > The cost basically compare cost of N scalar stores to 1 scalar store + > > > vector construction. Vector construction then N*sse_op+addss. > > > > > > With testcase: > > > > > > short array[8]; > > > test (short a,short b,short c,short d,short e,short f,short g,short h) > > > { > > > array[0]=a; > > > array[1]=b; > > > array[2]=c; > > > array[3]=d; > > > array[4]=e; > > > array[5]=f; > > > array[6]=g; > > > array[7]=h; > > > } > > > int iarray[8]; > > > test2 (int a,int b,int c,int d,int e,int f,int g,int h) > > > { > > > iarray[0]=a; > > > iarray[1]=b; > > > iarray[2]=c; > > > iarray[3]=d; > > > iarray[4]=e; > > > iarray[5]=f; > > > iarray[6]=g; > > > iarray[7]=h; > > > } > > > > > > I get the following codegen: > > > > > > > > > test: > > > vmovd %edi, %xmm0 > > > vmovd %edx, %xmm2 > > > vmovd %r8d, %xmm1 > > > vmovd 8(%rsp), %xmm3 > > > vpinsrw $1, 16(%rsp), %xmm3, %xmm3 > > > vpinsrw $1, %esi, %xmm0, %xmm0 > > > vpinsrw $1, %ecx, %xmm2, %xmm2 > > > vpinsrw $1, %r9d, %xmm1, %xmm1 > > > vpunpckldq %xmm2, %xmm0, %xmm0 > > > vpunpckldq %xmm3, %xmm1, %xmm1 > > > vpunpcklqdq %xmm1, %xmm0, %xmm0 > > > vmovaps %xmm0, array(%rip) > > > ret > > > > > > test2: > > > vmovd %r8d, %xmm5 > > > vmovd %edx, %xmm6 > > > vmovd %edi, %xmm7 > > > vpinsrd $1, %r9d, %xmm5, %xmm1 > > > vpinsrd $1, %ecx, %xmm6, %xmm3 > > > vpinsrd $1, %esi, %xmm7, %xmm0 > > > vpunpcklqdq %xmm3, %xmm0, %xmm0 > > > vmovd 16(%rbp), %xmm4 > > > vpinsrd $1, 24(%rbp), %xmm4, %xmm2 > > > vpunpcklqdq %xmm2, %xmm1, %xmm1 > > > vinserti128 $0x1, %xmm1, %ymm0, %ymm0 > > > vmovdqu %ymm0, iarray(%rip) > > > vzeroupper > > > ret > > > > > > which is about 20% slower on my skylake notebook than the > > > non-SLP-vectorized variant. > > > > > > I wonder if the vec_construct costs should be made more realistic. > > > It is computed as: > > > > > > case vec_construct: > > > { > > > /* N element inserts into SSE vectors. */ > > > int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > > > /* One vinserti128 for combining two SSE vectors for AVX256. */ > > > if (GET_MODE_BITSIZE (mode) == 256) > > > cost += ix86_vec_cost (mode, ix86_cost->addss); > > > /* One vinserti64x4 and two vinserti128 for combining SSE > > > and AVX256 vectors to AVX512. */ > > > else if (GET_MODE_BITSIZE (mode) == 512) > > > cost += 3 * ix86_vec_cost (mode, ix86_cost->addss); > > > return cost; > > > > > > So it expects 8 simple SSE operations + one SSE FP arithmetical > > > operations. While code above has 8 inter-unit moves + 3 SSE integer > > > operations to shuffle things around. Not mentioning the increased > > > register pressure. > > > > But aren't the inter-unit moves a red herring? Your testcase places > > the sources in integer registers but usually for the case of > > vectorization we arrive here from strided loads for which we could > > load the first value into a %xmm reg directly and have the > > later vpinsr instruction have memory source? > > > > Yes, vec_construct cost isn't the full story in this case which is > > why add_stmt special-cases strided loads/stores adding some > > pessimization. > > > > > I would say that for integer constructs it is a common case that things > > > needs to be moved from integer unit to SSE. > > > > Is it? For SLP vectorization probably yes. The costing interface > > unfortunately is not giving much information here (well, add_stmt > > has access to the stmt_info ...). > > > > > Overall the problem is deeper since vectorizer really may need to get > > > better idea about latencies and throughputs to estimate loop times more > > > realistically. > > > > Indeed, but I hardly see how we can handle this in a sensible way since > > we don't even understand performance corner-cases when analyzing them > > and looking at this info but the HW still behaves in unexpected ways :/ > > > > > One also may want to account somewhat that stores are often not part > > > of the hot path and thus their latency is not too critical and the > > > fact that vector stores prevents later partial memory stalls on the > > > other hand... > > > > > Costs of moves are closely related to latency and should only be used > for register allocator. We shouldn't use costs of moves for RTL costs. > For register allocator, register <-> register moves are preferred over > load and store unless it is slower than register -> memory -> register. > For RTL costs, we may want to make load and store cheap to improve > RTL expansion. But we don't want to change load and store costs for > register allocator. We need to separate costs of moves from costs of > RTL expressions first. > Here is the updated patch to improve register allocator and RTL expressions independently. Any comments? Thanks. -- H.J.