On Mon, Jun 24, 2019 at 9:16 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Jun 24, 2019 at 6:37 AM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Thu, 20 Jun 2019, Jan Hubicka wrote:
> >
> > > > > Currently, costs of moves are also used for costs of RTL expressions.   This
> > > > > patch:
> > > > >
> > > > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> > > > >
> > > > > includes:
> > > > >
> > > > > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > > > > index e943d13..8409a5f 100644
> > > > > --- a/gcc/config/i386/x86-tune-costs.h
> > > > > +++ b/gcc/config/i386/x86-tune-costs.h
> > > > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> > > > >    {4, 4, 4}, /* cost of loading integer registers
> > > > >      in QImode, HImode and SImode.
> > > > >      Relative to reg-reg move (2).  */
> > > > > -  {6, 6, 6}, /* cost of storing integer registers */
> > > > > +  {6, 6, 3}, /* cost of storing integer registers */
> > > > >    2, /* cost of reg,reg fld/fst */
> > > > >    {6, 6, 8}, /* cost of loading fp registers
> > > > >      in SFmode, DFmode and XFmode */
> > >
> > > Well, it seems that the patch was fixing things on wrong spot - the
> > > tables are intended to be mostly latency based. I think we ought to
> > > document divergences from these including benchmarks where the change
> > > helped. Otherwise it is very hard to figure out why the entry does not
> > > match the reality.
> > > > >
> > > > > It lowered the cost for SImode store and made it cheaper than SSE<->integer
> > > > > register move.  It caused a regression:
> > > > >
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> > > > >
> > > > > Since the cost for SImode store is also used to compute scalar_store
> > > > > in ix86_builtin_vectorization_cost, it changed loop costs in
> > > > >
> > > > > void
> > > > > foo (long p2, long *diag, long d, long i)
> > > > > {
> > > > >   long k;
> > > > >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> > > > >   while (i < k)
> > > > >     diag[i++] = d;
> > > > > }
> > > > >
> > > > > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > > > > instead of 3.
> > > > >
> > > > > My patch separates costs of moves from costs of RTL expressions.  We have
> > > > > a follow up patch which restores the cost for SImode store back to 6 and leave
> > > > > the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
> > > > > improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
> > > > > data now.
> > >
> > > I have seen the problem with scalar_store with AMD tuning as well.
> > > It seems to make SLP vectorizer to be happy about idea of turning
> > > sequence of say integer tores into code which moves all the values into
> > > AVX register and then does one vector store.
> > >
> > > The cost basically compare cost of N scalar stores to 1 scalar store +
> > > vector construction. Vector construction then N*sse_op+addss.
> > >
> > > With testcase:
> > >
> > > short array[8];
> > > test (short a,short b,short c,short d,short e,short f,short g,short h)
> > > {
> > >   array[0]=a;
> > >   array[1]=b;
> > >   array[2]=c;
> > >   array[3]=d;
> > >   array[4]=e;
> > >   array[5]=f;
> > >   array[6]=g;
> > >   array[7]=h;
> > > }
> > > int iarray[8];
> > > test2 (int a,int b,int c,int d,int e,int f,int g,int h)
> > > {
> > >   iarray[0]=a;
> > >   iarray[1]=b;
> > >   iarray[2]=c;
> > >   iarray[3]=d;
> > >   iarray[4]=e;
> > >   iarray[5]=f;
> > >   iarray[6]=g;
> > >   iarray[7]=h;
> > > }
> > >
> > > I get the following codegen:
> > >
> > >
> > > test:
> > >         vmovd   %edi, %xmm0
> > >         vmovd   %edx, %xmm2
> > >         vmovd   %r8d, %xmm1
> > >         vmovd   8(%rsp), %xmm3
> > >         vpinsrw $1, 16(%rsp), %xmm3, %xmm3
> > >         vpinsrw $1, %esi, %xmm0, %xmm0
> > >         vpinsrw $1, %ecx, %xmm2, %xmm2
> > >         vpinsrw $1, %r9d, %xmm1, %xmm1
> > >         vpunpckldq      %xmm2, %xmm0, %xmm0
> > >         vpunpckldq      %xmm3, %xmm1, %xmm1
> > >         vpunpcklqdq     %xmm1, %xmm0, %xmm0
> > >         vmovaps %xmm0, array(%rip)
> > >         ret
> > >
> > > test2:
> > >         vmovd   %r8d, %xmm5
> > >         vmovd   %edx, %xmm6
> > >         vmovd   %edi, %xmm7
> > >         vpinsrd $1, %r9d, %xmm5, %xmm1
> > >         vpinsrd $1, %ecx, %xmm6, %xmm3
> > >         vpinsrd $1, %esi, %xmm7, %xmm0
> > >         vpunpcklqdq     %xmm3, %xmm0, %xmm0
> > >         vmovd   16(%rbp), %xmm4
> > >         vpinsrd $1, 24(%rbp), %xmm4, %xmm2
> > >         vpunpcklqdq     %xmm2, %xmm1, %xmm1
> > >         vinserti128     $0x1, %xmm1, %ymm0, %ymm0
> > >         vmovdqu %ymm0, iarray(%rip)
> > >         vzeroupper
> > >       ret
> > >
> > > which is about 20% slower on my skylake notebook than the
> > > non-SLP-vectorized variant.
> > >
> > > I wonder if the vec_construct costs should be made more realistic.
> > > It is computed as:
> > >
> > >       case vec_construct:
> > >         {
> > >           /* N element inserts into SSE vectors.  */
> > >           int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> > >           /* One vinserti128 for combining two SSE vectors for AVX256.  */
> > >           if (GET_MODE_BITSIZE (mode) == 256)
> > >             cost += ix86_vec_cost (mode, ix86_cost->addss);
> > >           /* One vinserti64x4 and two vinserti128 for combining SSE
> > >              and AVX256 vectors to AVX512.  */
> > >           else if (GET_MODE_BITSIZE (mode) == 512)
> > >             cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
> > >           return cost;
> > >
> > > So it expects 8 simple SSE operations + one SSE FP arithmetical
> > > operations.  While code above has 8 inter-unit moves + 3 SSE integer
> > > operations to shuffle things around. Not mentioning the increased
> > > register pressure.
> >
> > But aren't the inter-unit moves a red herring?  Your testcase places
> > the sources in integer registers but usually for the case of
> > vectorization we arrive here from strided loads for which we could
> > load the first value into a %xmm reg directly and have the
> > later vpinsr instruction have memory source?
> >
> > Yes, vec_construct cost isn't the full story in this case which is
> > why add_stmt special-cases strided loads/stores adding some
> > pessimization.
> >
> > > I would say that for integer constructs it is a common case that things
> > > needs to be moved from integer unit to SSE.
> >
> > Is it?  For SLP vectorization probably yes.  The costing interface
> > unfortunately is not giving much information here (well, add_stmt
> > has access to the stmt_info ...).
> >
> > > Overall the problem is deeper since vectorizer really may need to get
> > > better idea about latencies and throughputs to estimate loop times more
> > > realistically.
> >
> > Indeed, but I hardly see how we can handle this in a sensible way since
> > we don't even understand performance corner-cases when analyzing them
> > and looking at this info but the HW still behaves in unexpected ways :/
> >
> > > One also may want to account somewhat that stores are often not part
> > > of the hot path and thus their latency is not too critical and the
> > > fact that vector stores prevents later partial memory stalls on the
> > > other hand...
> > >
>
> Costs of moves are closely related to latency and should only be used
> for register allocator.   We shouldn't use costs of moves for RTL costs.
> For register allocator, register <-> register moves are preferred over
> load and store unless it is slower than register -> memory -> register.
> For RTL costs,  we may want to make load and store cheap to improve
> RTL expansion.  But we don't want to change load and store costs for
> register allocator.   We need to separate costs of moves from costs of
> RTL expressions first.
>

Here is the updated patch to improve register allocator and RTL
expressions independently.

Any comments?

Thanks.


-- 
H.J.