public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: "H.J. Lu" <hjl.tools@gmail.com>
To: Jan Hubicka <hubicka@ucw.cz>
Cc: Uros Bizjak <ubizjak@gmail.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
	skpgkp1@gmail.com, 	Jeffrey Law <law@redhat.com>,
	Jakub Jelinek <jakub@redhat.com>,
	Richard Guenther <rguenther@suse.de>
Subject: Re: [PATCH] i386: Separate costs of RTL expressions from costs of moves
Date: Thu, 20 Jun 2019 21:43:00 -0000	[thread overview]
Message-ID: <CAMe9rOoZu=VXh=u-9jVS8imOdepBoPS300=d73O1hPLRDt7Aag@mail.gmail.com> (raw)
In-Reply-To: <20190620211050.whgnitxuudbha6u2@kam.mff.cuni.cz>

On Thu, Jun 20, 2019 at 2:10 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > Currently, costs of moves are also used for costs of RTL expressions.   This
> > > patch:
> > >
> > > https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html
> > >
> > > includes:
> > >
> > > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > > index e943d13..8409a5f 100644
> > > --- a/gcc/config/i386/x86-tune-costs.h
> > > +++ b/gcc/config/i386/x86-tune-costs.h
> > > @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
> > >    {4, 4, 4}, /* cost of loading integer registers
> > >      in QImode, HImode and SImode.
> > >      Relative to reg-reg move (2).  */
> > > -  {6, 6, 6}, /* cost of storing integer registers */
> > > +  {6, 6, 3}, /* cost of storing integer registers */
> > >    2, /* cost of reg,reg fld/fst */
> > >    {6, 6, 8}, /* cost of loading fp registers
> > >      in SFmode, DFmode and XFmode */
>
> Well, it seems that the patch was fixing things on wrong spot - the
> tables are intended to be mostly latency based. I think we ought to
> document divergences from these including benchmarks where the change
> helped. Otherwise it is very hard to figure out why the entry does not
> match the reality.
> > >
> > > It lowered the cost for SImode store and made it cheaper than SSE<->integer
> > > register move.  It caused a regression:
> > >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90878
> > >
> > > Since the cost for SImode store is also used to compute scalar_store
> > > in ix86_builtin_vectorization_cost, it changed loop costs in
> > >
> > > void
> > > foo (long p2, long *diag, long d, long i)
> > > {
> > >   long k;
> > >   k = p2 < 3 ? p2 + p2 : p2 + 3;
> > >   while (i < k)
> > >     diag[i++] = d;
> > > }
> > >
> > > As the result, the loop is unrolled 4 times with -O3 -march=skylake,
> > > instead of 3.
> > >
> > > My patch separates costs of moves from costs of RTL expressions.  We have
> > > a follow up patch which restores the cost for SImode store back to 6 and leave
> > > the cost of scalar_store unchanged.  It keeps loop unrolling unchanged and
> > > improves powf performance in glibc by 30%.  We are collecting SPEC CPU 2017
> > > data now.
>
> I have seen the problem with scalar_store with AMD tuning as well.
> It seems to make SLP vectorizer to be happy about idea of turning
> sequence of say integer tores into code which moves all the values into
> AVX register and then does one vector store.
>
> The cost basically compare cost of N scalar stores to 1 scalar store +
> vector construction. Vector construction then N*sse_op+addss.
>
> With testcase:
>
> short array[8];
> test (short a,short b,short c,short d,short e,short f,short g,short h)
> {
>   array[0]=a;
>   array[1]=b;
>   array[2]=c;
>   array[3]=d;
>   array[4]=e;
>   array[5]=f;
>   array[6]=g;
>   array[7]=h;
> }
> int iarray[8];
> test2 (int a,int b,int c,int d,int e,int f,int g,int h)
> {
>   iarray[0]=a;
>   iarray[1]=b;
>   iarray[2]=c;
>   iarray[3]=d;
>   iarray[4]=e;
>   iarray[5]=f;
>   iarray[6]=g;
>   iarray[7]=h;
> }
>
> I get the following codegen:
>
>
> test:
>         vmovd   %edi, %xmm0
>         vmovd   %edx, %xmm2
>         vmovd   %r8d, %xmm1
>         vmovd   8(%rsp), %xmm3
>         vpinsrw $1, 16(%rsp), %xmm3, %xmm3
>         vpinsrw $1, %esi, %xmm0, %xmm0
>         vpinsrw $1, %ecx, %xmm2, %xmm2
>         vpinsrw $1, %r9d, %xmm1, %xmm1
>         vpunpckldq      %xmm2, %xmm0, %xmm0
>         vpunpckldq      %xmm3, %xmm1, %xmm1
>         vpunpcklqdq     %xmm1, %xmm0, %xmm0
>         vmovaps %xmm0, array(%rip)
>         ret
>
> test2:
>         vmovd   %r8d, %xmm5
>         vmovd   %edx, %xmm6
>         vmovd   %edi, %xmm7
>         vpinsrd $1, %r9d, %xmm5, %xmm1
>         vpinsrd $1, %ecx, %xmm6, %xmm3
>         vpinsrd $1, %esi, %xmm7, %xmm0
>         vpunpcklqdq     %xmm3, %xmm0, %xmm0
>         vmovd   16(%rbp), %xmm4
>         vpinsrd $1, 24(%rbp), %xmm4, %xmm2
>         vpunpcklqdq     %xmm2, %xmm1, %xmm1
>         vinserti128     $0x1, %xmm1, %ymm0, %ymm0
>         vmovdqu %ymm0, iarray(%rip)
>         vzeroupper
>         ret
>
> which is about 20% slower on my skylake notebook than the
> non-SLP-vectorized variant.
>
> I wonder if the vec_construct costs should be made more realistic.
> It is computed as:
>
>       case vec_construct:
>         {
>           /* N element inserts into SSE vectors.  */
>           int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
>           /* One vinserti128 for combining two SSE vectors for AVX256.  */
>           if (GET_MODE_BITSIZE (mode) == 256)
>             cost += ix86_vec_cost (mode, ix86_cost->addss);
>           /* One vinserti64x4 and two vinserti128 for combining SSE
>              and AVX256 vectors to AVX512.  */
>           else if (GET_MODE_BITSIZE (mode) == 512)
>             cost += 3 * ix86_vec_cost (mode, ix86_cost->addss);
>           return cost;
>
> So it expects 8 simple SSE operations + one SSE FP arithmetical
> operations.  While code above has 8 inter-unit moves + 3 SSE integer
> operations to shuffle things around. Not mentioning the increased
> register pressure.
>
> I would say that for integer constructs it is a common case that things
> needs to be moved from integer unit to SSE.
>
> Overall the problem is deeper since vectorizer really may need to get
> better idea about latencies and throughputs to estimate loop times more
> realistically.
>
> One also may want to account somewhat that stores are often not part
> of the hot path and thus their latency is not too critical and the
> fact that vector stores prevents later partial memory stalls on the
> other hand...
>

I opened:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90952

We shouldn't use costs for moves for costs of RTL expressions.   We can
experiment different RTL expression cost formulas.   But we need to separate
costs of RTL expressions from costs for moves first.   What is the best way
to partition processor_costs to avoid confusion between costs of moves vs.
costs of RTL expressions?

-- 
H.J.

  reply	other threads:[~2019-06-20 21:43 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-17 16:27 H.J. Lu
2019-06-20  7:40 ` Uros Bizjak
2019-06-20  7:43   ` Uros Bizjak
2019-06-20 15:19     ` H.J. Lu
2019-06-20 20:33       ` Uros Bizjak
2019-06-20 21:10         ` Jan Hubicka
2019-06-20 21:43           ` H.J. Lu [this message]
2019-06-23 11:18             ` Jan Hubicka
2019-06-24 13:37           ` Richard Biener
2019-06-24 16:16             ` H.J. Lu
2019-07-23 22:11               ` [PATCH] i386: Separate costs of pseudo registers from hard registers H.J. Lu
2019-08-05 21:21                 ` PING^1 " H.J. Lu
2019-08-09 22:14                 ` Jeff Law
2019-08-10  0:47                   ` H.J. Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMe9rOoZu=VXh=u-9jVS8imOdepBoPS300=d73O1hPLRDt7Aag@mail.gmail.com' \
    --to=hjl.tools@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hubicka@ucw.cz \
    --cc=jakub@redhat.com \
    --cc=law@redhat.com \
    --cc=rguenther@suse.de \
    --cc=skpgkp1@gmail.com \
    --cc=ubizjak@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).