Re: Zen4 tuning part 1 - cost tables

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <richard.guenther@gmail.com>
To: Jan Hubicka <hubicka@ucw.cz>
Cc: gcc-patches@gcc.gnu.org, mjambor@suse.cz,
	 Alexander Monakov <amonakov@ispras.ru>,
	"Kumar, Venkataramanan" <Venkataramanan.Kumar@amd.com>,
	 Tejas Sanjay <TejasSanjay.Joshi@amd.com>
Subject: Re: Zen4 tuning part 1 - cost tables
Date: Tue, 6 Dec 2022 11:20:32 +0100	[thread overview]
Message-ID: <CAFiYyc0SZ6aRW-UPhdq16FxU94NxRkkD+WvfN1+raw=N3yqpow@mail.gmail.com> (raw)
In-Reply-To: <Y48S1d7kqcbRhfJ3@kam.mff.cuni.cz>

On Tue, Dec 6, 2022 at 11:01 AM Jan Hubicka via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi
> this patch updates cost of znver4 mostly based on data measued by Agner Fog.
> Compared to previous generations x87 became bit slower which is probably not
> big deal (and we have minimal benchmarking coverage for it).  One interesting
> improvement is reducation of FMA cost.  I also updated costs of AVX256
> loads/stores  based on latencies (not throughput which is twice of avx256).
> Overall AVX512 vectorization seems to improve noticeably some of TSVC
> benchmarks but since internally 512 vectors are split to 256 vectors it is
> somewhat risky and does not win in SPEC scores (mostly by regressing benchmarks
> with loop that have small trip count like x264 and exchange), so for now I am
> going to set AVX256_OPTIMAL tune but I am still playing with it.  We improved
> since ZNVER1 on choosing vectorization size and also have vectorized
> prologues/epilogues so it may be possible to make avx512 small win overall.
>
> In general I would like to keep cost tables latency based unless we have
> a good reason to not do so.  There are some interesting diferences in
> znver3 tables that I also patched and seems performance neutral.  I will
> send that separately.
>
> Bootstrapped/regtested x86_64-linux, also benchmarked on SPEC2017 along
> with AVX512 tuning.  I plan to commit it tomorrow unless there are some
> comments.
>
> Honza
>
>         * x86-tune-costs.h (znver4_cost): Upate costs of FP and SSE moves,
>         division multiplication, gathers, L2 cache size, and more complex
>         FP instrutions.
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index f01b8ee9eef..3a6ce02f093 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1867,9 +1868,9 @@ struct processor_costs znver4_cost = {
>    {8, 8, 8},                           /* cost of storing integer
>                                            registers.  */
>    2,                                   /* cost of reg,reg fld/fst.  */
> -  {6, 6, 16},                          /* cost of loading fp registers
> +  {14, 14, 17},                                /* cost of loading fp registers
>                                            in SFmode, DFmode and XFmode.  */
> -  {8, 8, 16},                          /* cost of storing fp registers
> +  {12, 12, 16},                                /* cost of storing fp registers
>                                            in SFmode, DFmode and XFmode.  */
>    2,                                   /* cost of moving MMX register.  */
>    {6, 6},                              /* cost of loading MMX registers
> @@ -1878,13 +1879,13 @@ struct processor_costs znver4_cost = {
>                                            in SImode and DImode.  */
>    2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
> -  {6, 6, 6, 6, 12},                    /* cost of loading SSE registers
> +  {6, 6, 10, 10, 12},                  /* cost of loading SSE registers
>                                            in 32,64,128,256 and 512-bit.  */
> -  {8, 8, 8, 8, 16},                    /* cost of storing SSE registers
> +  {8, 8, 8, 12, 12},                   /* cost of storing SSE registers
>                                            in 32,64,128,256 and 512-bit.  */
> -  6, 6,                                        /* SSE->integer and integer->SSE
> +  6, 8,                                        /* SSE->integer and integer->SSE
>                                            moves.  */
> -  8, 8,                                /* mask->integer and integer->mask moves */
> +  8, 8,                                        /* mask->integer and integer->mask moves */
>    {6, 6, 6},                           /* cost of loading mask register
>                                            in QImode, HImode, SImode.  */
>    {8, 8, 8},                           /* cost if storing mask register
> @@ -1894,6 +1895,7 @@ struct processor_costs znver4_cost = {
>    },
>
>    COSTS_N_INSNS (1),                   /* cost of an add instruction.  */
> +  /* TODO: Lea with 3 components has cost 2.  */
>    COSTS_N_INSNS (1),                   /* cost of a lea instruction.  */
>    COSTS_N_INSNS (1),                   /* variable shift costs.  */
>    COSTS_N_INSNS (1),                   /* constant shift costs.  */
> @@ -1904,11 +1906,11 @@ struct processor_costs znver4_cost = {
>     COSTS_N_INSNS (3)},                 /*                      other.  */
>    0,                                   /* cost of multiply per each bit
>                                            set.  */
> -  {COSTS_N_INSNS (9),                  /* cost of a divide/mod for QI.  */
> -   COSTS_N_INSNS (10),                 /*                          HI.  */
> -   COSTS_N_INSNS (12),                 /*                          SI.  */
> -   COSTS_N_INSNS (17),                 /*                          DI.  */
> -   COSTS_N_INSNS (17)},                        /*                          other.  */
> +  {COSTS_N_INSNS (12),                 /* cost of a divide/mod for QI.  */
> +   COSTS_N_INSNS (13),                 /*                          HI.  */
> +   COSTS_N_INSNS (13),                 /*                          SI.  */
> +   COSTS_N_INSNS (18),                 /*                          DI.  */
> +   COSTS_N_INSNS (18)},                        /*                          other.  */
>    COSTS_N_INSNS (1),                   /* cost of movsx.  */
>    COSTS_N_INSNS (1),                   /* cost of movzx.  */
>    8,                                   /* "large" insn.  */
> @@ -1919,22 +1921,22 @@ struct processor_costs znver4_cost = {
>                                            Relative to reg-reg move (2).  */
>    {8, 8, 8},                           /* cost of storing integer
>                                            registers.  */
> -  {6, 6, 6, 6, 12},                    /* cost of loading SSE registers
> +  {6, 6, 10, 10, 12},                  /* cost of loading SSE registers
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> -  {8, 8, 8, 8, 16},                    /* cost of storing SSE register
> +  {8, 8, 8, 12, 12},                   /* cost of storing SSE register
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> -  {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
> -  {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> -  2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
> +  {6, 6, 6, 6, 6},                     /* cost of unaligned loads.  */
> +  {8, 8, 8, 8, 8},                     /* cost of unaligned stores.  */
> +  2, 2, 2,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
>    6,                                   /* cost of moving SSE register to integer.  */
> -  /* VGATHERDPD is 15 uops and throughput is 4, VGATHERDPS is 23 uops,
> -     throughput 9.  Approx 7 uops do not depend on vector size and every load
> -     is 4 uops.  */
> -  14, 8,                               /* Gather load static, per_elt.  */
> -  14, 10,                              /* Gather store static, per_elt.  */
> +  /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops,
> +     throughput 5.  Approx 7 uops do not depend on vector size and every load
> +     is 5 uops.  */
> +  14, 10,                              /* Gather load static, per_elt.  */
> +  14, 20,                              /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> -  512,                                 /* size of l2 cache.  */
> +  1024,                                        /* size of l2 cache.  */
>    64,                                  /* size of prefetch block.  */
>    /* New AMD processors never drop prefetches; if they cannot be performed
>       immediately, they are queued.  We set number of simultaneous prefetches
> @@ -1943,26 +1945,26 @@ struct processor_costs znver4_cost = {
>       time).  */
>    100,                                 /* number of parallel prefetches.  */
>    3,                                   /* Branch cost.  */
> -  COSTS_N_INSNS (5),                   /* cost of FADD and FSUB insns.  */
> -  COSTS_N_INSNS (5),                   /* cost of FMUL instruction.  */
> +  COSTS_N_INSNS (7),                   /* cost of FADD and FSUB insns.  */
> +  COSTS_N_INSNS (7),                   /* cost of FMUL instruction.  */
>    /* Latency of fdiv is 8-15.  */
>    COSTS_N_INSNS (15),                  /* cost of FDIV instruction.  */
>    COSTS_N_INSNS (1),                   /* cost of FABS instruction.  */
>    COSTS_N_INSNS (1),                   /* cost of FCHS instruction.  */
>    /* Latency of fsqrt is 4-10.  */

the above comment looks like it needs updating as well

> -  COSTS_N_INSNS (10),                  /* cost of FSQRT instruction.  */
> +  COSTS_N_INSNS (25),                  /* cost of FSQRT instruction.  */
>
>    COSTS_N_INSNS (1),                   /* cost of cheap SSE instruction.  */
>    COSTS_N_INSNS (3),                   /* cost of ADDSS/SD SUBSS/SD insns.  */
>    COSTS_N_INSNS (3),                   /* cost of MULSS instruction.  */
>    COSTS_N_INSNS (3),                   /* cost of MULSD instruction.  */
> -  COSTS_N_INSNS (5),                   /* cost of FMA SS instruction.  */
> -  COSTS_N_INSNS (5),                   /* cost of FMA SD instruction.  */
> -  COSTS_N_INSNS (10),                  /* cost of DIVSS instruction.  */
> +  COSTS_N_INSNS (4),                   /* cost of FMA SS instruction.  */
> +  COSTS_N_INSNS (4),                   /* cost of FMA SD instruction.  */
> +  COSTS_N_INSNS (13),                  /* cost of DIVSS instruction.  */
>    /* 9-13.  */
>    COSTS_N_INSNS (13),                  /* cost of DIVSD instruction.  */
> -  COSTS_N_INSNS (10),                  /* cost of SQRTSS instruction.  */
> -  COSTS_N_INSNS (15),                  /* cost of SQRTSD instruction.  */
> +  COSTS_N_INSNS (15),                  /* cost of SQRTSS instruction.  */
> +  COSTS_N_INSNS (21),                  /* cost of SQRTSD instruction.  */
>    /* Zen can execute 4 integer operations per cycle.  FP operations
>       take 3 cycles and it can execute 2 integer additions and 2
>       multiplications thus reassociation may make sense up to with of 6.

next prev parent reply	other threads:[~2022-12-06 10:20 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-06 10:00 Jan Hubicka
2022-12-06 10:20 ` Richard Biener [this message]
2022-12-06 11:24   ` Jan Hubicka
2022-12-08  9:43 ` Kumar, Venkataramanan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFiYyc0SZ6aRW-UPhdq16FxU94NxRkkD+WvfN1+raw=N3yqpow@mail.gmail.com' \
    --to=richard.guenther@gmail.com \
    --cc=TejasSanjay.Joshi@amd.com \
    --cc=Venkataramanan.Kumar@amd.com \
    --cc=amonakov@ispras.ru \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hubicka@ucw.cz \
    --cc=mjambor@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).