From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=zIVP=4E=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x233.google.com (mail-lj1-x233.google.com [IPv6:2a00:1450:4864:20::233])
	by sourceware.org (Postfix) with ESMTPS id 3A461383B6C8
	for <gcc-patches@gcc.gnu.org>; Tue,  6 Dec 2022 10:20:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3A461383B6C8
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x233.google.com with SMTP id f16so5019301ljc.8
        for <gcc-patches@gcc.gnu.org>; Tue, 06 Dec 2022 02:20:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=UTcXtqoAoiQOnhRmqcnx4dacACnFf6MWVhtf/zoDx2A=;
        b=p0P12m/sPiJhQvO52Cgu9UWhq1nh3njRb9hzsflS3ZKQEk+n/ayecErupxUcfMaMF4
         prclAtL3YLntHWpiTQ2hTNIUeDYcc70XWiWOqwpaIGAzB1cgCjQiYELHig4J+wgZQLxc
         PIMEK4MHHt67zlAaZZ6mw/3gKUKETN+vGn3niOzItC55+GtssdQnav0wuAB5QNWfJqQH
         7PvUf7N5sUdOL9JDAejQ1csqE4Lql1Ff6aXiH5NtCw54Dx+9EGe6sWPwyrHypOxuc6c0
         uVFCF7615os/C/I1YdcZ+MzmL+4DmRgI/UDbbpLqfvtXmErlWlxwnmCFnbLqV9mmNWmI
         UDbQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=UTcXtqoAoiQOnhRmqcnx4dacACnFf6MWVhtf/zoDx2A=;
        b=s6uKgoH77yHoTokpO6LbDtw7chA3p0phHauzXgK/pYLbw/n48zGAZO8IdBcOdVN4vd
         9NKdyAAUkWI4PTiVZdcIFiIhOzZzPOReOjjT03MZ1c4zcwKAqItenJcbjiVepdyFfjW3
         Gxxt4S0uqt/Qt+JEOxZRRRKBmtM5BzLQA3v6RpFqT125O3qW7evC+LiBGymcATQ/CyZs
         YrUS/sLXwgQPLR6jbEQTzauwhRPEC076vsDyFt2RtNQ0aQYU7A7Wn02jKOGFj8uDcgC1
         MRfqC+MEtl83rofS8+/t8WFV0zxabIUFMaopIwvbMERaZCmFDgocmtAi8G2KEuwyW/5z
         vKWA==
X-Gm-Message-State: ANoB5pnOsSUdvl+vhLuhU31AADiRqj4xTtrtVTxFwQ9Yrxd32GVZu3X1
	paoTWpYDxrcVcD947qUamEJ2bfwECTnCqGmk4Cc=
X-Google-Smtp-Source: AA0mqf6qHxu7XipyYUR0xeE8E7cSdKfy2fXweInykBvEfyy8ClP7p04WXgfkPiXL44LLDe8qHxhklAT6h1rOhQqbijo=
X-Received: by 2002:a2e:968b:0:b0:27a:76e:748f with SMTP id
 q11-20020a2e968b000000b0027a076e748fmr1524633lji.467.1670322045287; Tue, 06
 Dec 2022 02:20:45 -0800 (PST)
MIME-Version: 1.0
References: <Y48S1d7kqcbRhfJ3@kam.mff.cuni.cz>
In-Reply-To: <Y48S1d7kqcbRhfJ3@kam.mff.cuni.cz>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 6 Dec 2022 11:20:32 +0100
Message-ID: <CAFiYyc0SZ6aRW-UPhdq16FxU94NxRkkD+WvfN1+raw=N3yqpow@mail.gmail.com>
Subject: Re: Zen4 tuning part 1 - cost tables
To: Jan Hubicka <hubicka@ucw.cz>
Cc: gcc-patches@gcc.gnu.org, mjambor@suse.cz, 
	Alexander Monakov <amonakov@ispras.ru>, "Kumar, Venkataramanan" <Venkataramanan.Kumar@amd.com>, 
	Tejas Sanjay <TejasSanjay.Joshi@amd.com>
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Dec 6, 2022 at 11:01 AM Jan Hubicka via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi
> this patch updates cost of znver4 mostly based on data measued by Agner Fog.
> Compared to previous generations x87 became bit slower which is probably not
> big deal (and we have minimal benchmarking coverage for it).  One interesting
> improvement is reducation of FMA cost.  I also updated costs of AVX256
> loads/stores  based on latencies (not throughput which is twice of avx256).
> Overall AVX512 vectorization seems to improve noticeably some of TSVC
> benchmarks but since internally 512 vectors are split to 256 vectors it is
> somewhat risky and does not win in SPEC scores (mostly by regressing benchmarks
> with loop that have small trip count like x264 and exchange), so for now I am
> going to set AVX256_OPTIMAL tune but I am still playing with it.  We improved
> since ZNVER1 on choosing vectorization size and also have vectorized
> prologues/epilogues so it may be possible to make avx512 small win overall.
>
> In general I would like to keep cost tables latency based unless we have
> a good reason to not do so.  There are some interesting diferences in
> znver3 tables that I also patched and seems performance neutral.  I will
> send that separately.
>
> Bootstrapped/regtested x86_64-linux, also benchmarked on SPEC2017 along
> with AVX512 tuning.  I plan to commit it tomorrow unless there are some
> comments.
>
> Honza
>
>         * x86-tune-costs.h (znver4_cost): Upate costs of FP and SSE moves,
>         division multiplication, gathers, L2 cache size, and more complex
>         FP instrutions.
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index f01b8ee9eef..3a6ce02f093 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1867,9 +1868,9 @@ struct processor_costs znver4_cost = {
>    {8, 8, 8},                           /* cost of storing integer
>                                            registers.  */
>    2,                                   /* cost of reg,reg fld/fst.  */
> -  {6, 6, 16},                          /* cost of loading fp registers
> +  {14, 14, 17},                                /* cost of loading fp registers
>                                            in SFmode, DFmode and XFmode.  */
> -  {8, 8, 16},                          /* cost of storing fp registers
> +  {12, 12, 16},                                /* cost of storing fp registers
>                                            in SFmode, DFmode and XFmode.  */
>    2,                                   /* cost of moving MMX register.  */
>    {6, 6},                              /* cost of loading MMX registers
> @@ -1878,13 +1879,13 @@ struct processor_costs znver4_cost = {
>                                            in SImode and DImode.  */
>    2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
> -  {6, 6, 6, 6, 12},                    /* cost of loading SSE registers
> +  {6, 6, 10, 10, 12},                  /* cost of loading SSE registers
>                                            in 32,64,128,256 and 512-bit.  */
> -  {8, 8, 8, 8, 16},                    /* cost of storing SSE registers
> +  {8, 8, 8, 12, 12},                   /* cost of storing SSE registers
>                                            in 32,64,128,256 and 512-bit.  */
> -  6, 6,                                        /* SSE->integer and integer->SSE
> +  6, 8,                                        /* SSE->integer and integer->SSE
>                                            moves.  */
> -  8, 8,                                /* mask->integer and integer->mask moves */
> +  8, 8,                                        /* mask->integer and integer->mask moves */
>    {6, 6, 6},                           /* cost of loading mask register
>                                            in QImode, HImode, SImode.  */
>    {8, 8, 8},                           /* cost if storing mask register
> @@ -1894,6 +1895,7 @@ struct processor_costs znver4_cost = {
>    },
>
>    COSTS_N_INSNS (1),                   /* cost of an add instruction.  */
> +  /* TODO: Lea with 3 components has cost 2.  */
>    COSTS_N_INSNS (1),                   /* cost of a lea instruction.  */
>    COSTS_N_INSNS (1),                   /* variable shift costs.  */
>    COSTS_N_INSNS (1),                   /* constant shift costs.  */
> @@ -1904,11 +1906,11 @@ struct processor_costs znver4_cost = {
>     COSTS_N_INSNS (3)},                 /*                      other.  */
>    0,                                   /* cost of multiply per each bit
>                                            set.  */
> -  {COSTS_N_INSNS (9),                  /* cost of a divide/mod for QI.  */
> -   COSTS_N_INSNS (10),                 /*                          HI.  */
> -   COSTS_N_INSNS (12),                 /*                          SI.  */
> -   COSTS_N_INSNS (17),                 /*                          DI.  */
> -   COSTS_N_INSNS (17)},                        /*                          other.  */
> +  {COSTS_N_INSNS (12),                 /* cost of a divide/mod for QI.  */
> +   COSTS_N_INSNS (13),                 /*                          HI.  */
> +   COSTS_N_INSNS (13),                 /*                          SI.  */
> +   COSTS_N_INSNS (18),                 /*                          DI.  */
> +   COSTS_N_INSNS (18)},                        /*                          other.  */
>    COSTS_N_INSNS (1),                   /* cost of movsx.  */
>    COSTS_N_INSNS (1),                   /* cost of movzx.  */
>    8,                                   /* "large" insn.  */
> @@ -1919,22 +1921,22 @@ struct processor_costs znver4_cost = {
>                                            Relative to reg-reg move (2).  */
>    {8, 8, 8},                           /* cost of storing integer
>                                            registers.  */
> -  {6, 6, 6, 6, 12},                    /* cost of loading SSE registers
> +  {6, 6, 10, 10, 12},                  /* cost of loading SSE registers
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> -  {8, 8, 8, 8, 16},                    /* cost of storing SSE register
> +  {8, 8, 8, 12, 12},                   /* cost of storing SSE register
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> -  {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
> -  {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> -  2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
> +  {6, 6, 6, 6, 6},                     /* cost of unaligned loads.  */
> +  {8, 8, 8, 8, 8},                     /* cost of unaligned stores.  */
> +  2, 2, 2,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
>    6,                                   /* cost of moving SSE register to integer.  */
> -  /* VGATHERDPD is 15 uops and throughput is 4, VGATHERDPS is 23 uops,
> -     throughput 9.  Approx 7 uops do not depend on vector size and every load
> -     is 4 uops.  */
> -  14, 8,                               /* Gather load static, per_elt.  */
> -  14, 10,                              /* Gather store static, per_elt.  */
> +  /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops,
> +     throughput 5.  Approx 7 uops do not depend on vector size and every load
> +     is 5 uops.  */
> +  14, 10,                              /* Gather load static, per_elt.  */
> +  14, 20,                              /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> -  512,                                 /* size of l2 cache.  */
> +  1024,                                        /* size of l2 cache.  */
>    64,                                  /* size of prefetch block.  */
>    /* New AMD processors never drop prefetches; if they cannot be performed
>       immediately, they are queued.  We set number of simultaneous prefetches
> @@ -1943,26 +1945,26 @@ struct processor_costs znver4_cost = {
>       time).  */
>    100,                                 /* number of parallel prefetches.  */
>    3,                                   /* Branch cost.  */
> -  COSTS_N_INSNS (5),                   /* cost of FADD and FSUB insns.  */
> -  COSTS_N_INSNS (5),                   /* cost of FMUL instruction.  */
> +  COSTS_N_INSNS (7),                   /* cost of FADD and FSUB insns.  */
> +  COSTS_N_INSNS (7),                   /* cost of FMUL instruction.  */
>    /* Latency of fdiv is 8-15.  */
>    COSTS_N_INSNS (15),                  /* cost of FDIV instruction.  */
>    COSTS_N_INSNS (1),                   /* cost of FABS instruction.  */
>    COSTS_N_INSNS (1),                   /* cost of FCHS instruction.  */
>    /* Latency of fsqrt is 4-10.  */

the above comment looks like it needs updating as well

> -  COSTS_N_INSNS (10),                  /* cost of FSQRT instruction.  */
> +  COSTS_N_INSNS (25),                  /* cost of FSQRT instruction.  */
>
>    COSTS_N_INSNS (1),                   /* cost of cheap SSE instruction.  */
>    COSTS_N_INSNS (3),                   /* cost of ADDSS/SD SUBSS/SD insns.  */
>    COSTS_N_INSNS (3),                   /* cost of MULSS instruction.  */
>    COSTS_N_INSNS (3),                   /* cost of MULSD instruction.  */
> -  COSTS_N_INSNS (5),                   /* cost of FMA SS instruction.  */
> -  COSTS_N_INSNS (5),                   /* cost of FMA SD instruction.  */
> -  COSTS_N_INSNS (10),                  /* cost of DIVSS instruction.  */
> +  COSTS_N_INSNS (4),                   /* cost of FMA SS instruction.  */
> +  COSTS_N_INSNS (4),                   /* cost of FMA SD instruction.  */
> +  COSTS_N_INSNS (13),                  /* cost of DIVSS instruction.  */
>    /* 9-13.  */
>    COSTS_N_INSNS (13),                  /* cost of DIVSD instruction.  */
> -  COSTS_N_INSNS (10),                  /* cost of SQRTSS instruction.  */
> -  COSTS_N_INSNS (15),                  /* cost of SQRTSD instruction.  */
> +  COSTS_N_INSNS (15),                  /* cost of SQRTSS instruction.  */
> +  COSTS_N_INSNS (21),                  /* cost of SQRTSD instruction.  */
>    /* Zen can execute 4 integer operations per cycle.  FP operations
>       take 3 cycles and it can execute 2 integer additions and 2
>       multiplications thus reassociation may make sense up to with of 6.