From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2057.outbound.protection.outlook.com [40.107.237.57]) by sourceware.org (Postfix) with ESMTPS id 5FA893855153 for ; Thu, 8 Dec 2022 09:43:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5FA893855153 Authentication-Results: sourceware.org; dmarc=fail (p=quarantine dis=none) header.from=amd.com Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=amd.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=aJh7QBREXeHnHkz6kXRDenw6wKV1U46ZBJmSY2VEeOSltg7qEVyDeLwCIdeowWXJ+jwUSrDMyne/QgqSse3C0BJIko+dqUAaQ8Kgh4/lYQOOdopbLr0GGs8VcBpH8iOJ/lpiF+27g6nAa1enUEWmZ/yGDLFROfJIm3uhwB5AGj5NIXkrxnPSew41SvXOMrrtLAmVyQ0bTCj9rKfg8CQfCOhDZyOdQ2wEzAVxp44AoOZN28YKxei7lcpfw6D4xCxKw2dtpBaAYlV46l3NobvBuY2xIWtrtLMuYeCPzv8AsPE9ZsbE7YOphj2TDEEFIDA+HfLVUmLRbHZy4TtHsmgi0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=EXBxOkclhTp7fHN8DxuxZzvghaoiDVPQ4DdswF3+WiE=; b=Q5MUv/4UWoMZcr/u7Cr1QcQaXqLIkhYQbRAfe9TyZsCn02NaRyYRQIDdWwtVM/aOM0Oop1FRmCnM/XIKGAcER2ez0uxN4suoa3L40jJ0lX+S8aEEjemBUlaSp9FbzDmmR7vEX9lmQF3ZJk6PDD+43O8wGRQB0Jb/qBmcpwR0x6UgzpMnY+PZm83o+gro4AoNt4VKyPEb/whXeEYmG5sVBqVh0jWQ7HU3rvNdp/urd8xAjFMMHTMOqf+6SB3Exot6fGIcVs3gWFG3rcE0pQvBi9yITOZrvrFzGBTe8okFv5PXE83AnOdtg07BAT2O1lQedjHyyidBEvrkLhd+hzaHLA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=EXBxOkclhTp7fHN8DxuxZzvghaoiDVPQ4DdswF3+WiE=; b=VrCqhmSUf74jWhYdf09iG8rNDhN9jT2kxI1LLggqwTd9Hk/R534YicOiNN4dyuQXFTxg9AY0yDJYP2YEmT9DQhM3jZs/zYc43Jj880bMrm365uvegoXCA+WWu7l256AqfIcm8mhoBqoOzlLn7C9U6DRsig6wUYXroEQdGwbDnmA= Received: from DM6PR12MB3081.namprd12.prod.outlook.com (2603:10b6:5:38::27) by CY8PR12MB7290.namprd12.prod.outlook.com (2603:10b6:930:55::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5880.14; Thu, 8 Dec 2022 09:43:16 +0000 Received: from DM6PR12MB3081.namprd12.prod.outlook.com ([fe80::223b:bdae:16c2:cd07]) by DM6PR12MB3081.namprd12.prod.outlook.com ([fe80::223b:bdae:16c2:cd07%4]) with mapi id 15.20.5880.014; Thu, 8 Dec 2022 09:43:16 +0000 From: "Kumar, Venkataramanan" To: Jan Hubicka , "gcc-patches@gcc.gnu.org" , "mjambor@suse.cz" , Alexander Monakov , "Joshi, Tejas Sanjay" Subject: RE: Zen4 tuning part 1 - cost tables Thread-Topic: Zen4 tuning part 1 - cost tables Thread-Index: AQHZCVmkR41aEbKUT0i2CNBp8Q256K5juXLA Date: Thu, 8 Dec 2022 09:43:16 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_4342314e-0df4-4b58-84bf-38bed6170a0f_ActionId=ae892748-8150-4e57-8b85-669b776f9474;MSIP_Label_4342314e-0df4-4b58-84bf-38bed6170a0f_ContentBits=0;MSIP_Label_4342314e-0df4-4b58-84bf-38bed6170a0f_Enabled=true;MSIP_Label_4342314e-0df4-4b58-84bf-38bed6170a0f_Method=Standard;MSIP_Label_4342314e-0df4-4b58-84bf-38bed6170a0f_Name=General;MSIP_Label_4342314e-0df4-4b58-84bf-38bed6170a0f_SetDate=2022-12-08T09:18:45Z;MSIP_Label_4342314e-0df4-4b58-84bf-38bed6170a0f_SiteId=3dd8961f-e488-4e60-8e11-a82d994e183d; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=amd.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: DM6PR12MB3081:EE_|CY8PR12MB7290:EE_ x-ms-office365-filtering-correlation-id: bc4b24f7-dd05-43ce-84bf-08dad900a1ab x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: mKaSedmN/TWGl/5YTiJAz2zpilq45Bmqc/2ZvJovbDIYWKXr/LeNoD0sLhPb7j8BJs4xLloJ0Ju0ksnuaMz8Ug4MXHxE/dsc32UZ6YjsEyehFiOFGoZlsCUgDxs8TpP8/qI9sbtCpl18zaCZyK5OyvxqpapVUbTLe2KU+/P8ryQm89IKY4czsWmk9qM/T5KmWs4ddCKMxUvySegYVAfkYAIvz2k2mRxYbo8C37Bx7a94BRkUzKKdxs4YmXdj0kX6wcjFNCmuwCvWWuinqpy0sa8V33xNUZGCLCS2G0G/J2m7DOIimeE5HZe68WJX7jkuHycCpDniwqsOB76Df63XxRSWDhVZ53L3kPMrDlfSJgBz+xTUXeP97Hx/B9+2cm0lI72CL3NqojQrk9Qi+685Okql2/DRAx9DmowW94raJFUIyY3pkJP7uS2gXeebLx//O60qAqr7fif4kzMXBOob4HsURbesT+ZkzKVA3XGppFPN2NbRzGa/et+C9rE3IrX8f0DqbCkT62r9Uxe66RUMp2y9keCEwqbEKEnGKUKSwtC8Pe9qoh/BYuQbyIBW8FM7wMNYIJmsWxiuJ+TnwbIhuDnRWmL7ky8uGuENMrnLae4B8j/pIIvWPX3TGghuDuzHV0AGAXgvsH4BSdNIcxJOpSLtYPUZBtCnLYrEr4JyYi0qhV49PTt620CfzYisBX2VPUghHzvYBto8HFkGdRGGUg== x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DM6PR12MB3081.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230022)(4636009)(376002)(396003)(39860400002)(136003)(346002)(366004)(451199015)(83380400001)(38070700005)(86362001)(2906002)(122000001)(38100700002)(8936002)(52536014)(55016003)(5660300002)(71200400001)(186003)(26005)(9686003)(66446008)(7696005)(6506007)(53546011)(66946007)(66556008)(66476007)(64756008)(76116006)(478600001)(8676002)(41300700001)(110136005)(316002)(6636002)(66899015)(33656002);DIR:OUT;SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?fU83JKL8aaFyMR3F78rjRCcJzdBFi0p6q48s+Hor5ZerXiwg+yzKnQjoEYcY?= =?us-ascii?Q?BgoOOqjdjNhPn0iQlNpxxbvq4JCVUEHmFGXbUqjEhLAaGi3/U2PbrVGuSFDZ?= =?us-ascii?Q?Dvo+24TIZWgTTFo7ZicRGdRAuflHUf6IHKXPX0owTXsh6QxXnRHQ3RJ0nkVq?= =?us-ascii?Q?i6D3RuI9veHIUCQJs1e6pjcqpGQQIURAaXGpF75ssfmz/+LuhgAthpwWvqr4?= =?us-ascii?Q?f2W7er7wHH4pa5ty/Wqsm6B1CXCpYj0x0UdFAJfUyqh7d9yRbndJQwqaP08m?= =?us-ascii?Q?9JPAKI5yMA+qiouTQiXMg3HqRmYKtJpp9i6WzcBx/2z9dfYfo0U++stCa+yw?= =?us-ascii?Q?s1ueI5m+5g/JJJxxj21uQSm0O1ZZd3lMLPox+KC6nwZNDPHtab5eYiEYZGac?= =?us-ascii?Q?PHfpv8gvUjdnPDkJdeuourdadcQlmNn6Lps3eP4jfUKbOBqlroxP7Wa5cNB2?= =?us-ascii?Q?4amMAuUzCFxngWiQvCb2T7aYBTlpL5uRw38g9JEA2RkYM8Mt7DvBwzrJkswy?= =?us-ascii?Q?PE6zZ8LqTfewLm4S3Y4qveMS0EB1T3dAC0OFxuaqjHOOBkdPV0txFg+Du9AV?= =?us-ascii?Q?Vy6EPWaqIoQNVq7PBF/nC0ch52sZYiuVe0ZowE5TTyZeV2XbP6ab+xv5jZZo?= =?us-ascii?Q?9ZG5a/COjymKpoB6M90WbK+1PcbeZ5sj6C9lCf83jfZprxdvwC76dMGtj6te?= =?us-ascii?Q?kuLAot+rQNhBxYdkgJm9giNPLcRZi1DL6I3rklN2qoei8jLegyKOibbm7YjY?= =?us-ascii?Q?9/2w/LqZwHxXxtilKWD6sJwJbrzOPDmgnZpX4H9gMleArU60+4c4EaxTlCp7?= =?us-ascii?Q?UwRkyBX7UKRvCYU/AXw2kVuZZgIrWfohVNmt6Zr9YuLd7aJ9vygm3MYV39XS?= =?us-ascii?Q?G9JF4e0PjEP0qP1mnxSFBmNldSrBmn4GpzYfkIAdPjOm5Uyn7wA/dR+HRaE7?= =?us-ascii?Q?LZw+1YSq/38zCbXuyqzAxaxi1hj+ZaHz6o0SH6n7zzRRFlXOEySU29H/wr/8?= =?us-ascii?Q?GpkbS4JMzbIWOiQavEmJ2t88W2wevxxMbP2NZ/yckYl2w3LKdp7cgVC3a2oH?= =?us-ascii?Q?rOlR8DZIIH+XI1bxQvAmBcH7aDC7XNTFfP+i6mP3lltt36I+Ee87pxba5Yeb?= =?us-ascii?Q?961qGE0NdyqT1Bm9Qa9KjCoDHPqdjkHtF59ixOG1GtCA+6X5KZeuxZDeR8px?= =?us-ascii?Q?UQ5FOPfR98ec3lKgYVoIv27Ku0BGggI4DoEYoFiVzU4gNnwXFZ3wQWrum4UG?= =?us-ascii?Q?zCpIqQYEsvHK7HwcFxEDoRV7TkKoG+0lgiifaRJpnrf6/Ql+oe1eQeBexQA0?= =?us-ascii?Q?XU6qVcddzoVjrezrfZlMHHod+uLgiwmZek1HlTIttFLtTgJvH1cJf5YomgV5?= =?us-ascii?Q?+ikMN8jOpMMEbWzfkMv1s2Ftz01FOg7wxPcGO0yaPKnVoaIXTleTvVaeqGcS?= =?us-ascii?Q?XRlMgoEQo6L8C7kIfGGuq8+76dctZmhiteryKmI2js8xopaRKeJk231j9Nvu?= =?us-ascii?Q?LKGtHvQRsUPX+aWvzpyD4PO1svNQGN82/Npo5rkqr9+fxsyWEbDzCHQgIp+F?= =?us-ascii?Q?NpkLe8sDGumCZYSIQOg=3D?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: DM6PR12MB3081.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: bc4b24f7-dd05-43ce-84bf-08dad900a1ab X-MS-Exchange-CrossTenant-originalarrivaltime: 08 Dec 2022 09:43:16.1790 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: uOvxeRKiTJMg1MfKhae7TQk3lq4MF74VZYvIaiW71iYFaFjkRcUatexyUQsWYKgK X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY8PR12MB7290 X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: [AMD Official Use Only - General] Hi Honza, Thank you for posting the tuning patch. > -----Original Message----- > From: Jan Hubicka > Sent: Tuesday, December 6, 2022 3:31 PM > To: gcc-patches@gcc.gnu.org; mjambor@suse.cz; Alexander Monakov > ; Kumar, Venkataramanan > ; Joshi, Tejas Sanjay > > Subject: Zen4 tuning part 1 - cost tables > > Caution: This message originated from an External Source. Use proper > caution when opening attachments, clicking links, or responding. > > > Hi > this patch updates cost of znver4 mostly based on data measued by Agner > Fog. > Compared to previous generations x87 became bit slower which is probably > not big deal (and we have minimal benchmarking coverage for it). One > interesting improvement is reducation of FMA cost. I also updated costs = of > AVX256 loads/stores based on latencies (not throughput which is twice of > avx256). > Overall AVX512 vectorization seems to improve noticeably some of TSVC > benchmarks but since internally 512 vectors are split to 256 vectors it i= s > somewhat risky and does not win in SPEC scores (mostly by regressing > benchmarks with loop that have small trip count like x264 and exchange), = so > for now I am going to set AVX256_OPTIMAL tune but I am still playing with= it. > We improved since ZNVER1 on choosing vectorization size and also have > vectorized prologues/epilogues so it may be possible to make avx512 small > win overall. I also noted improvements to TSVC benchmarks when we enable AVX512 vectoriz= ation. I think we should allow full AVX512 bit vectorization for znver4. = Even if the 512 vectors are broken into two 256 vectors we can pipeline th= e higher half immediately in the next cycle. Also we have less instruction= s to decode with avx512 instructions. Overall AVX512 operations should be = better. > > In general I would like to keep cost tables latency based unless we have = a > good reason to not do so. There are some interesting diferences in > znver3 tables that I also patched and seems performance neutral. I will = send > that separately. > > Bootstrapped/regtested x86_64-linux, also benchmarked on SPEC2017 along > with AVX512 tuning. I plan to commit it tomorrow unless there are some > comments. > > Honza > > * x86-tune-costs.h (znver4_cost): Upate costs of FP and SSE moves= , > division multiplication, gathers, L2 cache size, and more complex > FP instrutions. > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune- > costs.h > index f01b8ee9eef..3a6ce02f093 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -1867,9 +1868,9 @@ struct processor_costs znver4_cost =3D { > {8, 8, 8}, /* cost of storing integer > registers. */ > 2, /* cost of reg,reg fld/fst. */ > - {6, 6, 16}, /* cost of loading fp registers > + {14, 14, 17}, /* cost of loading fp reg= isters > in SFmode, DFmode and XFmode. = */ > - {8, 8, 16}, /* cost of storing fp registers > + {12, 12, 16}, /* cost of storing fp reg= isters > in SFmode, DFmode and XFmode. = */ > 2, /* cost of moving MMX register. = */ > {6, 6}, /* cost of loading MMX registers > @@ -1878,13 +1879,13 @@ struct processor_costs znver4_cost =3D { > in SImode and DImode. */ > 2, 2, 3, /* cost of moving XMM,YMM,ZMM > register. */ > - {6, 6, 6, 6, 12}, /* cost of loading SSE registers > + {6, 6, 10, 10, 12}, /* cost of loading SSE registers > in 32,64,128,256 and 512-bit. = */ > - {8, 8, 8, 8, 16}, /* cost of storing SSE registers > + {8, 8, 8, 12, 12}, /* cost of storing SSE registers > in 32,64,128,256 and 512-bit. = */ > - 6, 6, /* SSE->integer and integ= er->SSE > + 6, 8, /* SSE->integer and integ= er->SSE > moves. */ > - 8, 8, /* mask->integer and integer->mas= k moves */ > + 8, 8, /* mask->integer and inte= ger->mask moves */ > {6, 6, 6}, /* cost of loading mask register > in QImode, HImode, SImode. */ > {8, 8, 8}, /* cost if storing mask register > @@ -1894,6 +1895,7 @@ struct processor_costs znver4_cost =3D { > }, > > COSTS_N_INSNS (1), /* cost of an add instruction. *= / > + /* TODO: Lea with 3 components has cost 2. */ > COSTS_N_INSNS (1), /* cost of a lea instruction. */ > COSTS_N_INSNS (1), /* variable shift costs. */ > COSTS_N_INSNS (1), /* constant shift costs. */ > @@ -1904,11 +1906,11 @@ struct processor_costs znver4_cost =3D { > COSTS_N_INSNS (3)}, /* other. *= / > 0, /* cost of multiply per each bit > set. */ > - {COSTS_N_INSNS (9), /* cost of a divide/mod for QI. = */ > - COSTS_N_INSNS (10), /* HI. = */ > - COSTS_N_INSNS (12), /* SI. = */ > - COSTS_N_INSNS (17), /* DI. = */ > - COSTS_N_INSNS (17)}, /* = other. */ > + {COSTS_N_INSNS (12), /* cost of a divide/mod for QI. = */ > + COSTS_N_INSNS (13), /* HI. = */ > + COSTS_N_INSNS (13), /* SI. = */ > + COSTS_N_INSNS (18), /* DI. = */ > + COSTS_N_INSNS (18)}, /* = other. */ > COSTS_N_INSNS (1), /* cost of movsx. */ > COSTS_N_INSNS (1), /* cost of movzx. */ > 8, /* "large" insn. */ > @@ -1919,22 +1921,22 @@ struct processor_costs znver4_cost =3D { > Relative to reg-reg move (2). = */ > {8, 8, 8}, /* cost of storing integer > registers. */ > - {6, 6, 6, 6, 12}, /* cost of loading SSE registers > + {6, 6, 10, 10, 12}, /* cost of loading SSE registers > in 32bit, 64bit, 128bit, 256bi= t and 512bit */ > - {8, 8, 8, 8, 16}, /* cost of storing SSE register > + {8, 8, 8, 12, 12}, /* cost of storing SSE register > in 32bit, 64bit, 128bit, 256bi= t and 512bit */ > - {6, 6, 6, 6, 12}, /* cost of unaligned loads. */ > - {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ > - 2, 2, 3, /* cost of moving XMM,YMM,ZMM > + {6, 6, 6, 6, 6}, /* cost of unaligned loads. */ > + {8, 8, 8, 8, 8}, /* cost of unaligned stores. */ > + 2, 2, 2, /* cost of moving XMM,YMM,ZMM > register. */ > 6, /* cost of moving SSE register to= integer. */ > - /* VGATHERDPD is 15 uops and throughput is 4, VGATHERDPS is 23 uops, > - throughput 9. Approx 7 uops do not depend on vector size and every > load > - is 4 uops. */ > - 14, 8, /* Gather load static, per_elt. = */ > - 14, 10, /* Gather store static, per_elt. = */ > + /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops, > + throughput 5. Approx 7 uops do not depend on vector size and every > load > + is 5 uops. */ > + 14, 10, /* Gather load static, per_elt. = */ > + 14, 20, /* Gather store static, per_elt. = */ > 32, /* size of l1 cache. */ > - 512, /* size of l2 cache. */ > + 1024, /* size of l2 cache. */ > 64, /* size of prefetch block. */ > /* New AMD processors never drop prefetches; if they cannot be > performed > immediately, they are queued. We set number of simultaneous > prefetches @@ -1943,26 +1945,26 @@ struct processor_costs znver4_cost =3D > { > time). */ > 100, /* number of parallel prefetches.= */ > 3, /* Branch cost. */ > - COSTS_N_INSNS (5), /* cost of FADD and FSUB insns. = */ > - COSTS_N_INSNS (5), /* cost of FMUL instruction. */ > + COSTS_N_INSNS (7), /* cost of FADD and FSUB insns. = */ > + COSTS_N_INSNS (7), /* cost of FMUL instruction. */ > /* Latency of fdiv is 8-15. */ > COSTS_N_INSNS (15), /* cost of FDIV instruction. */ > COSTS_N_INSNS (1), /* cost of FABS instruction. */ > COSTS_N_INSNS (1), /* cost of FCHS instruction. */ > /* Latency of fsqrt is 4-10. */ > - COSTS_N_INSNS (10), /* cost of FSQRT instruction. */ > + COSTS_N_INSNS (25), /* cost of FSQRT instruction. */ > > COSTS_N_INSNS (1), /* cost of cheap SSE instruction.= */ > COSTS_N_INSNS (3), /* cost of ADDSS/SD SUBSS/SD insn= s. */ > COSTS_N_INSNS (3), /* cost of MULSS instruction. */ > COSTS_N_INSNS (3), /* cost of MULSD instruction. */ > - COSTS_N_INSNS (5), /* cost of FMA SS instruction. *= / > - COSTS_N_INSNS (5), /* cost of FMA SD instruction. *= / > - COSTS_N_INSNS (10), /* cost of DIVSS instruction. */ > + COSTS_N_INSNS (4), /* cost of FMA SS instruction. *= / > + COSTS_N_INSNS (4), /* cost of FMA SD instruction. *= / > + COSTS_N_INSNS (13), /* cost of DIVSS instruction. */ > /* 9-13. */ > COSTS_N_INSNS (13), /* cost of DIVSD instruction. */ > - COSTS_N_INSNS (10), /* cost of SQRTSS instruction. *= / > - COSTS_N_INSNS (15), /* cost of SQRTSD instruction. *= / > + COSTS_N_INSNS (15), /* cost of SQRTSS instruction. *= / > + COSTS_N_INSNS (21), /* cost of SQRTSD instruction. *= / > /* Zen can execute 4 integer operations per cycle. FP operations > take 3 cycles and it can execute 2 integer additions and 2 > multiplications thus reassociation may make sense up to with of 6. The cost changes looks fine. Regards, Venkat.