From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 15011 invoked by alias); 3 Jul 2017 06:12:00 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 14974 invoked by uid 89); 3 Jul 2017 06:11:58 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=no version=3.3.2 spammy=earnshaw, Earnshaw, HX-MS-Exchange-CrossTenant-originalarrivaltime:Jul, again! X-HELO: EUR02-VE1-obe.outbound.protection.outlook.com Received: from mail-eopbgr20080.outbound.protection.outlook.com (HELO EUR02-VE1-obe.outbound.protection.outlook.com) (40.107.2.80) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 03 Jul 2017 06:11:55 +0000 Received: from VI1PR0801MB2031.eurprd08.prod.outlook.com (10.173.74.140) by AM5PR0802MB2609.eurprd08.prod.outlook.com (10.175.46.17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1220.11; Mon, 3 Jul 2017 06:11:52 +0000 Received: from VI1PR0801MB2031.eurprd08.prod.outlook.com ([fe80::a9ee:7f10:b6a1:339a]) by VI1PR0801MB2031.eurprd08.prod.outlook.com ([fe80::a9ee:7f10:b6a1:339a%18]) with mapi id 15.01.1220.015; Mon, 3 Jul 2017 06:11:52 +0000 From: Tamar Christina To: James Greenhalgh CC: Richard Sandiford , GCC Patches , nd , Marcus Shawcroft , Richard Earnshaw Subject: Re: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - infrastructure. Date: Mon, 03 Jul 2017 06:12:00 -0000 Message-ID: References: <8760g6bwig.fsf@linaro.org> <20170613163934.GA1372@arm.com>,, In-Reply-To: authentication-results: arm.com; dkim=none (message not signed) header.d=none;arm.com; dmarc=none action=none header.from=arm.com; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;AM5PR0802MB2609;7:qpznd/lp79Rdbexolty4r/W7p2IuTLnYFuEOTwIDQD3b3ycv1duTg+NDPkNsAtzpKszKb/+YQmyYtkcGesXAT33ZJm4Bw5d2SYEXmlq7ddREt7F9brFTPCa+0U6hv8fYomka5SIdxqWaQGjqjjx3wgrzZ9FAdCuVbFc5McuPSDwjZKFr20l8U9xW8RNoymNQ4gH6gyMvg17BL+W6IqxA25Ka/EJLKuaaL5DKlH9H9yprQ4toLBG6nthfqJb07zE4YaniOIEE7l+LIw780yjN02uI4ZXMj+Y3NSzKCNOSAV6La+tB3mg4TjxG8n2pHxzQA5ar+AcHRJmwUNxPP8a+EPspYRw903ltCmnIo0xE/agn1CA41OQeoBwLqC432K1208q/2JnDgCtA+2hUgTNHcqfx+AZZ4H0+Orc8xnPorBSwW10YlvYmiH5EOS3tQCEDQ4plmoKVVD6kLMBLCsvrhV5u48Fx959t7/tuNcqF5s5TQ0kKvT81qfC5msic9650toE802ZPqeY9oXWtwgwonJP2iSjwHx1ZVbrSsZx2AuU1HpZMD7LHjJIBQ7T3emo4o7446Ogh09rr0V7u/8QZVqjRD51YCdTZ0q2PTY/aAbXBX/JjGND+BzFl1ktj3ciFz81vP3BeqwdL2F6gc/EaODef/BeEZaPkQoqkrf0KT+Cyh/xoAS4Eg0uf11Yx7JEDY5cHBIKG2TAXgP7M9DWvxLFjXSFl3oxyFfBIO9tbHsra3jRoM/eq6HX3KU1pwFORNSObu7ASJj/s3G/U9YZGnDZqhoKE1kgAC9+7ckNogL8= x-ms-office365-filtering-correlation-id: f097fdc1-800c-4064-9eac-08d4c1da65bd x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254075)(48565401081)(300000503095)(300135400095)(2017052603031)(201703131423075)(201703031133081)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:AM5PR0802MB2609; x-ms-traffictypediagnostic: AM5PR0802MB2609: nodisclaimer: True x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(180628864354917)(133145235818549)(22074186197030)(236129657087228)(183786458502308)(148574349560750); x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(8121501046)(5005006)(93006095)(93001095)(3002001)(10201501046)(100000703101)(100105400095)(6055026)(6041248)(20161123555025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123562025)(20161123564025)(20161123558100)(20161123560025)(6072148)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:AM5PR0802MB2609;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:AM5PR0802MB2609; x-forefront-prvs: 035748864E x-forefront-antispam-report: SFV:NSPM;SFS:(10009020)(6009001)(39410400002)(39840400002)(39860400002)(39450400003)(39400400002)(39850400002)(377424004)(53754006)(377454003)(8936002)(81166006)(2950100002)(229853002)(6636002)(14454004)(2900100001)(66066001)(74316002)(189998001)(53546010)(33656002)(8676002)(25786009)(305945005)(3280700002)(4326008)(54906002)(9686003)(5250100002)(2906002)(93886004)(478600001)(3660700001)(72206003)(5660300001)(86362001)(6862004)(53936002)(76176999)(6506006)(99286003)(54356999)(50986999)(6436002)(55016002)(38730400002)(7736002)(110136004)(6246003)(7696004)(6116002)(3846002)(102836003)(14773001);DIR:OUT;SFP:1101;SCL:1;SRVR:AM5PR0802MB2609;H:VI1PR0801MB2031.eurprd08.prod.outlook.com;FPR:;SPF:None;MLV:sfv;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-originalarrivaltime: 03 Jul 2017 06:11:51.8477 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5PR0802MB2609 X-IsSubscribed: yes X-SW-Source: 2017-07/txt/msg00046.txt.bz2 Ping ________________________________________ From: gcc-patches-owner@gcc.gnu.org on beha= lf of Tamar Christina Sent: Monday, June 26, 2017 11:49:42 AM To: James Greenhalgh Cc: Richard Sandiford; GCC Patches; nd; Marcus Shawcroft; Richard Earnshaw Subject: Re: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - = infrastructure. Hi All, I've updated patch accordingly. This mostly involves removing the loop to create the ival and removing the *2 code and instead defaulting to 64bit and switching to 128 when needed. Regression tested on aarch64-none-linux-gnu and no regressions. OK for trunk? Thanks, Tamar gcc/ 2017-06-26 Tamar Christina * config/aarch64/aarch64.c (aarch64_simd_container_mode): Add prototype. (aarch64_expand_mov_immediate): Add HI support. (aarch64_reinterpret_float_as_int, aarch64_float_const_rtx_p: New. (aarch64_can_const_movi_rtx_p): New. (aarch64_preferred_reload_class): Remove restrictions of using FP registers for certain SIMD operatio= ns. (aarch64_rtx_costs): Added new cost for CONST_DOUBLE moves. (aarch64_valid_floating_const): Add integer move validation. (aarch64_simd_imm_scalar_p): Remove. (aarch64_output_scalar_simd_mov_immediate): Generalize function. (aarch64_legitimate_constant_p): Expand list of supported cases. * config/aarch64/aarch64-protos.h (aarch64_float_const_rtx_p, aarch64_can_const_movi_rtx_p): New. (aarch64_reinterpret_float_as_int): New. (aarch64_simd_imm_scalar_p): Remove. * config/aarch64/predicates.md (aarch64_reg_or_fp_float): New. * config/aarch64/constraints.md (Uvi): New. (Dd): Split into Ds and new Dd. * config/aarch64/aarch64.md (*movsi_aarch64): Add SIMD mov case. (*movdi_aarch64): Add SIMD mov case. ________________________________________ From: Tamar Christina Sent: Thursday, June 15, 2017 1:50:19 PM To: James Greenhalgh Cc: Richard Sandiford; GCC Patches; nd; Marcus Shawcroft; Richard Earnshaw Subject: RE: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - = infrastructure. > > This patch is pretty huge, are there any opportunities to further split i= t to aid > review? Unfortunately because I'm also changing some constraints it introduced a bi= t of a dependency cycle. If I were to break it up more, the individual patches won't work on their o= wn anymore. If this is acceptable I can break it up more. > > + ival =3D zext_hwi (res[needed - 1], 32); for (int i =3D needed - 2;= i > > + >=3D 0; i--) > > + { > > + ival <<=3D 32; > > + ival |=3D zext_hwi (res[i], 32); > > + } > > + > > + *intval =3D ival; > > ??? > > Two cases here, needed is either 2 if GET_MODE_BITSIZE (mode) =3D=3D 64, = or it > is 1 otherwise. So i starts at either -1 or 0. So this for loop either ru= ns > 0 or 1 times. What am I missing? I'm sure this is all an indirect way of > writing: > Yes, the code was set up to be easily extended to support 128 floats as wel= l, Which was deprioritized. I'll just remove the loop. > > + > > + /* Determine whether it's cheaper to write float constants as > > + mov/movk pairs over ldr/adrp pairs. */ unsigned HOST_WIDE_INT > > + ival; > > + > > + if (GET_CODE (x) =3D=3D CONST_DOUBLE > > + && SCALAR_FLOAT_MODE_P (mode) > > + && aarch64_reinterpret_float_as_int (x, &ival)) > > + { > > + machine_mode imode =3D mode =3D=3D HFmode ? SImode : > int_mode_for_mode (mode); > > + int num_instr =3D aarch64_internal_mov_immediate > > + (NULL_RTX, gen_int_mode (ival, imode), false, > imode); > > + return num_instr < 3; > > Should this cost model be static on a magin number? Is it not the case th= at > the decision should be based on the relative speeds of a memory access > compared with mov/movk/fmov ? > As far as I'm aware, the cost model is too simplistic to be able to express= the Actual costs of mov/movk and movk/movk pairs. E.g it doesn't take into acco= unt The latency and throughput difference when the instructions occur in sequen= ce/pairs. This leads to it allowing a smaller subset through here then what would be = beneficial. > > +/* Return TRUE if rtx X is immediate constant that fits in a single > > + MOVI immediate operation. */ > > +bool > > +aarch64_can_const_movi_rtx_p (rtx x, machine_mode mode) { > > + if (!TARGET_SIMD) > > + return false; > > + > > + machine_mode vmode, imode; > > + unsigned HOST_WIDE_INT ival; > > + > > + /* Don't write float constants out to memory. */ > > + if (GET_CODE (x) =3D=3D CONST_DOUBLE > > + && SCALAR_FLOAT_MODE_P (mode)) > > + { > > + if (!aarch64_reinterpret_float_as_int (x, &ival)) > > + return false; > > + > > + imode =3D int_mode_for_mode (mode); > > + } > > + else if (GET_CODE (x) =3D=3D CONST_INT > > + && SCALAR_INT_MODE_P (mode)) > > + { > > + imode =3D mode; > > + ival =3D INTVAL (x); > > + } > > + else > > + return false; > > + > > + unsigned width =3D GET_MODE_BITSIZE (mode) * 2; > > Why * 2? It isn't obvious to me from my understanding of movi why that > would be better than just clamping to 64-bit? The idea is to get the smallest vector mode for the given mode. For SF that's V2SF and DF: V2DF, which is why the *2. Clamping to 64 bit th= ere would be no 64 bit DF vector mode as far as I'm aware of. The clamping is done for modes smaller than SF, e.g. HF. Which mapped to th= e smallest Option, V4HF thanks to the clamping. Forcing everything to 128 bit vectors = would work, But I don't see the advantage of that. For this particular function is doesn't matter much as no code is generated= . So clamping to 128 bits would work, but when generating the code, I don't see why V4SF and= V8HF would be Better than V2SF and V4HF. Alternatively I could instead of reusing aarch64_simd_container_mode just c= reate my own Mapping function which just does the mapping I expect. Would that be a bett= er option? > > > + if (width < GET_MODE_BITSIZE (DFmode)) > > + width =3D GET_MODE_BITSIZE (DFmode); > > + > > + vmode =3D aarch64_simd_container_mode (imode, width); rtx v_op =3D > > + aarch64_simd_gen_const_vector_dup (vmode, ival); > > + > > + return aarch64_simd_valid_immediate (v_op, vmode, false, NULL); } > > + > > + > > /* Return the fixed registers used for condition codes. */ > > > > static bool > > @@ -5758,12 +5860,6 @@ aarch64_preferred_reload_class (rtx x, > reg_class_t regclass) > > return NO_REGS; > > } > > > > - /* If it's an integer immediate that MOVI can't handle, then > > - FP_REGS is not an option, so we return NO_REGS instead. */ > > - if (CONST_INT_P (x) && reg_class_subset_p (regclass, FP_REGS) > > - && !aarch64_simd_imm_scalar_p (x, GET_MODE (x))) > > - return NO_REGS; > > - > > /* Register eliminiation can result in a request for > > SP+constant->FP_REGS. We cannot support such operations which > > use SP as source and an FP_REG as destination, so reject out @@ > > -6674,26 +6770,44 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int > outer ATTRIBUTE_UNUSED, > > return true; > > > > case CONST_DOUBLE: > > + > > + /* First determine number of instructions to do the move > > + as an integer constant. */ > > + if (!aarch64_float_const_representable_p (x) > > + && !aarch64_can_const_movi_rtx_p (x, mode) > > + && aarch64_float_const_rtx_p (x)) > > + { > > + unsigned HOST_WIDE_INT ival; > > + bool succeed =3D aarch64_reinterpret_float_as_int (x, &ival); > > + gcc_assert (succeed); > > Just: > > gcc_assert (aarch64_reinterpret_float_as_int (x, &ival)); > > There's not much extra information in the name "succeed", so no extra val= ue > in the variable assignment. But if asserts are disabled won't the code be optimized out. This is why I = only assert on the result (as an extra sanity check) since the call itself has a side-e= ffect. > > char* > > -aarch64_output_scalar_simd_mov_immediate (rtx immediate, > > - machine_mode mode) > > +aarch64_output_scalar_simd_mov_immediate (rtx immediate, > > +machine_mode mode) > > { > > + > > + /* If a floating point number was passed and we desire to use it in = an > > + integer mode do the conversion to integer. */ > > + if (CONST_DOUBLE_P (immediate) && GET_MODE_CLASS (mode) =3D=3D > MODE_INT) > > + { > > + unsigned HOST_WIDE_INT ival; > > + if (!aarch64_reinterpret_float_as_int (immediate, &ival)) > > + gcc_unreachable (); > > + immediate =3D gen_int_mode (ival, mode); > > + } > > + > > machine_mode vmode; > > + int width =3D GET_MODE_BITSIZE (mode) * 2; > > Dubious * 2 again! Same reason as above, however here code is actually generated. > > > + if (width < 64) > > + width =3D 64; > > > > gcc_assert (!VECTOR_MODE_P (mode)); > > - vmode =3D aarch64_simd_container_mode (mode, 64); > > + vmode =3D aarch64_simd_container_mode (mode, width); > > rtx v_op =3D aarch64_simd_gen_const_vector_dup (vmode, INTVAL > > (immediate)); > > - return aarch64_output_simd_mov_immediate (v_op, vmode, 64); > > + return aarch64_output_simd_mov_immediate (v_op, vmode, width); > > } > > > > /* Split operands into moves from op[1] + op[2] into op[0]. */ > > Thanks, > James