From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-457397-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 15011 invoked by alias); 3 Jul 2017 06:12:00 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 14974 invoked by uid 89); 3 Jul 2017 06:11:58 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL,BAYES_00,KAM_ASCII_DIVIDERS,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=no version=3.3.2 spammy=earnshaw, Earnshaw, HX-MS-Exchange-CrossTenant-originalarrivaltime:Jul, again!
X-HELO: EUR02-VE1-obe.outbound.protection.outlook.com
Received: from mail-eopbgr20080.outbound.protection.outlook.com (HELO EUR02-VE1-obe.outbound.protection.outlook.com) (40.107.2.80) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 03 Jul 2017 06:11:55 +0000
Received: from VI1PR0801MB2031.eurprd08.prod.outlook.com (10.173.74.140) by AM5PR0802MB2609.eurprd08.prod.outlook.com (10.175.46.17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1220.11; Mon, 3 Jul 2017 06:11:52 +0000
Received: from VI1PR0801MB2031.eurprd08.prod.outlook.com ([fe80::a9ee:7f10:b6a1:339a]) by VI1PR0801MB2031.eurprd08.prod.outlook.com ([fe80::a9ee:7f10:b6a1:339a%18]) with mapi id 15.01.1220.015; Mon, 3 Jul 2017 06:11:52 +0000
From: Tamar Christina <Tamar.Christina@arm.com>
To: James Greenhalgh <James.Greenhalgh@arm.com>
CC: Richard Sandiford <richard.sandiford@linaro.org>, GCC Patches	<gcc-patches@gcc.gnu.org>, nd <nd@arm.com>, Marcus Shawcroft	<Marcus.Shawcroft@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com>
Subject: Re: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - infrastructure.
Date: Mon, 03 Jul 2017 06:12:00 -0000
Message-ID: <VI1PR0801MB2031EE742847AD1D4F8EAFA4FFD60@VI1PR0801MB2031.eurprd08.prod.outlook.com>
References: <VI1PR0801MB203164A5E1F6B6EDA0F2074AFFC80@VI1PR0801MB2031.eurprd08.prod.outlook.com> <8760g6bwig.fsf@linaro.org> <VI1PR0801MB2031CA473713EDC999DF103CFFCD0@VI1PR0801MB2031.eurprd08.prod.outlook.com> <20170613163934.GA1372@arm.com>,<VI1PR0801MB20314B00B267D5B6D4CE7DD8FFC00@VI1PR0801MB2031.eurprd08.prod.outlook.com>,<VI1PR0801MB203183E948429F85BFB14E3CFFDF0@VI1PR0801MB2031.eurprd08.prod.outlook.com>
In-Reply-To: <VI1PR0801MB203183E948429F85BFB14E3CFFDF0@VI1PR0801MB2031.eurprd08.prod.outlook.com>
authentication-results: arm.com; dkim=none (message not signed) header.d=none;arm.com; dmarc=none action=none header.from=arm.com;
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1;AM5PR0802MB2609;7:qpznd/lp79Rdbexolty4r/W7p2IuTLnYFuEOTwIDQD3b3ycv1duTg+NDPkNsAtzpKszKb/+YQmyYtkcGesXAT33ZJm4Bw5d2SYEXmlq7ddREt7F9brFTPCa+0U6hv8fYomka5SIdxqWaQGjqjjx3wgrzZ9FAdCuVbFc5McuPSDwjZKFr20l8U9xW8RNoymNQ4gH6gyMvg17BL+W6IqxA25Ka/EJLKuaaL5DKlH9H9yprQ4toLBG6nthfqJb07zE4YaniOIEE7l+LIw780yjN02uI4ZXMj+Y3NSzKCNOSAV6La+tB3mg4TjxG8n2pHxzQA5ar+AcHRJmwUNxPP8a+EPspYRw903ltCmnIo0xE/agn1CA41OQeoBwLqC432K1208q/2JnDgCtA+2hUgTNHcqfx+AZZ4H0+Orc8xnPorBSwW10YlvYmiH5EOS3tQCEDQ4plmoKVVD6kLMBLCsvrhV5u48Fx959t7/tuNcqF5s5TQ0kKvT81qfC5msic9650toE802ZPqeY9oXWtwgwonJP2iSjwHx1ZVbrSsZx2AuU1HpZMD7LHjJIBQ7T3emo4o7446Ogh09rr0V7u/8QZVqjRD51YCdTZ0q2PTY/aAbXBX/JjGND+BzFl1ktj3ciFz81vP3BeqwdL2F6gc/EaODef/BeEZaPkQoqkrf0KT+Cyh/xoAS4Eg0uf11Yx7JEDY5cHBIKG2TAXgP7M9DWvxLFjXSFl3oxyFfBIO9tbHsra3jRoM/eq6HX3KU1pwFORNSObu7ASJj/s3G/U9YZGnDZqhoKE1kgAC9+7ckNogL8=
x-ms-office365-filtering-correlation-id: f097fdc1-800c-4064-9eac-08d4c1da65bd
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254075)(48565401081)(300000503095)(300135400095)(2017052603031)(201703131423075)(201703031133081)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:AM5PR0802MB2609;
x-ms-traffictypediagnostic: AM5PR0802MB2609:
nodisclaimer: True
x-microsoft-antispam-prvs: <AM5PR0802MB260983463840BC30CE723376FFD60@AM5PR0802MB2609.eurprd08.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(180628864354917)(133145235818549)(22074186197030)(236129657087228)(183786458502308)(148574349560750);
x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(8121501046)(5005006)(93006095)(93001095)(3002001)(10201501046)(100000703101)(100105400095)(6055026)(6041248)(20161123555025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123562025)(20161123564025)(20161123558100)(20161123560025)(6072148)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:AM5PR0802MB2609;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:AM5PR0802MB2609;
x-forefront-prvs: 035748864E
x-forefront-antispam-report: SFV:NSPM;SFS:(10009020)(6009001)(39410400002)(39840400002)(39860400002)(39450400003)(39400400002)(39850400002)(377424004)(53754006)(377454003)(8936002)(81166006)(2950100002)(229853002)(6636002)(14454004)(2900100001)(66066001)(74316002)(189998001)(53546010)(33656002)(8676002)(25786009)(305945005)(3280700002)(4326008)(54906002)(9686003)(5250100002)(2906002)(93886004)(478600001)(3660700001)(72206003)(5660300001)(86362001)(6862004)(53936002)(76176999)(6506006)(99286003)(54356999)(50986999)(6436002)(55016002)(38730400002)(7736002)(110136004)(6246003)(7696004)(6116002)(3846002)(102836003)(14773001);DIR:OUT;SFP:1101;SCL:1;SRVR:AM5PR0802MB2609;H:VI1PR0801MB2031.eurprd08.prod.outlook.com;FPR:;SPF:None;MLV:sfv;LANG:en;
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: arm.com
X-MS-Exchange-CrossTenant-originalarrivaltime: 03 Jul 2017 06:11:51.8477 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5PR0802MB2609
X-IsSubscribed: yes
X-SW-Source: 2017-07/txt/msg00046.txt.bz2

Ping
________________________________________
From: gcc-patches-owner@gcc.gnu.org <gcc-patches-owner@gcc.gnu.org> on beha=
lf of Tamar Christina <Tamar.Christina@arm.com>
Sent: Monday, June 26, 2017 11:49:42 AM
To: James Greenhalgh
Cc: Richard Sandiford; GCC Patches; nd; Marcus Shawcroft; Richard Earnshaw
Subject: Re: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - =
infrastructure.

Hi All,

I've updated patch accordingly.

This mostly involves removing the loop to create the ival
and removing the *2 code and instead defaulting to 64bit
and switching to 128 when needed.

Regression tested on aarch64-none-linux-gnu and no regressions.

OK for trunk?

Thanks,
Tamar


gcc/
2017-06-26  Tamar Christina  <tamar.christina@arm.com>

        * config/aarch64/aarch64.c
        (aarch64_simd_container_mode): Add prototype.
        (aarch64_expand_mov_immediate): Add HI support.
        (aarch64_reinterpret_float_as_int, aarch64_float_const_rtx_p: New.
        (aarch64_can_const_movi_rtx_p): New.
        (aarch64_preferred_reload_class):
        Remove restrictions of using FP registers for certain SIMD operatio=
ns.
        (aarch64_rtx_costs): Added new cost for CONST_DOUBLE moves.
        (aarch64_valid_floating_const): Add integer move validation.
        (aarch64_simd_imm_scalar_p): Remove.
        (aarch64_output_scalar_simd_mov_immediate): Generalize function.
        (aarch64_legitimate_constant_p): Expand list of supported cases.
        * config/aarch64/aarch64-protos.h
        (aarch64_float_const_rtx_p, aarch64_can_const_movi_rtx_p): New.
        (aarch64_reinterpret_float_as_int): New.
        (aarch64_simd_imm_scalar_p): Remove.
        * config/aarch64/predicates.md (aarch64_reg_or_fp_float): New.
        * config/aarch64/constraints.md (Uvi): New.
        (Dd): Split into Ds and new Dd.
        * config/aarch64/aarch64.md (*movsi_aarch64):
        Add SIMD mov case.
        (*movdi_aarch64): Add SIMD mov case.
________________________________________
From: Tamar Christina
Sent: Thursday, June 15, 2017 1:50:19 PM
To: James Greenhalgh
Cc: Richard Sandiford; GCC Patches; nd; Marcus Shawcroft; Richard Earnshaw
Subject: RE: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - =
infrastructure.

>
> This patch is pretty huge, are there any opportunities to further split i=
t to aid
> review?

Unfortunately because I'm also changing some constraints it introduced a bi=
t of a dependency cycle.
If I were to break it up more, the individual patches won't work on their o=
wn anymore. If this is acceptable
I can break it up more.

> > +  ival =3D zext_hwi (res[needed - 1], 32);  for (int i =3D needed - 2;=
 i
> > + >=3D 0; i--)
> > +    {
> > +      ival <<=3D 32;
> > +      ival |=3D zext_hwi (res[i], 32);
> > +    }
> > +
> > +  *intval =3D ival;
>
> ???
>
> Two cases here, needed is either 2 if GET_MODE_BITSIZE (mode) =3D=3D 64, =
or it
> is 1 otherwise. So i starts at either -1 or 0. So this for loop either ru=
ns
> 0 or 1 times. What am I missing? I'm sure this is all an indirect way of
> writing:
>

Yes, the code was set up to be easily extended to support 128 floats as wel=
l,
Which was deprioritized. I'll just remove the loop.

> > +
> > +  /* Determine whether it's cheaper to write float constants as
> > +     mov/movk pairs over ldr/adrp pairs.  */  unsigned HOST_WIDE_INT
> > + ival;
> > +
> > +  if (GET_CODE (x) =3D=3D CONST_DOUBLE
> > +      && SCALAR_FLOAT_MODE_P (mode)
> > +      && aarch64_reinterpret_float_as_int (x, &ival))
> > +    {
> > +      machine_mode imode =3D mode =3D=3D HFmode ? SImode :
> int_mode_for_mode (mode);
> > +      int num_instr =3D aarch64_internal_mov_immediate
> > +                   (NULL_RTX, gen_int_mode (ival, imode), false,
> imode);
> > +      return num_instr < 3;
>
> Should this cost model be static on a magin number? Is it not the case th=
at
> the decision should be based on the relative speeds of a memory access
> compared with mov/movk/fmov ?
>

As far as I'm aware, the cost model is too simplistic to be able to express=
 the
Actual costs of mov/movk and movk/movk pairs. E.g it doesn't take into acco=
unt
The latency and throughput difference when the instructions occur in sequen=
ce/pairs.

This leads to it allowing a smaller subset through here then what would be =
beneficial.

> > +/* Return TRUE if rtx X is immediate constant that fits in a single
> > +   MOVI immediate operation.  */
> > +bool
> > +aarch64_can_const_movi_rtx_p (rtx x, machine_mode mode) {
> > +  if (!TARGET_SIMD)
> > +     return false;
> > +
> > +  machine_mode vmode, imode;
> > +  unsigned HOST_WIDE_INT ival;
> > +
> > +  /* Don't write float constants out to memory.  */
> > +  if (GET_CODE (x) =3D=3D CONST_DOUBLE
> > +      && SCALAR_FLOAT_MODE_P (mode))
> > +    {
> > +      if (!aarch64_reinterpret_float_as_int (x, &ival))
> > +   return false;
> > +
> > +      imode =3D int_mode_for_mode (mode);
> > +    }
> > +  else if (GET_CODE (x) =3D=3D CONST_INT
> > +      && SCALAR_INT_MODE_P (mode))
> > +    {
> > +       imode =3D mode;
> > +       ival =3D INTVAL (x);
> > +    }
> > +  else
> > +    return false;
> > +
> > +  unsigned width =3D GET_MODE_BITSIZE (mode) * 2;
>
> Why * 2? It isn't obvious to me from my understanding of movi why that
> would be better than just clamping to 64-bit?

The idea is to get the smallest vector mode for the given mode.
For SF that's V2SF and DF: V2DF, which is why the *2. Clamping to 64 bit th=
ere
would be no 64 bit DF vector mode as far as I'm aware of.

The clamping is done for modes smaller than SF, e.g. HF. Which mapped to th=
e smallest
Option, V4HF thanks to the clamping. Forcing everything to 128 bit vectors =
would work,
But I don't see the advantage of that.

For this particular function is doesn't matter much as no code is generated=
. So clamping to
128 bits would work, but when generating the code, I don't see why V4SF and=
 V8HF would be
Better than V2SF and V4HF.

Alternatively I could instead of reusing aarch64_simd_container_mode just c=
reate my own
Mapping function which just does the mapping I expect. Would that be a bett=
er option?

>
> > +  if (width < GET_MODE_BITSIZE (DFmode))
> > +     width =3D GET_MODE_BITSIZE (DFmode);
> > +
> > +  vmode =3D aarch64_simd_container_mode (imode, width);  rtx v_op =3D
> > + aarch64_simd_gen_const_vector_dup (vmode, ival);
> > +
> > +  return aarch64_simd_valid_immediate (v_op, vmode, false, NULL); }
> > +
> > +
> >  /* Return the fixed registers used for condition codes.  */
> >
> >  static bool
> > @@ -5758,12 +5860,6 @@ aarch64_preferred_reload_class (rtx x,
> reg_class_t regclass)
> >        return NO_REGS;
> >      }
> >
> > -  /* If it's an integer immediate that MOVI can't handle, then
> > -     FP_REGS is not an option, so we return NO_REGS instead.  */
> > -  if (CONST_INT_P (x) && reg_class_subset_p (regclass, FP_REGS)
> > -      && !aarch64_simd_imm_scalar_p (x, GET_MODE (x)))
> > -    return NO_REGS;
> > -
> >    /* Register eliminiation can result in a request for
> >       SP+constant->FP_REGS.  We cannot support such operations which
> >       use SP as source and an FP_REG as destination, so reject out @@
> > -6674,26 +6770,44 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int
> outer ATTRIBUTE_UNUSED,
> >        return true;
> >
> >      case CONST_DOUBLE:
> > +
> > +    /* First determine number of instructions to do the move
> > +       as an integer constant.  */
> > +   if (!aarch64_float_const_representable_p (x)
> > +       && !aarch64_can_const_movi_rtx_p (x, mode)
> > +       && aarch64_float_const_rtx_p (x))
> > +     {
> > +       unsigned HOST_WIDE_INT ival;
> > +       bool succeed =3D aarch64_reinterpret_float_as_int (x, &ival);
> > +       gcc_assert (succeed);
>
> Just:
>
>   gcc_assert (aarch64_reinterpret_float_as_int (x, &ival));
>
> There's not much extra information in the name "succeed", so no extra val=
ue
> in the variable assignment.

But if asserts are disabled won't the code be optimized out. This is why I =
only assert
on the result (as an extra sanity check) since the call itself has a side-e=
ffect.

> >  char*
> > -aarch64_output_scalar_simd_mov_immediate (rtx immediate,
> > -                                     machine_mode mode)
> > +aarch64_output_scalar_simd_mov_immediate (rtx immediate,
> > +machine_mode mode)
> >  {
> > +
> > +  /* If a floating point number was passed and we desire to use it in =
an
> > +     integer mode do the conversion to integer.  */
> > +  if (CONST_DOUBLE_P (immediate) && GET_MODE_CLASS (mode) =3D=3D
> MODE_INT)
> > +    {
> > +      unsigned HOST_WIDE_INT ival;
> > +      if (!aarch64_reinterpret_float_as_int (immediate, &ival))
> > +     gcc_unreachable ();
> > +      immediate =3D gen_int_mode (ival, mode);
> > +    }
> > +
> >    machine_mode vmode;
> > +  int width =3D GET_MODE_BITSIZE (mode) * 2;
>
> Dubious * 2 again!

Same reason as above, however here code is actually generated.

>
> > +  if (width < 64)
> > +     width =3D 64;
> >
> >    gcc_assert (!VECTOR_MODE_P (mode));
> > -  vmode =3D aarch64_simd_container_mode (mode, 64);
> > +  vmode =3D aarch64_simd_container_mode (mode, width);
> >    rtx v_op =3D aarch64_simd_gen_const_vector_dup (vmode, INTVAL
> > (immediate));
> > -  return aarch64_output_simd_mov_immediate (v_op, vmode, 64);
> > +  return aarch64_output_simd_mov_immediate (v_op, vmode, width);
> >  }
> >
> >  /* Split operands into moves from op[1] + op[2] into op[0].  */
>
> Thanks,
> James