From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 2B9D93858D3C for ; Fri, 7 Oct 2022 13:17:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2B9D93858D3C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E0F4F106F; Fri, 7 Oct 2022 06:17:38 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 31B893F73B; Fri, 7 Oct 2022 06:17:32 -0700 (PDT) From: Richard Sandiford To: Wilco Dijkstra via Gcc-patches Mail-Followup-To: Wilco Dijkstra via Gcc-patches ,Wilco Dijkstra , richard.sandiford@arm.com Cc: Wilco Dijkstra Subject: Re: [PATCH][AArch64] Improve immediate expansion [PR106583] References: Date: Fri, 07 Oct 2022 14:17:30 +0100 In-Reply-To: (Wilco Dijkstra via Gcc-patches's message of "Thu, 6 Oct 2022 17:29:03 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-44.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Wilco Dijkstra via Gcc-patches writes: > Hi Richard, > >> Did you consider handling the case where the movks aren't for >> consecutive bitranges?=C2=A0 E.g. the patch handles: > >> but it looks like it would be fairly easy to extend it to: >> >>=C2=A0 0x1234cccc5678cccc > > Yes, with a more general search loop we can get that case too - > it doesn't trigger much though. The code that checks for this is > now refactored into a new function. Given there are now many > more calls to aarch64_bitmask_imm, I added a streamlined internal > entry point that assumes the input is 64-bit. Sounds good, but could you put it before the mode version, to avoid the forward declaration? OK with that change, thanks. Richard > Cheers, > Wilco > > [PATCH v2][AArch64] Improve immediate expansion [PR106583] > > Improve immediate expansion of immediates which can be created from a > bitmask immediate and 2 MOVKs. Simplify, refactor and improve=20 > efficiency of bitmask checks. This reduces the number of 4-instruction > immediates in SPECINT/FP by 10-15%. > > Passes regress, OK for commit? > > gcc/ChangeLog: > > PR target/106583 > * config/aarch64/aarch64.cc (aarch64_internal_mov_immediate) > Add support for a bitmask immediate with 2 MOVKs. > (aarch64_check_bitmask): New function after refactorization. > (aarch64_replicate_bitmask_imm): Remove function, merge into... > (aarch64_bitmask_imm): Simplify replication of small modes. > Split function into 64-bit only version for efficiency.=20 > > gcc/testsuite: > PR target/106583 > * gcc.target/aarch64/pr106583.c: Add new test. > > --- > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 926e81f028c82aac9a5fecc18f921f84399c24ae..b2d9c7380975028131d0fe731= a97b3909874b87b 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -306,6 +306,7 @@ static machine_mode aarch64_simd_container_mode (scal= ar_mode, poly_int64); > static bool aarch64_print_address_internal (FILE*, machine_mode, rtx, > aarch64_addr_query_type); > static HOST_WIDE_INT aarch64_clamp_to_uimm12_shift (HOST_WIDE_INT val); > +static bool aarch64_bitmask_imm (unsigned HOST_WIDE_INT); >=20=20 > /* The processor for which instructions should be scheduled. */ > enum aarch64_processor aarch64_tune =3D cortexa53; > @@ -5502,6 +5503,30 @@ aarch64_output_sve_vector_inc_dec (const char *ope= rands, rtx x) > factor, nelts_per_vq); > } >=20=20 > +/* Return true if the immediate VAL can be a bitfield immediate > + by changing the given MASK bits in VAL to zeroes, ones or bits > + from the other half of VAL. Return the new immediate in VAL2. */ > +static inline bool > +aarch64_check_bitmask (unsigned HOST_WIDE_INT val, > + unsigned HOST_WIDE_INT &val2, > + unsigned HOST_WIDE_INT mask) > +{ > + val2 =3D val & ~mask; > + if (val2 !=3D val && aarch64_bitmask_imm (val2)) > + return true; > + val2 =3D val | mask; > + if (val2 !=3D val && aarch64_bitmask_imm (val2)) > + return true; > + val =3D val & ~mask; > + val2 =3D val | (((val >> 32) | (val << 32)) & mask); > + if (val2 !=3D val && aarch64_bitmask_imm (val2)) > + return true; > + val2 =3D val | (((val >> 16) | (val << 48)) & mask); > + if (val2 !=3D val && aarch64_bitmask_imm (val2)) > + return true; > + return false; > +} > + > static int > aarch64_internal_mov_immediate (rtx dest, rtx imm, bool generate, > scalar_int_mode mode) > @@ -5568,36 +5593,43 @@ aarch64_internal_mov_immediate (rtx dest, rtx imm= , bool generate, > one_match =3D ((~val & mask) =3D=3D 0) + ((~val & (mask << 16)) =3D=3D= 0) + > ((~val & (mask << 32)) =3D=3D 0) + ((~val & (mask << 48)) =3D=3D 0); >=20=20 > - if (zero_match !=3D 2 && one_match !=3D 2) > + if (zero_match < 2 && one_match < 2) > { > /* Try emitting a bitmask immediate with a movk replacing 16 bits. > For a 64-bit bitmask try whether changing 16 bits to all ones or > zeroes creates a valid bitmask. To check any repeated bitmask, > try using 16 bits from the other 32-bit half of val. */ >=20=20 > - for (i =3D 0; i < 64; i +=3D 16, mask <<=3D 16) > - { > - val2 =3D val & ~mask; > - if (val2 !=3D val && aarch64_bitmask_imm (val2, mode)) > - break; > - val2 =3D val | mask; > - if (val2 !=3D val && aarch64_bitmask_imm (val2, mode)) > - break; > - val2 =3D val2 & ~mask; > - val2 =3D val2 | (((val2 >> 32) | (val2 << 32)) & mask); > - if (val2 !=3D val && aarch64_bitmask_imm (val2, mode)) > - break; > - } > - if (i !=3D 64) > - { > - if (generate) > + for (i =3D 0; i < 64; i +=3D 16) > + if (aarch64_check_bitmask (val, val2, mask << i)) > + { > + if (generate) > + { > + emit_insn (gen_rtx_SET (dest, GEN_INT (val2))); > + emit_insn (gen_insv_immdi (dest, GEN_INT (i), > + GEN_INT ((val >> i) & 0xffff))); > + } > + return 2; > + } > + } > + > + /* Try a bitmask plus 2 movk to generate the immediate in 3 instructio= ns. */ > + if (zero_match + one_match =3D=3D 0) > + { > + for (i =3D 0; i < 48; i +=3D 16) > + for (int j =3D i + 16; j < 64; j +=3D 16) > + if (aarch64_check_bitmask (val, val2, (mask << i) | (mask << j))) > { > - emit_insn (gen_rtx_SET (dest, GEN_INT (val2))); > - emit_insn (gen_insv_immdi (dest, GEN_INT (i), > - GEN_INT ((val >> i) & 0xffff))); > + if (generate) > + { > + emit_insn (gen_rtx_SET (dest, GEN_INT (val2))); > + emit_insn (gen_insv_immdi (dest, GEN_INT (i), > + GEN_INT ((val >> i) & 0xffff))); > + emit_insn (gen_insv_immdi (dest, GEN_INT (j), > + GEN_INT ((val >> j) & 0xffff))); > + } > + return 3; > } > - return 2; > - } > } >=20=20 > /* Generate 2-4 instructions, skipping 16 bits of all zeroes or ones w= hich > @@ -10168,22 +10200,6 @@ aarch64_movk_shift (const wide_int_ref &and_val, > return -1; > } >=20=20 > -/* VAL is a value with the inner mode of MODE. Replicate it to fill a > - 64-bit (DImode) integer. */ > - > -static unsigned HOST_WIDE_INT > -aarch64_replicate_bitmask_imm (unsigned HOST_WIDE_INT val, machine_mode = mode) > -{ > - unsigned int size =3D GET_MODE_UNIT_PRECISION (mode); > - while (size < 64) > - { > - val &=3D (HOST_WIDE_INT_1U << size) - 1; > - val |=3D val << size; > - size *=3D 2; > - } > - return val; > -} > - > /* Multipliers for repeating bitmasks of width 32, 16, 8, 4, and 2. */ >=20=20 > static const unsigned HOST_WIDE_INT bitmask_imm_mul[] =3D > @@ -10196,26 +10212,42 @@ static const unsigned HOST_WIDE_INT bitmask_imm= _mul[] =3D > }; >=20=20 >=20=20 > -/* Return true if val is a valid bitmask immediate. */ > - > +/* Return true if val is a valid bitmask immediate for any mode. */ > bool > aarch64_bitmask_imm (HOST_WIDE_INT val_in, machine_mode mode) > { > - unsigned HOST_WIDE_INT val, tmp, mask, first_one, next_one; > + if (mode =3D=3D DImode) > + return aarch64_bitmask_imm (val_in); > + > + unsigned HOST_WIDE_INT val =3D val_in; > + > + if (mode =3D=3D SImode) > + return aarch64_bitmask_imm ((val & 0xffffffff) | (val << 32)); > + > + /* Replicate small immediates to fit 64 bits. */ > + int size =3D GET_MODE_UNIT_PRECISION (mode); > + val &=3D (HOST_WIDE_INT_1U << size) - 1; > + val *=3D bitmask_imm_mul[__builtin_clz (size) - 26]; > + > + return aarch64_bitmask_imm (val); > +} > + > + > +/* Return true if 64-bit val is a valid bitmask immediate. */ > + > +static bool > +aarch64_bitmask_imm (unsigned HOST_WIDE_INT val) > +{ > + unsigned HOST_WIDE_INT tmp, mask, first_one, next_one; > int bits; >=20=20 > /* Check for a single sequence of one bits and return quickly if so. > The special cases of all ones and all zeroes returns false. */ > - val =3D aarch64_replicate_bitmask_imm (val_in, mode); > tmp =3D val + (val & -val); >=20=20 > if (tmp =3D=3D (tmp & -tmp)) > return (val + 1) > 1; >=20=20 > - /* Replicate 32-bit immediates so we can treat them as 64-bit. */ > - if (mode =3D=3D SImode) > - val =3D (val << 32) | (val & 0xffffffff); > - > /* Invert if the immediate doesn't start with a zero bit - this means = we > only need to search for sequences of one bits. */ > if (val & 1) > diff --git a/gcc/testsuite/gcc.target/aarch64/pr106583.c b/gcc/testsuite/= gcc.target/aarch64/pr106583.c > new file mode 100644 > index 0000000000000000000000000000000000000000..0f931580817d78dc1cc58f03b= 251bd21bec71f59 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr106583.c > @@ -0,0 +1,41 @@ > +/* { dg-do assemble } */ > +/* { dg-options "-O2 --save-temps" } */ > + > +long f1 (void) > +{ > + return 0x7efefefefefefeff; > +} > + > +long f2 (void) > +{ > + return 0x12345678aaaaaaaa; > +} > + > +long f3 (void) > +{ > + return 0x1234cccccccc5678; > +} > + > +long f4 (void) > +{ > + return 0x7777123456787777; > +} > + > +long f5 (void) > +{ > + return 0x5555555512345678; > +} > + > +long f6 (void) > +{ > + return 0x1234bbbb5678bbbb; > +} > + > +long f7 (void) > +{ > + return 0x4444123444445678; > +} > + > + > +/* { dg-final { scan-assembler-times {\tmovk\t} 14 } } */ > +/* { dg-final { scan-assembler-times {\tmov\t} 7 } } */