From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 85199 invoked by alias); 26 Feb 2016 15:50:43 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 85157 invoked by uid 89); 26 Feb 2016 15:50:42 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.2 required=5.0 tests=AWL,BAYES_50,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=truncation, BITS_PER_WORD, bits_per_word, ___ X-HELO: cam-smtp0.cambridge.arm.com Received: from fw-tnat.cambridge.arm.com (HELO cam-smtp0.cambridge.arm.com) (217.140.96.140) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-SHA encrypted) ESMTPS; Fri, 26 Feb 2016 15:50:40 +0000 Received: from arm.com (e107456-lin.cambridge.arm.com [10.2.206.78]) by cam-smtp0.cambridge.arm.com (8.13.8/8.13.8) with ESMTP id u1QFoZvr012980; Fri, 26 Feb 2016 15:50:35 GMT Date: Fri, 26 Feb 2016 15:50:00 -0000 From: James Greenhalgh To: Kyrill Tkachov Cc: GCC Patches , Marcus Shawcroft , Richard Earnshaw Subject: Re: [PATCH][AArch64] PR target/69613: Return zero TARGET_SHIFT_TRUNCATION_MASK when SHIFT_COUNT_TRUNCATED is false Message-ID: <20160226155034.GD40219@arm.com> References: <56CEC899.5020701@foss.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56CEC899.5020701@foss.arm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-IsSubscribed: yes X-SW-Source: 2016-02/txt/msg01805.txt.bz2 On Thu, Feb 25, 2016 at 09:25:45AM +0000, Kyrill Tkachov wrote: > Hi all, > > In this wrong-code PR we get bad code when synthesising a TImode right shift > by variable amount using DImode shifts during expand. > > The expand_double_word_shift function expands two paths: one where the > variable amount is greater than GET_MODE_BITSIZE (DImode) (word_mode for aarch64) at runtime > and one where it's less than that, and performs a conditional select between the two in the end. > > The second path is the one that goes wrong here. > The algorithm it uses for calculating a TImode shift when the variable shift amount is <64 is: > > > ------DImode----- ------DImode----- ------DImode----- ------DImode----- > { |__ target-hi ___| |___ target-lo ___| } = { |___ source-hi ___| |___ source-lo ___| } >> var_amnt > > 1) carry = source-hi << 1 > 2) tmp = ~var_amnt // either that or BITS_PER_WORD - 1 - var_amnt depending on shift_truncation_mask > 3) carry = carry << tmp // net effect is that carry is source-hi shifted left by BITS_PER_WORD - var_amnt > 4) target-lo = source-lo >>u var_amnt //unsigned shift. > 5) target-lo = target-lo | carry > 6) target-hi = source-hi >> var_amnt > > where the two DImode words source-hi and source-lo are the two words of the > source TImode value and var_amnt is the register holding the variable shift > amount. This is performed by the expand_subword_shift function in optabs.c. > > Step 2) is valid only if the target truncates the shift amount by the width > of the type its shifting, that is SHIFT_COUNT_TRUNCATED is true and > TARGET_SHIFT_TRUNCATION_MASK is 63 in this case. > > Step 3) is the one that goes wrong. On aarch64 a left shift is usually > implemented using the LSL instruction but we also have alternatives that can > use the SIMD registers and the USHL instruction. In this case we end up > using the USHL instruction. However, although the USHL instruction does > a DImode shift, it doesn't truncate the shift amount by 64, but rather by > 255. Furthermore, the shift amount is interpreted as a 2's complement signed > integer and if it's negative performs a right shift. This is in contrast > with the normal LSL instruction which just performs an unsigned shift by an > amount truncated by 64. > > Now, on aarch64 SHIFT_COUNT_TRUNCATED is defined as !TARGET_SIMD, so we don't > assume shift amounts are truncated unless we know we can only ever use the > LSL instructions for variable shifts. > > However, we still return 63 as the TARGET_SHIFT_TRUNCATION_MASK for DImode, > so expand_subword_shift assumes that since the mask is non-zero it's a valid > shift truncation mask. The documentation for TARGET_SHIFT_TRUNCATION_MASK > says: > " The default implementation of this function returns > @code{GET_MODE_BITSIZE (@var{mode}) - 1} if @code{SHIFT_COUNT_TRUNCATED} > and 0 otherwise." > > So since for TARGET_SIMD we cannot guarantee that all shifts truncate their > amount, we should be returning 0 in TARGET_SHIFT_TRUNCATION_MASK when > SHIFT_COUNT_TRUNCATED is false. This is what the patch does, and with it > step 2) becomes: > 2) tmp = BITS_PER_WORD - 1 - var_amnt > which behaves correctly on aarch64. > > Unfortunately, the testcase from the PR has very recently gone latent on > trunk because it depends on register allocation picking a particular > alternative from the *aarch64_ashl_sisd_or_int_3 pattern in aarch64.md. > I tried to do some inline asm tricks to get it to force the correct > constraints but was unsuccessful. Nevertheless I've included the testcase in > the patch. I suppose it's better than nothing. Thanks for the great write-up. This level of detail is very helpful for review. OK for trunk. Thanks, James > > Bootstrapped and tested on aarch64. > > Ok for trunk? > > Thanks, > Kyrill > > > 2016-02-25 Kyrylo Tkachov > > PR target/69613 > * config/aarch64/aarch64.c (aarch64_shift_truncation_mask): > Return 0 if !SHIFT_COUNT_TRUNCATED > > 2016-02-25 Kyrylo Tkachov > > PR target/69613 > * gcc.dg/torture/pr69613.c: New test. > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index d0d15b4feee70a5ca6af8dd16c7667cbcd844dbf..2e69e345545e591d1de76c2d175aac476e6e1107 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -11171,7 +11171,8 @@ static unsigned HOST_WIDE_INT > aarch64_shift_truncation_mask (machine_mode mode) > { > return > - (aarch64_vector_mode_supported_p (mode) > + (!SHIFT_COUNT_TRUNCATED > + || aarch64_vector_mode_supported_p (mode) > || aarch64_vect_struct_mode_p (mode)) ? 0 : (GET_MODE_BITSIZE (mode) - 1); > } > > diff --git a/gcc/testsuite/gcc.dg/torture/pr69613.c b/gcc/testsuite/gcc.dg/torture/pr69613.c > new file mode 100644 > index 0000000000000000000000000000000000000000..44f2b0cc91ac4b12c4d255b608f95bc8bf016923 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/torture/pr69613.c > @@ -0,0 +1,40 @@ > +/* PR target/69613. */ > +/* { dg-do run { target int128 } } */ > +/* { dg-additional-options "-mavx" { target { i?86-*-* x86_64-*-* } } } */ > + > +typedef unsigned short u16; > +typedef unsigned short v32u16 __attribute__ ((vector_size (32))); > +typedef unsigned int u32; > +typedef unsigned int v32u32 __attribute__ ((vector_size (32))); > +typedef unsigned long long u64; > +typedef unsigned long long v32u64 __attribute__ ((vector_size (32))); > +typedef unsigned __int128 u128; > +typedef unsigned __int128 v32u128 __attribute__ ((vector_size (32))); > + > +u128 __attribute__ ((noinline, noclone)) > +foo (u32 u32_0, u64 u64_1, u128 u128_1, v32u16 v32u16_0, v32u128 v32u128_0, > + v32u16 v32u16_1, v32u32 v32u32_1, v32u64 v32u64_1, v32u128 v32u128_1) > +{ > + u128 temp = (v32u128_1[0] << ((-u32_0) & 127)); > + u32 t2 = (u32_0 & 127); > + v32u128_1[0] = (v32u128_1[0] >> t2); > + > + v32u128_1[0] ^= temp; > + v32u128_1 |= (v32u128){ v32u128_0[1] }; > + > + return u64_1 + u128_1 + v32u16_0[0] + v32u16_0[1] + v32u16_1[11] > + + v32u16_1[12] + v32u16_1[13] + v32u32_1[0] + v32u32_1[1] > + + v32u32_1[2] + v32u64_1[1] + v32u64_1[2] + v32u64_1[3] + v32u128_1[0] > + + v32u128_1[1]; > +} > + > +int > +main () > +{ > + u128 x > + = foo (1, 1, 1, (v32u16){ 1, 1, 1 }, (v32u128){ 1 }, (v32u16){ 1, 1, 1 }, > + (v32u32){ 1 }, (v32u64){ 1, 1, 1 }, (v32u128){ -1 }); > + if (x != 6) > + __builtin_abort (); > + return 0; > +}