From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id E827B3858D32 for ; Mon, 27 Feb 2023 15:34:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E827B3858D32 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9A573C14; Mon, 27 Feb 2023 07:34:56 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DC7493F67D; Mon, 27 Feb 2023 07:34:12 -0800 (PST) From: Richard Sandiford To: Jakub Jelinek Mail-Followup-To: Jakub Jelinek ,Richard Biener , gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Cc: Richard Biener , gcc-patches@gcc.gnu.org Subject: Re: [PATCH] optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803] References: Date: Mon, 27 Feb 2023 15:34:11 +0000 In-Reply-To: (Jakub Jelinek's message of "Fri, 17 Feb 2023 11:14:57 +0100") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-28.8 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Jakub Jelinek writes: > Hi! > > The following testcase is miscompiled on aarch64. The problem is that > aarch64 with TARGET_SIMD is !SHIFT_COUNT_TRUNCATED target with > targetm.shift_truncation_mask (DImode) == 0 which has HAVE_conditional_move > true. If a doubleword shift (in this case TImode) doesn't have its own > expander (as is the case of e.g. x86) and is handled in generic code, > there are 3 possible expansions. One is when the shift count is constant, > the code computes in that case superword_op1 as op1 - BITS_PER_WORD, > and chooses at compile time which of expand_superword_shift or > expand_subword_shift to call, which ensures that whatever is used > actually has its shift count (superword_op1 or op1) in [0, BITS_PER_WORD - 1] > range. If !HAVE_conditional_move or that expansion fails, the function > handles non-constant op1 similarly (there are some special cases for > shift_mask >= BITS_PER_WORD - 1 but let's talk here just about > shift_mask < BITS_PER_WORD - 1), except that the selection is done at > runtime, with branches around the stuff. While superword_op1 can be > [-BITS_PER_WORD, BITS_PER_WORD - 1] and op1 [0, 2 * BITS_PER_WORD - 1], > the runtime selection ensures that the instructions executed at runtime > have their corresponding shift count again in the right range of > [0, BITS_PER_WORD - 1]. > The problematic is the HAVE_conditional_move case, which emits both > sequences into the actually executed code, so necessarily one of them > has an out of range shift count and then using 2 conditional moves > picks up a result. > Now, in the testcase because -Og doesn't perform VRP/EVRP the shift > count isn't optimized to constant during GIMPLE passes, but is determined > to be constant during/after expansion into RTL. The shift count is > actually const0_rtx later, so superword_op1 is (const_int -64) and we end > up with miscompilation during combine because of that. I haven't worked through the testcase yet, but how does that happen? Having shifts with out-of-range shift counts shouldn't be a problem in itself, provided that the conditional move selects the one with the in-range shift. Thanks, Richard > I'm afraid on targetm.shift_truncation_mask (DImode) == 0 targets we have > to mask the shift counts when the doubleshift count is in range but > one of subword_op1/superword_op1 is not, which is what the following > patch does and what fixes the wrong-code. Now, targets like x86 or aarch64, > while they are !SHIFT_COUNT_TRUNCATED, have actually patterns to catch > shift with masked counter, so the ANDs can be optimized out. On the other > side, when we know the result will be masked this way we can use the > simpler ~op1 instead of (BITS_PER_WORD - 1) - op1 in expand_subword_shift. > So, say on > __attribute__((noipa)) __int128 > foo (__int128 a, unsigned k) > { > return a << k; > } > on aarch64 at -Og the difference is: > foo: > subs w5, w2, #64 > - lsl x6, x0, x5 > + lsl x4, x0, x2 > lsr x3, x0, 1 > - mov w4, 63 > - sub w4, w4, w2 > - lsr x3, x3, x4 > + mvn w6, w2 > + and w2, w2, 63 > + lsr x3, x3, x6 > lsl x1, x1, x2 > orr x1, x3, x1 > lsl x0, x0, x2 > csel x0, xzr, x0, pl > - csel x1, x6, x1, pl > + csel x1, x4, x1, pl > ret > We could do even better and optimize the and w2, w2, 63 instruction out, > but it is a matter of costs and so IMHO should be handled incrementally. > For that case consider say: > void corge (int, int, int); > > void > qux (int x, int y, int z, int n) > { > n &= 31; > corge (x << n, y << n, z >> n); > } > with -O2 -fno-tree-vectorize, on x86_64 one gets > sarl %cl, %edx > sall %cl, %esi > sall %cl, %edi > jmp corge > but on aarch64 > and w3, w3, 31 > lsl w0, w0, w3 > lsl w1, w1, w3 > asr w2, w2, w3 > b corge > The reason it works on x86 is that its rtx_costs hook recognizes > that the AND in shift+mask patterns is for free. > Trying 9 -> 11: > 9: {r85:SI=r96:SI&0x1f;clobber flags:CC;} > REG_UNUSED flags:CC > 11: {r91:SI=r87:SI< REG_DEAD r87:SI > REG_UNUSED flags:CC > Failed to match this instruction: > ... > Failed to match this instruction: > ... > Successfully matched this instruction: > (set (reg/v:SI 85 [ n ]) > (and:SI (reg:SI 96) > (const_int 31 [0x1f]))) > Successfully matched this instruction: > (set (reg:SI 91) > (ashift:SI (reg/v:SI 87 [ y ]) > (subreg:QI (and:SI (reg:SI 96) > (const_int 31 [0x1f])) 0))) > allowing combination of insns 9 and 11 > original costs 4 + 4 = 8 > replacement costs 4 + 4 = 8 > Compare that to the aarch64 case: > Trying 9 -> 11: > 9: r95:SI=r106:SI&0x1f > REG_DEAD r106:SI > 11: r101:SI=r104:SI< REG_DEAD r104:SI > Failed to match this instruction: > ... > Failed to match this instruction: > ... > Successfully matched this instruction: > (set (reg/v:SI 95 [ n ]) > (and:SI (reg:SI 106) > (const_int 31 [0x1f]))) > Successfully matched this instruction: > (set (reg:SI 101) > (ashift:SI (reg:SI 104) > (subreg:QI (and:SI (reg:SI 106) > (const_int 31 [0x1f])) 0))) > rejecting combination of insns 9 and 11 > original costs 4 + 4 = 8 > replacement costs 4 + 8 = 12 > Now, if the rtx costs on the *aarch64_ashl_reg_si3_mask1 > and similar would be the same as for normal shift without the > masking, this would succeed and we could get rid of the and. > > Bootstrapped/regtested on aarch64-linux, ok for trunk? > > 2023-02-17 Jakub Jelinek > > PR middle-end/108803 > * optabs.cc (expand_subword_shift): Add MASK_COUNT argument, > if true, use ~op1 & (BITS_PER_WORD - 1) instead of > (BITS_PER_WORD - 1) - op1 as tmp and op1 & (BITS_PER_WORD - 1) > instead of op1 on the other two shifts. > (expand_doubleword_shift_condmove): Use > superword_op1 & (BITS_PER_WORD - 1) instead of superword_op1. Pass > shift_mask < BITS_PER_WORD - 1 as MASK_COUNT to expand_subword_shift. > (expand_doubleword_shift): Pass false as MASK_COUNT to > expand_subword_shift. > > * gcc.dg/pr108803.c: New test. > > --- gcc/optabs.cc.jj 2023-01-02 09:32:53.309838465 +0100 > +++ gcc/optabs.cc 2023-02-16 20:44:21.862031535 +0100 > @@ -507,7 +507,7 @@ expand_subword_shift (scalar_int_mode op > rtx outof_input, rtx into_input, rtx op1, > rtx outof_target, rtx into_target, > int unsignedp, enum optab_methods methods, > - unsigned HOST_WIDE_INT shift_mask) > + unsigned HOST_WIDE_INT shift_mask, bool mask_count) > { > optab reverse_unsigned_shift, unsigned_shift; > rtx tmp, carries; > @@ -535,7 +535,7 @@ expand_subword_shift (scalar_int_mode op > are truncated to the mode size. */ > carries = expand_binop (word_mode, reverse_unsigned_shift, > outof_input, const1_rtx, 0, unsignedp, methods); > - if (shift_mask == BITS_PER_WORD - 1) > + if (shift_mask == BITS_PER_WORD - 1 || mask_count) > { > tmp = immed_wide_int_const > (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode); > @@ -549,6 +549,15 @@ expand_subword_shift (scalar_int_mode op > tmp = simplify_expand_binop (op1_mode, sub_optab, tmp, op1, > 0, true, methods); > } > + if (mask_count) > + { > + rtx tmp2 = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, > + op1_mode), op1_mode); > + tmp = simplify_expand_binop (op1_mode, and_optab, tmp, tmp2, 0, > + true, methods); > + op1 = simplify_expand_binop (op1_mode, and_optab, op1, tmp2, 0, > + true, methods); > + } > } > if (tmp == 0 || carries == 0) > return false; > @@ -596,6 +605,15 @@ expand_doubleword_shift_condmove (scalar > { > rtx outof_superword, into_superword; > > + if (shift_mask < BITS_PER_WORD - 1) > + { > + rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, op1_mode), > + op1_mode); > + superword_op1 > + = simplify_expand_binop (op1_mode, and_optab, superword_op1, tmp, > + 0, true, methods); > + } > + > /* Put the superword version of the output into OUTOF_SUPERWORD and > INTO_SUPERWORD. */ > outof_superword = outof_target != 0 ? gen_reg_rtx (word_mode) : 0; > @@ -621,7 +639,8 @@ expand_doubleword_shift_condmove (scalar > if (!expand_subword_shift (op1_mode, binoptab, > outof_input, into_input, subword_op1, > outof_target, into_target, > - unsignedp, methods, shift_mask)) > + unsignedp, methods, shift_mask, > + shift_mask < BITS_PER_WORD - 1)) > return false; > > /* Select between them. Do the INTO half first because INTO_SUPERWORD > @@ -742,7 +761,7 @@ expand_doubleword_shift (scalar_int_mode > return expand_subword_shift (op1_mode, binoptab, > outof_input, into_input, op1, > outof_target, into_target, > - unsignedp, methods, shift_mask); > + unsignedp, methods, shift_mask, false); > } > > /* Try using conditional moves to generate straight-line code. */ > @@ -781,7 +800,7 @@ expand_doubleword_shift (scalar_int_mode > if (!expand_subword_shift (op1_mode, binoptab, > outof_input, into_input, op1, > outof_target, into_target, > - unsignedp, methods, shift_mask)) > + unsignedp, methods, shift_mask, false)) > return false; > > emit_label (done_label); > --- gcc/testsuite/gcc.dg/pr108803.c.jj 2023-02-16 20:48:53.517032422 +0100 > +++ gcc/testsuite/gcc.dg/pr108803.c 2023-02-16 20:48:45.971143498 +0100 > @@ -0,0 +1,24 @@ > +/* PR middle-end/108803 */ > +/* { dg-do run { target int128 } } */ > +/* { dg-options "-Og" } */ > + > +unsigned i, j; > + > +__int128 > +foo (__int128 a) > +{ > + a &= 9; > + unsigned k = __builtin_add_overflow_p (i, j, a); > + __int128 b = (63 + a) << k | ((63 + a) >> (63 & k)); > + __int128 c = a + b; > + return c; > +} > + > +int > +main (void) > +{ > + __int128 x = foo (0); > + if (x != 63) > + __builtin_abort (); > + return 0; > +} > > Jakub