From: Richard Sandiford <richard.sandiford@arm.com>
To: Jakub Jelinek <jakub@redhat.com>
Cc: Richard Biener <rguenther@suse.de>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803]
Date: Mon, 27 Feb 2023 15:34:11 +0000 [thread overview]
Message-ID: <mpth6v7ruks.fsf@arm.com> (raw)
In-Reply-To: <Y+9TodOHE09e9Vwq@tucnak> (Jakub Jelinek's message of "Fri, 17 Feb 2023 11:14:57 +0100")
Jakub Jelinek <jakub@redhat.com> writes:
> Hi!
>
> The following testcase is miscompiled on aarch64. The problem is that
> aarch64 with TARGET_SIMD is !SHIFT_COUNT_TRUNCATED target with
> targetm.shift_truncation_mask (DImode) == 0 which has HAVE_conditional_move
> true. If a doubleword shift (in this case TImode) doesn't have its own
> expander (as is the case of e.g. x86) and is handled in generic code,
> there are 3 possible expansions. One is when the shift count is constant,
> the code computes in that case superword_op1 as op1 - BITS_PER_WORD,
> and chooses at compile time which of expand_superword_shift or
> expand_subword_shift to call, which ensures that whatever is used
> actually has its shift count (superword_op1 or op1) in [0, BITS_PER_WORD - 1]
> range. If !HAVE_conditional_move or that expansion fails, the function
> handles non-constant op1 similarly (there are some special cases for
> shift_mask >= BITS_PER_WORD - 1 but let's talk here just about
> shift_mask < BITS_PER_WORD - 1), except that the selection is done at
> runtime, with branches around the stuff. While superword_op1 can be
> [-BITS_PER_WORD, BITS_PER_WORD - 1] and op1 [0, 2 * BITS_PER_WORD - 1],
> the runtime selection ensures that the instructions executed at runtime
> have their corresponding shift count again in the right range of
> [0, BITS_PER_WORD - 1].
> The problematic is the HAVE_conditional_move case, which emits both
> sequences into the actually executed code, so necessarily one of them
> has an out of range shift count and then using 2 conditional moves
> picks up a result.
> Now, in the testcase because -Og doesn't perform VRP/EVRP the shift
> count isn't optimized to constant during GIMPLE passes, but is determined
> to be constant during/after expansion into RTL. The shift count is
> actually const0_rtx later, so superword_op1 is (const_int -64) and we end
> up with miscompilation during combine because of that.
I haven't worked through the testcase yet, but how does that happen?
Having shifts with out-of-range shift counts shouldn't be a problem
in itself, provided that the conditional move selects the one with
the in-range shift.
Thanks,
Richard
> I'm afraid on targetm.shift_truncation_mask (DImode) == 0 targets we have
> to mask the shift counts when the doubleshift count is in range but
> one of subword_op1/superword_op1 is not, which is what the following
> patch does and what fixes the wrong-code. Now, targets like x86 or aarch64,
> while they are !SHIFT_COUNT_TRUNCATED, have actually patterns to catch
> shift with masked counter, so the ANDs can be optimized out. On the other
> side, when we know the result will be masked this way we can use the
> simpler ~op1 instead of (BITS_PER_WORD - 1) - op1 in expand_subword_shift.
> So, say on
> __attribute__((noipa)) __int128
> foo (__int128 a, unsigned k)
> {
> return a << k;
> }
> on aarch64 at -Og the difference is:
> foo:
> subs w5, w2, #64
> - lsl x6, x0, x5
> + lsl x4, x0, x2
> lsr x3, x0, 1
> - mov w4, 63
> - sub w4, w4, w2
> - lsr x3, x3, x4
> + mvn w6, w2
> + and w2, w2, 63
> + lsr x3, x3, x6
> lsl x1, x1, x2
> orr x1, x3, x1
> lsl x0, x0, x2
> csel x0, xzr, x0, pl
> - csel x1, x6, x1, pl
> + csel x1, x4, x1, pl
> ret
> We could do even better and optimize the and w2, w2, 63 instruction out,
> but it is a matter of costs and so IMHO should be handled incrementally.
> For that case consider say:
> void corge (int, int, int);
>
> void
> qux (int x, int y, int z, int n)
> {
> n &= 31;
> corge (x << n, y << n, z >> n);
> }
> with -O2 -fno-tree-vectorize, on x86_64 one gets
> sarl %cl, %edx
> sall %cl, %esi
> sall %cl, %edi
> jmp corge
> but on aarch64
> and w3, w3, 31
> lsl w0, w0, w3
> lsl w1, w1, w3
> asr w2, w2, w3
> b corge
> The reason it works on x86 is that its rtx_costs hook recognizes
> that the AND in shift+mask patterns is for free.
> Trying 9 -> 11:
> 9: {r85:SI=r96:SI&0x1f;clobber flags:CC;}
> REG_UNUSED flags:CC
> 11: {r91:SI=r87:SI<<r85:SI#0;clobber flags:CC;}
> REG_DEAD r87:SI
> REG_UNUSED flags:CC
> Failed to match this instruction:
> ...
> Failed to match this instruction:
> ...
> Successfully matched this instruction:
> (set (reg/v:SI 85 [ n ])
> (and:SI (reg:SI 96)
> (const_int 31 [0x1f])))
> Successfully matched this instruction:
> (set (reg:SI 91)
> (ashift:SI (reg/v:SI 87 [ y ])
> (subreg:QI (and:SI (reg:SI 96)
> (const_int 31 [0x1f])) 0)))
> allowing combination of insns 9 and 11
> original costs 4 + 4 = 8
> replacement costs 4 + 4 = 8
> Compare that to the aarch64 case:
> Trying 9 -> 11:
> 9: r95:SI=r106:SI&0x1f
> REG_DEAD r106:SI
> 11: r101:SI=r104:SI<<r95:SI#0
> REG_DEAD r104:SI
> Failed to match this instruction:
> ...
> Failed to match this instruction:
> ...
> Successfully matched this instruction:
> (set (reg/v:SI 95 [ n ])
> (and:SI (reg:SI 106)
> (const_int 31 [0x1f])))
> Successfully matched this instruction:
> (set (reg:SI 101)
> (ashift:SI (reg:SI 104)
> (subreg:QI (and:SI (reg:SI 106)
> (const_int 31 [0x1f])) 0)))
> rejecting combination of insns 9 and 11
> original costs 4 + 4 = 8
> replacement costs 4 + 8 = 12
> Now, if the rtx costs on the *aarch64_ashl_reg_si3_mask1
> and similar would be the same as for normal shift without the
> masking, this would succeed and we could get rid of the and.
>
> Bootstrapped/regtested on aarch64-linux, ok for trunk?
>
> 2023-02-17 Jakub Jelinek <jakub@redhat.com>
>
> PR middle-end/108803
> * optabs.cc (expand_subword_shift): Add MASK_COUNT argument,
> if true, use ~op1 & (BITS_PER_WORD - 1) instead of
> (BITS_PER_WORD - 1) - op1 as tmp and op1 & (BITS_PER_WORD - 1)
> instead of op1 on the other two shifts.
> (expand_doubleword_shift_condmove): Use
> superword_op1 & (BITS_PER_WORD - 1) instead of superword_op1. Pass
> shift_mask < BITS_PER_WORD - 1 as MASK_COUNT to expand_subword_shift.
> (expand_doubleword_shift): Pass false as MASK_COUNT to
> expand_subword_shift.
>
> * gcc.dg/pr108803.c: New test.
>
> --- gcc/optabs.cc.jj 2023-01-02 09:32:53.309838465 +0100
> +++ gcc/optabs.cc 2023-02-16 20:44:21.862031535 +0100
> @@ -507,7 +507,7 @@ expand_subword_shift (scalar_int_mode op
> rtx outof_input, rtx into_input, rtx op1,
> rtx outof_target, rtx into_target,
> int unsignedp, enum optab_methods methods,
> - unsigned HOST_WIDE_INT shift_mask)
> + unsigned HOST_WIDE_INT shift_mask, bool mask_count)
> {
> optab reverse_unsigned_shift, unsigned_shift;
> rtx tmp, carries;
> @@ -535,7 +535,7 @@ expand_subword_shift (scalar_int_mode op
> are truncated to the mode size. */
> carries = expand_binop (word_mode, reverse_unsigned_shift,
> outof_input, const1_rtx, 0, unsignedp, methods);
> - if (shift_mask == BITS_PER_WORD - 1)
> + if (shift_mask == BITS_PER_WORD - 1 || mask_count)
> {
> tmp = immed_wide_int_const
> (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode);
> @@ -549,6 +549,15 @@ expand_subword_shift (scalar_int_mode op
> tmp = simplify_expand_binop (op1_mode, sub_optab, tmp, op1,
> 0, true, methods);
> }
> + if (mask_count)
> + {
> + rtx tmp2 = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1,
> + op1_mode), op1_mode);
> + tmp = simplify_expand_binop (op1_mode, and_optab, tmp, tmp2, 0,
> + true, methods);
> + op1 = simplify_expand_binop (op1_mode, and_optab, op1, tmp2, 0,
> + true, methods);
> + }
> }
> if (tmp == 0 || carries == 0)
> return false;
> @@ -596,6 +605,15 @@ expand_doubleword_shift_condmove (scalar
> {
> rtx outof_superword, into_superword;
>
> + if (shift_mask < BITS_PER_WORD - 1)
> + {
> + rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, op1_mode),
> + op1_mode);
> + superword_op1
> + = simplify_expand_binop (op1_mode, and_optab, superword_op1, tmp,
> + 0, true, methods);
> + }
> +
> /* Put the superword version of the output into OUTOF_SUPERWORD and
> INTO_SUPERWORD. */
> outof_superword = outof_target != 0 ? gen_reg_rtx (word_mode) : 0;
> @@ -621,7 +639,8 @@ expand_doubleword_shift_condmove (scalar
> if (!expand_subword_shift (op1_mode, binoptab,
> outof_input, into_input, subword_op1,
> outof_target, into_target,
> - unsignedp, methods, shift_mask))
> + unsignedp, methods, shift_mask,
> + shift_mask < BITS_PER_WORD - 1))
> return false;
>
> /* Select between them. Do the INTO half first because INTO_SUPERWORD
> @@ -742,7 +761,7 @@ expand_doubleword_shift (scalar_int_mode
> return expand_subword_shift (op1_mode, binoptab,
> outof_input, into_input, op1,
> outof_target, into_target,
> - unsignedp, methods, shift_mask);
> + unsignedp, methods, shift_mask, false);
> }
>
> /* Try using conditional moves to generate straight-line code. */
> @@ -781,7 +800,7 @@ expand_doubleword_shift (scalar_int_mode
> if (!expand_subword_shift (op1_mode, binoptab,
> outof_input, into_input, op1,
> outof_target, into_target,
> - unsignedp, methods, shift_mask))
> + unsignedp, methods, shift_mask, false))
> return false;
>
> emit_label (done_label);
> --- gcc/testsuite/gcc.dg/pr108803.c.jj 2023-02-16 20:48:53.517032422 +0100
> +++ gcc/testsuite/gcc.dg/pr108803.c 2023-02-16 20:48:45.971143498 +0100
> @@ -0,0 +1,24 @@
> +/* PR middle-end/108803 */
> +/* { dg-do run { target int128 } } */
> +/* { dg-options "-Og" } */
> +
> +unsigned i, j;
> +
> +__int128
> +foo (__int128 a)
> +{
> + a &= 9;
> + unsigned k = __builtin_add_overflow_p (i, j, a);
> + __int128 b = (63 + a) << k | ((63 + a) >> (63 & k));
> + __int128 c = a + b;
> + return c;
> +}
> +
> +int
> +main (void)
> +{
> + __int128 x = foo (0);
> + if (x != 63)
> + __builtin_abort ();
> + return 0;
> +}
>
> Jakub
next prev parent reply other threads:[~2023-02-27 15:34 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-17 10:14 Jakub Jelinek
2023-02-27 15:34 ` Richard Sandiford [this message]
2023-02-27 19:11 ` Jakub Jelinek
2023-02-27 19:51 ` Richard Sandiford
2023-02-27 20:02 ` Jakub Jelinek
2023-02-27 20:43 ` Richard Sandiford
2023-02-27 20:54 ` Jakub Jelinek
2023-02-27 21:01 ` Richard Sandiford
2023-02-27 21:15 ` Jakub Jelinek
2023-02-27 22:15 ` Segher Boessenkool
2023-02-27 22:21 ` Segher Boessenkool
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mpth6v7ruks.fsf@arm.com \
--to=richard.sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=jakub@redhat.com \
--cc=rguenther@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).