From: Jakub Jelinek <jakub@redhat.com>
To: Richard Sandiford <richard.sandiford@arm.com>,
Richard Biener <rguenther@suse.de>
Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH] optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803]
Date: Fri, 17 Feb 2023 11:14:57 +0100 [thread overview]
Message-ID: <Y+9TodOHE09e9Vwq@tucnak> (raw)
Hi!
The following testcase is miscompiled on aarch64. The problem is that
aarch64 with TARGET_SIMD is !SHIFT_COUNT_TRUNCATED target with
targetm.shift_truncation_mask (DImode) == 0 which has HAVE_conditional_move
true. If a doubleword shift (in this case TImode) doesn't have its own
expander (as is the case of e.g. x86) and is handled in generic code,
there are 3 possible expansions. One is when the shift count is constant,
the code computes in that case superword_op1 as op1 - BITS_PER_WORD,
and chooses at compile time which of expand_superword_shift or
expand_subword_shift to call, which ensures that whatever is used
actually has its shift count (superword_op1 or op1) in [0, BITS_PER_WORD - 1]
range. If !HAVE_conditional_move or that expansion fails, the function
handles non-constant op1 similarly (there are some special cases for
shift_mask >= BITS_PER_WORD - 1 but let's talk here just about
shift_mask < BITS_PER_WORD - 1), except that the selection is done at
runtime, with branches around the stuff. While superword_op1 can be
[-BITS_PER_WORD, BITS_PER_WORD - 1] and op1 [0, 2 * BITS_PER_WORD - 1],
the runtime selection ensures that the instructions executed at runtime
have their corresponding shift count again in the right range of
[0, BITS_PER_WORD - 1].
The problematic is the HAVE_conditional_move case, which emits both
sequences into the actually executed code, so necessarily one of them
has an out of range shift count and then using 2 conditional moves
picks up a result.
Now, in the testcase because -Og doesn't perform VRP/EVRP the shift
count isn't optimized to constant during GIMPLE passes, but is determined
to be constant during/after expansion into RTL. The shift count is
actually const0_rtx later, so superword_op1 is (const_int -64) and we end
up with miscompilation during combine because of that.
I'm afraid on targetm.shift_truncation_mask (DImode) == 0 targets we have
to mask the shift counts when the doubleshift count is in range but
one of subword_op1/superword_op1 is not, which is what the following
patch does and what fixes the wrong-code. Now, targets like x86 or aarch64,
while they are !SHIFT_COUNT_TRUNCATED, have actually patterns to catch
shift with masked counter, so the ANDs can be optimized out. On the other
side, when we know the result will be masked this way we can use the
simpler ~op1 instead of (BITS_PER_WORD - 1) - op1 in expand_subword_shift.
So, say on
__attribute__((noipa)) __int128
foo (__int128 a, unsigned k)
{
return a << k;
}
on aarch64 at -Og the difference is:
foo:
subs w5, w2, #64
- lsl x6, x0, x5
+ lsl x4, x0, x2
lsr x3, x0, 1
- mov w4, 63
- sub w4, w4, w2
- lsr x3, x3, x4
+ mvn w6, w2
+ and w2, w2, 63
+ lsr x3, x3, x6
lsl x1, x1, x2
orr x1, x3, x1
lsl x0, x0, x2
csel x0, xzr, x0, pl
- csel x1, x6, x1, pl
+ csel x1, x4, x1, pl
ret
We could do even better and optimize the and w2, w2, 63 instruction out,
but it is a matter of costs and so IMHO should be handled incrementally.
For that case consider say:
void corge (int, int, int);
void
qux (int x, int y, int z, int n)
{
n &= 31;
corge (x << n, y << n, z >> n);
}
with -O2 -fno-tree-vectorize, on x86_64 one gets
sarl %cl, %edx
sall %cl, %esi
sall %cl, %edi
jmp corge
but on aarch64
and w3, w3, 31
lsl w0, w0, w3
lsl w1, w1, w3
asr w2, w2, w3
b corge
The reason it works on x86 is that its rtx_costs hook recognizes
that the AND in shift+mask patterns is for free.
Trying 9 -> 11:
9: {r85:SI=r96:SI&0x1f;clobber flags:CC;}
REG_UNUSED flags:CC
11: {r91:SI=r87:SI<<r85:SI#0;clobber flags:CC;}
REG_DEAD r87:SI
REG_UNUSED flags:CC
Failed to match this instruction:
...
Failed to match this instruction:
...
Successfully matched this instruction:
(set (reg/v:SI 85 [ n ])
(and:SI (reg:SI 96)
(const_int 31 [0x1f])))
Successfully matched this instruction:
(set (reg:SI 91)
(ashift:SI (reg/v:SI 87 [ y ])
(subreg:QI (and:SI (reg:SI 96)
(const_int 31 [0x1f])) 0)))
allowing combination of insns 9 and 11
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
Compare that to the aarch64 case:
Trying 9 -> 11:
9: r95:SI=r106:SI&0x1f
REG_DEAD r106:SI
11: r101:SI=r104:SI<<r95:SI#0
REG_DEAD r104:SI
Failed to match this instruction:
...
Failed to match this instruction:
...
Successfully matched this instruction:
(set (reg/v:SI 95 [ n ])
(and:SI (reg:SI 106)
(const_int 31 [0x1f])))
Successfully matched this instruction:
(set (reg:SI 101)
(ashift:SI (reg:SI 104)
(subreg:QI (and:SI (reg:SI 106)
(const_int 31 [0x1f])) 0)))
rejecting combination of insns 9 and 11
original costs 4 + 4 = 8
replacement costs 4 + 8 = 12
Now, if the rtx costs on the *aarch64_ashl_reg_si3_mask1
and similar would be the same as for normal shift without the
masking, this would succeed and we could get rid of the and.
Bootstrapped/regtested on aarch64-linux, ok for trunk?
2023-02-17 Jakub Jelinek <jakub@redhat.com>
PR middle-end/108803
* optabs.cc (expand_subword_shift): Add MASK_COUNT argument,
if true, use ~op1 & (BITS_PER_WORD - 1) instead of
(BITS_PER_WORD - 1) - op1 as tmp and op1 & (BITS_PER_WORD - 1)
instead of op1 on the other two shifts.
(expand_doubleword_shift_condmove): Use
superword_op1 & (BITS_PER_WORD - 1) instead of superword_op1. Pass
shift_mask < BITS_PER_WORD - 1 as MASK_COUNT to expand_subword_shift.
(expand_doubleword_shift): Pass false as MASK_COUNT to
expand_subword_shift.
* gcc.dg/pr108803.c: New test.
--- gcc/optabs.cc.jj 2023-01-02 09:32:53.309838465 +0100
+++ gcc/optabs.cc 2023-02-16 20:44:21.862031535 +0100
@@ -507,7 +507,7 @@ expand_subword_shift (scalar_int_mode op
rtx outof_input, rtx into_input, rtx op1,
rtx outof_target, rtx into_target,
int unsignedp, enum optab_methods methods,
- unsigned HOST_WIDE_INT shift_mask)
+ unsigned HOST_WIDE_INT shift_mask, bool mask_count)
{
optab reverse_unsigned_shift, unsigned_shift;
rtx tmp, carries;
@@ -535,7 +535,7 @@ expand_subword_shift (scalar_int_mode op
are truncated to the mode size. */
carries = expand_binop (word_mode, reverse_unsigned_shift,
outof_input, const1_rtx, 0, unsignedp, methods);
- if (shift_mask == BITS_PER_WORD - 1)
+ if (shift_mask == BITS_PER_WORD - 1 || mask_count)
{
tmp = immed_wide_int_const
(wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode);
@@ -549,6 +549,15 @@ expand_subword_shift (scalar_int_mode op
tmp = simplify_expand_binop (op1_mode, sub_optab, tmp, op1,
0, true, methods);
}
+ if (mask_count)
+ {
+ rtx tmp2 = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1,
+ op1_mode), op1_mode);
+ tmp = simplify_expand_binop (op1_mode, and_optab, tmp, tmp2, 0,
+ true, methods);
+ op1 = simplify_expand_binop (op1_mode, and_optab, op1, tmp2, 0,
+ true, methods);
+ }
}
if (tmp == 0 || carries == 0)
return false;
@@ -596,6 +605,15 @@ expand_doubleword_shift_condmove (scalar
{
rtx outof_superword, into_superword;
+ if (shift_mask < BITS_PER_WORD - 1)
+ {
+ rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, op1_mode),
+ op1_mode);
+ superword_op1
+ = simplify_expand_binop (op1_mode, and_optab, superword_op1, tmp,
+ 0, true, methods);
+ }
+
/* Put the superword version of the output into OUTOF_SUPERWORD and
INTO_SUPERWORD. */
outof_superword = outof_target != 0 ? gen_reg_rtx (word_mode) : 0;
@@ -621,7 +639,8 @@ expand_doubleword_shift_condmove (scalar
if (!expand_subword_shift (op1_mode, binoptab,
outof_input, into_input, subword_op1,
outof_target, into_target,
- unsignedp, methods, shift_mask))
+ unsignedp, methods, shift_mask,
+ shift_mask < BITS_PER_WORD - 1))
return false;
/* Select between them. Do the INTO half first because INTO_SUPERWORD
@@ -742,7 +761,7 @@ expand_doubleword_shift (scalar_int_mode
return expand_subword_shift (op1_mode, binoptab,
outof_input, into_input, op1,
outof_target, into_target,
- unsignedp, methods, shift_mask);
+ unsignedp, methods, shift_mask, false);
}
/* Try using conditional moves to generate straight-line code. */
@@ -781,7 +800,7 @@ expand_doubleword_shift (scalar_int_mode
if (!expand_subword_shift (op1_mode, binoptab,
outof_input, into_input, op1,
outof_target, into_target,
- unsignedp, methods, shift_mask))
+ unsignedp, methods, shift_mask, false))
return false;
emit_label (done_label);
--- gcc/testsuite/gcc.dg/pr108803.c.jj 2023-02-16 20:48:53.517032422 +0100
+++ gcc/testsuite/gcc.dg/pr108803.c 2023-02-16 20:48:45.971143498 +0100
@@ -0,0 +1,24 @@
+/* PR middle-end/108803 */
+/* { dg-do run { target int128 } } */
+/* { dg-options "-Og" } */
+
+unsigned i, j;
+
+__int128
+foo (__int128 a)
+{
+ a &= 9;
+ unsigned k = __builtin_add_overflow_p (i, j, a);
+ __int128 b = (63 + a) << k | ((63 + a) >> (63 & k));
+ __int128 c = a + b;
+ return c;
+}
+
+int
+main (void)
+{
+ __int128 x = foo (0);
+ if (x != 63)
+ __builtin_abort ();
+ return 0;
+}
Jakub
next reply other threads:[~2023-02-17 10:15 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-17 10:14 Jakub Jelinek [this message]
2023-02-27 15:34 ` Richard Sandiford
2023-02-27 19:11 ` Jakub Jelinek
2023-02-27 19:51 ` Richard Sandiford
2023-02-27 20:02 ` Jakub Jelinek
2023-02-27 20:43 ` Richard Sandiford
2023-02-27 20:54 ` Jakub Jelinek
2023-02-27 21:01 ` Richard Sandiford
2023-02-27 21:15 ` Jakub Jelinek
2023-02-27 22:15 ` Segher Boessenkool
2023-02-27 22:21 ` Segher Boessenkool
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y+9TodOHE09e9Vwq@tucnak \
--to=jakub@redhat.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=rguenther@suse.de \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).