From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 359E93858C78 for ; Fri, 17 Feb 2023 10:15:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 359E93858C78 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676628904; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type; bh=wl3haC2qtULdNDBEvtcQsBuKrco7yZ/K9SEkdKJ0dVc=; b=Mma2PQc62fXT7BuiaO3h9zQNaRFkvAYnwD6lXPXJAZqsvXMvo4L0FRNvwXg5mjHZ9jwSJO KQJRYVSBgFF0m4AfK72hl5aGcvvPggRcwz4bzB50bzAzBFI5zAJTlGtEqkn454tVb/o0ug 8Igy0qOmh6qvgHmLaVEQrepT66U8Mks= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-549-m3VqmZ0bOneWGCUTlpibSw-1; Fri, 17 Feb 2023 05:15:02 -0500 X-MC-Unique: m3VqmZ0bOneWGCUTlpibSw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5E8A13C10696; Fri, 17 Feb 2023 10:15:02 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.39.192.211]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 189CB175A2; Fri, 17 Feb 2023 10:15:01 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 31HAExCG2921687 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Fri, 17 Feb 2023 11:14:59 +0100 Received: (from jakub@localhost) by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 31HAEw2c2921686; Fri, 17 Feb 2023 11:14:58 +0100 Date: Fri, 17 Feb 2023 11:14:57 +0100 From: Jakub Jelinek To: Richard Sandiford , Richard Biener Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803] Message-ID: Reply-To: Jakub Jelinek MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi! The following testcase is miscompiled on aarch64. The problem is that aarch64 with TARGET_SIMD is !SHIFT_COUNT_TRUNCATED target with targetm.shift_truncation_mask (DImode) == 0 which has HAVE_conditional_move true. If a doubleword shift (in this case TImode) doesn't have its own expander (as is the case of e.g. x86) and is handled in generic code, there are 3 possible expansions. One is when the shift count is constant, the code computes in that case superword_op1 as op1 - BITS_PER_WORD, and chooses at compile time which of expand_superword_shift or expand_subword_shift to call, which ensures that whatever is used actually has its shift count (superword_op1 or op1) in [0, BITS_PER_WORD - 1] range. If !HAVE_conditional_move or that expansion fails, the function handles non-constant op1 similarly (there are some special cases for shift_mask >= BITS_PER_WORD - 1 but let's talk here just about shift_mask < BITS_PER_WORD - 1), except that the selection is done at runtime, with branches around the stuff. While superword_op1 can be [-BITS_PER_WORD, BITS_PER_WORD - 1] and op1 [0, 2 * BITS_PER_WORD - 1], the runtime selection ensures that the instructions executed at runtime have their corresponding shift count again in the right range of [0, BITS_PER_WORD - 1]. The problematic is the HAVE_conditional_move case, which emits both sequences into the actually executed code, so necessarily one of them has an out of range shift count and then using 2 conditional moves picks up a result. Now, in the testcase because -Og doesn't perform VRP/EVRP the shift count isn't optimized to constant during GIMPLE passes, but is determined to be constant during/after expansion into RTL. The shift count is actually const0_rtx later, so superword_op1 is (const_int -64) and we end up with miscompilation during combine because of that. I'm afraid on targetm.shift_truncation_mask (DImode) == 0 targets we have to mask the shift counts when the doubleshift count is in range but one of subword_op1/superword_op1 is not, which is what the following patch does and what fixes the wrong-code. Now, targets like x86 or aarch64, while they are !SHIFT_COUNT_TRUNCATED, have actually patterns to catch shift with masked counter, so the ANDs can be optimized out. On the other side, when we know the result will be masked this way we can use the simpler ~op1 instead of (BITS_PER_WORD - 1) - op1 in expand_subword_shift. So, say on __attribute__((noipa)) __int128 foo (__int128 a, unsigned k) { return a << k; } on aarch64 at -Og the difference is: foo: subs w5, w2, #64 - lsl x6, x0, x5 + lsl x4, x0, x2 lsr x3, x0, 1 - mov w4, 63 - sub w4, w4, w2 - lsr x3, x3, x4 + mvn w6, w2 + and w2, w2, 63 + lsr x3, x3, x6 lsl x1, x1, x2 orr x1, x3, x1 lsl x0, x0, x2 csel x0, xzr, x0, pl - csel x1, x6, x1, pl + csel x1, x4, x1, pl ret We could do even better and optimize the and w2, w2, 63 instruction out, but it is a matter of costs and so IMHO should be handled incrementally. For that case consider say: void corge (int, int, int); void qux (int x, int y, int z, int n) { n &= 31; corge (x << n, y << n, z >> n); } with -O2 -fno-tree-vectorize, on x86_64 one gets sarl %cl, %edx sall %cl, %esi sall %cl, %edi jmp corge but on aarch64 and w3, w3, 31 lsl w0, w0, w3 lsl w1, w1, w3 asr w2, w2, w3 b corge The reason it works on x86 is that its rtx_costs hook recognizes that the AND in shift+mask patterns is for free. Trying 9 -> 11: 9: {r85:SI=r96:SI&0x1f;clobber flags:CC;} REG_UNUSED flags:CC 11: {r91:SI=r87:SI< 11: 9: r95:SI=r106:SI&0x1f REG_DEAD r106:SI 11: r101:SI=r104:SI< PR middle-end/108803 * optabs.cc (expand_subword_shift): Add MASK_COUNT argument, if true, use ~op1 & (BITS_PER_WORD - 1) instead of (BITS_PER_WORD - 1) - op1 as tmp and op1 & (BITS_PER_WORD - 1) instead of op1 on the other two shifts. (expand_doubleword_shift_condmove): Use superword_op1 & (BITS_PER_WORD - 1) instead of superword_op1. Pass shift_mask < BITS_PER_WORD - 1 as MASK_COUNT to expand_subword_shift. (expand_doubleword_shift): Pass false as MASK_COUNT to expand_subword_shift. * gcc.dg/pr108803.c: New test. --- gcc/optabs.cc.jj 2023-01-02 09:32:53.309838465 +0100 +++ gcc/optabs.cc 2023-02-16 20:44:21.862031535 +0100 @@ -507,7 +507,7 @@ expand_subword_shift (scalar_int_mode op rtx outof_input, rtx into_input, rtx op1, rtx outof_target, rtx into_target, int unsignedp, enum optab_methods methods, - unsigned HOST_WIDE_INT shift_mask) + unsigned HOST_WIDE_INT shift_mask, bool mask_count) { optab reverse_unsigned_shift, unsigned_shift; rtx tmp, carries; @@ -535,7 +535,7 @@ expand_subword_shift (scalar_int_mode op are truncated to the mode size. */ carries = expand_binop (word_mode, reverse_unsigned_shift, outof_input, const1_rtx, 0, unsignedp, methods); - if (shift_mask == BITS_PER_WORD - 1) + if (shift_mask == BITS_PER_WORD - 1 || mask_count) { tmp = immed_wide_int_const (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode); @@ -549,6 +549,15 @@ expand_subword_shift (scalar_int_mode op tmp = simplify_expand_binop (op1_mode, sub_optab, tmp, op1, 0, true, methods); } + if (mask_count) + { + rtx tmp2 = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, + op1_mode), op1_mode); + tmp = simplify_expand_binop (op1_mode, and_optab, tmp, tmp2, 0, + true, methods); + op1 = simplify_expand_binop (op1_mode, and_optab, op1, tmp2, 0, + true, methods); + } } if (tmp == 0 || carries == 0) return false; @@ -596,6 +605,15 @@ expand_doubleword_shift_condmove (scalar { rtx outof_superword, into_superword; + if (shift_mask < BITS_PER_WORD - 1) + { + rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, op1_mode), + op1_mode); + superword_op1 + = simplify_expand_binop (op1_mode, and_optab, superword_op1, tmp, + 0, true, methods); + } + /* Put the superword version of the output into OUTOF_SUPERWORD and INTO_SUPERWORD. */ outof_superword = outof_target != 0 ? gen_reg_rtx (word_mode) : 0; @@ -621,7 +639,8 @@ expand_doubleword_shift_condmove (scalar if (!expand_subword_shift (op1_mode, binoptab, outof_input, into_input, subword_op1, outof_target, into_target, - unsignedp, methods, shift_mask)) + unsignedp, methods, shift_mask, + shift_mask < BITS_PER_WORD - 1)) return false; /* Select between them. Do the INTO half first because INTO_SUPERWORD @@ -742,7 +761,7 @@ expand_doubleword_shift (scalar_int_mode return expand_subword_shift (op1_mode, binoptab, outof_input, into_input, op1, outof_target, into_target, - unsignedp, methods, shift_mask); + unsignedp, methods, shift_mask, false); } /* Try using conditional moves to generate straight-line code. */ @@ -781,7 +800,7 @@ expand_doubleword_shift (scalar_int_mode if (!expand_subword_shift (op1_mode, binoptab, outof_input, into_input, op1, outof_target, into_target, - unsignedp, methods, shift_mask)) + unsignedp, methods, shift_mask, false)) return false; emit_label (done_label); --- gcc/testsuite/gcc.dg/pr108803.c.jj 2023-02-16 20:48:53.517032422 +0100 +++ gcc/testsuite/gcc.dg/pr108803.c 2023-02-16 20:48:45.971143498 +0100 @@ -0,0 +1,24 @@ +/* PR middle-end/108803 */ +/* { dg-do run { target int128 } } */ +/* { dg-options "-Og" } */ + +unsigned i, j; + +__int128 +foo (__int128 a) +{ + a &= 9; + unsigned k = __builtin_add_overflow_p (i, j, a); + __int128 b = (63 + a) << k | ((63 + a) >> (63 & k)); + __int128 c = a + b; + return c; +} + +int +main (void) +{ + __int128 x = foo (0); + if (x != 63) + __builtin_abort (); + return 0; +} Jakub