From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 38736 invoked by alias); 17 Sep 2019 16:50:37 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 38707 invoked by uid 89); 17 Sep 2019 16:50:37 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-8.9 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,SPF_PASS autolearn=ham version=3.3.1 spammy=chains, tmovprfx, cond_unary_2c, movprfx X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.110.172) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 17 Sep 2019 16:50:35 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8579828; Tue, 17 Sep 2019 09:50:33 -0700 (PDT) Received: from localhost (dashpi.manchester.arm.com [10.32.98.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 116EB3F575; Tue, 17 Sep 2019 09:50:32 -0700 (PDT) From: Richard Sandiford To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org,vmakarov@redhat.com, richard.sandiford@arm.com Cc: vmakarov@redhat.com Subject: Avoid adding impossible copies in ira-conflicts.c:process_reg_shuffles Date: Tue, 17 Sep 2019 16:50:00 -0000 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-IsSubscribed: yes X-SW-Source: 2019-09/txt/msg01018.txt.bz2 If an insn requires two operands to be tied, and the input operand dies in the insn, IRA acts as though there were a copy from the input to the output with the same execution frequency as the insn. Allocating the same register to the input and the output then saves the cost of a move. If there is no such tie, but an input operand nevertheless dies in the insn, IRA creates a similar move, but with an eighth of the frequency. This helps to ensure that chains of instructions reuse registers in a natural way, rather than using arbitrarily different registers for no reason. This heuristic seems to work well in the vast majority of cases. However, for SVE, the handling of untied operands ends up creating copies between dying predicate registers and vector outputs, even though vector and predicate registers are distinct classes and can never be tied. This is a particular problem because the dying predicate tends to be the loop control predicate, which is used by most instructions in a vector loop and so (rightly) has a very high allocation priority. Any copies involving the loop predicate therefore tend to get processed before copies involving only vector registers. The end result is that we tend to allocate the output of the last vector instruction in a loop ahead of its natural place in the allocation order and don't benefit from chains created between vector registers. This patch tries to avoid the problem by not adding register shuffle copies if there appears to be no chance that the two operands could be allocated to the same register. Tested on aarch64-linux-gnu and x86_64-linux-gnu (which needs the pr82361-*.c tweak I just posted). Also tested by comparing the gcc.c-torture, gcc.dg and g++.dg output for one target per CPU directory at -O2 -ftree-vectorize. The table below summarises the tests that had more or fewer assembly lines after the patch (always a bad metric, but it's just to get a flavour): Target Tests Delta Best Worst Median ====== ===== ===== ==== ===== ====== aarch64-linux-gnu 11 -16 -15 9 -1 aarch64_be-linux-gnu 10 -16 -15 9 -1 alpha-linux-gnu 10 12 -5 10 2 amdgcn-amdhsa 51 48 -138 174 -1 arc-elf 4 -3 -2 2 -2 bfin-elf 53 19 -13 5 1 cris-elf 1 -1 -1 -1 -1 frv-linux-gnu 139 22 -16 12 -1 h8300-elf 2 2 1 1 1 hppa64-hp-hpux11.23 20 5 -3 3 1 i686-apple-darwin 42 -305 -87 11 -1 i686-pc-linux-gnu 32 -3205 -402 15 -15 ia64-linux-gnu 774 -636 -274 261 -1 m32r-elf 179 -1386 -161 31 -2 m68k-linux-gnu 21 -64 -18 4 -2 mipsel-linux-gnu 61 8 -25 34 1 mipsisa64-linux-gnu 40 -11 -10 13 -1 mmix 4 1 -2 6 -2 mn10300-elf 9 0 -6 5 1 pdp11 13 -6 -5 2 1 powerpc-ibm-aix7.0 152 1736 -142 481 -1 powerpc64-linux-gnu 111 -1366 -370 39 -1 powerpc64le-linux-gnu 136 -1183 -409 241 -1 pru-elf 45 24 -6 17 1 riscv32-elf 22 -83 -62 10 -2 riscv64-elf 26 -78 -19 4 -1 s390-linux-gnu 43 -145 -32 16 -1 s390x-linux-gnu 40 -285 -96 16 -1 visium-elf 125 -317 -50 10 -1 x86_64-darwin 40 -83 -42 13 -1 x86_64-linux-gnu 39 -577 -164 23 -1 xstormy16-elf 26 -12 -4 4 -1 Richard 2019-09-17 Richard Sandiford gcc/ * ira-conflicts.c (can_use_same_reg_p): New function. (process_reg_shuffles): Take an insn parameter. Ignore cases in which input operand op_num could seemingly never be allocated to the same register as the destination. (add_insn_allocno_copies): Update call to process_reg_shuffles. gcc/testsuite/ * gcc.target/aarch64/sve/cond_convert_1.c: Remove XFAILs. * gcc.target/aarch64/sve/cond_convert_4.c: Likewise. * gcc.target/aarch64/sve/cond_unary_2.c: Likewise. Index: gcc/ira-conflicts.c =================================================================== --- gcc/ira-conflicts.c 2019-09-12 10:52:55.000000000 +0100 +++ gcc/ira-conflicts.c 2019-09-17 17:43:06.426031052 +0100 @@ -325,12 +325,37 @@ process_regs_for_copy (rtx reg1, rtx reg return true; } -/* Process all of the output registers of the current insn which are - not bound (BOUND_P) and the input register REG (its operand number +/* Return true if output operand OUTPUT and input operand INPUT of + INSN can use the same register class for at least one alternative. + INSN is already described in recog_data and recog_op_alt. */ +static bool +can_use_same_reg_p (rtx_insn *insn, int output, int input) +{ + alternative_mask preferred = get_preferred_alternatives (insn); + for (int nalt = 0; nalt < recog_data.n_alternatives; nalt++) + { + if (!TEST_BIT (preferred, nalt)) + continue; + + const operand_alternative *op_alt + = &recog_op_alt[nalt * recog_data.n_operands]; + if (op_alt[input].matches == output) + return true; + + if (ira_reg_class_intersect[op_alt[input].cl][op_alt[output].cl] + != NO_REGS) + return true; + } + return false; +} + +/* Process all of the output registers of the current insn (INSN) which + are not bound (BOUND_P) and the input register REG (its operand number OP_NUM) which dies in the insn as if there were a move insn between them with frequency FREQ. */ static void -process_reg_shuffles (rtx reg, int op_num, int freq, bool *bound_p) +process_reg_shuffles (rtx_insn *insn, rtx reg, int op_num, int freq, + bool *bound_p) { int i; rtx another_reg; @@ -342,7 +367,13 @@ process_reg_shuffles (rtx reg, int op_nu if (!REG_SUBREG_P (another_reg) || op_num == i || recog_data.operand_type[i] != OP_OUT - || bound_p[i]) + || bound_p[i] + || (!can_use_same_reg_p (insn, i, op_num) + && (recog_data.constraints[op_num][0] != '%' + || !can_use_same_reg_p (insn, i, op_num + 1)) + && (op_num == 0 + || recog_data.constraints[op_num - 1][0] != '%' + || !can_use_same_reg_p (insn, i, op_num - 1)))) continue; process_regs_for_copy (reg, another_reg, false, NULL, freq); @@ -412,7 +443,8 @@ add_insn_allocno_copies (rtx_insn *insn) the corresponding allocno copies. The cost will not correspond to a real move insn cost, so make the frequency smaller. */ - process_reg_shuffles (operand, i, freq < 8 ? 1 : freq / 8, bound_p); + process_reg_shuffles (insn, operand, i, freq < 8 ? 1 : freq / 8, + bound_p); } } Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c 2019-08-14 11:56:54.223221654 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_1.c 2019-09-17 17:43:06.426031052 +0100 @@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz} } } */ -/* At the moment we don't manage to avoid using MOVPRFX. */ -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c 2019-08-14 11:56:54.223221654 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/cond_convert_4.c 2019-09-17 17:43:06.430031021 +0100 @@ -32,6 +32,5 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz} } } */ -/* At the moment we don't manage to avoid using MOVPRFX. */ -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ Index: gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c =================================================================== --- gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c 2019-08-14 11:53:04.636898923 +0100 +++ gcc/testsuite/gcc.target/aarch64/sve/cond_unary_2.c 2019-09-17 17:43:06.430031021 +0100 @@ -54,8 +54,5 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz} } } */ -/* At the moment we don't manage to avoid using MOVPRFX for the - floating-point functions. */ -/* { dg-final { scan-assembler-not {\tmovprfx\t} { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\t} 6 } } */ +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */