From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id 8F983398CC2A for ; Wed, 28 Jul 2021 08:36:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8F983398CC2A Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-555-d-tTpam7O366_gaU3nrb4w-1; Wed, 28 Jul 2021 04:35:59 -0400 X-MC-Unique: d-tTpam7O366_gaU3nrb4w-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 34F87190B2A0; Wed, 28 Jul 2021 08:35:58 +0000 (UTC) Received: from tucnak.zalov.cz (ovpn-112-143.ams2.redhat.com [10.36.112.143]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 8B24B19C79; Wed, 28 Jul 2021 08:35:57 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 16S8ZsgU1208180 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Wed, 28 Jul 2021 10:35:55 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 16S8ZsUR1208178; Wed, 28 Jul 2021 10:35:54 +0200 Date: Wed, 28 Jul 2021 10:35:54 +0200 From: Jakub Jelinek To: Uros Bizjak Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] i386: Improve extensions of __builtin_clz and constant - __builtin_clz for -mno-lzcnt [PR78103] Message-ID: <20210728083554.GR2380545@tucnak> Reply-To: Jakub Jelinek MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Jul 2021 08:36:04 -0000 Hi! This patch improves emitted code for the non-TARGET_LZCNT case. As __builtin_clz* is UB on 0 argument and for !TARGET_LZCNT CLZ_VALUE_DEFINED_AT_ZERO is 0, it is UB even at RTL time and so we can take advantage of that and assume the result will be 0 to 31 or 0 to 63. Given that, sign or zero extension of that result are the same and are actually already performed by bsrl or xorl instructions. And constant - __builtin_clz* can be simplified into bsr + constant - bitmask. For TARGET_LZCNT, a lot of this is already fine as is (e.g. the sign or zero extensions), and other optimizations are IMHO not possible (if we have lzcnt, we've lost information on whether it is UB at zero or not and so can't transform it into bsr even when that is 1-2 insns shorter). The changes on the 3 testcases between unpatched and patched gcc are for -m64: pr78103-1.s: bsrq %rdi, %rax - xorq $63, %rax - cltq + xorl $63, %eax ... bsrq %rdi, %rax - xorq $63, %rax - cltq + xorl $63, %eax ... bsrl %edi, %eax xorl $31, %eax - cltq ... bsrl %edi, %eax xorl $31, %eax - cltq pr78103-2.s: bsrl %edi, %edi - movl $32, %eax - xorl $31, %edi - subl %edi, %eax + leal 1(%rdi), %eax ... - bsrl %edi, %edi - movl $31, %eax - xorl $31, %edi - subl %edi, %eax + bsrl %edi, %eax ... bsrq %rdi, %rdi - movl $64, %eax - xorq $63, %rdi - subl %edi, %eax + leal 1(%rdi), %eax ... - bsrq %rdi, %rdi - movl $63, %eax - xorq $63, %rdi - subl %edi, %eax + bsrq %rdi, %rax pr78103-3.s: bsrl %edi, %edi - movl $32, %eax - xorl $31, %edi - movslq %edi, %rdi - subq %rdi, %rax + leaq 1(%rdi), %rax ... - bsrl %edi, %edi - movl $31, %eax - xorl $31, %edi - movslq %edi, %rdi - subq %rdi, %rax + bsrl %edi, %eax ... bsrq %rdi, %rdi - movl $64, %eax - xorq $63, %rdi - movslq %edi, %rdi - subq %rdi, %rax + leaq 1(%rdi), %rax ... - bsrq %rdi, %rdi - movl $63, %eax - xorq $63, %rdi - movslq %edi, %rdi - subq %rdi, %rax + bsrq %rdi, %rax Most of the changes are done with combine splitters, but for *bsr_rex64_2 and *bsr_2 I had to use define_insn_and_split, because as mentioned in the PR the combiner unfortunately doesn't create LOG_LINKS in between the two insns created by combine splitter, so it can't be combined further with following instructions. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? 2021-07-28 Jakub Jelinek PR target/78103 * config/i386/i386.md (*bsr_rex64_1, *bsr_1, *bsr_zext_1): New define_insn patterns. (*bsr_rex64_2, *bsr_2): New define_insn_and_split patterns. Add combine splitters for constant - clz. (clz2): Use a temporary pseudo for bsr result. * gcc.target/i386/pr78103-1.c: New test. * gcc.target/i386/pr78103-2.c: New test. * gcc.target/i386/pr78103-3.c: New test. --- gcc/config/i386/i386.md.jj 2021-07-27 09:47:30.311970004 +0200 +++ gcc/config/i386/i386.md 2021-07-27 15:37:59.011394624 +0200 @@ -14761,6 +14761,18 @@ (define_insn "bsr_rex64" (set_attr "znver1_decode" "vector") (set_attr "mode" "DI")]) +(define_insn "*bsr_rex64_1" + [(set (match_operand:DI 0 "register_operand" "=r") + (minus:DI (const_int 63) + (clz:DI (match_operand:DI 1 "nonimmediate_operand" "rm")))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_LZCNT && TARGET_64BIT" + "bsr{q}\t{%1, %0|%0, %1}" + [(set_attr "type" "alu1") + (set_attr "prefix_0f" "1") + (set_attr "znver1_decode" "vector") + (set_attr "mode" "DI")]) + (define_insn "bsr" [(set (reg:CCZ FLAGS_REG) (compare:CCZ (match_operand:SI 1 "nonimmediate_operand" "rm") @@ -14775,17 +14787,210 @@ (define_insn "bsr" (set_attr "znver1_decode" "vector") (set_attr "mode" "SI")]) +(define_insn "*bsr_1" + [(set (match_operand:SI 0 "register_operand" "=r") + (minus:SI (const_int 31) + (clz:SI (match_operand:SI 1 "nonimmediate_operand" "rm")))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_LZCNT" + "bsr{l}\t{%1, %0|%0, %1}" + [(set_attr "type" "alu1") + (set_attr "prefix_0f" "1") + (set_attr "znver1_decode" "vector") + (set_attr "mode" "SI")]) + +(define_insn "*bsr_zext_1" + [(set (match_operand:DI 0 "register_operand" "=r") + (zero_extend:DI + (minus:SI + (const_int 31) + (clz:SI (match_operand:SI 1 "nonimmediate_operand" "rm"))))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_LZCNT && TARGET_64BIT" + "bsr{l}\t{%1, %k0|%k0, %1}" + [(set_attr "type" "alu1") + (set_attr "prefix_0f" "1") + (set_attr "znver1_decode" "vector") + (set_attr "mode" "SI")]) + +; As bsr is undefined behavior on zero and for other input +; values it is in range 0 to 63, we can optimize away sign-extends. +(define_insn_and_split "*bsr_rex64_2" + [(set (match_operand:DI 0 "register_operand") + (xor:DI + (sign_extend:DI + (minus:SI + (const_int 63) + (subreg:SI (clz:DI (match_operand:DI 1 "nonimmediate_operand")) + 0))) + (const_int 63))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_LZCNT && TARGET_64BIT && ix86_pre_reload_split ()" + "#" + "&& 1" + [(parallel [(set (reg:CCZ FLAGS_REG) + (compare:CCZ (match_dup 1) (const_int 0))) + (set (match_dup 2) + (minus:DI (const_int 63) (clz:DI (match_dup 1))))]) + (parallel [(set (match_dup 0) + (zero_extend:DI (xor:SI (match_dup 3) (const_int 63)))) + (clobber (reg:CC FLAGS_REG))])] +{ + operands[2] = gen_reg_rtx (DImode); + operands[3] = lowpart_subreg (SImode, operands[2], DImode); +}) + +(define_insn_and_split "*bsr_2" + [(set (match_operand:DI 0 "register_operand") + (sign_extend:DI + (xor:SI + (minus:SI + (const_int 31) + (clz:SI (match_operand:SI 1 "nonimmediate_operand"))) + (const_int 31)))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_LZCNT && TARGET_64BIT && ix86_pre_reload_split ()" + "#" + "&& 1" + [(parallel [(set (reg:CCZ FLAGS_REG) + (compare:CCZ (match_dup 1) (const_int 0))) + (set (match_dup 2) + (minus:SI (const_int 31) (clz:SI (match_dup 1))))]) + (parallel [(set (match_dup 0) + (zero_extend:DI (xor:SI (match_dup 2) (const_int 31)))) + (clobber (reg:CC FLAGS_REG))])] + "operands[2] = gen_reg_rtx (SImode);") + +; Splitters to optimize 64 - __builtin_clzl (x) or 32 - __builtin_clz (x). +; Again, as for !TARGET_LZCNT CLZ is UB at zero, CLZ is guaranteed to be +; in [0, 63] or [0, 31] range. +(define_split + [(set (match_operand:SI 0 "register_operand") + (minus:SI + (match_operand:SI 2 "const_int_operand") + (xor:SI + (minus:SI (const_int 63) + (subreg:SI + (clz:DI (match_operand:DI 1 "nonimmediate_operand")) + 0)) + (const_int 63))))] + "!TARGET_LZCNT && TARGET_64BIT && ix86_pre_reload_split ()" + [(set (match_dup 3) (minus:DI (const_int 63) (clz:DI (match_dup 1)))) + (set (match_dup 0) (plus:SI (match_dup 5) (match_dup 4)))] +{ + operands[3] = gen_reg_rtx (DImode); + operands[5] = lowpart_subreg (SImode, operands[3], DImode); + if (INTVAL (operands[2]) == 63) + { + rtx tem = gen_rtx_CLZ (DImode, operands[1]); + tem = gen_rtx_MINUS (DImode, operands[2], tem); + tem = gen_rtx_SET (operands[3], tem); + emit_insn (tem); + emit_move_insn (operands[0], operands[5]); + DONE; + } + operands[4] = gen_int_mode (UINTVAL (operands[2]) - 63, SImode); +}) + +(define_split + [(set (match_operand:SI 0 "register_operand") + (minus:SI + (match_operand:SI 2 "const_int_operand") + (xor:SI + (minus:SI (const_int 31) + (clz:SI (match_operand:SI 1 "nonimmediate_operand"))) + (const_int 31))))] + "!TARGET_LZCNT && ix86_pre_reload_split ()" + [(set (match_dup 3) (minus:SI (const_int 31) (clz:SI (match_dup 1)))) + (set (match_dup 0) (plus:SI (match_dup 3) (match_dup 4)))] +{ + if (INTVAL (operands[2]) == 31) + { + rtx tem = gen_rtx_CLZ (SImode, operands[1]); + tem = gen_rtx_MINUS (SImode, operands[2], tem); + tem = gen_rtx_SET (operands[0], tem); + emit_insn (tem); + DONE; + } + operands[3] = gen_reg_rtx (SImode); + operands[4] = gen_int_mode (UINTVAL (operands[2]) - 31, SImode); +}) + +(define_split + [(set (match_operand:DI 0 "register_operand") + (minus:DI + (match_operand:DI 2 "const_int_operand") + (xor:DI + (sign_extend:DI + (minus:SI (const_int 63) + (subreg:SI + (clz:DI (match_operand:DI 1 "nonimmediate_operand")) + 0))) + (const_int 63))))] + "!TARGET_LZCNT + && TARGET_64BIT + && ix86_pre_reload_split () + && ((unsigned HOST_WIDE_INT) + trunc_int_for_mode (UINTVAL (operands[2]) - 63, SImode) + == UINTVAL (operands[2]) - 63)" + [(set (match_dup 3) (minus:DI (const_int 63) (clz:DI (match_dup 1)))) + (set (match_dup 0) (plus:DI (match_dup 3) (match_dup 4)))] +{ + if (INTVAL (operands[2]) == 63) + { + rtx tem = gen_rtx_CLZ (DImode, operands[1]); + tem = gen_rtx_MINUS (DImode, operands[2], tem); + tem = gen_rtx_SET (operands[0], tem); + emit_insn (tem); + DONE; + } + operands[3] = gen_reg_rtx (DImode); + operands[4] = GEN_INT (UINTVAL (operands[2]) - 63); +}) + +(define_split + [(set (match_operand:DI 0 "register_operand") + (minus:DI + (match_operand:DI 2 "const_int_operand") + (sign_extend:DI + (xor:SI + (minus:SI (const_int 31) + (clz:SI (match_operand:SI 1 "nonimmediate_operand"))) + (const_int 31)))))] + "!TARGET_LZCNT + && TARGET_64BIT + && ix86_pre_reload_split () + && ((unsigned HOST_WIDE_INT) + trunc_int_for_mode (UINTVAL (operands[2]) - 31, SImode) + == UINTVAL (operands[2]) - 31)" + [(set (match_dup 3) + (zero_extend:DI (minus:SI (const_int 31) (clz:SI (match_dup 1))))) + (set (match_dup 0) (plus:DI (match_dup 3) (match_dup 4)))] +{ + if (INTVAL (operands[2]) == 31) + { + rtx tem = gen_rtx_CLZ (SImode, operands[1]); + tem = gen_rtx_MINUS (SImode, operands[2], tem); + tem = gen_rtx_ZERO_EXTEND (DImode, tem); + tem = gen_rtx_SET (operands[0], tem); + emit_insn (tem); + DONE; + } + operands[3] = gen_reg_rtx (DImode); + operands[4] = GEN_INT (UINTVAL (operands[2]) - 31); +}) + (define_expand "clz2" [(parallel [(set (reg:CCZ FLAGS_REG) (compare:CCZ (match_operand:SWI48 1 "nonimmediate_operand" "rm") (const_int 0))) - (set (match_operand:SWI48 0 "register_operand") - (minus:SWI48 - (match_dup 2) - (clz:SWI48 (match_dup 1))))]) + (set (match_dup 3) (minus:SWI48 + (match_dup 2) + (clz:SWI48 (match_dup 1))))]) (parallel - [(set (match_dup 0) (xor:SWI48 (match_dup 0) (match_dup 2))) + [(set (match_operand:SWI48 0 "register_operand") + (xor:SWI48 (match_dup 3) (match_dup 2))) (clobber (reg:CC FLAGS_REG))])] "" { @@ -14795,6 +15000,7 @@ (define_expand "clz2" DONE; } operands[2] = GEN_INT (GET_MODE_BITSIZE (mode)-1); + operands[3] = gen_reg_rtx (mode); }) (define_insn_and_split "clz2_lzcnt" --- gcc/testsuite/gcc.target/i386/pr78103-1.c.jj 2021-07-27 10:29:13.278547362 +0200 +++ gcc/testsuite/gcc.target/i386/pr78103-1.c 2021-07-27 10:29:13.278547362 +0200 @@ -0,0 +1,28 @@ +/* PR target/78103 */ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-lzcnt" } */ +/* { dg-final { scan-assembler-not {\mcltq\M} } } */ + +long long +foo (long long x) +{ + return __builtin_clzll (x); +} + +long long +bar (long long x) +{ + return (unsigned int) __builtin_clzll (x); +} + +long long +baz (int x) +{ + return __builtin_clz (x); +} + +long long +qux (int x) +{ + return (unsigned int) __builtin_clz (x); +} --- gcc/testsuite/gcc.target/i386/pr78103-2.c.jj 2021-07-27 10:29:13.278547362 +0200 +++ gcc/testsuite/gcc.target/i386/pr78103-2.c 2021-07-27 10:29:13.278547362 +0200 @@ -0,0 +1,33 @@ +/* PR target/78103 */ +/* { dg-do compile } */ +/* { dg-options "-O2 -mno-lzcnt" } */ +/* { dg-final { scan-assembler-not {\mmovl\M} } } */ +/* { dg-final { scan-assembler-not {\mxor[lq]\M} } } */ +/* { dg-final { scan-assembler-not {\msubl\M} } } */ +/* { dg-final { scan-assembler {\m(leal|addl)\M} } } */ + +unsigned int +foo (unsigned int x) +{ + return __CHAR_BIT__ * sizeof (unsigned int) - __builtin_clz (x); +} + +unsigned int +bar (unsigned int x) +{ + return __CHAR_BIT__ * sizeof (unsigned int) - 1 - __builtin_clz (x); +} + +#ifdef __x86_64__ +unsigned int +baz (unsigned long long x) +{ + return __CHAR_BIT__ * sizeof (unsigned long long) - __builtin_clzll (x); +} + +unsigned int +qux (unsigned long long x) +{ + return __CHAR_BIT__ * sizeof (unsigned long long) - 1 - __builtin_clzll (x); +} +#endif --- gcc/testsuite/gcc.target/i386/pr78103-3.c.jj 2021-07-27 15:45:51.421113690 +0200 +++ gcc/testsuite/gcc.target/i386/pr78103-3.c 2021-07-27 15:45:35.678323393 +0200 @@ -0,0 +1,32 @@ +/* PR target/78103 */ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-lzcnt" } */ +/* { dg-final { scan-assembler-not {\mmovl\M} } } */ +/* { dg-final { scan-assembler-not {\mmovslq\M} } } */ +/* { dg-final { scan-assembler-not {\mxor[lq]\M} } } */ +/* { dg-final { scan-assembler-not {\msubq\M} } } */ +/* { dg-final { scan-assembler {\m(leaq|addq)\M} } } */ + +unsigned long long +foo (unsigned int x) +{ + return __CHAR_BIT__ * sizeof (unsigned int) - __builtin_clz (x); +} + +unsigned long long +bar (unsigned int x) +{ + return __CHAR_BIT__ * sizeof (unsigned int) - 1 - __builtin_clz (x); +} + +unsigned long long +baz (unsigned long long x) +{ + return __CHAR_BIT__ * sizeof (unsigned long long) - __builtin_clzll (x); +} + +unsigned long long +qux (unsigned long long x) +{ + return __CHAR_BIT__ * sizeof (unsigned long long) - 1 - __builtin_clzll (x); +} Jakub