From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 30931 invoked by alias); 8 Jun 2016 16:40:47 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 30920 invoked by uid 89); 8 Jun 2016 16:40:46 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy= X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 08 Jun 2016 16:40:34 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3D45634; Wed, 8 Jun 2016 09:41:08 -0700 (PDT) Received: from [10.2.206.43] (e100706-lin.cambridge.arm.com [10.2.206.43]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 64EE53F21A; Wed, 8 Jun 2016 09:40:32 -0700 (PDT) Message-ID: <57584A7E.5080600@foss.arm.com> Date: Wed, 08 Jun 2016 16:40:00 -0000 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Christophe Lyon CC: GCC Patches Subject: Re: [PATCH][3/3][RTL ifcvt] PR middle-end/37780: Conditional expression with __builtin_clz() should be optimized out References: <5746C78B.3030006@foss.arm.com> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-SW-Source: 2016-06/txt/msg00626.txt.bz2 On 07/06/16 20:34, Christophe Lyon wrote: > On 26 May 2016 at 11:53, Kyrill Tkachov wrote: >> Hi all, >> >> In this PR we want to optimise: >> int foo (int i) >> { >> return (i == 0) ? N : __builtin_clz (i); >> } >> >> on targets where CLZ is defined at zero to the constant 'N'. >> This is determined at the RTL level through the CLZ_DEFINED_VALUE_AT_ZERO >> macro. >> The obvious place to implement this would be in combine through simplify-rtx >> where we'd >> recognise an IF_THEN_ELSE of the form: >> (set (reg:SI r1) >> (if_then_else:SI (ne (reg:SI r2) >> (const_int 0 [0])) >> (clz:SI (reg:SI r2)) >> (const_int 32))) >> >> and if CLZ_DEFINED_VALUE_AT_ZERO is defined to 32 for SImode we'd simplify >> it into >> just (clz:SI (reg:SI r2)). >> However, I found this doesn't quite happen for a couple of reasons: >> 1) This depends on ifcvt or some other pass to have created a conditional >> move of the >> two branches that provide the IF_THEN_ELSE to propagate the const_int and >> clz operation into. >> >> 2) Combine will refuse to propagate r2 from the above example into both the >> condition and the >> CLZ at the same time, so the most we see is: >> (set (reg:SI r1) >> (if_then_else:SI (ne (reg:CC cc) >> (const_int 0)) >> (clz:SI (reg:SI r2)) >> (const_int 32))) >> >> which is not enough information to perform the simplification. >> >> This patch implements the optimisation in ce1 using the noce ifcvt >> framework. >> During ifcvt noce_process_if_block can see that we're trying to optimise >> something >> of the form (x == 0 ? const_int : CLZ (x)) and so it has visibility of all >> the information >> needed to perform the transformation. >> >> The transformation is performed by adding a new noce_try* function that >> tries to put the >> condition and the 'then' and 'else' arms into an IF_THEN_ELSE rtx and try to >> simplify that >> using the simplify-rtx machinery. That way, we can implement the >> simplification logic in >> simplify-rtx.c where it belongs. >> >> A similar transformation for CTZ is implemented as well. >> So for code: >> int foo (int i) >> { >> return (i == 0) ? 32 : __builtin_clz (i); >> } >> >> On aarch64 we now emit: >> foo: >> clz w0, w0 >> ret >> >> instead of: >> foo: >> mov w1, 32 >> clz w2, w0 >> cmp w0, 0 >> csel w0, w2, w1, ne >> ret >> >> and for arm similarly we generate: >> foo: >> clz r0, r0 >> bx lr >> >> instead of: >> foo: >> cmp r0, #0 >> clzne r0, r0 >> moveq r0, #32 >> bx lr >> >> >> and for x86_64 with -O2 -mlzcnt we generate: >> foo: >> xorl %eax, %eax >> lzcntl %edi, %eax >> ret >> >> instead of: >> foo: >> xorl %eax, %eax >> movl $32, %edx >> lzcntl %edi, %eax >> testl %edi, %edi >> cmove %edx, %eax >> ret >> >> >> I tried getting this to work on other targets as well, but encountered >> difficulties. >> For example on powerpc the two arms of the condition seen during ifcvt are: >> >> (insn 4 22 11 4 (set (reg:DI 156 [ ]) >> (const_int 32 [0x20])) clz.c:3 434 {*movdi_internal64} >> (nil)) >> and >> (insn 10 9 23 3 (set (subreg/s/u:SI (reg:DI 156 [ ]) 0) >> (clz:SI (subreg/u:SI (reg/v:DI 157 [ i ]) 0))) clz.c:3 132 {clzsi2} >> (expr_list:REG_DEAD (reg/v:DI 157 [ i ]) >> (nil))) >> >> So the setup code in noce_process_if_block sees that the set destination is >> not the same >> ((reg:DI 156 [ ]) and (subreg/s/u:SI (reg:DI 156 [ ]) 0)) >> so it bails out on the rtx_interchangeable_p (x, SET_DEST (set_b)) check. >> I suppose that's a consequence of how SImode operations are represented in >> early RTL >> on powerpc, I don't know what to do there. Perhaps that part of ivcvt can be >> taught to handle >> destinations that are subregs of one another, but that would be a separate >> patch. >> >> Anyway, is this patch ok for trunk? >> >> Bootstrapped and tested on arm-none-linux-gnueabihf, aarch64-none-linux-gnu, >> x86_64-pc-linux-gnu. >> >> Thanks, >> Kyrill >> >> 2016-05-26 Kyrylo Tkachov >> >> PR middle-end/37780 >> * ifcvt.c (noce_try_ifelse_collapse): New function. >> Declare prototype. >> (noce_process_if_block): Call noce_try_ifelse_collapse. >> * simplify-rtx.c (simplify_cond_clz_ctz): New function. >> (simplify_ternary_operation): Use the above to simplify >> conditional CLZ/CTZ expressions. >> >> 2016-05-26 Kyrylo Tkachov >> >> PR middle-end/37780 >> * gcc.c-torture/execute/pr37780.c: New test. >> * gcc.target/aarch64/pr37780_1.c: Likewise. >> * gcc.target/arm/pr37780_1.c: Likewise. > Hi Kyrylo, Hi Christophe, > I've noticed that gcc.target/arm/pr37780_1.c fails on > arm if arch < v6. > I first tried to fix the effective-target guard (IIRC, the doc > says clz is available starting with v5t), but that isn't sufficient. > > When compiling for armv5-t, the scan-assembler directives > fail. It seems to work with v6t2, so I am wondering whether > it's just a matter of increasing the effective-target arch version, > or if you really intended to make the test pass on these old > architectures? I've dug into it a bit. I think the problem is that CLZ is available with ARMv5T but only in arm mode. In Thumb mode it's only available with ARMv6T2. So if your armv5t gcc compiles for Thumb by default you'll get the failure. I think just bumping the effective to armv6t2 here is appropriate as I intended the test to just check the ifcvt transformation when the instruction is available rather than test that the CLZ instruction is available at one architecture level or another. A patch to bump the effective target check is pre-approved if you want to do it. Otherwise I'll get to it in a few days. Thanks for spotting this, Kyrill > Thanks, > > Christophe.