From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 101059 invoked by alias); 14 Oct 2015 19:15:37 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 101049 invoked by uid 89); 14 Oct 2015 19:15:37 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Wed, 14 Oct 2015 19:15:36 +0000 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (Postfix) with ESMTPS id 11E878E718; Wed, 14 Oct 2015 19:15:35 +0000 (UTC) Received: from localhost.localdomain (vpn1-5-28.ams2.redhat.com [10.36.5.28]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t9EJFX1H031015; Wed, 14 Oct 2015 15:15:33 -0400 Subject: Re: using scratchpads to enhance RTL-level if-conversion: revised patch To: Jeff Law , Abe References: <5615AADE.4030306@yahoo.com> <56166E68.2040004@redhat.com> <561D5CC4.8030502@yahoo.com> <561D66AB.9090003@redhat.com> <561E9458.5090701@redhat.com> Cc: Sebastian Pop , Kyrill Tkachov , "gcc-patches@gcc.gnu.org" From: Bernd Schmidt Message-ID: <561EA9D4.2070101@redhat.com> Date: Wed, 14 Oct 2015 19:15:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: <561E9458.5090701@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-IsSubscribed: yes X-SW-Source: 2015-10/txt/msg01415.txt.bz2 On 10/14/2015 07:43 PM, Jeff Law wrote: > Obviously some pessimization relative to current code is necessary to > fix some of the problems WRT thread safety and avoiding things like > introducing faults in code which did not previously fault. Huh? This patch is purely an (attempt at) optimization, not something that fixes any problems. > However, pessimization of safe code is, err, um, bad and needs to be > avoided. Here's an example: > subq $16, %rsp [...] > leaq 8(%rsp), %r8 > leaq 256(%rax), %rdx cmpq 256(%rax), %rcx | cmpq 256(%rax), %rsi jne .L97 < movq $0, 256(%rax) < .L97: < > movq %rdx, %rax > cmovne %r8, %rax > movq $0, (%rax) [...] > addq $16, %rsp In the worst case that executes six more instructions, and always causes unnecessary stack frame bloat. This on x86 where AFAIK it's doubtful whether cmov is a win at all anyway. I think this shows the approach is just bad, even ignoring problems like that it could allocate multiple scratchpads when one would suffice, or allocate one and end up not using it because the transformation fails. I can't test valgrind right now, it fails to run on my machine, but I guess it could adapt to allow stores slightly below the stack (maybe warning once)? It seems like a bit of an edge case to worry about, but if supporting it is critical and it can't be changed to adapt to new optimizations, then I think we're probably better off entirely without this scratchpad transformation. Alternatively I can think of a few other possible approaches which wouldn't require this kind of bloat: * add support for allocating space in the stack redzone. That could be interesting for the register allocator as well. Would help only x86_64, but that's a large fraction of gcc's userbase. * add support for opportunistically finding unused alignment padding in the existing stack frame. Less likely to work but would produce better results when it does. * on embedded targets we probably don't have to worry about valgrind, so do the optimal (sp - x) thing there * allocate a single global as the dummy target. Might be more expensive to load the address on some targets though. * at least find a way to express costs for this transformation. Difficult since you don't yet necessarily know if the function is going to have a stack frame. Hence, IMO this approach is flawed. (You'll still want cost estimates even when not allocating stuff in the normal stack frame, because generated code will still execute between two and four extra instructions). Bernd