From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 99055 invoked by alias); 6 Mar 2018 20:03:05 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 98951 invoked by uid 89); 6 Mar 2018 20:03:05 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=b5, c4, successive, C3 X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 06 Mar 2018 20:03:02 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1439B1435; Tue, 6 Mar 2018 12:03:01 -0800 (PST) Received: from [10.2.206.38] (e109742-lin.cambridge.arm.com [10.2.206.38]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 722083F24A; Tue, 6 Mar 2018 12:03:00 -0800 (PST) Subject: Re: BLKmode parameters are stored in unaligned stack slot when passed via registers. To: Richard Biener Cc: "gcc@gcc.gnu.org" References: <2f9b6580-4c7d-d29d-157c-24fe6dd8f781@arm.com> From: Renlin Li Message-ID: <122112c3-91cd-bc68-4a28-adf4fea42ee4@foss.arm.com> Date: Tue, 06 Mar 2018 20:03:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-SW-Source: 2018-03/txt/msg00083.txt.bz2 Hi Richard, On 06/03/18 16:04, Richard Biener wrote: > On Tue, Mar 6, 2018 at 4:21 PM, Renlin Li wrote: >> Hi all, >> >> The problem described here probably only affects targets whose ABI allow to >> pass structured >> arguments of certain size via registers. >> >> If the mode of the parameter type is BLKmode, in the callee, during RTL >> expanding, >> a stack slot will be reserved for this parameter, and the incoming value >> will be copied into >> the stack slot. >> >> However, the stack slot for the parameter will not be aligned if the >> alignment of parameter type >> exceeds MAX_SUPPORTED_STACK_ALIGNMENT. >> Chances are, unaligned memory access might cause run-time errors. >> >> For local variable on the stack, the alignment of the data type is honored, >> although the document states that it is not guaranteed. >> >> For example: >> >> #include >> union U { >> uint32_t M0; >> uint32_t M1; >> uint32_t M2; >> uint32_t M3; >> } __attribute((aligned(16))); >> >> void tmp (union U *); >> void foo (union U P0) >> { >> union U P1 = P0; >> tmp (&P1); >> } >> >> The code-gen from armv7-a is like this: >> >> foo: >> @ args = 0, pretend = 0, frame = 48 >> @ frame_needed = 0, uses_anonymous_args = 0 >> str lr, [sp, #-4]! >> sub sp, sp, #52 >> mov ip, sp >> stm ip, {r0, r1, r2, r3} --> ip is not 128-bit aligned >> add lr, sp, #39 >> bic lr, lr, #15 >> ldm ip, {r0, r1, r2, r3} >> stm lr, {r0, r1, r2, r3} --> lr is 128-bit aligned >> mov r0, lr >> bl tmp >> add sp, sp, #52 >> @ sp needed >> ldr pc, [sp], #4 >> >> There are other obvious missed optimizations in the code-generation above. >> The stack slot for parameter P0 and local variable P1 could be merged. >> So that some of the load/store instructions could be removed. >> I think this is a known missed optimization case. >> >> To summaries, there are two issues here: >> 1, (wrong code) unaligned stack slot allocated for parameters during >> function expansion. >> 2, (missed optimization) stack slot for parameter sometimes is not >> necessary. >> In certain scenario, the argument register could directly be used. >> Currently, this is only possible when the parameter mode is not BLKmode. >> >> For issue 1, we can do similar things as expand_used_vars. >> Dynamically align the stack slot address for parameters whose alignment >> exceeds >> PREDERRED_STACK_BOUNDARY. Other parameters could be store in gap between the >> aligned address and fp when possible. >> >> For issue 2, I checked the behavior of LLVM, it seems the stack slot >> allocation >> for parameters are explicitly exposed by the alloca IR instruction at the >> very beginning. >> Later, there are optimization/transformation passes like mem2reg, reg2mem, >> sroa etc. to remove >> unnecessary alloca instructions. >> >> In gcc, the stack allocation for parameters and local variables are done >> during expand pass, implicitly. >> And RTL passes are not able to remove the unnecessary stack allocation and >> load/store operations. >> >> For example: >> >> uint32_t bar(union U P0) >> { >> return P0.M0; >> } >> >> Currently, the code-gen is different on different targets. >> There are various backend hooks which make the code-gen sub-optimal. >> For example, aarch64 target could directly return with w0 while armv7-a >> target generates unnecessary >> store and load. >> >> However, this optimization should be target independent, unrelated target >> alignment configuration. >> Both issue 1&2 could be resolved if gcc has a similar approach. But I assume >> the change is big. >> >> Is there any suggestions for solving issue 1 and improving issue 2 in a >> generic way? >> I can create a bugzilla ticket to record the issue. > > What does the ABI say for passing such over-aligned data types? > > For solving 1) you could copy the argument as passed by the ABI > to a properly aligned stack location in the callee. > > Generally it sounds like either the ABI doesn't specify anything > or the ABI specifies something that violates user expectation. > > For 2) again, it is the ABI which specifies whether an argument > is passed via the stack or via registers. So - what does the ABI say? The compiler is doing the right thing here to pass argument via registers. To be specific, there are such clause in the arm PCS: > B.5 If the argument is an alignment adjusted type its value is passed as a copy of the actual value. The > copy will have an alignment defined as follows. > ... > For a Composite Type, the alignment of the copy will have 4-byte alignment if its natural alignment is > <= 4 and 8-byte alignment if its natural alignment is >= 8 > C.3 If the argument requires double-word alignment (8-byte), the NCRN is rounded up to the next even > register number. > C.4 If the size in words of the argument is not more than r4 minus NCRN, the argument is copied into > core registers, starting at the NCRN. The NCRN is incremented by the number of registers used. > Successive registers hold the parts of the argument they would hold if its value were loaded into > those registers from memory using an LDM instruction. The argument has now been allocated. This is quite similar for other RISC machines. Here, the problem here how arguments/parameters are received in the callee. To store the incoming parameters on the stack, it seems an implementation decision. Even for the following case without over-alignment, in the callee, it will save r0-r3 into local stack first, and load M3 from local copy. struct U { uint32_t M0; uint32_t M1; uint32_t M2; uint32_t M3; }; int x (struct U p) { return p.M3; } Regards, Renlin > > Richard. > >> Regards, >> Renlin