From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 97D933858039 for ; Mon, 2 Nov 2020 14:28:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 97D933858039 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 539CF30E; Mon, 2 Nov 2020 06:28:48 -0800 (PST) Received: from [192.168.1.19] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8DC633F66E; Mon, 2 Nov 2020 06:28:47 -0800 (PST) Subject: Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code To: Christophe Lyon Cc: gcc Patches References: <1601409053-17310-1-git-send-email-christophe.lyon@linaro.org> <11736e3f-6e4b-7a42-e034-10e77473cfcc@foss.arm.com> <643448ae-7797-5883-ec7c-75eff98d0db0@foss.arm.com> <4604bc17-3ceb-6324-4286-402d2d9da174@foss.arm.com> <9cf6f7e4-2ba6-51fd-115a-8c9b6d9ef772@foss.arm.com> <1813e83b-c3a0-00ed-2224-c5fc6a9c399c@foss.arm.com> From: Richard Earnshaw Message-ID: <07c87679-dda6-a0e0-d082-6a29d09457d3@foss.arm.com> Date: Mon, 2 Nov 2020 14:28:45 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3498.5 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_ASCII_DIVIDERS, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, NICE_REPLY_A, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Nov 2020 14:28:51 -0000 On 02/11/2020 10:24, Christophe Lyon via Gcc-patches wrote: > Hi, > > On Fri, 30 Oct 2020 at 13:49, Richard Earnshaw > wrote: >> >> On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote: >>> On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote: >>>> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw >>>> wrote: >>>>> >>>>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote: >>>>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote: >>>>>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw >>>>>>> wrote: >>>>>>>> >>>>>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote: >>>>>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote: >>>>>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta >>>>>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code). >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of >>>>>>>>>>>>>>>>>>>>>> movs/adds/lsls. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta >>>>>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them >>>>>>>>>>>>>>>>>>>>>> with -mpure-code. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by >>>>>>>>>>>>>>>>>>>>>> eg. pr46287-3.C >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2020-09-29 Christophe Lyon >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> gcc/ >>>>>>>>>>>>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and >>>>>>>>>>>>>>>>>>>>>> do not emit function address and delta when -mpure-code is used. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Richard, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for your comments. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better >>>>>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add. >>>>>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed >>>>>>>>>>>>>>>>>>>> before, so I can write something similar indeed. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split >>>>>>>>>>>>>>>>>>>> when -mpure-code is NOT used. >>>>>>>>>>>>>>>>>>>> Consider: >>>>>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; } >>>>>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; } >>>>>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: >>>>>>>>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>>>>>>>> ldr r0, .L4 >>>>>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by >>>>>>>>>>>>>>>>>>>> not having to load the constant? >>>>>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into >>>>>>>>>>>>>>>>>>>> account when comparing code size, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account. >>>>>>>>>>>>>>>>>>> It might happen that the entry could be shared with another use of that >>>>>>>>>>>>>>>>>>> literal, but in general that's rare. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three >>>>>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading >>>>>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not >>>>>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due >>>>>>>>>>>>>>>>>>> to the limited pc offsets we can use. I think on balance we probably >>>>>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> This is also true for >>>>>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all >>>>>>>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as: >>>>>>>>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>>>>>>>> movs r0, #1 >>>>>>>>>>>>>>>>>>>> lsls r0, r0, #8 >>>>>>>>>>>>>>>>>>>> adds r0, r0, #255 >>>>>>>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be >>>>>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL. The examples >>>>>>>>>>>>>>>>>>> you show here are not thunks. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> OK thanks for the clarification. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make >>>>>>>>>>>>>>>>>> review easier. >>>>>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770: >>>>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html >>>>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier, >>>>>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or >>>>>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the >>>>>>>>>>>>>>>>>> same place, >>>>>>>>>>>>>>>>>> in case we need to improve/fix it later. We now generate shorter sequences in >>>>>>>>>>>>>>>>>> several cases matching your comments. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and >>>>>>>>>>>>>>>>>> calls thumb1_gen_const_int. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls >>>>>>>>>>>>>>>>>> thumb1_gen_const_int. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use >>>>>>>>>>>>>>>>> C++ now to do much better than that, for example by making that function >>>>>>>>>>>>>>>>> a template. For example (and this is just a sketch): >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of >>>>>>>>>>>>>>>> templates in arm.c yet. >>>>>>>>>>>>>>>> I'll send an update soon. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Other than that, does the approach look OK to you? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, I think this is heading in the right direction. Bringing the two >>>>>>>>>>>>>>> immediate generating operations into a single function can only be a >>>>>>>>>>>>>>> good thing. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Looking again at your example constant sequences, I see: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 0x1000010: >>>>>>>>>>>>>>> movs r3, #16 >>>>>>>>>>>>>>> lsls r3, #16 >>>>>>>>>>>>>>> adds r3, #1 >>>>>>>>>>>>>>> lsls r3, #4 >>>>>>>>>>>>>>> 0x1000011: >>>>>>>>>>>>>>> movs r3, #1 >>>>>>>>>>>>>>> lsls r3, #24 >>>>>>>>>>>>>>> adds r3, #17 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The first of these looks odd, given the second sequence. Why doesn't >>>>>>>>>>>>>>> the first expand to >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 0x1000010: >>>>>>>>>>>>>>> movs r3, #16 >>>>>>>>>>>>>>> lsls r3, #16 >>>>>>>>>>>>>>> adds r3, #16 >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Err, I mean to: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 0x1000010: >>>>>>>>>>>>>> movs r3, #1 >>>>>>>>>>>>>> lsls r3, #24 >>>>>>>>>>>>>> adds r3, #16 >>>>>>>>>>>>>> >>>>>>>>>>>>>> ? >>>>>>>>>>>>> >>>>>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce its >>>>>>>>>>>>> range and need less instructions to build the higher part, then >>>>>>>>>>>>> left-shift back. >>>>>>>>>>>>> >>>>>>>>>>>>> In this particular case, we'd need to realize that there are many >>>>>>>>>>>>> zeros "inside" the constant. >>>>>>>>>>>>> >>>>>>>>>>>>> If I remove the part that tries to reduce the range, I do get that >>>>>>>>>>>>> sequence, but for 764 I now generate >>>>>>>>>>>>> movs r3, #2 >>>>>>>>>>>>> lsls r3, #8 >>>>>>>>>>>>> adds r3, #252 >>>>>>>>>>>>> instead of >>>>>>>>>>>>> movs r3, #191 >>>>>>>>>>>>> lsls r3, #2 >>>>>>>>>>>>> >>>>>>>>>>>>> A possibility would be to try both approaches and keep the shortest one. >>>>>>>>>>>> >>>>>>>>>>>> Lets leave that for now, it's not important to fixing the main issue; >>>>>>>>>>>> but we should remember we need to come back to it at some point. >>>>>>>>>>>> >>>>>>>>>>> Thanks, that's what I was thinking too. >>>>>>>>>>> >>>>>>>>>>>> There are other tricks as well, such as >>>>>>>>>>>> >>>>>>>>>>>> 0xffffff >>>>>>>>>>>> >>>>>>>>>>>> can be done as >>>>>>>>>>>> >>>>>>>>>>>> 0x1000000 - 1 >>>>>>>>>>>> >>>>>>>>>>>> and >>>>>>>>>>>> >>>>>>>>>>>> 0xfffffd >>>>>>>>>>>> >>>>>>>>>>>> as >>>>>>>>>>>> >>>>>>>>>>>> 0x1000000 - 3 >>>>>>>>>>>> >>>>>>>>>>>> but these can wait as well. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Didn't we already need to handle such tricks? I'm surprised this >>>>>>>>>>> wasn't needed earlier. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I don't think we ever worried about them. Most of them need at least 3 >>>>>>>>>> instructions so aren't a code size saving over using a literal pool entry. >>>>>>>>>> >>>>>>>>> OK, this will also help when using -mslow-flash-data. >>>>>>>>> >>>>>>>>> Here are updated patches, now using a template as you suggested. >>>>>>>> >>>>>>>> Looking better, but when I try to apply this to my local tree patch 2 >>>>>>>> fails (I'm not exactly sure why, what was your baseline for these >>>>>>>> patches?) >>>>>>> I have the tree patches in this thread on top of these other two: >>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html >>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html >>>>>>> >>>>>>> They have gradual improvements to thumb1_movsi_insn. >>>>>>> >>>>>>>> -- that patch looks suspicious anyway, you're replacing code >>>>>>>> that prints out assembly with code that generates RTL. >>>>>>> Right! I took me a while to understand how I could miss this, sorry. >>>>>>> That was caused by improper testing, as this part of the code >>>>>>> isn't used when targetting cortex-m0. I have added a testcase >>>>>>> for cortex-m23 which crashes with the previous version of patch 2, >>>>>>> and succeeds now. >>>>>>> >>>>>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and >>>>>>>> thumb1_const_rtl. I think the names as they stand are likely to be too >>>>>>>> generic. >>>>>>> OK, done. >>>>>>> >>>>>>> How about this new version? >>>>>>> I'm not yet sure about the most appropriate naming for: >>>>>>> thumb1_gen_const_int >>>>>>> thumb1_gen_const_int_asm >>>>>>> should they be >>>>>>> thumb1_gen_const_int_rtl >>>>>>> thumb1_gen_const_int_print >>>>>>> to be consistent with the new classes? >>>>>> >>>>>> It would probably be better, yes. >>>>>> >>>>>> More detailed comments below. >>>>>> >>>>>> R. >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Christophe >>>>>>> >>>>>>>> R. >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Christophe >>>>>>>>> >>>>>>>>>> R. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> R. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> class t1_rtl >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> public: >>>>>>>>>>>>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } >>>>>>>>>>>>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } >>>>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> class t1_print >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> public: >>>>>>>>>>>>>>>>> t1_print (FILE *f) : t_file(f) {} >>>>>>>>>>>>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } >>>>>>>>>>>>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } >>>>>>>>>>>>>>>>> private: >>>>>>>>>>>>>>>>> FILE *t_file; >>>>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> template >>>>>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f) >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> // Expansion of thumb1_gen_const_int ... >>>>>>>>>>>>>>>>> t.ashift(f); >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> // Usage... >>>>>>>>>>>>>>>>> void f1() >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> // Use the RTL expander >>>>>>>>>>>>>>>>> t1_rtl g; >>>>>>>>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> void f2() >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> // Use the printf expander writing to stdout >>>>>>>>>>>>>>>>> t1_print g(stdout); >>>>>>>>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry >>>>>>>>>>>>>>>>> about which expander is being used in each instance and the template >>>>>>>>>>>>>>>>> expansion will use the right version. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I also suspect (but haven't check) that the base adjustment will >>>>>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4). If that is >>>>>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an >>>>>>>>>>>>>>>>>>>>> even greater range of literals. >>>>>>>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case, >>>>>>>>>>>>>>>>>>>> I'll update my patch. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> More generally, any sequence of up to >>>>>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the >>>>>>>>>>>>>>>>>>>>> existing literal pool fall back. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will >>>>>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could >>>>>>>>>>>>>>>>>>>>> be merged into a single shift-by-16. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at >>>>>>>>>>>>>>>>>>>>> all possible. >>>>>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase >>>>>>>>>>>>>>>>>>>> that showed the problem. There are also: >>>>>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C >>>>>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C >>>>>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C >>>>>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add >>>>>>>>>>>>>>>>>>>> -mpure-code in dg-options? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated >>>>>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other >>>>>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from that). If >>>>>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs >>>>>>>>>>>>>>>>>>> with it enabled and that should be sufficient. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a >>>>>>>>>>>>>>>>>> modified simulator. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) >>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>> gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++--------------- >>>>>>>>>>>>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644 >>>>>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>>> if (mi_delta > 255) >>>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>>>>>>>> - fputs ("+4\n", file); >>>>>>>>>>>>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the constant >>>>>>>>>>>>>>>>>>>>>> + pool: we build it explicitly. */ >>>>>>>>>>>>>>>>>>>>>> + if (target_pure_code) >>>>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>>>> + bool mov_done_p = false; >>>>>>>>>>>>>>>>>>>>>> + int i; >>>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ >>>>>>>>>>>>>>>>>>>>>> + for (i = 0; i < 3; i++) >>>>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & 0xff; >>>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>>> + if (byte) >>>>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", byte); >>>>>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", byte); >>>>>>>>>>>>>>>>>>>>>> + mov_done_p = true; >>>>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); >>>>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>>> + /* Emit lower byte if needed. */ >>>>>>>>>>>>>>>>>>>>>> + if (!mov_done_p) >>>>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff); >>>>>>>>>>>>>>>>>>>>>> + else if (mi_delta & 0xff) >>>>>>>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff); >>>>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>>>>>>>> + fputs ("+4\n", file); >>>>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", >>>>>>>>>>>>>>>>>>>>>> mi_op, this_regno, this_regno); >>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); >>>>>>>>>>>>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>>>>>>>> - fputs (":\n", file); >>>>>>>>>>>>>>>>>>>>>> - if (flag_pic) >>>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals for the >>>>>>>>>>>>>>>>>>>>>> + function address and delta since we emitted code to build >>>>>>>>>>>>>>>>>>>>>> + them. */ >>>>>>>>>>>>>>>>>>>>>> + if (!target_pure_code) >>>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC >>>>>>>>>>>>>>>>>>>>>> - pipeline offset is four rather than eight. Adjust the offset >>>>>>>>>>>>>>>>>>>>>> - accordingly. */ >>>>>>>>>>>>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>>>>>>>> - tem, >>>>>>>>>>>>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>>>>>>>> - ggc_strdup (labelpc))); >>>>>>>>>>>>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>>>>>>>>> - else >>>>>>>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>>>>>>>> + fputs (":\n", file); >>>>>>>>>>>>>>>>>>>>>> + if (flag_pic) >>>>>>>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC >>>>>>>>>>>>>>>>>>>>>> + pipeline offset is four rather than eight. Adjust the offset >>>>>>>>>>>>>>>>>>>>>> + accordingly. */ >>>>>>>>>>>>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>>>>>>>> + tem, >>>>>>>>>>>>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>>>>>>>> + ggc_strdup (labelpc))); >>>>>>>>>>>>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>>> +class thumb1_const_rtl >>>>>> ... >>>>>> + void mov (int val) >>>>>> >>>>>> This should take a HOST_WIDE_INT. Similarly for add and shift. The >>>>>> same applies to the asm version as well. >>>>>> >>>>>> + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); >>>>>> >>>>>> Should be using reg_names[dst_regno] in all cases. In fact, you might >>>>>> want to move that lookup to the constructor and just save a pointer to >>>>>> the string there. You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED >>>>> >>>>> Correction, for a (signed) HOST_WIDE_INT, this should be >>>>> HOST_WIDE_INT_PRINT_DEC. >>>>> >>>> >>>> Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the >>>> two classes >>>> can have unsigned HOST_WIDE_INT parameters, and thus use >>>> HOST_WIDE_INT_PRINT_UNSIGNED? >>>> >>> >>> It's generally safer to print it as a signed value. We probably won't >>> have cases where the top bit of a 32-bit word are set here; but if you >>> do, and the value is unsigned, you end up with 16 digit hex numbers >>> rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT >>> is at least 64 bits in size. >>> >> >> I don't mean hex numbers, of course, just very large decimal numbers. >> But the point is still the same. >> > > Here is an updated version, which hopefully addresses your comments, > and adds testcases for cortex-m0 and cortex-m23. > This is OK. Thanks, R. > Christophe > >> R. >> >>> R. >>> >>>> >>>>>> rather than "%d" for the immediate. >>>>>> >>>>>> +template >>>>>> +void >>>>>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) >>>>>> +{ >>>>>> + bool mov_done_p = false; >>>>>> + int val = op1; >>>>>> >>>>>> This potentially silently loses precision. In fact, I think you really >>>>>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so >>>>>> that the right shifts aren't undefined if dealing with negative numbers. >>>>>> >>>>>> For safety, you should also have an assertion in here that >>>>>> >>>>>> op1 == trunc_int_for_mode (op1, SImode) >>>>>> >>>>>> + int shift = 0; >>>>>> + int i; >>>>>> + >>>>>> + if (val == 0) >>>>>> >>>>>> You can short-circuit 0..255 here for a quick exit. >>>>>> >>>>>> + { >>>>>> + dst.mov (val); >>>>>> + return; >>>>>> + } >>>>>> >>>>>> Another trick: if the top nine bits of the 32-bit value are all set, >>>>>> you're probably going to be better off (and certainly not worse off) by >>>>>> generating -op1 and then negating the result in a final step - you can >>>>>> do that via recursion. >>>>>> >>>>>> + >>>>>> + /* In the general case, we need 7 instructions to build >>>>>> + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can >>>>>> + do better if VAL is small enough, or >>>>>> + right-shiftable by a suitable amount. If the >>>>>> + right-shift enables to encode at least one less byte, >>>>>> + it's worth it: we save a adds and a lsls at the >>>>>> + expense of a final lsls. */ >>>>>> + int final_shift = number_of_first_bit_set (val); >>>>>> + >>>>>> + int leading_zeroes = clz_hwi (val); >>>>>> + int number_of_bytes_needed >>>>>> + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) >>>>>> + / BITS_PER_UNIT) + 1; >>>>>> + int number_of_bytes_needed2 >>>>>> + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) >>>>>> + / BITS_PER_UNIT) + 1; >>>>>> + >>>>>> + if (number_of_bytes_needed2 < number_of_bytes_needed) >>>>>> + val >>= final_shift; >>>>>> + else >>>>>> + final_shift = 0; >>>>>> + >>>>>> + /* If we are in a very small range, we can use either a single movs >>>>>> + or movs+adds. */ >>>>>> + if ((val >= 0) && (val <= 510)) >>>>>> >>>>>> if val is made unsigned HWI as I suggest, the lower bounds test is not >>>>>> needed. >>>>>> >>>>>> + { >>>>>> + if (val > 255) >>>>>> + { >>>>>> + int high = val - 255; >>>>>> >>>>>> Again, watch your types. >>>>>> >>>>>> + >>>>>> + dst.mov (high); >>>>>> + dst.add (255); >>>>>> + } >>>>>> + else >>>>>> + dst.mov (val); >>>>>> + >>>>>> + if (final_shift > 0) >>>>>> + dst.ashift (final_shift); >>>>>> + } >>>>>> + else >>>>>> + { >>>>>> + /* General case, emit upper 3 bytes as needed. */ >>>>>> + for (i = 0; i < 3; i++) >>>>>> + { >>>>>> + int byte = (val >> (8 * (3 - i))) & 0xff; >>>>>> >>>>>> and here. >>>>>> >>>>>> + >>>>>> + if (byte) >>>>>> + { >>>>>> + /* We are about to emit new bits, stop accumulating a >>>>>> + shift amount, and left-shift only if we have already >>>>>> + emitted some upper bits. */ >>>>>> + if (mov_done_p) >>>>>> + { >>>>>> + dst.ashift (shift); >>>>>> + dst.add (byte); >>>>>> + } >>>>>> + else >>>>>> + dst.mov (byte); >>>>>> + >>>>>> + /* Stop accumulating shift amount since we've just >>>>>> + emitted some bits. */ >>>>>> + shift = 0; >>>>>> + >>>>>> + mov_done_p = true; >>>>>> + } >>>>>> + >>>>>> + if (mov_done_p) >>>>>> + shift += 8; >>>>>> + } >>>>>> + >>>>>> + /* Emit lower byte. */ >>>>>> + if (!mov_done_p) >>>>>> + dst.mov (val & 0xff); >>>>>> + else >>>>>> + { >>>>>> + dst.ashift (shift); >>>>>> + if (val & 0xff) >>>>>> + dst.add (val & 0xff); >>>>>> + } >>>>>> + >>>>>> + if (final_shift > 0) >>>>>> + dst.ashift (final_shift); >>>>>> + } >>>>>> +} >>>>>> + >>>>>> >>>>> >>> >>