From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 869CD3972006 for ; Wed, 28 Oct 2020 17:44:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 869CD3972006 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 16C58176C; Wed, 28 Oct 2020 10:44:28 -0700 (PDT) Received: from [192.168.1.19] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5DB8F3F719; Wed, 28 Oct 2020 10:44:27 -0700 (PDT) Subject: Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code To: Christophe Lyon Cc: gcc Patches References: <1601409053-17310-1-git-send-email-christophe.lyon@linaro.org> <106645a8-8542-6095-6dd6-4454a878d98f@foss.arm.com> <11736e3f-6e4b-7a42-e034-10e77473cfcc@foss.arm.com> <643448ae-7797-5883-ec7c-75eff98d0db0@foss.arm.com> <4604bc17-3ceb-6324-4286-402d2d9da174@foss.arm.com> <9cf6f7e4-2ba6-51fd-115a-8c9b6d9ef772@foss.arm.com> From: Richard Earnshaw Message-ID: <1813e83b-c3a0-00ed-2224-c5fc6a9c399c@foss.arm.com> Date: Wed, 28 Oct 2020 17:44:26 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3499.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, NICE_REPLY_A, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Oct 2020 17:44:31 -0000 On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote: > On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote: >> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw >> wrote: >>> >>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote: >>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw >>>> wrote: >>>>> >>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote: >>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw >>>>>> wrote: >>>>>>> >>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote: >>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote: >>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote: >>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta >>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of >>>>>>>>>>>>>>>>> movs/adds/lsls. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta >>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them >>>>>>>>>>>>>>>>> with -mpure-code. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by >>>>>>>>>>>>>>>>> eg. pr46287-3.C >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2020-09-29 Christophe Lyon >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> gcc/ >>>>>>>>>>>>>>>>> * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and >>>>>>>>>>>>>>>>> do not emit function address and delta when -mpure-code is used. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Richard, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for your comments. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There are some optimizations you can make to this code. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better >>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add. >>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed >>>>>>>>>>>>>>> before, so I can write something similar indeed. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split >>>>>>>>>>>>>>> when -mpure-code is NOT used. >>>>>>>>>>>>>>> Consider: >>>>>>>>>>>>>>> int f3_1 (void) { return 510; } >>>>>>>>>>>>>>> int f3_2 (void) { return 511; } >>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0: >>>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>>> ldr r0, .L4 >>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by >>>>>>>>>>>>>>> not having to load the constant? >>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into >>>>>>>>>>>>>>> account when comparing code size, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account. >>>>>>>>>>>>>> It might happen that the entry could be shared with another use of that >>>>>>>>>>>>>> literal, but in general that's rare. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three >>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading >>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended? >>>>>>>>>>>>>> >>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not >>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due >>>>>>>>>>>>>> to the limited pc offsets we can use. I think on balance we probably >>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is also true for >>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all >>>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as: >>>>>>>>>>>>>>> f3_1: >>>>>>>>>>>>>>> movs r0, #255 >>>>>>>>>>>>>>> lsls r0, r0, #1 >>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>> f3_2: >>>>>>>>>>>>>>> movs r0, #1 >>>>>>>>>>>>>>> lsls r0, r0, #8 >>>>>>>>>>>>>>> adds r0, r0, #255 >>>>>>>>>>>>>>> bx lr >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be >>>>>>>>>>>>>> emitting instructions directly rather than generating RTL. The examples >>>>>>>>>>>>>> you show here are not thunks. >>>>>>>>>>>>>> >>>>>>>>>>>>> OK thanks for the clarification. >>>>>>>>>>>>> >>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make >>>>>>>>>>>>> review easier. >>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770: >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html >>>>>>>>>>>>> >>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand. >>>>>>>>>>>>> >>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier, >>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or >>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the >>>>>>>>>>>>> same place, >>>>>>>>>>>>> in case we need to improve/fix it later. We now generate shorter sequences in >>>>>>>>>>>>> several cases matching your comments. >>>>>>>>>>>>> >>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and >>>>>>>>>>>>> calls thumb1_gen_const_int. >>>>>>>>>>>>> >>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls >>>>>>>>>>>>> thumb1_gen_const_int. >>>>>>>>>>>> >>>>>>>>>>>> Yuk! Those changes to thumb1_gen_const_int are horrible. >>>>>>>>>>>> >>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use >>>>>>>>>>>> C++ now to do much better than that, for example by making that function >>>>>>>>>>>> a template. For example (and this is just a sketch): >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Indeed! I didn't think about it since there is no other use of >>>>>>>>>>> templates in arm.c yet. >>>>>>>>>>> I'll send an update soon. >>>>>>>>>>> >>>>>>>>>>> Other than that, does the approach look OK to you? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes, I think this is heading in the right direction. Bringing the two >>>>>>>>>> immediate generating operations into a single function can only be a >>>>>>>>>> good thing. >>>>>>>>>> >>>>>>>>>> Looking again at your example constant sequences, I see: >>>>>>>>>> >>>>>>>>>> 0x1000010: >>>>>>>>>> movs r3, #16 >>>>>>>>>> lsls r3, #16 >>>>>>>>>> adds r3, #1 >>>>>>>>>> lsls r3, #4 >>>>>>>>>> 0x1000011: >>>>>>>>>> movs r3, #1 >>>>>>>>>> lsls r3, #24 >>>>>>>>>> adds r3, #17 >>>>>>>>>> >>>>>>>>>> The first of these looks odd, given the second sequence. Why doesn't >>>>>>>>>> the first expand to >>>>>>>>>> >>>>>>>>>> 0x1000010: >>>>>>>>>> movs r3, #16 >>>>>>>>>> lsls r3, #16 >>>>>>>>>> adds r3, #16 >>>>>>>>>> >>>>>>>>> Err, I mean to: >>>>>>>>> >>>>>>>>> >>>>>>>>> 0x1000010: >>>>>>>>> movs r3, #1 >>>>>>>>> lsls r3, #24 >>>>>>>>> adds r3, #16 >>>>>>>>> >>>>>>>>> ? >>>>>>>> >>>>>>>> Because I first try to right-shift the constant, hoping to reduce its >>>>>>>> range and need less instructions to build the higher part, then >>>>>>>> left-shift back. >>>>>>>> >>>>>>>> In this particular case, we'd need to realize that there are many >>>>>>>> zeros "inside" the constant. >>>>>>>> >>>>>>>> If I remove the part that tries to reduce the range, I do get that >>>>>>>> sequence, but for 764 I now generate >>>>>>>> movs r3, #2 >>>>>>>> lsls r3, #8 >>>>>>>> adds r3, #252 >>>>>>>> instead of >>>>>>>> movs r3, #191 >>>>>>>> lsls r3, #2 >>>>>>>> >>>>>>>> A possibility would be to try both approaches and keep the shortest one. >>>>>>> >>>>>>> Lets leave that for now, it's not important to fixing the main issue; >>>>>>> but we should remember we need to come back to it at some point. >>>>>>> >>>>>> Thanks, that's what I was thinking too. >>>>>> >>>>>>> There are other tricks as well, such as >>>>>>> >>>>>>> 0xffffff >>>>>>> >>>>>>> can be done as >>>>>>> >>>>>>> 0x1000000 - 1 >>>>>>> >>>>>>> and >>>>>>> >>>>>>> 0xfffffd >>>>>>> >>>>>>> as >>>>>>> >>>>>>> 0x1000000 - 3 >>>>>>> >>>>>>> but these can wait as well. >>>>>>> >>>>>> >>>>>> Didn't we already need to handle such tricks? I'm surprised this >>>>>> wasn't needed earlier. >>>>>> >>>>> >>>>> I don't think we ever worried about them. Most of them need at least 3 >>>>> instructions so aren't a code size saving over using a literal pool entry. >>>>> >>>> OK, this will also help when using -mslow-flash-data. >>>> >>>> Here are updated patches, now using a template as you suggested. >>> >>> Looking better, but when I try to apply this to my local tree patch 2 >>> fails (I'm not exactly sure why, what was your baseline for these >>> patches?) >> I have the tree patches in this thread on top of these other two: >> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html >> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html >> >> They have gradual improvements to thumb1_movsi_insn. >> >>> -- that patch looks suspicious anyway, you're replacing code >>> that prints out assembly with code that generates RTL. >> Right! I took me a while to understand how I could miss this, sorry. >> That was caused by improper testing, as this part of the code >> isn't used when targetting cortex-m0. I have added a testcase >> for cortex-m23 which crashes with the previous version of patch 2, >> and succeeds now. >> >>> Could you also rename t1_print and t1_rtl to thumb1_const_print and >>> thumb1_const_rtl. I think the names as they stand are likely to be too >>> generic. >> OK, done. >> >> How about this new version? >> I'm not yet sure about the most appropriate naming for: >> thumb1_gen_const_int >> thumb1_gen_const_int_asm >> should they be >> thumb1_gen_const_int_rtl >> thumb1_gen_const_int_print >> to be consistent with the new classes? > > It would probably be better, yes. > > More detailed comments below. > > R. > >> >> Thanks, >> >> Christophe >> >>> R. >>> >>>> >>>> Thanks, >>>> >>>> Christophe >>>> >>>>> R. >>>>> >>>>>> >>>>>>> >>>>>>> R. >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>> R. >>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Christophe >>>>>>>>>>> >>>>>>>>>>>> class t1_rtl >>>>>>>>>>>> { >>>>>>>>>>>> public: >>>>>>>>>>>> void ashift(int a) { gen_rtx_ASHIFT(a); } >>>>>>>>>>>> void rshift(int b) { gen_rtx_SHIFTRT(b); } >>>>>>>>>>>> }; >>>>>>>>>>>> >>>>>>>>>>>> class t1_print >>>>>>>>>>>> { >>>>>>>>>>>> public: >>>>>>>>>>>> t1_print (FILE *f) : t_file(f) {} >>>>>>>>>>>> void ashift (int a) { fprintf (t_file, "a shift %d\n", a); } >>>>>>>>>>>> void rshift (int b) { fprintf (t_file, "r shift %d\n", b); } >>>>>>>>>>>> private: >>>>>>>>>>>> FILE *t_file; >>>>>>>>>>>> }; >>>>>>>>>>>> >>>>>>>>>>>> template >>>>>>>>>>>> void thumb1_gen_const_int(T t, int f) >>>>>>>>>>>> { >>>>>>>>>>>> // Expansion of thumb1_gen_const_int ... >>>>>>>>>>>> t.ashift(f); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> // Usage... >>>>>>>>>>>> void f1() >>>>>>>>>>>> { >>>>>>>>>>>> // Use the RTL expander >>>>>>>>>>>> t1_rtl g; >>>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> void f2() >>>>>>>>>>>> { >>>>>>>>>>>> // Use the printf expander writing to stdout >>>>>>>>>>>> t1_print g(stdout); >>>>>>>>>>>> thumb1_gen_const_int (g, 3); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry >>>>>>>>>>>> about which expander is being used in each instance and the template >>>>>>>>>>>> expansion will use the right version. >>>>>>>>>>>> >>>>>>>>>>>> R. >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I also suspect (but haven't check) that the base adjustment will >>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4). If that is >>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an >>>>>>>>>>>>>>>> even greater range of literals. >>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case, >>>>>>>>>>>>>>> I'll update my patch. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> More generally, any sequence of up to >>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the >>>>>>>>>>>>>>>> existing literal pool fall back. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will >>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could >>>>>>>>>>>>>>>> be merged into a single shift-by-16. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at >>>>>>>>>>>>>>>> all possible. >>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase >>>>>>>>>>>>>>> that showed the problem. There are also: >>>>>>>>>>>>>>> g++.dg/opt/thunk1.C >>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C >>>>>>>>>>>>>>> g++.dg/torture/pr46287.C >>>>>>>>>>>>>>> g++.dg/torture/pr45699.C >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add >>>>>>>>>>>>>>> -mpure-code in dg-options? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated >>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other >>>>>>>>>>>>>> sets of options on the command line and the fall-out from that). If >>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs >>>>>>>>>>>>>> with it enabled and that should be sufficient. >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a >>>>>>>>>>>>> modified simulator. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Christophe >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> R. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Christophe >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> R. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> k# (use "git pull" to merge the remote branch into yours) >>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>> gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++--------------- >>>>>>>>>>>>>>>>> 1 file changed, 66 insertions(+), 25 deletions(-) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644 >>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c >>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> if (mi_delta > 255) >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> - fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>>> - fputs ("+4\n", file); >>>>>>>>>>>>>>>>> + /* With -mpure-code, we cannot load delta from the constant >>>>>>>>>>>>>>>>> + pool: we build it explicitly. */ >>>>>>>>>>>>>>>>> + if (target_pure_code) >>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>> + bool mov_done_p = false; >>>>>>>>>>>>>>>>> + int i; >>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>> + /* Emit upper 3 bytes if needed. */ >>>>>>>>>>>>>>>>> + for (i = 0; i < 3; i++) >>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>> + int byte = (mi_delta >> (8 * (3 - i))) & 0xff; >>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>> + if (byte) >>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", byte); >>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", byte); >>>>>>>>>>>>>>>>> + mov_done_p = true; >>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>> + if (mov_done_p) >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tlsls\tr3, #8\n"); >>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>> + /* Emit lower byte if needed. */ >>>>>>>>>>>>>>>>> + if (!mov_done_p) >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff); >>>>>>>>>>>>>>>>> + else if (mi_delta & 0xff) >>>>>>>>>>>>>>>>> + asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff); >>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>> + fputs ("\tldr\tr3, ", file); >>>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>>> + fputs ("+4\n", file); >>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>> asm_fprintf (file, "\t%ss\t%r, %r, r3\n", >>>>>>>>>>>>>>>>> mi_op, this_regno, this_regno); >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta, >>>>>>>>>>>>>>>>> fputs ("\tpop\t{r3}\n", file); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> fprintf (file, "\tbx\tr12\n"); >>>>>>>>>>>>>>>>> - ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>>> - assemble_name (file, label); >>>>>>>>>>>>>>>>> - fputs (":\n", file); >>>>>>>>>>>>>>>>> - if (flag_pic) >>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>> + /* With -mpure-code, we don't need to emit literals for the >>>>>>>>>>>>>>>>> + function address and delta since we emitted code to build >>>>>>>>>>>>>>>>> + them. */ >>>>>>>>>>>>>>>>> + if (!target_pure_code) >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>>> - rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>>> - /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC >>>>>>>>>>>>>>>>> - pipeline offset is four rather than eight. Adjust the offset >>>>>>>>>>>>>>>>> - accordingly. */ >>>>>>>>>>>>>>>>> - tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>>> - TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>>>>>>> - tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>>> - tem, >>>>>>>>>>>>>>>>> - gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>>> - ggc_strdup (labelpc))); >>>>>>>>>>>>>>>>> - assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>>>> - else >>>>>>>>>>>>>>>>> - /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>>> - assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>> + ASM_OUTPUT_ALIGN (file, 2); >>>>>>>>>>>>>>>>> + assemble_name (file, label); >>>>>>>>>>>>>>>>> + fputs (":\n", file); >>>>>>>>>>>>>>>>> + if (flag_pic) >>>>>>>>>>>>>>>>> + { >>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn". */ >>>>>>>>>>>>>>>>> + rtx tem = XEXP (DECL_RTL (function), 0); >>>>>>>>>>>>>>>>> + /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC >>>>>>>>>>>>>>>>> + pipeline offset is four rather than eight. Adjust the offset >>>>>>>>>>>>>>>>> + accordingly. */ >>>>>>>>>>>>>>>>> + tem = plus_constant (GET_MODE (tem), tem, >>>>>>>>>>>>>>>>> + TARGET_THUMB1_ONLY ? -3 : -7); >>>>>>>>>>>>>>>>> + tem = gen_rtx_MINUS (GET_MODE (tem), >>>>>>>>>>>>>>>>> + tem, >>>>>>>>>>>>>>>>> + gen_rtx_SYMBOL_REF (Pmode, >>>>>>>>>>>>>>>>> + ggc_strdup (labelpc))); >>>>>>>>>>>>>>>>> + assemble_integer (tem, 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>> + else >>>>>>>>>>>>>>>>> + /* Output ".word .LTHUNKn". */ >>>>>>>>>>>>>>>>> + assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>>> - assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>> + if (TARGET_THUMB1_ONLY && mi_delta > 255) >>>>>>>>>>>>>>>>> + assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1); >>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > > +class thumb1_const_rtl > ... > + void mov (int val) > > This should take a HOST_WIDE_INT. Similarly for add and shift. The > same applies to the asm version as well. > > + asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val); > > Should be using reg_names[dst_regno] in all cases. In fact, you might > want to move that lookup to the constructor and just save a pointer to > the string there. You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED Correction, for a (signed) HOST_WIDE_INT, this should be HOST_WIDE_INT_PRINT_DEC. > rather than "%d" for the immediate. > > +template > +void > +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1) > +{ > + bool mov_done_p = false; > + int val = op1; > > This potentially silently loses precision. In fact, I think you really > want to use "unsigned HOST_WIDE_INT" throughout the following code, so > that the right shifts aren't undefined if dealing with negative numbers. > > For safety, you should also have an assertion in here that > > op1 == trunc_int_for_mode (op1, SImode) > > + int shift = 0; > + int i; > + > + if (val == 0) > > You can short-circuit 0..255 here for a quick exit. > > + { > + dst.mov (val); > + return; > + } > > Another trick: if the top nine bits of the 32-bit value are all set, > you're probably going to be better off (and certainly not worse off) by > generating -op1 and then negating the result in a final step - you can > do that via recursion. > > + > + /* In the general case, we need 7 instructions to build > + a 32 bits constant (1 movs, 3 lsls, 3 adds). We can > + do better if VAL is small enough, or > + right-shiftable by a suitable amount. If the > + right-shift enables to encode at least one less byte, > + it's worth it: we save a adds and a lsls at the > + expense of a final lsls. */ > + int final_shift = number_of_first_bit_set (val); > + > + int leading_zeroes = clz_hwi (val); > + int number_of_bytes_needed > + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes) > + / BITS_PER_UNIT) + 1; > + int number_of_bytes_needed2 > + = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift) > + / BITS_PER_UNIT) + 1; > + > + if (number_of_bytes_needed2 < number_of_bytes_needed) > + val >>= final_shift; > + else > + final_shift = 0; > + > + /* If we are in a very small range, we can use either a single movs > + or movs+adds. */ > + if ((val >= 0) && (val <= 510)) > > if val is made unsigned HWI as I suggest, the lower bounds test is not > needed. > > + { > + if (val > 255) > + { > + int high = val - 255; > > Again, watch your types. > > + > + dst.mov (high); > + dst.add (255); > + } > + else > + dst.mov (val); > + > + if (final_shift > 0) > + dst.ashift (final_shift); > + } > + else > + { > + /* General case, emit upper 3 bytes as needed. */ > + for (i = 0; i < 3; i++) > + { > + int byte = (val >> (8 * (3 - i))) & 0xff; > > and here. > > + > + if (byte) > + { > + /* We are about to emit new bits, stop accumulating a > + shift amount, and left-shift only if we have already > + emitted some upper bits. */ > + if (mov_done_p) > + { > + dst.ashift (shift); > + dst.add (byte); > + } > + else > + dst.mov (byte); > + > + /* Stop accumulating shift amount since we've just > + emitted some bits. */ > + shift = 0; > + > + mov_done_p = true; > + } > + > + if (mov_done_p) > + shift += 8; > + } > + > + /* Emit lower byte. */ > + if (!mov_done_p) > + dst.mov (val & 0xff); > + else > + { > + dst.ashift (shift); > + if (val & 0xff) > + dst.add (val & 0xff); > + } > + > + if (final_shift > 0) > + dst.ashift (final_shift); > + } > +} > + >