From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Richard.Earnshaw@foss.arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 97D933858039
 for <gcc-patches@gcc.gnu.org>; Mon,  2 Nov 2020 14:28:48 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 97D933858039
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 539CF30E;
 Mon,  2 Nov 2020 06:28:48 -0800 (PST)
Received: from [192.168.1.19] (unknown [172.31.20.19])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8DC633F66E;
 Mon,  2 Nov 2020 06:28:47 -0800 (PST)
Subject: Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with
 -mpure-code
To: Christophe Lyon <christophe.lyon@linaro.org>
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
References: <1601409053-17310-1-git-send-email-christophe.lyon@linaro.org>
 <11736e3f-6e4b-7a42-e034-10e77473cfcc@foss.arm.com>
 <643448ae-7797-5883-ec7c-75eff98d0db0@foss.arm.com>
 <CAKdteOY1pSafRK98818z94GMr-vu+CHU=iaPdiEmx3FSn_rfcg@mail.gmail.com>
 <4604bc17-3ceb-6324-4286-402d2d9da174@foss.arm.com>
 <CAKdteOZx-QETLx=z+GGavzcYRX1dMQKEm44Lp=yafC167a1eaA@mail.gmail.com>
 <d99dc7ec-955e-2dbf-e689-322569da29ac@foss.arm.com>
 <CAKdteOb58iQDcS+_1UmgKOHe_ycg27rJ=Eo8KPBADsHSNp0xVQ@mail.gmail.com>
 <9cf6f7e4-2ba6-51fd-115a-8c9b6d9ef772@foss.arm.com>
 <CAKdteOaHhWxw4cbnimND1eS22SBtam7SbUoo3MwxMcvt_G91Tw@mail.gmail.com>
 <fc0813f8-4692-fb44-0f87-3a9d9483b3d0@foss.arm.com>
 <1813e83b-c3a0-00ed-2224-c5fc6a9c399c@foss.arm.com>
 <CAKdteOY-5vvfY8YokcPFv4FrMm3g0-Pp0CKE2J=ZRS7XDEL0NQ@mail.gmail.com>
 <d5cf6bdd-16be-c033-f9e0-c2f1e170bf30@foss.arm.com>
 <d8a13cf5-69af-1325-f6a2-7cf2707fa91a@foss.arm.com>
 <CAKdteOYmPe8cMSOa3fwxB+HcpbPkYc28-Y0+uWbrVKnDpSB6sQ@mail.gmail.com>
From: Richard Earnshaw <Richard.Earnshaw@foss.arm.com>
Message-ID: <07c87679-dda6-a0e0-d082-6a29d09457d3@foss.arm.com>
Date: Mon, 2 Nov 2020 14:28:45 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <CAKdteOYmPe8cMSOa3fwxB+HcpbPkYc28-Y0+uWbrVKnDpSB6sQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-GB
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-3498.5 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_ASCII_DIVIDERS, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT,
 NICE_REPLY_A, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Nov 2020 14:28:51 -0000

On 02/11/2020 10:24, Christophe Lyon via Gcc-patches wrote:
> Hi,
> 
> On Fri, 30 Oct 2020 at 13:49, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote:
>>> On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
>>>> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>
>>>>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
>>>>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
>>>>>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>
>>>>>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>>>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>>>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>>>>>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>>>>>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>>>>>>>>>>>> Consider:
>>>>>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>>>>>>>>>>>>>> not having to load the constant?
>>>>>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>>>>>>>>>>>>>> account when comparing code size,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>>>>>>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This is also true for
>>>>>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>>>>>>>>>>>>>> you show here are not thunks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> OK thanks for the clarification.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>>>>>>>>>>>> review easier.
>>>>>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>>>>>>>>>>>>>> same place,
>>>>>>>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>>>>>>>>>>>>>> several cases matching your comments.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>>>>>>>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>>>>>>>>>>>> thumb1_gen_const_int.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>>>>>>>>>>>>>> C++ now to do much better than that, for example by making that function
>>>>>>>>>>>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>>>>>>>>>>>> templates in arm.c yet.
>>>>>>>>>>>>>>>> I'll send an update soon.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Other than that, does the approach look OK to you?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>>>>>>>>>>>> immediate generating operations into a single function can only be a
>>>>>>>>>>>>>>> good thing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking again at your example constant sequences, I see:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>>>>>>         adds    r3, #1
>>>>>>>>>>>>>>>         lsls    r3, #4
>>>>>>>>>>>>>>> 0x1000011:
>>>>>>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>>>>>>         adds    r3, #17
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>>>>>>>>>>>> the first expand to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Err, I mean to:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
>>>>>>>>>>>>> range and need less instructions to build the higher part, then
>>>>>>>>>>>>> left-shift back.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In this particular case, we'd need to realize that there are many
>>>>>>>>>>>>> zeros "inside" the constant.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If I remove the part that tries to reduce the range, I do get that
>>>>>>>>>>>>> sequence, but for 764 I now generate
>>>>>>>>>>>>> movs    r3, #2
>>>>>>>>>>>>> lsls    r3, #8
>>>>>>>>>>>>> adds    r3, #252
>>>>>>>>>>>>> instead of
>>>>>>>>>>>>> movs    r3, #191
>>>>>>>>>>>>> lsls    r3, #2
>>>>>>>>>>>>>
>>>>>>>>>>>>> A possibility would be to try both approaches and keep the shortest one.
>>>>>>>>>>>>
>>>>>>>>>>>> Lets leave that for now, it's not important to fixing the main issue;
>>>>>>>>>>>> but we should remember we need to come back to it at some point.
>>>>>>>>>>>>
>>>>>>>>>>> Thanks, that's what I was thinking too.
>>>>>>>>>>>
>>>>>>>>>>>> There are other tricks as well, such as
>>>>>>>>>>>>
>>>>>>>>>>>>   0xffffff
>>>>>>>>>>>>
>>>>>>>>>>>> can be done as
>>>>>>>>>>>>
>>>>>>>>>>>>   0x1000000 - 1
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>>   0xfffffd
>>>>>>>>>>>>
>>>>>>>>>>>> as
>>>>>>>>>>>>
>>>>>>>>>>>>   0x1000000 - 3
>>>>>>>>>>>>
>>>>>>>>>>>> but these can wait as well.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Didn't we already need to handle such tricks? I'm surprised this
>>>>>>>>>>> wasn't needed earlier.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't think we ever worried about them.  Most of them need at least 3
>>>>>>>>>> instructions so aren't a code size saving over using a literal pool entry.
>>>>>>>>>>
>>>>>>>>> OK, this will also help when using -mslow-flash-data.
>>>>>>>>>
>>>>>>>>> Here are updated patches, now using a template as you suggested.
>>>>>>>>
>>>>>>>> Looking better, but when I try to apply this to my local tree patch 2
>>>>>>>> fails (I'm not exactly sure why, what was your baseline for these
>>>>>>>> patches?)
>>>>>>> I have the tree patches in this thread on top of these other two:
>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
>>>>>>>
>>>>>>> They have gradual improvements to thumb1_movsi_insn.
>>>>>>>
>>>>>>>> -- that patch looks suspicious anyway, you're replacing code
>>>>>>>> that prints out assembly with code that generates RTL.
>>>>>>> Right! I took me a while to understand how I could miss this, sorry.
>>>>>>> That was caused by improper testing, as this part of the code
>>>>>>> isn't used when targetting cortex-m0. I have added a testcase
>>>>>>> for cortex-m23 which crashes with the previous version of patch 2,
>>>>>>> and succeeds now.
>>>>>>>
>>>>>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
>>>>>>>> thumb1_const_rtl.  I think the names as they stand are likely to be too
>>>>>>>> generic.
>>>>>>> OK, done.
>>>>>>>
>>>>>>> How about this new version?
>>>>>>> I'm not yet sure about the most appropriate naming for:
>>>>>>> thumb1_gen_const_int
>>>>>>> thumb1_gen_const_int_asm
>>>>>>> should they be
>>>>>>> thumb1_gen_const_int_rtl
>>>>>>> thumb1_gen_const_int_print
>>>>>>> to be consistent with the new classes?
>>>>>>
>>>>>> It would probably be better, yes.
>>>>>>
>>>>>> More detailed comments below.
>>>>>>
>>>>>> R.
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Christophe
>>>>>>>
>>>>>>>> R.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Christophe
>>>>>>>>>
>>>>>>>>>> R.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> R.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> class t1_rtl
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>  public:
>>>>>>>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> class t1_print
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>  public:
>>>>>>>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>>>>>>>>>>>  private:
>>>>>>>>>>>>>>>>>   FILE *t_file;
>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> template <class T>
>>>>>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>>>>>>>>>>>   t.ashift(f);
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> // Usage...
>>>>>>>>>>>>>>>>> void f1()
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>   // Use the RTL expander
>>>>>>>>>>>>>>>>>   t1_rtl g;
>>>>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> void f2()
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>>>>>>>>>>>   t1_print g(stdout);
>>>>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>>>>>>>>>>>> about which expander is being used in each instance and the template
>>>>>>>>>>>>>>>>> expansion will use the right version.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>>>>>>>>>>>>>> even greater range of literals.
>>>>>>>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>>>>>>>>>>>>>> I'll update my patch.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>>>>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>>>>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>>>>>>>>>>>>>> all possible.
>>>>>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>>>>>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>>>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>>>>>>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>>>>>>>>>>>>>> modified simulator.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>>>>>>>>>>>> +           int i;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>>>>>>>>>>>> +             {
>>>>>>>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>>>>>>>>>>>> +                 {
>>>>>>>>>>>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>>>>>>> +                   else
>>>>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>>>>>>>>>>>> +                 }
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>>>>>>>>>>>> +             }
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>>>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> -     }
>>>>>>>>>>>>>>>>>>>>>> -      else
>>>>>>>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>>    else
>>>>>>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> +class thumb1_const_rtl
>>>>>> ...
>>>>>> +  void mov (int val)
>>>>>>
>>>>>> This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
>>>>>> same applies to the asm version as well.
>>>>>>
>>>>>> +    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
>>>>>>
>>>>>> Should be using reg_names[dst_regno] in all cases.  In fact, you might
>>>>>> want to move that lookup to the constructor and just save a pointer to
>>>>>> the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
>>>>>
>>>>> Correction, for a (signed) HOST_WIDE_INT, this should be
>>>>> HOST_WIDE_INT_PRINT_DEC.
>>>>>
>>>>
>>>> Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the
>>>> two classes
>>>> can have unsigned HOST_WIDE_INT parameters, and thus use
>>>> HOST_WIDE_INT_PRINT_UNSIGNED?
>>>>
>>>
>>> It's generally safer to print it as a signed value.  We probably won't
>>> have cases where the top bit of a 32-bit word are set here; but if you
>>> do, and the value is unsigned, you end up with 16 digit hex numbers
>>> rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT
>>> is at least 64 bits in size.
>>>
>>
>> I don't mean hex numbers, of course, just very large decimal numbers.
>> But the point is still the same.
>>
> 
> Here is an updated version, which hopefully addresses your comments,
> and adds testcases for cortex-m0 and cortex-m23.
> 

This is OK.

Thanks,

R.

> Christophe
> 
>> R.
>>
>>> R.
>>>
>>>>
>>>>>> rather than "%d" for the immediate.
>>>>>>
>>>>>> +template <class T>
>>>>>> +void
>>>>>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
>>>>>> +{
>>>>>> +  bool mov_done_p = false;
>>>>>> +  int val = op1;
>>>>>>
>>>>>> This potentially silently loses precision.  In fact, I think you really
>>>>>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so
>>>>>> that the right shifts aren't undefined if dealing with negative numbers.
>>>>>>
>>>>>> For safety, you should also have an assertion in here that
>>>>>>
>>>>>>   op1 == trunc_int_for_mode (op1, SImode)
>>>>>>
>>>>>> +  int shift = 0;
>>>>>> +  int i;
>>>>>> +
>>>>>> +  if (val == 0)
>>>>>>
>>>>>> You can short-circuit 0..255 here for a quick exit.
>>>>>>
>>>>>> +    {
>>>>>> +      dst.mov (val);
>>>>>> +      return;
>>>>>> +    }
>>>>>>
>>>>>> Another trick: if the top nine bits of the 32-bit value are all set,
>>>>>> you're probably going to be better off (and certainly not worse off) by
>>>>>> generating -op1 and then negating the result in a final step - you can
>>>>>> do that via recursion.
>>>>>>
>>>>>> +
>>>>>> +  /* In the general case, we need 7 instructions to build
>>>>>> +     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
>>>>>> +     do better if VAL is small enough, or
>>>>>> +     right-shiftable by a suitable amount.  If the
>>>>>> +     right-shift enables to encode at least one less byte,
>>>>>> +     it's worth it: we save a adds and a lsls at the
>>>>>> +     expense of a final lsls.  */
>>>>>> +  int final_shift = number_of_first_bit_set (val);
>>>>>> +
>>>>>> +  int leading_zeroes = clz_hwi (val);
>>>>>> +  int number_of_bytes_needed
>>>>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
>>>>>> +       / BITS_PER_UNIT) + 1;
>>>>>> +  int number_of_bytes_needed2
>>>>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
>>>>>> +       / BITS_PER_UNIT) + 1;
>>>>>> +
>>>>>> +  if (number_of_bytes_needed2 < number_of_bytes_needed)
>>>>>> +    val >>= final_shift;
>>>>>> +  else
>>>>>> +    final_shift = 0;
>>>>>> +
>>>>>> +  /* If we are in a very small range, we can use either a single movs
>>>>>> +     or movs+adds.  */
>>>>>> +  if ((val >= 0) && (val <= 510))
>>>>>>
>>>>>> if val is made unsigned HWI as I suggest, the lower bounds test is not
>>>>>> needed.
>>>>>>
>>>>>> +    {
>>>>>> +      if (val > 255)
>>>>>> +     {
>>>>>> +       int high = val - 255;
>>>>>>
>>>>>> Again, watch your types.
>>>>>>
>>>>>> +
>>>>>> +       dst.mov (high);
>>>>>> +       dst.add (255);
>>>>>> +     }
>>>>>> +      else
>>>>>> +     dst.mov (val);
>>>>>> +
>>>>>> +      if (final_shift > 0)
>>>>>> +     dst.ashift (final_shift);
>>>>>> +    }
>>>>>> +  else
>>>>>> +    {
>>>>>> +      /* General case, emit upper 3 bytes as needed.  */
>>>>>> +      for (i = 0; i < 3; i++)
>>>>>> +     {
>>>>>> +       int byte = (val >> (8 * (3 - i))) & 0xff;
>>>>>>
>>>>>> and here.
>>>>>>
>>>>>> +
>>>>>> +       if (byte)
>>>>>> +         {
>>>>>> +           /* We are about to emit new bits, stop accumulating a
>>>>>> +              shift amount, and left-shift only if we have already
>>>>>> +              emitted some upper bits.  */
>>>>>> +           if (mov_done_p)
>>>>>> +             {
>>>>>> +               dst.ashift (shift);
>>>>>> +               dst.add (byte);
>>>>>> +             }
>>>>>> +           else
>>>>>> +             dst.mov (byte);
>>>>>> +
>>>>>> +           /* Stop accumulating shift amount since we've just
>>>>>> +              emitted some bits.  */
>>>>>> +           shift = 0;
>>>>>> +
>>>>>> +           mov_done_p = true;
>>>>>> +         }
>>>>>> +
>>>>>> +       if (mov_done_p)
>>>>>> +         shift += 8;
>>>>>> +     }
>>>>>> +
>>>>>> +      /* Emit lower byte.  */
>>>>>> +      if (!mov_done_p)
>>>>>> +     dst.mov (val & 0xff);
>>>>>> +      else
>>>>>> +     {
>>>>>> +       dst.ashift (shift);
>>>>>> +       if (val & 0xff)
>>>>>> +         dst.add (val & 0xff);
>>>>>> +     }
>>>>>> +
>>>>>> +      if (final_shift > 0)
>>>>>> +     dst.ashift (final_shift);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>>
>>>>>
>>>
>>