public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
@ 2020-09-29 19:50 Christophe Lyon
  2020-10-06  8:30 ` Christophe Lyon
  2020-10-06 16:02 ` Richard Earnshaw
  0 siblings, 2 replies; 24+ messages in thread
From: Christophe Lyon @ 2020-09-29 19:50 UTC (permalink / raw)
  To: gcc-patches

When mi_delta is > 255 and -mpure-code is used, we cannot load delta
from code memory (like we do without -mpure-code).

This patch builds the value of mi_delta into r3 with a series of
movs/adds/lsls.

We also do some cleanup by not emitting the function address and delta
via .word directives at the end of the thunk since we don't use them
with -mpure-code.

No need for new testcases, this bug was already identified by
eg. pr46287-3.C

2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
	do not emit function address and delta when -mpure-code is used.

k#   (use "git pull" to merge the remote branch into yours)
---
 gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 66 insertions(+), 25 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ceeb91f..62abeb5 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
     {
       if (mi_delta > 255)
 	{
-	  fputs ("\tldr\tr3, ", file);
-	  assemble_name (file, label);
-	  fputs ("+4\n", file);
+	  /* With -mpure-code, we cannot load delta from the constant
+	     pool: we build it explicitly.  */
+	  if (target_pure_code)
+	    {
+	      bool mov_done_p = false;
+	      int i;
+
+	      /* Emit upper 3 bytes if needed.  */
+	      for (i = 0; i < 3; i++)
+		{
+		  int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
+
+		  if (byte)
+		    {
+		      if (mov_done_p)
+			asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
+		      else
+			asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
+		      mov_done_p = true;
+		    }
+
+		  if (mov_done_p)
+		    asm_fprintf (file, "\tlsls\tr3, #8\n");
+		}
+
+	      /* Emit lower byte if needed.  */
+	      if (!mov_done_p)
+		asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
+	      else if (mi_delta & 0xff)
+		asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
+	    }
+	  else
+	    {
+	      fputs ("\tldr\tr3, ", file);
+	      assemble_name (file, label);
+	      fputs ("+4\n", file);
+	    }
 	  asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
 		       mi_op, this_regno, this_regno);
 	}
@@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
 	fputs ("\tpop\t{r3}\n", file);
 
       fprintf (file, "\tbx\tr12\n");
-      ASM_OUTPUT_ALIGN (file, 2);
-      assemble_name (file, label);
-      fputs (":\n", file);
-      if (flag_pic)
+
+      /* With -mpure-code, we don't need to emit literals for the
+	 function address and delta since we emitted code to build
+	 them.  */
+      if (!target_pure_code)
 	{
-	  /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
-	  rtx tem = XEXP (DECL_RTL (function), 0);
-	  /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
-	     pipeline offset is four rather than eight.  Adjust the offset
-	     accordingly.  */
-	  tem = plus_constant (GET_MODE (tem), tem,
-			       TARGET_THUMB1_ONLY ? -3 : -7);
-	  tem = gen_rtx_MINUS (GET_MODE (tem),
-			       tem,
-			       gen_rtx_SYMBOL_REF (Pmode,
-						   ggc_strdup (labelpc)));
-	  assemble_integer (tem, 4, BITS_PER_WORD, 1);
-	}
-      else
-	/* Output ".word .LTHUNKn".  */
-	assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
+	  ASM_OUTPUT_ALIGN (file, 2);
+	  assemble_name (file, label);
+	  fputs (":\n", file);
+	  if (flag_pic)
+	    {
+	      /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
+	      rtx tem = XEXP (DECL_RTL (function), 0);
+	      /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
+		 pipeline offset is four rather than eight.  Adjust the offset
+		 accordingly.  */
+	      tem = plus_constant (GET_MODE (tem), tem,
+				   TARGET_THUMB1_ONLY ? -3 : -7);
+	      tem = gen_rtx_MINUS (GET_MODE (tem),
+				   tem,
+				   gen_rtx_SYMBOL_REF (Pmode,
+						       ggc_strdup (labelpc)));
+	      assemble_integer (tem, 4, BITS_PER_WORD, 1);
+	    }
+	  else
+	    /* Output ".word .LTHUNKn".  */
+	    assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
 
-      if (TARGET_THUMB1_ONLY && mi_delta > 255)
-	assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
+	  if (TARGET_THUMB1_ONLY && mi_delta > 255)
+	    assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
+	}
     }
   else
     {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-09-29 19:50 [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code Christophe Lyon
@ 2020-10-06  8:30 ` Christophe Lyon
  2020-10-06 16:02 ` Richard Earnshaw
  1 sibling, 0 replies; 24+ messages in thread
From: Christophe Lyon @ 2020-10-06  8:30 UTC (permalink / raw)
  To: gcc Patches

ping?

On Tue, 29 Sep 2020 at 21:50, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
>
> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> from code memory (like we do without -mpure-code).
>
> This patch builds the value of mi_delta into r3 with a series of
> movs/adds/lsls.
>
> We also do some cleanup by not emitting the function address and delta
> via .word directives at the end of the thunk since we don't use them
> with -mpure-code.
>
> No need for new testcases, this bug was already identified by
> eg. pr46287-3.C
>
> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>
>         gcc/
>         * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>         do not emit function address and delta when -mpure-code is used.
>
> k#   (use "git pull" to merge the remote branch into yours)
> ---
>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 66 insertions(+), 25 deletions(-)
>
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index ceeb91f..62abeb5 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>      {
>        if (mi_delta > 255)
>         {
> -         fputs ("\tldr\tr3, ", file);
> -         assemble_name (file, label);
> -         fputs ("+4\n", file);
> +         /* With -mpure-code, we cannot load delta from the constant
> +            pool: we build it explicitly.  */
> +         if (target_pure_code)
> +           {
> +             bool mov_done_p = false;
> +             int i;
> +
> +             /* Emit upper 3 bytes if needed.  */
> +             for (i = 0; i < 3; i++)
> +               {
> +                 int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> +
> +                 if (byte)
> +                   {
> +                     if (mov_done_p)
> +                       asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> +                     else
> +                       asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> +                     mov_done_p = true;
> +                   }
> +
> +                 if (mov_done_p)
> +                   asm_fprintf (file, "\tlsls\tr3, #8\n");
> +               }
> +
> +             /* Emit lower byte if needed.  */
> +             if (!mov_done_p)
> +               asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> +             else if (mi_delta & 0xff)
> +               asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> +           }
> +         else
> +           {
> +             fputs ("\tldr\tr3, ", file);
> +             assemble_name (file, label);
> +             fputs ("+4\n", file);
> +           }
>           asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>                        mi_op, this_regno, this_regno);
>         }
> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>         fputs ("\tpop\t{r3}\n", file);
>
>        fprintf (file, "\tbx\tr12\n");
> -      ASM_OUTPUT_ALIGN (file, 2);
> -      assemble_name (file, label);
> -      fputs (":\n", file);
> -      if (flag_pic)
> +
> +      /* With -mpure-code, we don't need to emit literals for the
> +        function address and delta since we emitted code to build
> +        them.  */
> +      if (!target_pure_code)
>         {
> -         /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> -         rtx tem = XEXP (DECL_RTL (function), 0);
> -         /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> -            pipeline offset is four rather than eight.  Adjust the offset
> -            accordingly.  */
> -         tem = plus_constant (GET_MODE (tem), tem,
> -                              TARGET_THUMB1_ONLY ? -3 : -7);
> -         tem = gen_rtx_MINUS (GET_MODE (tem),
> -                              tem,
> -                              gen_rtx_SYMBOL_REF (Pmode,
> -                                                  ggc_strdup (labelpc)));
> -         assemble_integer (tem, 4, BITS_PER_WORD, 1);
> -       }
> -      else
> -       /* Output ".word .LTHUNKn".  */
> -       assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> +         ASM_OUTPUT_ALIGN (file, 2);
> +         assemble_name (file, label);
> +         fputs (":\n", file);
> +         if (flag_pic)
> +           {
> +             /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> +             rtx tem = XEXP (DECL_RTL (function), 0);
> +             /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> +                pipeline offset is four rather than eight.  Adjust the offset
> +                accordingly.  */
> +             tem = plus_constant (GET_MODE (tem), tem,
> +                                  TARGET_THUMB1_ONLY ? -3 : -7);
> +             tem = gen_rtx_MINUS (GET_MODE (tem),
> +                                  tem,
> +                                  gen_rtx_SYMBOL_REF (Pmode,
> +                                                      ggc_strdup (labelpc)));
> +             assemble_integer (tem, 4, BITS_PER_WORD, 1);
> +           }
> +         else
> +           /* Output ".word .LTHUNKn".  */
> +           assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>
> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> -       assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> +         if (TARGET_THUMB1_ONLY && mi_delta > 255)
> +           assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> +       }
>      }
>    else
>      {
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-09-29 19:50 [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code Christophe Lyon
  2020-10-06  8:30 ` Christophe Lyon
@ 2020-10-06 16:02 ` Richard Earnshaw
  2020-10-08  9:07   ` Christophe Lyon
  1 sibling, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-06 16:02 UTC (permalink / raw)
  To: Christophe Lyon, gcc-patches

On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> from code memory (like we do without -mpure-code).
> 
> This patch builds the value of mi_delta into r3 with a series of
> movs/adds/lsls.
> 
> We also do some cleanup by not emitting the function address and delta
> via .word directives at the end of the thunk since we don't use them
> with -mpure-code.
> 
> No need for new testcases, this bug was already identified by
> eg. pr46287-3.C
> 
> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> 
> 	gcc/
> 	* config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> 	do not emit function address and delta when -mpure-code is used.

There are some optimizations you can make to this code.

Firstly, for values between 256 and 510 (inclusive), it would be better
to just expand a mov of 255 followed by an add.  This is also true for
the literal pools alternative as well, so should be handled before all
this.  I also suspect (but haven't check) that the base adjustment will
most commonly be a multiple of the machine word size (ie 4).  If that is
the case then you could generate n/4 and then shift it left by 2 for an
even greater range of literals.  More generally, any sequence of up to
three thumb1 instructions will be no larger, and probably as fast as the
existing literal pool fall back.

Secondly, if the value is, for example, 65536 (0x10000), your code will
emit a mov followed by two shift-by-8 instructions; the two shifts could
be merged into a single shift-by-16.

Finally, I'd really like to see some executable tests for this, if at
all possible.

R.

> 
> k#   (use "git pull" to merge the remote branch into yours)
> ---
>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 66 insertions(+), 25 deletions(-)
> 
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index ceeb91f..62abeb5 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>      {
>        if (mi_delta > 255)
>  	{
> -	  fputs ("\tldr\tr3, ", file);
> -	  assemble_name (file, label);
> -	  fputs ("+4\n", file);
> +	  /* With -mpure-code, we cannot load delta from the constant
> +	     pool: we build it explicitly.  */
> +	  if (target_pure_code)
> +	    {
> +	      bool mov_done_p = false;
> +	      int i;
> +
> +	      /* Emit upper 3 bytes if needed.  */
> +	      for (i = 0; i < 3; i++)
> +		{
> +		  int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> +
> +		  if (byte)
> +		    {
> +		      if (mov_done_p)
> +			asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> +		      else
> +			asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> +		      mov_done_p = true;
> +		    }
> +
> +		  if (mov_done_p)
> +		    asm_fprintf (file, "\tlsls\tr3, #8\n");
> +		}
> +
> +	      /* Emit lower byte if needed.  */
> +	      if (!mov_done_p)
> +		asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> +	      else if (mi_delta & 0xff)
> +		asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> +	    }
> +	  else
> +	    {
> +	      fputs ("\tldr\tr3, ", file);
> +	      assemble_name (file, label);
> +	      fputs ("+4\n", file);
> +	    }
>  	  asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>  		       mi_op, this_regno, this_regno);
>  	}
> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>  	fputs ("\tpop\t{r3}\n", file);
>  
>        fprintf (file, "\tbx\tr12\n");
> -      ASM_OUTPUT_ALIGN (file, 2);
> -      assemble_name (file, label);
> -      fputs (":\n", file);
> -      if (flag_pic)
> +
> +      /* With -mpure-code, we don't need to emit literals for the
> +	 function address and delta since we emitted code to build
> +	 them.  */
> +      if (!target_pure_code)
>  	{
> -	  /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> -	  rtx tem = XEXP (DECL_RTL (function), 0);
> -	  /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> -	     pipeline offset is four rather than eight.  Adjust the offset
> -	     accordingly.  */
> -	  tem = plus_constant (GET_MODE (tem), tem,
> -			       TARGET_THUMB1_ONLY ? -3 : -7);
> -	  tem = gen_rtx_MINUS (GET_MODE (tem),
> -			       tem,
> -			       gen_rtx_SYMBOL_REF (Pmode,
> -						   ggc_strdup (labelpc)));
> -	  assemble_integer (tem, 4, BITS_PER_WORD, 1);
> -	}
> -      else
> -	/* Output ".word .LTHUNKn".  */
> -	assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> +	  ASM_OUTPUT_ALIGN (file, 2);
> +	  assemble_name (file, label);
> +	  fputs (":\n", file);
> +	  if (flag_pic)
> +	    {
> +	      /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> +	      rtx tem = XEXP (DECL_RTL (function), 0);
> +	      /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> +		 pipeline offset is four rather than eight.  Adjust the offset
> +		 accordingly.  */
> +	      tem = plus_constant (GET_MODE (tem), tem,
> +				   TARGET_THUMB1_ONLY ? -3 : -7);
> +	      tem = gen_rtx_MINUS (GET_MODE (tem),
> +				   tem,
> +				   gen_rtx_SYMBOL_REF (Pmode,
> +						       ggc_strdup (labelpc)));
> +	      assemble_integer (tem, 4, BITS_PER_WORD, 1);
> +	    }
> +	  else
> +	    /* Output ".word .LTHUNKn".  */
> +	    assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>  
> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> -	assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> +	  if (TARGET_THUMB1_ONLY && mi_delta > 255)
> +	    assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> +	}
>      }
>    else
>      {
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-06 16:02 ` Richard Earnshaw
@ 2020-10-08  9:07   ` Christophe Lyon
  2020-10-08  9:58     ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-10-08  9:07 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> > When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> > from code memory (like we do without -mpure-code).
> >
> > This patch builds the value of mi_delta into r3 with a series of
> > movs/adds/lsls.
> >
> > We also do some cleanup by not emitting the function address and delta
> > via .word directives at the end of the thunk since we don't use them
> > with -mpure-code.
> >
> > No need for new testcases, this bug was already identified by
> > eg. pr46287-3.C
> >
> > 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >
> >       gcc/
> >       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >       do not emit function address and delta when -mpure-code is used.
>
Hi Richard,

Thanks for your comments.

> There are some optimizations you can make to this code.
>
> Firstly, for values between 256 and 510 (inclusive), it would be better
> to just expand a mov of 255 followed by an add.
I now see the splitted for the "Pe" constraint which I hadn't noticed
before, so I can write something similar indeed.

However, I'm note quite sure to understand the benefit in the split
when -mpure-code is NOT used.
Consider:
int f3_1 (void) { return 510; }
int f3_2 (void) { return 511; }
Compile with -O2 -mcpu=cortex-m0:
f3_1:
        movs    r0, #255
        lsls    r0, r0, #1
        bx      lr
f3_2:
        ldr     r0, .L4
        bx      lr

The splitter makes the code bigger, does it "compensate" for this by
not having to load the constant?
Actually the constant uses 4 more bytes, which should be taken into
account when comparing code size,
so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
thumb1 instructions would be equivalent in size compared to loading
from the literal pool. Should the 256-510 range be extended?


> This is also true for
> the literal pools alternative as well, so should be handled before all
> this.
I am not sure what you mean: with -mpure-code, the above sample is compiled as:
f3_1:
        movs    r0, #255
        lsls    r0, r0, #1
        bx      lr
f3_2:
        movs    r0, #1
        lsls    r0, r0, #8
        adds    r0, r0, #255
        bx      lr

so the "return 510" case is already handled as without -mpure-code.

>  I also suspect (but haven't check) that the base adjustment will
> most commonly be a multiple of the machine word size (ie 4).  If that is
> the case then you could generate n/4 and then shift it left by 2 for an
> even greater range of literals.
I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
I'll update my patch.

>  More generally, any sequence of up to
> three thumb1 instructions will be no larger, and probably as fast as the
> existing literal pool fall back.
>
> Secondly, if the value is, for example, 65536 (0x10000), your code will
> emit a mov followed by two shift-by-8 instructions; the two shifts could
> be merged into a single shift-by-16.

Right, I'll try to make use of thumb_shiftable_const.

> Finally, I'd really like to see some executable tests for this, if at
> all possible.
I mentioned pr46287-3.C, but that's not the only existing testcase
that showed the problem. There are also:
g++.dg/opt/thunk1.C
g++.dg/ipa/pr46984.C
g++.dg/torture/pr46287.C
g++.dg/torture/pr45699.C

Do you want that I copy one of these in the arm subdir and add
-mpure-code in dg-options?

Thanks,

Christophe

> R.
>
> >
> > k#   (use "git pull" to merge the remote branch into yours)
> > ---
> >  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 66 insertions(+), 25 deletions(-)
> >
> > diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> > index ceeb91f..62abeb5 100644
> > --- a/gcc/config/arm/arm.c
> > +++ b/gcc/config/arm/arm.c
> > @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >      {
> >        if (mi_delta > 255)
> >       {
> > -       fputs ("\tldr\tr3, ", file);
> > -       assemble_name (file, label);
> > -       fputs ("+4\n", file);
> > +       /* With -mpure-code, we cannot load delta from the constant
> > +          pool: we build it explicitly.  */
> > +       if (target_pure_code)
> > +         {
> > +           bool mov_done_p = false;
> > +           int i;
> > +
> > +           /* Emit upper 3 bytes if needed.  */
> > +           for (i = 0; i < 3; i++)
> > +             {
> > +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> > +
> > +               if (byte)
> > +                 {
> > +                   if (mov_done_p)
> > +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> > +                   else
> > +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> > +                   mov_done_p = true;
> > +                 }
> > +
> > +               if (mov_done_p)
> > +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> > +             }
> > +
> > +           /* Emit lower byte if needed.  */
> > +           if (!mov_done_p)
> > +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> > +           else if (mi_delta & 0xff)
> > +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> > +         }
> > +       else
> > +         {
> > +           fputs ("\tldr\tr3, ", file);
> > +           assemble_name (file, label);
> > +           fputs ("+4\n", file);
> > +         }
> >         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >                      mi_op, this_regno, this_regno);
> >       }
> > @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >       fputs ("\tpop\t{r3}\n", file);
> >
> >        fprintf (file, "\tbx\tr12\n");
> > -      ASM_OUTPUT_ALIGN (file, 2);
> > -      assemble_name (file, label);
> > -      fputs (":\n", file);
> > -      if (flag_pic)
> > +
> > +      /* With -mpure-code, we don't need to emit literals for the
> > +      function address and delta since we emitted code to build
> > +      them.  */
> > +      if (!target_pure_code)
> >       {
> > -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> > -       rtx tem = XEXP (DECL_RTL (function), 0);
> > -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> > -          pipeline offset is four rather than eight.  Adjust the offset
> > -          accordingly.  */
> > -       tem = plus_constant (GET_MODE (tem), tem,
> > -                            TARGET_THUMB1_ONLY ? -3 : -7);
> > -       tem = gen_rtx_MINUS (GET_MODE (tem),
> > -                            tem,
> > -                            gen_rtx_SYMBOL_REF (Pmode,
> > -                                                ggc_strdup (labelpc)));
> > -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> > -     }
> > -      else
> > -     /* Output ".word .LTHUNKn".  */
> > -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> > +       ASM_OUTPUT_ALIGN (file, 2);
> > +       assemble_name (file, label);
> > +       fputs (":\n", file);
> > +       if (flag_pic)
> > +         {
> > +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> > +           rtx tem = XEXP (DECL_RTL (function), 0);
> > +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> > +              pipeline offset is four rather than eight.  Adjust the offset
> > +              accordingly.  */
> > +           tem = plus_constant (GET_MODE (tem), tem,
> > +                                TARGET_THUMB1_ONLY ? -3 : -7);
> > +           tem = gen_rtx_MINUS (GET_MODE (tem),
> > +                                tem,
> > +                                gen_rtx_SYMBOL_REF (Pmode,
> > +                                                    ggc_strdup (labelpc)));
> > +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> > +         }
> > +       else
> > +         /* Output ".word .LTHUNKn".  */
> > +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >
> > -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> > -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> > +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> > +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> > +     }
> >      }
> >    else
> >      {
> >
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-08  9:07   ` Christophe Lyon
@ 2020-10-08  9:58     ` Richard Earnshaw
  2020-10-12  7:59       ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-08  9:58 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>> from code memory (like we do without -mpure-code).
>>>
>>> This patch builds the value of mi_delta into r3 with a series of
>>> movs/adds/lsls.
>>>
>>> We also do some cleanup by not emitting the function address and delta
>>> via .word directives at the end of the thunk since we don't use them
>>> with -mpure-code.
>>>
>>> No need for new testcases, this bug was already identified by
>>> eg. pr46287-3.C
>>>
>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>
>>>       gcc/
>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>       do not emit function address and delta when -mpure-code is used.
>>
> Hi Richard,
> 
> Thanks for your comments.
> 
>> There are some optimizations you can make to this code.
>>
>> Firstly, for values between 256 and 510 (inclusive), it would be better
>> to just expand a mov of 255 followed by an add.
> I now see the splitted for the "Pe" constraint which I hadn't noticed
> before, so I can write something similar indeed.
> 
> However, I'm note quite sure to understand the benefit in the split
> when -mpure-code is NOT used.
> Consider:
> int f3_1 (void) { return 510; }
> int f3_2 (void) { return 511; }
> Compile with -O2 -mcpu=cortex-m0:
> f3_1:
>         movs    r0, #255
>         lsls    r0, r0, #1
>         bx      lr
> f3_2:
>         ldr     r0, .L4
>         bx      lr
> 
> The splitter makes the code bigger, does it "compensate" for this by
> not having to load the constant?
> Actually the constant uses 4 more bytes, which should be taken into
> account when comparing code size,

Yes, the size of the literal pool entry needs to be taken into account.
 It might happen that the entry could be shared with another use of that
literal, but in general that's rare.

> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> thumb1 instructions would be equivalent in size compared to loading
> from the literal pool. Should the 256-510 range be extended?

It's a bit borderline at three instructions when literal pools are not
expensive to use, but in thumb1 literal pools tend to be quite small due
to the limited pc offsets we can use.  I think on balance we probably
want to use the instruction sequence unless optimizing for size.

> 
> 
>> This is also true for
>> the literal pools alternative as well, so should be handled before all
>> this.
> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> f3_1:
>         movs    r0, #255
>         lsls    r0, r0, #1
>         bx      lr
> f3_2:
>         movs    r0, #1
>         lsls    r0, r0, #8
>         adds    r0, r0, #255
>         bx      lr
> 
> so the "return 510" case is already handled as without -mpure-code.

I was thinking specifically of the thunk sequence where you seem to be
emitting instructions directly rather than generating RTL.  The examples
you show here are not thunks.


> 
>>  I also suspect (but haven't check) that the base adjustment will
>> most commonly be a multiple of the machine word size (ie 4).  If that is
>> the case then you could generate n/4 and then shift it left by 2 for an
>> even greater range of literals.
> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> I'll update my patch.
> 
>>  More generally, any sequence of up to
>> three thumb1 instructions will be no larger, and probably as fast as the
>> existing literal pool fall back.
>>
>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>> be merged into a single shift-by-16.
> 
> Right, I'll try to make use of thumb_shiftable_const.
> 
>> Finally, I'd really like to see some executable tests for this, if at
>> all possible.
> I mentioned pr46287-3.C, but that's not the only existing testcase
> that showed the problem. There are also:
> g++.dg/opt/thunk1.C
> g++.dg/ipa/pr46984.C
> g++.dg/torture/pr46287.C
> g++.dg/torture/pr45699.C
> 
> Do you want that I copy one of these in the arm subdir and add
> -mpure-code in dg-options?

On reflection, probably not - that just makes things more complicated
with all the dg-options mess (I'm worried about interactions with other
sets of options on the command line and the fall-out from that).  If
someone cares about pure-code they should be doing full testsuite runs
with it enabled and that should be sufficient.

R.

> 
> Thanks,
> 
> Christophe
> 
>> R.
>>
>>>
>>> k#   (use "git pull" to merge the remote branch into yours)
>>> ---
>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>> index ceeb91f..62abeb5 100644
>>> --- a/gcc/config/arm/arm.c
>>> +++ b/gcc/config/arm/arm.c
>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>      {
>>>        if (mi_delta > 255)
>>>       {
>>> -       fputs ("\tldr\tr3, ", file);
>>> -       assemble_name (file, label);
>>> -       fputs ("+4\n", file);
>>> +       /* With -mpure-code, we cannot load delta from the constant
>>> +          pool: we build it explicitly.  */
>>> +       if (target_pure_code)
>>> +         {
>>> +           bool mov_done_p = false;
>>> +           int i;
>>> +
>>> +           /* Emit upper 3 bytes if needed.  */
>>> +           for (i = 0; i < 3; i++)
>>> +             {
>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>> +
>>> +               if (byte)
>>> +                 {
>>> +                   if (mov_done_p)
>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>> +                   else
>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>> +                   mov_done_p = true;
>>> +                 }
>>> +
>>> +               if (mov_done_p)
>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>> +             }
>>> +
>>> +           /* Emit lower byte if needed.  */
>>> +           if (!mov_done_p)
>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>> +           else if (mi_delta & 0xff)
>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>> +         }
>>> +       else
>>> +         {
>>> +           fputs ("\tldr\tr3, ", file);
>>> +           assemble_name (file, label);
>>> +           fputs ("+4\n", file);
>>> +         }
>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>                      mi_op, this_regno, this_regno);
>>>       }
>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>       fputs ("\tpop\t{r3}\n", file);
>>>
>>>        fprintf (file, "\tbx\tr12\n");
>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>> -      assemble_name (file, label);
>>> -      fputs (":\n", file);
>>> -      if (flag_pic)
>>> +
>>> +      /* With -mpure-code, we don't need to emit literals for the
>>> +      function address and delta since we emitted code to build
>>> +      them.  */
>>> +      if (!target_pure_code)
>>>       {
>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>> -          accordingly.  */
>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>> -                            tem,
>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>> -                                                ggc_strdup (labelpc)));
>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>> -     }
>>> -      else
>>> -     /* Output ".word .LTHUNKn".  */
>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>> +       assemble_name (file, label);
>>> +       fputs (":\n", file);
>>> +       if (flag_pic)
>>> +         {
>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>> +              accordingly.  */
>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>> +                                tem,
>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>> +                                                    ggc_strdup (labelpc)));
>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>> +         }
>>> +       else
>>> +         /* Output ".word .LTHUNKn".  */
>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>
>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>> +     }
>>>      }
>>>    else
>>>      {
>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-08  9:58     ` Richard Earnshaw
@ 2020-10-12  7:59       ` Christophe Lyon
  2020-10-19 14:39         ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-10-12  7:59 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

[-- Attachment #1: Type: text/plain, Size: 11213 bytes --]

On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> > On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> > <Richard.Earnshaw@foss.arm.com> wrote:
> >>
> >> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>> from code memory (like we do without -mpure-code).
> >>>
> >>> This patch builds the value of mi_delta into r3 with a series of
> >>> movs/adds/lsls.
> >>>
> >>> We also do some cleanup by not emitting the function address and delta
> >>> via .word directives at the end of the thunk since we don't use them
> >>> with -mpure-code.
> >>>
> >>> No need for new testcases, this bug was already identified by
> >>> eg. pr46287-3.C
> >>>
> >>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>
> >>>       gcc/
> >>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>       do not emit function address and delta when -mpure-code is used.
> >>
> > Hi Richard,
> >
> > Thanks for your comments.
> >
> >> There are some optimizations you can make to this code.
> >>
> >> Firstly, for values between 256 and 510 (inclusive), it would be better
> >> to just expand a mov of 255 followed by an add.
> > I now see the splitted for the "Pe" constraint which I hadn't noticed
> > before, so I can write something similar indeed.
> >
> > However, I'm note quite sure to understand the benefit in the split
> > when -mpure-code is NOT used.
> > Consider:
> > int f3_1 (void) { return 510; }
> > int f3_2 (void) { return 511; }
> > Compile with -O2 -mcpu=cortex-m0:
> > f3_1:
> >         movs    r0, #255
> >         lsls    r0, r0, #1
> >         bx      lr
> > f3_2:
> >         ldr     r0, .L4
> >         bx      lr
> >
> > The splitter makes the code bigger, does it "compensate" for this by
> > not having to load the constant?
> > Actually the constant uses 4 more bytes, which should be taken into
> > account when comparing code size,
>
> Yes, the size of the literal pool entry needs to be taken into account.
>  It might happen that the entry could be shared with another use of that
> literal, but in general that's rare.
>
> > so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> > thumb1 instructions would be equivalent in size compared to loading
> > from the literal pool. Should the 256-510 range be extended?
>
> It's a bit borderline at three instructions when literal pools are not
> expensive to use, but in thumb1 literal pools tend to be quite small due
> to the limited pc offsets we can use.  I think on balance we probably
> want to use the instruction sequence unless optimizing for size.
>
> >
> >
> >> This is also true for
> >> the literal pools alternative as well, so should be handled before all
> >> this.
> > I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> > f3_1:
> >         movs    r0, #255
> >         lsls    r0, r0, #1
> >         bx      lr
> > f3_2:
> >         movs    r0, #1
> >         lsls    r0, r0, #8
> >         adds    r0, r0, #255
> >         bx      lr
> >
> > so the "return 510" case is already handled as without -mpure-code.
>
> I was thinking specifically of the thunk sequence where you seem to be
> emitting instructions directly rather than generating RTL.  The examples
> you show here are not thunks.
>
OK thanks for the clarification.

Here is an updated version, split into 3 patches to hopefully make
review easier.
They apply on top of my other mpure-code patches for PR96967 and PR96770:
https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html

I kept it this way to make incremental changes easier to understand.

Patch 1: With the hope to avoid confusion and make maintenance easier,
I have updated thumb1_gen_const_int() so that it can generate either RTL or
asm. This way, all the code used to build thumb-1 constants is in the
same place,
 in case we need to improve/fix it later. We now generate shorter sequences in
several cases matching your comments.

Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
calls thumb1_gen_const_int.

Patch 3: Update of the original patch in this thread, now calls
thumb1_gen_const_int.

>
> >
> >>  I also suspect (but haven't check) that the base adjustment will
> >> most commonly be a multiple of the machine word size (ie 4).  If that is
> >> the case then you could generate n/4 and then shift it left by 2 for an
> >> even greater range of literals.
> > I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> > I'll update my patch.
> >
> >>  More generally, any sequence of up to
> >> three thumb1 instructions will be no larger, and probably as fast as the
> >> existing literal pool fall back.
> >>
> >> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >> be merged into a single shift-by-16.
> >
> > Right, I'll try to make use of thumb_shiftable_const.
> >
> >> Finally, I'd really like to see some executable tests for this, if at
> >> all possible.
> > I mentioned pr46287-3.C, but that's not the only existing testcase
> > that showed the problem. There are also:
> > g++.dg/opt/thunk1.C
> > g++.dg/ipa/pr46984.C
> > g++.dg/torture/pr46287.C
> > g++.dg/torture/pr45699.C
> >
> > Do you want that I copy one of these in the arm subdir and add
> > -mpure-code in dg-options?
>
> On reflection, probably not - that just makes things more complicated
> with all the dg-options mess (I'm worried about interactions with other
> sets of options on the command line and the fall-out from that).  If
> someone cares about pure-code they should be doing full testsuite runs
> with it enabled and that should be sufficient.

Yes, that's what I am doing manually, it's a bit tricky, and I use a
modified simulator.

Thanks,

Christophe

>
> R.
>
> >
> > Thanks,
> >
> > Christophe
> >
> >> R.
> >>
> >>>
> >>> k#   (use "git pull" to merge the remote branch into yours)
> >>> ---
> >>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>
> >>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>> index ceeb91f..62abeb5 100644
> >>> --- a/gcc/config/arm/arm.c
> >>> +++ b/gcc/config/arm/arm.c
> >>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>      {
> >>>        if (mi_delta > 255)
> >>>       {
> >>> -       fputs ("\tldr\tr3, ", file);
> >>> -       assemble_name (file, label);
> >>> -       fputs ("+4\n", file);
> >>> +       /* With -mpure-code, we cannot load delta from the constant
> >>> +          pool: we build it explicitly.  */
> >>> +       if (target_pure_code)
> >>> +         {
> >>> +           bool mov_done_p = false;
> >>> +           int i;
> >>> +
> >>> +           /* Emit upper 3 bytes if needed.  */
> >>> +           for (i = 0; i < 3; i++)
> >>> +             {
> >>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>> +
> >>> +               if (byte)
> >>> +                 {
> >>> +                   if (mov_done_p)
> >>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>> +                   else
> >>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>> +                   mov_done_p = true;
> >>> +                 }
> >>> +
> >>> +               if (mov_done_p)
> >>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>> +             }
> >>> +
> >>> +           /* Emit lower byte if needed.  */
> >>> +           if (!mov_done_p)
> >>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>> +           else if (mi_delta & 0xff)
> >>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>> +         }
> >>> +       else
> >>> +         {
> >>> +           fputs ("\tldr\tr3, ", file);
> >>> +           assemble_name (file, label);
> >>> +           fputs ("+4\n", file);
> >>> +         }
> >>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>                      mi_op, this_regno, this_regno);
> >>>       }
> >>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>       fputs ("\tpop\t{r3}\n", file);
> >>>
> >>>        fprintf (file, "\tbx\tr12\n");
> >>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>> -      assemble_name (file, label);
> >>> -      fputs (":\n", file);
> >>> -      if (flag_pic)
> >>> +
> >>> +      /* With -mpure-code, we don't need to emit literals for the
> >>> +      function address and delta since we emitted code to build
> >>> +      them.  */
> >>> +      if (!target_pure_code)
> >>>       {
> >>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>> -          accordingly.  */
> >>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>> -                            tem,
> >>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>> -                                                ggc_strdup (labelpc)));
> >>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>> -     }
> >>> -      else
> >>> -     /* Output ".word .LTHUNKn".  */
> >>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>> +       assemble_name (file, label);
> >>> +       fputs (":\n", file);
> >>> +       if (flag_pic)
> >>> +         {
> >>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>> +              accordingly.  */
> >>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>> +                                tem,
> >>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>> +                                                    ggc_strdup (labelpc)));
> >>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>> +         }
> >>> +       else
> >>> +         /* Output ".word .LTHUNKn".  */
> >>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>
> >>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>> +     }
> >>>      }
> >>>    else
> >>>      {
> >>>
> >>
>

[-- Attachment #2: 0001-arm-Improve-thumb1_gen_const_int.patch --]
[-- Type: text/x-patch, Size: 8808 bytes --]

From 99ca30b86d986d4247c736973767f791a6abac44 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Fri, 9 Oct 2020 12:42:39 +0000
Subject: [PATCH 1/3] arm: Improve thumb1_gen_const_int

Enable thumb1_gen_const_int to generate RTL or asm depending on the
context, so that we avoid duplicating code to handle constants in
thumb-1 with -mpure-code.

When the OP0 parameter is NULL, generate asm in FILE, targeting the
REGNO register. When OP0 is not NULL, generate RTL targeting OP0.

The generated sequence is improved to handle right-shiftable and small
values with less instructions. We now generate:

128:
        subs    r0, r0, #128
264:
        movs    r3, #33
        lsls    r3, #3
510:
        movs    r3, #255
        lsls    r3, #1
512:
        movs    r3, #1
        lsls    r3, #9
764:
        movs    r3, #191
        lsls    r3, #2
65536:
        movs    r3, #1
        lsls    r3, #16
0x123456:
        movs    r3, #18 ;0x12
        lsls    r3, #8
        adds    r3, #52 ;0x34
        lsls    r3, #8
        adds    r3, #86 ;0x56
0x1123456:
        movs    r3, #137 ;0x89
        lsls    r3, #8
        adds    r3, #26 ;0x1a
        lsls    r3, #8
        adds    r3, #43 ;0x2b
        lsls    r3, #1
0x1000010:
        movs    r3, #16
        lsls    r3, #16
        adds    r3, #1
        lsls    r3, #4
0x1000011:
        movs    r3, #1
        lsls    r3, #24
        adds    r3, #17

2020-10-12  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm-protos.h (thumb1_gen_const_int): Update prototype.
	* config/arm/arm.c (thumb1_gen_const_int): Add capability to emit
	either RTL or asm, improve generated code.
	* config/arm/thumb1.md: Update splitter to use the new
	thumb1_gen_const_int prototype.
---
 gcc/config/arm/arm-protos.h |   2 +-
 gcc/config/arm/arm.c        | 190 ++++++++++++++++++++++++++++++++++++--------
 gcc/config/arm/thumb1.md    |   2 +-
 3 files changed, 160 insertions(+), 34 deletions(-)

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 0cc0ae7..673f94b 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -74,7 +74,7 @@ extern bool arm_small_register_classes_for_mode_p (machine_mode);
 extern int const_ok_for_arm (HOST_WIDE_INT);
 extern int const_ok_for_op (HOST_WIDE_INT, enum rtx_code);
 extern int const_ok_for_dimode_op (HOST_WIDE_INT, enum rtx_code);
-extern void thumb1_gen_const_int (rtx, HOST_WIDE_INT);
+extern void thumb1_gen_const_int (FILE *, int, rtx, HOST_WIDE_INT);
 extern int arm_split_constant (RTX_CODE, machine_mode, rtx,
 			       HOST_WIDE_INT, rtx, rtx, int);
 extern int legitimate_pic_operand_p (rtx);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ceeb91f..eb4675a 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -4513,38 +4513,6 @@ const_ok_for_dimode_op (HOST_WIDE_INT i, enum rtx_code code)
     }
 }
 
-/* Emit a sequence of movs/adds/shift to produce a 32-bit constant.
-   Avoid generating useless code when one of the bytes is zero.  */
-void
-thumb1_gen_const_int (rtx op0, HOST_WIDE_INT op1)
-{
-  bool mov_done_p = false;
-  int i;
-
-  /* Emit upper 3 bytes if needed.  */
-  for (i = 0; i < 3; i++)
-    {
-      int byte = (op1 >> (8 * (3 - i))) & 0xff;
-
-      if (byte)
-	{
-	  emit_set_insn (op0, mov_done_p
-			 ? gen_rtx_PLUS (SImode,op0, GEN_INT (byte))
-			 : GEN_INT (byte));
-	  mov_done_p = true;
-	}
-
-      if (mov_done_p)
-	emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (8)));
-    }
-
-  /* Emit lower byte if needed.  */
-  if (!mov_done_p)
-    emit_set_insn (op0, GEN_INT (op1 & 0xff));
-  else if (op1 & 0xff)
-    emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (op1 & 0xff)));
-}
-
 /* Emit a sequence of insns to handle a large constant.
    CODE is the code of the operation required, it can be any of SET, PLUS,
    IOR, AND, XOR, MINUS;
@@ -28244,6 +28212,164 @@ arm_internal_label (FILE *stream, const char *prefix, unsigned long labelno)
   default_internal_label (stream, prefix, labelno);
 }
 
+/* Emit a sequence of movs/adds/shift to produce a 32-bit constant.
+   Avoid generating useless code when one of the bytes is zero.  If
+   OP0 is NULL, use REGNO as output register, use REGNO(OP0)
+   otherwise.  */
+void
+thumb1_gen_const_int (FILE *file, int regno, rtx op0, HOST_WIDE_INT op1)
+{
+  bool mov_done_p = false;
+  int val = op1;
+  int shift = 0;
+  int i;
+
+  if (val == 0)
+    {
+      if (op0)
+	emit_set_insn (op0, GEN_INT (val));
+      else
+	asm_fprintf (file, "\tmovs\tr%d, #%d\n", regno, val);
+      return;
+    }
+
+  /* In the general case, we need 7 instructions to build
+     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
+     do better if VAL is small enough, or
+     right-shiftable by a suitable amount.  If the
+     right-shift enables to encode at least one less byte,
+     it's worth it: we save a movs and a lsls at the
+     expense of a final lsls.  */
+  int final_shift = number_of_first_bit_set (val);
+
+  int leading_zeroes = clz_hwi (val);
+  int number_of_bytes_needed =
+    ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
+     / BITS_PER_UNIT) + 1;
+  int number_of_bytes_needed2 =
+    ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
+     / BITS_PER_UNIT) + 1;
+
+  if (number_of_bytes_needed2 < number_of_bytes_needed)
+    val >>= final_shift;
+  else
+    final_shift = 0;
+
+  /* If we are in a very small range, we can use either a single movs
+     or movs+adds.  */
+  if ((val >= 0) && (val <= 510))
+    {
+      /* It's possible that VAL is now <= 255, if we
+	 left-shifted it.  */
+      if (val > 255)
+	{
+	  int high = val - 255;
+
+	  if (op0)
+	    {
+	      emit_set_insn (op0, GEN_INT (high));
+	      emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (255)));
+	    }
+	  else
+	    {
+	      asm_fprintf (file, "\tmovs\tr%d, #%d\n", regno, high);
+	      asm_fprintf (file, "\tadds\tr%d, #%d\n", regno, 255);
+	    }
+	}
+      else
+	{
+	  if (op0)
+	    emit_set_insn (op0, GEN_INT (val));
+	  else
+	    asm_fprintf (file, "\tmovs\tr%d, #%d\n", regno, val);
+	}
+
+      if (final_shift > 0)
+	{
+	  if (op0)
+	    emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (final_shift)));
+	  else
+	    asm_fprintf (file, "\tlsls\tr%d, #%d\n", regno, final_shift);
+	}
+    }
+  else
+    {
+      /* General case, emit upper 3 bytes as needed.  */
+      for (i = 0; i < 3; i++)
+	{
+	  int byte = (val >> (8 * (3 - i))) & 0xff;
+
+	  if (byte)
+	    {
+	      /* We are about to emit new bits, stop accumulating a
+		 shift amount, and left-shift only if we have already
+		 emitted some upper bits.  */
+	      if (mov_done_p)
+		{
+		  if (op0)
+		    emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (shift)));
+		  else
+		    asm_fprintf (file, "\tlsls\tr%d, #%d\n", regno, shift);
+		}
+
+	      if (mov_done_p)
+		{
+		  if (op0)
+		    emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (byte)));
+		  else
+		    asm_fprintf (file, "\tadds\tr%d, #%d\n", regno, byte);
+		}
+	      else
+		{
+		  if (op0)
+		    emit_set_insn (op0, GEN_INT (byte));
+		  else
+		    asm_fprintf (file, "\tmovs\tr%d, #%d\n", regno, byte);
+		}
+
+	      /* Stop accumulating shift amount since we've just
+		 emitted some bits.  */
+	      shift = 0;
+
+	      mov_done_p = true;
+	    }
+
+	  if (mov_done_p)
+	    shift += 8;
+	}
+
+      /* Emit lower byte; it's not zero because we handled that case above.  */
+      if (!mov_done_p)
+	{
+	  if (op0)
+	    emit_set_insn (op0, GEN_INT (val & 0xff));
+	  else
+	    asm_fprintf (file, "\tmovs\tr%d, #%d\n", regno, val & 0xff);
+	}
+      else
+	{
+	  if (op0)
+	    {
+	      emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (shift)));
+	      emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (val & 0xff)));
+	    }
+	  else
+	    {
+	      asm_fprintf (file, "\tlsls\tr%d, #%d\n", regno, shift);
+	      asm_fprintf (file, "\tadds\tr%d, #%d\n", regno, val & 0xff);
+	    }
+	}
+
+      if (final_shift > 0)
+	{
+	  if (op0)
+	    emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (final_shift)));
+	  else
+	    asm_fprintf (file, "\tlsls\tr%d, #%d\n", regno, final_shift);
+	}
+    }
+}
+
 /* Output code to add DELTA to the first argument, and then jump
    to FUNCTION.  Used for C++ multiple inheritance.  */
 
diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
index 2258a52..406edc5 100644
--- a/gcc/config/arm/thumb1.md
+++ b/gcc/config/arm/thumb1.md
@@ -820,7 +820,7 @@ (define_split
    && !satisfies_constraint_K (operands[1])"
   [(clobber (const_int 0))]
   "
-    thumb1_gen_const_int (operands[0], INTVAL (operands[1]));
+    thumb1_gen_const_int (NULL, 0, operands[0], INTVAL (operands[1]));
     DONE;
   "
 )
-- 
2.7.4


[-- Attachment #3: 0003-arm-Fix-multiple-inheritance-thunks-for-thumb-1-with.patch --]
[-- Type: text/x-patch, Size: 4017 bytes --]

From 6b7d3f02f275ee140227fa293636687a321a21f6 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Tue, 29 Sep 2020 19:40:39 +0000
Subject: [PATCH 3/3] arm: Fix multiple inheritance thunks for thumb-1 with
 -mpure-code

When -mpure-code is used, we cannot load delta from code memory (like
we do without -mpure-code).

This patch builds the value of mi_delta into r3 with a series of
movs/adds/lsls.

We also do some cleanup by not emitting the function address and delta
via .word directives at the end of the thunk since we don't use them
with -mpure-code.

No need for new testcases, this bug was already identified by:
g++.dg/ipa/pr46287-3.C
g++.dg/ipa/pr46984.C
g++.dg/opt/thunk1.C
g++.dg/torture/pr46287.C
g++.dg/torture/pr45699.C

2020-10-12  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
	do not emit function address and delta when -mpure-code is used.
---
 gcc/config/arm/arm.c | 64 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index eb4675a..ed0828c 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28468,9 +28468,16 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
     {
       if (mi_delta > 255)
 	{
-	  fputs ("\tldr\tr3, ", file);
-	  assemble_name (file, label);
-	  fputs ("+4\n", file);
+	  /* With -mpure-code, we cannot load MI_DELTA from the
+	     constant pool: we build it explicitly.  */
+	  if (target_pure_code)
+	    thumb1_gen_const_int (file, 3, NULL, mi_delta);
+	  else
+	    {
+	      fputs ("\tldr\tr3, ", file);
+	      assemble_name (file, label);
+	      fputs ("+4\n", file);
+	    }
 	  asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
 		       mi_op, this_regno, this_regno);
 	}
@@ -28506,30 +28513,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
 	fputs ("\tpop\t{r3}\n", file);
 
       fprintf (file, "\tbx\tr12\n");
-      ASM_OUTPUT_ALIGN (file, 2);
-      assemble_name (file, label);
-      fputs (":\n", file);
-      if (flag_pic)
+
+      /* With -mpure-code, we don't need to emit literals for the
+	 function address and delta since we emitted code to build
+	 them.  */
+      if (!target_pure_code)
 	{
-	  /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
-	  rtx tem = XEXP (DECL_RTL (function), 0);
-	  /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
-	     pipeline offset is four rather than eight.  Adjust the offset
-	     accordingly.  */
-	  tem = plus_constant (GET_MODE (tem), tem,
-			       TARGET_THUMB1_ONLY ? -3 : -7);
-	  tem = gen_rtx_MINUS (GET_MODE (tem),
-			       tem,
-			       gen_rtx_SYMBOL_REF (Pmode,
-						   ggc_strdup (labelpc)));
-	  assemble_integer (tem, 4, BITS_PER_WORD, 1);
-	}
-      else
-	/* Output ".word .LTHUNKn".  */
-	assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
+	  ASM_OUTPUT_ALIGN (file, 2);
+	  assemble_name (file, label);
+	  fputs (":\n", file);
+	  if (flag_pic)
+	    {
+	      /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
+	      rtx tem = XEXP (DECL_RTL (function), 0);
+	      /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
+		 pipeline offset is four rather than eight.  Adjust the offset
+		 accordingly.  */
+	      tem = plus_constant (GET_MODE (tem), tem,
+				   TARGET_THUMB1_ONLY ? -3 : -7);
+	      tem = gen_rtx_MINUS (GET_MODE (tem),
+				   tem,
+				   gen_rtx_SYMBOL_REF (Pmode,
+						       ggc_strdup (labelpc)));
+	      assemble_integer (tem, 4, BITS_PER_WORD, 1);
+	    }
+	  else
+	    /* Output ".word .LTHUNKn".  */
+	    assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
 
-      if (TARGET_THUMB1_ONLY && mi_delta > 255)
-	assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
+	  if (TARGET_THUMB1_ONLY && mi_delta > 255)
+	    assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
+	}
     }
   else
     {
-- 
2.7.4


[-- Attachment #4: 0002-arm-Call-thumb1_gen_const_int-from-thumb1_movsi_insn.patch --]
[-- Type: text/x-patch, Size: 2058 bytes --]

From 31b09a3cf77b7731d17c5ff85d569481fc831bc7 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Fri, 9 Oct 2020 15:18:07 +0000
Subject: [PATCH 2/3] arm: Call thumb1_gen_const_int from thumb1_movsi_insn

thumb1_movsi_insn used the same algorithm to build a constant in asm
than thumb1_gen_const_int does in RTL. Since the previous patch added
support for asm generation in thumb1_gen_const_int, this patch calls
it from thumb1_movsi_insn to avoid duplication.

2020-10-12  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/thumb1.md (thumb1_movsi_insn): Call
	thumb1_movsi_insn.
---
 gcc/config/arm/thumb1.md | 37 ++++---------------------------------
 1 file changed, 4 insertions(+), 33 deletions(-)

diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
index 406edc5..83499dd 100644
--- a/gcc/config/arm/thumb1.md
+++ b/gcc/config/arm/thumb1.md
@@ -688,40 +688,11 @@ (define_insn "*thumb1_movsi_insn"
 	  }
 	else if (GET_CODE (operands[1]) == CONST_INT)
 	  {
-	    int i;
-	    HOST_WIDE_INT op1 = INTVAL (operands[1]);
-	    bool mov_done_p = false;
-	    rtx ops[2];
-	    ops[0] = operands[0];
-
-	    /* Emit upper 3 bytes if needed.  */
-	    for (i = 0; i < 3; i++)
-	      {
-		int byte = (op1 >> (8 * (3 - i))) & 0xff;
-
-		if (byte)
-		  {
-		    ops[1] = GEN_INT (byte);
-		    if (mov_done_p)
-		      output_asm_insn ("adds\t%0, %1", ops);
-		    else
-		      output_asm_insn ("movs\t%0, %1", ops);
-		    mov_done_p = true;
-		  }
-
-		if (mov_done_p)
-		  output_asm_insn ("lsls\t%0, #8", ops);
-	      }
-
-	    /* Emit lower byte if needed.  */
-	    ops[1] = GEN_INT (op1 & 0xff);
-	    if (!mov_done_p)
-	      output_asm_insn ("movs\t%0, %1", ops);
-	    else if (op1 & 0xff)
-	      output_asm_insn ("adds\t%0, %1", ops);
-	    return "";
+	    thumb1_gen_const_int (NULL, 0, operands[0], INTVAL (operands[1]));
+	    return \"\";
 	  }
-	  gcc_unreachable ();
+
+	gcc_unreachable ();
 
       case 8: return "ldr\t%0, %1";
       case 9: return "str\t%1, %0";
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-12  7:59       ` Christophe Lyon
@ 2020-10-19 14:39         ` Richard Earnshaw
  2020-10-19 16:32           ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-19 14:39 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>
>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>> from code memory (like we do without -mpure-code).
>>>>>
>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>> movs/adds/lsls.
>>>>>
>>>>> We also do some cleanup by not emitting the function address and delta
>>>>> via .word directives at the end of the thunk since we don't use them
>>>>> with -mpure-code.
>>>>>
>>>>> No need for new testcases, this bug was already identified by
>>>>> eg. pr46287-3.C
>>>>>
>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>
>>>>>       gcc/
>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>
>>> Hi Richard,
>>>
>>> Thanks for your comments.
>>>
>>>> There are some optimizations you can make to this code.
>>>>
>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>> to just expand a mov of 255 followed by an add.
>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>> before, so I can write something similar indeed.
>>>
>>> However, I'm note quite sure to understand the benefit in the split
>>> when -mpure-code is NOT used.
>>> Consider:
>>> int f3_1 (void) { return 510; }
>>> int f3_2 (void) { return 511; }
>>> Compile with -O2 -mcpu=cortex-m0:
>>> f3_1:
>>>         movs    r0, #255
>>>         lsls    r0, r0, #1
>>>         bx      lr
>>> f3_2:
>>>         ldr     r0, .L4
>>>         bx      lr
>>>
>>> The splitter makes the code bigger, does it "compensate" for this by
>>> not having to load the constant?
>>> Actually the constant uses 4 more bytes, which should be taken into
>>> account when comparing code size,
>>
>> Yes, the size of the literal pool entry needs to be taken into account.
>>  It might happen that the entry could be shared with another use of that
>> literal, but in general that's rare.
>>
>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>> thumb1 instructions would be equivalent in size compared to loading
>>> from the literal pool. Should the 256-510 range be extended?
>>
>> It's a bit borderline at three instructions when literal pools are not
>> expensive to use, but in thumb1 literal pools tend to be quite small due
>> to the limited pc offsets we can use.  I think on balance we probably
>> want to use the instruction sequence unless optimizing for size.
>>
>>>
>>>
>>>> This is also true for
>>>> the literal pools alternative as well, so should be handled before all
>>>> this.
>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>> f3_1:
>>>         movs    r0, #255
>>>         lsls    r0, r0, #1
>>>         bx      lr
>>> f3_2:
>>>         movs    r0, #1
>>>         lsls    r0, r0, #8
>>>         adds    r0, r0, #255
>>>         bx      lr
>>>
>>> so the "return 510" case is already handled as without -mpure-code.
>>
>> I was thinking specifically of the thunk sequence where you seem to be
>> emitting instructions directly rather than generating RTL.  The examples
>> you show here are not thunks.
>>
> OK thanks for the clarification.
> 
> Here is an updated version, split into 3 patches to hopefully make
> review easier.
> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> 
> I kept it this way to make incremental changes easier to understand.
> 
> Patch 1: With the hope to avoid confusion and make maintenance easier,
> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> asm. This way, all the code used to build thumb-1 constants is in the
> same place,
>  in case we need to improve/fix it later. We now generate shorter sequences in
> several cases matching your comments.
> 
> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> calls thumb1_gen_const_int.
> 
> Patch 3: Update of the original patch in this thread, now calls
> thumb1_gen_const_int.

Yuk!  Those changes to thumb1_gen_const_int are horrible.

I think we should be able to leverage the fact that the compiler can use
C++ now to do much better than that, for example by making that function
a template.  For example (and this is just a sketch):

class t1_rtl
{
 public:
  void ashift(int a) { gen_rtx_ASHIFT(a); }
  void rshift(int b) { gen_rtx_SHIFTRT(b); }
};

class t1_print
{
 public:
  t1_print (FILE *f) : t_file(f) {}
  void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
  void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
 private:
  FILE *t_file;
};

template <class T>
void thumb1_gen_const_int(T t, int f)
{
  // Expansion of thumb1_gen_const_int ...
  t.ashift(f);
}

// Usage...
void f1()
{
  // Use the RTL expander
  t1_rtl g;
  thumb1_gen_const_int (g, 3);
}

void f2()
{
  // Use the printf expander writing to stdout
  t1_print g(stdout);
  thumb1_gen_const_int (g, 3);
}

With this you can write thumb1_gen_const_int without having to worry
about which expander is being used in each instance and the template
expansion will use the right version.

R.

> 
>>
>>>
>>>>  I also suspect (but haven't check) that the base adjustment will
>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>> even greater range of literals.
>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>> I'll update my patch.
>>>
>>>>  More generally, any sequence of up to
>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>> existing literal pool fall back.
>>>>
>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>> be merged into a single shift-by-16.
>>>
>>> Right, I'll try to make use of thumb_shiftable_const.
>>>
>>>> Finally, I'd really like to see some executable tests for this, if at
>>>> all possible.
>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>> that showed the problem. There are also:
>>> g++.dg/opt/thunk1.C
>>> g++.dg/ipa/pr46984.C
>>> g++.dg/torture/pr46287.C
>>> g++.dg/torture/pr45699.C
>>>
>>> Do you want that I copy one of these in the arm subdir and add
>>> -mpure-code in dg-options?
>>
>> On reflection, probably not - that just makes things more complicated
>> with all the dg-options mess (I'm worried about interactions with other
>> sets of options on the command line and the fall-out from that).  If
>> someone cares about pure-code they should be doing full testsuite runs
>> with it enabled and that should be sufficient.
> 
> Yes, that's what I am doing manually, it's a bit tricky, and I use a
> modified simulator.
> 
> Thanks,
> 
> Christophe
> 
>>
>> R.
>>
>>>
>>> Thanks,
>>>
>>> Christophe
>>>
>>>> R.
>>>>
>>>>>
>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>> ---
>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>
>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>> index ceeb91f..62abeb5 100644
>>>>> --- a/gcc/config/arm/arm.c
>>>>> +++ b/gcc/config/arm/arm.c
>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>      {
>>>>>        if (mi_delta > 255)
>>>>>       {
>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>> -       assemble_name (file, label);
>>>>> -       fputs ("+4\n", file);
>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>> +          pool: we build it explicitly.  */
>>>>> +       if (target_pure_code)
>>>>> +         {
>>>>> +           bool mov_done_p = false;
>>>>> +           int i;
>>>>> +
>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>> +           for (i = 0; i < 3; i++)
>>>>> +             {
>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>> +
>>>>> +               if (byte)
>>>>> +                 {
>>>>> +                   if (mov_done_p)
>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>> +                   else
>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>> +                   mov_done_p = true;
>>>>> +                 }
>>>>> +
>>>>> +               if (mov_done_p)
>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>> +             }
>>>>> +
>>>>> +           /* Emit lower byte if needed.  */
>>>>> +           if (!mov_done_p)
>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>> +           else if (mi_delta & 0xff)
>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>> +         }
>>>>> +       else
>>>>> +         {
>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>> +           assemble_name (file, label);
>>>>> +           fputs ("+4\n", file);
>>>>> +         }
>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>                      mi_op, this_regno, this_regno);
>>>>>       }
>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>
>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>> -      assemble_name (file, label);
>>>>> -      fputs (":\n", file);
>>>>> -      if (flag_pic)
>>>>> +
>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>> +      function address and delta since we emitted code to build
>>>>> +      them.  */
>>>>> +      if (!target_pure_code)
>>>>>       {
>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>> -          accordingly.  */
>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>> -                            tem,
>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>> -                                                ggc_strdup (labelpc)));
>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>> -     }
>>>>> -      else
>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>> +       assemble_name (file, label);
>>>>> +       fputs (":\n", file);
>>>>> +       if (flag_pic)
>>>>> +         {
>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>> +              accordingly.  */
>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>> +                                tem,
>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>> +                                                    ggc_strdup (labelpc)));
>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>> +         }
>>>>> +       else
>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>
>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>> +     }
>>>>>      }
>>>>>    else
>>>>>      {
>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-19 14:39         ` Richard Earnshaw
@ 2020-10-19 16:32           ` Christophe Lyon
  2020-10-20 11:22             ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-10-19 16:32 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> > On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> > <Richard.Earnshaw@foss.arm.com> wrote:
> >>
> >> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>
> >>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>>>> from code memory (like we do without -mpure-code).
> >>>>>
> >>>>> This patch builds the value of mi_delta into r3 with a series of
> >>>>> movs/adds/lsls.
> >>>>>
> >>>>> We also do some cleanup by not emitting the function address and delta
> >>>>> via .word directives at the end of the thunk since we don't use them
> >>>>> with -mpure-code.
> >>>>>
> >>>>> No need for new testcases, this bug was already identified by
> >>>>> eg. pr46287-3.C
> >>>>>
> >>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>>>
> >>>>>       gcc/
> >>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>>>       do not emit function address and delta when -mpure-code is used.
> >>>>
> >>> Hi Richard,
> >>>
> >>> Thanks for your comments.
> >>>
> >>>> There are some optimizations you can make to this code.
> >>>>
> >>>> Firstly, for values between 256 and 510 (inclusive), it would be better
> >>>> to just expand a mov of 255 followed by an add.
> >>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>> before, so I can write something similar indeed.
> >>>
> >>> However, I'm note quite sure to understand the benefit in the split
> >>> when -mpure-code is NOT used.
> >>> Consider:
> >>> int f3_1 (void) { return 510; }
> >>> int f3_2 (void) { return 511; }
> >>> Compile with -O2 -mcpu=cortex-m0:
> >>> f3_1:
> >>>         movs    r0, #255
> >>>         lsls    r0, r0, #1
> >>>         bx      lr
> >>> f3_2:
> >>>         ldr     r0, .L4
> >>>         bx      lr
> >>>
> >>> The splitter makes the code bigger, does it "compensate" for this by
> >>> not having to load the constant?
> >>> Actually the constant uses 4 more bytes, which should be taken into
> >>> account when comparing code size,
> >>
> >> Yes, the size of the literal pool entry needs to be taken into account.
> >>  It might happen that the entry could be shared with another use of that
> >> literal, but in general that's rare.
> >>
> >>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>> thumb1 instructions would be equivalent in size compared to loading
> >>> from the literal pool. Should the 256-510 range be extended?
> >>
> >> It's a bit borderline at three instructions when literal pools are not
> >> expensive to use, but in thumb1 literal pools tend to be quite small due
> >> to the limited pc offsets we can use.  I think on balance we probably
> >> want to use the instruction sequence unless optimizing for size.
> >>
> >>>
> >>>
> >>>> This is also true for
> >>>> the literal pools alternative as well, so should be handled before all
> >>>> this.
> >>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> >>> f3_1:
> >>>         movs    r0, #255
> >>>         lsls    r0, r0, #1
> >>>         bx      lr
> >>> f3_2:
> >>>         movs    r0, #1
> >>>         lsls    r0, r0, #8
> >>>         adds    r0, r0, #255
> >>>         bx      lr
> >>>
> >>> so the "return 510" case is already handled as without -mpure-code.
> >>
> >> I was thinking specifically of the thunk sequence where you seem to be
> >> emitting instructions directly rather than generating RTL.  The examples
> >> you show here are not thunks.
> >>
> > OK thanks for the clarification.
> >
> > Here is an updated version, split into 3 patches to hopefully make
> > review easier.
> > They apply on top of my other mpure-code patches for PR96967 and PR96770:
> > https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> > https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >
> > I kept it this way to make incremental changes easier to understand.
> >
> > Patch 1: With the hope to avoid confusion and make maintenance easier,
> > I have updated thumb1_gen_const_int() so that it can generate either RTL or
> > asm. This way, all the code used to build thumb-1 constants is in the
> > same place,
> >  in case we need to improve/fix it later. We now generate shorter sequences in
> > several cases matching your comments.
> >
> > Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> > calls thumb1_gen_const_int.
> >
> > Patch 3: Update of the original patch in this thread, now calls
> > thumb1_gen_const_int.
>
> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>
> I think we should be able to leverage the fact that the compiler can use
> C++ now to do much better than that, for example by making that function
> a template.  For example (and this is just a sketch):
>

Indeed! I didn't think about it since there is no other use of
templates in arm.c yet.
I'll send an update soon.

Other than that, does the approach look OK to you?

Thanks,

Christophe

> class t1_rtl
> {
>  public:
>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
> };
>
> class t1_print
> {
>  public:
>   t1_print (FILE *f) : t_file(f) {}
>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>  private:
>   FILE *t_file;
> };
>
> template <class T>
> void thumb1_gen_const_int(T t, int f)
> {
>   // Expansion of thumb1_gen_const_int ...
>   t.ashift(f);
> }
>
> // Usage...
> void f1()
> {
>   // Use the RTL expander
>   t1_rtl g;
>   thumb1_gen_const_int (g, 3);
> }
>
> void f2()
> {
>   // Use the printf expander writing to stdout
>   t1_print g(stdout);
>   thumb1_gen_const_int (g, 3);
> }
>
> With this you can write thumb1_gen_const_int without having to worry
> about which expander is being used in each instance and the template
> expansion will use the right version.
>
> R.
>
> >
> >>
> >>>
> >>>>  I also suspect (but haven't check) that the base adjustment will
> >>>> most commonly be a multiple of the machine word size (ie 4).  If that is
> >>>> the case then you could generate n/4 and then shift it left by 2 for an
> >>>> even greater range of literals.
> >>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> >>> I'll update my patch.
> >>>
> >>>>  More generally, any sequence of up to
> >>>> three thumb1 instructions will be no larger, and probably as fast as the
> >>>> existing literal pool fall back.
> >>>>
> >>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >>>> be merged into a single shift-by-16.
> >>>
> >>> Right, I'll try to make use of thumb_shiftable_const.
> >>>
> >>>> Finally, I'd really like to see some executable tests for this, if at
> >>>> all possible.
> >>> I mentioned pr46287-3.C, but that's not the only existing testcase
> >>> that showed the problem. There are also:
> >>> g++.dg/opt/thunk1.C
> >>> g++.dg/ipa/pr46984.C
> >>> g++.dg/torture/pr46287.C
> >>> g++.dg/torture/pr45699.C
> >>>
> >>> Do you want that I copy one of these in the arm subdir and add
> >>> -mpure-code in dg-options?
> >>
> >> On reflection, probably not - that just makes things more complicated
> >> with all the dg-options mess (I'm worried about interactions with other
> >> sets of options on the command line and the fall-out from that).  If
> >> someone cares about pure-code they should be doing full testsuite runs
> >> with it enabled and that should be sufficient.
> >
> > Yes, that's what I am doing manually, it's a bit tricky, and I use a
> > modified simulator.
> >
> > Thanks,
> >
> > Christophe
> >
> >>
> >> R.
> >>
> >>>
> >>> Thanks,
> >>>
> >>> Christophe
> >>>
> >>>> R.
> >>>>
> >>>>>
> >>>>> k#   (use "git pull" to merge the remote branch into yours)
> >>>>> ---
> >>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>>>
> >>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>>>> index ceeb91f..62abeb5 100644
> >>>>> --- a/gcc/config/arm/arm.c
> >>>>> +++ b/gcc/config/arm/arm.c
> >>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>      {
> >>>>>        if (mi_delta > 255)
> >>>>>       {
> >>>>> -       fputs ("\tldr\tr3, ", file);
> >>>>> -       assemble_name (file, label);
> >>>>> -       fputs ("+4\n", file);
> >>>>> +       /* With -mpure-code, we cannot load delta from the constant
> >>>>> +          pool: we build it explicitly.  */
> >>>>> +       if (target_pure_code)
> >>>>> +         {
> >>>>> +           bool mov_done_p = false;
> >>>>> +           int i;
> >>>>> +
> >>>>> +           /* Emit upper 3 bytes if needed.  */
> >>>>> +           for (i = 0; i < 3; i++)
> >>>>> +             {
> >>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>>>> +
> >>>>> +               if (byte)
> >>>>> +                 {
> >>>>> +                   if (mov_done_p)
> >>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>>>> +                   else
> >>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>>>> +                   mov_done_p = true;
> >>>>> +                 }
> >>>>> +
> >>>>> +               if (mov_done_p)
> >>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>>>> +             }
> >>>>> +
> >>>>> +           /* Emit lower byte if needed.  */
> >>>>> +           if (!mov_done_p)
> >>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>>>> +           else if (mi_delta & 0xff)
> >>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>>>> +         }
> >>>>> +       else
> >>>>> +         {
> >>>>> +           fputs ("\tldr\tr3, ", file);
> >>>>> +           assemble_name (file, label);
> >>>>> +           fputs ("+4\n", file);
> >>>>> +         }
> >>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>>>                      mi_op, this_regno, this_regno);
> >>>>>       }
> >>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>       fputs ("\tpop\t{r3}\n", file);
> >>>>>
> >>>>>        fprintf (file, "\tbx\tr12\n");
> >>>>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>>>> -      assemble_name (file, label);
> >>>>> -      fputs (":\n", file);
> >>>>> -      if (flag_pic)
> >>>>> +
> >>>>> +      /* With -mpure-code, we don't need to emit literals for the
> >>>>> +      function address and delta since we emitted code to build
> >>>>> +      them.  */
> >>>>> +      if (!target_pure_code)
> >>>>>       {
> >>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>>>> -          accordingly.  */
> >>>>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>> -                            tem,
> >>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>>>> -                                                ggc_strdup (labelpc)));
> >>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>> -     }
> >>>>> -      else
> >>>>> -     /* Output ".word .LTHUNKn".  */
> >>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>>>> +       assemble_name (file, label);
> >>>>> +       fputs (":\n", file);
> >>>>> +       if (flag_pic)
> >>>>> +         {
> >>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>>>> +              accordingly.  */
> >>>>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>> +                                tem,
> >>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>>>> +                                                    ggc_strdup (labelpc)));
> >>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>> +         }
> >>>>> +       else
> >>>>> +         /* Output ".word .LTHUNKn".  */
> >>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>
> >>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>> +     }
> >>>>>      }
> >>>>>    else
> >>>>>      {
> >>>>>
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-19 16:32           ` Christophe Lyon
@ 2020-10-20 11:22             ` Richard Earnshaw
  2020-10-20 11:25               ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-20 11:22 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>
>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>
>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>
>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>> movs/adds/lsls.
>>>>>>>
>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>> with -mpure-code.
>>>>>>>
>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>> eg. pr46287-3.C
>>>>>>>
>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>
>>>>>>>       gcc/
>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>
>>>>> Hi Richard,
>>>>>
>>>>> Thanks for your comments.
>>>>>
>>>>>> There are some optimizations you can make to this code.
>>>>>>
>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>> to just expand a mov of 255 followed by an add.
>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>> before, so I can write something similar indeed.
>>>>>
>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>> when -mpure-code is NOT used.
>>>>> Consider:
>>>>> int f3_1 (void) { return 510; }
>>>>> int f3_2 (void) { return 511; }
>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>> f3_1:
>>>>>         movs    r0, #255
>>>>>         lsls    r0, r0, #1
>>>>>         bx      lr
>>>>> f3_2:
>>>>>         ldr     r0, .L4
>>>>>         bx      lr
>>>>>
>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>> not having to load the constant?
>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>> account when comparing code size,
>>>>
>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>  It might happen that the entry could be shared with another use of that
>>>> literal, but in general that's rare.
>>>>
>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>
>>>> It's a bit borderline at three instructions when literal pools are not
>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>> want to use the instruction sequence unless optimizing for size.
>>>>
>>>>>
>>>>>
>>>>>> This is also true for
>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>> this.
>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>> f3_1:
>>>>>         movs    r0, #255
>>>>>         lsls    r0, r0, #1
>>>>>         bx      lr
>>>>> f3_2:
>>>>>         movs    r0, #1
>>>>>         lsls    r0, r0, #8
>>>>>         adds    r0, r0, #255
>>>>>         bx      lr
>>>>>
>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>
>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>> emitting instructions directly rather than generating RTL.  The examples
>>>> you show here are not thunks.
>>>>
>>> OK thanks for the clarification.
>>>
>>> Here is an updated version, split into 3 patches to hopefully make
>>> review easier.
>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>
>>> I kept it this way to make incremental changes easier to understand.
>>>
>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>> asm. This way, all the code used to build thumb-1 constants is in the
>>> same place,
>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>> several cases matching your comments.
>>>
>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>> calls thumb1_gen_const_int.
>>>
>>> Patch 3: Update of the original patch in this thread, now calls
>>> thumb1_gen_const_int.
>>
>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>
>> I think we should be able to leverage the fact that the compiler can use
>> C++ now to do much better than that, for example by making that function
>> a template.  For example (and this is just a sketch):
>>
> 
> Indeed! I didn't think about it since there is no other use of
> templates in arm.c yet.
> I'll send an update soon.
> 
> Other than that, does the approach look OK to you?
> 

Yes, I think this is heading in the right direction.  Bringing the two
immediate generating operations into a single function can only be a
good thing.

Looking again at your example constant sequences, I see:

0x1000010:
        movs    r3, #16
        lsls    r3, #16
        adds    r3, #1
        lsls    r3, #4
0x1000011:
        movs    r3, #1
        lsls    r3, #24
        adds    r3, #17

The first of these looks odd, given the second sequence.  Why doesn't
the first expand to

0x1000010:
        movs    r3, #16
        lsls    r3, #16
        adds    r3, #16

?

R.

> Thanks,
> 
> Christophe
> 
>> class t1_rtl
>> {
>>  public:
>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>> };
>>
>> class t1_print
>> {
>>  public:
>>   t1_print (FILE *f) : t_file(f) {}
>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>  private:
>>   FILE *t_file;
>> };
>>
>> template <class T>
>> void thumb1_gen_const_int(T t, int f)
>> {
>>   // Expansion of thumb1_gen_const_int ...
>>   t.ashift(f);
>> }
>>
>> // Usage...
>> void f1()
>> {
>>   // Use the RTL expander
>>   t1_rtl g;
>>   thumb1_gen_const_int (g, 3);
>> }
>>
>> void f2()
>> {
>>   // Use the printf expander writing to stdout
>>   t1_print g(stdout);
>>   thumb1_gen_const_int (g, 3);
>> }
>>
>> With this you can write thumb1_gen_const_int without having to worry
>> about which expander is being used in each instance and the template
>> expansion will use the right version.
>>
>> R.
>>
>>>
>>>>
>>>>>
>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>> even greater range of literals.
>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>> I'll update my patch.
>>>>>
>>>>>>  More generally, any sequence of up to
>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>> existing literal pool fall back.
>>>>>>
>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>> be merged into a single shift-by-16.
>>>>>
>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>
>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>> all possible.
>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>> that showed the problem. There are also:
>>>>> g++.dg/opt/thunk1.C
>>>>> g++.dg/ipa/pr46984.C
>>>>> g++.dg/torture/pr46287.C
>>>>> g++.dg/torture/pr45699.C
>>>>>
>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>> -mpure-code in dg-options?
>>>>
>>>> On reflection, probably not - that just makes things more complicated
>>>> with all the dg-options mess (I'm worried about interactions with other
>>>> sets of options on the command line and the fall-out from that).  If
>>>> someone cares about pure-code they should be doing full testsuite runs
>>>> with it enabled and that should be sufficient.
>>>
>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>> modified simulator.
>>>
>>> Thanks,
>>>
>>> Christophe
>>>
>>>>
>>>> R.
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Christophe
>>>>>
>>>>>> R.
>>>>>>
>>>>>>>
>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>> ---
>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>
>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>      {
>>>>>>>        if (mi_delta > 255)
>>>>>>>       {
>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>> -       assemble_name (file, label);
>>>>>>> -       fputs ("+4\n", file);
>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>> +          pool: we build it explicitly.  */
>>>>>>> +       if (target_pure_code)
>>>>>>> +         {
>>>>>>> +           bool mov_done_p = false;
>>>>>>> +           int i;
>>>>>>> +
>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>> +             {
>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>> +
>>>>>>> +               if (byte)
>>>>>>> +                 {
>>>>>>> +                   if (mov_done_p)
>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>> +                   else
>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>> +                   mov_done_p = true;
>>>>>>> +                 }
>>>>>>> +
>>>>>>> +               if (mov_done_p)
>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>> +             }
>>>>>>> +
>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>> +           if (!mov_done_p)
>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>> +         }
>>>>>>> +       else
>>>>>>> +         {
>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>> +           assemble_name (file, label);
>>>>>>> +           fputs ("+4\n", file);
>>>>>>> +         }
>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>       }
>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>
>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>> -      assemble_name (file, label);
>>>>>>> -      fputs (":\n", file);
>>>>>>> -      if (flag_pic)
>>>>>>> +
>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>> +      function address and delta since we emitted code to build
>>>>>>> +      them.  */
>>>>>>> +      if (!target_pure_code)
>>>>>>>       {
>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>> -          accordingly.  */
>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>> -                            tem,
>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>> -     }
>>>>>>> -      else
>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>> +       assemble_name (file, label);
>>>>>>> +       fputs (":\n", file);
>>>>>>> +       if (flag_pic)
>>>>>>> +         {
>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>> +              accordingly.  */
>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>> +                                tem,
>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>> +         }
>>>>>>> +       else
>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>
>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>> +     }
>>>>>>>      }
>>>>>>>    else
>>>>>>>      {
>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-20 11:22             ` Richard Earnshaw
@ 2020-10-20 11:25               ` Richard Earnshaw
  2020-10-21 15:49                 ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-20 11:25 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 20/10/2020 12:22, Richard Earnshaw wrote:
> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>
>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>
>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>
>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>
>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>> movs/adds/lsls.
>>>>>>>>
>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>> with -mpure-code.
>>>>>>>>
>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>> eg. pr46287-3.C
>>>>>>>>
>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>
>>>>>>>>       gcc/
>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>
>>>>>> Hi Richard,
>>>>>>
>>>>>> Thanks for your comments.
>>>>>>
>>>>>>> There are some optimizations you can make to this code.
>>>>>>>
>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>> before, so I can write something similar indeed.
>>>>>>
>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>> when -mpure-code is NOT used.
>>>>>> Consider:
>>>>>> int f3_1 (void) { return 510; }
>>>>>> int f3_2 (void) { return 511; }
>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>> f3_1:
>>>>>>         movs    r0, #255
>>>>>>         lsls    r0, r0, #1
>>>>>>         bx      lr
>>>>>> f3_2:
>>>>>>         ldr     r0, .L4
>>>>>>         bx      lr
>>>>>>
>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>> not having to load the constant?
>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>> account when comparing code size,
>>>>>
>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>  It might happen that the entry could be shared with another use of that
>>>>> literal, but in general that's rare.
>>>>>
>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>
>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>
>>>>>>
>>>>>>
>>>>>>> This is also true for
>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>> this.
>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>> f3_1:
>>>>>>         movs    r0, #255
>>>>>>         lsls    r0, r0, #1
>>>>>>         bx      lr
>>>>>> f3_2:
>>>>>>         movs    r0, #1
>>>>>>         lsls    r0, r0, #8
>>>>>>         adds    r0, r0, #255
>>>>>>         bx      lr
>>>>>>
>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>
>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>> you show here are not thunks.
>>>>>
>>>> OK thanks for the clarification.
>>>>
>>>> Here is an updated version, split into 3 patches to hopefully make
>>>> review easier.
>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>
>>>> I kept it this way to make incremental changes easier to understand.
>>>>
>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>> same place,
>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>> several cases matching your comments.
>>>>
>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>> calls thumb1_gen_const_int.
>>>>
>>>> Patch 3: Update of the original patch in this thread, now calls
>>>> thumb1_gen_const_int.
>>>
>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>
>>> I think we should be able to leverage the fact that the compiler can use
>>> C++ now to do much better than that, for example by making that function
>>> a template.  For example (and this is just a sketch):
>>>
>>
>> Indeed! I didn't think about it since there is no other use of
>> templates in arm.c yet.
>> I'll send an update soon.
>>
>> Other than that, does the approach look OK to you?
>>
> 
> Yes, I think this is heading in the right direction.  Bringing the two
> immediate generating operations into a single function can only be a
> good thing.
> 
> Looking again at your example constant sequences, I see:
> 
> 0x1000010:
>         movs    r3, #16
>         lsls    r3, #16
>         adds    r3, #1
>         lsls    r3, #4
> 0x1000011:
>         movs    r3, #1
>         lsls    r3, #24
>         adds    r3, #17
> 
> The first of these looks odd, given the second sequence.  Why doesn't
> the first expand to
> 
> 0x1000010:
>         movs    r3, #16
>         lsls    r3, #16
>         adds    r3, #16
> 
Err, I mean to:


0x1000010:
        movs    r3, #1
        lsls    r3, #24
        adds    r3, #16

> ?
> 
> R.
> 
>> Thanks,
>>
>> Christophe
>>
>>> class t1_rtl
>>> {
>>>  public:
>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>> };
>>>
>>> class t1_print
>>> {
>>>  public:
>>>   t1_print (FILE *f) : t_file(f) {}
>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>  private:
>>>   FILE *t_file;
>>> };
>>>
>>> template <class T>
>>> void thumb1_gen_const_int(T t, int f)
>>> {
>>>   // Expansion of thumb1_gen_const_int ...
>>>   t.ashift(f);
>>> }
>>>
>>> // Usage...
>>> void f1()
>>> {
>>>   // Use the RTL expander
>>>   t1_rtl g;
>>>   thumb1_gen_const_int (g, 3);
>>> }
>>>
>>> void f2()
>>> {
>>>   // Use the printf expander writing to stdout
>>>   t1_print g(stdout);
>>>   thumb1_gen_const_int (g, 3);
>>> }
>>>
>>> With this you can write thumb1_gen_const_int without having to worry
>>> about which expander is being used in each instance and the template
>>> expansion will use the right version.
>>>
>>> R.
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>> even greater range of literals.
>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>> I'll update my patch.
>>>>>>
>>>>>>>  More generally, any sequence of up to
>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>> existing literal pool fall back.
>>>>>>>
>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>> be merged into a single shift-by-16.
>>>>>>
>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>
>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>> all possible.
>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>> that showed the problem. There are also:
>>>>>> g++.dg/opt/thunk1.C
>>>>>> g++.dg/ipa/pr46984.C
>>>>>> g++.dg/torture/pr46287.C
>>>>>> g++.dg/torture/pr45699.C
>>>>>>
>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>> -mpure-code in dg-options?
>>>>>
>>>>> On reflection, probably not - that just makes things more complicated
>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>> sets of options on the command line and the fall-out from that).  If
>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>> with it enabled and that should be sufficient.
>>>>
>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>> modified simulator.
>>>>
>>>> Thanks,
>>>>
>>>> Christophe
>>>>
>>>>>
>>>>> R.
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Christophe
>>>>>>
>>>>>>> R.
>>>>>>>
>>>>>>>>
>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>> ---
>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>      {
>>>>>>>>        if (mi_delta > 255)
>>>>>>>>       {
>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>> -       assemble_name (file, label);
>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>> +       if (target_pure_code)
>>>>>>>> +         {
>>>>>>>> +           bool mov_done_p = false;
>>>>>>>> +           int i;
>>>>>>>> +
>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>> +             {
>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>> +
>>>>>>>> +               if (byte)
>>>>>>>> +                 {
>>>>>>>> +                   if (mov_done_p)
>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>> +                   else
>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>> +                   mov_done_p = true;
>>>>>>>> +                 }
>>>>>>>> +
>>>>>>>> +               if (mov_done_p)
>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>> +             }
>>>>>>>> +
>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>> +           if (!mov_done_p)
>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>> +         }
>>>>>>>> +       else
>>>>>>>> +         {
>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>> +           assemble_name (file, label);
>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>> +         }
>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>       }
>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>
>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>> -      assemble_name (file, label);
>>>>>>>> -      fputs (":\n", file);
>>>>>>>> -      if (flag_pic)
>>>>>>>> +
>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>> +      them.  */
>>>>>>>> +      if (!target_pure_code)
>>>>>>>>       {
>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>> -          accordingly.  */
>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>> -                            tem,
>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>> -     }
>>>>>>>> -      else
>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>> +       assemble_name (file, label);
>>>>>>>> +       fputs (":\n", file);
>>>>>>>> +       if (flag_pic)
>>>>>>>> +         {
>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>> +              accordingly.  */
>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>> +                                tem,
>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>> +         }
>>>>>>>> +       else
>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>
>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>> +     }
>>>>>>>>      }
>>>>>>>>    else
>>>>>>>>      {
>>>>>>>>
>>>>>>>
>>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-20 11:25               ` Richard Earnshaw
@ 2020-10-21 15:49                 ` Christophe Lyon
  2020-10-21 16:07                   ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-10-21 15:49 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 20/10/2020 12:22, Richard Earnshaw wrote:
> > On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>
> >>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>
> >>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>
> >>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>>>>>>> from code memory (like we do without -mpure-code).
> >>>>>>>>
> >>>>>>>> This patch builds the value of mi_delta into r3 with a series of
> >>>>>>>> movs/adds/lsls.
> >>>>>>>>
> >>>>>>>> We also do some cleanup by not emitting the function address and delta
> >>>>>>>> via .word directives at the end of the thunk since we don't use them
> >>>>>>>> with -mpure-code.
> >>>>>>>>
> >>>>>>>> No need for new testcases, this bug was already identified by
> >>>>>>>> eg. pr46287-3.C
> >>>>>>>>
> >>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>>>>>>
> >>>>>>>>       gcc/
> >>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>>>>>>       do not emit function address and delta when -mpure-code is used.
> >>>>>>>
> >>>>>> Hi Richard,
> >>>>>>
> >>>>>> Thanks for your comments.
> >>>>>>
> >>>>>>> There are some optimizations you can make to this code.
> >>>>>>>
> >>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
> >>>>>>> to just expand a mov of 255 followed by an add.
> >>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>>>>> before, so I can write something similar indeed.
> >>>>>>
> >>>>>> However, I'm note quite sure to understand the benefit in the split
> >>>>>> when -mpure-code is NOT used.
> >>>>>> Consider:
> >>>>>> int f3_1 (void) { return 510; }
> >>>>>> int f3_2 (void) { return 511; }
> >>>>>> Compile with -O2 -mcpu=cortex-m0:
> >>>>>> f3_1:
> >>>>>>         movs    r0, #255
> >>>>>>         lsls    r0, r0, #1
> >>>>>>         bx      lr
> >>>>>> f3_2:
> >>>>>>         ldr     r0, .L4
> >>>>>>         bx      lr
> >>>>>>
> >>>>>> The splitter makes the code bigger, does it "compensate" for this by
> >>>>>> not having to load the constant?
> >>>>>> Actually the constant uses 4 more bytes, which should be taken into
> >>>>>> account when comparing code size,
> >>>>>
> >>>>> Yes, the size of the literal pool entry needs to be taken into account.
> >>>>>  It might happen that the entry could be shared with another use of that
> >>>>> literal, but in general that's rare.
> >>>>>
> >>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>>>>> thumb1 instructions would be equivalent in size compared to loading
> >>>>>> from the literal pool. Should the 256-510 range be extended?
> >>>>>
> >>>>> It's a bit borderline at three instructions when literal pools are not
> >>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
> >>>>> to the limited pc offsets we can use.  I think on balance we probably
> >>>>> want to use the instruction sequence unless optimizing for size.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>> This is also true for
> >>>>>>> the literal pools alternative as well, so should be handled before all
> >>>>>>> this.
> >>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> >>>>>> f3_1:
> >>>>>>         movs    r0, #255
> >>>>>>         lsls    r0, r0, #1
> >>>>>>         bx      lr
> >>>>>> f3_2:
> >>>>>>         movs    r0, #1
> >>>>>>         lsls    r0, r0, #8
> >>>>>>         adds    r0, r0, #255
> >>>>>>         bx      lr
> >>>>>>
> >>>>>> so the "return 510" case is already handled as without -mpure-code.
> >>>>>
> >>>>> I was thinking specifically of the thunk sequence where you seem to be
> >>>>> emitting instructions directly rather than generating RTL.  The examples
> >>>>> you show here are not thunks.
> >>>>>
> >>>> OK thanks for the clarification.
> >>>>
> >>>> Here is an updated version, split into 3 patches to hopefully make
> >>>> review easier.
> >>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> >>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> >>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >>>>
> >>>> I kept it this way to make incremental changes easier to understand.
> >>>>
> >>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
> >>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> >>>> asm. This way, all the code used to build thumb-1 constants is in the
> >>>> same place,
> >>>>  in case we need to improve/fix it later. We now generate shorter sequences in
> >>>> several cases matching your comments.
> >>>>
> >>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> >>>> calls thumb1_gen_const_int.
> >>>>
> >>>> Patch 3: Update of the original patch in this thread, now calls
> >>>> thumb1_gen_const_int.
> >>>
> >>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
> >>>
> >>> I think we should be able to leverage the fact that the compiler can use
> >>> C++ now to do much better than that, for example by making that function
> >>> a template.  For example (and this is just a sketch):
> >>>
> >>
> >> Indeed! I didn't think about it since there is no other use of
> >> templates in arm.c yet.
> >> I'll send an update soon.
> >>
> >> Other than that, does the approach look OK to you?
> >>
> >
> > Yes, I think this is heading in the right direction.  Bringing the two
> > immediate generating operations into a single function can only be a
> > good thing.
> >
> > Looking again at your example constant sequences, I see:
> >
> > 0x1000010:
> >         movs    r3, #16
> >         lsls    r3, #16
> >         adds    r3, #1
> >         lsls    r3, #4
> > 0x1000011:
> >         movs    r3, #1
> >         lsls    r3, #24
> >         adds    r3, #17
> >
> > The first of these looks odd, given the second sequence.  Why doesn't
> > the first expand to
> >
> > 0x1000010:
> >         movs    r3, #16
> >         lsls    r3, #16
> >         adds    r3, #16
> >
> Err, I mean to:
>
>
> 0x1000010:
>         movs    r3, #1
>         lsls    r3, #24
>         adds    r3, #16
>
> ?

Because I first try to right-shift the constant, hoping to reduce its
range and need less instructions to build the higher part, then
left-shift back.

In this particular case, we'd need to realize that there are many
zeros "inside" the constant.

If I remove the part that tries to reduce the range, I do get that
sequence, but for 764 I now generate
movs    r3, #2
lsls    r3, #8
adds    r3, #252
instead of
movs    r3, #191
lsls    r3, #2

A possibility would be to try both approaches and keep the shortest one.


> >
> > R.
> >
> >> Thanks,
> >>
> >> Christophe
> >>
> >>> class t1_rtl
> >>> {
> >>>  public:
> >>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
> >>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
> >>> };
> >>>
> >>> class t1_print
> >>> {
> >>>  public:
> >>>   t1_print (FILE *f) : t_file(f) {}
> >>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
> >>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
> >>>  private:
> >>>   FILE *t_file;
> >>> };
> >>>
> >>> template <class T>
> >>> void thumb1_gen_const_int(T t, int f)
> >>> {
> >>>   // Expansion of thumb1_gen_const_int ...
> >>>   t.ashift(f);
> >>> }
> >>>
> >>> // Usage...
> >>> void f1()
> >>> {
> >>>   // Use the RTL expander
> >>>   t1_rtl g;
> >>>   thumb1_gen_const_int (g, 3);
> >>> }
> >>>
> >>> void f2()
> >>> {
> >>>   // Use the printf expander writing to stdout
> >>>   t1_print g(stdout);
> >>>   thumb1_gen_const_int (g, 3);
> >>> }
> >>>
> >>> With this you can write thumb1_gen_const_int without having to worry
> >>> about which expander is being used in each instance and the template
> >>> expansion will use the right version.
> >>>
> >>> R.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>  I also suspect (but haven't check) that the base adjustment will
> >>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
> >>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
> >>>>>>> even greater range of literals.
> >>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> >>>>>> I'll update my patch.
> >>>>>>
> >>>>>>>  More generally, any sequence of up to
> >>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
> >>>>>>> existing literal pool fall back.
> >>>>>>>
> >>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >>>>>>> be merged into a single shift-by-16.
> >>>>>>
> >>>>>> Right, I'll try to make use of thumb_shiftable_const.
> >>>>>>
> >>>>>>> Finally, I'd really like to see some executable tests for this, if at
> >>>>>>> all possible.
> >>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
> >>>>>> that showed the problem. There are also:
> >>>>>> g++.dg/opt/thunk1.C
> >>>>>> g++.dg/ipa/pr46984.C
> >>>>>> g++.dg/torture/pr46287.C
> >>>>>> g++.dg/torture/pr45699.C
> >>>>>>
> >>>>>> Do you want that I copy one of these in the arm subdir and add
> >>>>>> -mpure-code in dg-options?
> >>>>>
> >>>>> On reflection, probably not - that just makes things more complicated
> >>>>> with all the dg-options mess (I'm worried about interactions with other
> >>>>> sets of options on the command line and the fall-out from that).  If
> >>>>> someone cares about pure-code they should be doing full testsuite runs
> >>>>> with it enabled and that should be sufficient.
> >>>>
> >>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
> >>>> modified simulator.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Christophe
> >>>>
> >>>>>
> >>>>> R.
> >>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Christophe
> >>>>>>
> >>>>>>> R.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
> >>>>>>>> ---
> >>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>>>>>>> index ceeb91f..62abeb5 100644
> >>>>>>>> --- a/gcc/config/arm/arm.c
> >>>>>>>> +++ b/gcc/config/arm/arm.c
> >>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>      {
> >>>>>>>>        if (mi_delta > 255)
> >>>>>>>>       {
> >>>>>>>> -       fputs ("\tldr\tr3, ", file);
> >>>>>>>> -       assemble_name (file, label);
> >>>>>>>> -       fputs ("+4\n", file);
> >>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
> >>>>>>>> +          pool: we build it explicitly.  */
> >>>>>>>> +       if (target_pure_code)
> >>>>>>>> +         {
> >>>>>>>> +           bool mov_done_p = false;
> >>>>>>>> +           int i;
> >>>>>>>> +
> >>>>>>>> +           /* Emit upper 3 bytes if needed.  */
> >>>>>>>> +           for (i = 0; i < 3; i++)
> >>>>>>>> +             {
> >>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>>>>>>> +
> >>>>>>>> +               if (byte)
> >>>>>>>> +                 {
> >>>>>>>> +                   if (mov_done_p)
> >>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>>>>>>> +                   else
> >>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>>>>>>> +                   mov_done_p = true;
> >>>>>>>> +                 }
> >>>>>>>> +
> >>>>>>>> +               if (mov_done_p)
> >>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>>>>>>> +             }
> >>>>>>>> +
> >>>>>>>> +           /* Emit lower byte if needed.  */
> >>>>>>>> +           if (!mov_done_p)
> >>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>> +           else if (mi_delta & 0xff)
> >>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>> +         }
> >>>>>>>> +       else
> >>>>>>>> +         {
> >>>>>>>> +           fputs ("\tldr\tr3, ", file);
> >>>>>>>> +           assemble_name (file, label);
> >>>>>>>> +           fputs ("+4\n", file);
> >>>>>>>> +         }
> >>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>>>>>>                      mi_op, this_regno, this_regno);
> >>>>>>>>       }
> >>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>       fputs ("\tpop\t{r3}\n", file);
> >>>>>>>>
> >>>>>>>>        fprintf (file, "\tbx\tr12\n");
> >>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>> -      assemble_name (file, label);
> >>>>>>>> -      fputs (":\n", file);
> >>>>>>>> -      if (flag_pic)
> >>>>>>>> +
> >>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
> >>>>>>>> +      function address and delta since we emitted code to build
> >>>>>>>> +      them.  */
> >>>>>>>> +      if (!target_pure_code)
> >>>>>>>>       {
> >>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>> -          accordingly.  */
> >>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>> -                            tem,
> >>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>> -                                                ggc_strdup (labelpc)));
> >>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>> -     }
> >>>>>>>> -      else
> >>>>>>>> -     /* Output ".word .LTHUNKn".  */
> >>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>> +       assemble_name (file, label);
> >>>>>>>> +       fputs (":\n", file);
> >>>>>>>> +       if (flag_pic)
> >>>>>>>> +         {
> >>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>> +              accordingly.  */
> >>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>> +                                tem,
> >>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>> +                                                    ggc_strdup (labelpc)));
> >>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>> +         }
> >>>>>>>> +       else
> >>>>>>>> +         /* Output ".word .LTHUNKn".  */
> >>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>
> >>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>> +     }
> >>>>>>>>      }
> >>>>>>>>    else
> >>>>>>>>      {
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-21 15:49                 ` Christophe Lyon
@ 2020-10-21 16:07                   ` Richard Earnshaw
  2020-10-21 16:11                     ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-21 16:07 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>
>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>
>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>
>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>
>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>> with -mpure-code.
>>>>>>>>>>
>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>
>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>
>>>>>>>>>>       gcc/
>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>
>>>>>>>> Hi Richard,
>>>>>>>>
>>>>>>>> Thanks for your comments.
>>>>>>>>
>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>
>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>
>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>> when -mpure-code is NOT used.
>>>>>>>> Consider:
>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>> f3_1:
>>>>>>>>         movs    r0, #255
>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>         bx      lr
>>>>>>>> f3_2:
>>>>>>>>         ldr     r0, .L4
>>>>>>>>         bx      lr
>>>>>>>>
>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>> not having to load the constant?
>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>> account when comparing code size,
>>>>>>>
>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>> literal, but in general that's rare.
>>>>>>>
>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>
>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> This is also true for
>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>> this.
>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>> f3_1:
>>>>>>>>         movs    r0, #255
>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>         bx      lr
>>>>>>>> f3_2:
>>>>>>>>         movs    r0, #1
>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>         adds    r0, r0, #255
>>>>>>>>         bx      lr
>>>>>>>>
>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>
>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>> you show here are not thunks.
>>>>>>>
>>>>>> OK thanks for the clarification.
>>>>>>
>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>> review easier.
>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>
>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>
>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>> same place,
>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>> several cases matching your comments.
>>>>>>
>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>> calls thumb1_gen_const_int.
>>>>>>
>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>> thumb1_gen_const_int.
>>>>>
>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>
>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>> C++ now to do much better than that, for example by making that function
>>>>> a template.  For example (and this is just a sketch):
>>>>>
>>>>
>>>> Indeed! I didn't think about it since there is no other use of
>>>> templates in arm.c yet.
>>>> I'll send an update soon.
>>>>
>>>> Other than that, does the approach look OK to you?
>>>>
>>>
>>> Yes, I think this is heading in the right direction.  Bringing the two
>>> immediate generating operations into a single function can only be a
>>> good thing.
>>>
>>> Looking again at your example constant sequences, I see:
>>>
>>> 0x1000010:
>>>         movs    r3, #16
>>>         lsls    r3, #16
>>>         adds    r3, #1
>>>         lsls    r3, #4
>>> 0x1000011:
>>>         movs    r3, #1
>>>         lsls    r3, #24
>>>         adds    r3, #17
>>>
>>> The first of these looks odd, given the second sequence.  Why doesn't
>>> the first expand to
>>>
>>> 0x1000010:
>>>         movs    r3, #16
>>>         lsls    r3, #16
>>>         adds    r3, #16
>>>
>> Err, I mean to:
>>
>>
>> 0x1000010:
>>         movs    r3, #1
>>         lsls    r3, #24
>>         adds    r3, #16
>>
>> ?
> 
> Because I first try to right-shift the constant, hoping to reduce its
> range and need less instructions to build the higher part, then
> left-shift back.
> 
> In this particular case, we'd need to realize that there are many
> zeros "inside" the constant.
> 
> If I remove the part that tries to reduce the range, I do get that
> sequence, but for 764 I now generate
> movs    r3, #2
> lsls    r3, #8
> adds    r3, #252
> instead of
> movs    r3, #191
> lsls    r3, #2
> 
> A possibility would be to try both approaches and keep the shortest one.

Lets leave that for now, it's not important to fixing the main issue;
but we should remember we need to come back to it at some point.

There are other tricks as well, such as

  0xffffff

can be done as

  0x1000000 - 1

and

  0xfffffd

as

  0x1000000 - 3

but these can wait as well.


R.

> 
> 
>>>
>>> R.
>>>
>>>> Thanks,
>>>>
>>>> Christophe
>>>>
>>>>> class t1_rtl
>>>>> {
>>>>>  public:
>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>> };
>>>>>
>>>>> class t1_print
>>>>> {
>>>>>  public:
>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>  private:
>>>>>   FILE *t_file;
>>>>> };
>>>>>
>>>>> template <class T>
>>>>> void thumb1_gen_const_int(T t, int f)
>>>>> {
>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>   t.ashift(f);
>>>>> }
>>>>>
>>>>> // Usage...
>>>>> void f1()
>>>>> {
>>>>>   // Use the RTL expander
>>>>>   t1_rtl g;
>>>>>   thumb1_gen_const_int (g, 3);
>>>>> }
>>>>>
>>>>> void f2()
>>>>> {
>>>>>   // Use the printf expander writing to stdout
>>>>>   t1_print g(stdout);
>>>>>   thumb1_gen_const_int (g, 3);
>>>>> }
>>>>>
>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>> about which expander is being used in each instance and the template
>>>>> expansion will use the right version.
>>>>>
>>>>> R.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>> even greater range of literals.
>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>> I'll update my patch.
>>>>>>>>
>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>> existing literal pool fall back.
>>>>>>>>>
>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>
>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>
>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>> all possible.
>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>> that showed the problem. There are also:
>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>
>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>> -mpure-code in dg-options?
>>>>>>>
>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>> with it enabled and that should be sufficient.
>>>>>>
>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>> modified simulator.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Christophe
>>>>>>
>>>>>>>
>>>>>>> R.
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Christophe
>>>>>>>>
>>>>>>>>> R.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>> ---
>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>      {
>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>       {
>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>> +         {
>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>> +           int i;
>>>>>>>>>> +
>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>> +             {
>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>> +
>>>>>>>>>> +               if (byte)
>>>>>>>>>> +                 {
>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>> +                   else
>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>> +                 }
>>>>>>>>>> +
>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>> +             }
>>>>>>>>>> +
>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>> +         }
>>>>>>>>>> +       else
>>>>>>>>>> +         {
>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>> +         }
>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>       }
>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>
>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>> +
>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>> +      them.  */
>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>       {
>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>> -          accordingly.  */
>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>> -                            tem,
>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>> -     }
>>>>>>>>>> -      else
>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>> +         {
>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>> +              accordingly.  */
>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>> +                                tem,
>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>> +         }
>>>>>>>>>> +       else
>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>
>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>> +     }
>>>>>>>>>>      }
>>>>>>>>>>    else
>>>>>>>>>>      {
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-21 16:07                   ` Richard Earnshaw
@ 2020-10-21 16:11                     ` Christophe Lyon
  2020-10-21 17:36                       ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-10-21 16:11 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> > On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> > <Richard.Earnshaw@foss.arm.com> wrote:
> >>
> >> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>
> >>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>
> >>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>>>>>>>>> from code memory (like we do without -mpure-code).
> >>>>>>>>>>
> >>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
> >>>>>>>>>> movs/adds/lsls.
> >>>>>>>>>>
> >>>>>>>>>> We also do some cleanup by not emitting the function address and delta
> >>>>>>>>>> via .word directives at the end of the thunk since we don't use them
> >>>>>>>>>> with -mpure-code.
> >>>>>>>>>>
> >>>>>>>>>> No need for new testcases, this bug was already identified by
> >>>>>>>>>> eg. pr46287-3.C
> >>>>>>>>>>
> >>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>>>>>>>>
> >>>>>>>>>>       gcc/
> >>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
> >>>>>>>>>
> >>>>>>>> Hi Richard,
> >>>>>>>>
> >>>>>>>> Thanks for your comments.
> >>>>>>>>
> >>>>>>>>> There are some optimizations you can make to this code.
> >>>>>>>>>
> >>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
> >>>>>>>>> to just expand a mov of 255 followed by an add.
> >>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>>>>>>> before, so I can write something similar indeed.
> >>>>>>>>
> >>>>>>>> However, I'm note quite sure to understand the benefit in the split
> >>>>>>>> when -mpure-code is NOT used.
> >>>>>>>> Consider:
> >>>>>>>> int f3_1 (void) { return 510; }
> >>>>>>>> int f3_2 (void) { return 511; }
> >>>>>>>> Compile with -O2 -mcpu=cortex-m0:
> >>>>>>>> f3_1:
> >>>>>>>>         movs    r0, #255
> >>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>         bx      lr
> >>>>>>>> f3_2:
> >>>>>>>>         ldr     r0, .L4
> >>>>>>>>         bx      lr
> >>>>>>>>
> >>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
> >>>>>>>> not having to load the constant?
> >>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
> >>>>>>>> account when comparing code size,
> >>>>>>>
> >>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
> >>>>>>>  It might happen that the entry could be shared with another use of that
> >>>>>>> literal, but in general that's rare.
> >>>>>>>
> >>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>>>>>>> thumb1 instructions would be equivalent in size compared to loading
> >>>>>>>> from the literal pool. Should the 256-510 range be extended?
> >>>>>>>
> >>>>>>> It's a bit borderline at three instructions when literal pools are not
> >>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
> >>>>>>> to the limited pc offsets we can use.  I think on balance we probably
> >>>>>>> want to use the instruction sequence unless optimizing for size.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> This is also true for
> >>>>>>>>> the literal pools alternative as well, so should be handled before all
> >>>>>>>>> this.
> >>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> >>>>>>>> f3_1:
> >>>>>>>>         movs    r0, #255
> >>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>         bx      lr
> >>>>>>>> f3_2:
> >>>>>>>>         movs    r0, #1
> >>>>>>>>         lsls    r0, r0, #8
> >>>>>>>>         adds    r0, r0, #255
> >>>>>>>>         bx      lr
> >>>>>>>>
> >>>>>>>> so the "return 510" case is already handled as without -mpure-code.
> >>>>>>>
> >>>>>>> I was thinking specifically of the thunk sequence where you seem to be
> >>>>>>> emitting instructions directly rather than generating RTL.  The examples
> >>>>>>> you show here are not thunks.
> >>>>>>>
> >>>>>> OK thanks for the clarification.
> >>>>>>
> >>>>>> Here is an updated version, split into 3 patches to hopefully make
> >>>>>> review easier.
> >>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >>>>>>
> >>>>>> I kept it this way to make incremental changes easier to understand.
> >>>>>>
> >>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
> >>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> >>>>>> asm. This way, all the code used to build thumb-1 constants is in the
> >>>>>> same place,
> >>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
> >>>>>> several cases matching your comments.
> >>>>>>
> >>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> >>>>>> calls thumb1_gen_const_int.
> >>>>>>
> >>>>>> Patch 3: Update of the original patch in this thread, now calls
> >>>>>> thumb1_gen_const_int.
> >>>>>
> >>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
> >>>>>
> >>>>> I think we should be able to leverage the fact that the compiler can use
> >>>>> C++ now to do much better than that, for example by making that function
> >>>>> a template.  For example (and this is just a sketch):
> >>>>>
> >>>>
> >>>> Indeed! I didn't think about it since there is no other use of
> >>>> templates in arm.c yet.
> >>>> I'll send an update soon.
> >>>>
> >>>> Other than that, does the approach look OK to you?
> >>>>
> >>>
> >>> Yes, I think this is heading in the right direction.  Bringing the two
> >>> immediate generating operations into a single function can only be a
> >>> good thing.
> >>>
> >>> Looking again at your example constant sequences, I see:
> >>>
> >>> 0x1000010:
> >>>         movs    r3, #16
> >>>         lsls    r3, #16
> >>>         adds    r3, #1
> >>>         lsls    r3, #4
> >>> 0x1000011:
> >>>         movs    r3, #1
> >>>         lsls    r3, #24
> >>>         adds    r3, #17
> >>>
> >>> The first of these looks odd, given the second sequence.  Why doesn't
> >>> the first expand to
> >>>
> >>> 0x1000010:
> >>>         movs    r3, #16
> >>>         lsls    r3, #16
> >>>         adds    r3, #16
> >>>
> >> Err, I mean to:
> >>
> >>
> >> 0x1000010:
> >>         movs    r3, #1
> >>         lsls    r3, #24
> >>         adds    r3, #16
> >>
> >> ?
> >
> > Because I first try to right-shift the constant, hoping to reduce its
> > range and need less instructions to build the higher part, then
> > left-shift back.
> >
> > In this particular case, we'd need to realize that there are many
> > zeros "inside" the constant.
> >
> > If I remove the part that tries to reduce the range, I do get that
> > sequence, but for 764 I now generate
> > movs    r3, #2
> > lsls    r3, #8
> > adds    r3, #252
> > instead of
> > movs    r3, #191
> > lsls    r3, #2
> >
> > A possibility would be to try both approaches and keep the shortest one.
>
> Lets leave that for now, it's not important to fixing the main issue;
> but we should remember we need to come back to it at some point.
>
Thanks, that's what I was thinking too.

> There are other tricks as well, such as
>
>   0xffffff
>
> can be done as
>
>   0x1000000 - 1
>
> and
>
>   0xfffffd
>
> as
>
>   0x1000000 - 3
>
> but these can wait as well.
>

Didn't we already need to handle such tricks? I'm surprised this
wasn't needed earlier.


>
> R.
>
> >
> >
> >>>
> >>> R.
> >>>
> >>>> Thanks,
> >>>>
> >>>> Christophe
> >>>>
> >>>>> class t1_rtl
> >>>>> {
> >>>>>  public:
> >>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
> >>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
> >>>>> };
> >>>>>
> >>>>> class t1_print
> >>>>> {
> >>>>>  public:
> >>>>>   t1_print (FILE *f) : t_file(f) {}
> >>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
> >>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
> >>>>>  private:
> >>>>>   FILE *t_file;
> >>>>> };
> >>>>>
> >>>>> template <class T>
> >>>>> void thumb1_gen_const_int(T t, int f)
> >>>>> {
> >>>>>   // Expansion of thumb1_gen_const_int ...
> >>>>>   t.ashift(f);
> >>>>> }
> >>>>>
> >>>>> // Usage...
> >>>>> void f1()
> >>>>> {
> >>>>>   // Use the RTL expander
> >>>>>   t1_rtl g;
> >>>>>   thumb1_gen_const_int (g, 3);
> >>>>> }
> >>>>>
> >>>>> void f2()
> >>>>> {
> >>>>>   // Use the printf expander writing to stdout
> >>>>>   t1_print g(stdout);
> >>>>>   thumb1_gen_const_int (g, 3);
> >>>>> }
> >>>>>
> >>>>> With this you can write thumb1_gen_const_int without having to worry
> >>>>> about which expander is being used in each instance and the template
> >>>>> expansion will use the right version.
> >>>>>
> >>>>> R.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
> >>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
> >>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
> >>>>>>>>> even greater range of literals.
> >>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> >>>>>>>> I'll update my patch.
> >>>>>>>>
> >>>>>>>>>  More generally, any sequence of up to
> >>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
> >>>>>>>>> existing literal pool fall back.
> >>>>>>>>>
> >>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >>>>>>>>> be merged into a single shift-by-16.
> >>>>>>>>
> >>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
> >>>>>>>>
> >>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
> >>>>>>>>> all possible.
> >>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
> >>>>>>>> that showed the problem. There are also:
> >>>>>>>> g++.dg/opt/thunk1.C
> >>>>>>>> g++.dg/ipa/pr46984.C
> >>>>>>>> g++.dg/torture/pr46287.C
> >>>>>>>> g++.dg/torture/pr45699.C
> >>>>>>>>
> >>>>>>>> Do you want that I copy one of these in the arm subdir and add
> >>>>>>>> -mpure-code in dg-options?
> >>>>>>>
> >>>>>>> On reflection, probably not - that just makes things more complicated
> >>>>>>> with all the dg-options mess (I'm worried about interactions with other
> >>>>>>> sets of options on the command line and the fall-out from that).  If
> >>>>>>> someone cares about pure-code they should be doing full testsuite runs
> >>>>>>> with it enabled and that should be sufficient.
> >>>>>>
> >>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
> >>>>>> modified simulator.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Christophe
> >>>>>>
> >>>>>>>
> >>>>>>> R.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Christophe
> >>>>>>>>
> >>>>>>>>> R.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
> >>>>>>>>>> ---
> >>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>>>>>>>>> index ceeb91f..62abeb5 100644
> >>>>>>>>>> --- a/gcc/config/arm/arm.c
> >>>>>>>>>> +++ b/gcc/config/arm/arm.c
> >>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>      {
> >>>>>>>>>>        if (mi_delta > 255)
> >>>>>>>>>>       {
> >>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
> >>>>>>>>>> -       assemble_name (file, label);
> >>>>>>>>>> -       fputs ("+4\n", file);
> >>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
> >>>>>>>>>> +          pool: we build it explicitly.  */
> >>>>>>>>>> +       if (target_pure_code)
> >>>>>>>>>> +         {
> >>>>>>>>>> +           bool mov_done_p = false;
> >>>>>>>>>> +           int i;
> >>>>>>>>>> +
> >>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
> >>>>>>>>>> +           for (i = 0; i < 3; i++)
> >>>>>>>>>> +             {
> >>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>>>>>>>>> +
> >>>>>>>>>> +               if (byte)
> >>>>>>>>>> +                 {
> >>>>>>>>>> +                   if (mov_done_p)
> >>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>>>>>>>>> +                   else
> >>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>>>>>>>>> +                   mov_done_p = true;
> >>>>>>>>>> +                 }
> >>>>>>>>>> +
> >>>>>>>>>> +               if (mov_done_p)
> >>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>>>>>>>>> +             }
> >>>>>>>>>> +
> >>>>>>>>>> +           /* Emit lower byte if needed.  */
> >>>>>>>>>> +           if (!mov_done_p)
> >>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>> +           else if (mi_delta & 0xff)
> >>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>> +         }
> >>>>>>>>>> +       else
> >>>>>>>>>> +         {
> >>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
> >>>>>>>>>> +           assemble_name (file, label);
> >>>>>>>>>> +           fputs ("+4\n", file);
> >>>>>>>>>> +         }
> >>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>>>>>>>>                      mi_op, this_regno, this_regno);
> >>>>>>>>>>       }
> >>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
> >>>>>>>>>>
> >>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
> >>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>> -      assemble_name (file, label);
> >>>>>>>>>> -      fputs (":\n", file);
> >>>>>>>>>> -      if (flag_pic)
> >>>>>>>>>> +
> >>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
> >>>>>>>>>> +      function address and delta since we emitted code to build
> >>>>>>>>>> +      them.  */
> >>>>>>>>>> +      if (!target_pure_code)
> >>>>>>>>>>       {
> >>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>> -          accordingly.  */
> >>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>> -                            tem,
> >>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>> -                                                ggc_strdup (labelpc)));
> >>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>> -     }
> >>>>>>>>>> -      else
> >>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
> >>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>> +       assemble_name (file, label);
> >>>>>>>>>> +       fputs (":\n", file);
> >>>>>>>>>> +       if (flag_pic)
> >>>>>>>>>> +         {
> >>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>> +              accordingly.  */
> >>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>> +                                tem,
> >>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>> +                                                    ggc_strdup (labelpc)));
> >>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>> +         }
> >>>>>>>>>> +       else
> >>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
> >>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>
> >>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>> +     }
> >>>>>>>>>>      }
> >>>>>>>>>>    else
> >>>>>>>>>>      {
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-21 16:11                     ` Christophe Lyon
@ 2020-10-21 17:36                       ` Richard Earnshaw
  2020-10-22  8:45                         ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-21 17:36 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>
>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>
>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>
>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>
>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>
>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>
>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>>>
>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>>>
>>>>>>>>>> Hi Richard,
>>>>>>>>>>
>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>
>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>
>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>
>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>> Consider:
>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>> f3_1:
>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>         bx      lr
>>>>>>>>>> f3_2:
>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>         bx      lr
>>>>>>>>>>
>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>>>> not having to load the constant?
>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>>>> account when comparing code size,
>>>>>>>>>
>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>
>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>
>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> This is also true for
>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>>>> this.
>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>>>> f3_1:
>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>         bx      lr
>>>>>>>>>> f3_2:
>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>         bx      lr
>>>>>>>>>>
>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>>>
>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>>>> you show here are not thunks.
>>>>>>>>>
>>>>>>>> OK thanks for the clarification.
>>>>>>>>
>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>> review easier.
>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>
>>>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>>>
>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>>>> same place,
>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>>>> several cases matching your comments.
>>>>>>>>
>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>
>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>> thumb1_gen_const_int.
>>>>>>>
>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>
>>>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>>>> C++ now to do much better than that, for example by making that function
>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>
>>>>>>
>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>> templates in arm.c yet.
>>>>>> I'll send an update soon.
>>>>>>
>>>>>> Other than that, does the approach look OK to you?
>>>>>>
>>>>>
>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>> immediate generating operations into a single function can only be a
>>>>> good thing.
>>>>>
>>>>> Looking again at your example constant sequences, I see:
>>>>>
>>>>> 0x1000010:
>>>>>         movs    r3, #16
>>>>>         lsls    r3, #16
>>>>>         adds    r3, #1
>>>>>         lsls    r3, #4
>>>>> 0x1000011:
>>>>>         movs    r3, #1
>>>>>         lsls    r3, #24
>>>>>         adds    r3, #17
>>>>>
>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>> the first expand to
>>>>>
>>>>> 0x1000010:
>>>>>         movs    r3, #16
>>>>>         lsls    r3, #16
>>>>>         adds    r3, #16
>>>>>
>>>> Err, I mean to:
>>>>
>>>>
>>>> 0x1000010:
>>>>         movs    r3, #1
>>>>         lsls    r3, #24
>>>>         adds    r3, #16
>>>>
>>>> ?
>>>
>>> Because I first try to right-shift the constant, hoping to reduce its
>>> range and need less instructions to build the higher part, then
>>> left-shift back.
>>>
>>> In this particular case, we'd need to realize that there are many
>>> zeros "inside" the constant.
>>>
>>> If I remove the part that tries to reduce the range, I do get that
>>> sequence, but for 764 I now generate
>>> movs    r3, #2
>>> lsls    r3, #8
>>> adds    r3, #252
>>> instead of
>>> movs    r3, #191
>>> lsls    r3, #2
>>>
>>> A possibility would be to try both approaches and keep the shortest one.
>>
>> Lets leave that for now, it's not important to fixing the main issue;
>> but we should remember we need to come back to it at some point.
>>
> Thanks, that's what I was thinking too.
> 
>> There are other tricks as well, such as
>>
>>   0xffffff
>>
>> can be done as
>>
>>   0x1000000 - 1
>>
>> and
>>
>>   0xfffffd
>>
>> as
>>
>>   0x1000000 - 3
>>
>> but these can wait as well.
>>
> 
> Didn't we already need to handle such tricks? I'm surprised this
> wasn't needed earlier.
> 

I don't think we ever worried about them.  Most of them need at least 3
instructions so aren't a code size saving over using a literal pool entry.

R.

> 
>>
>> R.
>>
>>>
>>>
>>>>>
>>>>> R.
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Christophe
>>>>>>
>>>>>>> class t1_rtl
>>>>>>> {
>>>>>>>  public:
>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>> };
>>>>>>>
>>>>>>> class t1_print
>>>>>>> {
>>>>>>>  public:
>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>  private:
>>>>>>>   FILE *t_file;
>>>>>>> };
>>>>>>>
>>>>>>> template <class T>
>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>> {
>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>   t.ashift(f);
>>>>>>> }
>>>>>>>
>>>>>>> // Usage...
>>>>>>> void f1()
>>>>>>> {
>>>>>>>   // Use the RTL expander
>>>>>>>   t1_rtl g;
>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>> }
>>>>>>>
>>>>>>> void f2()
>>>>>>> {
>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>   t1_print g(stdout);
>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>> }
>>>>>>>
>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>> about which expander is being used in each instance and the template
>>>>>>> expansion will use the right version.
>>>>>>>
>>>>>>> R.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>>>> even greater range of literals.
>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>>>> I'll update my patch.
>>>>>>>>>>
>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>
>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>
>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>
>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>>>> all possible.
>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>
>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>
>>>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>
>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>>>> modified simulator.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Christophe
>>>>>>>>
>>>>>>>>>
>>>>>>>>> R.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Christophe
>>>>>>>>>>
>>>>>>>>>>> R.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>> ---
>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>      {
>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>       {
>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>> +         {
>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>> +           int i;
>>>>>>>>>>>> +
>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>> +             {
>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>> +
>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>> +                 {
>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>>>> +                   else
>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>> +                 }
>>>>>>>>>>>> +
>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +
>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>> +         }
>>>>>>>>>>>> +       else
>>>>>>>>>>>> +         {
>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>> +         }
>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>       }
>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>
>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>> +
>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>       {
>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>> -     }
>>>>>>>>>>>> -      else
>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>> +         {
>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>> +         }
>>>>>>>>>>>> +       else
>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>
>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>> +     }
>>>>>>>>>>>>      }
>>>>>>>>>>>>    else
>>>>>>>>>>>>      {
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-21 17:36                       ` Richard Earnshaw
@ 2020-10-22  8:45                         ` Christophe Lyon
  2020-10-22 15:22                           ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-10-22  8:45 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

[-- Attachment #1: Type: text/plain, Size: 19013 bytes --]

On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> > On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> > <Richard.Earnshaw@foss.arm.com> wrote:
> >>
> >> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> >>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>
> >>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>
> >>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>>>>>>>>>>> from code memory (like we do without -mpure-code).
> >>>>>>>>>>>>
> >>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
> >>>>>>>>>>>> movs/adds/lsls.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
> >>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
> >>>>>>>>>>>> with -mpure-code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> No need for new testcases, this bug was already identified by
> >>>>>>>>>>>> eg. pr46287-3.C
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>>>>>>>>>>
> >>>>>>>>>>>>       gcc/
> >>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
> >>>>>>>>>>>
> >>>>>>>>>> Hi Richard,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for your comments.
> >>>>>>>>>>
> >>>>>>>>>>> There are some optimizations you can make to this code.
> >>>>>>>>>>>
> >>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
> >>>>>>>>>>> to just expand a mov of 255 followed by an add.
> >>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>>>>>>>>> before, so I can write something similar indeed.
> >>>>>>>>>>
> >>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
> >>>>>>>>>> when -mpure-code is NOT used.
> >>>>>>>>>> Consider:
> >>>>>>>>>> int f3_1 (void) { return 510; }
> >>>>>>>>>> int f3_2 (void) { return 511; }
> >>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
> >>>>>>>>>> f3_1:
> >>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>         bx      lr
> >>>>>>>>>> f3_2:
> >>>>>>>>>>         ldr     r0, .L4
> >>>>>>>>>>         bx      lr
> >>>>>>>>>>
> >>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
> >>>>>>>>>> not having to load the constant?
> >>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
> >>>>>>>>>> account when comparing code size,
> >>>>>>>>>
> >>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
> >>>>>>>>>  It might happen that the entry could be shared with another use of that
> >>>>>>>>> literal, but in general that's rare.
> >>>>>>>>>
> >>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
> >>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
> >>>>>>>>>
> >>>>>>>>> It's a bit borderline at three instructions when literal pools are not
> >>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
> >>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
> >>>>>>>>> want to use the instruction sequence unless optimizing for size.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> This is also true for
> >>>>>>>>>>> the literal pools alternative as well, so should be handled before all
> >>>>>>>>>>> this.
> >>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> >>>>>>>>>> f3_1:
> >>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>         bx      lr
> >>>>>>>>>> f3_2:
> >>>>>>>>>>         movs    r0, #1
> >>>>>>>>>>         lsls    r0, r0, #8
> >>>>>>>>>>         adds    r0, r0, #255
> >>>>>>>>>>         bx      lr
> >>>>>>>>>>
> >>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
> >>>>>>>>>
> >>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
> >>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
> >>>>>>>>> you show here are not thunks.
> >>>>>>>>>
> >>>>>>>> OK thanks for the clarification.
> >>>>>>>>
> >>>>>>>> Here is an updated version, split into 3 patches to hopefully make
> >>>>>>>> review easier.
> >>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> >>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> >>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >>>>>>>>
> >>>>>>>> I kept it this way to make incremental changes easier to understand.
> >>>>>>>>
> >>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
> >>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> >>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
> >>>>>>>> same place,
> >>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
> >>>>>>>> several cases matching your comments.
> >>>>>>>>
> >>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> >>>>>>>> calls thumb1_gen_const_int.
> >>>>>>>>
> >>>>>>>> Patch 3: Update of the original patch in this thread, now calls
> >>>>>>>> thumb1_gen_const_int.
> >>>>>>>
> >>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
> >>>>>>>
> >>>>>>> I think we should be able to leverage the fact that the compiler can use
> >>>>>>> C++ now to do much better than that, for example by making that function
> >>>>>>> a template.  For example (and this is just a sketch):
> >>>>>>>
> >>>>>>
> >>>>>> Indeed! I didn't think about it since there is no other use of
> >>>>>> templates in arm.c yet.
> >>>>>> I'll send an update soon.
> >>>>>>
> >>>>>> Other than that, does the approach look OK to you?
> >>>>>>
> >>>>>
> >>>>> Yes, I think this is heading in the right direction.  Bringing the two
> >>>>> immediate generating operations into a single function can only be a
> >>>>> good thing.
> >>>>>
> >>>>> Looking again at your example constant sequences, I see:
> >>>>>
> >>>>> 0x1000010:
> >>>>>         movs    r3, #16
> >>>>>         lsls    r3, #16
> >>>>>         adds    r3, #1
> >>>>>         lsls    r3, #4
> >>>>> 0x1000011:
> >>>>>         movs    r3, #1
> >>>>>         lsls    r3, #24
> >>>>>         adds    r3, #17
> >>>>>
> >>>>> The first of these looks odd, given the second sequence.  Why doesn't
> >>>>> the first expand to
> >>>>>
> >>>>> 0x1000010:
> >>>>>         movs    r3, #16
> >>>>>         lsls    r3, #16
> >>>>>         adds    r3, #16
> >>>>>
> >>>> Err, I mean to:
> >>>>
> >>>>
> >>>> 0x1000010:
> >>>>         movs    r3, #1
> >>>>         lsls    r3, #24
> >>>>         adds    r3, #16
> >>>>
> >>>> ?
> >>>
> >>> Because I first try to right-shift the constant, hoping to reduce its
> >>> range and need less instructions to build the higher part, then
> >>> left-shift back.
> >>>
> >>> In this particular case, we'd need to realize that there are many
> >>> zeros "inside" the constant.
> >>>
> >>> If I remove the part that tries to reduce the range, I do get that
> >>> sequence, but for 764 I now generate
> >>> movs    r3, #2
> >>> lsls    r3, #8
> >>> adds    r3, #252
> >>> instead of
> >>> movs    r3, #191
> >>> lsls    r3, #2
> >>>
> >>> A possibility would be to try both approaches and keep the shortest one.
> >>
> >> Lets leave that for now, it's not important to fixing the main issue;
> >> but we should remember we need to come back to it at some point.
> >>
> > Thanks, that's what I was thinking too.
> >
> >> There are other tricks as well, such as
> >>
> >>   0xffffff
> >>
> >> can be done as
> >>
> >>   0x1000000 - 1
> >>
> >> and
> >>
> >>   0xfffffd
> >>
> >> as
> >>
> >>   0x1000000 - 3
> >>
> >> but these can wait as well.
> >>
> >
> > Didn't we already need to handle such tricks? I'm surprised this
> > wasn't needed earlier.
> >
>
> I don't think we ever worried about them.  Most of them need at least 3
> instructions so aren't a code size saving over using a literal pool entry.
>
OK, this will also help when using -mslow-flash-data.

Here are updated patches, now using a template as you suggested.

Thanks,

Christophe

> R.
>
> >
> >>
> >> R.
> >>
> >>>
> >>>
> >>>>>
> >>>>> R.
> >>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Christophe
> >>>>>>
> >>>>>>> class t1_rtl
> >>>>>>> {
> >>>>>>>  public:
> >>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
> >>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
> >>>>>>> };
> >>>>>>>
> >>>>>>> class t1_print
> >>>>>>> {
> >>>>>>>  public:
> >>>>>>>   t1_print (FILE *f) : t_file(f) {}
> >>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
> >>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
> >>>>>>>  private:
> >>>>>>>   FILE *t_file;
> >>>>>>> };
> >>>>>>>
> >>>>>>> template <class T>
> >>>>>>> void thumb1_gen_const_int(T t, int f)
> >>>>>>> {
> >>>>>>>   // Expansion of thumb1_gen_const_int ...
> >>>>>>>   t.ashift(f);
> >>>>>>> }
> >>>>>>>
> >>>>>>> // Usage...
> >>>>>>> void f1()
> >>>>>>> {
> >>>>>>>   // Use the RTL expander
> >>>>>>>   t1_rtl g;
> >>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>> }
> >>>>>>>
> >>>>>>> void f2()
> >>>>>>> {
> >>>>>>>   // Use the printf expander writing to stdout
> >>>>>>>   t1_print g(stdout);
> >>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>> }
> >>>>>>>
> >>>>>>> With this you can write thumb1_gen_const_int without having to worry
> >>>>>>> about which expander is being used in each instance and the template
> >>>>>>> expansion will use the right version.
> >>>>>>>
> >>>>>>> R.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
> >>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
> >>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
> >>>>>>>>>>> even greater range of literals.
> >>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> >>>>>>>>>> I'll update my patch.
> >>>>>>>>>>
> >>>>>>>>>>>  More generally, any sequence of up to
> >>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
> >>>>>>>>>>> existing literal pool fall back.
> >>>>>>>>>>>
> >>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >>>>>>>>>>> be merged into a single shift-by-16.
> >>>>>>>>>>
> >>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
> >>>>>>>>>>
> >>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
> >>>>>>>>>>> all possible.
> >>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
> >>>>>>>>>> that showed the problem. There are also:
> >>>>>>>>>> g++.dg/opt/thunk1.C
> >>>>>>>>>> g++.dg/ipa/pr46984.C
> >>>>>>>>>> g++.dg/torture/pr46287.C
> >>>>>>>>>> g++.dg/torture/pr45699.C
> >>>>>>>>>>
> >>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
> >>>>>>>>>> -mpure-code in dg-options?
> >>>>>>>>>
> >>>>>>>>> On reflection, probably not - that just makes things more complicated
> >>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
> >>>>>>>>> sets of options on the command line and the fall-out from that).  If
> >>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
> >>>>>>>>> with it enabled and that should be sufficient.
> >>>>>>>>
> >>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
> >>>>>>>> modified simulator.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Christophe
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> R.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Christophe
> >>>>>>>>>>
> >>>>>>>>>>> R.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>>>>>>>>>>> index ceeb91f..62abeb5 100644
> >>>>>>>>>>>> --- a/gcc/config/arm/arm.c
> >>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
> >>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>      {
> >>>>>>>>>>>>        if (mi_delta > 255)
> >>>>>>>>>>>>       {
> >>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>> -       assemble_name (file, label);
> >>>>>>>>>>>> -       fputs ("+4\n", file);
> >>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
> >>>>>>>>>>>> +          pool: we build it explicitly.  */
> >>>>>>>>>>>> +       if (target_pure_code)
> >>>>>>>>>>>> +         {
> >>>>>>>>>>>> +           bool mov_done_p = false;
> >>>>>>>>>>>> +           int i;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
> >>>>>>>>>>>> +           for (i = 0; i < 3; i++)
> >>>>>>>>>>>> +             {
> >>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +               if (byte)
> >>>>>>>>>>>> +                 {
> >>>>>>>>>>>> +                   if (mov_done_p)
> >>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>>>>>>>>>>> +                   else
> >>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>>>>>>>>>>> +                   mov_done_p = true;
> >>>>>>>>>>>> +                 }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +               if (mov_done_p)
> >>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>>>>>>>>>>> +             }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +           /* Emit lower byte if needed.  */
> >>>>>>>>>>>> +           if (!mov_done_p)
> >>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>> +           else if (mi_delta & 0xff)
> >>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>> +         }
> >>>>>>>>>>>> +       else
> >>>>>>>>>>>> +         {
> >>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>> +           assemble_name (file, label);
> >>>>>>>>>>>> +           fputs ("+4\n", file);
> >>>>>>>>>>>> +         }
> >>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>>>>>>>>>>                      mi_op, this_regno, this_regno);
> >>>>>>>>>>>>       }
> >>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
> >>>>>>>>>>>>
> >>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
> >>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>> -      assemble_name (file, label);
> >>>>>>>>>>>> -      fputs (":\n", file);
> >>>>>>>>>>>> -      if (flag_pic)
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
> >>>>>>>>>>>> +      function address and delta since we emitted code to build
> >>>>>>>>>>>> +      them.  */
> >>>>>>>>>>>> +      if (!target_pure_code)
> >>>>>>>>>>>>       {
> >>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>> -          accordingly.  */
> >>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>> -                            tem,
> >>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
> >>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>> -     }
> >>>>>>>>>>>> -      else
> >>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>> +       assemble_name (file, label);
> >>>>>>>>>>>> +       fputs (":\n", file);
> >>>>>>>>>>>> +       if (flag_pic)
> >>>>>>>>>>>> +         {
> >>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>> +              accordingly.  */
> >>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>> +                                tem,
> >>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
> >>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>> +         }
> >>>>>>>>>>>> +       else
> >>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>
> >>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>> +     }
> >>>>>>>>>>>>      }
> >>>>>>>>>>>>    else
> >>>>>>>>>>>>      {
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
>

[-- Attachment #2: 0003-arm-Fix-multiple-inheritance-thunks-for-thumb-1-with.patch --]
[-- Type: text/x-patch, Size: 4058 bytes --]

From 592f8871ea23530409a030fd4a5806d34d7b6b7f Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Tue, 29 Sep 2020 19:40:39 +0000
Subject: [PATCH 3/3] arm: Fix multiple inheritance thunks for thumb-1 with
 -mpure-code

When -mpure-code is used, we cannot load delta from code memory (like
we do without -mpure-code).

This patch builds the value of mi_delta into r3 with a series of
movs/adds/lsls.

We also do some cleanup by not emitting the function address and delta
via .word directives at the end of the thunk since we don't use them
with -mpure-code.

No need for new testcases, this bug was already identified by:
g++.dg/ipa/pr46287-3.C
g++.dg/ipa/pr46984.C
g++.dg/opt/thunk1.C
g++.dg/torture/pr46287.C
g++.dg/torture/pr45699.C

2020-10-12  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
	do not emit function address and delta when -mpure-code is used.
---
 gcc/config/arm/arm.c | 67 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 0a78344..8be64f2 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28473,9 +28473,19 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
     {
       if (mi_delta > 255)
 	{
-	  fputs ("\tldr\tr3, ", file);
-	  assemble_name (file, label);
-	  fputs ("+4\n", file);
+	  /* With -mpure-code, we cannot load MI_DELTA from the
+	     constant pool: we build it explicitly.  */
+	  if (target_pure_code)
+	    {
+	      t1_print r3 (file, 3);
+	      thumb1_gen_const_int_1 (r3, mi_delta);
+	    }
+	  else
+	    {
+	      fputs ("\tldr\tr3, ", file);
+	      assemble_name (file, label);
+	      fputs ("+4\n", file);
+	    }
 	  asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
 		       mi_op, this_regno, this_regno);
 	}
@@ -28511,30 +28521,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
 	fputs ("\tpop\t{r3}\n", file);
 
       fprintf (file, "\tbx\tr12\n");
-      ASM_OUTPUT_ALIGN (file, 2);
-      assemble_name (file, label);
-      fputs (":\n", file);
-      if (flag_pic)
+
+      /* With -mpure-code, we don't need to emit literals for the
+	 function address and delta since we emitted code to build
+	 them.  */
+      if (!target_pure_code)
 	{
-	  /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
-	  rtx tem = XEXP (DECL_RTL (function), 0);
-	  /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
-	     pipeline offset is four rather than eight.  Adjust the offset
-	     accordingly.  */
-	  tem = plus_constant (GET_MODE (tem), tem,
-			       TARGET_THUMB1_ONLY ? -3 : -7);
-	  tem = gen_rtx_MINUS (GET_MODE (tem),
-			       tem,
-			       gen_rtx_SYMBOL_REF (Pmode,
-						   ggc_strdup (labelpc)));
-	  assemble_integer (tem, 4, BITS_PER_WORD, 1);
-	}
-      else
-	/* Output ".word .LTHUNKn".  */
-	assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
+	  ASM_OUTPUT_ALIGN (file, 2);
+	  assemble_name (file, label);
+	  fputs (":\n", file);
+	  if (flag_pic)
+	    {
+	      /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
+	      rtx tem = XEXP (DECL_RTL (function), 0);
+	      /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
+		 pipeline offset is four rather than eight.  Adjust the offset
+		 accordingly.  */
+	      tem = plus_constant (GET_MODE (tem), tem,
+				   TARGET_THUMB1_ONLY ? -3 : -7);
+	      tem = gen_rtx_MINUS (GET_MODE (tem),
+				   tem,
+				   gen_rtx_SYMBOL_REF (Pmode,
+						       ggc_strdup (labelpc)));
+	      assemble_integer (tem, 4, BITS_PER_WORD, 1);
+	    }
+	  else
+	    /* Output ".word .LTHUNKn".  */
+	    assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
 
-      if (TARGET_THUMB1_ONLY && mi_delta > 255)
-	assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
+	  if (TARGET_THUMB1_ONLY && mi_delta > 255)
+	    assemble_integer (GEN_INT (mi_delta), 4, BITS_PER_WORD, 1);
+	}
     }
   else
     {
-- 
2.7.4


[-- Attachment #3: 0002-arm-Call-thumb1_gen_const_int-from-thumb1_movsi_insn.patch --]
[-- Type: text/x-patch, Size: 2052 bytes --]

From e5a75d819e976b8d8fb0aad80d7450c98d012cbf Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Fri, 9 Oct 2020 15:18:07 +0000
Subject: [PATCH 2/3] arm: Call thumb1_gen_const_int from thumb1_movsi_insn

thumb1_movsi_insn used the same algorithm to build a constant in asm
than thumb1_gen_const_int does in RTL. Since the previous patch added
support for asm generation in thumb1_gen_const_int, this patch calls
it from thumb1_movsi_insn to avoid duplication.

2020-10-22  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/thumb1.md (thumb1_movsi_insn): Call
	thumb1_gen_const_int.
---
 gcc/config/arm/thumb1.md | 37 ++++---------------------------------
 1 file changed, 4 insertions(+), 33 deletions(-)

diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
index 2258a52..e8e8c11 100644
--- a/gcc/config/arm/thumb1.md
+++ b/gcc/config/arm/thumb1.md
@@ -688,40 +688,11 @@ (define_insn "*thumb1_movsi_insn"
 	  }
 	else if (GET_CODE (operands[1]) == CONST_INT)
 	  {
-	    int i;
-	    HOST_WIDE_INT op1 = INTVAL (operands[1]);
-	    bool mov_done_p = false;
-	    rtx ops[2];
-	    ops[0] = operands[0];
-
-	    /* Emit upper 3 bytes if needed.  */
-	    for (i = 0; i < 3; i++)
-	      {
-		int byte = (op1 >> (8 * (3 - i))) & 0xff;
-
-		if (byte)
-		  {
-		    ops[1] = GEN_INT (byte);
-		    if (mov_done_p)
-		      output_asm_insn ("adds\t%0, %1", ops);
-		    else
-		      output_asm_insn ("movs\t%0, %1", ops);
-		    mov_done_p = true;
-		  }
-
-		if (mov_done_p)
-		  output_asm_insn ("lsls\t%0, #8", ops);
-	      }
-
-	    /* Emit lower byte if needed.  */
-	    ops[1] = GEN_INT (op1 & 0xff);
-	    if (!mov_done_p)
-	      output_asm_insn ("movs\t%0, %1", ops);
-	    else if (op1 & 0xff)
-	      output_asm_insn ("adds\t%0, %1", ops);
-	    return "";
+	    thumb1_gen_const_int (operands[0], INTVAL (operands[1]));
+	    return \"\";
 	  }
-	  gcc_unreachable ();
+
+	gcc_unreachable ();
 
       case 8: return "ldr\t%0, %1";
       case 9: return "str\t%1, %0";
-- 
2.7.4


[-- Attachment #4: 0001-arm-Improve-thumb1_gen_const_int.patch --]
[-- Type: text/x-patch, Size: 6979 bytes --]

From ad092339f4172141fc0e567cec724f452d95a603 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Fri, 9 Oct 2020 12:42:39 +0000
Subject: [PATCH 1/3] arm: Improve thumb1_gen_const_int

Enable thumb1_gen_const_int to generate RTL or asm depending on the
context, so that we avoid duplicating code to handle constants in
thumb-1 with -mpure-code.

Use a template so that the algorithm is effectively shared, and
rely on two classes to handle the actual emission as RTL or asm.

The generated sequence is improved to handle right-shiftable and small
values with less instructions. We now generate:

128:
        movs    r0, r0, #128
264:
        movs    r3, #33
        lsls    r3, #3
510:
        movs    r3, #255
        lsls    r3, #1
512:
        movs    r3, #1
        lsls    r3, #9
764:
        movs    r3, #191
        lsls    r3, #2
65536:
        movs    r3, #1
        lsls    r3, #16
0x123456:
        movs    r3, #18 ;0x12
        lsls    r3, #8
        adds    r3, #52 ;0x34
        lsls    r3, #8
        adds    r3, #86 ;0x56
0x1123456:
        movs    r3, #137 ;0x89
        lsls    r3, #8
        adds    r3, #26 ;0x1a
        lsls    r3, #8
        adds    r3, #43 ;0x2b
        lsls    r3, #1
0x1000010:
        movs    r3, #16
        lsls    r3, #16
        adds    r3, #1
        lsls    r3, #4
0x1000011:
        movs    r3, #1
        lsls    r3, #24
        adds    r3, #17

2020-10-22  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (t1_rtl, t1_print): New classes.
	(thumb1_gen_const_int_1): New helper function.
	(thumb1_gen_const_int): Add capability to emit either RTL or asm,
	improve generated code by calling thumb1_gen_const_int_1.
---
 gcc/config/arm/arm.c | 195 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 163 insertions(+), 32 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ceeb91f..0a78344 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -4513,38 +4513,6 @@ const_ok_for_dimode_op (HOST_WIDE_INT i, enum rtx_code code)
     }
 }
 
-/* Emit a sequence of movs/adds/shift to produce a 32-bit constant.
-   Avoid generating useless code when one of the bytes is zero.  */
-void
-thumb1_gen_const_int (rtx op0, HOST_WIDE_INT op1)
-{
-  bool mov_done_p = false;
-  int i;
-
-  /* Emit upper 3 bytes if needed.  */
-  for (i = 0; i < 3; i++)
-    {
-      int byte = (op1 >> (8 * (3 - i))) & 0xff;
-
-      if (byte)
-	{
-	  emit_set_insn (op0, mov_done_p
-			 ? gen_rtx_PLUS (SImode,op0, GEN_INT (byte))
-			 : GEN_INT (byte));
-	  mov_done_p = true;
-	}
-
-      if (mov_done_p)
-	emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (8)));
-    }
-
-  /* Emit lower byte if needed.  */
-  if (!mov_done_p)
-    emit_set_insn (op0, GEN_INT (op1 & 0xff));
-  else if (op1 & 0xff)
-    emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (op1 & 0xff)));
-}
-
 /* Emit a sequence of insns to handle a large constant.
    CODE is the code of the operation required, it can be any of SET, PLUS,
    IOR, AND, XOR, MINUS;
@@ -28244,6 +28212,169 @@ arm_internal_label (FILE *stream, const char *prefix, unsigned long labelno)
   default_internal_label (stream, prefix, labelno);
 }
 
+/* Define classes to generate code as RTL or output asm to a file.
+   Using templates then allows to use the same code to output code
+   sequences in the two formats.  */
+class t1_rtl
+{
+ public:
+  t1_rtl (rtx dst) : dst (dst) {}
+
+  void mov (int val)
+  {
+    emit_set_insn (dst, GEN_INT (val));
+  }
+
+  void add (int val)
+  {
+    emit_set_insn (dst, gen_rtx_PLUS (SImode, dst, GEN_INT (val)));
+  }
+
+  void ashift (int shift)
+  {
+    emit_set_insn (dst, gen_rtx_ASHIFT (SImode, dst, GEN_INT (shift)));
+  }
+
+ private:
+  rtx dst;
+};
+
+class t1_print
+{
+ public:
+  t1_print (FILE *f, int regno) : t_file (f), dst_regno (regno) {}
+
+  void mov (int val)
+  {
+    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
+  }
+
+  void add (int val)
+  {
+    asm_fprintf (t_file, "\tadds\tr%d, #%d\n", dst_regno, val);
+  }
+
+  void ashift (int shift)
+  {
+    asm_fprintf (t_file, "\tlsls\tr%d, #%d\n", dst_regno, shift);
+  }
+
+ private:
+  FILE *t_file;
+  int dst_regno;
+};
+
+/* Emit a sequence of movs/adds/shift to produce a 32-bit constant.
+   Avoid generating useless code when one of the bytes is zero.  */
+template <class T>
+void
+thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
+{
+  bool mov_done_p = false;
+  int val = op1;
+  int shift = 0;
+  int i;
+
+  if (val == 0)
+    {
+      dst.mov (val);
+      return;
+    }
+
+  /* In the general case, we need 7 instructions to build
+     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
+     do better if VAL is small enough, or
+     right-shiftable by a suitable amount.  If the
+     right-shift enables to encode at least one less byte,
+     it's worth it: we save a adds and a lsls at the
+     expense of a final lsls.  */
+  int final_shift = number_of_first_bit_set (val);
+
+  int leading_zeroes = clz_hwi (val);
+  int number_of_bytes_needed
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
+       / BITS_PER_UNIT) + 1;
+  int number_of_bytes_needed2
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
+       / BITS_PER_UNIT) + 1;
+
+  if (number_of_bytes_needed2 < number_of_bytes_needed)
+    val >>= final_shift;
+  else
+    final_shift = 0;
+
+  /* If we are in a very small range, we can use either a single movs
+     or movs+adds.  */
+  if ((val >= 0) && (val <= 510))
+    {
+      if (val > 255)
+	{
+	  int high = val - 255;
+
+	  dst.mov (high);
+	  dst.add (255);
+	}
+      else
+	dst.mov (val);
+
+      if (final_shift > 0)
+	dst.ashift (final_shift);
+    }
+  else
+    {
+      /* General case, emit upper 3 bytes as needed.  */
+      for (i = 0; i < 3; i++)
+	{
+	  int byte = (val >> (8 * (3 - i))) & 0xff;
+
+	  if (byte)
+	    {
+	      /* We are about to emit new bits, stop accumulating a
+		 shift amount, and left-shift only if we have already
+		 emitted some upper bits.  */
+	      if (mov_done_p)
+		{
+		  dst.ashift (shift);
+		  dst.add (byte);
+		}
+	      else
+		dst.mov (byte);
+
+	      /* Stop accumulating shift amount since we've just
+		 emitted some bits.  */
+	      shift = 0;
+
+	      mov_done_p = true;
+	    }
+
+	  if (mov_done_p)
+	    shift += 8;
+	}
+
+      /* Emit lower byte.  */
+      if (!mov_done_p)
+	dst.mov (val & 0xff);
+      else
+	{
+	  dst.ashift (shift);
+	  if (val & 0xff)
+	    dst.add (val & 0xff);
+	}
+
+      if (final_shift > 0)
+	dst.ashift (final_shift);
+    }
+}
+
+/* Proxy for thumb1.md, since the t1_print and t1_rtl classes are not
+   exported.  */
+void
+thumb1_gen_const_int (rtx dst, HOST_WIDE_INT op1)
+{
+  t1_rtl t (dst);
+  thumb1_gen_const_int_1 (t, op1);
+}
+
 /* Output code to add DELTA to the first argument, and then jump
    to FUNCTION.  Used for C++ multiple inheritance.  */
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-22  8:45                         ` Christophe Lyon
@ 2020-10-22 15:22                           ` Richard Earnshaw
  2020-10-26 10:52                             ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-22 15:22 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>
>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>
>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>>>>>
>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>
>>>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>>>
>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>>>> Consider:
>>>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>
>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>>>>>> not having to load the constant?
>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>>>>>> account when comparing code size,
>>>>>>>>>>>
>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>>>
>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>>>
>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> This is also true for
>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>>>>>> this.
>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>
>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>>>>>
>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>>>>>> you show here are not thunks.
>>>>>>>>>>>
>>>>>>>>>> OK thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>>>> review easier.
>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>>>
>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>>>>>
>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>>>>>> same place,
>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>>>>>> several cases matching your comments.
>>>>>>>>>>
>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>>>
>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>>>> thumb1_gen_const_int.
>>>>>>>>>
>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>>>
>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>>>>>> C++ now to do much better than that, for example by making that function
>>>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>>>
>>>>>>>>
>>>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>>>> templates in arm.c yet.
>>>>>>>> I'll send an update soon.
>>>>>>>>
>>>>>>>> Other than that, does the approach look OK to you?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>>>> immediate generating operations into a single function can only be a
>>>>>>> good thing.
>>>>>>>
>>>>>>> Looking again at your example constant sequences, I see:
>>>>>>>
>>>>>>> 0x1000010:
>>>>>>>         movs    r3, #16
>>>>>>>         lsls    r3, #16
>>>>>>>         adds    r3, #1
>>>>>>>         lsls    r3, #4
>>>>>>> 0x1000011:
>>>>>>>         movs    r3, #1
>>>>>>>         lsls    r3, #24
>>>>>>>         adds    r3, #17
>>>>>>>
>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>>>> the first expand to
>>>>>>>
>>>>>>> 0x1000010:
>>>>>>>         movs    r3, #16
>>>>>>>         lsls    r3, #16
>>>>>>>         adds    r3, #16
>>>>>>>
>>>>>> Err, I mean to:
>>>>>>
>>>>>>
>>>>>> 0x1000010:
>>>>>>         movs    r3, #1
>>>>>>         lsls    r3, #24
>>>>>>         adds    r3, #16
>>>>>>
>>>>>> ?
>>>>>
>>>>> Because I first try to right-shift the constant, hoping to reduce its
>>>>> range and need less instructions to build the higher part, then
>>>>> left-shift back.
>>>>>
>>>>> In this particular case, we'd need to realize that there are many
>>>>> zeros "inside" the constant.
>>>>>
>>>>> If I remove the part that tries to reduce the range, I do get that
>>>>> sequence, but for 764 I now generate
>>>>> movs    r3, #2
>>>>> lsls    r3, #8
>>>>> adds    r3, #252
>>>>> instead of
>>>>> movs    r3, #191
>>>>> lsls    r3, #2
>>>>>
>>>>> A possibility would be to try both approaches and keep the shortest one.
>>>>
>>>> Lets leave that for now, it's not important to fixing the main issue;
>>>> but we should remember we need to come back to it at some point.
>>>>
>>> Thanks, that's what I was thinking too.
>>>
>>>> There are other tricks as well, such as
>>>>
>>>>   0xffffff
>>>>
>>>> can be done as
>>>>
>>>>   0x1000000 - 1
>>>>
>>>> and
>>>>
>>>>   0xfffffd
>>>>
>>>> as
>>>>
>>>>   0x1000000 - 3
>>>>
>>>> but these can wait as well.
>>>>
>>>
>>> Didn't we already need to handle such tricks? I'm surprised this
>>> wasn't needed earlier.
>>>
>>
>> I don't think we ever worried about them.  Most of them need at least 3
>> instructions so aren't a code size saving over using a literal pool entry.
>>
> OK, this will also help when using -mslow-flash-data.
> 
> Here are updated patches, now using a template as you suggested.

Looking better, but when I try to apply this to my local tree patch 2
fails (I'm not exactly sure why, what was your baseline for these
patches?) -- that patch looks suspicious anyway, you're replacing code
that prints out assembly with code that generates RTL.

Could you also rename t1_print and t1_rtl to thumb1_const_print and
thumb1_const_rtl.  I think the names as they stand are likely to be too
generic.

R.

> 
> Thanks,
> 
> Christophe
> 
>> R.
>>
>>>
>>>>
>>>> R.
>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> R.
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Christophe
>>>>>>>>
>>>>>>>>> class t1_rtl
>>>>>>>>> {
>>>>>>>>>  public:
>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> class t1_print
>>>>>>>>> {
>>>>>>>>>  public:
>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>>>  private:
>>>>>>>>>   FILE *t_file;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> template <class T>
>>>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>>>> {
>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>>>   t.ashift(f);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> // Usage...
>>>>>>>>> void f1()
>>>>>>>>> {
>>>>>>>>>   // Use the RTL expander
>>>>>>>>>   t1_rtl g;
>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> void f2()
>>>>>>>>> {
>>>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>>>   t1_print g(stdout);
>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>>>> about which expander is being used in each instance and the template
>>>>>>>>> expansion will use the right version.
>>>>>>>>>
>>>>>>>>> R.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>>>>>> even greater range of literals.
>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>>>>>> I'll update my patch.
>>>>>>>>>>>>
>>>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>>>
>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>>>
>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>>>>>> all possible.
>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>>>
>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>>>
>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>>>
>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>>>>>> modified simulator.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Christophe
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> R.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Christophe
>>>>>>>>>>>>
>>>>>>>>>>>>> R.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>>>> +           int i;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>>>> +             {
>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>>>> +                 {
>>>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>>>>>> +                   else
>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>>>> +                 }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>>>> +             }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>> -     }
>>>>>>>>>>>>>> -      else
>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>    else
>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-22 15:22                           ` Richard Earnshaw
@ 2020-10-26 10:52                             ` Christophe Lyon
  2020-10-27 15:42                               ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-10-26 10:52 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

[-- Attachment #1: Type: text/plain, Size: 21383 bytes --]

On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> > On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
> > <Richard.Earnshaw@foss.arm.com> wrote:
> >>
> >> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> >>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>
> >>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> >>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>
> >>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
> >>>>>>>>>>>>>> movs/adds/lsls.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
> >>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
> >>>>>>>>>>>>>> with -mpure-code.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
> >>>>>>>>>>>>>> eg. pr46287-3.C
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>       gcc/
> >>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Hi Richard,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for your comments.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> There are some optimizations you can make to this code.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
> >>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
> >>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>>>>>>>>>>> before, so I can write something similar indeed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
> >>>>>>>>>>>> when -mpure-code is NOT used.
> >>>>>>>>>>>> Consider:
> >>>>>>>>>>>> int f3_1 (void) { return 510; }
> >>>>>>>>>>>> int f3_2 (void) { return 511; }
> >>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
> >>>>>>>>>>>> f3_1:
> >>>>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>> f3_2:
> >>>>>>>>>>>>         ldr     r0, .L4
> >>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>
> >>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
> >>>>>>>>>>>> not having to load the constant?
> >>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
> >>>>>>>>>>>> account when comparing code size,
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
> >>>>>>>>>>>  It might happen that the entry could be shared with another use of that
> >>>>>>>>>>> literal, but in general that's rare.
> >>>>>>>>>>>
> >>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
> >>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
> >>>>>>>>>>>
> >>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
> >>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
> >>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
> >>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> This is also true for
> >>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
> >>>>>>>>>>>>> this.
> >>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> >>>>>>>>>>>> f3_1:
> >>>>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>> f3_2:
> >>>>>>>>>>>>         movs    r0, #1
> >>>>>>>>>>>>         lsls    r0, r0, #8
> >>>>>>>>>>>>         adds    r0, r0, #255
> >>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>
> >>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
> >>>>>>>>>>>
> >>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
> >>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
> >>>>>>>>>>> you show here are not thunks.
> >>>>>>>>>>>
> >>>>>>>>>> OK thanks for the clarification.
> >>>>>>>>>>
> >>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
> >>>>>>>>>> review easier.
> >>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> >>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> >>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >>>>>>>>>>
> >>>>>>>>>> I kept it this way to make incremental changes easier to understand.
> >>>>>>>>>>
> >>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
> >>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> >>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
> >>>>>>>>>> same place,
> >>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
> >>>>>>>>>> several cases matching your comments.
> >>>>>>>>>>
> >>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> >>>>>>>>>> calls thumb1_gen_const_int.
> >>>>>>>>>>
> >>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
> >>>>>>>>>> thumb1_gen_const_int.
> >>>>>>>>>
> >>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
> >>>>>>>>>
> >>>>>>>>> I think we should be able to leverage the fact that the compiler can use
> >>>>>>>>> C++ now to do much better than that, for example by making that function
> >>>>>>>>> a template.  For example (and this is just a sketch):
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Indeed! I didn't think about it since there is no other use of
> >>>>>>>> templates in arm.c yet.
> >>>>>>>> I'll send an update soon.
> >>>>>>>>
> >>>>>>>> Other than that, does the approach look OK to you?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
> >>>>>>> immediate generating operations into a single function can only be a
> >>>>>>> good thing.
> >>>>>>>
> >>>>>>> Looking again at your example constant sequences, I see:
> >>>>>>>
> >>>>>>> 0x1000010:
> >>>>>>>         movs    r3, #16
> >>>>>>>         lsls    r3, #16
> >>>>>>>         adds    r3, #1
> >>>>>>>         lsls    r3, #4
> >>>>>>> 0x1000011:
> >>>>>>>         movs    r3, #1
> >>>>>>>         lsls    r3, #24
> >>>>>>>         adds    r3, #17
> >>>>>>>
> >>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
> >>>>>>> the first expand to
> >>>>>>>
> >>>>>>> 0x1000010:
> >>>>>>>         movs    r3, #16
> >>>>>>>         lsls    r3, #16
> >>>>>>>         adds    r3, #16
> >>>>>>>
> >>>>>> Err, I mean to:
> >>>>>>
> >>>>>>
> >>>>>> 0x1000010:
> >>>>>>         movs    r3, #1
> >>>>>>         lsls    r3, #24
> >>>>>>         adds    r3, #16
> >>>>>>
> >>>>>> ?
> >>>>>
> >>>>> Because I first try to right-shift the constant, hoping to reduce its
> >>>>> range and need less instructions to build the higher part, then
> >>>>> left-shift back.
> >>>>>
> >>>>> In this particular case, we'd need to realize that there are many
> >>>>> zeros "inside" the constant.
> >>>>>
> >>>>> If I remove the part that tries to reduce the range, I do get that
> >>>>> sequence, but for 764 I now generate
> >>>>> movs    r3, #2
> >>>>> lsls    r3, #8
> >>>>> adds    r3, #252
> >>>>> instead of
> >>>>> movs    r3, #191
> >>>>> lsls    r3, #2
> >>>>>
> >>>>> A possibility would be to try both approaches and keep the shortest one.
> >>>>
> >>>> Lets leave that for now, it's not important to fixing the main issue;
> >>>> but we should remember we need to come back to it at some point.
> >>>>
> >>> Thanks, that's what I was thinking too.
> >>>
> >>>> There are other tricks as well, such as
> >>>>
> >>>>   0xffffff
> >>>>
> >>>> can be done as
> >>>>
> >>>>   0x1000000 - 1
> >>>>
> >>>> and
> >>>>
> >>>>   0xfffffd
> >>>>
> >>>> as
> >>>>
> >>>>   0x1000000 - 3
> >>>>
> >>>> but these can wait as well.
> >>>>
> >>>
> >>> Didn't we already need to handle such tricks? I'm surprised this
> >>> wasn't needed earlier.
> >>>
> >>
> >> I don't think we ever worried about them.  Most of them need at least 3
> >> instructions so aren't a code size saving over using a literal pool entry.
> >>
> > OK, this will also help when using -mslow-flash-data.
> >
> > Here are updated patches, now using a template as you suggested.
>
> Looking better, but when I try to apply this to my local tree patch 2
> fails (I'm not exactly sure why, what was your baseline for these
> patches?)
I have the tree patches in this thread on top of these other two:
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html

They have gradual improvements to thumb1_movsi_insn.

> -- that patch looks suspicious anyway, you're replacing code
> that prints out assembly with code that generates RTL.
Right! I took me a while to understand how I could miss this, sorry.
That was caused by improper testing, as this part of the code
isn't used when targetting cortex-m0. I have added a testcase
for cortex-m23 which crashes with the previous version of patch 2,
and succeeds now.

> Could you also rename t1_print and t1_rtl to thumb1_const_print and
> thumb1_const_rtl.  I think the names as they stand are likely to be too
> generic.
OK, done.

How about this new version?
I'm not yet sure about the most appropriate naming for:
thumb1_gen_const_int
thumb1_gen_const_int_asm
should they be
thumb1_gen_const_int_rtl
thumb1_gen_const_int_print
to be consistent with the new classes?

Thanks,

Christophe

> R.
>
> >
> > Thanks,
> >
> > Christophe
> >
> >> R.
> >>
> >>>
> >>>>
> >>>> R.
> >>>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>> R.
> >>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Christophe
> >>>>>>>>
> >>>>>>>>> class t1_rtl
> >>>>>>>>> {
> >>>>>>>>>  public:
> >>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
> >>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
> >>>>>>>>> };
> >>>>>>>>>
> >>>>>>>>> class t1_print
> >>>>>>>>> {
> >>>>>>>>>  public:
> >>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
> >>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
> >>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
> >>>>>>>>>  private:
> >>>>>>>>>   FILE *t_file;
> >>>>>>>>> };
> >>>>>>>>>
> >>>>>>>>> template <class T>
> >>>>>>>>> void thumb1_gen_const_int(T t, int f)
> >>>>>>>>> {
> >>>>>>>>>   // Expansion of thumb1_gen_const_int ...
> >>>>>>>>>   t.ashift(f);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> // Usage...
> >>>>>>>>> void f1()
> >>>>>>>>> {
> >>>>>>>>>   // Use the RTL expander
> >>>>>>>>>   t1_rtl g;
> >>>>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> void f2()
> >>>>>>>>> {
> >>>>>>>>>   // Use the printf expander writing to stdout
> >>>>>>>>>   t1_print g(stdout);
> >>>>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
> >>>>>>>>> about which expander is being used in each instance and the template
> >>>>>>>>> expansion will use the right version.
> >>>>>>>>>
> >>>>>>>>> R.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
> >>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
> >>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
> >>>>>>>>>>>>> even greater range of literals.
> >>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> >>>>>>>>>>>> I'll update my patch.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>  More generally, any sequence of up to
> >>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
> >>>>>>>>>>>>> existing literal pool fall back.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >>>>>>>>>>>>> be merged into a single shift-by-16.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
> >>>>>>>>>>>>> all possible.
> >>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
> >>>>>>>>>>>> that showed the problem. There are also:
> >>>>>>>>>>>> g++.dg/opt/thunk1.C
> >>>>>>>>>>>> g++.dg/ipa/pr46984.C
> >>>>>>>>>>>> g++.dg/torture/pr46287.C
> >>>>>>>>>>>> g++.dg/torture/pr45699.C
> >>>>>>>>>>>>
> >>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
> >>>>>>>>>>>> -mpure-code in dg-options?
> >>>>>>>>>>>
> >>>>>>>>>>> On reflection, probably not - that just makes things more complicated
> >>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
> >>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
> >>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
> >>>>>>>>>>> with it enabled and that should be sufficient.
> >>>>>>>>>>
> >>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
> >>>>>>>>>> modified simulator.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Christophe
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> R.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Christophe
> >>>>>>>>>>>>
> >>>>>>>>>>>>> R.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
> >>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
> >>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
> >>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>        if (mi_delta > 255)
> >>>>>>>>>>>>>>       {
> >>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>>>> -       assemble_name (file, label);
> >>>>>>>>>>>>>> -       fputs ("+4\n", file);
> >>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
> >>>>>>>>>>>>>> +          pool: we build it explicitly.  */
> >>>>>>>>>>>>>> +       if (target_pure_code)
> >>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>> +           bool mov_done_p = false;
> >>>>>>>>>>>>>> +           int i;
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
> >>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
> >>>>>>>>>>>>>> +             {
> >>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +               if (byte)
> >>>>>>>>>>>>>> +                 {
> >>>>>>>>>>>>>> +                   if (mov_done_p)
> >>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>>>>>>>>>>>>> +                   else
> >>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>>>>>>>>>>>>> +                   mov_done_p = true;
> >>>>>>>>>>>>>> +                 }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +               if (mov_done_p)
> >>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>>>>>>>>>>>>> +             }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
> >>>>>>>>>>>>>> +           if (!mov_done_p)
> >>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
> >>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>> +       else
> >>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>>>> +           assemble_name (file, label);
> >>>>>>>>>>>>>> +           fputs ("+4\n", file);
> >>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
> >>>>>>>>>>>>>>       }
> >>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
> >>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>>>> -      assemble_name (file, label);
> >>>>>>>>>>>>>> -      fputs (":\n", file);
> >>>>>>>>>>>>>> -      if (flag_pic)
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
> >>>>>>>>>>>>>> +      function address and delta since we emitted code to build
> >>>>>>>>>>>>>> +      them.  */
> >>>>>>>>>>>>>> +      if (!target_pure_code)
> >>>>>>>>>>>>>>       {
> >>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>>>> -          accordingly.  */
> >>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>>>> -                            tem,
> >>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
> >>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>> -     }
> >>>>>>>>>>>>>> -      else
> >>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>>>> +       assemble_name (file, label);
> >>>>>>>>>>>>>> +       fputs (":\n", file);
> >>>>>>>>>>>>>> +       if (flag_pic)
> >>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>>>> +              accordingly.  */
> >>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>>>> +                                tem,
> >>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
> >>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>> +       else
> >>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>> +     }
> >>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>    else
> >>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
>

[-- Attachment #2: 0002-arm-Call-thumb1_gen_const_int-from-thumb1_movsi_insn.patch --]
[-- Type: text/x-patch, Size: 4776 bytes --]

From 1ea1d3300f253601afbf3173b705db1186a2e6fa Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Fri, 9 Oct 2020 15:18:07 +0000
Subject: [PATCH 2/3] arm: Call thumb1_gen_const_int from thumb1_movsi_insn

thumb1_movsi_insn used the same algorithm to build a constant in asm
than thumb1_gen_const_int_1 does in RTL. Since the previous patch added
support for asm generation in thumb1_gen_const_int_1, this patch calls
it from thumb1_movsi_insn to avoid duplication.

We need to introduce a new proxy function, thumb1_gen_const_int_asm to
select the right template.

This patch also adds a new testcase as the updated alternative is only
used by thumb-1 processors that also support movt/movw.

2020-10-26  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/thumb1.md (thumb1_movsi_insn): Call
	thumb1_gen_const_int_asm.
	* config/arm/arm-protos.h (thumb1_gen_const_int_asm): Add
	prototype.
	* config/arm/arm.c (thumb1_gen_const_int_asm): New.

	gcc/testsuite/
	* gcc.target/arm/pure-code/no-literal-pool-m23.c: New.
---
 gcc/config/arm/arm-protos.h                        |  1 +
 gcc/config/arm/arm.c                               |  9 +++++-
 gcc/config/arm/thumb1.md                           | 37 +++-------------------
 .../gcc.target/arm/pure-code/no-literal-pool-m23.c | 17 ++++++++++
 4 files changed, 30 insertions(+), 34 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 0cc0ae7..1b1992b 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -75,6 +75,7 @@ extern int const_ok_for_arm (HOST_WIDE_INT);
 extern int const_ok_for_op (HOST_WIDE_INT, enum rtx_code);
 extern int const_ok_for_dimode_op (HOST_WIDE_INT, enum rtx_code);
 extern void thumb1_gen_const_int (rtx, HOST_WIDE_INT);
+extern void thumb1_gen_const_int_asm (rtx, HOST_WIDE_INT);
 extern int arm_split_constant (RTX_CODE, machine_mode, rtx,
 			       HOST_WIDE_INT, rtx, rtx, int);
 extern int legitimate_pic_operand_p (rtx);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 1d80c6d..366b761 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28366,7 +28366,7 @@ thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
     }
 }
 
-/* Proxy for thumb1.md, since the thumb1_const_print and
+/* Proxies for thumb1.md, since the thumb1_const_print and
    thumb1_const_rtl classes are not exported.  */
 void
 thumb1_gen_const_int (rtx dst, HOST_WIDE_INT op1)
@@ -28375,6 +28375,13 @@ thumb1_gen_const_int (rtx dst, HOST_WIDE_INT op1)
   thumb1_gen_const_int_1 (t, op1);
 }
 
+void
+thumb1_gen_const_int_asm (rtx dst, HOST_WIDE_INT op1)
+{
+  thumb1_const_print t (asm_out_file, REGNO (dst));
+  thumb1_gen_const_int_1 (t, op1);
+}
+
 /* Output code to add DELTA to the first argument, and then jump
    to FUNCTION.  Used for C++ multiple inheritance.  */
 
diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
index 2258a52..2f8df67 100644
--- a/gcc/config/arm/thumb1.md
+++ b/gcc/config/arm/thumb1.md
@@ -688,40 +688,11 @@ (define_insn "*thumb1_movsi_insn"
 	  }
 	else if (GET_CODE (operands[1]) == CONST_INT)
 	  {
-	    int i;
-	    HOST_WIDE_INT op1 = INTVAL (operands[1]);
-	    bool mov_done_p = false;
-	    rtx ops[2];
-	    ops[0] = operands[0];
-
-	    /* Emit upper 3 bytes if needed.  */
-	    for (i = 0; i < 3; i++)
-	      {
-		int byte = (op1 >> (8 * (3 - i))) & 0xff;
-
-		if (byte)
-		  {
-		    ops[1] = GEN_INT (byte);
-		    if (mov_done_p)
-		      output_asm_insn ("adds\t%0, %1", ops);
-		    else
-		      output_asm_insn ("movs\t%0, %1", ops);
-		    mov_done_p = true;
-		  }
-
-		if (mov_done_p)
-		  output_asm_insn ("lsls\t%0, #8", ops);
-	      }
-
-	    /* Emit lower byte if needed.  */
-	    ops[1] = GEN_INT (op1 & 0xff);
-	    if (!mov_done_p)
-	      output_asm_insn ("movs\t%0, %1", ops);
-	    else if (op1 & 0xff)
-	      output_asm_insn ("adds\t%0, %1", ops);
-	    return "";
+	    thumb1_gen_const_int_asm (operands[0], INTVAL (operands[1]));
+	    return \"\";
 	  }
-	  gcc_unreachable ();
+
+	gcc_unreachable ();
 
       case 8: return "ldr\t%0, %1";
       case 9: return "str\t%1, %0";
diff --git a/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c
new file mode 100644
index 0000000..08d3164
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-mpure-code -mcpu=cortex-m23 -march=armv8-m.base -mthumb" } */
+
+int
+testi (int *p)
+{
+  if (*p > 0x12345678)
+    return *p-8192;
+  else
+    return *p+8192;
+}
+
+int
+testi2 ()
+{
+  return -8192;
+}
-- 
2.7.4


[-- Attachment #3: 0001-arm-Improve-thumb1_gen_const_int.patch --]
[-- Type: text/x-patch, Size: 7070 bytes --]

From 1ab6393e935dce2abae98e8f9e8442f87b22e9ed Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Fri, 9 Oct 2020 12:42:39 +0000
Subject: [PATCH 1/3] arm: Improve thumb1_gen_const_int

Enable thumb1_gen_const_int to generate RTL or asm depending on the
context, so that we avoid duplicating code to handle constants in
thumb-1 with -mpure-code.

Use a template so that the algorithm is effectively shared, and
rely on two classes to handle the actual emission as RTL or asm.

The generated sequence is improved to handle right-shiftable and small
values with less instructions. We now generate:

128:
        movs    r0, r0, #128
264:
        movs    r3, #33
        lsls    r3, #3
510:
        movs    r3, #255
        lsls    r3, #1
512:
        movs    r3, #1
        lsls    r3, #9
764:
        movs    r3, #191
        lsls    r3, #2
65536:
        movs    r3, #1
        lsls    r3, #16
0x123456:
        movs    r3, #18 ;0x12
        lsls    r3, #8
        adds    r3, #52 ;0x34
        lsls    r3, #8
        adds    r3, #86 ;0x56
0x1123456:
        movs    r3, #137 ;0x89
        lsls    r3, #8
        adds    r3, #26 ;0x1a
        lsls    r3, #8
        adds    r3, #43 ;0x2b
        lsls    r3, #1
0x1000010:
        movs    r3, #16
        lsls    r3, #16
        adds    r3, #1
        lsls    r3, #4
0x1000011:
        movs    r3, #1
        lsls    r3, #24
        adds    r3, #17

2020-10-26  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (thumb1_const_rtl, thumb1_const_print): New
	classes.
	(thumb1_gen_const_int_1): New helper function.
	(thumb1_gen_const_int): Add capability to emit either RTL or asm,
	improve generated code by calling thumb1_gen_const_int_1.
---
 gcc/config/arm/arm.c | 195 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 163 insertions(+), 32 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ceeb91f..1d80c6d 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -4513,38 +4513,6 @@ const_ok_for_dimode_op (HOST_WIDE_INT i, enum rtx_code code)
     }
 }
 
-/* Emit a sequence of movs/adds/shift to produce a 32-bit constant.
-   Avoid generating useless code when one of the bytes is zero.  */
-void
-thumb1_gen_const_int (rtx op0, HOST_WIDE_INT op1)
-{
-  bool mov_done_p = false;
-  int i;
-
-  /* Emit upper 3 bytes if needed.  */
-  for (i = 0; i < 3; i++)
-    {
-      int byte = (op1 >> (8 * (3 - i))) & 0xff;
-
-      if (byte)
-	{
-	  emit_set_insn (op0, mov_done_p
-			 ? gen_rtx_PLUS (SImode,op0, GEN_INT (byte))
-			 : GEN_INT (byte));
-	  mov_done_p = true;
-	}
-
-      if (mov_done_p)
-	emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (8)));
-    }
-
-  /* Emit lower byte if needed.  */
-  if (!mov_done_p)
-    emit_set_insn (op0, GEN_INT (op1 & 0xff));
-  else if (op1 & 0xff)
-    emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (op1 & 0xff)));
-}
-
 /* Emit a sequence of insns to handle a large constant.
    CODE is the code of the operation required, it can be any of SET, PLUS,
    IOR, AND, XOR, MINUS;
@@ -28244,6 +28212,169 @@ arm_internal_label (FILE *stream, const char *prefix, unsigned long labelno)
   default_internal_label (stream, prefix, labelno);
 }
 
+/* Define classes to generate code as RTL or output asm to a file.
+   Using templates then allows to use the same code to output code
+   sequences in the two formats.  */
+class thumb1_const_rtl
+{
+ public:
+  thumb1_const_rtl (rtx dst) : dst (dst) {}
+
+  void mov (int val)
+  {
+    emit_set_insn (dst, GEN_INT (val));
+  }
+
+  void add (int val)
+  {
+    emit_set_insn (dst, gen_rtx_PLUS (SImode, dst, GEN_INT (val)));
+  }
+
+  void ashift (int shift)
+  {
+    emit_set_insn (dst, gen_rtx_ASHIFT (SImode, dst, GEN_INT (shift)));
+  }
+
+ private:
+  rtx dst;
+};
+
+class thumb1_const_print
+{
+ public:
+  thumb1_const_print (FILE *f, int regno) : t_file (f), dst_regno (regno) {}
+
+  void mov (int val)
+  {
+    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
+  }
+
+  void add (int val)
+  {
+    asm_fprintf (t_file, "\tadds\tr%d, #%d\n", dst_regno, val);
+  }
+
+  void ashift (int shift)
+  {
+    asm_fprintf (t_file, "\tlsls\tr%d, #%d\n", dst_regno, shift);
+  }
+
+ private:
+  FILE *t_file;
+  int dst_regno;
+};
+
+/* Emit a sequence of movs/adds/shift to produce a 32-bit constant.
+   Avoid generating useless code when one of the bytes is zero.  */
+template <class T>
+void
+thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
+{
+  bool mov_done_p = false;
+  int val = op1;
+  int shift = 0;
+  int i;
+
+  if (val == 0)
+    {
+      dst.mov (val);
+      return;
+    }
+
+  /* In the general case, we need 7 instructions to build
+     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
+     do better if VAL is small enough, or
+     right-shiftable by a suitable amount.  If the
+     right-shift enables to encode at least one less byte,
+     it's worth it: we save a adds and a lsls at the
+     expense of a final lsls.  */
+  int final_shift = number_of_first_bit_set (val);
+
+  int leading_zeroes = clz_hwi (val);
+  int number_of_bytes_needed
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
+       / BITS_PER_UNIT) + 1;
+  int number_of_bytes_needed2
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
+       / BITS_PER_UNIT) + 1;
+
+  if (number_of_bytes_needed2 < number_of_bytes_needed)
+    val >>= final_shift;
+  else
+    final_shift = 0;
+
+  /* If we are in a very small range, we can use either a single movs
+     or movs+adds.  */
+  if ((val >= 0) && (val <= 510))
+    {
+      if (val > 255)
+	{
+	  int high = val - 255;
+
+	  dst.mov (high);
+	  dst.add (255);
+	}
+      else
+	dst.mov (val);
+
+      if (final_shift > 0)
+	dst.ashift (final_shift);
+    }
+  else
+    {
+      /* General case, emit upper 3 bytes as needed.  */
+      for (i = 0; i < 3; i++)
+	{
+	  int byte = (val >> (8 * (3 - i))) & 0xff;
+
+	  if (byte)
+	    {
+	      /* We are about to emit new bits, stop accumulating a
+		 shift amount, and left-shift only if we have already
+		 emitted some upper bits.  */
+	      if (mov_done_p)
+		{
+		  dst.ashift (shift);
+		  dst.add (byte);
+		}
+	      else
+		dst.mov (byte);
+
+	      /* Stop accumulating shift amount since we've just
+		 emitted some bits.  */
+	      shift = 0;
+
+	      mov_done_p = true;
+	    }
+
+	  if (mov_done_p)
+	    shift += 8;
+	}
+
+      /* Emit lower byte.  */
+      if (!mov_done_p)
+	dst.mov (val & 0xff);
+      else
+	{
+	  dst.ashift (shift);
+	  if (val & 0xff)
+	    dst.add (val & 0xff);
+	}
+
+      if (final_shift > 0)
+	dst.ashift (final_shift);
+    }
+}
+
+/* Proxy for thumb1.md, since the thumb1_const_print and
+   thumb1_const_rtl classes are not exported.  */
+void
+thumb1_gen_const_int (rtx dst, HOST_WIDE_INT op1)
+{
+  thumb1_const_rtl t (dst);
+  thumb1_gen_const_int_1 (t, op1);
+}
+
 /* Output code to add DELTA to the first argument, and then jump
    to FUNCTION.  Used for C++ multiple inheritance.  */
 
-- 
2.7.4


[-- Attachment #4: 0003-arm-Fix-multiple-inheritance-thunks-for-thumb-1-with.patch --]
[-- Type: text/x-patch, Size: 4068 bytes --]

From 54f0bb788be7cea9bb125dc921b5fc00b0d3aa05 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Tue, 29 Sep 2020 19:40:39 +0000
Subject: [PATCH 3/3] arm: Fix multiple inheritance thunks for thumb-1 with
 -mpure-code

When -mpure-code is used, we cannot load delta from code memory (like
we do without -mpure-code).

This patch builds the value of mi_delta into r3 with a series of
movs/adds/lsls.

We also do some cleanup by not emitting the function address and delta
via .word directives at the end of the thunk since we don't use them
with -mpure-code.

No need for new testcases, this bug was already identified by:
g++.dg/ipa/pr46287-3.C
g++.dg/ipa/pr46984.C
g++.dg/opt/thunk1.C
g++.dg/torture/pr46287.C
g++.dg/torture/pr45699.C

2020-10-12  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
	do not emit function address and delta when -mpure-code is used.
---
 gcc/config/arm/arm.c | 67 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 366b761..aee3a40 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28480,9 +28480,19 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
     {
       if (mi_delta > 255)
 	{
-	  fputs ("\tldr\tr3, ", file);
-	  assemble_name (file, label);
-	  fputs ("+4\n", file);
+	  /* With -mpure-code, we cannot load MI_DELTA from the
+	     constant pool: we build it explicitly.  */
+	  if (target_pure_code)
+	    {
+	      thumb1_const_print r3 (file, 3);
+	      thumb1_gen_const_int_1 (r3, mi_delta);
+	    }
+	  else
+	    {
+	      fputs ("\tldr\tr3, ", file);
+	      assemble_name (file, label);
+	      fputs ("+4\n", file);
+	    }
 	  asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
 		       mi_op, this_regno, this_regno);
 	}
@@ -28518,30 +28528,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
 	fputs ("\tpop\t{r3}\n", file);
 
       fprintf (file, "\tbx\tr12\n");
-      ASM_OUTPUT_ALIGN (file, 2);
-      assemble_name (file, label);
-      fputs (":\n", file);
-      if (flag_pic)
+
+      /* With -mpure-code, we don't need to emit literals for the
+	 function address and delta since we emitted code to build
+	 them.  */
+      if (!target_pure_code)
 	{
-	  /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
-	  rtx tem = XEXP (DECL_RTL (function), 0);
-	  /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
-	     pipeline offset is four rather than eight.  Adjust the offset
-	     accordingly.  */
-	  tem = plus_constant (GET_MODE (tem), tem,
-			       TARGET_THUMB1_ONLY ? -3 : -7);
-	  tem = gen_rtx_MINUS (GET_MODE (tem),
-			       tem,
-			       gen_rtx_SYMBOL_REF (Pmode,
-						   ggc_strdup (labelpc)));
-	  assemble_integer (tem, 4, BITS_PER_WORD, 1);
-	}
-      else
-	/* Output ".word .LTHUNKn".  */
-	assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
+	  ASM_OUTPUT_ALIGN (file, 2);
+	  assemble_name (file, label);
+	  fputs (":\n", file);
+	  if (flag_pic)
+	    {
+	      /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
+	      rtx tem = XEXP (DECL_RTL (function), 0);
+	      /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
+		 pipeline offset is four rather than eight.  Adjust the offset
+		 accordingly.  */
+	      tem = plus_constant (GET_MODE (tem), tem,
+				   TARGET_THUMB1_ONLY ? -3 : -7);
+	      tem = gen_rtx_MINUS (GET_MODE (tem),
+				   tem,
+				   gen_rtx_SYMBOL_REF (Pmode,
+						       ggc_strdup (labelpc)));
+	      assemble_integer (tem, 4, BITS_PER_WORD, 1);
+	    }
+	  else
+	    /* Output ".word .LTHUNKn".  */
+	    assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
 
-      if (TARGET_THUMB1_ONLY && mi_delta > 255)
-	assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
+	  if (TARGET_THUMB1_ONLY && mi_delta > 255)
+	    assemble_integer (GEN_INT (mi_delta), 4, BITS_PER_WORD, 1);
+	}
     }
   else
     {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-26 10:52                             ` Christophe Lyon
@ 2020-10-27 15:42                               ` Richard Earnshaw
  2020-10-28 17:44                                 ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-27 15:42 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>
>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>
>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>
>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>>>>>> Consider:
>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>>>>>>>> not having to load the constant?
>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>>>>>>>> account when comparing code size,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is also true for
>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>>>>>>>> you show here are not thunks.
>>>>>>>>>>>>>
>>>>>>>>>>>> OK thanks for the clarification.
>>>>>>>>>>>>
>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>>>>>> review easier.
>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>>>>>
>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>>>>>>>> same place,
>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>>>>>>>> several cases matching your comments.
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>>>>>> thumb1_gen_const_int.
>>>>>>>>>>>
>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>>>>>
>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>>>>>>>> C++ now to do much better than that, for example by making that function
>>>>>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>>>>>> templates in arm.c yet.
>>>>>>>>>> I'll send an update soon.
>>>>>>>>>>
>>>>>>>>>> Other than that, does the approach look OK to you?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>>>>>> immediate generating operations into a single function can only be a
>>>>>>>>> good thing.
>>>>>>>>>
>>>>>>>>> Looking again at your example constant sequences, I see:
>>>>>>>>>
>>>>>>>>> 0x1000010:
>>>>>>>>>         movs    r3, #16
>>>>>>>>>         lsls    r3, #16
>>>>>>>>>         adds    r3, #1
>>>>>>>>>         lsls    r3, #4
>>>>>>>>> 0x1000011:
>>>>>>>>>         movs    r3, #1
>>>>>>>>>         lsls    r3, #24
>>>>>>>>>         adds    r3, #17
>>>>>>>>>
>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>>>>>> the first expand to
>>>>>>>>>
>>>>>>>>> 0x1000010:
>>>>>>>>>         movs    r3, #16
>>>>>>>>>         lsls    r3, #16
>>>>>>>>>         adds    r3, #16
>>>>>>>>>
>>>>>>>> Err, I mean to:
>>>>>>>>
>>>>>>>>
>>>>>>>> 0x1000010:
>>>>>>>>         movs    r3, #1
>>>>>>>>         lsls    r3, #24
>>>>>>>>         adds    r3, #16
>>>>>>>>
>>>>>>>> ?
>>>>>>>
>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
>>>>>>> range and need less instructions to build the higher part, then
>>>>>>> left-shift back.
>>>>>>>
>>>>>>> In this particular case, we'd need to realize that there are many
>>>>>>> zeros "inside" the constant.
>>>>>>>
>>>>>>> If I remove the part that tries to reduce the range, I do get that
>>>>>>> sequence, but for 764 I now generate
>>>>>>> movs    r3, #2
>>>>>>> lsls    r3, #8
>>>>>>> adds    r3, #252
>>>>>>> instead of
>>>>>>> movs    r3, #191
>>>>>>> lsls    r3, #2
>>>>>>>
>>>>>>> A possibility would be to try both approaches and keep the shortest one.
>>>>>>
>>>>>> Lets leave that for now, it's not important to fixing the main issue;
>>>>>> but we should remember we need to come back to it at some point.
>>>>>>
>>>>> Thanks, that's what I was thinking too.
>>>>>
>>>>>> There are other tricks as well, such as
>>>>>>
>>>>>>   0xffffff
>>>>>>
>>>>>> can be done as
>>>>>>
>>>>>>   0x1000000 - 1
>>>>>>
>>>>>> and
>>>>>>
>>>>>>   0xfffffd
>>>>>>
>>>>>> as
>>>>>>
>>>>>>   0x1000000 - 3
>>>>>>
>>>>>> but these can wait as well.
>>>>>>
>>>>>
>>>>> Didn't we already need to handle such tricks? I'm surprised this
>>>>> wasn't needed earlier.
>>>>>
>>>>
>>>> I don't think we ever worried about them.  Most of them need at least 3
>>>> instructions so aren't a code size saving over using a literal pool entry.
>>>>
>>> OK, this will also help when using -mslow-flash-data.
>>>
>>> Here are updated patches, now using a template as you suggested.
>>
>> Looking better, but when I try to apply this to my local tree patch 2
>> fails (I'm not exactly sure why, what was your baseline for these
>> patches?)
> I have the tree patches in this thread on top of these other two:
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
> 
> They have gradual improvements to thumb1_movsi_insn.
> 
>> -- that patch looks suspicious anyway, you're replacing code
>> that prints out assembly with code that generates RTL.
> Right! I took me a while to understand how I could miss this, sorry.
> That was caused by improper testing, as this part of the code
> isn't used when targetting cortex-m0. I have added a testcase
> for cortex-m23 which crashes with the previous version of patch 2,
> and succeeds now.
> 
>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
>> thumb1_const_rtl.  I think the names as they stand are likely to be too
>> generic.
> OK, done.
> 
> How about this new version?
> I'm not yet sure about the most appropriate naming for:
> thumb1_gen_const_int
> thumb1_gen_const_int_asm
> should they be
> thumb1_gen_const_int_rtl
> thumb1_gen_const_int_print
> to be consistent with the new classes?

It would probably be better, yes.

More detailed comments below.

R.

> 
> Thanks,
> 
> Christophe
> 
>> R.
>>
>>>
>>> Thanks,
>>>
>>> Christophe
>>>
>>>> R.
>>>>
>>>>>
>>>>>>
>>>>>> R.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> R.
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Christophe
>>>>>>>>>>
>>>>>>>>>>> class t1_rtl
>>>>>>>>>>> {
>>>>>>>>>>>  public:
>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> class t1_print
>>>>>>>>>>> {
>>>>>>>>>>>  public:
>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>>>>>  private:
>>>>>>>>>>>   FILE *t_file;
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> template <class T>
>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>>>>>> {
>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>>>>>   t.ashift(f);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> // Usage...
>>>>>>>>>>> void f1()
>>>>>>>>>>> {
>>>>>>>>>>>   // Use the RTL expander
>>>>>>>>>>>   t1_rtl g;
>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> void f2()
>>>>>>>>>>> {
>>>>>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>>>>>   t1_print g(stdout);
>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>>>>>> about which expander is being used in each instance and the template
>>>>>>>>>>> expansion will use the right version.
>>>>>>>>>>>
>>>>>>>>>>> R.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>>>>>>>> even greater range of literals.
>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>>>>>>>> I'll update my patch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>>>>>>>> all possible.
>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>>>>>>>> modified simulator.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Christophe
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> R.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>>>>>> +           int i;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>>>>>> +             {
>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>>>>>> +                 {
>>>>>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>> +                   else
>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>>>>>> +                 }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>>>>>> +             }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> -     }
>>>>>>>>>>>>>>>> -      else
>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>    else
>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

+class thumb1_const_rtl
...
+  void mov (int val)

This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
same applies to the asm version as well.

+    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);

Should be using reg_names[dst_regno] in all cases.  In fact, you might
want to move that lookup to the constructor and just save a pointer to
the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
rather than "%d" for the immediate.

+template <class T>
+void
+thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
+{
+  bool mov_done_p = false;
+  int val = op1;

This potentially silently loses precision.  In fact, I think you really
want to use "unsigned HOST_WIDE_INT" throughout the following code, so
that the right shifts aren't undefined if dealing with negative numbers.

For safety, you should also have an assertion in here that

  op1 == trunc_int_for_mode (op1, SImode)

+  int shift = 0;
+  int i;
+
+  if (val == 0)

You can short-circuit 0..255 here for a quick exit.

+    {
+      dst.mov (val);
+      return;
+    }

Another trick: if the top nine bits of the 32-bit value are all set,
you're probably going to be better off (and certainly not worse off) by
generating -op1 and then negating the result in a final step - you can
do that via recursion.

+
+  /* In the general case, we need 7 instructions to build
+     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
+     do better if VAL is small enough, or
+     right-shiftable by a suitable amount.  If the
+     right-shift enables to encode at least one less byte,
+     it's worth it: we save a adds and a lsls at the
+     expense of a final lsls.  */
+  int final_shift = number_of_first_bit_set (val);
+
+  int leading_zeroes = clz_hwi (val);
+  int number_of_bytes_needed
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
+       / BITS_PER_UNIT) + 1;
+  int number_of_bytes_needed2
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
+       / BITS_PER_UNIT) + 1;
+
+  if (number_of_bytes_needed2 < number_of_bytes_needed)
+    val >>= final_shift;
+  else
+    final_shift = 0;
+
+  /* If we are in a very small range, we can use either a single movs
+     or movs+adds.  */
+  if ((val >= 0) && (val <= 510))

if val is made unsigned HWI as I suggest, the lower bounds test is not
needed.

+    {
+      if (val > 255)
+	{
+	  int high = val - 255;

Again, watch your types.

+
+	  dst.mov (high);
+	  dst.add (255);
+	}
+      else
+	dst.mov (val);
+
+      if (final_shift > 0)
+	dst.ashift (final_shift);
+    }
+  else
+    {
+      /* General case, emit upper 3 bytes as needed.  */
+      for (i = 0; i < 3; i++)
+	{
+	  int byte = (val >> (8 * (3 - i))) & 0xff;

and here.

+
+	  if (byte)
+	    {
+	      /* We are about to emit new bits, stop accumulating a
+		 shift amount, and left-shift only if we have already
+		 emitted some upper bits.  */
+	      if (mov_done_p)
+		{
+		  dst.ashift (shift);
+		  dst.add (byte);
+		}
+	      else
+		dst.mov (byte);
+
+	      /* Stop accumulating shift amount since we've just
+		 emitted some bits.  */
+	      shift = 0;
+
+	      mov_done_p = true;
+	    }
+
+	  if (mov_done_p)
+	    shift += 8;
+	}
+
+      /* Emit lower byte.  */
+      if (!mov_done_p)
+	dst.mov (val & 0xff);
+      else
+	{
+	  dst.ashift (shift);
+	  if (val & 0xff)
+	    dst.add (val & 0xff);
+	}
+
+      if (final_shift > 0)
+	dst.ashift (final_shift);
+    }
+}
+

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-27 15:42                               ` Richard Earnshaw
@ 2020-10-28 17:44                                 ` Richard Earnshaw
  2020-10-28 18:10                                   ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-28 17:44 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>
>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>
>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>
>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>>>>>>> Consider:
>>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>>>>>>>>> not having to load the constant?
>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>>>>>>>>> account when comparing code size,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is also true for
>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>>>>>>>>> you show here are not thunks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> OK thanks for the clarification.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>>>>>>> review easier.
>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>>>>>>>>> same place,
>>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>>>>>>>>> several cases matching your comments.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>>>>>>> thumb1_gen_const_int.
>>>>>>>>>>>>
>>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>>>>>>
>>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>>>>>>>>> C++ now to do much better than that, for example by making that function
>>>>>>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>>>>>>> templates in arm.c yet.
>>>>>>>>>>> I'll send an update soon.
>>>>>>>>>>>
>>>>>>>>>>> Other than that, does the approach look OK to you?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>>>>>>> immediate generating operations into a single function can only be a
>>>>>>>>>> good thing.
>>>>>>>>>>
>>>>>>>>>> Looking again at your example constant sequences, I see:
>>>>>>>>>>
>>>>>>>>>> 0x1000010:
>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>         adds    r3, #1
>>>>>>>>>>         lsls    r3, #4
>>>>>>>>>> 0x1000011:
>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>         adds    r3, #17
>>>>>>>>>>
>>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>>>>>>> the first expand to
>>>>>>>>>>
>>>>>>>>>> 0x1000010:
>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>
>>>>>>>>> Err, I mean to:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 0x1000010:
>>>>>>>>>         movs    r3, #1
>>>>>>>>>         lsls    r3, #24
>>>>>>>>>         adds    r3, #16
>>>>>>>>>
>>>>>>>>> ?
>>>>>>>>
>>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
>>>>>>>> range and need less instructions to build the higher part, then
>>>>>>>> left-shift back.
>>>>>>>>
>>>>>>>> In this particular case, we'd need to realize that there are many
>>>>>>>> zeros "inside" the constant.
>>>>>>>>
>>>>>>>> If I remove the part that tries to reduce the range, I do get that
>>>>>>>> sequence, but for 764 I now generate
>>>>>>>> movs    r3, #2
>>>>>>>> lsls    r3, #8
>>>>>>>> adds    r3, #252
>>>>>>>> instead of
>>>>>>>> movs    r3, #191
>>>>>>>> lsls    r3, #2
>>>>>>>>
>>>>>>>> A possibility would be to try both approaches and keep the shortest one.
>>>>>>>
>>>>>>> Lets leave that for now, it's not important to fixing the main issue;
>>>>>>> but we should remember we need to come back to it at some point.
>>>>>>>
>>>>>> Thanks, that's what I was thinking too.
>>>>>>
>>>>>>> There are other tricks as well, such as
>>>>>>>
>>>>>>>   0xffffff
>>>>>>>
>>>>>>> can be done as
>>>>>>>
>>>>>>>   0x1000000 - 1
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>>   0xfffffd
>>>>>>>
>>>>>>> as
>>>>>>>
>>>>>>>   0x1000000 - 3
>>>>>>>
>>>>>>> but these can wait as well.
>>>>>>>
>>>>>>
>>>>>> Didn't we already need to handle such tricks? I'm surprised this
>>>>>> wasn't needed earlier.
>>>>>>
>>>>>
>>>>> I don't think we ever worried about them.  Most of them need at least 3
>>>>> instructions so aren't a code size saving over using a literal pool entry.
>>>>>
>>>> OK, this will also help when using -mslow-flash-data.
>>>>
>>>> Here are updated patches, now using a template as you suggested.
>>>
>>> Looking better, but when I try to apply this to my local tree patch 2
>>> fails (I'm not exactly sure why, what was your baseline for these
>>> patches?)
>> I have the tree patches in this thread on top of these other two:
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
>>
>> They have gradual improvements to thumb1_movsi_insn.
>>
>>> -- that patch looks suspicious anyway, you're replacing code
>>> that prints out assembly with code that generates RTL.
>> Right! I took me a while to understand how I could miss this, sorry.
>> That was caused by improper testing, as this part of the code
>> isn't used when targetting cortex-m0. I have added a testcase
>> for cortex-m23 which crashes with the previous version of patch 2,
>> and succeeds now.
>>
>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
>>> thumb1_const_rtl.  I think the names as they stand are likely to be too
>>> generic.
>> OK, done.
>>
>> How about this new version?
>> I'm not yet sure about the most appropriate naming for:
>> thumb1_gen_const_int
>> thumb1_gen_const_int_asm
>> should they be
>> thumb1_gen_const_int_rtl
>> thumb1_gen_const_int_print
>> to be consistent with the new classes?
> 
> It would probably be better, yes.
> 
> More detailed comments below.
> 
> R.
> 
>>
>> Thanks,
>>
>> Christophe
>>
>>> R.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Christophe
>>>>
>>>>> R.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> R.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> R.
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Christophe
>>>>>>>>>>>
>>>>>>>>>>>> class t1_rtl
>>>>>>>>>>>> {
>>>>>>>>>>>>  public:
>>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>> class t1_print
>>>>>>>>>>>> {
>>>>>>>>>>>>  public:
>>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>>>>>>  private:
>>>>>>>>>>>>   FILE *t_file;
>>>>>>>>>>>> };
>>>>>>>>>>>>
>>>>>>>>>>>> template <class T>
>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>>>>>>> {
>>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>>>>>>   t.ashift(f);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> // Usage...
>>>>>>>>>>>> void f1()
>>>>>>>>>>>> {
>>>>>>>>>>>>   // Use the RTL expander
>>>>>>>>>>>>   t1_rtl g;
>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> void f2()
>>>>>>>>>>>> {
>>>>>>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>>>>>>   t1_print g(stdout);
>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>>>>>>> about which expander is being used in each instance and the template
>>>>>>>>>>>> expansion will use the right version.
>>>>>>>>>>>>
>>>>>>>>>>>> R.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>>>>>>>>> even greater range of literals.
>>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>>>>>>>>> I'll update my patch.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>>>>>>>>> all possible.
>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>>>>>>>>> modified simulator.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>>>>>>> +           int i;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>>>>>>> +             {
>>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>>>>>>> +                 {
>>>>>>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>> +                   else
>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>>>>>>> +                 }
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>>>>>>> +             }
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>> -     }
>>>>>>>>>>>>>>>>> -      else
>>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>    else
>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
> 
> +class thumb1_const_rtl
> ...
> +  void mov (int val)
> 
> This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
> same applies to the asm version as well.
> 
> +    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
> 
> Should be using reg_names[dst_regno] in all cases.  In fact, you might
> want to move that lookup to the constructor and just save a pointer to
> the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED

Correction, for a (signed) HOST_WIDE_INT, this should be
HOST_WIDE_INT_PRINT_DEC.

> rather than "%d" for the immediate.
> 
> +template <class T>
> +void
> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
> +{
> +  bool mov_done_p = false;
> +  int val = op1;
> 
> This potentially silently loses precision.  In fact, I think you really
> want to use "unsigned HOST_WIDE_INT" throughout the following code, so
> that the right shifts aren't undefined if dealing with negative numbers.
> 
> For safety, you should also have an assertion in here that
> 
>   op1 == trunc_int_for_mode (op1, SImode)
> 
> +  int shift = 0;
> +  int i;
> +
> +  if (val == 0)
> 
> You can short-circuit 0..255 here for a quick exit.
> 
> +    {
> +      dst.mov (val);
> +      return;
> +    }
> 
> Another trick: if the top nine bits of the 32-bit value are all set,
> you're probably going to be better off (and certainly not worse off) by
> generating -op1 and then negating the result in a final step - you can
> do that via recursion.
> 
> +
> +  /* In the general case, we need 7 instructions to build
> +     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
> +     do better if VAL is small enough, or
> +     right-shiftable by a suitable amount.  If the
> +     right-shift enables to encode at least one less byte,
> +     it's worth it: we save a adds and a lsls at the
> +     expense of a final lsls.  */
> +  int final_shift = number_of_first_bit_set (val);
> +
> +  int leading_zeroes = clz_hwi (val);
> +  int number_of_bytes_needed
> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
> +       / BITS_PER_UNIT) + 1;
> +  int number_of_bytes_needed2
> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
> +       / BITS_PER_UNIT) + 1;
> +
> +  if (number_of_bytes_needed2 < number_of_bytes_needed)
> +    val >>= final_shift;
> +  else
> +    final_shift = 0;
> +
> +  /* If we are in a very small range, we can use either a single movs
> +     or movs+adds.  */
> +  if ((val >= 0) && (val <= 510))
> 
> if val is made unsigned HWI as I suggest, the lower bounds test is not
> needed.
> 
> +    {
> +      if (val > 255)
> +	{
> +	  int high = val - 255;
> 
> Again, watch your types.
> 
> +
> +	  dst.mov (high);
> +	  dst.add (255);
> +	}
> +      else
> +	dst.mov (val);
> +
> +      if (final_shift > 0)
> +	dst.ashift (final_shift);
> +    }
> +  else
> +    {
> +      /* General case, emit upper 3 bytes as needed.  */
> +      for (i = 0; i < 3; i++)
> +	{
> +	  int byte = (val >> (8 * (3 - i))) & 0xff;
> 
> and here.
> 
> +
> +	  if (byte)
> +	    {
> +	      /* We are about to emit new bits, stop accumulating a
> +		 shift amount, and left-shift only if we have already
> +		 emitted some upper bits.  */
> +	      if (mov_done_p)
> +		{
> +		  dst.ashift (shift);
> +		  dst.add (byte);
> +		}
> +	      else
> +		dst.mov (byte);
> +
> +	      /* Stop accumulating shift amount since we've just
> +		 emitted some bits.  */
> +	      shift = 0;
> +
> +	      mov_done_p = true;
> +	    }
> +
> +	  if (mov_done_p)
> +	    shift += 8;
> +	}
> +
> +      /* Emit lower byte.  */
> +      if (!mov_done_p)
> +	dst.mov (val & 0xff);
> +      else
> +	{
> +	  dst.ashift (shift);
> +	  if (val & 0xff)
> +	    dst.add (val & 0xff);
> +	}
> +
> +      if (final_shift > 0)
> +	dst.ashift (final_shift);
> +    }
> +}
> +
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-28 17:44                                 ` Richard Earnshaw
@ 2020-10-28 18:10                                   ` Christophe Lyon
  2020-10-29 19:18                                     ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-10-28 18:10 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
> > On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> >> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
> >> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>
> >>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> >>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
> >>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>
> >>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> >>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>
> >>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
> >>>>>>>>>>>>>>>>> movs/adds/lsls.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
> >>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
> >>>>>>>>>>>>>>>>> with -mpure-code.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
> >>>>>>>>>>>>>>>>> eg. pr46287-3.C
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>       gcc/
> >>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Richard,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for your comments.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
> >>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
> >>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>>>>>>>>>>>>>> before, so I can write something similar indeed.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
> >>>>>>>>>>>>>>> when -mpure-code is NOT used.
> >>>>>>>>>>>>>>> Consider:
> >>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
> >>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
> >>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
> >>>>>>>>>>>>>>> f3_1:
> >>>>>>>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>> f3_2:
> >>>>>>>>>>>>>>>         ldr     r0, .L4
> >>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
> >>>>>>>>>>>>>>> not having to load the constant?
> >>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
> >>>>>>>>>>>>>>> account when comparing code size,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
> >>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
> >>>>>>>>>>>>>> literal, but in general that's rare.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
> >>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
> >>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
> >>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
> >>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is also true for
> >>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
> >>>>>>>>>>>>>>>> this.
> >>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> >>>>>>>>>>>>>>> f3_1:
> >>>>>>>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>> f3_2:
> >>>>>>>>>>>>>>>         movs    r0, #1
> >>>>>>>>>>>>>>>         lsls    r0, r0, #8
> >>>>>>>>>>>>>>>         adds    r0, r0, #255
> >>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
> >>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
> >>>>>>>>>>>>>> you show here are not thunks.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> OK thanks for the clarification.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
> >>>>>>>>>>>>> review easier.
> >>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
> >>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> >>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
> >>>>>>>>>>>>> same place,
> >>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
> >>>>>>>>>>>>> several cases matching your comments.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> >>>>>>>>>>>>> calls thumb1_gen_const_int.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
> >>>>>>>>>>>>> thumb1_gen_const_int.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
> >>>>>>>>>>>> C++ now to do much better than that, for example by making that function
> >>>>>>>>>>>> a template.  For example (and this is just a sketch):
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
> >>>>>>>>>>> templates in arm.c yet.
> >>>>>>>>>>> I'll send an update soon.
> >>>>>>>>>>>
> >>>>>>>>>>> Other than that, does the approach look OK to you?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
> >>>>>>>>>> immediate generating operations into a single function can only be a
> >>>>>>>>>> good thing.
> >>>>>>>>>>
> >>>>>>>>>> Looking again at your example constant sequences, I see:
> >>>>>>>>>>
> >>>>>>>>>> 0x1000010:
> >>>>>>>>>>         movs    r3, #16
> >>>>>>>>>>         lsls    r3, #16
> >>>>>>>>>>         adds    r3, #1
> >>>>>>>>>>         lsls    r3, #4
> >>>>>>>>>> 0x1000011:
> >>>>>>>>>>         movs    r3, #1
> >>>>>>>>>>         lsls    r3, #24
> >>>>>>>>>>         adds    r3, #17
> >>>>>>>>>>
> >>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
> >>>>>>>>>> the first expand to
> >>>>>>>>>>
> >>>>>>>>>> 0x1000010:
> >>>>>>>>>>         movs    r3, #16
> >>>>>>>>>>         lsls    r3, #16
> >>>>>>>>>>         adds    r3, #16
> >>>>>>>>>>
> >>>>>>>>> Err, I mean to:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 0x1000010:
> >>>>>>>>>         movs    r3, #1
> >>>>>>>>>         lsls    r3, #24
> >>>>>>>>>         adds    r3, #16
> >>>>>>>>>
> >>>>>>>>> ?
> >>>>>>>>
> >>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
> >>>>>>>> range and need less instructions to build the higher part, then
> >>>>>>>> left-shift back.
> >>>>>>>>
> >>>>>>>> In this particular case, we'd need to realize that there are many
> >>>>>>>> zeros "inside" the constant.
> >>>>>>>>
> >>>>>>>> If I remove the part that tries to reduce the range, I do get that
> >>>>>>>> sequence, but for 764 I now generate
> >>>>>>>> movs    r3, #2
> >>>>>>>> lsls    r3, #8
> >>>>>>>> adds    r3, #252
> >>>>>>>> instead of
> >>>>>>>> movs    r3, #191
> >>>>>>>> lsls    r3, #2
> >>>>>>>>
> >>>>>>>> A possibility would be to try both approaches and keep the shortest one.
> >>>>>>>
> >>>>>>> Lets leave that for now, it's not important to fixing the main issue;
> >>>>>>> but we should remember we need to come back to it at some point.
> >>>>>>>
> >>>>>> Thanks, that's what I was thinking too.
> >>>>>>
> >>>>>>> There are other tricks as well, such as
> >>>>>>>
> >>>>>>>   0xffffff
> >>>>>>>
> >>>>>>> can be done as
> >>>>>>>
> >>>>>>>   0x1000000 - 1
> >>>>>>>
> >>>>>>> and
> >>>>>>>
> >>>>>>>   0xfffffd
> >>>>>>>
> >>>>>>> as
> >>>>>>>
> >>>>>>>   0x1000000 - 3
> >>>>>>>
> >>>>>>> but these can wait as well.
> >>>>>>>
> >>>>>>
> >>>>>> Didn't we already need to handle such tricks? I'm surprised this
> >>>>>> wasn't needed earlier.
> >>>>>>
> >>>>>
> >>>>> I don't think we ever worried about them.  Most of them need at least 3
> >>>>> instructions so aren't a code size saving over using a literal pool entry.
> >>>>>
> >>>> OK, this will also help when using -mslow-flash-data.
> >>>>
> >>>> Here are updated patches, now using a template as you suggested.
> >>>
> >>> Looking better, but when I try to apply this to my local tree patch 2
> >>> fails (I'm not exactly sure why, what was your baseline for these
> >>> patches?)
> >> I have the tree patches in this thread on top of these other two:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
> >> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
> >>
> >> They have gradual improvements to thumb1_movsi_insn.
> >>
> >>> -- that patch looks suspicious anyway, you're replacing code
> >>> that prints out assembly with code that generates RTL.
> >> Right! I took me a while to understand how I could miss this, sorry.
> >> That was caused by improper testing, as this part of the code
> >> isn't used when targetting cortex-m0. I have added a testcase
> >> for cortex-m23 which crashes with the previous version of patch 2,
> >> and succeeds now.
> >>
> >>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
> >>> thumb1_const_rtl.  I think the names as they stand are likely to be too
> >>> generic.
> >> OK, done.
> >>
> >> How about this new version?
> >> I'm not yet sure about the most appropriate naming for:
> >> thumb1_gen_const_int
> >> thumb1_gen_const_int_asm
> >> should they be
> >> thumb1_gen_const_int_rtl
> >> thumb1_gen_const_int_print
> >> to be consistent with the new classes?
> >
> > It would probably be better, yes.
> >
> > More detailed comments below.
> >
> > R.
> >
> >>
> >> Thanks,
> >>
> >> Christophe
> >>
> >>> R.
> >>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Christophe
> >>>>
> >>>>> R.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> R.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> R.
> >>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Christophe
> >>>>>>>>>>>
> >>>>>>>>>>>> class t1_rtl
> >>>>>>>>>>>> {
> >>>>>>>>>>>>  public:
> >>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
> >>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
> >>>>>>>>>>>> };
> >>>>>>>>>>>>
> >>>>>>>>>>>> class t1_print
> >>>>>>>>>>>> {
> >>>>>>>>>>>>  public:
> >>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
> >>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
> >>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
> >>>>>>>>>>>>  private:
> >>>>>>>>>>>>   FILE *t_file;
> >>>>>>>>>>>> };
> >>>>>>>>>>>>
> >>>>>>>>>>>> template <class T>
> >>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
> >>>>>>>>>>>>   t.ashift(f);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> // Usage...
> >>>>>>>>>>>> void f1()
> >>>>>>>>>>>> {
> >>>>>>>>>>>>   // Use the RTL expander
> >>>>>>>>>>>>   t1_rtl g;
> >>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> void f2()
> >>>>>>>>>>>> {
> >>>>>>>>>>>>   // Use the printf expander writing to stdout
> >>>>>>>>>>>>   t1_print g(stdout);
> >>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
> >>>>>>>>>>>> about which expander is being used in each instance and the template
> >>>>>>>>>>>> expansion will use the right version.
> >>>>>>>>>>>>
> >>>>>>>>>>>> R.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
> >>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
> >>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
> >>>>>>>>>>>>>>>> even greater range of literals.
> >>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> >>>>>>>>>>>>>>> I'll update my patch.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  More generally, any sequence of up to
> >>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
> >>>>>>>>>>>>>>>> existing literal pool fall back.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >>>>>>>>>>>>>>>> be merged into a single shift-by-16.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
> >>>>>>>>>>>>>>>> all possible.
> >>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
> >>>>>>>>>>>>>>> that showed the problem. There are also:
> >>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
> >>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
> >>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
> >>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
> >>>>>>>>>>>>>>> -mpure-code in dg-options?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
> >>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
> >>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
> >>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
> >>>>>>>>>>>>>> with it enabled and that should be sufficient.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
> >>>>>>>>>>>>> modified simulator.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Christophe
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> R.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Christophe
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> R.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
> >>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
> >>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>>>>        if (mi_delta > 255)
> >>>>>>>>>>>>>>>>>       {
> >>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>>>>>>> -       assemble_name (file, label);
> >>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
> >>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
> >>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
> >>>>>>>>>>>>>>>>> +       if (target_pure_code)
> >>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
> >>>>>>>>>>>>>>>>> +           int i;
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
> >>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
> >>>>>>>>>>>>>>>>> +             {
> >>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +               if (byte)
> >>>>>>>>>>>>>>>>> +                 {
> >>>>>>>>>>>>>>>>> +                   if (mov_done_p)
> >>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>>>>>>>>>>>>>>>> +                   else
> >>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>>>>>>>>>>>>>>>> +                   mov_done_p = true;
> >>>>>>>>>>>>>>>>> +                 }
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +               if (mov_done_p)
> >>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>>>>>>>>>>>>>>>> +             }
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
> >>>>>>>>>>>>>>>>> +           if (!mov_done_p)
> >>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
> >>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>> +       else
> >>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>>>>>>> +           assemble_name (file, label);
> >>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
> >>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
> >>>>>>>>>>>>>>>>>       }
> >>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
> >>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>>>>>>> -      assemble_name (file, label);
> >>>>>>>>>>>>>>>>> -      fputs (":\n", file);
> >>>>>>>>>>>>>>>>> -      if (flag_pic)
> >>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
> >>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
> >>>>>>>>>>>>>>>>> +      them.  */
> >>>>>>>>>>>>>>>>> +      if (!target_pure_code)
> >>>>>>>>>>>>>>>>>       {
> >>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>>>>>>> -          accordingly.  */
> >>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>>>>>>> -                            tem,
> >>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
> >>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> -     }
> >>>>>>>>>>>>>>>>> -      else
> >>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>>>>>>> +       assemble_name (file, label);
> >>>>>>>>>>>>>>>>> +       fputs (":\n", file);
> >>>>>>>>>>>>>>>>> +       if (flag_pic)
> >>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>>>>>>> +              accordingly.  */
> >>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>>>>>>> +                                tem,
> >>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
> >>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>> +       else
> >>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>> +     }
> >>>>>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>>>>    else
> >>>>>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
> > +class thumb1_const_rtl
> > ...
> > +  void mov (int val)
> >
> > This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
> > same applies to the asm version as well.
> >
> > +    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
> >
> > Should be using reg_names[dst_regno] in all cases.  In fact, you might
> > want to move that lookup to the constructor and just save a pointer to
> > the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
>
> Correction, for a (signed) HOST_WIDE_INT, this should be
> HOST_WIDE_INT_PRINT_DEC.
>

Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the
two classes
can have unsigned HOST_WIDE_INT parameters, and thus use
HOST_WIDE_INT_PRINT_UNSIGNED?


> > rather than "%d" for the immediate.
> >
> > +template <class T>
> > +void
> > +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
> > +{
> > +  bool mov_done_p = false;
> > +  int val = op1;
> >
> > This potentially silently loses precision.  In fact, I think you really
> > want to use "unsigned HOST_WIDE_INT" throughout the following code, so
> > that the right shifts aren't undefined if dealing with negative numbers.
> >
> > For safety, you should also have an assertion in here that
> >
> >   op1 == trunc_int_for_mode (op1, SImode)
> >
> > +  int shift = 0;
> > +  int i;
> > +
> > +  if (val == 0)
> >
> > You can short-circuit 0..255 here for a quick exit.
> >
> > +    {
> > +      dst.mov (val);
> > +      return;
> > +    }
> >
> > Another trick: if the top nine bits of the 32-bit value are all set,
> > you're probably going to be better off (and certainly not worse off) by
> > generating -op1 and then negating the result in a final step - you can
> > do that via recursion.
> >
> > +
> > +  /* In the general case, we need 7 instructions to build
> > +     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
> > +     do better if VAL is small enough, or
> > +     right-shiftable by a suitable amount.  If the
> > +     right-shift enables to encode at least one less byte,
> > +     it's worth it: we save a adds and a lsls at the
> > +     expense of a final lsls.  */
> > +  int final_shift = number_of_first_bit_set (val);
> > +
> > +  int leading_zeroes = clz_hwi (val);
> > +  int number_of_bytes_needed
> > +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
> > +       / BITS_PER_UNIT) + 1;
> > +  int number_of_bytes_needed2
> > +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
> > +       / BITS_PER_UNIT) + 1;
> > +
> > +  if (number_of_bytes_needed2 < number_of_bytes_needed)
> > +    val >>= final_shift;
> > +  else
> > +    final_shift = 0;
> > +
> > +  /* If we are in a very small range, we can use either a single movs
> > +     or movs+adds.  */
> > +  if ((val >= 0) && (val <= 510))
> >
> > if val is made unsigned HWI as I suggest, the lower bounds test is not
> > needed.
> >
> > +    {
> > +      if (val > 255)
> > +     {
> > +       int high = val - 255;
> >
> > Again, watch your types.
> >
> > +
> > +       dst.mov (high);
> > +       dst.add (255);
> > +     }
> > +      else
> > +     dst.mov (val);
> > +
> > +      if (final_shift > 0)
> > +     dst.ashift (final_shift);
> > +    }
> > +  else
> > +    {
> > +      /* General case, emit upper 3 bytes as needed.  */
> > +      for (i = 0; i < 3; i++)
> > +     {
> > +       int byte = (val >> (8 * (3 - i))) & 0xff;
> >
> > and here.
> >
> > +
> > +       if (byte)
> > +         {
> > +           /* We are about to emit new bits, stop accumulating a
> > +              shift amount, and left-shift only if we have already
> > +              emitted some upper bits.  */
> > +           if (mov_done_p)
> > +             {
> > +               dst.ashift (shift);
> > +               dst.add (byte);
> > +             }
> > +           else
> > +             dst.mov (byte);
> > +
> > +           /* Stop accumulating shift amount since we've just
> > +              emitted some bits.  */
> > +           shift = 0;
> > +
> > +           mov_done_p = true;
> > +         }
> > +
> > +       if (mov_done_p)
> > +         shift += 8;
> > +     }
> > +
> > +      /* Emit lower byte.  */
> > +      if (!mov_done_p)
> > +     dst.mov (val & 0xff);
> > +      else
> > +     {
> > +       dst.ashift (shift);
> > +       if (val & 0xff)
> > +         dst.add (val & 0xff);
> > +     }
> > +
> > +      if (final_shift > 0)
> > +     dst.ashift (final_shift);
> > +    }
> > +}
> > +
> >
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-28 18:10                                   ` Christophe Lyon
@ 2020-10-29 19:18                                     ` Richard Earnshaw
  2020-10-30 12:49                                       ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-29 19:18 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
>>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>
>>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>
>>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>>>>>>>>> Consider:
>>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>>>>>>>>>>> not having to load the constant?
>>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>>>>>>>>>>> account when comparing code size,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>>>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is also true for
>>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>>>>>>>>>>> you show here are not thunks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> OK thanks for the clarification.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>>>>>>>>> review easier.
>>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>>>>>>>>>>> same place,
>>>>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>>>>>>>>>>> several cases matching your comments.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>>>>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>>>>>>>>> thumb1_gen_const_int.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>>>>>>>>>>> C++ now to do much better than that, for example by making that function
>>>>>>>>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>>>>>>>>> templates in arm.c yet.
>>>>>>>>>>>>> I'll send an update soon.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Other than that, does the approach look OK to you?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>>>>>>>>> immediate generating operations into a single function can only be a
>>>>>>>>>>>> good thing.
>>>>>>>>>>>>
>>>>>>>>>>>> Looking again at your example constant sequences, I see:
>>>>>>>>>>>>
>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>>>         adds    r3, #1
>>>>>>>>>>>>         lsls    r3, #4
>>>>>>>>>>>> 0x1000011:
>>>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>>>         adds    r3, #17
>>>>>>>>>>>>
>>>>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>>>>>>>>> the first expand to
>>>>>>>>>>>>
>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>>>
>>>>>>>>>>> Err, I mean to:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>>
>>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
>>>>>>>>>> range and need less instructions to build the higher part, then
>>>>>>>>>> left-shift back.
>>>>>>>>>>
>>>>>>>>>> In this particular case, we'd need to realize that there are many
>>>>>>>>>> zeros "inside" the constant.
>>>>>>>>>>
>>>>>>>>>> If I remove the part that tries to reduce the range, I do get that
>>>>>>>>>> sequence, but for 764 I now generate
>>>>>>>>>> movs    r3, #2
>>>>>>>>>> lsls    r3, #8
>>>>>>>>>> adds    r3, #252
>>>>>>>>>> instead of
>>>>>>>>>> movs    r3, #191
>>>>>>>>>> lsls    r3, #2
>>>>>>>>>>
>>>>>>>>>> A possibility would be to try both approaches and keep the shortest one.
>>>>>>>>>
>>>>>>>>> Lets leave that for now, it's not important to fixing the main issue;
>>>>>>>>> but we should remember we need to come back to it at some point.
>>>>>>>>>
>>>>>>>> Thanks, that's what I was thinking too.
>>>>>>>>
>>>>>>>>> There are other tricks as well, such as
>>>>>>>>>
>>>>>>>>>   0xffffff
>>>>>>>>>
>>>>>>>>> can be done as
>>>>>>>>>
>>>>>>>>>   0x1000000 - 1
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>   0xfffffd
>>>>>>>>>
>>>>>>>>> as
>>>>>>>>>
>>>>>>>>>   0x1000000 - 3
>>>>>>>>>
>>>>>>>>> but these can wait as well.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Didn't we already need to handle such tricks? I'm surprised this
>>>>>>>> wasn't needed earlier.
>>>>>>>>
>>>>>>>
>>>>>>> I don't think we ever worried about them.  Most of them need at least 3
>>>>>>> instructions so aren't a code size saving over using a literal pool entry.
>>>>>>>
>>>>>> OK, this will also help when using -mslow-flash-data.
>>>>>>
>>>>>> Here are updated patches, now using a template as you suggested.
>>>>>
>>>>> Looking better, but when I try to apply this to my local tree patch 2
>>>>> fails (I'm not exactly sure why, what was your baseline for these
>>>>> patches?)
>>>> I have the tree patches in this thread on top of these other two:
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
>>>>
>>>> They have gradual improvements to thumb1_movsi_insn.
>>>>
>>>>> -- that patch looks suspicious anyway, you're replacing code
>>>>> that prints out assembly with code that generates RTL.
>>>> Right! I took me a while to understand how I could miss this, sorry.
>>>> That was caused by improper testing, as this part of the code
>>>> isn't used when targetting cortex-m0. I have added a testcase
>>>> for cortex-m23 which crashes with the previous version of patch 2,
>>>> and succeeds now.
>>>>
>>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
>>>>> thumb1_const_rtl.  I think the names as they stand are likely to be too
>>>>> generic.
>>>> OK, done.
>>>>
>>>> How about this new version?
>>>> I'm not yet sure about the most appropriate naming for:
>>>> thumb1_gen_const_int
>>>> thumb1_gen_const_int_asm
>>>> should they be
>>>> thumb1_gen_const_int_rtl
>>>> thumb1_gen_const_int_print
>>>> to be consistent with the new classes?
>>>
>>> It would probably be better, yes.
>>>
>>> More detailed comments below.
>>>
>>> R.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Christophe
>>>>
>>>>> R.
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Christophe
>>>>>>
>>>>>>> R.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> R.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> R.
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>
>>>>>>>>>>>>>> class t1_rtl
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>  public:
>>>>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> class t1_print
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>  public:
>>>>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>>>>>>>>  private:
>>>>>>>>>>>>>>   FILE *t_file;
>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> template <class T>
>>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>>>>>>>>   t.ashift(f);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> // Usage...
>>>>>>>>>>>>>> void f1()
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>   // Use the RTL expander
>>>>>>>>>>>>>>   t1_rtl g;
>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> void f2()
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>>>>>>>>   t1_print g(stdout);
>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>>>>>>>>> about which expander is being used in each instance and the template
>>>>>>>>>>>>>> expansion will use the right version.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>>>>>>>>>>> even greater range of literals.
>>>>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>>>>>>>>>>> I'll update my patch.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>>>>>>>>>>> all possible.
>>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>>>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>>>>>>>>>>> modified simulator.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>>>>>>>>> +           int i;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>>>>>>>>> +             {
>>>>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>>>>>>>>> +                 {
>>>>>>>>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>>>> +                   else
>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>>>>>>>>> +                 }
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>>>>>>>>> +             }
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>> -     }
>>>>>>>>>>>>>>>>>>> -      else
>>>>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>    else
>>>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>> +class thumb1_const_rtl
>>> ...
>>> +  void mov (int val)
>>>
>>> This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
>>> same applies to the asm version as well.
>>>
>>> +    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
>>>
>>> Should be using reg_names[dst_regno] in all cases.  In fact, you might
>>> want to move that lookup to the constructor and just save a pointer to
>>> the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
>>
>> Correction, for a (signed) HOST_WIDE_INT, this should be
>> HOST_WIDE_INT_PRINT_DEC.
>>
> 
> Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the
> two classes
> can have unsigned HOST_WIDE_INT parameters, and thus use
> HOST_WIDE_INT_PRINT_UNSIGNED?
> 

It's generally safer to print it as a signed value.  We probably won't
have cases where the top bit of a 32-bit word are set here; but if you
do, and the value is unsigned, you end up with 16 digit hex numbers
rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT
is at least 64 bits in size.

R.

> 
>>> rather than "%d" for the immediate.
>>>
>>> +template <class T>
>>> +void
>>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
>>> +{
>>> +  bool mov_done_p = false;
>>> +  int val = op1;
>>>
>>> This potentially silently loses precision.  In fact, I think you really
>>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so
>>> that the right shifts aren't undefined if dealing with negative numbers.
>>>
>>> For safety, you should also have an assertion in here that
>>>
>>>   op1 == trunc_int_for_mode (op1, SImode)
>>>
>>> +  int shift = 0;
>>> +  int i;
>>> +
>>> +  if (val == 0)
>>>
>>> You can short-circuit 0..255 here for a quick exit.
>>>
>>> +    {
>>> +      dst.mov (val);
>>> +      return;
>>> +    }
>>>
>>> Another trick: if the top nine bits of the 32-bit value are all set,
>>> you're probably going to be better off (and certainly not worse off) by
>>> generating -op1 and then negating the result in a final step - you can
>>> do that via recursion.
>>>
>>> +
>>> +  /* In the general case, we need 7 instructions to build
>>> +     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
>>> +     do better if VAL is small enough, or
>>> +     right-shiftable by a suitable amount.  If the
>>> +     right-shift enables to encode at least one less byte,
>>> +     it's worth it: we save a adds and a lsls at the
>>> +     expense of a final lsls.  */
>>> +  int final_shift = number_of_first_bit_set (val);
>>> +
>>> +  int leading_zeroes = clz_hwi (val);
>>> +  int number_of_bytes_needed
>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
>>> +       / BITS_PER_UNIT) + 1;
>>> +  int number_of_bytes_needed2
>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
>>> +       / BITS_PER_UNIT) + 1;
>>> +
>>> +  if (number_of_bytes_needed2 < number_of_bytes_needed)
>>> +    val >>= final_shift;
>>> +  else
>>> +    final_shift = 0;
>>> +
>>> +  /* If we are in a very small range, we can use either a single movs
>>> +     or movs+adds.  */
>>> +  if ((val >= 0) && (val <= 510))
>>>
>>> if val is made unsigned HWI as I suggest, the lower bounds test is not
>>> needed.
>>>
>>> +    {
>>> +      if (val > 255)
>>> +     {
>>> +       int high = val - 255;
>>>
>>> Again, watch your types.
>>>
>>> +
>>> +       dst.mov (high);
>>> +       dst.add (255);
>>> +     }
>>> +      else
>>> +     dst.mov (val);
>>> +
>>> +      if (final_shift > 0)
>>> +     dst.ashift (final_shift);
>>> +    }
>>> +  else
>>> +    {
>>> +      /* General case, emit upper 3 bytes as needed.  */
>>> +      for (i = 0; i < 3; i++)
>>> +     {
>>> +       int byte = (val >> (8 * (3 - i))) & 0xff;
>>>
>>> and here.
>>>
>>> +
>>> +       if (byte)
>>> +         {
>>> +           /* We are about to emit new bits, stop accumulating a
>>> +              shift amount, and left-shift only if we have already
>>> +              emitted some upper bits.  */
>>> +           if (mov_done_p)
>>> +             {
>>> +               dst.ashift (shift);
>>> +               dst.add (byte);
>>> +             }
>>> +           else
>>> +             dst.mov (byte);
>>> +
>>> +           /* Stop accumulating shift amount since we've just
>>> +              emitted some bits.  */
>>> +           shift = 0;
>>> +
>>> +           mov_done_p = true;
>>> +         }
>>> +
>>> +       if (mov_done_p)
>>> +         shift += 8;
>>> +     }
>>> +
>>> +      /* Emit lower byte.  */
>>> +      if (!mov_done_p)
>>> +     dst.mov (val & 0xff);
>>> +      else
>>> +     {
>>> +       dst.ashift (shift);
>>> +       if (val & 0xff)
>>> +         dst.add (val & 0xff);
>>> +     }
>>> +
>>> +      if (final_shift > 0)
>>> +     dst.ashift (final_shift);
>>> +    }
>>> +}
>>> +
>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-29 19:18                                     ` Richard Earnshaw
@ 2020-10-30 12:49                                       ` Richard Earnshaw
  2020-11-02 10:24                                         ` Christophe Lyon
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Earnshaw @ 2020-10-30 12:49 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote:
> On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
>> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>
>>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
>>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
>>>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>
>>>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>
>>>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>>>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>>>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>>>>>>>>>> Consider:
>>>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>>>>>>>>>>>> not having to load the constant?
>>>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>>>>>>>>>>>> account when comparing code size,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>>>>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is also true for
>>>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>>>>>>>>>>>> you show here are not thunks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> OK thanks for the clarification.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>>>>>>>>>> review easier.
>>>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>>>>>>>>>>>> same place,
>>>>>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>>>>>>>>>>>> several cases matching your comments.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>>>>>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>>>>>>>>>> thumb1_gen_const_int.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>>>>>>>>>>>> C++ now to do much better than that, for example by making that function
>>>>>>>>>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>>>>>>>>>> templates in arm.c yet.
>>>>>>>>>>>>>> I'll send an update soon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Other than that, does the approach look OK to you?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>>>>>>>>>> immediate generating operations into a single function can only be a
>>>>>>>>>>>>> good thing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looking again at your example constant sequences, I see:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>>>>         adds    r3, #1
>>>>>>>>>>>>>         lsls    r3, #4
>>>>>>>>>>>>> 0x1000011:
>>>>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>>>>         adds    r3, #17
>>>>>>>>>>>>>
>>>>>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>>>>>>>>>> the first expand to
>>>>>>>>>>>>>
>>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>>>>
>>>>>>>>>>>> Err, I mean to:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>>>
>>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
>>>>>>>>>>> range and need less instructions to build the higher part, then
>>>>>>>>>>> left-shift back.
>>>>>>>>>>>
>>>>>>>>>>> In this particular case, we'd need to realize that there are many
>>>>>>>>>>> zeros "inside" the constant.
>>>>>>>>>>>
>>>>>>>>>>> If I remove the part that tries to reduce the range, I do get that
>>>>>>>>>>> sequence, but for 764 I now generate
>>>>>>>>>>> movs    r3, #2
>>>>>>>>>>> lsls    r3, #8
>>>>>>>>>>> adds    r3, #252
>>>>>>>>>>> instead of
>>>>>>>>>>> movs    r3, #191
>>>>>>>>>>> lsls    r3, #2
>>>>>>>>>>>
>>>>>>>>>>> A possibility would be to try both approaches and keep the shortest one.
>>>>>>>>>>
>>>>>>>>>> Lets leave that for now, it's not important to fixing the main issue;
>>>>>>>>>> but we should remember we need to come back to it at some point.
>>>>>>>>>>
>>>>>>>>> Thanks, that's what I was thinking too.
>>>>>>>>>
>>>>>>>>>> There are other tricks as well, such as
>>>>>>>>>>
>>>>>>>>>>   0xffffff
>>>>>>>>>>
>>>>>>>>>> can be done as
>>>>>>>>>>
>>>>>>>>>>   0x1000000 - 1
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>   0xfffffd
>>>>>>>>>>
>>>>>>>>>> as
>>>>>>>>>>
>>>>>>>>>>   0x1000000 - 3
>>>>>>>>>>
>>>>>>>>>> but these can wait as well.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Didn't we already need to handle such tricks? I'm surprised this
>>>>>>>>> wasn't needed earlier.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't think we ever worried about them.  Most of them need at least 3
>>>>>>>> instructions so aren't a code size saving over using a literal pool entry.
>>>>>>>>
>>>>>>> OK, this will also help when using -mslow-flash-data.
>>>>>>>
>>>>>>> Here are updated patches, now using a template as you suggested.
>>>>>>
>>>>>> Looking better, but when I try to apply this to my local tree patch 2
>>>>>> fails (I'm not exactly sure why, what was your baseline for these
>>>>>> patches?)
>>>>> I have the tree patches in this thread on top of these other two:
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
>>>>>
>>>>> They have gradual improvements to thumb1_movsi_insn.
>>>>>
>>>>>> -- that patch looks suspicious anyway, you're replacing code
>>>>>> that prints out assembly with code that generates RTL.
>>>>> Right! I took me a while to understand how I could miss this, sorry.
>>>>> That was caused by improper testing, as this part of the code
>>>>> isn't used when targetting cortex-m0. I have added a testcase
>>>>> for cortex-m23 which crashes with the previous version of patch 2,
>>>>> and succeeds now.
>>>>>
>>>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
>>>>>> thumb1_const_rtl.  I think the names as they stand are likely to be too
>>>>>> generic.
>>>>> OK, done.
>>>>>
>>>>> How about this new version?
>>>>> I'm not yet sure about the most appropriate naming for:
>>>>> thumb1_gen_const_int
>>>>> thumb1_gen_const_int_asm
>>>>> should they be
>>>>> thumb1_gen_const_int_rtl
>>>>> thumb1_gen_const_int_print
>>>>> to be consistent with the new classes?
>>>>
>>>> It would probably be better, yes.
>>>>
>>>> More detailed comments below.
>>>>
>>>> R.
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Christophe
>>>>>
>>>>>> R.
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Christophe
>>>>>>>
>>>>>>>> R.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> R.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> R.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> class t1_rtl
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>  public:
>>>>>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> class t1_print
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>  public:
>>>>>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>>>>>>>>>  private:
>>>>>>>>>>>>>>>   FILE *t_file;
>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> template <class T>
>>>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>>>>>>>>>   t.ashift(f);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> // Usage...
>>>>>>>>>>>>>>> void f1()
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>   // Use the RTL expander
>>>>>>>>>>>>>>>   t1_rtl g;
>>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> void f2()
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>>>>>>>>>   t1_print g(stdout);
>>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>>>>>>>>>> about which expander is being used in each instance and the template
>>>>>>>>>>>>>>> expansion will use the right version.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>>>>>>>>>>>> even greater range of literals.
>>>>>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>>>>>>>>>>>> I'll update my patch.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>>>>>>>>>>>> all possible.
>>>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>>>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>>>>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>>>>>>>>>>>> modified simulator.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>>>>>>>>>> +           int i;
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>>>>>>>>>> +             {
>>>>>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>>>>>>>>>> +                 {
>>>>>>>>>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>>>>> +                   else
>>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>>>>>>>>>> +                 }
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>>>>>>>>>> +             }
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>> -     }
>>>>>>>>>>>>>>>>>>>> -      else
>>>>>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>    else
>>>>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>> +class thumb1_const_rtl
>>>> ...
>>>> +  void mov (int val)
>>>>
>>>> This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
>>>> same applies to the asm version as well.
>>>>
>>>> +    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
>>>>
>>>> Should be using reg_names[dst_regno] in all cases.  In fact, you might
>>>> want to move that lookup to the constructor and just save a pointer to
>>>> the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
>>>
>>> Correction, for a (signed) HOST_WIDE_INT, this should be
>>> HOST_WIDE_INT_PRINT_DEC.
>>>
>>
>> Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the
>> two classes
>> can have unsigned HOST_WIDE_INT parameters, and thus use
>> HOST_WIDE_INT_PRINT_UNSIGNED?
>>
> 
> It's generally safer to print it as a signed value.  We probably won't
> have cases where the top bit of a 32-bit word are set here; but if you
> do, and the value is unsigned, you end up with 16 digit hex numbers
> rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT
> is at least 64 bits in size.
> 

I don't mean hex numbers, of course, just very large decimal numbers.
But the point is still the same.

R.

> R.
> 
>>
>>>> rather than "%d" for the immediate.
>>>>
>>>> +template <class T>
>>>> +void
>>>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
>>>> +{
>>>> +  bool mov_done_p = false;
>>>> +  int val = op1;
>>>>
>>>> This potentially silently loses precision.  In fact, I think you really
>>>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so
>>>> that the right shifts aren't undefined if dealing with negative numbers.
>>>>
>>>> For safety, you should also have an assertion in here that
>>>>
>>>>   op1 == trunc_int_for_mode (op1, SImode)
>>>>
>>>> +  int shift = 0;
>>>> +  int i;
>>>> +
>>>> +  if (val == 0)
>>>>
>>>> You can short-circuit 0..255 here for a quick exit.
>>>>
>>>> +    {
>>>> +      dst.mov (val);
>>>> +      return;
>>>> +    }
>>>>
>>>> Another trick: if the top nine bits of the 32-bit value are all set,
>>>> you're probably going to be better off (and certainly not worse off) by
>>>> generating -op1 and then negating the result in a final step - you can
>>>> do that via recursion.
>>>>
>>>> +
>>>> +  /* In the general case, we need 7 instructions to build
>>>> +     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
>>>> +     do better if VAL is small enough, or
>>>> +     right-shiftable by a suitable amount.  If the
>>>> +     right-shift enables to encode at least one less byte,
>>>> +     it's worth it: we save a adds and a lsls at the
>>>> +     expense of a final lsls.  */
>>>> +  int final_shift = number_of_first_bit_set (val);
>>>> +
>>>> +  int leading_zeroes = clz_hwi (val);
>>>> +  int number_of_bytes_needed
>>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
>>>> +       / BITS_PER_UNIT) + 1;
>>>> +  int number_of_bytes_needed2
>>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
>>>> +       / BITS_PER_UNIT) + 1;
>>>> +
>>>> +  if (number_of_bytes_needed2 < number_of_bytes_needed)
>>>> +    val >>= final_shift;
>>>> +  else
>>>> +    final_shift = 0;
>>>> +
>>>> +  /* If we are in a very small range, we can use either a single movs
>>>> +     or movs+adds.  */
>>>> +  if ((val >= 0) && (val <= 510))
>>>>
>>>> if val is made unsigned HWI as I suggest, the lower bounds test is not
>>>> needed.
>>>>
>>>> +    {
>>>> +      if (val > 255)
>>>> +     {
>>>> +       int high = val - 255;
>>>>
>>>> Again, watch your types.
>>>>
>>>> +
>>>> +       dst.mov (high);
>>>> +       dst.add (255);
>>>> +     }
>>>> +      else
>>>> +     dst.mov (val);
>>>> +
>>>> +      if (final_shift > 0)
>>>> +     dst.ashift (final_shift);
>>>> +    }
>>>> +  else
>>>> +    {
>>>> +      /* General case, emit upper 3 bytes as needed.  */
>>>> +      for (i = 0; i < 3; i++)
>>>> +     {
>>>> +       int byte = (val >> (8 * (3 - i))) & 0xff;
>>>>
>>>> and here.
>>>>
>>>> +
>>>> +       if (byte)
>>>> +         {
>>>> +           /* We are about to emit new bits, stop accumulating a
>>>> +              shift amount, and left-shift only if we have already
>>>> +              emitted some upper bits.  */
>>>> +           if (mov_done_p)
>>>> +             {
>>>> +               dst.ashift (shift);
>>>> +               dst.add (byte);
>>>> +             }
>>>> +           else
>>>> +             dst.mov (byte);
>>>> +
>>>> +           /* Stop accumulating shift amount since we've just
>>>> +              emitted some bits.  */
>>>> +           shift = 0;
>>>> +
>>>> +           mov_done_p = true;
>>>> +         }
>>>> +
>>>> +       if (mov_done_p)
>>>> +         shift += 8;
>>>> +     }
>>>> +
>>>> +      /* Emit lower byte.  */
>>>> +      if (!mov_done_p)
>>>> +     dst.mov (val & 0xff);
>>>> +      else
>>>> +     {
>>>> +       dst.ashift (shift);
>>>> +       if (val & 0xff)
>>>> +         dst.add (val & 0xff);
>>>> +     }
>>>> +
>>>> +      if (final_shift > 0)
>>>> +     dst.ashift (final_shift);
>>>> +    }
>>>> +}
>>>> +
>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-10-30 12:49                                       ` Richard Earnshaw
@ 2020-11-02 10:24                                         ` Christophe Lyon
  2020-11-02 14:28                                           ` Richard Earnshaw
  0 siblings, 1 reply; 24+ messages in thread
From: Christophe Lyon @ 2020-11-02 10:24 UTC (permalink / raw)
  To: Richard Earnshaw; +Cc: gcc Patches

[-- Attachment #1: Type: text/plain, Size: 30699 bytes --]

Hi,

On Fri, 30 Oct 2020 at 13:49, Richard Earnshaw
<Richard.Earnshaw@foss.arm.com> wrote:
>
> On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote:
> > On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
> >> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
> >> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>
> >>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
> >>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
> >>>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
> >>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>
> >>>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> >>>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
> >>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>
> >>>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
> >>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
> >>>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >>>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
> >>>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
> >>>>>>>>>>>>>>>>>>>> movs/adds/lsls.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
> >>>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
> >>>>>>>>>>>>>>>>>>>> with -mpure-code.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
> >>>>>>>>>>>>>>>>>>>> eg. pr46287-3.C
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>       gcc/
> >>>>>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
> >>>>>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Richard,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for your comments.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
> >>>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
> >>>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
> >>>>>>>>>>>>>>>>>> before, so I can write something similar indeed.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
> >>>>>>>>>>>>>>>>>> when -mpure-code is NOT used.
> >>>>>>>>>>>>>>>>>> Consider:
> >>>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
> >>>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
> >>>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
> >>>>>>>>>>>>>>>>>> f3_1:
> >>>>>>>>>>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>>>>> f3_2:
> >>>>>>>>>>>>>>>>>>         ldr     r0, .L4
> >>>>>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
> >>>>>>>>>>>>>>>>>> not having to load the constant?
> >>>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
> >>>>>>>>>>>>>>>>>> account when comparing code size,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
> >>>>>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
> >>>>>>>>>>>>>>>>> literal, but in general that's rare.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
> >>>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
> >>>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
> >>>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
> >>>>>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
> >>>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This is also true for
> >>>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
> >>>>>>>>>>>>>>>>>>> this.
> >>>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
> >>>>>>>>>>>>>>>>>> f3_1:
> >>>>>>>>>>>>>>>>>>         movs    r0, #255
> >>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
> >>>>>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>>>>> f3_2:
> >>>>>>>>>>>>>>>>>>         movs    r0, #1
> >>>>>>>>>>>>>>>>>>         lsls    r0, r0, #8
> >>>>>>>>>>>>>>>>>>         adds    r0, r0, #255
> >>>>>>>>>>>>>>>>>>         bx      lr
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
> >>>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
> >>>>>>>>>>>>>>>>> you show here are not thunks.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> OK thanks for the clarification.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
> >>>>>>>>>>>>>>>> review easier.
> >>>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
> >>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
> >>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
> >>>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
> >>>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
> >>>>>>>>>>>>>>>> same place,
> >>>>>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
> >>>>>>>>>>>>>>>> several cases matching your comments.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
> >>>>>>>>>>>>>>>> calls thumb1_gen_const_int.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
> >>>>>>>>>>>>>>>> thumb1_gen_const_int.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
> >>>>>>>>>>>>>>> C++ now to do much better than that, for example by making that function
> >>>>>>>>>>>>>>> a template.  For example (and this is just a sketch):
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
> >>>>>>>>>>>>>> templates in arm.c yet.
> >>>>>>>>>>>>>> I'll send an update soon.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Other than that, does the approach look OK to you?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
> >>>>>>>>>>>>> immediate generating operations into a single function can only be a
> >>>>>>>>>>>>> good thing.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Looking again at your example constant sequences, I see:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 0x1000010:
> >>>>>>>>>>>>>         movs    r3, #16
> >>>>>>>>>>>>>         lsls    r3, #16
> >>>>>>>>>>>>>         adds    r3, #1
> >>>>>>>>>>>>>         lsls    r3, #4
> >>>>>>>>>>>>> 0x1000011:
> >>>>>>>>>>>>>         movs    r3, #1
> >>>>>>>>>>>>>         lsls    r3, #24
> >>>>>>>>>>>>>         adds    r3, #17
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
> >>>>>>>>>>>>> the first expand to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 0x1000010:
> >>>>>>>>>>>>>         movs    r3, #16
> >>>>>>>>>>>>>         lsls    r3, #16
> >>>>>>>>>>>>>         adds    r3, #16
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Err, I mean to:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 0x1000010:
> >>>>>>>>>>>>         movs    r3, #1
> >>>>>>>>>>>>         lsls    r3, #24
> >>>>>>>>>>>>         adds    r3, #16
> >>>>>>>>>>>>
> >>>>>>>>>>>> ?
> >>>>>>>>>>>
> >>>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
> >>>>>>>>>>> range and need less instructions to build the higher part, then
> >>>>>>>>>>> left-shift back.
> >>>>>>>>>>>
> >>>>>>>>>>> In this particular case, we'd need to realize that there are many
> >>>>>>>>>>> zeros "inside" the constant.
> >>>>>>>>>>>
> >>>>>>>>>>> If I remove the part that tries to reduce the range, I do get that
> >>>>>>>>>>> sequence, but for 764 I now generate
> >>>>>>>>>>> movs    r3, #2
> >>>>>>>>>>> lsls    r3, #8
> >>>>>>>>>>> adds    r3, #252
> >>>>>>>>>>> instead of
> >>>>>>>>>>> movs    r3, #191
> >>>>>>>>>>> lsls    r3, #2
> >>>>>>>>>>>
> >>>>>>>>>>> A possibility would be to try both approaches and keep the shortest one.
> >>>>>>>>>>
> >>>>>>>>>> Lets leave that for now, it's not important to fixing the main issue;
> >>>>>>>>>> but we should remember we need to come back to it at some point.
> >>>>>>>>>>
> >>>>>>>>> Thanks, that's what I was thinking too.
> >>>>>>>>>
> >>>>>>>>>> There are other tricks as well, such as
> >>>>>>>>>>
> >>>>>>>>>>   0xffffff
> >>>>>>>>>>
> >>>>>>>>>> can be done as
> >>>>>>>>>>
> >>>>>>>>>>   0x1000000 - 1
> >>>>>>>>>>
> >>>>>>>>>> and
> >>>>>>>>>>
> >>>>>>>>>>   0xfffffd
> >>>>>>>>>>
> >>>>>>>>>> as
> >>>>>>>>>>
> >>>>>>>>>>   0x1000000 - 3
> >>>>>>>>>>
> >>>>>>>>>> but these can wait as well.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Didn't we already need to handle such tricks? I'm surprised this
> >>>>>>>>> wasn't needed earlier.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't think we ever worried about them.  Most of them need at least 3
> >>>>>>>> instructions so aren't a code size saving over using a literal pool entry.
> >>>>>>>>
> >>>>>>> OK, this will also help when using -mslow-flash-data.
> >>>>>>>
> >>>>>>> Here are updated patches, now using a template as you suggested.
> >>>>>>
> >>>>>> Looking better, but when I try to apply this to my local tree patch 2
> >>>>>> fails (I'm not exactly sure why, what was your baseline for these
> >>>>>> patches?)
> >>>>> I have the tree patches in this thread on top of these other two:
> >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
> >>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
> >>>>>
> >>>>> They have gradual improvements to thumb1_movsi_insn.
> >>>>>
> >>>>>> -- that patch looks suspicious anyway, you're replacing code
> >>>>>> that prints out assembly with code that generates RTL.
> >>>>> Right! I took me a while to understand how I could miss this, sorry.
> >>>>> That was caused by improper testing, as this part of the code
> >>>>> isn't used when targetting cortex-m0. I have added a testcase
> >>>>> for cortex-m23 which crashes with the previous version of patch 2,
> >>>>> and succeeds now.
> >>>>>
> >>>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
> >>>>>> thumb1_const_rtl.  I think the names as they stand are likely to be too
> >>>>>> generic.
> >>>>> OK, done.
> >>>>>
> >>>>> How about this new version?
> >>>>> I'm not yet sure about the most appropriate naming for:
> >>>>> thumb1_gen_const_int
> >>>>> thumb1_gen_const_int_asm
> >>>>> should they be
> >>>>> thumb1_gen_const_int_rtl
> >>>>> thumb1_gen_const_int_print
> >>>>> to be consistent with the new classes?
> >>>>
> >>>> It would probably be better, yes.
> >>>>
> >>>> More detailed comments below.
> >>>>
> >>>> R.
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Christophe
> >>>>>
> >>>>>> R.
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Christophe
> >>>>>>>
> >>>>>>>> R.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> R.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> R.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Christophe
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> class t1_rtl
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>  public:
> >>>>>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
> >>>>>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
> >>>>>>>>>>>>>>> };
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> class t1_print
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>  public:
> >>>>>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
> >>>>>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
> >>>>>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
> >>>>>>>>>>>>>>>  private:
> >>>>>>>>>>>>>>>   FILE *t_file;
> >>>>>>>>>>>>>>> };
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> template <class T>
> >>>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
> >>>>>>>>>>>>>>>   t.ashift(f);
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> // Usage...
> >>>>>>>>>>>>>>> void f1()
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>   // Use the RTL expander
> >>>>>>>>>>>>>>>   t1_rtl g;
> >>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> void f2()
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>   // Use the printf expander writing to stdout
> >>>>>>>>>>>>>>>   t1_print g(stdout);
> >>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
> >>>>>>>>>>>>>>> about which expander is being used in each instance and the template
> >>>>>>>>>>>>>>> expansion will use the right version.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> R.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
> >>>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
> >>>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
> >>>>>>>>>>>>>>>>>>> even greater range of literals.
> >>>>>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
> >>>>>>>>>>>>>>>>>> I'll update my patch.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>  More generally, any sequence of up to
> >>>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
> >>>>>>>>>>>>>>>>>>> existing literal pool fall back.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
> >>>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
> >>>>>>>>>>>>>>>>>>> be merged into a single shift-by-16.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
> >>>>>>>>>>>>>>>>>>> all possible.
> >>>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
> >>>>>>>>>>>>>>>>>> that showed the problem. There are also:
> >>>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
> >>>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
> >>>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
> >>>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
> >>>>>>>>>>>>>>>>>> -mpure-code in dg-options?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
> >>>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
> >>>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
> >>>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
> >>>>>>>>>>>>>>>>> with it enabled and that should be sufficient.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
> >>>>>>>>>>>>>>>> modified simulator.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Christophe
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> R.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Christophe
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> R.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
> >>>>>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
> >>>>>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
> >>>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
> >>>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>>>>>>>        if (mi_delta > 255)
> >>>>>>>>>>>>>>>>>>>>       {
> >>>>>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>>>>>>>>>> -       assemble_name (file, label);
> >>>>>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
> >>>>>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
> >>>>>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
> >>>>>>>>>>>>>>>>>>>> +       if (target_pure_code)
> >>>>>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
> >>>>>>>>>>>>>>>>>>>> +           int i;
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
> >>>>>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
> >>>>>>>>>>>>>>>>>>>> +             {
> >>>>>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +               if (byte)
> >>>>>>>>>>>>>>>>>>>> +                 {
> >>>>>>>>>>>>>>>>>>>> +                   if (mov_done_p)
> >>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
> >>>>>>>>>>>>>>>>>>>> +                   else
> >>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
> >>>>>>>>>>>>>>>>>>>> +                   mov_done_p = true;
> >>>>>>>>>>>>>>>>>>>> +                 }
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +               if (mov_done_p)
> >>>>>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
> >>>>>>>>>>>>>>>>>>>> +             }
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
> >>>>>>>>>>>>>>>>>>>> +           if (!mov_done_p)
> >>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
> >>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
> >>>>>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>>>>> +       else
> >>>>>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
> >>>>>>>>>>>>>>>>>>>> +           assemble_name (file, label);
> >>>>>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
> >>>>>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
> >>>>>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
> >>>>>>>>>>>>>>>>>>>>       }
> >>>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
> >>>>>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
> >>>>>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>>>>>>>>>> -      assemble_name (file, label);
> >>>>>>>>>>>>>>>>>>>> -      fputs (":\n", file);
> >>>>>>>>>>>>>>>>>>>> -      if (flag_pic)
> >>>>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
> >>>>>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
> >>>>>>>>>>>>>>>>>>>> +      them.  */
> >>>>>>>>>>>>>>>>>>>> +      if (!target_pure_code)
> >>>>>>>>>>>>>>>>>>>>       {
> >>>>>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>>>>>>>>>> -          accordingly.  */
> >>>>>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>>>>>>>>>> -                            tem,
> >>>>>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
> >>>>>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>>>>> -     }
> >>>>>>>>>>>>>>>>>>>> -      else
> >>>>>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
> >>>>>>>>>>>>>>>>>>>> +       assemble_name (file, label);
> >>>>>>>>>>>>>>>>>>>> +       fputs (":\n", file);
> >>>>>>>>>>>>>>>>>>>> +       if (flag_pic)
> >>>>>>>>>>>>>>>>>>>> +         {
> >>>>>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
> >>>>>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
> >>>>>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
> >>>>>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
> >>>>>>>>>>>>>>>>>>>> +              accordingly.  */
> >>>>>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
> >>>>>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
> >>>>>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
> >>>>>>>>>>>>>>>>>>>> +                                tem,
> >>>>>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
> >>>>>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
> >>>>>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>>>>> +         }
> >>>>>>>>>>>>>>>>>>>> +       else
> >>>>>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
> >>>>>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
> >>>>>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
> >>>>>>>>>>>>>>>>>>>> +     }
> >>>>>>>>>>>>>>>>>>>>      }
> >>>>>>>>>>>>>>>>>>>>    else
> >>>>>>>>>>>>>>>>>>>>      {
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>>> +class thumb1_const_rtl
> >>>> ...
> >>>> +  void mov (int val)
> >>>>
> >>>> This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
> >>>> same applies to the asm version as well.
> >>>>
> >>>> +    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
> >>>>
> >>>> Should be using reg_names[dst_regno] in all cases.  In fact, you might
> >>>> want to move that lookup to the constructor and just save a pointer to
> >>>> the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
> >>>
> >>> Correction, for a (signed) HOST_WIDE_INT, this should be
> >>> HOST_WIDE_INT_PRINT_DEC.
> >>>
> >>
> >> Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the
> >> two classes
> >> can have unsigned HOST_WIDE_INT parameters, and thus use
> >> HOST_WIDE_INT_PRINT_UNSIGNED?
> >>
> >
> > It's generally safer to print it as a signed value.  We probably won't
> > have cases where the top bit of a 32-bit word are set here; but if you
> > do, and the value is unsigned, you end up with 16 digit hex numbers
> > rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT
> > is at least 64 bits in size.
> >
>
> I don't mean hex numbers, of course, just very large decimal numbers.
> But the point is still the same.
>

Here is an updated version, which hopefully addresses your comments,
and adds testcases for cortex-m0 and cortex-m23.

Christophe

> R.
>
> > R.
> >
> >>
> >>>> rather than "%d" for the immediate.
> >>>>
> >>>> +template <class T>
> >>>> +void
> >>>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
> >>>> +{
> >>>> +  bool mov_done_p = false;
> >>>> +  int val = op1;
> >>>>
> >>>> This potentially silently loses precision.  In fact, I think you really
> >>>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so
> >>>> that the right shifts aren't undefined if dealing with negative numbers.
> >>>>
> >>>> For safety, you should also have an assertion in here that
> >>>>
> >>>>   op1 == trunc_int_for_mode (op1, SImode)
> >>>>
> >>>> +  int shift = 0;
> >>>> +  int i;
> >>>> +
> >>>> +  if (val == 0)
> >>>>
> >>>> You can short-circuit 0..255 here for a quick exit.
> >>>>
> >>>> +    {
> >>>> +      dst.mov (val);
> >>>> +      return;
> >>>> +    }
> >>>>
> >>>> Another trick: if the top nine bits of the 32-bit value are all set,
> >>>> you're probably going to be better off (and certainly not worse off) by
> >>>> generating -op1 and then negating the result in a final step - you can
> >>>> do that via recursion.
> >>>>
> >>>> +
> >>>> +  /* In the general case, we need 7 instructions to build
> >>>> +     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
> >>>> +     do better if VAL is small enough, or
> >>>> +     right-shiftable by a suitable amount.  If the
> >>>> +     right-shift enables to encode at least one less byte,
> >>>> +     it's worth it: we save a adds and a lsls at the
> >>>> +     expense of a final lsls.  */
> >>>> +  int final_shift = number_of_first_bit_set (val);
> >>>> +
> >>>> +  int leading_zeroes = clz_hwi (val);
> >>>> +  int number_of_bytes_needed
> >>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
> >>>> +       / BITS_PER_UNIT) + 1;
> >>>> +  int number_of_bytes_needed2
> >>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
> >>>> +       / BITS_PER_UNIT) + 1;
> >>>> +
> >>>> +  if (number_of_bytes_needed2 < number_of_bytes_needed)
> >>>> +    val >>= final_shift;
> >>>> +  else
> >>>> +    final_shift = 0;
> >>>> +
> >>>> +  /* If we are in a very small range, we can use either a single movs
> >>>> +     or movs+adds.  */
> >>>> +  if ((val >= 0) && (val <= 510))
> >>>>
> >>>> if val is made unsigned HWI as I suggest, the lower bounds test is not
> >>>> needed.
> >>>>
> >>>> +    {
> >>>> +      if (val > 255)
> >>>> +     {
> >>>> +       int high = val - 255;
> >>>>
> >>>> Again, watch your types.
> >>>>
> >>>> +
> >>>> +       dst.mov (high);
> >>>> +       dst.add (255);
> >>>> +     }
> >>>> +      else
> >>>> +     dst.mov (val);
> >>>> +
> >>>> +      if (final_shift > 0)
> >>>> +     dst.ashift (final_shift);
> >>>> +    }
> >>>> +  else
> >>>> +    {
> >>>> +      /* General case, emit upper 3 bytes as needed.  */
> >>>> +      for (i = 0; i < 3; i++)
> >>>> +     {
> >>>> +       int byte = (val >> (8 * (3 - i))) & 0xff;
> >>>>
> >>>> and here.
> >>>>
> >>>> +
> >>>> +       if (byte)
> >>>> +         {
> >>>> +           /* We are about to emit new bits, stop accumulating a
> >>>> +              shift amount, and left-shift only if we have already
> >>>> +              emitted some upper bits.  */
> >>>> +           if (mov_done_p)
> >>>> +             {
> >>>> +               dst.ashift (shift);
> >>>> +               dst.add (byte);
> >>>> +             }
> >>>> +           else
> >>>> +             dst.mov (byte);
> >>>> +
> >>>> +           /* Stop accumulating shift amount since we've just
> >>>> +              emitted some bits.  */
> >>>> +           shift = 0;
> >>>> +
> >>>> +           mov_done_p = true;
> >>>> +         }
> >>>> +
> >>>> +       if (mov_done_p)
> >>>> +         shift += 8;
> >>>> +     }
> >>>> +
> >>>> +      /* Emit lower byte.  */
> >>>> +      if (!mov_done_p)
> >>>> +     dst.mov (val & 0xff);
> >>>> +      else
> >>>> +     {
> >>>> +       dst.ashift (shift);
> >>>> +       if (val & 0xff)
> >>>> +         dst.add (val & 0xff);
> >>>> +     }
> >>>> +
> >>>> +      if (final_shift > 0)
> >>>> +     dst.ashift (final_shift);
> >>>> +    }
> >>>> +}
> >>>> +
> >>>>
> >>>
> >
>

[-- Attachment #2: 0001-arm-Improve-thumb1_gen_const_int.patch --]
[-- Type: text/x-patch, Size: 12498 bytes --]

From 8e4298b4d46b6d2555c70aa6ce95d9afda4dfa69 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Fri, 9 Oct 2020 12:42:39 +0000
Subject: [PATCH 1/3] arm: Improve thumb1_gen_const_int

Enable thumb1_gen_const_int to generate RTL or asm depending on the
context, so that we avoid duplicating code to handle constants in
Thumb-1 with -mpure-code.

Use a template so that the algorithm is effectively shared, and
rely on two classes to handle the actual emission as RTL or asm.

The generated sequence is improved to handle right-shiftable and small
values with less instructions. We now generate:

128:
        movs    r0, r0, #128
264:
        movs    r3, #33
        lsls    r3, #3
510:
        movs    r3, #255
        lsls    r3, #1
512:
        movs    r3, #1
        lsls    r3, #9
764:
        movs    r3, #191
        lsls    r3, #2
65536:
        movs    r3, #1
        lsls    r3, #16
0x123456:
        movs    r3, #18 ;0x12
        lsls    r3, #8
        adds    r3, #52 ;0x34
        lsls    r3, #8
        adds    r3, #86 ;0x56
0x1123456:
        movs    r3, #137 ;0x89
        lsls    r3, #8
        adds    r3, #26 ;0x1a
        lsls    r3, #8
        adds    r3, #43 ;0x2b
        lsls    r3, #1
0x1000010:
        movs    r3, #16
        lsls    r3, #16
        adds    r3, #1
        lsls    r3, #4
0x1000011:
        movs    r3, #1
        lsls    r3, #24
        adds    r3, #17
-8192:
	movs	r3, #1
	lsls	r3, #13
	rsbs	r3, #0

The patch adds a testcase which does not fully exercise
thumb1_gen_const_int, as other existing patterns already catch small
constants.  These parts of thumb1_gen_const_int are used by
arm_thumb1_mi_thunk.

2020-11-02  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (thumb1_const_rtl, thumb1_const_print): New
	classes.
	(thumb1_gen_const_int): Rename to ...
	(thumb1_gen_const_int_1): ... New helper function. Add capability
	to emit either RTL or asm, improve generated code.
	(thumb1_gen_const_int_rtl): New function.
	* gcc/config/arm/arm-protos.h (thumb1_gen_const_int): Rename to
	thumb1_gen_const_int_rtl.
	* gcc/config/arm/thumb1.md: Call thumb1_gen_const_int_rtl instead
	of thumb1_gen_const_int.

	gcc/testsuite/
	* gcc.target/arm/pure-code/no-literal-pool-m0.c: New.
---
 gcc/config/arm/arm-protos.h                        |   2 +-
 gcc/config/arm/arm.c                               | 224 ++++++++++++++++++---
 gcc/config/arm/thumb1.md                           |   2 +-
 .../gcc.target/arm/pure-code/no-literal-pool-m0.c  | 175 ++++++++++++++++
 4 files changed, 369 insertions(+), 34 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m0.c

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 0cc0ae7..23b329d 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -74,7 +74,7 @@ extern bool arm_small_register_classes_for_mode_p (machine_mode);
 extern int const_ok_for_arm (HOST_WIDE_INT);
 extern int const_ok_for_op (HOST_WIDE_INT, enum rtx_code);
 extern int const_ok_for_dimode_op (HOST_WIDE_INT, enum rtx_code);
-extern void thumb1_gen_const_int (rtx, HOST_WIDE_INT);
+extern void thumb1_gen_const_int_rtl (rtx, HOST_WIDE_INT);
 extern int arm_split_constant (RTX_CODE, machine_mode, rtx,
 			       HOST_WIDE_INT, rtx, rtx, int);
 extern int legitimate_pic_operand_p (rtx);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ceeb91f..29fa536 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -4513,38 +4513,6 @@ const_ok_for_dimode_op (HOST_WIDE_INT i, enum rtx_code code)
     }
 }
 
-/* Emit a sequence of movs/adds/shift to produce a 32-bit constant.
-   Avoid generating useless code when one of the bytes is zero.  */
-void
-thumb1_gen_const_int (rtx op0, HOST_WIDE_INT op1)
-{
-  bool mov_done_p = false;
-  int i;
-
-  /* Emit upper 3 bytes if needed.  */
-  for (i = 0; i < 3; i++)
-    {
-      int byte = (op1 >> (8 * (3 - i))) & 0xff;
-
-      if (byte)
-	{
-	  emit_set_insn (op0, mov_done_p
-			 ? gen_rtx_PLUS (SImode,op0, GEN_INT (byte))
-			 : GEN_INT (byte));
-	  mov_done_p = true;
-	}
-
-      if (mov_done_p)
-	emit_set_insn (op0, gen_rtx_ASHIFT (SImode, op0, GEN_INT (8)));
-    }
-
-  /* Emit lower byte if needed.  */
-  if (!mov_done_p)
-    emit_set_insn (op0, GEN_INT (op1 & 0xff));
-  else if (op1 & 0xff)
-    emit_set_insn (op0, gen_rtx_PLUS (SImode, op0, GEN_INT (op1 & 0xff)));
-}
-
 /* Emit a sequence of insns to handle a large constant.
    CODE is the code of the operation required, it can be any of SET, PLUS,
    IOR, AND, XOR, MINUS;
@@ -28244,6 +28212,198 @@ arm_internal_label (FILE *stream, const char *prefix, unsigned long labelno)
   default_internal_label (stream, prefix, labelno);
 }
 
+/* Define classes to generate code as RTL or output asm to a file.
+   Using templates then allows to use the same code to output code
+   sequences in the two formats.  */
+class thumb1_const_rtl
+{
+ public:
+  thumb1_const_rtl (rtx dst) : dst (dst) {}
+
+  void mov (HOST_WIDE_INT val)
+  {
+    emit_set_insn (dst, GEN_INT (val));
+  }
+
+  void add (HOST_WIDE_INT val)
+  {
+    emit_set_insn (dst, gen_rtx_PLUS (SImode, dst, GEN_INT (val)));
+  }
+
+  void ashift (HOST_WIDE_INT shift)
+  {
+    emit_set_insn (dst, gen_rtx_ASHIFT (SImode, dst, GEN_INT (shift)));
+  }
+
+  void neg ()
+  {
+    emit_set_insn (dst, gen_rtx_NEG (SImode, dst));
+  }
+
+ private:
+  rtx dst;
+};
+
+class thumb1_const_print
+{
+ public:
+  thumb1_const_print (FILE *f, int regno)
+  {
+    t_file = f;
+    dst_regname = reg_names[regno];
+  }
+
+  void mov (HOST_WIDE_INT val)
+  {
+    asm_fprintf (t_file, "\tmovs\t%s, #" HOST_WIDE_INT_PRINT_DEC "\n",
+		 dst_regname, val);
+  }
+
+  void add (HOST_WIDE_INT val)
+  {
+    asm_fprintf (t_file, "\tadds\t%s, #" HOST_WIDE_INT_PRINT_DEC "\n",
+		 dst_regname, val);
+  }
+
+  void ashift (HOST_WIDE_INT shift)
+  {
+    asm_fprintf (t_file, "\tlsls\t%s, #" HOST_WIDE_INT_PRINT_DEC "\n",
+		 dst_regname, shift);
+  }
+
+  void neg ()
+  {
+    asm_fprintf (t_file, "\trsbs\t%s, #0\n", dst_regname);
+  }
+
+ private:
+  FILE *t_file;
+  const char *dst_regname;
+};
+
+/* Emit a sequence of movs/adds/shift to produce a 32-bit constant.
+   Avoid generating useless code when one of the bytes is zero.  */
+template <class T>
+void
+thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
+{
+  bool mov_done_p = false;
+  unsigned HOST_WIDE_INT val = op1;
+  int shift = 0;
+  int i;
+
+  gcc_assert (op1 == trunc_int_for_mode (op1, SImode));
+
+  if (val <= 255)
+    {
+      dst.mov (val);
+      return;
+    }
+
+  /* For negative numbers with the first nine bits set, build the
+     opposite of OP1, then negate it, it's generally shorter and not
+     longer.  */
+  if ((val & 0xFF800000) == 0xFF800000)
+    {
+      thumb1_gen_const_int_1 (dst, -op1);
+      dst.neg ();
+      return;
+    }
+
+  /* In the general case, we need 7 instructions to build
+     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
+     do better if VAL is small enough, or
+     right-shiftable by a suitable amount.  If the
+     right-shift enables to encode at least one less byte,
+     it's worth it: we save a adds and a lsls at the
+     expense of a final lsls.  */
+  int final_shift = number_of_first_bit_set (val);
+
+  int leading_zeroes = clz_hwi (val);
+  int number_of_bytes_needed
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
+       / BITS_PER_UNIT) + 1;
+  int number_of_bytes_needed2
+    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
+       / BITS_PER_UNIT) + 1;
+
+  if (number_of_bytes_needed2 < number_of_bytes_needed)
+    val >>= final_shift;
+  else
+    final_shift = 0;
+
+  /* If we are in a very small range, we can use either a single movs
+     or movs+adds.  */
+  if (val <= 510)
+    {
+      if (val > 255)
+	{
+	  unsigned HOST_WIDE_INT high = val - 255;
+
+	  dst.mov (high);
+	  dst.add (255);
+	}
+      else
+	dst.mov (val);
+
+      if (final_shift > 0)
+	dst.ashift (final_shift);
+    }
+  else
+    {
+      /* General case, emit upper 3 bytes as needed.  */
+      for (i = 0; i < 3; i++)
+	{
+	  unsigned HOST_WIDE_INT byte = (val >> (8 * (3 - i))) & 0xff;
+
+	  if (byte)
+	    {
+	      /* We are about to emit new bits, stop accumulating a
+		 shift amount, and left-shift only if we have already
+		 emitted some upper bits.  */
+	      if (mov_done_p)
+		{
+		  dst.ashift (shift);
+		  dst.add (byte);
+		}
+	      else
+		dst.mov (byte);
+
+	      /* Stop accumulating shift amount since we've just
+		 emitted some bits.  */
+	      shift = 0;
+
+	      mov_done_p = true;
+	    }
+
+	  if (mov_done_p)
+	    shift += 8;
+	}
+
+      /* Emit lower byte.  */
+      if (!mov_done_p)
+	dst.mov (val & 0xff);
+      else
+	{
+	  dst.ashift (shift);
+	  if (val & 0xff)
+	    dst.add (val & 0xff);
+	}
+
+      if (final_shift > 0)
+	dst.ashift (final_shift);
+    }
+}
+
+/* Proxy for thumb1.md, since the thumb1_const_print and
+   thumb1_const_rtl classes are not exported.  */
+void
+thumb1_gen_const_int_rtl (rtx dst, HOST_WIDE_INT op1)
+{
+  thumb1_const_rtl t (dst);
+  thumb1_gen_const_int_1 (t, op1);
+}
+
 /* Output code to add DELTA to the first argument, and then jump
    to FUNCTION.  Used for C++ multiple inheritance.  */
 
diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
index 2258a52..b32dd4f 100644
--- a/gcc/config/arm/thumb1.md
+++ b/gcc/config/arm/thumb1.md
@@ -820,7 +820,7 @@ (define_split
    && !satisfies_constraint_K (operands[1])"
   [(clobber (const_int 0))]
   "
-    thumb1_gen_const_int (operands[0], INTVAL (operands[1]));
+    thumb1_gen_const_int_rtl (operands[0], INTVAL (operands[1]));
     DONE;
   "
 )
diff --git a/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m0.c b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m0.c
new file mode 100644
index 0000000..787a61a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m0.c
@@ -0,0 +1,175 @@
+/* { dg-do compile } */
+/* { dg-options "-mpure-code -mcpu=cortex-m0 -march=armv6s-m -mthumb" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+/* Does not use thumb1_gen_const_int.
+** test_0:
+**	...
+**	movs	r[0-3], #0
+**	...
+*/
+int
+test_0 ()
+{
+  return 0;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_128:
+**	...
+**	movs	r[0-3], #128
+**	...
+*/
+int
+test_128 ()
+{
+  return 128;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_264:
+**	...
+**	movs	r[0-3], #132
+**	lsls	r[0-3], r[0-3], #1
+**	...
+*/
+int
+test_264 ()
+{
+  return 264;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_510:
+**	...
+**	movs	r[0-3], #255
+**	lsls	r[0-3], r[0-3], #1
+**	...
+*/
+int
+test_510 ()
+{
+  return 510;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_512:
+**	...
+**	movs	r[0-3], #128
+**	lsls	r[0-3], r[0-3], #2
+**	...
+*/
+int
+test_512 ()
+{
+  return 512;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_764:
+**	...
+**	movs	r[0-3], #191
+**	lsls	r[0-3], r[0-3], #2
+**	...
+*/
+int
+test_764 ()
+{
+  return 764;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_65536:
+**	...
+**	movs	r[0-3], #128
+**	lsls	r[0-3], r[0-3], #9
+**	...
+*/
+int
+test_65536 ()
+{
+  return 65536;
+}
+
+/*
+** test_0x123456:
+**	...
+**	movs	r[0-3], #18
+**	lsls	r[0-3], r[0-3], #8
+**	adds	r[0-3], r[0-3], #52
+**	lsls	r[0-3], r[0-3], #8
+**	adds	r[0-3], r[0-3], #86
+**	...
+*/
+int
+test_0x123456 ()
+{
+  return 0x123456;
+}
+
+/*
+** test_0x1123456:
+**	...
+**	movs	r[0-3], #137
+**	lsls	r[0-3], r[0-3], #8
+**	adds	r[0-3], r[0-3], #26
+**	lsls	r[0-3], r[0-3], #8
+**	adds	r[0-3], r[0-3], #43
+**	lsls	r[0-3], r[0-3], #1
+**	...
+*/
+int
+test_0x1123456 ()
+{
+  return 0x1123456;
+}
+
+/* With -Os, we generate:
+   movs r0, #16
+   lsls r0, r0, r0
+   With the other optimization levels, we generate:
+   movs r0, #16
+   lsls r0, r0, #16
+   hence the two alternatives.  */
+/*
+** test_0x1000010:
+**	...
+**	movs	r[0-3], #16
+**	lsls	r[0-3], r[0-3], (#16|r[0-3])
+**	adds	r[0-3], r[0-3], #1
+**	lsls	r[0-3], r[0-3], #4
+**	...
+*/
+int
+test_0x1000010 ()
+{
+  return 0x1000010;
+}
+
+/*
+** test_0x1000011:
+**	...
+**	movs	r[0-3], #1
+**	lsls	r[0-3], r[0-3], #24
+**	adds	r[0-3], r[0-3], #17
+**	...
+*/
+int
+test_0x1000011 ()
+{
+  return 0x1000011;
+}
+
+/*
+** test_m8192:
+**	...
+**	movs	r[0-3], #1
+**	lsls	r[0-3], r[0-3], #13
+**	rsbs	r[0-3], r[0-3], #0
+**	...
+*/
+int
+test_m8192 ()
+{
+  return -8192;
+}
-- 
2.7.4


[-- Attachment #3: 0002-arm-Call-thumb1_gen_const_int-from-thumb1_movsi_insn.patch --]
[-- Type: text/x-patch, Size: 6730 bytes --]

From 4e91a0986014446071a031e29ea0efbc90dda407 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Tue, 27 Oct 2020 16:16:08 +0000
Subject: [PATCH 2/3] arm: Call thumb1_gen_const_int from thumb1_movsi_insn

thumb1_movsi_insn used the same algorithm to build a constant in asm
than thumb1_gen_const_int_1 does in RTL. Since the previous patch added
support for asm generation in thumb1_gen_const_int_1, this patch calls
it from thumb1_movsi_insn to avoid duplication.

We need to introduce a new proxy function, thumb1_gen_const_int_print
to select the right template.

This patch also adds a new testcase as the updated alternative is only
used by thumb-1 processors that also support movt/movw.

2020-11-02  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/thumb1.md (thumb1_movsi_insn): Call
	thumb1_gen_const_int_print.
	* config/arm/arm-protos.h (thumb1_gen_const_int_print): Add
	prototype.
	* config/arm/arm.c (thumb1_gen_const_int_print): New.

	gcc/testsuite/
	* gcc.target/arm/pure-code/no-literal-pool-m23.c: New.
---
 gcc/config/arm/arm-protos.h                        |   1 +
 gcc/config/arm/arm.c                               |   9 +-
 gcc/config/arm/thumb1.md                           |  37 +----
 .../gcc.target/arm/pure-code/no-literal-pool-m23.c | 171 +++++++++++++++++++++
 4 files changed, 184 insertions(+), 34 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 23b329d..d09e6a2 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -75,6 +75,7 @@ extern int const_ok_for_arm (HOST_WIDE_INT);
 extern int const_ok_for_op (HOST_WIDE_INT, enum rtx_code);
 extern int const_ok_for_dimode_op (HOST_WIDE_INT, enum rtx_code);
 extern void thumb1_gen_const_int_rtl (rtx, HOST_WIDE_INT);
+extern void thumb1_gen_const_int_print (rtx, HOST_WIDE_INT);
 extern int arm_split_constant (RTX_CODE, machine_mode, rtx,
 			       HOST_WIDE_INT, rtx, rtx, int);
 extern int legitimate_pic_operand_p (rtx);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 29fa536..584714b 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28395,7 +28395,7 @@ thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
     }
 }
 
-/* Proxy for thumb1.md, since the thumb1_const_print and
+/* Proxies for thumb1.md, since the thumb1_const_print and
    thumb1_const_rtl classes are not exported.  */
 void
 thumb1_gen_const_int_rtl (rtx dst, HOST_WIDE_INT op1)
@@ -28404,6 +28404,13 @@ thumb1_gen_const_int_rtl (rtx dst, HOST_WIDE_INT op1)
   thumb1_gen_const_int_1 (t, op1);
 }
 
+void
+thumb1_gen_const_int_print (rtx dst, HOST_WIDE_INT op1)
+{
+  thumb1_const_print t (asm_out_file, REGNO (dst));
+  thumb1_gen_const_int_1 (t, op1);
+}
+
 /* Output code to add DELTA to the first argument, and then jump
    to FUNCTION.  Used for C++ multiple inheritance.  */
 
diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
index b32dd4f..52468da 100644
--- a/gcc/config/arm/thumb1.md
+++ b/gcc/config/arm/thumb1.md
@@ -688,40 +688,11 @@ (define_insn "*thumb1_movsi_insn"
 	  }
 	else if (GET_CODE (operands[1]) == CONST_INT)
 	  {
-	    int i;
-	    HOST_WIDE_INT op1 = INTVAL (operands[1]);
-	    bool mov_done_p = false;
-	    rtx ops[2];
-	    ops[0] = operands[0];
-
-	    /* Emit upper 3 bytes if needed.  */
-	    for (i = 0; i < 3; i++)
-	      {
-		int byte = (op1 >> (8 * (3 - i))) & 0xff;
-
-		if (byte)
-		  {
-		    ops[1] = GEN_INT (byte);
-		    if (mov_done_p)
-		      output_asm_insn ("adds\t%0, %1", ops);
-		    else
-		      output_asm_insn ("movs\t%0, %1", ops);
-		    mov_done_p = true;
-		  }
-
-		if (mov_done_p)
-		  output_asm_insn ("lsls\t%0, #8", ops);
-	      }
-
-	    /* Emit lower byte if needed.  */
-	    ops[1] = GEN_INT (op1 & 0xff);
-	    if (!mov_done_p)
-	      output_asm_insn ("movs\t%0, %1", ops);
-	    else if (op1 & 0xff)
-	      output_asm_insn ("adds\t%0, %1", ops);
-	    return "";
+	    thumb1_gen_const_int_print (operands[0], INTVAL (operands[1]));
+	    return \"\";
 	  }
-	  gcc_unreachable ();
+
+	gcc_unreachable ();
 
       case 8: return "ldr\t%0, %1";
       case 9: return "str\t%1, %0";
diff --git a/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c
new file mode 100644
index 0000000..67d63d2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pure-code/no-literal-pool-m23.c
@@ -0,0 +1,171 @@
+/* { dg-do compile } */
+/* { dg-options "-mpure-code -mcpu=cortex-m23 -march=armv8-m.base -mthumb" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+/*
+** testi:
+**	...
+**	movs	r[0-3], #1
+**	lsls	r[0-3], #13
+**	rsbs	r[0-3], #0
+**	...
+*/
+int
+testi (int *p)
+{
+  if (*p > 0x12345678)
+    return *p-8192;
+  else
+    return *p+8192;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_0:
+**	...
+**	movs	r[0-3], #0
+**	...
+*/
+int
+test_0 ()
+{
+  return 0;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_128:
+**	...
+**	movs	r[0-3], #128
+**	...
+*/
+int
+test_128 ()
+{
+  return 128;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_264:
+**	...
+**	movw	r[0-3], #264
+**	...
+*/
+int
+test_264 ()
+{
+  return 264;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_510:
+**	...
+**	movw	r[0-3], #510
+**	...
+*/
+int
+test_510 ()
+{
+  return 510;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_512:
+**	...
+**	movw	r[0-3], #512
+**	...
+*/
+int
+test_512 ()
+{
+  return 512;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_764:
+**	...
+**	movw	r[0-3], #764
+**	...
+*/
+int
+test_764 ()
+{
+  return 764;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_65536:
+**	...
+**	movs	r[0-3], #128
+**	lsls	r[0-3], r[0-3], #9
+**	...
+*/
+int
+test_65536 ()
+{
+  return 65536;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_0x123456:
+**	...
+**	movw	r[0-3], #13398
+**	movt	r[0-3], 18
+**	...
+*/
+int
+test_0x123456 ()
+{
+  return 0x123456;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_0x1123456:
+**	...
+**	movw	r[0-3], #13398
+**	movt	r[0-3], 274
+**	...
+*/
+int
+test_0x1123456 ()
+{
+  return 0x1123456;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_0x1000010:
+**	...
+**	movs	r[0-3], #16
+**	movt	r[0-3], 256
+**	...
+*/
+int
+test_0x1000010 ()
+{
+  return 0x1000010;
+}
+
+/* Does not use thumb1_gen_const_int.
+** test_0x1000011:
+**	...
+**	movs	r[0-3], #17
+**	movt	r[0-3], 256
+**	...
+*/
+int
+test_0x1000011 ()
+{
+  return 0x1000011;
+}
+
+/*
+** test_m8192:
+**	...
+**	movs	r[0-3], #1
+**	lsls	r[0-3], #13
+**	rsbs	r[0-3], #0
+**	...
+*/
+int
+test_m8192 ()
+{
+  return -8192;
+}
-- 
2.7.4


[-- Attachment #4: 0003-arm-Fix-multiple-inheritance-thunks-for-thumb-1-with.patch --]
[-- Type: text/x-patch, Size: 4068 bytes --]

From 696d56ed48b6b7ac3293212118d3f157e8f25477 Mon Sep 17 00:00:00 2001
From: Christophe Lyon <christophe.lyon@linaro.org>
Date: Tue, 29 Sep 2020 19:40:39 +0000
Subject: [PATCH 3/3] arm: Fix multiple inheritance thunks for thumb-1 with
 -mpure-code

When -mpure-code is used, we cannot load delta from code memory (like
we do without -mpure-code).

This patch builds the value of mi_delta into r3 with a series of
movs/adds/lsls.

We also do some cleanup by not emitting the function address and delta
via .word directives at the end of the thunk since we don't use them
with -mpure-code.

No need for new testcases, this bug was already identified by:
g++.dg/ipa/pr46287-3.C
g++.dg/ipa/pr46984.C
g++.dg/opt/thunk1.C
g++.dg/torture/pr46287.C
g++.dg/torture/pr45699.C

2020-11-02  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
	do not emit function address and delta when -mpure-code is used.
---
 gcc/config/arm/arm.c | 67 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 584714b..fc88071 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -28509,9 +28509,19 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
     {
       if (mi_delta > 255)
 	{
-	  fputs ("\tldr\tr3, ", file);
-	  assemble_name (file, label);
-	  fputs ("+4\n", file);
+	  /* With -mpure-code, we cannot load MI_DELTA from the
+	     constant pool: we build it explicitly.  */
+	  if (target_pure_code)
+	    {
+	      thumb1_const_print r3 (file, 3);
+	      thumb1_gen_const_int_1 (r3, mi_delta);
+	    }
+	  else
+	    {
+	      fputs ("\tldr\tr3, ", file);
+	      assemble_name (file, label);
+	      fputs ("+4\n", file);
+	    }
 	  asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
 		       mi_op, this_regno, this_regno);
 	}
@@ -28547,30 +28557,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
 	fputs ("\tpop\t{r3}\n", file);
 
       fprintf (file, "\tbx\tr12\n");
-      ASM_OUTPUT_ALIGN (file, 2);
-      assemble_name (file, label);
-      fputs (":\n", file);
-      if (flag_pic)
+
+      /* With -mpure-code, we don't need to emit literals for the
+	 function address and delta since we emitted code to build
+	 them.  */
+      if (!target_pure_code)
 	{
-	  /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
-	  rtx tem = XEXP (DECL_RTL (function), 0);
-	  /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
-	     pipeline offset is four rather than eight.  Adjust the offset
-	     accordingly.  */
-	  tem = plus_constant (GET_MODE (tem), tem,
-			       TARGET_THUMB1_ONLY ? -3 : -7);
-	  tem = gen_rtx_MINUS (GET_MODE (tem),
-			       tem,
-			       gen_rtx_SYMBOL_REF (Pmode,
-						   ggc_strdup (labelpc)));
-	  assemble_integer (tem, 4, BITS_PER_WORD, 1);
-	}
-      else
-	/* Output ".word .LTHUNKn".  */
-	assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
+	  ASM_OUTPUT_ALIGN (file, 2);
+	  assemble_name (file, label);
+	  fputs (":\n", file);
+	  if (flag_pic)
+	    {
+	      /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
+	      rtx tem = XEXP (DECL_RTL (function), 0);
+	      /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
+		 pipeline offset is four rather than eight.  Adjust the offset
+		 accordingly.  */
+	      tem = plus_constant (GET_MODE (tem), tem,
+				   TARGET_THUMB1_ONLY ? -3 : -7);
+	      tem = gen_rtx_MINUS (GET_MODE (tem),
+				   tem,
+				   gen_rtx_SYMBOL_REF (Pmode,
+						       ggc_strdup (labelpc)));
+	      assemble_integer (tem, 4, BITS_PER_WORD, 1);
+	    }
+	  else
+	    /* Output ".word .LTHUNKn".  */
+	    assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
 
-      if (TARGET_THUMB1_ONLY && mi_delta > 255)
-	assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
+	  if (TARGET_THUMB1_ONLY && mi_delta > 255)
+	    assemble_integer (GEN_INT (mi_delta), 4, BITS_PER_WORD, 1);
+	}
     }
   else
     {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code
  2020-11-02 10:24                                         ` Christophe Lyon
@ 2020-11-02 14:28                                           ` Richard Earnshaw
  0 siblings, 0 replies; 24+ messages in thread
From: Richard Earnshaw @ 2020-11-02 14:28 UTC (permalink / raw)
  To: Christophe Lyon; +Cc: gcc Patches

On 02/11/2020 10:24, Christophe Lyon via Gcc-patches wrote:
> Hi,
> 
> On Fri, 30 Oct 2020 at 13:49, Richard Earnshaw
> <Richard.Earnshaw@foss.arm.com> wrote:
>>
>> On 29/10/2020 19:18, Richard Earnshaw via Gcc-patches wrote:
>>> On 28/10/2020 18:10, Christophe Lyon via Gcc-patches wrote:
>>>> On Wed, 28 Oct 2020 at 18:44, Richard Earnshaw
>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>
>>>>> On 27/10/2020 15:42, Richard Earnshaw via Gcc-patches wrote:
>>>>>> On 26/10/2020 10:52, Christophe Lyon via Gcc-patches wrote:
>>>>>>> On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>
>>>>>>>> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>> On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>> On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 20/10/2020 12:22, Richard Earnshaw wrote:
>>>>>>>>>>>>>>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
>>>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>>>> On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>>>>>>>>>>>>>>>>>>>> <Richard.Earnshaw@foss.arm.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
>>>>>>>>>>>>>>>>>>>>>> When mi_delta is > 255 and -mpure-code is used, we cannot load delta
>>>>>>>>>>>>>>>>>>>>>> from code memory (like we do without -mpure-code).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This patch builds the value of mi_delta into r3 with a series of
>>>>>>>>>>>>>>>>>>>>>> movs/adds/lsls.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> We also do some cleanup by not emitting the function address and delta
>>>>>>>>>>>>>>>>>>>>>> via .word directives at the end of the thunk since we don't use them
>>>>>>>>>>>>>>>>>>>>>> with -mpure-code.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> No need for new testcases, this bug was already identified by
>>>>>>>>>>>>>>>>>>>>>> eg. pr46287-3.C
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2020-09-29  Christophe Lyon  <christophe.lyon@linaro.org>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>       gcc/
>>>>>>>>>>>>>>>>>>>>>>       * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta in r3 and
>>>>>>>>>>>>>>>>>>>>>>       do not emit function address and delta when -mpure-code is used.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Richard,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> There are some optimizations you can make to this code.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Firstly, for values between 256 and 510 (inclusive), it would be better
>>>>>>>>>>>>>>>>>>>>> to just expand a mov of 255 followed by an add.
>>>>>>>>>>>>>>>>>>>> I now see the splitted for the "Pe" constraint which I hadn't noticed
>>>>>>>>>>>>>>>>>>>> before, so I can write something similar indeed.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> However, I'm note quite sure to understand the benefit in the split
>>>>>>>>>>>>>>>>>>>> when -mpure-code is NOT used.
>>>>>>>>>>>>>>>>>>>> Consider:
>>>>>>>>>>>>>>>>>>>> int f3_1 (void) { return 510; }
>>>>>>>>>>>>>>>>>>>> int f3_2 (void) { return 511; }
>>>>>>>>>>>>>>>>>>>> Compile with -O2 -mcpu=cortex-m0:
>>>>>>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>>>>>>         ldr     r0, .L4
>>>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The splitter makes the code bigger, does it "compensate" for this by
>>>>>>>>>>>>>>>>>>>> not having to load the constant?
>>>>>>>>>>>>>>>>>>>> Actually the constant uses 4 more bytes, which should be taken into
>>>>>>>>>>>>>>>>>>>> account when comparing code size,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, the size of the literal pool entry needs to be taken into account.
>>>>>>>>>>>>>>>>>>>  It might happen that the entry could be shared with another use of that
>>>>>>>>>>>>>>>>>>> literal, but in general that's rare.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>>>>>>>>>>>>>>>>>>>> thumb1 instructions would be equivalent in size compared to loading
>>>>>>>>>>>>>>>>>>>> from the literal pool. Should the 256-510 range be extended?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It's a bit borderline at three instructions when literal pools are not
>>>>>>>>>>>>>>>>>>> expensive to use, but in thumb1 literal pools tend to be quite small due
>>>>>>>>>>>>>>>>>>> to the limited pc offsets we can use.  I think on balance we probably
>>>>>>>>>>>>>>>>>>> want to use the instruction sequence unless optimizing for size.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> This is also true for
>>>>>>>>>>>>>>>>>>>>> the literal pools alternative as well, so should be handled before all
>>>>>>>>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>> I am not sure what you mean: with -mpure-code, the above sample is compiled as:
>>>>>>>>>>>>>>>>>>>> f3_1:
>>>>>>>>>>>>>>>>>>>>         movs    r0, #255
>>>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #1
>>>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>>> f3_2:
>>>>>>>>>>>>>>>>>>>>         movs    r0, #1
>>>>>>>>>>>>>>>>>>>>         lsls    r0, r0, #8
>>>>>>>>>>>>>>>>>>>>         adds    r0, r0, #255
>>>>>>>>>>>>>>>>>>>>         bx      lr
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> so the "return 510" case is already handled as without -mpure-code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I was thinking specifically of the thunk sequence where you seem to be
>>>>>>>>>>>>>>>>>>> emitting instructions directly rather than generating RTL.  The examples
>>>>>>>>>>>>>>>>>>> you show here are not thunks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> OK thanks for the clarification.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here is an updated version, split into 3 patches to hopefully make
>>>>>>>>>>>>>>>>>> review easier.
>>>>>>>>>>>>>>>>>> They apply on top of my other mpure-code patches for PR96967 and PR96770:
>>>>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554956.html
>>>>>>>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-September/554957.html
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I kept it this way to make incremental changes easier to understand.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Patch 1: With the hope to avoid confusion and make maintenance easier,
>>>>>>>>>>>>>>>>>> I have updated thumb1_gen_const_int() so that it can generate either RTL or
>>>>>>>>>>>>>>>>>> asm. This way, all the code used to build thumb-1 constants is in the
>>>>>>>>>>>>>>>>>> same place,
>>>>>>>>>>>>>>>>>>  in case we need to improve/fix it later. We now generate shorter sequences in
>>>>>>>>>>>>>>>>>> several cases matching your comments.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Patch 2: Removes the equivalent loop from thumb1_movsi_insn pattern and
>>>>>>>>>>>>>>>>>> calls thumb1_gen_const_int.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Patch 3: Update of the original patch in this thread, now calls
>>>>>>>>>>>>>>>>>> thumb1_gen_const_int.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yuk!  Those changes to thumb1_gen_const_int are horrible.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think we should be able to leverage the fact that the compiler can use
>>>>>>>>>>>>>>>>> C++ now to do much better than that, for example by making that function
>>>>>>>>>>>>>>>>> a template.  For example (and this is just a sketch):
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Indeed! I didn't think about it since there is no other use of
>>>>>>>>>>>>>>>> templates in arm.c yet.
>>>>>>>>>>>>>>>> I'll send an update soon.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Other than that, does the approach look OK to you?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I think this is heading in the right direction.  Bringing the two
>>>>>>>>>>>>>>> immediate generating operations into a single function can only be a
>>>>>>>>>>>>>>> good thing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking again at your example constant sequences, I see:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>>>>>>         adds    r3, #1
>>>>>>>>>>>>>>>         lsls    r3, #4
>>>>>>>>>>>>>>> 0x1000011:
>>>>>>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>>>>>>         adds    r3, #17
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The first of these looks odd, given the second sequence.  Why doesn't
>>>>>>>>>>>>>>> the first expand to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>>>>         movs    r3, #16
>>>>>>>>>>>>>>>         lsls    r3, #16
>>>>>>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Err, I mean to:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 0x1000010:
>>>>>>>>>>>>>>         movs    r3, #1
>>>>>>>>>>>>>>         lsls    r3, #24
>>>>>>>>>>>>>>         adds    r3, #16
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Because I first try to right-shift the constant, hoping to reduce its
>>>>>>>>>>>>> range and need less instructions to build the higher part, then
>>>>>>>>>>>>> left-shift back.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In this particular case, we'd need to realize that there are many
>>>>>>>>>>>>> zeros "inside" the constant.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If I remove the part that tries to reduce the range, I do get that
>>>>>>>>>>>>> sequence, but for 764 I now generate
>>>>>>>>>>>>> movs    r3, #2
>>>>>>>>>>>>> lsls    r3, #8
>>>>>>>>>>>>> adds    r3, #252
>>>>>>>>>>>>> instead of
>>>>>>>>>>>>> movs    r3, #191
>>>>>>>>>>>>> lsls    r3, #2
>>>>>>>>>>>>>
>>>>>>>>>>>>> A possibility would be to try both approaches and keep the shortest one.
>>>>>>>>>>>>
>>>>>>>>>>>> Lets leave that for now, it's not important to fixing the main issue;
>>>>>>>>>>>> but we should remember we need to come back to it at some point.
>>>>>>>>>>>>
>>>>>>>>>>> Thanks, that's what I was thinking too.
>>>>>>>>>>>
>>>>>>>>>>>> There are other tricks as well, such as
>>>>>>>>>>>>
>>>>>>>>>>>>   0xffffff
>>>>>>>>>>>>
>>>>>>>>>>>> can be done as
>>>>>>>>>>>>
>>>>>>>>>>>>   0x1000000 - 1
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>>   0xfffffd
>>>>>>>>>>>>
>>>>>>>>>>>> as
>>>>>>>>>>>>
>>>>>>>>>>>>   0x1000000 - 3
>>>>>>>>>>>>
>>>>>>>>>>>> but these can wait as well.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Didn't we already need to handle such tricks? I'm surprised this
>>>>>>>>>>> wasn't needed earlier.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't think we ever worried about them.  Most of them need at least 3
>>>>>>>>>> instructions so aren't a code size saving over using a literal pool entry.
>>>>>>>>>>
>>>>>>>>> OK, this will also help when using -mslow-flash-data.
>>>>>>>>>
>>>>>>>>> Here are updated patches, now using a template as you suggested.
>>>>>>>>
>>>>>>>> Looking better, but when I try to apply this to my local tree patch 2
>>>>>>>> fails (I'm not exactly sure why, what was your baseline for these
>>>>>>>> patches?)
>>>>>>> I have the tree patches in this thread on top of these other two:
>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556768.html
>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556769.html
>>>>>>>
>>>>>>> They have gradual improvements to thumb1_movsi_insn.
>>>>>>>
>>>>>>>> -- that patch looks suspicious anyway, you're replacing code
>>>>>>>> that prints out assembly with code that generates RTL.
>>>>>>> Right! I took me a while to understand how I could miss this, sorry.
>>>>>>> That was caused by improper testing, as this part of the code
>>>>>>> isn't used when targetting cortex-m0. I have added a testcase
>>>>>>> for cortex-m23 which crashes with the previous version of patch 2,
>>>>>>> and succeeds now.
>>>>>>>
>>>>>>>> Could you also rename t1_print and t1_rtl to thumb1_const_print and
>>>>>>>> thumb1_const_rtl.  I think the names as they stand are likely to be too
>>>>>>>> generic.
>>>>>>> OK, done.
>>>>>>>
>>>>>>> How about this new version?
>>>>>>> I'm not yet sure about the most appropriate naming for:
>>>>>>> thumb1_gen_const_int
>>>>>>> thumb1_gen_const_int_asm
>>>>>>> should they be
>>>>>>> thumb1_gen_const_int_rtl
>>>>>>> thumb1_gen_const_int_print
>>>>>>> to be consistent with the new classes?
>>>>>>
>>>>>> It would probably be better, yes.
>>>>>>
>>>>>> More detailed comments below.
>>>>>>
>>>>>> R.
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Christophe
>>>>>>>
>>>>>>>> R.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Christophe
>>>>>>>>>
>>>>>>>>>> R.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> R.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> class t1_rtl
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>  public:
>>>>>>>>>>>>>>>>>   void ashift(int a) { gen_rtx_ASHIFT(a); }
>>>>>>>>>>>>>>>>>   void rshift(int b) { gen_rtx_SHIFTRT(b); }
>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> class t1_print
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>  public:
>>>>>>>>>>>>>>>>>   t1_print (FILE *f) : t_file(f) {}
>>>>>>>>>>>>>>>>>   void ashift (int a) { fprintf (t_file, "a shift %d\n", a); }
>>>>>>>>>>>>>>>>>   void rshift (int b) { fprintf (t_file, "r shift %d\n", b); }
>>>>>>>>>>>>>>>>>  private:
>>>>>>>>>>>>>>>>>   FILE *t_file;
>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> template <class T>
>>>>>>>>>>>>>>>>> void thumb1_gen_const_int(T t, int f)
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>   // Expansion of thumb1_gen_const_int ...
>>>>>>>>>>>>>>>>>   t.ashift(f);
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> // Usage...
>>>>>>>>>>>>>>>>> void f1()
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>   // Use the RTL expander
>>>>>>>>>>>>>>>>>   t1_rtl g;
>>>>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> void f2()
>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>   // Use the printf expander writing to stdout
>>>>>>>>>>>>>>>>>   t1_print g(stdout);
>>>>>>>>>>>>>>>>>   thumb1_gen_const_int (g, 3);
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With this you can write thumb1_gen_const_int without having to worry
>>>>>>>>>>>>>>>>> about which expander is being used in each instance and the template
>>>>>>>>>>>>>>>>> expansion will use the right version.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>  I also suspect (but haven't check) that the base adjustment will
>>>>>>>>>>>>>>>>>>>>> most commonly be a multiple of the machine word size (ie 4).  If that is
>>>>>>>>>>>>>>>>>>>>> the case then you could generate n/4 and then shift it left by 2 for an
>>>>>>>>>>>>>>>>>>>>> even greater range of literals.
>>>>>>>>>>>>>>>>>>>> I can see there is provision for this in the !TARGET_THUMB1_ONLY case,
>>>>>>>>>>>>>>>>>>>> I'll update my patch.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>  More generally, any sequence of up to
>>>>>>>>>>>>>>>>>>>>> three thumb1 instructions will be no larger, and probably as fast as the
>>>>>>>>>>>>>>>>>>>>> existing literal pool fall back.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Secondly, if the value is, for example, 65536 (0x10000), your code will
>>>>>>>>>>>>>>>>>>>>> emit a mov followed by two shift-by-8 instructions; the two shifts could
>>>>>>>>>>>>>>>>>>>>> be merged into a single shift-by-16.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Right, I'll try to make use of thumb_shiftable_const.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Finally, I'd really like to see some executable tests for this, if at
>>>>>>>>>>>>>>>>>>>>> all possible.
>>>>>>>>>>>>>>>>>>>> I mentioned pr46287-3.C, but that's not the only existing testcase
>>>>>>>>>>>>>>>>>>>> that showed the problem. There are also:
>>>>>>>>>>>>>>>>>>>> g++.dg/opt/thunk1.C
>>>>>>>>>>>>>>>>>>>> g++.dg/ipa/pr46984.C
>>>>>>>>>>>>>>>>>>>> g++.dg/torture/pr46287.C
>>>>>>>>>>>>>>>>>>>> g++.dg/torture/pr45699.C
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Do you want that I copy one of these in the arm subdir and add
>>>>>>>>>>>>>>>>>>>> -mpure-code in dg-options?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On reflection, probably not - that just makes things more complicated
>>>>>>>>>>>>>>>>>>> with all the dg-options mess (I'm worried about interactions with other
>>>>>>>>>>>>>>>>>>> sets of options on the command line and the fall-out from that).  If
>>>>>>>>>>>>>>>>>>> someone cares about pure-code they should be doing full testsuite runs
>>>>>>>>>>>>>>>>>>> with it enabled and that should be sufficient.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, that's what I am doing manually, it's a bit tricky, and I use a
>>>>>>>>>>>>>>>>>> modified simulator.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Christophe
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> R.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> k#   (use "git pull" to merge the remote branch into yours)
>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>  gcc/config/arm/arm.c | 91 +++++++++++++++++++++++++++++++++++++---------------
>>>>>>>>>>>>>>>>>>>>>>  1 file changed, 66 insertions(+), 25 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>>>> index ceeb91f..62abeb5 100644
>>>>>>>>>>>>>>>>>>>>>> --- a/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>>>> +++ b/gcc/config/arm/arm.c
>>>>>>>>>>>>>>>>>>>>>> @@ -28342,9 +28342,43 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>>>>>>        if (mi_delta > 255)
>>>>>>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>>>>>>> -       fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>>>>>>> -       assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>>>> -       fputs ("+4\n", file);
>>>>>>>>>>>>>>>>>>>>>> +       /* With -mpure-code, we cannot load delta from the constant
>>>>>>>>>>>>>>>>>>>>>> +          pool: we build it explicitly.  */
>>>>>>>>>>>>>>>>>>>>>> +       if (target_pure_code)
>>>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>>>> +           bool mov_done_p = false;
>>>>>>>>>>>>>>>>>>>>>> +           int i;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +           /* Emit upper 3 bytes if needed.  */
>>>>>>>>>>>>>>>>>>>>>> +           for (i = 0; i < 3; i++)
>>>>>>>>>>>>>>>>>>>>>> +             {
>>>>>>>>>>>>>>>>>>>>>> +               int byte = (mi_delta >> (8 * (3 - i))) & 0xff;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +               if (byte)
>>>>>>>>>>>>>>>>>>>>>> +                 {
>>>>>>>>>>>>>>>>>>>>>> +                   if (mov_done_p)
>>>>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tadds\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>>>>>>> +                   else
>>>>>>>>>>>>>>>>>>>>>> +                     asm_fprintf (file, "\tmovs\tr3, #%d\n", byte);
>>>>>>>>>>>>>>>>>>>>>> +                   mov_done_p = true;
>>>>>>>>>>>>>>>>>>>>>> +                 }
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +               if (mov_done_p)
>>>>>>>>>>>>>>>>>>>>>> +                 asm_fprintf (file, "\tlsls\tr3, #8\n");
>>>>>>>>>>>>>>>>>>>>>> +             }
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +           /* Emit lower byte if needed.  */
>>>>>>>>>>>>>>>>>>>>>> +           if (!mov_done_p)
>>>>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tmovs\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>>>>>>> +           else if (mi_delta & 0xff)
>>>>>>>>>>>>>>>>>>>>>> +             asm_fprintf (file, "\tadds\tr3, #%d\n", mi_delta & 0xff);
>>>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>>>> +           fputs ("\tldr\tr3, ", file);
>>>>>>>>>>>>>>>>>>>>>> +           assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>>>> +           fputs ("+4\n", file);
>>>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>>>>         asm_fprintf (file, "\t%ss\t%r, %r, r3\n",
>>>>>>>>>>>>>>>>>>>>>>                      mi_op, this_regno, this_regno);
>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>> @@ -28380,30 +28414,37 @@ arm_thumb1_mi_thunk (FILE *file, tree, HOST_WIDE_INT delta,
>>>>>>>>>>>>>>>>>>>>>>       fputs ("\tpop\t{r3}\n", file);
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>        fprintf (file, "\tbx\tr12\n");
>>>>>>>>>>>>>>>>>>>>>> -      ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>>>>>>> -      assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>>>> -      fputs (":\n", file);
>>>>>>>>>>>>>>>>>>>>>> -      if (flag_pic)
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +      /* With -mpure-code, we don't need to emit literals for the
>>>>>>>>>>>>>>>>>>>>>> +      function address and delta since we emitted code to build
>>>>>>>>>>>>>>>>>>>>>> +      them.  */
>>>>>>>>>>>>>>>>>>>>>> +      if (!target_pure_code)
>>>>>>>>>>>>>>>>>>>>>>       {
>>>>>>>>>>>>>>>>>>>>>> -       /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>>>>>>> -       rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>>>>>>> -       /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>>>>>>> -          pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>>>>>>> -          accordingly.  */
>>>>>>>>>>>>>>>>>>>>>> -       tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>>>>>>> -                            TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>>>>>>> -       tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>>>>>>> -                            tem,
>>>>>>>>>>>>>>>>>>>>>> -                            gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>>>>>>> -                                                ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>>>>>>> -       assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> -     }
>>>>>>>>>>>>>>>>>>>>>> -      else
>>>>>>>>>>>>>>>>>>>>>> -     /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>>>>>>> -     assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> +       ASM_OUTPUT_ALIGN (file, 2);
>>>>>>>>>>>>>>>>>>>>>> +       assemble_name (file, label);
>>>>>>>>>>>>>>>>>>>>>> +       fputs (":\n", file);
>>>>>>>>>>>>>>>>>>>>>> +       if (flag_pic)
>>>>>>>>>>>>>>>>>>>>>> +         {
>>>>>>>>>>>>>>>>>>>>>> +           /* Output ".word .LTHUNKn-[3,7]-.LTHUNKPCn".  */
>>>>>>>>>>>>>>>>>>>>>> +           rtx tem = XEXP (DECL_RTL (function), 0);
>>>>>>>>>>>>>>>>>>>>>> +           /* For TARGET_THUMB1_ONLY the thunk is in Thumb mode, so the PC
>>>>>>>>>>>>>>>>>>>>>> +              pipeline offset is four rather than eight.  Adjust the offset
>>>>>>>>>>>>>>>>>>>>>> +              accordingly.  */
>>>>>>>>>>>>>>>>>>>>>> +           tem = plus_constant (GET_MODE (tem), tem,
>>>>>>>>>>>>>>>>>>>>>> +                                TARGET_THUMB1_ONLY ? -3 : -7);
>>>>>>>>>>>>>>>>>>>>>> +           tem = gen_rtx_MINUS (GET_MODE (tem),
>>>>>>>>>>>>>>>>>>>>>> +                                tem,
>>>>>>>>>>>>>>>>>>>>>> +                                gen_rtx_SYMBOL_REF (Pmode,
>>>>>>>>>>>>>>>>>>>>>> +                                                    ggc_strdup (labelpc)));
>>>>>>>>>>>>>>>>>>>>>> +           assemble_integer (tem, 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> +         }
>>>>>>>>>>>>>>>>>>>>>> +       else
>>>>>>>>>>>>>>>>>>>>>> +         /* Output ".word .LTHUNKn".  */
>>>>>>>>>>>>>>>>>>>>>> +         assemble_integer (XEXP (DECL_RTL (function), 0), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -      if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>>>>>>> -     assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> +       if (TARGET_THUMB1_ONLY && mi_delta > 255)
>>>>>>>>>>>>>>>>>>>>>> +         assemble_integer (GEN_INT(mi_delta), 4, BITS_PER_WORD, 1);
>>>>>>>>>>>>>>>>>>>>>> +     }
>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>>    else
>>>>>>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> +class thumb1_const_rtl
>>>>>> ...
>>>>>> +  void mov (int val)
>>>>>>
>>>>>> This should take a HOST_WIDE_INT.  Similarly for add and shift.  The
>>>>>> same applies to the asm version as well.
>>>>>>
>>>>>> +    asm_fprintf (t_file, "\tmovs\tr%d, #%d\n", dst_regno, val);
>>>>>>
>>>>>> Should be using reg_names[dst_regno] in all cases.  In fact, you might
>>>>>> want to move that lookup to the constructor and just save a pointer to
>>>>>> the string there.  You'll need to use HOST_WIDE_INT_PRINT_UNSIGNED
>>>>>
>>>>> Correction, for a (signed) HOST_WIDE_INT, this should be
>>>>> HOST_WIDE_INT_PRINT_DEC.
>>>>>
>>>>
>>>> Right, but if "val" is unsigned HOST_WIDE_INT, all the methods in the
>>>> two classes
>>>> can have unsigned HOST_WIDE_INT parameters, and thus use
>>>> HOST_WIDE_INT_PRINT_UNSIGNED?
>>>>
>>>
>>> It's generally safer to print it as a signed value.  We probably won't
>>> have cases where the top bit of a 32-bit word are set here; but if you
>>> do, and the value is unsigned, you end up with 16 digit hex numbers
>>> rather than the 8 you'd expect on a 32-bit target because HOST_WIDE_INT
>>> is at least 64 bits in size.
>>>
>>
>> I don't mean hex numbers, of course, just very large decimal numbers.
>> But the point is still the same.
>>
> 
> Here is an updated version, which hopefully addresses your comments,
> and adds testcases for cortex-m0 and cortex-m23.
> 

This is OK.

Thanks,

R.

> Christophe
> 
>> R.
>>
>>> R.
>>>
>>>>
>>>>>> rather than "%d" for the immediate.
>>>>>>
>>>>>> +template <class T>
>>>>>> +void
>>>>>> +thumb1_gen_const_int_1 (T dst, HOST_WIDE_INT op1)
>>>>>> +{
>>>>>> +  bool mov_done_p = false;
>>>>>> +  int val = op1;
>>>>>>
>>>>>> This potentially silently loses precision.  In fact, I think you really
>>>>>> want to use "unsigned HOST_WIDE_INT" throughout the following code, so
>>>>>> that the right shifts aren't undefined if dealing with negative numbers.
>>>>>>
>>>>>> For safety, you should also have an assertion in here that
>>>>>>
>>>>>>   op1 == trunc_int_for_mode (op1, SImode)
>>>>>>
>>>>>> +  int shift = 0;
>>>>>> +  int i;
>>>>>> +
>>>>>> +  if (val == 0)
>>>>>>
>>>>>> You can short-circuit 0..255 here for a quick exit.
>>>>>>
>>>>>> +    {
>>>>>> +      dst.mov (val);
>>>>>> +      return;
>>>>>> +    }
>>>>>>
>>>>>> Another trick: if the top nine bits of the 32-bit value are all set,
>>>>>> you're probably going to be better off (and certainly not worse off) by
>>>>>> generating -op1 and then negating the result in a final step - you can
>>>>>> do that via recursion.
>>>>>>
>>>>>> +
>>>>>> +  /* In the general case, we need 7 instructions to build
>>>>>> +     a 32 bits constant (1 movs, 3 lsls, 3 adds). We can
>>>>>> +     do better if VAL is small enough, or
>>>>>> +     right-shiftable by a suitable amount.  If the
>>>>>> +     right-shift enables to encode at least one less byte,
>>>>>> +     it's worth it: we save a adds and a lsls at the
>>>>>> +     expense of a final lsls.  */
>>>>>> +  int final_shift = number_of_first_bit_set (val);
>>>>>> +
>>>>>> +  int leading_zeroes = clz_hwi (val);
>>>>>> +  int number_of_bytes_needed
>>>>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes)
>>>>>> +       / BITS_PER_UNIT) + 1;
>>>>>> +  int number_of_bytes_needed2
>>>>>> +    = ((HOST_BITS_PER_WIDE_INT - 1 - leading_zeroes - final_shift)
>>>>>> +       / BITS_PER_UNIT) + 1;
>>>>>> +
>>>>>> +  if (number_of_bytes_needed2 < number_of_bytes_needed)
>>>>>> +    val >>= final_shift;
>>>>>> +  else
>>>>>> +    final_shift = 0;
>>>>>> +
>>>>>> +  /* If we are in a very small range, we can use either a single movs
>>>>>> +     or movs+adds.  */
>>>>>> +  if ((val >= 0) && (val <= 510))
>>>>>>
>>>>>> if val is made unsigned HWI as I suggest, the lower bounds test is not
>>>>>> needed.
>>>>>>
>>>>>> +    {
>>>>>> +      if (val > 255)
>>>>>> +     {
>>>>>> +       int high = val - 255;
>>>>>>
>>>>>> Again, watch your types.
>>>>>>
>>>>>> +
>>>>>> +       dst.mov (high);
>>>>>> +       dst.add (255);
>>>>>> +     }
>>>>>> +      else
>>>>>> +     dst.mov (val);
>>>>>> +
>>>>>> +      if (final_shift > 0)
>>>>>> +     dst.ashift (final_shift);
>>>>>> +    }
>>>>>> +  else
>>>>>> +    {
>>>>>> +      /* General case, emit upper 3 bytes as needed.  */
>>>>>> +      for (i = 0; i < 3; i++)
>>>>>> +     {
>>>>>> +       int byte = (val >> (8 * (3 - i))) & 0xff;
>>>>>>
>>>>>> and here.
>>>>>>
>>>>>> +
>>>>>> +       if (byte)
>>>>>> +         {
>>>>>> +           /* We are about to emit new bits, stop accumulating a
>>>>>> +              shift amount, and left-shift only if we have already
>>>>>> +              emitted some upper bits.  */
>>>>>> +           if (mov_done_p)
>>>>>> +             {
>>>>>> +               dst.ashift (shift);
>>>>>> +               dst.add (byte);
>>>>>> +             }
>>>>>> +           else
>>>>>> +             dst.mov (byte);
>>>>>> +
>>>>>> +           /* Stop accumulating shift amount since we've just
>>>>>> +              emitted some bits.  */
>>>>>> +           shift = 0;
>>>>>> +
>>>>>> +           mov_done_p = true;
>>>>>> +         }
>>>>>> +
>>>>>> +       if (mov_done_p)
>>>>>> +         shift += 8;
>>>>>> +     }
>>>>>> +
>>>>>> +      /* Emit lower byte.  */
>>>>>> +      if (!mov_done_p)
>>>>>> +     dst.mov (val & 0xff);
>>>>>> +      else
>>>>>> +     {
>>>>>> +       dst.ashift (shift);
>>>>>> +       if (val & 0xff)
>>>>>> +         dst.add (val & 0xff);
>>>>>> +     }
>>>>>> +
>>>>>> +      if (final_shift > 0)
>>>>>> +     dst.ashift (final_shift);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>>
>>>>>
>>>
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-11-02 14:28 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-29 19:50 [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code Christophe Lyon
2020-10-06  8:30 ` Christophe Lyon
2020-10-06 16:02 ` Richard Earnshaw
2020-10-08  9:07   ` Christophe Lyon
2020-10-08  9:58     ` Richard Earnshaw
2020-10-12  7:59       ` Christophe Lyon
2020-10-19 14:39         ` Richard Earnshaw
2020-10-19 16:32           ` Christophe Lyon
2020-10-20 11:22             ` Richard Earnshaw
2020-10-20 11:25               ` Richard Earnshaw
2020-10-21 15:49                 ` Christophe Lyon
2020-10-21 16:07                   ` Richard Earnshaw
2020-10-21 16:11                     ` Christophe Lyon
2020-10-21 17:36                       ` Richard Earnshaw
2020-10-22  8:45                         ` Christophe Lyon
2020-10-22 15:22                           ` Richard Earnshaw
2020-10-26 10:52                             ` Christophe Lyon
2020-10-27 15:42                               ` Richard Earnshaw
2020-10-28 17:44                                 ` Richard Earnshaw
2020-10-28 18:10                                   ` Christophe Lyon
2020-10-29 19:18                                     ` Richard Earnshaw
2020-10-30 12:49                                       ` Richard Earnshaw
2020-11-02 10:24                                         ` Christophe Lyon
2020-11-02 14:28                                           ` Richard Earnshaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).